C.analysis_tools¶

class clibas.dataanalysis.DataAnalysisTools(*args)[source]¶

Bases: Handler

Collection of analysis tools for sequencing data visualization and statistics.

Most methods are pipeline operations: they return a callable for use in processing pipelines.

Provides methods for analyzing sequence length distributions, quality scores, library convergence metrics, and dimensionality reduction with clustering. Typically accessed through the clibas facade after initialization.

Example

>>> import clibas as C
>>> C.initialize_from_config('config.yaml')
>>> #analysis tools are now ready to use
>>> #as C.analysis_tools

Note

This class is not typically instantiated directly. Use the clibas initialization system to access analysis functionality.

Methods¶

length_analysis¶

DataAnalysisTools.length_analysis(where=None, save_txt=False)[source]

Analyze sequence length distributions.

Note

Pipeline Operation - Returns a callable for use in processing pipelines.

Computes and plots the distribution of sequence lengths for each sample. Results are saved as histogram plots in the analysis output directory.

Parameters:

where (str or None) – Dataset selector when analyzing SequencingSample datasets. In such cases, must be either ‘dna’ or ‘pep’. For direct inputs such as lists, NumPy arrays, or pandas Series objects, leave as None. In these cases, the provided input is analyzed directly.
save_txt (bool) – If True, save length distribution data to CSV file alongside the plots. Default is False.

Returns:

Operation that accepts a Data object, generates length distribution analysis, and returns the unmodified Data object.

Return type:

callable

Example

>>> #analyse the length distribution of peptide datasets
>>> len_analysis = C.analysis_tools.length_analysis(where='pep', save_txt=True)
>>> data = len_analysis(data)

q_score_analysis¶

DataAnalysisTools.q_score_analysis(loc=None, save_txt=False)[source]

Analyze quality score distributions.

Note

Pipeline Operation - Returns a callable for use in processing pipelines.

Computes mean and standard deviation of quality scores for each position in the specified regions or across entire sequences. Results are plotted in the analysis output directory.

Parameters:

loc (list, optional) – List of integers specifying which regions of DNA to analyze. If None, analyzes entire sequences. When specified, collapses sample internal state.
save_txt (bool) – If True, save quality score statistics to CSV file alongside the plots. Default is False.

Returns:

Operation that accepts a Data object, generates quality score analysis, and returns the unmodified Data object.

Return type:

callable

Example

>>> #summarize Q scores in regions 0 and 1
>>> q_analysis = C.analysis_tools.q_score_analysis(loc=[0, 1], save_txt=True)
>>> data = q_analysis(data)

sequence_convergence_analysis¶

DataAnalysisTools.sequence_convergence_analysis(where=None)[source]

Analyze library convergence at the sequence level.

Note

Pipeline Operation - Returns a callable for use in processing pipelines.

Computes normalized Shannon entropy and position-wise sequence conservation to assess library convergence. Results are plotted in the analysis output directory.

Parameters:: where (str or None) – Dataset selector when analyzing SequencingSample datasets. In such cases, must be either ‘dna’ or ‘pep’. For direct inputs such as lists, NumPy arrays, or pandas Series objects, leave as None. In these cases, the provided input is analyzed directly.
Returns:: Operation that accepts a Data object, generates convergence analysis, and returns the unmodified Data object.
Return type:: callable

Example

>>> conv_analysis = C.analysis_tools.sequence_convergence_analysis(where='pep')
>>> data = conv_analysis(data)

token_convergence_analysis¶

DataAnalysisTools.token_convergence_analysis(where=None, loc=None, alphabet=None, save_txt=False)[source]

Analyze library convergence at the token (amino acid/nucleobase) level.

Note

Pipeline Operation - Returns a callable for use in processing pipelines.

Computes the frequency of each token at each position in the dataset. Generates frequency heatmaps and sequence logos. When analyzing specific regions, collapses sample internal state.

Parameters:

where (str or None) – Dataset selector when analyzing SequencingSample datasets. In such cases, must be either ‘dna’ or ‘pep’. For direct inputs such as lists, NumPy arrays, or pandas Series objects, leave as None. In these cases, the provided input is analyzed directly. If left as None, the alphabet argument must be specified.
loc (list, optional) – List of integers specifying which regions to analyze. If None, analyzes entire sequences. When specified, collapses sample internal state.
alphabet (list, tuple, ndarray, or str, optional) –
Token alphabet for analysis. Can be one of:
- a sequence (list, tuple, or ndarray) specifying custom tokens, or
- the string ‘aa’ for the amino acid alphabet as specified in the config, or
- the string ‘base’ for the nucleotide base alphabet as specified in the config.
If None, the alphabet is automatically inferred from ‘where’.
save_txt (bool) – If True, save frequency data to CSV file alongside the plots. Default is False.

Returns:

Operation that accepts a Data object, generates token-level convergence analysis, and returns the unmodified Data object.

Return type:

callable

Example

>>> token_analysis = C.analysis_tools.token_convergence_analysis(
...     where='pep', loc=[1, 3], save_txt=True
... )
>>> data = token_analysis(data)

umap_hdbscan_analysis¶

DataAnalysisTools.umap_hdbscan_analysis(top_n=None, where=None, F=None, cluster_fasta=False, alphabet=None, return_modified=False, single_manifold=False)[source]

Perform UMAP dimensionality reduction and HDBSCAN clustering.

Note

Pipeline Operation - Returns a callable for use in processing pipelines.

Embeds sequences into 2D space using UMAP and clusters them using HDBSCAN. Automatically optimizes hyperparameters based on dataset size and composition. Generates interactive HTML dashboards and clustering summaries.

The analysis pipeline: 1. Featurize sequences (one-hot encoding or custom feature matrix) 2. Infer optimal UMAP and HDBSCAN hyperparameters 3. Embed sequences to 2D using UMAP 4. Cluster embedded sequences with HDBSCAN 5. Generate interactive visualizations and summaries

Parameters:

top_n (int, optional) – Number of most abundant sequences to analyze. If None, analyzes entire dataset. Recommended: 100-5000 for computational efficiency.
where (str or None) – Dataset selector when analyzing SequencingSample datasets. In such cases, must be either ‘dna’ or ‘pep’. For direct inputs such as lists, NumPy arrays, or pandas Series objects, leave as None. In these cases, the provided input is analyzed directly. If left as None, the alphabet argument must be specified.
F (str, ndarray, or None) – Feature matrix specification. If None, uses one-hot encoding. String options include ‘varimax’, ‘pep_ECFP3’, ‘pep_ECFP4’, ‘pep_SMILES’. Can also provide custom 2D array with F.shape[0] == len(alphabet).
cluster_fasta (bool) – If True, generates separate FASTA files for each cluster. Default is False.
alphabet (list, tuple, ndarray, or str, optional) –
Token alphabet for analysis. Can be one of:
- a sequence (list, tuple, or ndarray) specifying custom symbols, or
- the string ‘aa’ for the amino acid alphabet as specified in the config, or
- the string ‘base’ for the nucleotide base alphabet as specified in the config.
If None, the alphabet is automatically inferred from ‘where’.
return_modified (bool) – If True, returns Data object with embeddings and cluster labels. If False, returns unmodified original data. Default is False.
single_manifold (bool) – If True, embeds all samples to a single shared manifold for direct comparison. If False, each sample gets independent embedding. Default is False.

Returns:

Operation that accepts a Data object, performs UMAP/HDBSCAN analysis, and returns either the original or modified Data object.

Return type:

callable

Note

Generates multiple output files per sample:

Interactive HTML dashboard for exploration
Clustering summary CSV
Entry-count-cluster mapping CSV
Static matplotlib plots
Optional: per-cluster FASTA files
Optional: single manifold comparison dashboard (if single_manifold=True)

Example

>>> umap_analysis = C.analysis_tools.umap_hdbscan_analysis(
...     top_n=1000, where='pep', F='pep_ECFP4', cluster_fasta=True
... )
>>> data = umap_analysis(data)
>>>
>>> #for comparing multiple samples on same manifold
>>> umap_analysis = C.analysis_tools.umap_hdbscan_analysis(
...     top_n=500, where='pep', single_manifold=True
... )
>>> data = umap_analysis(data)