C.fastq_parser¶

class clibas.parsers.FastqParser(*args)[source]¶

Bases: Handler

Processor for FASTQ sequencing data with filtering and translation capabilities.

FastqParser provides methods for processing NGS data, including DNA-to-peptide translation, quality filtering, and sequence validation against library designs. Most operations work on Data objects and return transformed Data instances.

Most methods are pipeline operations: they return a callable for use in processing pipelines.

The parser is typically accessed through the clibas facade after initialization:

Example

>>> import clibas as C
>>> C.initialize_from_config('config.yaml')
>>> #parser is now ready to use
>>> #as C.fastq_parser

Note

This class is not typically invoked directly. Use the clibas initialization system to access parser functionality.

Methods¶

trim_reads¶

FastqParser.trim_reads(left='', right='', tol=None)[source]

Trim DNA reads based on constant flanking sequences.

Note

Pipeline Operation - Returns a callable for use in processing pipelines.

Removes adapter overhangs and non-target sequences by anchoring to specified 5’ and 3’ constant regions. Useful for processing merged paired-end reads from tools like FLASH that may leave adapter sequences at read termini.

The operation locks onto the specified left (5’) and right (3’) sequences, keeping only the region between them. Matches must be within the specified tolerance. If no match is found on either side, that side remains untrimmed.

Parameters:

left (str) – 5’ anchor sequence. Everything upstream of this sequence is trimmed. Use empty string ‘’ to skip 5’ trimming.
right (str) – 3’ anchor sequence. Everything downstream of this sequence is trimmed. Use empty string ‘’ to skip 3’ trimming.
tol (int) – Maximum number of mismatches allowed when matching anchor sequences (Hamming distance tolerance).

Returns:

Operation that accepts a Data object, applies trimming transformation, and returns the modified Data object.

Return type:

callable

Example

>>> trimmer = C.fastq_parser.trim_reads(left='ATCG', right='GCTA', tol=1)
>>> data = trimmer(data)

translate¶

FastqParser.translate(force_at_frame=None, stop_readthrough=False)[source]

Translate DNA sequences to peptides/proteins in silico.

Note

Pipeline Operation - Returns a callable for use in processing pipelines.

Performs translation of DNA sequencing data to peptides/proteins. Supports ORF detection via a configurable sequence pattern or forced translation at a specified frame. Designed for one-ORF-per-read NGS data rather than long reads with multiple ORFs.

An underscore ‘_’ is appended to the C-terminus if the final codon is incomplete and no stop codon is present.

Parameters:

force_at_frame (int, optional) –
Translation frame (0, 1, or 2). If None, performs automatic ORF search using the pattern specified in the config (orf_locator), in which case translation begins downstream of the matched pattern; the search is conducted in the 5’-to-3’ direction, and the first match is used to initiate translation. If no orf_locator is specified, the operation defaults to “ATG” as orf_locator.

If force_at_frame is specified, it forces translation to start at the given frame regardless of pattern matching.
For example:
>>> DNA: TACGACTCACTATAGGGTTAACTTTAAGAAGGA >>> force_at_frame=0 ----------> >>> force_at_frame=1 ----------> >>> force_at_frame=2 ---------->
stop_readthrough (bool) – If True, translation continues past stop codons until the 3’ end of the read. If False, translation terminates at the first stop codon. Default is False.

Returns:

Operation that accepts a Data object, translates DNA to peptides, and returns the modified Data object with peptide sequences.

Return type:

callable

Example

>>> translator = C.fastq_parser.translate(force_at_frame=0)
>>> data = translator(data)

revcom¶

FastqParser.revcom()[source]

Reverse complement DNA sequences.

Note

Pipeline Operation - Returns a callable for use in processing pipelines.

Computes the reverse complement of DNA sequences and reverses the corresponding quality scores for each sample in the Data object. Useful for processing reads sequenced from the opposite strand.

Returns:: Operation that accepts a Data object, applies reverse complement transformation to DNA sequences and reverses quality scores, and returns the modified Data object.
Return type:: callable

Note

Calling on Data containing peptide sequences will raise a warning; The peptide data will be left as is (no transform will be applied)

Example

>>> revcom_op = C.fastq_parser.revcom()
>>> data = revcom_op(data)

len_filter¶

FastqParser.len_filter(where=None, len_range=None)[source]

Filter dataset sequences by length.

Note

Pipeline Operation - Returns a callable for use in processing pipelines.

Filters sequences based on library design specifications or a custom length range. Sequences that don’t meet length criteria are discarded from all dataset in each sample.

Parameters:

where (str) – Dataset to evaluate for filtering. Must be ‘dna’ or ‘pep’. Length is checked against this dataset, but filtering is applied across all associated data in each sample.
len_range (tuple or list, optional) – Two-element sequence specifying (min_length, max_length) range. If None, filtering is performed according to library design specifications.

Returns:

Operation that accepts a Data object, applies length filtering, and returns the modified Data object.

Return type:

callable

Example

>>> #filter by design specifications
>>> length_filt = C.fastq_parser.len_filter(where='pep')
>>> data = length_filt(data)
>>> #filter by custom range
>>> length_filt = C.fastq_parser.len_filter(where='pep', len_range=(10, 50))
>>> data = length_filt(data)

cr_filter¶

FastqParser.cr_filter(where=None, loc=None, tol=1)[source]

Filter datasets by constant region integrity.

Note

Pipeline Operation - Returns a callable for use in processing pipelines.

Filters sequences based on constant region integrity. Sequences with mutations in specified constant regions exceeding the tolerance threshold are discarded from all datasets in each sample.

Parameters:

where (str) – Dataset to evaluate for filtering. Must be ‘dna’ or ‘pep’. Constant regions are checked in this dataset, but filtering is applied across all associated data in each sample.
loc (list) – List of integers specifying which constant region indices to operate on. Must reference constant (non-variable) regions in the library design.
tol (int) – Maximum number of mismatches allowed in the constant regions (Hamming distance). Sequences exceeding this threshold are discarded. Default is 1.

Returns:

Operation that accepts a Data object, applies constant region filtering, and returns the modified Data object.

Return type:

callable

Note

Insertions and deletions in constant regions are not validated by this operation, only substitutions.

Example

Consider a library design with the following structure:

        seq:  ACDEF11133211AWVFRTQ12345YTPPK
     region:  [-0-][---1--][--2--][-3-][-4-]
is_variable:  False  True   False True False

Regions 0, 2, and 4 are constant regions with expected sequences.

>>> #allow up to 1 mutation in region 2 (AWVFRTQ)
>>> cr_filt = C.fastq_parser.cr_filter(where='pep', loc=[2], tol=1)
>>> data = cr_filt(data)
>>> #sequences with >1 mutation in AWVFRTQ are discarded

vr_filter¶

FastqParser.vr_filter(where=None, loc=None, sets=None)[source]

Filter by variable region composition.

Note

Pipeline Operation - Returns a callable for use in processing pipelines.

Filters sequences based on variable region composition. Sequences containing amino acids or nucleotides outside the allowed monomer sets are discarded from all datasets in each sample.

Parameters:

where (str) – Dataset to evaluate for filtering. Must be ‘dna’ or ‘pep’. Variable regions are checked in this dataset, but filtering is applied across all associated data in each sample.
loc (list) – List of integers specifying which variable region indices to validate. Must reference variable regions in the library design.
sets (list) – List of integers specifying which monomer subsets to validate. Each integer corresponds to a monomer set defined in the library design configuration. Only specified sets are checked; others are ignored.

Returns:

Operation that accepts a Data object, applies variable region filtering, and returns the modified Data object.

Return type:

callable

Example

Consider a library design with the following structure:

        seq:  ACDEF11133211AWVFRTQ12345YTPPK
     region:  [-0-][---1--][--2--][-3-][-4-]
is_variable:  False  True   False True False

Region 1 contains variable positions with monomer sets 1, 2, and 3. The library design defines which amino acids are allowed for each set.

>>> #validate only sets 1 and 3 in region 1 (set 2 is not checked)
>>> vr_filt = C.fastq_parser.vr_filter(where='pep', loc=[1], sets=[1, 3])
>>> data = vr_filt(data)
>>>
>>> #this would raise an error (region 2 is constant, not variable)
>>> vr_filt = C.fastq_parser.vr_filter(where='pep', loc=[2], sets=[1])

filt_ambiguous¶

FastqParser.filt_ambiguous(where=None)[source]

Filter sequences containing ambiguous tokens.

Note

Pipeline Operation - Returns a callable for use in processing pipelines.

Removes sequences containing ambiguous characters. For DNA, this includes ‘N’ nucleotides from uncertain base calls. For peptides, this includes any amino acids outside the translation table specification (such as ‘+’ tokens stemming from ambiguous codons).

Parameters:: where (str) – Dataset to evaluate for filtering. Must be ‘dna’ or ‘pep’. Ambiguous tokens are checked in this dataset, but filtering is applied across all associated data in each sample.
Returns:: Operation that accepts a Data object, applies ambiguous token filtering, and returns the modified Data object.
Return type:: callable

Example

>>> ambig_filt = C.fastq_parser.filt_ambiguous(where='pep')
>>> data = ambig_filt(data)

drop_data¶

FastqParser.drop_data(where=None)[source]

Drop specified datasets from samples.

Note

Pipeline Operation - Returns a callable for use in processing pipelines.

Removes the specified dataset from all samples to free memory or simplify downstream processing. Useful when certain data types are no longer needed in the pipeline.

Parameters:: where (str) – Dataset to drop. Must be ‘dna’, ‘pep’, or ‘Q’ (quality scores).
Returns:: Operation that accepts a Data object, drops the specified dataset, and returns the modified Data object.
Return type:: callable

Example

>>> drop_op = C.fastq_parser.drop_data(where='dna')
>>> data = drop_op(data)

q_score_filt¶

FastqParser.q_score_filt(minQ=None, avgQ=None, loc=None)[source]

Filter data by Phred quality scores.

Note

Pipeline Operation — Returns a callable for use in processing pipelines.

Filters sequences based on Phred quality scores (Q). Reads that fail the specified quality criteria are discarded from all datasets in each sample.

Exactly one of minQ or avgQ must be provided:

If minQ is given, a read passes only if all relevant Q scores are greater than or equal to minQ.
If avgQ is given, a read passes only if the average Q score across the relevant regions is greater than or equal to avgQ.

The optional loc argument restricts filtering to specific regions (e.g., primer sites, amplicons). If omitted, filtering is applied across all regions.

Parameters:

minQ (int, optional) – Minimum quality score threshold. All Q scores in the specified regions must be >= this value. Mutually exclusive with avgQ.
avgQ (int, optional) – Average quality score threshold. The mean Q score across the specified regions must be >= this value. Mutually exclusive with minQ.
loc (list[int], optional) – List of region indices to evaluate for quality filtering. If None, all regions are considered.

Returns:

A pipeline operation that accepts a Data object, applies quality score filtering according to the specified criteria, and returns the modified Data object.

Return type:

callable

Example

>>> # require all Q scores in regions 0, 1, 2 to be >= 30
>>> q_filt = C.fastq_parser.q_score_filt(minQ=30, loc=[0, 1, 2])
>>> data = q_filt(data)

>>> # alternatively, require average Q score across all regions >= 35
>>> q_filt = C.fastq_parser.q_score_filt(avgQ=35)
>>> data = q_filt(data)

fetch_at¶

FastqParser.fetch_at(where=None, loc=None)[source]

Extract specified regions from sequences.

Note

Pipeline Operation - Returns a callable for use in processing pipelines.

Extracts and retains only the specified regions from sequences in a dataset specified by ‘where’, discarding other regions. For DNA datasets, quality scores are also truncated to match. Collapses the sample’s internal state after extraction.

Parameters:

where (str) – Dataset to extract from. Must be ‘dna’ or ‘pep’.
loc (list) – List of integers specifying which regions to extract and retain.

Returns:

Operation that accepts a Data object, extracts specified regions, and returns the modified Data object with truncated sequences.

Return type:

callable

Example

>>> #truncate peptide sequences to contain only regions 1 and 3:
>>> fetch_op = C.fastq_parser.fetch_at(where='pep', loc=[1, 3])
>>> data = fetch_op(data)

unpad¶

FastqParser.unpad()[source]

Remove padding from sequence arrays.

Note

Pipeline Operation - Returns a callable for use in processing pipelines.

Removes padding columns from DNA, peptide, and quality score arrays. Columns where every value is a padding token are removed, reducing memory usage and simplifying downstream analysis.

Returns:: Operation that accepts a Data object, removes padding, and returns the modified Data object.
Return type:: callable

Example

>>> unpad_op = C.fastq_parser.unpad()
>>> data = unpad_op(data)

demultiplex_sample_barcodes¶

FastqParser.demultiplex_sample_barcodes(barcode_loc=None, barcode_tol=None)[source]

Demultiplex samples based on DNA barcode sequences.

Note

Pipeline Operation - Returns a callable for use in processing pipelines.

Splits a multiplexed sample into separate sub-samples according to DNA barcode sequences present in specified regions. Each read is assigned to the sample whose barcode it most closely matches (within tolerance). Reads that match multiple barcodes are assigned to the first matching barcode with a warning.

The operation requires a sample barcode mapping to be specified in the config file (sample_barcodes in FastqParserConfig), which defines the relationship between sample names and their corresponding barcode sequences.

This operation creates new Sample objects for each barcode, named using the original sample name concatenated with the barcode name. The original Data object is replaced with a new Data instance containing the demultiplexed samples.

Parameters:

barcode_loc (list) – List of integers specifying the DNA region(s) containing the barcode sequence. Must reference positions in the library design.
barcode_tol (int) – Maximum Hamming distance allowed between observed and expected barcodes. Reads exceeding this threshold are not assigned to any barcode and are discarded.

Returns:

Operation that accepts a Data object, performs sample: demultiplexing based on barcodes, and returns a new Data object containing the demultiplexed samples.

Return type:

callable

Note

Requires a sample barcode mapping to be specified in the config file (sample_barcodes) as a dictionary mapping bacrodes names to their sequences.

Example

>>> # demultiplex using barcode in region 2, allowing up to 1 mismatch
>>> demux_op = C.fastq_parser.demultiplex_sample_barcodes(
...     barcode_loc=[2], barcode_tol=1
... )
>>> demultiplexed_data = demux_op(data)

demultiplex_aa_barcodes¶

FastqParser.demultiplex_aa_barcodes(barcode_loc=None, barcode_tol=None)[source]

Demultiplex degenerate amino acids using DNA barcodes.

Note

Pipeline Operation - Returns a callable for use in processing pipelines.

Resolves ambiguous amino acid identities in non-proteinogenic saturation mutagenesis workflows by reading DNA barcodes. The procedure is first described in https://doi.org/10.1073/pnas.1809901115

Briefly, during npAA saturation mutagenesis, several peptide libraries containing different genetic codes are prepared and analyzed together. For example, in sublibrary 1 an AUG codon may encode Met, in sublibrary 2 some other amino acid (e.g., N-Me-Gly), in sublibrary 3 - β-Ala, and so on. Therefore, the nature of the amino acid encoded by the AUG codon in this example is ambiguous. When the sublibraries are prepared, usually their mRNA are barcoded in the UTR region to encode the nature of such degenerate amino acids. This barcode can be read out by NGS to determine the nature of the degenerate amino acid in any given peptide.

This operation requires a “barcode translation table” to be specified in the config file as part of the “constants” declaration. The barcode translation table is a dictionary which denotes the mapping between the barcodes and the degenerate amino acids. It looks like this:

{"CACGAT":
  {"$": "f"}
"ATGTCG":
  {"$": "g"}
"AGGCTT":
  {"$": "c"}
}

Here, CACGAT etc are mRNA barcodes, $ is the encoding used for the amino acid to be demultiplexed (translation_table in the config file will have to map the AUG codon to some placeholder token like $ - since during translation, it’s impossible to tell what actual amino acid should correspond to AUG; the AUG codon is used here as an example, it can be any other codon), and f, g, c are the specific npAAs corresponding to these barcodes. More than 1 amino acid can be demultiplexed at a time.

In the example above, this operation will do a $ -> f swapping for all reads containing barcode CACGAT in the DNA region specified by barcode_loc.

The read may mismatch the specified barcode (CACGAT) by no more than barcode_tol (maximum Hamming distance). Analogous procedure will be carried out for all other barcodes indicated in the barcode translation table.

This operation collapses the sample’s internal state.

Parameters:

barcode_loc (list) – List of integers specifying the DNA region containing the barcode sequence.
barcode_tol (int) – Maximum Hamming distance allowed between observed and expected barcodes. Sequences exceeding this threshold are not demultiplexed.

Returns:

Operation that accepts a Data object, performs barcode-based: demultiplexing, and returns the modified Data object with resolved amino acid identities.

Return type:

callable

Note

Requires a barcode translation table to be specified in the configuration file, mapping barcode sequences to amino acid substitutions.

Example

>>> demux_op = C.fastq_parser.demultiplex_aa_barcodes(
...     barcode_loc=[3], barcode_tol=2
... )
>>> data = demux_op(data)

save¶

FastqParser.save(where=None, fmt=None)[source]

Save datasets to a file.

Note

Pipeline Operation - Returns a callable for use in processing pipelines.

Writes the specified dataset from each sample to a file in the parser output directory. Files are named using the sample name and timestamp. Does not transform the Data object.

Parameters:

where (str) – Dataset to save. Must be ‘dna’ or ‘pep’.
fmt (str) – Output file format. Must be ‘npy’, ‘csv’, or ‘fasta’.

Returns:

Operation that accepts a Data object, saves the specified data, and returns the unmodified Data object.

Return type:

callable

Example

>>> #save peptide datasets as .fasta files
>>> save_op = C.fastq_parser.save(where='pep', fmt='fasta')
>>> data = save_op(data)

count_summary¶

FastqParser.count_summary(where=None, top_n=None, fmt=None)[source]

Generate sequence count summary for each sample.

Note

Pipeline Operation - Returns a callable for use in processing pipelines.

Counts the occurrences of each unique sequence in the specified dataset and writes results to a file in the parser output directory. Does not transform the Data object.

Parameters:

where (str) – Dataset to analyze. Must be ‘dna’ or ‘pep’.
top_n (int, optional) – If specified, only the top N most abundant sequences are included in the output. If None, all unique sequences are reported.
fmt (str) – Output file format. Must be ‘csv’ or ‘fasta’.

Returns:

Operation that accepts a Data object, generates count summaries, and returns the unmodified Data object.

Return type:

callable

Example

>>> #save top 100 peptides by count from each sample as .csv files
>>> count_op = C.fastq_parser.count_summary(where='pep', top_n=100, fmt='csv')
>>> data = count_op(data)

library_design_match¶

FastqParser.library_design_match(where=None)[source]

Summarize sequence matches to library design templates.

Note

Pipeline Operation - Returns a callable for use in processing pipelines.

Analyzes how many sequences from each sample match each library design template. Results are written to a CSV file showing the distribution of sequences across different library designs. Does not transform the Data object.

Parameters:: where (str) – Dataset to analyze. Must be ‘dna’ or ‘pep’.
Returns:: Operation that accepts a Data object, generates library design match summary, and returns the unmodified Data object.
Return type:: callable

Example

>>> match_op = C.fastq_parser.library_design_match(where='pep')
>>> data = match_op(data)

dataset_wide_count_summary¶

FastqParser.dataset_wide_count_summary(where=None, top_n=None)[source]

Generate merged count summary across all samples.

Note

Pipeline Operation - Returns a callable for use in processing pipelines.

Counts unique sequences across all samples and merges results into a single spreadsheet showing counts and percentages for each sample. Two CSV files are generated: one with absolute counts and one with percentages. Does not transform the Data object.

Parameters:

where (str) – Dataset to analyze. Must be ‘dna’ or ‘pep’.
top_n (int, optional) – If specified, only the top N most abundant sequences across all samples are included. If None, all unique sequences are reported.

Returns:

Operation that accepts a Data object, generates dataset-wide count summary, and returns the unmodified Data object.

Return type:

callable

Note

Requires at least 2 samples in the dataset. Single-sample datasets will log a warning and skip this operation.

Example

>>> #summarize top 500 peptides (aggregate across all samples)
>>> summary_op = C.fastq_parser.dataset_wide_count_summary(where='pep', top_n=500)
>>> data = summary_op(data)