C.data_loader

class clibas.dataloaders.FastqLoader(*args)[source]

Bases: Handler

FASTQ file loading and streaming utilities.

Provides FastqLoader for reading sequencing data from FASTQ files, supporting both uncompressed (.fastq) and gzipped (.fastq.gz) formats with optional streaming for memory-efficient processing of large files. Typically accessed through the clibas facade after initialization.

Example

>>> import clibas as C
>>> C.initialize_from_config('config.yaml')
>>> #loader tools are now ready to use
>>> #as C.data_loader

Note

This class is not typically instantiated directly. Use the clibas initialization system to access analysis functionality.

Methods

fetch_fastq_from_dir

FastqLoader.fetch_fastq_from_dir(data_dir=None)[source]

Load all FASTQ files from a directory into memory.

Note

Pipeline Operation - Returns a callable for use in processing pipelines.

Reads all .fastq files in the specified directory and combines them into a single Data object. Use for small to medium datasets that fit in memory.

Parameters:

data_dir (str, optional) – Directory containing FASTQ files. If None, uses the sequencing_data directory from configuration.

Returns:

Operation that when called returns a Data object containing all samples from the directory.

Return type:

callable

Raises:

IOError – If directory is invalid or contains no .fastq files.

Example

>>> fetch_op = C.data_loader.fetch_fastq_from_dir('path/to/fastq/files')
>>> data = fetch_op()

fetch_gz_from_dir

FastqLoader.fetch_gz_from_dir(data_dir=None)[source]

Load all gzipped FASTQ files from a directory into memory.

Note

Pipeline Operation - Returns a callable for use in processing pipelines.

Reads all .fastq.gz files in the specified directory and combines them into a single Data object. Use for small to medium datasets that fit in memory.

Parameters:

data_dir (str, optional) – Directory containing gzipped FASTQ files. If None, uses the sequencing_data directory from configuration.

Returns:

Operation that when called returns a Data object containing all samples from the directory.

Return type:

callable

Raises:

IOError – If directory is invalid or contains no .fastq.gz files.

Example

>>> fetch_op = C.data_loader.fetch_gz_from_dir('path/to/fastq/files')
>>> data = fetch_op()

stream_from_fastq_dir

FastqLoader.stream_from_fastq_dir(data_dir=None)[source]

Stream sequencing samples from a directory containing FASTQ files.

Generator that yields one SequencingSample per FASTQ file, enabling memory-efficient processing of multiple files without loading all data at once.

Parameters:

data_dir (str, optional) – Directory containing FASTQ files. If None, uses the sequencing_data directory from the config file.

Yields:

SequencingSample – One sample per FASTQ file in the directory.

Raises:

IOError – If directory is invalid or contains no .fastq files.

Example

>>> streamer = C.data_loader.stream_from_fastq_dir(data_dir='../fastq')
>>> #process samples one by one
>>> C.pipeline.stream(streamer=streamer, save_summary=True)

stream_from_gz_dir

FastqLoader.stream_from_gz_dir(data_dir=None)[source]

Stream sequencing samples from a directory of gzipped FASTQ files.

Generator that yields one SequencingSample per .fastq.gz file, enabling memory-efficient processing of multiple compressed files without loading all data at once.

Parameters:

data_dir (str, optional) – Directory containing gzipped FASTQ files. If None, uses the sequencing_data directory from the config file.

Yields:

SequencingSample – One sample per .fastq.gz file in the directory.

Raises:

IOError – If directory is invalid or contains no .fastq.gz files.

Example

>>> streamer = C.data_loader.stream_from_gz_dir(data_dir='../fastq')
>>> #process samples one by one
>>> C.pipeline.stream(streamer=streamer, save_summary=True)

stream_from_gz_file

FastqLoader.stream_from_gz_file(fname=None, reads_per_chunk=None)[source]

Stream a gzipped FASTQ file in chunks for memory-efficient processing.

Reads a large .fastq.gz file in manageable chunks, yielding SequencingSample objects containing the specified number of reads. Useful for processing files too large to fit in memory.

Parameters:
  • fname (str) – Path to the .fastq.gz file to stream.

  • reads_per_chunk (int) – Number of reads to process in each chunk.

Yields:

SequencingSample – Sample objects containing reads_per_chunk sequences. Sample names are suffixed with chunk numbers (e.g., sample_001, sample_002).

Raises:
  • ValueError – If reads_per_chunk is not an integer.

  • IOError – If file cannot be opened or read.

Example

>>> streamer = C.data_loader.stream_from_gz_file(fname='example.fastq.gz', reads_per_chunk=int(5e6))
>>> #process file chunks one by one
>>> C.pipeline.stream(streamer=streamer, save_summary=True)