C.data_loader¶
- class clibas.dataloaders.FastqLoader(*args)[source]¶
Bases:
HandlerFASTQ file loading and streaming utilities.
Provides FastqLoader for reading sequencing data from FASTQ files, supporting both uncompressed (.fastq) and gzipped (.fastq.gz) formats with optional streaming for memory-efficient processing of large files. Typically accessed through the clibas facade after initialization.
Example
>>> import clibas as C >>> C.initialize_from_config('config.yaml') >>> #loader tools are now ready to use >>> #as C.data_loader
Note
This class is not typically instantiated directly. Use the clibas initialization system to access analysis functionality.
Methods¶
fetch_fastq_from_dir¶
- FastqLoader.fetch_fastq_from_dir(data_dir=None)[source]
Load all FASTQ files from a directory into memory.
Note
Pipeline Operation - Returns a callable for use in processing pipelines.
Reads all .fastq files in the specified directory and combines them into a single Data object. Use for small to medium datasets that fit in memory.
- Parameters:
data_dir (str, optional) – Directory containing FASTQ files. If None, uses the sequencing_data directory from configuration.
- Returns:
Operation that when called returns a Data object containing all samples from the directory.
- Return type:
callable
- Raises:
IOError – If directory is invalid or contains no .fastq files.
Example
>>> fetch_op = C.data_loader.fetch_fastq_from_dir('path/to/fastq/files') >>> data = fetch_op()
fetch_gz_from_dir¶
- FastqLoader.fetch_gz_from_dir(data_dir=None)[source]
Load all gzipped FASTQ files from a directory into memory.
Note
Pipeline Operation - Returns a callable for use in processing pipelines.
Reads all .fastq.gz files in the specified directory and combines them into a single Data object. Use for small to medium datasets that fit in memory.
- Parameters:
data_dir (str, optional) – Directory containing gzipped FASTQ files. If None, uses the sequencing_data directory from configuration.
- Returns:
Operation that when called returns a Data object containing all samples from the directory.
- Return type:
callable
- Raises:
IOError – If directory is invalid or contains no .fastq.gz files.
Example
>>> fetch_op = C.data_loader.fetch_gz_from_dir('path/to/fastq/files') >>> data = fetch_op()
stream_from_fastq_dir¶
- FastqLoader.stream_from_fastq_dir(data_dir=None)[source]
Stream sequencing samples from a directory containing FASTQ files.
Generator that yields one SequencingSample per FASTQ file, enabling memory-efficient processing of multiple files without loading all data at once.
- Parameters:
data_dir (str, optional) – Directory containing FASTQ files. If None, uses the sequencing_data directory from the config file.
- Yields:
SequencingSample – One sample per FASTQ file in the directory.
- Raises:
IOError – If directory is invalid or contains no .fastq files.
Example
>>> streamer = C.data_loader.stream_from_fastq_dir(data_dir='../fastq') >>> #process samples one by one >>> C.pipeline.stream(streamer=streamer, save_summary=True)
stream_from_gz_dir¶
- FastqLoader.stream_from_gz_dir(data_dir=None)[source]
Stream sequencing samples from a directory of gzipped FASTQ files.
Generator that yields one SequencingSample per .fastq.gz file, enabling memory-efficient processing of multiple compressed files without loading all data at once.
- Parameters:
data_dir (str, optional) – Directory containing gzipped FASTQ files. If None, uses the sequencing_data directory from the config file.
- Yields:
SequencingSample – One sample per .fastq.gz file in the directory.
- Raises:
IOError – If directory is invalid or contains no .fastq.gz files.
Example
>>> streamer = C.data_loader.stream_from_gz_dir(data_dir='../fastq') >>> #process samples one by one >>> C.pipeline.stream(streamer=streamer, save_summary=True)
stream_from_gz_file¶
- FastqLoader.stream_from_gz_file(fname=None, reads_per_chunk=None)[source]
Stream a gzipped FASTQ file in chunks for memory-efficient processing.
Reads a large .fastq.gz file in manageable chunks, yielding SequencingSample objects containing the specified number of reads. Useful for processing files too large to fit in memory.
- Parameters:
fname (str) – Path to the .fastq.gz file to stream.
reads_per_chunk (int) – Number of reads to process in each chunk.
- Yields:
SequencingSample – Sample objects containing reads_per_chunk sequences. Sample names are suffixed with chunk numbers (e.g., sample_001, sample_002).
- Raises:
ValueError – If reads_per_chunk is not an integer.
IOError – If file cannot be opened or read.
Example
>>> streamer = C.data_loader.stream_from_gz_file(fname='example.fastq.gz', reads_per_chunk=int(5e6)) >>> #process file chunks one by one >>> C.pipeline.stream(streamer=streamer, save_summary=True)