C.preprocessor

class clibas.datapreprocessors.DataPreprocessor(*args)[source]

Bases: Handler

Preprocessing tools for machine learning dataset preparation.

Provides methods for filtering, featurization, sampling, and data augmentation of sequence datasets. Operations transform Data objects for downstream machine learning workflows. Typically accessed through the clibas facade after initialization.

Example

>>> import clibas as C
>>> C.initialize_from_config('config.yaml')
>>> # Preprocessor is now ready to use
>>> # as C.preprocessor

Note

This class is not typically instantiated directly. Use the clibas initialization system to access preprocessing functionality.

Methods

token_filter

DataPreprocessor.token_filter(tokens_to_filter_by=None)[source]

Filter sequences containing specific tokens.

Note

Pipeline Operation - Returns a callable for use in processing pipelines.

Removes all sequences (X array entries) that contain any of the specified tokens. Useful for filtering out sequences with ambiguous or unwanted characters.

Parameters:

tokens_to_filter_by (list, tuple, or ndarray) – Single-letter encoded tokens to filter out (e.g., amino acids, bases).

Returns:

Operation that accepts a Data object, filters sequences containing specified tokens, and returns the modified Data object.

Return type:

callable

Example

>>> #remove sequences containing 'X' or 'Z' amino acids
>>> token_filt = C.preprocessor.token_filter(tokens_to_filter_by=['X', 'Z'])
>>> data = token_filt(data)

intrasample_unique

DataPreprocessor.intrasample_unique()[source]

Remove duplicate sequences within each sample.

Note

Pipeline Operation - Returns a callable for use in processing pipelines.

Removes duplicate entries within each sample’s X dataset. Equivalent to calling np.unique(X, axis=0) on each sample. Entries are resorted during the process.

Returns:

Operation that accepts a Data object, removes intra-sample duplicates, and returns the modified Data object.

Return type:

callable

Example

>>> unique_op = C.preprocessor.intrasample_unique()
>>> data = unique_op(data)

intersample_unique

DataPreprocessor.intersample_unique()[source]

Remove sequences found in multiple samples.

Note

Pipeline Operation - Returns a callable for use in processing pipelines.

Compares X arrays across all samples and removes entries that appear in more than one sample. Only sequences unique to a single sample are retained.

Returns:

Operation that accepts a Data object, removes inter-sample duplicates, and returns the modified Data object.

Return type:

callable

Note

If X arrays have different widths, they will be padded to the maximum width before comparison.

Example

>>> intersample_op = C.preprocessor.intersample_unique()
>>> data = intersample_op(data)

filter_external

DataPreprocessor.filter_external(external=None, max_hd=None)[source]

Filter sequences similar to an external dataset.

Note

Pipeline Operation - Returns a callable for use in processing pipelines.

Removes sequences that are within a specified Hamming distance of any sequence in an external dataset. Useful for removing validation/test sequences or known contaminants from training data.

Parameters:
  • external – External dataset to compare against. Should be castable to np.ndarray or compatible with AnalysisSample.

  • max_hd (int) – Maximum Hamming distance threshold. Sequences with distance ≤ max_hd from any external sequence are removed.

Returns:

Operation that accepts a Data object, filters sequences

similar to external dataset, and returns the modified Data object.

Return type:

callable

Example

>>> #remove sequences similar to validation set
>>> external_filt = C.preprocessor.filter_external(
...     external=validation_sequences, max_hd=2
... )
>>> data = external_filt(data)

merge

DataPreprocessor.merge()[source]

Merge all samples into a single dataset.

Note

Pipeline Operation - Returns a callable for use in processing pipelines.

Combines all samples in the Data object into a single merged sample. The merged sample is named ‘merged_data’.

Returns:

Operation that accepts a Data object, merges all samples, and returns the modified Data object containing a single sample.

Return type:

callable

Example

>>> merge_op = C.preprocessor.merge()
>>> data = merge_op(data)
>>> # data now contains a single merged sample

sample

DataPreprocessor.sample(sample_size=None)[source]

Randomly sample from each dataset.

Note

Pipeline Operation - Returns a callable for use in processing pipelines.

Randomly samples a specified number or fraction of sequences from each sample in the dataset. Applied independently to each sample.

Parameters:

sample_size (int or float) – Number or fraction of sequences to sample. If ≤ 1, interpreted as fraction of dataset to keep. If > 1, interpreted as absolute number of sequences to sample.

Returns:

Operation that accepts a Data object, samples from each dataset, and returns the modified Data object.

Return type:

callable

Example

>>> #sample 50% of each dataset
>>> sample_op = C.preprocessor.sample(sample_size=0.5)
>>> data = sample_op(data)
>>>
>>> #sample exactly 1000 sequences from each dataset
>>> sample_op = C.preprocessor.sample(sample_size=1000)
>>> data = sample_op(data)

shuffle

DataPreprocessor.shuffle()[source]

Randomly shuffle sequences within each sample.

Note

Pipeline Operation - Returns a callable for use in processing pipelines.

Randomly reorders sequences within each sample’s X dataset. Applied independently to each sample.

Returns:

Operation that accepts a Data object, shuffles sequences within each sample, and returns the modified Data object.

Return type:

callable

Example

>>> shuffle_op = C.preprocessor.shuffle()
>>> data = shuffle_op(data)

tt_split

DataPreprocessor.tt_split(test_fraction=None)[source]

Perform train/test split on dataset.

Note

Pipeline Operation - Returns a callable for use in processing pipelines.

Splits a single sample into separate train and test datasets. The input Data object must contain exactly one sample.

Parameters:

test_fraction (float) – Fraction of data to allocate to test set (between 0 and 1). Remaining data goes to training set.

Returns:

Operation that accepts a Data object with one sample, splits it, and returns a Data object with two samples named ‘train_data’ and ‘test_data’.

Return type:

callable

Raises:

ValueError – If input Data contains more than one sample.

Example

>>> #split into 80% train, 20% test
>>> split_op = C.preprocessor.tt_split(test_fraction=0.2)
>>> data = split_op(data)
>>> #data now contains train_data and test_data samples

to_h5

DataPreprocessor.to_h5(F=None, alphabet=None, reshape=False, chunks=None, return_data=False)[source]

Featurize sequences and save to HDF5 files.

Note

Pipeline Operation - Returns a callable for use in processing pipelines.

Converts sequence datasets to numerical feature representations and saves them as HDF5 files. Useful when featurized datasets are too large to fit in memory. Processes data in chunks for memory efficiency.

Parameters:
  • F (str, ndarray, or None) – Feature matrix specification. If None, uses one-hot encoding. String options include ‘varimax’, ‘pep_ECFP3’, ‘pep_ECFP4’, ‘pep_SMILES’. Can provide custom 2D array with F.shape[0] == len(alphabet).

  • alphabet (tuple, list, or ndarray, optional) – Token alphabet. If None, uses peptide alphabet from configuration.

  • reshape (bool) – If True, represents sequences as multidimensional tensors. If False, unrolls to vectors. Default is False.

  • chunks (int) – Process data in chunks of this size for memory efficiency.

  • return_data (bool) – If True, returns unmodified Data object. If False, returns None. Default is False.

Returns:

Operation that accepts a Data object, featurizes sequences to HDF5 files, and returns Data object or None based on return_data.

Return type:

callable

Note

HDF5 files are saved in the ml_data directory specified in the config file.

Example

>>> to_h5_op = C.preprocessor.to_h5(F='pep_ECFP4', chunks=20, reshape=True)
>>> to_h5_op(data)

featurize_X

DataPreprocessor.featurize_X(F=None, alphabet=None, reshape=False)[source]

Featurize sequence datasets in memory.

Note

Pipeline Operation - Returns a callable for use in processing pipelines.

Converts sequence datasets to numerical feature representations. Use when featurized datasets fit in memory. For large datasets, use to_h5() instead.

Parameters:
  • F (str, ndarray, or None) – Feature matrix specification. If None, uses one-hot encoding. String options include ‘varimax’, ‘pep_ECFP3’, ‘pep_ECFP4’, ‘pep_SMILES’. Can provide custom 2D array with F.shape[0] == len(alphabet).

  • alphabet (tuple, list, or ndarray, optional) – Token alphabet. If None, uses peptide alphabet from the config file.

  • reshape (bool) – If True, represents sequences as multidimensional tensors. If False, unrolls to vectors. Default is False.

Returns:

Operation that accepts a Data object, featurizes X datasets, and returns the modified Data object.

Return type:

callable

Example

>>> featurize_op = C.preprocessor.featurize_X(F='pep_ECFP4', reshape=True)
>>> data = featurize_op(data)

featurize_for_RFA

DataPreprocessor.featurize_for_RFA(alphabet=None, order=None)[source]

Featurize sequences for Reference-Free Analysis (RFA) models as implemented in DOMEK workflows. See https://www.cell.com/chem/abstract/S2451-9294(25)00328-6

Note

Pipeline Operation - Returns a callable for use in processing pipelines.

Converts peptide sequences to feature representations compatible with RFA models. See https://www.nature.com/articles/s41467-024-51895-5 for methodology details.

Parameters:
  • alphabet (list, tuple, or ndarray) – Amino acid alphabet comprising the sequences. Alphabet size determines feature dimensions.

  • order (str) – RFA model order. Must be ‘first’ or ‘second’. Higher orders not yet implemented.

Returns:

Operation that accepts a Data object, featurizes sequences for RFA, and returns the modified Data object with flattened 2D feature arrays.

Return type:

callable

Example

>>> rfa_op = C.preprocessor.featurize_for_RFA(alphabet='aa', order='second')
>>> data = rfa_op(data)

drop

DataPreprocessor.drop(sample_to_drop=None)[source]

Remove a sample from the dataset by name.

Note

Pipeline Operation - Returns a callable for use in processing pipelines.

Removes the specified sample from the Data object. If sample name is not found, logs a warning and returns data unchanged.

Parameters:

sample_to_drop (str) – Name of the sample to remove.

Returns:

Operation that accepts a Data object, removes the specified sample, and returns the modified Data object.

Return type:

callable

Example

>>> drop_op = C.preprocessor.drop(sample_to_drop='validation_data')
>>> data = drop_op(data)

pad_and_random_shift

DataPreprocessor.pad_and_random_shift(new_x_dim=None)[source]

Expand and randomly shift sequences for data augmentation.

Note

Pipeline Operation - Returns a callable for use in processing pipelines.

Pads each sequence to a fixed width and applies a random circular shift. The shift is constrained so that non-pad values remain within bounds, effectively redistributing padding without truncating sequence data. Useful for data augmentation and positional invariance.

Parameters:

new_x_dim (int) – Target width for padded sequences. Must be ≥ current sequence width.

Returns:

Operation that accepts a Data object, pads and shifts sequences, and returns the modified Data object.

Return type:

callable

Example

If new_x_dim=6, a sequence array:

[['A', 'B', 'C', 'D'],
 ['E', 'F', 'G', 'H'],
 ['I', 'J', 'K', 'L'],
 ['M', 'N', 'O', 'P']]

might become:

[[‘’, ‘’, ‘A’, ‘B’, ‘C’, ‘D’],

[‘’, ‘’, ‘E’, ‘F’, ‘G’, ‘H’], [‘I’, ‘J’, ‘K’, ‘L’, ‘’, ‘’], [‘’, ‘M’, ‘N’, ‘O’, ‘P’, ‘’]]

with padding randomly distributed on either side.

>>> augment_op = C.preprocessor.pad_and_random_shift(new_x_dim=20)
>>> data = augment_op(data)