5. Yeast display: chymotrypsin selection

This notebook demonstrates the use of clibas to analyze yeast display library selections using data from Angelini and co-workers as an example. We will try to reproduce their results, which can be found in S. Linciano et al. Screening macrocyclic peptide libraries by yeast display allows control of selection process and affinity ranking. Nat. Commun. 2025, 16, 5367

The authors conducted affinity selection of mono- and bicyclic peptides against five structurally diverse targets using yeast surface display. For this example, we will analyze only the selection against α-chymotrypsin, which is denoted as PT5 in the paper.

Disulfide bond formation was used to achieve peptide macrocyclization using the standard genetic code (no genetic code reprogramming). The peptide library was constructed by mixing 5 sub-libraries each containing peptides with either 2 or 3 fixed cysteine residues in a 9-15-residue insert. The 3-Cys peptides may bear an extra Cys in the random insert, and therefore will form bicyclic disulfide-bonded structures.

NGS data was provided in a single-end read .fastq format, and the reads look like this:

001 angelini v2.png

Reads are barcoded by 12-nt barcodes at the 5’-end; there are 5 barcodes in total, and the PT5-derived library is barcoded by bc2 (highlighted in blue in the figure above). It’s not clear what other barcodes represent, but this is not important for us here.

The ORF of interest is flanked by a longer constant DNA region that is also not important for our purposes. In the paper, the authors report trimming the reads to the 5- and 6-nt anchor sequences highlighted in red in the figure above prior to conducting downstream analysis. The enclosed region encodes the variable region sequence as well as a few constant region amino acids (EAC at the N-terminal side, and CSG at the C-terminal side). We can process the reads similarly, and write the library designs accordingly:

003 angelini v2.png

The corresponding config file then looks like this:

experiment: "Angelini_yeast_selections"

LibraryDesigns:
  dna_templates:
    - "111111111111GGAAGAAGGTGTTCAATTGGACAAGAGAGAAGCTTGT111111111111111111111TGTTCCGGTGGTGGTGGCTCTGGTGGCGGTGGTAGCGGAGGTGGTGGTAGTTACCCATACGA"
    - "111111111111GGAAGAAGGTGTTCAATTGGACAAGAGAGAAGCTTGT111111111111111111111111111TGTTCCGGTGGTGGTGGCTCTGGTGGCGGTGGTAGCGGAGGTGGTGGTAGTTACCCATACGA"
    - "111111111111GGAAGAAGGTGTTCAATTGGACAAGAGAGAAGCTTGT111111111111111111111111111111111111111TGTTCCGGTGGTGGTGGCTCTGGTGGCGGTGGTAGCGGAGGTGGTGGTAGTTACCCATACGA"
    - "111111111111GGAAGAAGGTGTTCAATTGGACAAGAGAGAAGCTTGT111111111111111111111111111111111111111TGTTCCGGTGGTGGTGGCTCTGGTGGCGGTGGTAGCGGAGGTGGTGGTAGTTACCCATACGA"
    - "111111111111GGAAGAAGGTGTTCAATTGGACAAGAGAGAAGCTTGT111111111111111111111111111111111111111TGTTCCGGTGGTGGTGGCTCTGGTGGCGGTGGTAGCGGAGGTGGTGGTAGTTACCCATACGA"

  dna_monomers:
    1: ["A", "G", "T", "C"] #N

  pep_templates:
    - "E3211111112SG"
    - "E321111111112SG"
    - "E3211121111111112SG"
    - "E3211111121111112SG"
    - "E3211111111121112SG"

  pep_monomers:
    1: ["A", "C", "D", "E", "F", "G", "H", "I", "K", "L", "M", "N", "P", "Q", "R", "S", "T", "V", "W", "Y"]
    2: ["C"]
    3: ["A"]

FastqParserConfig:
  sample_barcodes:
    bc1: "CACAGAGATACT"
    bc2: "AGTGTACACTGC"
    bc3: "GACACTGATGGT"
    bc4: "CCTTACGATGTG"
    bc5: "TCTGAGGATCCA"

TrackerConfig:
  logs: "./outputs/Angelini_yeast_v2"                 # Directory for writing logs to
  parser_out: "./outputs/Angelini_yeast_v2"           # Directory that stores fastq parser outputs
  analysis_out: "./outputs/Angelini_yeast_v2"         # Directory that stores outputs of data analysis operations

LoggerConfig:
  verbose: true                     # Verbose loggers print to the console
  log_to_file: true                 # Write logs to file
  level: "INFO"                     # Logger level; accepted values: "DEBUG", "INFO", "WARNING", "ERROR"

The key feature of this config is the FastqParserConfig element that holds the barcode information:

FastqParserConfig:
  sample_barcodes:
    bc1: "CACAGAGATACT"
    bc2: "AGTGTACACTGC"
    bc3: "GACACTGATGGT"
    bc4: "CCTTACGATGTG"
    bc5: "TCTGAGGATCCA"

We will just try to reproduce the original paper without doing any additional analyses – refer to earlier examples for this. clibas provides a function called C.fastq_parser.demultiplex_sample_barcodes which allows us to split a .fastq file into several sub-samples by their respective barcodes. This operation takes barcode_loc as a keyword: it points to a region in a DNA library design that encodes the barcode. For this reason, we indicate the first 12-nt as a standalone variable region in the designs above: so that we can point to it to match the barcodes as barcode_loc=[0]. This operation has to be called before read trimming, because the barcode information will be lost after that point.

Next, to (roughly) reproduce the original paper, we trim the reads, do their Q-score filtering (in the paper, reads containing Q scores below 18 are discarded), and run in silico translation. Trimmed reads don’t have a start codon, so we have to “force” translation to start. Because the trimming is done in-frame with the ORF, we can set force_at_frame=0 for C.fastq_parser.translate operation.

After, peptide products are filtered by length: only the ORFs between 11- and 19-residues-long are retained in the paper. Here, we can just call C.fastq_parser.len_filter(where='pep') to discard all translation products that do not comply with our design specifications.

Finally, just like in the paper, we clip peptide sequences to regions 0 and 1 (effectively removing the C-terminal SG), and count them, writing the outputs to a .csv file.

The entire pipeline is ~12 lines of code:

[1]:
import clibas as C

C.initialize(config_path="Angelini_yeast_config.yaml")

C.pipeline.enque(
    [
        C.fastq_parser.demultiplex_sample_barcodes(barcode_loc=[0], barcode_tol=0),
        C.fastq_parser.trim_reads(left="GAAGC", right="TCCGGT", tol=0),
        C.fastq_parser.q_score_filt(minQ=18, loc=None),
        C.fastq_parser.translate(force_at_frame=0, stop_readthrough=False),
        C.fastq_parser.len_filter(where="pep"),
        C.fastq_parser.fetch_at(where="pep", loc=[0, 1]),
        C.fastq_parser.count_summary(where="pep", top_n=None, fmt="csv"),
    ]
)

loader = C.data_loader.fetch_fastq_from_dir(
    data_dir="./sequencing_data/Angelini_yeast/PT5"
)
data = C.pipeline.load_and_run(loader=loader, save_summary=True)
[INFO]: <clibas> succesfully loaded config and is ready. . .
[INFO]: 7 ops appended to pipeline; current queue size: 7
[INFO]: Queuing <fetch_dir_fastq> op. . .
[INFO]: Fetching Angelini_Yeast_PT5.fastq. . .
[INFO]: The operation took 1.178 s
[INFO]: Angelini_Yeast_PT5 dataset size: 723231
[INFO]: -----------------------------------------------------------------
[INFO]: Queuing <split_by_barcode> op. . .
[INFO]: The operation took 4.656 s
[INFO]: Angelini_Yeast_PT5_bc1 dataset size: 139216
[INFO]: Angelini_Yeast_PT5_bc2 dataset size: 101324
[INFO]: Angelini_Yeast_PT5_bc3 dataset size: 170597
[INFO]: Angelini_Yeast_PT5_bc4 dataset size: 182184
[INFO]: Angelini_Yeast_PT5_bc5 dataset size: 112118
[INFO]: -----------------------------------------------------------------
[INFO]: Queuing <trim_exp> op. . .
[INFO]: The operation took 1.374 s
[INFO]: Angelini_Yeast_PT5_bc1 dataset size: 139216
[INFO]: Angelini_Yeast_PT5_bc2 dataset size: 101324
[INFO]: Angelini_Yeast_PT5_bc3 dataset size: 170597
[INFO]: Angelini_Yeast_PT5_bc4 dataset size: 182184
[INFO]: Angelini_Yeast_PT5_bc5 dataset size: 112118
[INFO]: -----------------------------------------------------------------
[INFO]: Queuing <q_score_filter> op. . .
[INFO]: The operation took 0.447 s
[INFO]: Angelini_Yeast_PT5_bc1 dataset size: 55132
[INFO]: Angelini_Yeast_PT5_bc2 dataset size: 45392
[INFO]: Angelini_Yeast_PT5_bc3 dataset size: 51595
[INFO]: Angelini_Yeast_PT5_bc4 dataset size: 47350
[INFO]: Angelini_Yeast_PT5_bc5 dataset size: 53989
[INFO]: -----------------------------------------------------------------
[INFO]: Queuing <translate_dna> op. . .
[INFO]: The operation took 1.274 s
[INFO]: Angelini_Yeast_PT5_bc1 dataset size: 55132
[INFO]: Angelini_Yeast_PT5_bc2 dataset size: 45392
[INFO]: Angelini_Yeast_PT5_bc3 dataset size: 51595
[INFO]: Angelini_Yeast_PT5_bc4 dataset size: 47350
[INFO]: Angelini_Yeast_PT5_bc5 dataset size: 53989
[INFO]: -----------------------------------------------------------------
[INFO]: Queuing <length_filter> op. . .
[INFO]: The operation took 0.202 s
[INFO]: Angelini_Yeast_PT5_bc1 dataset size: 53330
[INFO]: Angelini_Yeast_PT5_bc2 dataset size: 41794
[INFO]: Angelini_Yeast_PT5_bc3 dataset size: 46750
[INFO]: Angelini_Yeast_PT5_bc4 dataset size: 46126
[INFO]: Angelini_Yeast_PT5_bc5 dataset size: 7971
[INFO]: -----------------------------------------------------------------
[INFO]: Queuing <fetch_region> op. . .
[INFO]: The operation took 0.06 s
[INFO]: Angelini_Yeast_PT5_bc1 dataset size: 53330
[INFO]: Angelini_Yeast_PT5_bc2 dataset size: 41794
[INFO]: Angelini_Yeast_PT5_bc3 dataset size: 46750
[INFO]: Angelini_Yeast_PT5_bc4 dataset size: 46126
[INFO]: Angelini_Yeast_PT5_bc5 dataset size: 7971
[INFO]: -----------------------------------------------------------------
[INFO]: Queuing <fastq_count_summary> op. . .
[INFO]: The operation took 0.747 s
[INFO]: Angelini_Yeast_PT5_bc1 dataset size: 53330
[INFO]: Angelini_Yeast_PT5_bc2 dataset size: 41794
[INFO]: Angelini_Yeast_PT5_bc3 dataset size: 46750
[INFO]: Angelini_Yeast_PT5_bc4 dataset size: 46126
[INFO]: Angelini_Yeast_PT5_bc5 dataset size: 7971
[INFO]: -----------------------------------------------------------------

Let’s load the resulting peptide count files and compare:

[6]:
from pathlib import Path

import pandas as pd

out_dir = C.fastq_parser.dirs.logs
sample_name = "Angelini_Yeast_PT5_bc2"
out_dir = Path(out_dir) / sample_name

fname = next(Path(out_dir).glob(f"{sample_name}_pep_count_summary*.csv"))
df = pd.read_csv(fname)
df.head(20).style.set_table_styles(
    [{"selector": "th, td", "props": [("font-family", "Consolas, monospace")]}]
)
[6]:
  Index Peptide DNA pep count Dataset %
0 1 EACRDWTWIQC GAAGCTTGTCGTGATTGGACTTGGATTCAGTGTTCCGGT 10119 24.211609
1 2 EACRDWTWVSC GAAGCTTGTAGGGATTGGACGTGGGTTTCGTGTTCCGGT 7988 19.112791
2 3 EACHRWGVRFSLC GAAGCTTGTCATCGTTGGGGGGTGAGGTTTTCGTTGTGTTCCGGT 2194 5.249557
3 4 EACLHVCLERGQCLFMC GAAGCTTGTCTGCATGTGTGTTTGGAGCGTGGGCAGTGTCTGTTTATGTGTTCCGGT 1991 4.763842
4 5 EACRDWTWVAC GAAGCTTGTCGTGATTGGACTTGGGTGGCGTGTTCCGGT 1977 4.730344
5 6 EACCYRIYCPC GAAGCTTGTTGTTATAGGATTTATTGTCCGTGTTCCGGT 1564 3.742164
6 7 EACAWCLRGTICC GAAGCTTGTGCTTGGTGTCTTAGGGGTACGATTTGTTGTTCCGGT 1302 3.115280
7 8 EACHSWFTRMSAC GAAGCTTGTCATTCGTGGTTTACGAGGATGAGTGCTTGTTCCGGT 1114 2.665454
8 9 EACLYMCARTRPCNALC GAAGCTTGTTTGTATATGTGTGCGCGTACTCGTCCGTGTAATGCGTTGTGTTCCGGT 1097 2.624779
9 10 EACAFCIGRPICC GAAGCTTGTGCTTTTTGTATTGGTAGGCCTATTTGTTGTTCCGGT 1078 2.579318
10 11 EACTCLEMRSYMCDIRC GAAGCTTGTACGTGTCTGGAGATGAGGTCTTATATGTGTGATATTCGGTGTTCCGGT 922 2.206058
11 12 EACRPVCLDSMVCLFMC GAAGCTTGTCGTCCTGTTTGTTTGGATAGTATGGTTTGTTTGTTTATGTGTTCCGGT 599 1.433220
12 13 EACVPICLADLRCLFMC GAAGCTTGTGTGCCTATTTGTCTGGCGGATCTTAGGTGTTTGTTTATGTGTTCCGGT 557 1.332727
13 14 EACTCWGTHRYLCHVTC GAAGCTTGTACTTGTTGGGGTACGCATCGTTATTTGTGTCATGTTACTTGTTCCGGT 501 1.198737
14 15 EACARCFAINEMCELMC GAAGCTTGTGCTCGTTGTTTTGCTATTAATGAGATGTGTGAGCTTATGTGTTCCGGT 485 1.160454
15 16 EACHVWGERFRRC GAAGCTTGTCATGTTTGGGGTGAGAGGTTTCGTAGGTGTTCCGGT 455 1.088673
16 17 EACLLRCWGRRGCSIRC GAAGCTTGTTTGCTGCGTTGTTGGGGTAGGCGTGGTTGTAGTATTAGGTGTTCCGGT 422 1.009714
17 18 EACPCQLGSMAVCFRFC GAAGCTTGTCCTTGTCAGCTTGGGAGTATGGCTGTTTGTTTTCGGTTTTGTTCCGGT 421 1.007322
18 19 EACCYMDFINLLCEQRC GAAGCTTGTTGTTATATGGATTTTATTAATCTTCTTTGTGAGCAGCGTTGTTCCGGT 409 0.978609
19 20 EACVYMLPDQC GAAGCTTGTGTGTATATGTTGCCTGATCAGTGTTCCGGT 342 0.818299

These results are not identical, but are very similar to what is reported in the paper; minor differences likely stem from the fact that filtering operation are not commutative, and a different order of operations can lead to different outcomes. Either way, our results are qualititatively nearly identical. Compare:

[10]:
# Translated_BC2 chemo_GOOD.txt is taken directly from Angelini and co-workers
paper_df = pd.read_csv("Translated_BC2 chemo_GOOD.txt", header=None, sep=r"\s+")
paper_df.head(20).style.set_table_styles(
    [{"selector": "th, td", "props": [("font-family", "Consolas, monospace")]}]
)
[10]:
  0 1 2
0 EACRDWTWIQC 10202 GAAGCTTGTCGTGATTGGACTTGGATTCAGTGTT
1 EACRDWTWVSC 8033 GAAGCTTGTAGGGATTGGACGTGGGTTTCGTGTT
2 EACHRWGVRFSLC 2229 GAAGCTTGTCATCGTTGGGGGGTGAGGTTTTCGTTGTGTT
3 EACRDWTWVAC 1999 GAAGCTTGTCGTGATTGGACTTGGGTGGCGTGTT
4 EACLHVCLERGQCLFMC 1977 GAAGCTTGTCTGCATGTGTGTTTGGAGCGTGGGCAGTGTCTGTTTATGTGTT
5 EACCYRIYCPC 1589 GAAGCTTGTTGTTATAGGATTTATTGTCCGTGTT
6 EACAWCLRGTICC 1317 GAAGCTTGTGCTTGGTGTCTTAGGGGTACGATTTGTTGTT
7 EACHSWFTRMSAC 1135 GAAGCTTGTCATTCGTGGTTTACGAGGATGAGTGCTTGTT
8 EACLYMCARTRPCNALC 1107 GAAGCTTGTTTGTATATGTGTGCGCGTACTCGTCCGTGTAATGCGTTGTGTT
9 EACAFCIGRPICC 1093 GAAGCTTGTGCTTTTTGTATTGGTAGGCCTATTTGTTGTT
10 EACTCLEMRSYMCDIRC 921 GAAGCTTGTACGTGTCTGGAGATGAGGTCTTATATGTGTGATATTCGGTGTT
11 EACRPVCLDSMVCLFMC 612 GAAGCTTGTCGTCCTGTTTGTTTGGATAGTATGGTTTGTTTGTTTATGTGTT
12 EACVPICLADLRCLFMC 567 GAAGCTTGTGTGCCTATTTGTCTGGCGGATCTTAGGTGTTTGTTTATGTGTT
13 EACTCWGTHRYLCHVTC 509 GAAGCTTGTACTTGTTGGGGTACGCATCGTTATTTGTGTCATGTTACTTGTT
14 EACARCFAINEMCELMC 492 GAAGCTTGTGCTCGTTGTTTTGCTATTAATGAGATGTGTGAGCTTATGTGTT
15 EACHVWGERFRRC 466 GAAGCTTGTCATGTTTGGGGTGAGAGGTTTCGTAGGTGTT
16 EACLLRCWGRRGCSIRC 428 GAAGCTTGTTTGCTGCGTTGTTGGGGTAGGCGTGGTTGTAGTATTAGGTGTT
17 EACPCQLGSMAVCFRFC 427 GAAGCTTGTCCTTGTCAGCTTGGGAGTATGGCTGTTTGTTTTCGGTTTTGTT
18 EACCYMDFINLLCEQRC 412 GAAGCTTGTTGTTATATGGATTTTATTAATCTTCTTTGTGAGCAGCGTTGTT
19 EACVYMLPDQC 346 GAAGCTTGTGTGTATATGTTGCCTGATCAGTGTT
[ ]: