2. mRNA display: CCL22 selection¶
This notebook demonstrates the use of clibas to analyze mRNA display library selections using data from Payne and co-workers as an example.
For further details and the source of data, see B. Zhang et al. Discovery of Selective Cyclic D‑Sulfopeptide Ligands of the Chemokine CCL22 via Mirror-Image mRNA Display with Genetic Reprogramming. J. Am. Chem. Soc. 2024, 146, 34253−34259
In the paper, the authors design and screen mRNA display libraries of cyclic peptides containing non-standard sulfo-tyrosine amino acids. Additionally, the cyclization is achieved with the use of N-chloroacetylated amino acids used for translation initiation. The electrophilic chloroacetyl moiety spontaneously cyclizes with a downstream cysteine residue to yield thioether-closed macrocyclic structures. Currently, this is one of the most common cyclization strategies employed in mRNA display selections.
The authors released NGS data for the first seven rounds of selection against mirror image CCL22 (D-CCL22), which was chemically synthesized. The NGS-derived DNA reads look something like this:
The reads, provided in single-end format, cover the entire ORF and start with a defined constant 5’ sequence. This makes their parsing relatively easy. The peptide library design features an N-terminal Cl-Ac-L-Tyr (non-canonical translation initiation), followed by a random region of variable length encoded by an (NNS)x (x=4-15) insert. The C-terminal sequence contains a cysteine residue used for cyclization, and “SALSA” as a linker. This design allows us to write DNA and peptide library templates for a clibas pipeline like this:
Note that the N-terminal Cl-Ac-L-Tyr could be broken into a separate constant region of length 1, and the cysteine could instead be specified as part of the C-terminal constant region. In other words, peptide templates can also be specified like this: Y111111CSALSA. It is a matter of preference. Specifying the designs as 21111113SALSA allows us to later call C.fastq_parser.vr_filter(where='pep', loc=[0], sets=[1, 2, 3]) to ensure that all peptide sequences lacking the C-terminal
cysteine are discarded.
To write a .yaml config file for these libraries, we also need to specify a custom translation table where the ATG elongation codon encodes sulfo-tyrosine (denoted as y in this example), and provide the custom_ini_aa to account for the non-standard translation initiator amino acid. Specifying aa_SMILES allows us to later run UMAP/HDBSCAN analysis using extended connectivity fingerprint representations of peptide sequences. If not provided, the operation can still be called
but the peptides will need to be one-hot encoded.
In addition, we specify orf_locator: "AGGAGA.....ATG". This helps the parser locate the ORF at the 5’ end of the reads. This is optional. If not specified, "ATG" will be used to locate ORFs which works fine most of the time. TrackerConfig and LoggerConfig specifications are also optional.
experiment: "Payne_CCL22_selection"
constants:
description: |
Star symbol (*) is internally reserved for stop codons that terminate
translation.
Plus and underscore symbols (+ and _) are internally reserved tokens.
Numerals (1234567890) are internally reserved for library design
specifications. These symbols (123456790+_) should not be used to
encode amino acids.
Other symbols are OK.
translation_table:
ATA: "I"
ATC: "I"
ATT: "I"
ATG: "y"
ACA: "T"
ACC: "T"
ACG: "T"
ACT: "T"
AAC: "N"
AAT: "N"
AAA: "K"
AAG: "K"
AGC: "S"
AGT: "S"
AGA: "R"
AGG: "R"
CTA: "L"
CTC: "L"
CTG: "L"
CTT: "L"
CCA: "P"
CCC: "P"
CCG: "P"
CCT: "P"
CAC: "H"
CAT: "H"
CAA: "Q"
CAG: "Q"
CGA: "R"
CGC: "R"
CGG: "R"
CGT: "R"
GTA: "V"
GTC: "V"
GTG: "V"
GTT: "V"
GCA: "A"
GCC: "A"
GCG: "A"
GCT: "A"
GAC: "D"
GAT: "D"
GAA: "E"
GAG: "E"
GGA: "G"
GGC: "G"
GGG: "G"
GGT: "G"
TCA: "S"
TCC: "S"
TCG: "S"
TCT: "S"
TTC: "F"
TTT: "F"
TTA: "L"
TTG: "L"
TAC: "Y"
TAT: "Y"
TAA: "*"
TAG: "*"
TGC: "C"
TGT: "C"
TGA: "*"
TGG: "W"
custom_ini_aa: "Y"
aa_SMILES:
A: "N[C@@H](C)C(=O)"
C: "N[C@@H](CS)C(=O)"
D: "N[C@@H](CC(=O)O)C(=O)"
E: "N[C@@H](CCC(=O)O)C(=O)"
F: "N[C@@H](Cc1ccccc1)C(=O)"
G: "NCC(=O)"
H: "N[C@@H](Cc1c[nH]cn1)C(=O)"
I: "N[C@@H]([C@H](CC)C)C(=O)"
K: "N[C@@H](CCCCN)C(=O)"
L: "N[C@@H](CC(C)C)C(=O)"
N: "N[C@@H](CC(=O)N)C(=O)"
P: "O=C[C@@H]1CCCN1"
Q: "N[C@@H](CCC(=O)N)C(=O)"
R: "N[C@@H](CCCNC(=N)N)C(=O)"
S: "N[C@@H](CO)C(=O)"
T: "N[C@@H]([C@H](O)C)C(=O)"
V: "N[C@@H](C(C)C)C(=O)"
W: "N[C@@H](Cc1c[nH]c2c1cccc2)C(=O)"
Y: "N[C@@H](Cc1ccc(O)cc1)C(=O)"
y: "N[C@@H](CC1=CC=C(OS(O)(=O)=O)C=C1)C=O"
complement_table:
- "ATCGN"
- "TAGCN"
LibraryDesigns:
dna_templates:
- "TAATACGACTCACTATAGGGTTAACTTTAACAAGGAGAAAAACATG112112112112TGCTCCGCGTTATCTGCATAGGACGGGGGGCGGAAA"
- "TAATACGACTCACTATAGGGTTAACTTTAACAAGGAGAAAAACATG112112112112112TGCTCCGCGTTATCTGCATAGGACGGGGGGCGGAAA"
- "TAATACGACTCACTATAGGGTTAACTTTAACAAGGAGAAAAACATG112112112112112112TGCTCCGCGTTATCTGCATAGGACGGGGGGCGGAAA"
- "TAATACGACTCACTATAGGGTTAACTTTAACAAGGAGAAAAACATG112112112112112112112TGCTCCGCGTTATCTGCATAGGACGGGGGGCGGAAA"
- "TAATACGACTCACTATAGGGTTAACTTTAACAAGGAGAAAAACATG112112112112112112112112TGCTCCGCGTTATCTGCATAGGACGGGGGGCGGAAA"
- "TAATACGACTCACTATAGGGTTAACTTTAACAAGGAGAAAAACATG112112112112112112112112112TGCTCCGCGTTATCTGCATAGGACGGGGGGCGGAAA"
- "TAATACGACTCACTATAGGGTTAACTTTAACAAGGAGAAAAACATG112112112112112112112112112112TGCTCCGCGTTATCTGCATAGGACGGGGGGCGGAAA"
- "TAATACGACTCACTATAGGGTTAACTTTAACAAGGAGAAAAACATG112112112112112112112112112112112TGCTCCGCGTTATCTGCATAGGACGGGGGGCGGAAA"
- "TAATACGACTCACTATAGGGTTAACTTTAACAAGGAGAAAAACATG112112112112112112112112112112112112TGCTCCGCGTTATCTGCATAGGACGGGGGGCGGAAA"
- "TAATACGACTCACTATAGGGTTAACTTTAACAAGGAGAAAAACATG112112112112112112112112112112112112112TGCTCCGCGTTATCTGCATAGGACGGGGGGCGGAAA"
- "TAATACGACTCACTATAGGGTTAACTTTAACAAGGAGAAAAACATG112112112112112112112112112112112112112112TGCTCCGCGTTATCTGCATAGGACGGGGGGCGGAAA"
- "TAATACGACTCACTATAGGGTTAACTTTAACAAGGAGAAAAACATG112112112112112112112112112112112112112112112TGCTCCGCGTTATCTGCATAGGACGGGGGGCGGAAA"
dna_monomers:
1: ["A", "G", "T", "C"] #N
2: ["G", "C"] #S
pep_templates:
- "211113SALSA"
- "2111113SALSA"
- "21111113SALSA"
- "211111113SALSA"
- "2111111113SALSA"
- "21111111113SALSA"
- "211111111113SALSA"
- "2111111111113SALSA"
- "21111111111113SALSA"
- "211111111111113SALSA"
- "2111111111111113SALSA"
- "21111111111111113SALSA"
pep_monomers:
1: ["A", "C", "D", "E", "F", "G", "H", "I", "K", "L", "y", "N", "P", "Q", "R", "S", "T", "V", "W", "Y"]
2: ["Y"]
3: ["C"]
TrackerConfig:
logs: "./outputs/Payne_CCL22" # Directory for writing logs to
parser_out: "./outputs/Payne_CCL22" # Directory that stores fastq parser outputs
analysis_out: "./outputs/Payne_CCL22" # Directory that stores outputs of data analysis operations
LoggerConfig:
verbose: true # Verbose loggers print to the console
log_to_file: false # Write logs to file
level: "INFO" # Logger level; accepted values: "DEBUG", "INFO", "WARNING", "ERROR"
FastqParserConfig:
orf_locator: "AGGAGA.....ATG" # A regex pattern that has to match to initiate the ORF
To build a pipeline for these designs, we first intialize clibas facade using this config file. In this case, the Payne_CCL22_config.yaml is placed in the same directory as this notebook.
[1]:
import clibas as C
C.initialize(config_path='Payne_CCL22_config.yaml')
[INFO]: <clibas> succesfully loaded config and is ready. . .
Now we can build the pipeline. We trim the reads at the 3’ (optional), translate them to DNA right after, and process from there. A few things to note:
we collect DNA/peptide lengths before doing any filtering to see an unbiased distribution of reads.
thecount_summaryoperation is called multiple times:
first, before doing any filtering to see the “raw” composition of the library at the peptide level
then at the end, after removing noise, to generate both
.csvand.fastafiles for the most enriched peptides.
we call
umap_hdbscan_analysisto see how the resulting peptides cluster after all processing is done. It is discussed in more detail below.
[2]:
C.pipeline.enque(
[
# trim DNA reads
C.fastq_parser.trim_reads(left='', right='GGCGGAAA', tol=1),
# translate DNA -> peptide, use the ORF_locator to find ORFs
C.fastq_parser.translate(stop_readthrough=False),
# length analysis of trimmed DNA reads and the resulting peptides
# we do it before filtering any reads to get an unbiased view
C.analysis_tools.length_analysis(where='pep'),
C.analysis_tools.length_analysis(where='dna'),
# analyze average Phred quality scores
# we do it before filtering any reads to get an unbiased view
C.analysis_tools.q_score_analysis(loc=None),
# count the peptides before doing any filtrtion
C.fastq_parser.count_summary(where='pep', top_n=1000, fmt='csv'),
# discard the peptides of incorrect length (incompatible with library designs)
C.fastq_parser.len_filter(where='pep'),
# discard the peptides with incorrect/overly mutated constant regions
# up to 3 mutations are tolerated
C.fastq_parser.cr_filter(where='pep', loc=[1], tol=3),
# discard the peptides contained unallowed monomers in the variable region
C.fastq_parser.vr_filter(where='pep', loc=[0], sets=[1, 2, 3]),
# discard the DNA with average Q scores below 30 in region 0 (random insert region)
C.fastq_parser.q_score_filt(avgQ=30, loc=[0]),
# clip peptide sequences to region 0 (random insert region)
C.fastq_parser.fetch_at(where='pep', loc=[0]),
# discard all peptides containing any ambiguous amino acids
# stemming from N nucleotide base calling etc
C.fastq_parser.filt_ambiguous(where='pep'),
# analyze the convergence of the peptide library at the sequence level
C.analysis_tools.sequence_convergence_analysis(where='pep'),
# analyze the convergence of the peptide/DNA library at the token level
# by token, we mean amino acid/nucleotide: basically positional freq analysis
C.analysis_tools.token_convergence_analysis(where='pep', loc=None),
C.analysis_tools.token_convergence_analysis(where='dna', loc=None),
# unpad all arrays: optional, maybe used to free up some memory; not really needed in this case
C.fastq_parser.unpad(),
# save the peptide datasets as numpy .npy files - if any downstream analysis in python is envisioned
C.fastq_parser.save(where='pep', fmt='npy'),
# count peptides/DNA again - this time after all filtration ops have been called
C.fastq_parser.count_summary(where='pep', top_n=1000, fmt='csv'),
C.fastq_parser.count_summary(where='pep', top_n=100, fmt='fasta'),
C.fastq_parser.count_summary(where='dna', top_n=1000, fmt='csv'),
C.fastq_parser.count_summary(where='dna', top_n=100, fmt='fasta'),
# make a table showing the number of reads belonging to each individual
# library template for every sample in the batch
C.fastq_parser.library_design_match(where='pep'),
# make one joint peptide count summary - track peptides from sample to sample
C.fastq_parser.dataset_wide_count_summary(where='pep', top_n=2000),
# run UMAP/HDBSCAN analysis on top 1000 peptides; use ECFP4 representation,
# and embed all samples onto a single manifold for consistency
C.analysis_tools.umap_hdbscan_analysis(top_n=1000,
where='pep',
F='pep_ECFP4',
single_manifold=True,
return_modified=False
)
]
)
[INFO]: 24 ops appended to pipeline; current queue size: 24
Note that at this stage, no data processing has taken place. The pipeline only asserts the validity of the passed arguments, and their consistency with the specified library designs.
To run the analysis, we need to specify a data loader, and then execute the pipeline. In this case, the data is provided as unzipped .fastq files, so we use C.data_loader.fetch_fastq_from_dir to load it (as opposed to C.data_loader.fetch_gz_from_dir which is used to load .fastq.gz files). These .fastq files should be located in a directory as specified by the data_dir keyword: './sequencing_data/Payne_CCL22' in this case. The pipeline will fetch all .fastq files
from this folder.
[3]:
loader = C.data_loader.fetch_fastq_from_dir(data_dir='./sequencing_data/Payne_CCL22')
C.pipeline.load_and_run(loader=loader, save_summary=True)
[INFO]: Queuing <fetch_dir_fastq> op. . .
[INFO]: Fetching DCCL22_r1.fastq. . .
[INFO]: Fetching DCCL22_r2.fastq. . .
[INFO]: Fetching DCCL22_r3.fastq. . .
[INFO]: Fetching DCCL22_r4.fastq. . .
[INFO]: Fetching DCCL22_r5.fastq. . .
[INFO]: Fetching DCCL22_r6.fastq. . .
[INFO]: Fetching DCCL22_r7.fastq. . .
[INFO]: The operation took 1.745 s
[INFO]: DCCL22_r1 dataset size: 140263
[INFO]: DCCL22_r2 dataset size: 124194
[INFO]: DCCL22_r3 dataset size: 156584
[INFO]: DCCL22_r4 dataset size: 177065
[INFO]: DCCL22_r5 dataset size: 176882
[INFO]: DCCL22_r6 dataset size: 160607
[INFO]: DCCL22_r7 dataset size: 153025
[INFO]: -----------------------------------------------------------------
[INFO]: Queuing <trim_exp> op. . .
[INFO]: The operation took 2.601 s
[INFO]: DCCL22_r1 dataset size: 140263
[INFO]: DCCL22_r2 dataset size: 124194
[INFO]: DCCL22_r3 dataset size: 156584
[INFO]: DCCL22_r4 dataset size: 177065
[INFO]: DCCL22_r5 dataset size: 176882
[INFO]: DCCL22_r6 dataset size: 160607
[INFO]: DCCL22_r7 dataset size: 153025
[INFO]: -----------------------------------------------------------------
[INFO]: Queuing <translate_dna> op. . .
[INFO]: The operation took 5.298 s
[INFO]: DCCL22_r1 dataset size: 140263
[INFO]: DCCL22_r2 dataset size: 124194
[INFO]: DCCL22_r3 dataset size: 156584
[INFO]: DCCL22_r4 dataset size: 177065
[INFO]: DCCL22_r5 dataset size: 176882
[INFO]: DCCL22_r6 dataset size: 160607
[INFO]: DCCL22_r7 dataset size: 153025
[INFO]: -----------------------------------------------------------------
[INFO]: Queuing <length_summary> op. . .
[INFO]: The operation took 1.866 s
[INFO]: DCCL22_r1 dataset size: 140263
[INFO]: DCCL22_r2 dataset size: 124194
[INFO]: DCCL22_r3 dataset size: 156584
[INFO]: DCCL22_r4 dataset size: 177065
[INFO]: DCCL22_r5 dataset size: 176882
[INFO]: DCCL22_r6 dataset size: 160607
[INFO]: DCCL22_r7 dataset size: 153025
[INFO]: -----------------------------------------------------------------
[INFO]: Queuing <length_summary> op. . .
[INFO]: The operation took 3.494 s
[INFO]: DCCL22_r1 dataset size: 140263
[INFO]: DCCL22_r2 dataset size: 124194
[INFO]: DCCL22_r3 dataset size: 156584
[INFO]: DCCL22_r4 dataset size: 177065
[INFO]: DCCL22_r5 dataset size: 176882
[INFO]: DCCL22_r6 dataset size: 160607
[INFO]: DCCL22_r7 dataset size: 153025
[INFO]: -----------------------------------------------------------------
[INFO]: Queuing <q_score_summary> op. . .
[INFO]: The operation took 5.523 s
[INFO]: DCCL22_r1 dataset size: 140263
[INFO]: DCCL22_r2 dataset size: 124194
[INFO]: DCCL22_r3 dataset size: 156584
[INFO]: DCCL22_r4 dataset size: 177065
[INFO]: DCCL22_r5 dataset size: 176882
[INFO]: DCCL22_r6 dataset size: 160607
[INFO]: DCCL22_r7 dataset size: 153025
[INFO]: -----------------------------------------------------------------
[INFO]: Queuing <fastq_count_summary> op. . .
[INFO]: The operation took 0.369 s
[INFO]: DCCL22_r1 dataset size: 140263
[INFO]: DCCL22_r2 dataset size: 124194
[INFO]: DCCL22_r3 dataset size: 156584
[INFO]: DCCL22_r4 dataset size: 177065
[INFO]: DCCL22_r5 dataset size: 176882
[INFO]: DCCL22_r6 dataset size: 160607
[INFO]: DCCL22_r7 dataset size: 153025
[INFO]: -----------------------------------------------------------------
[INFO]: Queuing <length_filter> op. . .
[INFO]: The operation took 0.698 s
[INFO]: DCCL22_r1 dataset size: 89709
[INFO]: DCCL22_r2 dataset size: 79399
[INFO]: DCCL22_r3 dataset size: 97393
[INFO]: DCCL22_r4 dataset size: 105638
[INFO]: DCCL22_r5 dataset size: 89824
[INFO]: DCCL22_r6 dataset size: 64076
[INFO]: DCCL22_r7 dataset size: 68564
[INFO]: -----------------------------------------------------------------
[INFO]: Queuing <constant_region_filter> op. . .
[INFO]: The operation took 0.218 s
[INFO]: DCCL22_r1 dataset size: 81987
[INFO]: DCCL22_r2 dataset size: 72161
[INFO]: DCCL22_r3 dataset size: 87713
[INFO]: DCCL22_r4 dataset size: 93528
[INFO]: DCCL22_r5 dataset size: 79021
[INFO]: DCCL22_r6 dataset size: 59114
[INFO]: DCCL22_r7 dataset size: 65769
[INFO]: -----------------------------------------------------------------
[INFO]: Queuing <variable_region_filter> op. . .
[INFO]: The operation took 1.965 s
[INFO]: DCCL22_r1 dataset size: 80385
[INFO]: DCCL22_r2 dataset size: 70912
[INFO]: DCCL22_r3 dataset size: 86047
[INFO]: DCCL22_r4 dataset size: 91694
[INFO]: DCCL22_r5 dataset size: 77318
[INFO]: DCCL22_r6 dataset size: 58110
[INFO]: DCCL22_r7 dataset size: 64921
[INFO]: -----------------------------------------------------------------
[INFO]: Queuing <q_score_filter> op. . .
[INFO]: The operation took 0.334 s
[INFO]: DCCL22_r1 dataset size: 79957
[INFO]: DCCL22_r2 dataset size: 70565
[INFO]: DCCL22_r3 dataset size: 85548
[INFO]: DCCL22_r4 dataset size: 91204
[INFO]: DCCL22_r5 dataset size: 76942
[INFO]: DCCL22_r6 dataset size: 57750
[INFO]: DCCL22_r7 dataset size: 64462
[INFO]: -----------------------------------------------------------------
[INFO]: Queuing <fetch_region> op. . .
[INFO]: The operation took 0.125 s
[INFO]: DCCL22_r1 dataset size: 79957
[INFO]: DCCL22_r2 dataset size: 70565
[INFO]: DCCL22_r3 dataset size: 85548
[INFO]: DCCL22_r4 dataset size: 91204
[INFO]: DCCL22_r5 dataset size: 76942
[INFO]: DCCL22_r6 dataset size: 57750
[INFO]: DCCL22_r7 dataset size: 64462
[INFO]: -----------------------------------------------------------------
[INFO]: Queuing <filter_ambiguous> op. . .
[INFO]: The operation took 2.314 s
[INFO]: DCCL22_r1 dataset size: 79957
[INFO]: DCCL22_r2 dataset size: 70565
[INFO]: DCCL22_r3 dataset size: 85548
[INFO]: DCCL22_r4 dataset size: 91204
[INFO]: DCCL22_r5 dataset size: 76942
[INFO]: DCCL22_r6 dataset size: 57750
[INFO]: DCCL22_r7 dataset size: 64462
[INFO]: -----------------------------------------------------------------
[INFO]: Queuing <sequence_level_convergence_summary> op. . .
[INFO]: The operation took 3.444 s
[INFO]: DCCL22_r1 dataset size: 79957
[INFO]: DCCL22_r2 dataset size: 70565
[INFO]: DCCL22_r3 dataset size: 85548
[INFO]: DCCL22_r4 dataset size: 91204
[INFO]: DCCL22_r5 dataset size: 76942
[INFO]: DCCL22_r6 dataset size: 57750
[INFO]: DCCL22_r7 dataset size: 64462
[INFO]: -----------------------------------------------------------------
[INFO]: Queuing <token_level_convergence_analysis> op. . .
[INFO]: The operation took 20.792 s
[INFO]: DCCL22_r1 dataset size: 79957
[INFO]: DCCL22_r2 dataset size: 70565
[INFO]: DCCL22_r3 dataset size: 85548
[INFO]: DCCL22_r4 dataset size: 91204
[INFO]: DCCL22_r5 dataset size: 76942
[INFO]: DCCL22_r6 dataset size: 57750
[INFO]: DCCL22_r7 dataset size: 64462
[INFO]: -----------------------------------------------------------------
[INFO]: Queuing <token_level_convergence_analysis> op. . .
[INFO]: The operation took 48.9 s
[INFO]: DCCL22_r1 dataset size: 79957
[INFO]: DCCL22_r2 dataset size: 70565
[INFO]: DCCL22_r3 dataset size: 85548
[INFO]: DCCL22_r4 dataset size: 91204
[INFO]: DCCL22_r5 dataset size: 76942
[INFO]: DCCL22_r6 dataset size: 57750
[INFO]: DCCL22_r7 dataset size: 64462
[INFO]: -----------------------------------------------------------------
[INFO]: Queuing <unpad_data> op. . .
[INFO]: The operation took 2.145 s
[INFO]: DCCL22_r1 dataset size: 79957
[INFO]: DCCL22_r2 dataset size: 70565
[INFO]: DCCL22_r3 dataset size: 85548
[INFO]: DCCL22_r4 dataset size: 91204
[INFO]: DCCL22_r5 dataset size: 76942
[INFO]: DCCL22_r6 dataset size: 57750
[INFO]: DCCL22_r7 dataset size: 64462
[INFO]: -----------------------------------------------------------------
[INFO]: Queuing <save_data> op. . .
[INFO]: The operation took 0.071 s
[INFO]: DCCL22_r1 dataset size: 79957
[INFO]: DCCL22_r2 dataset size: 70565
[INFO]: DCCL22_r3 dataset size: 85548
[INFO]: DCCL22_r4 dataset size: 91204
[INFO]: DCCL22_r5 dataset size: 76942
[INFO]: DCCL22_r6 dataset size: 57750
[INFO]: DCCL22_r7 dataset size: 64462
[INFO]: -----------------------------------------------------------------
[INFO]: Queuing <fastq_count_summary> op. . .
[INFO]: The operation took 1.059 s
[INFO]: DCCL22_r1 dataset size: 79957
[INFO]: DCCL22_r2 dataset size: 70565
[INFO]: DCCL22_r3 dataset size: 85548
[INFO]: DCCL22_r4 dataset size: 91204
[INFO]: DCCL22_r5 dataset size: 76942
[INFO]: DCCL22_r6 dataset size: 57750
[INFO]: DCCL22_r7 dataset size: 64462
[INFO]: -----------------------------------------------------------------
[INFO]: Queuing <fastq_count_summary> op. . .
[INFO]: The operation took 1.036 s
[INFO]: DCCL22_r1 dataset size: 79957
[INFO]: DCCL22_r2 dataset size: 70565
[INFO]: DCCL22_r3 dataset size: 85548
[INFO]: DCCL22_r4 dataset size: 91204
[INFO]: DCCL22_r5 dataset size: 76942
[INFO]: DCCL22_r6 dataset size: 57750
[INFO]: DCCL22_r7 dataset size: 64462
[INFO]: -----------------------------------------------------------------
[INFO]: Queuing <fastq_count_summary> op. . .
[INFO]: The operation took 0.5 s
[INFO]: DCCL22_r1 dataset size: 79957
[INFO]: DCCL22_r2 dataset size: 70565
[INFO]: DCCL22_r3 dataset size: 85548
[INFO]: DCCL22_r4 dataset size: 91204
[INFO]: DCCL22_r5 dataset size: 76942
[INFO]: DCCL22_r6 dataset size: 57750
[INFO]: DCCL22_r7 dataset size: 64462
[INFO]: -----------------------------------------------------------------
[INFO]: Queuing <fastq_count_summary> op. . .
[INFO]: The operation took 0.449 s
[INFO]: DCCL22_r1 dataset size: 79957
[INFO]: DCCL22_r2 dataset size: 70565
[INFO]: DCCL22_r3 dataset size: 85548
[INFO]: DCCL22_r4 dataset size: 91204
[INFO]: DCCL22_r5 dataset size: 76942
[INFO]: DCCL22_r6 dataset size: 57750
[INFO]: DCCL22_r7 dataset size: 64462
[INFO]: -----------------------------------------------------------------
[INFO]: Queuing <library_design_summary> op. . .
[INFO]: The operation took 0.019 s
[INFO]: DCCL22_r1 dataset size: 79957
[INFO]: DCCL22_r2 dataset size: 70565
[INFO]: DCCL22_r3 dataset size: 85548
[INFO]: DCCL22_r4 dataset size: 91204
[INFO]: DCCL22_r5 dataset size: 76942
[INFO]: DCCL22_r6 dataset size: 57750
[INFO]: DCCL22_r7 dataset size: 64462
[INFO]: -----------------------------------------------------------------
[INFO]: Queuing <joint_dataset_count_summary> op. . .
[INFO]: The operation took 2.139 s
[INFO]: DCCL22_r1 dataset size: 79957
[INFO]: DCCL22_r2 dataset size: 70565
[INFO]: DCCL22_r3 dataset size: 85548
[INFO]: DCCL22_r4 dataset size: 91204
[INFO]: DCCL22_r5 dataset size: 76942
[INFO]: DCCL22_r6 dataset size: 57750
[INFO]: DCCL22_r7 dataset size: 64462
[INFO]: -----------------------------------------------------------------
[INFO]: Queuing <umap_hdbscan_summary> op. . .
[INFO]: The operation took 559.987 s
[INFO]: DCCL22_r1 dataset size: 79957
[INFO]: DCCL22_r2 dataset size: 70565
[INFO]: DCCL22_r3 dataset size: 85548
[INFO]: DCCL22_r4 dataset size: 91204
[INFO]: DCCL22_r5 dataset size: 76942
[INFO]: DCCL22_r6 dataset size: 57750
[INFO]: DCCL22_r7 dataset size: 64462
[INFO]: -----------------------------------------------------------------
[3]:
<Data container holding 7 samples>:
<SequencingSample DCCL22_r1 containing 79957 entries>
<SequencingSample DCCL22_r2 containing 70565 entries>
<SequencingSample DCCL22_r3 containing 85548 entries>
<SequencingSample DCCL22_r4 containing 91204 entries>
<SequencingSample DCCL22_r5 containing 76942 entries>
<SequencingSample DCCL22_r6 containing 57750 entries>
<SequencingSample DCCL22_r7 containing 64462 entries>
This example demonstrates the difficulties of reproducing NGS analysis results even when data is made publicly available. Our results are significantly different from what the authors report.
Below is a table from the paper showing the top-25 peptide sequences and their abundance (ignore the lower case sequence notation, it is not important here). The non-standard sulfo-tyrosine is represented as x:
Peptide YFTYWYHTNNTGRC is the top-1 selection hit; it comprises 31.37% of the post-selection library. In our case, YFTYWYHTNNTGRC has only 16.9% abundance and is not even the top-1 hit!
[9]:
from pathlib import Path
import pandas as pd
out_dir = C.fastq_parser.dirs.logs
exp_name = C.fastq_parser.exp_name
sample_name = "DCCL22_r7"
out_dir = Path(out_dir) / sample_name
fnames = list(Path(out_dir).glob(f"{sample_name}_pep_count_summary*.csv"))
final_peptide_list = pd.read_csv(fnames[1])
cols = ['Index', 'Peptide', 'pep count', 'Dataset %']
final_peptide_list[cols].head(25).style.set_table_styles(
[{'selector': 'th, td', 'props': [('font-family', 'Consolas, monospace')]}]
)
[9]:
| Index | Peptide | pep count | Dataset % | |
|---|---|---|---|---|
| 0 | 1 | YYTHWIYTVNPRRC | 14434 | 22.391486 |
| 1 | 2 | YFTYWYHTNNTGRC | 10900 | 16.909187 |
| 2 | 3 | YTSHWYITRVTARC | 4980 | 7.725482 |
| 3 | 4 | YYTFWYYTNVRRRC | 4307 | 6.681456 |
| 4 | 5 | YTWELPSWYLEHYFPTC | 2021 | 3.135180 |
| 5 | 6 | YFAHWLVTYNPARC | 1588 | 2.463467 |
| 6 | 7 | YIYYVNyTTLTyLFTDC | 1316 | 2.041513 |
| 7 | 8 | YETHWFVTYRAELC | 1227 | 1.903447 |
| 8 | 9 | YPTWYTLHYYPLFEC | 1050 | 1.628867 |
| 9 | 10 | YYTFWYFTSEAHKC | 916 | 1.420992 |
| 10 | 11 | YISHWYVTNVARTC | 912 | 1.414787 |
| 11 | 12 | YTTFWYVTRRPTAC | 874 | 1.355838 |
| 12 | 13 | YTAYWYTTSNTNRC | 843 | 1.307747 |
| 13 | 14 | YTTHWYDTDNKSRC | 749 | 1.161925 |
| 14 | 15 | YHTYWLDTYRPLVC | 671 | 1.040923 |
| 15 | 16 | YLEWSLDyyELTYyRSC | 649 | 1.006795 |
| 16 | 17 | YHTYWYSTTKTRHC | 641 | 0.994384 |
| 17 | 18 | YGPQWYRRyYLyC | 528 | 0.819087 |
| 18 | 19 | YLLNyRGHATQQVFWWC | 510 | 0.791164 |
| 19 | 20 | YVIyKGVRNySWHASWC | 468 | 0.726009 |
| 20 | 21 | YRYyTSTYEySWNC | 430 | 0.667060 |
| 21 | 22 | YLIFTWRTKRDCIIAyC | 369 | 0.572430 |
| 22 | 23 | YyGWHCPLKHFRTC | 342 | 0.530545 |
| 23 | 24 | YyTHPTTTySyYFFNC | 244 | 0.378518 |
| 24 | 25 | YyTYWSyLTQLYHyTSC | 225 | 0.349043 |
Aside from the top-1 peptide discrepancy, our list generally looks quite different, with many peptides completely absent from the top-25 in the original publication. For example, our top-7 and top-16 sulfo-tyrosine-containing sequences, YIYYVNyTTLTyLFTDC and YLEWSLDyyELTYyRSC, are not found in Table S3 above.
It is not clear where the differences come from since the details of NGS data analysis are not mentioned in the original publication. However, it is unlikely due to overzealous read filtering on our part. The peptide list obtained right after in silico translation, before any filtering is applied, already suggests YYTHWIYTVNPRRC, which is completely absent from Table 3, as the top-1 peptide:
[11]:
initial_peptide_list = pd.read_csv(fnames[0])
cols = ['Index', 'Peptide', 'pep count', 'Dataset %']
initial_peptide_list[cols].head(25).style.set_table_styles(
[{'selector': 'th, td', 'props': [('font-family', 'Consolas, monospace')]}]
)
[11]:
| Index | Peptide | pep count | Dataset % | |
|---|---|---|---|---|
| 0 | 1 | YYTHWIYTVNPRRCSALSA | 14226 | 9.296520 |
| 1 | 2 | nan | 13679 | 8.939062 |
| 2 | 3 | YPKWYLKyYTNFPRYLHRTGGG_ | 12061 | 7.881719 |
| 3 | 4 | YFTYWYHTNNTGRCSALSA | 10782 | 7.045908 |
| 4 | 5 | YTSHWYITRVTARCSALSA | 4923 | 3.217121 |
| 5 | 6 | YTyyCySyyFNWYPRFPRYLHRTGGG_ | 4409 | 2.881229 |
| 6 | 7 | YYTFWYYTNVRRRCSALSA | 4233 | 2.766215 |
| 7 | 8 | YRAyCWYyySRyyWFPRYLHRTGGG_ | 2387 | 1.559876 |
| 8 | 9 | YPSWYTYWYyRSFPRYLHRTGGG_ | 2089 | 1.365136 |
| 9 | 10 | YTWELPSWYLEHYFPTCSALSA | 1976 | 1.291292 |
| 10 | 11 | YFAHWLVTYNPARCSALSA | 1566 | 1.023362 |
| 11 | 12 | YIHYyCyyyTyySFPRYLHRTGGG_ | 1374 | 0.897893 |
| 12 | 13 | YIYYVNyTTLTyLFTDCSALSA | 1285 | 0.839732 |
| 13 | 14 | YETHWFVTYRAELCSALSA | 1212 | 0.792027 |
| 14 | 15 | YPTWYTLHYYPLFECSALSA | 1028 | 0.671786 |
| 15 | 16 | YHySLSAYyKAWSESFPRYLHRTGGG_ | 960 | 0.627348 |
| 16 | 17 | YYTFWYFTSEAHKCSALSA | 904 | 0.590753 |
| 17 | 18 | YISHWYVTNVARTCSALSA | 890 | 0.581604 |
| 18 | 19 | YTTFWYVTRRPTACSALSA | 868 | 0.567228 |
| 19 | 20 | YTAYWYTTSNTNRCSALSA | 823 | 0.537821 |
| 20 | 21 | YyTyCFFVyyFyyyHFPRYLHRTGGG_ | 814 | 0.531939 |
| 21 | 22 | YFCYTLYyAyyTAyFPRYLHRTGGG_ | 784 | 0.512335 |
| 22 | 23 | YHyTTyYCDyWLyWTRFPRYLHRTGGG_ | 776 | 0.507107 |
| 23 | 24 | YTTHWYDTDNKSRCSALSA | 750 | 0.490116 |
| 24 | 25 | YHTYWLDTYRPLVCSALSA | 658 | 0.429995 |
In the table above, the sequence contains the constant region that is clipped later in the pipeline, but it is the same peptide: YYTHWIYTVNPRRC vs YYTHWIYTVNPRRCSALSA.
It is possible that the authors used different in silico translation logic (searched for ORFs differently), or lost this sequence during their filtering procedures (e.g., when ascertaining Q-scores).
Without suggesting which analysis pipeline is “correct”, we hope to illustrate with this example that the details of NGS data processing can have a major influence on the apparent composition of the resulting library!
As a side note, an inspection of the initial and final peptide lists reveals large differences in the library composition. The initial dataset is dominated by DNA reads that either do not encode an ORF (peptide nan in the table above) or encode ORFs lacking a stop codon. The top-2 peptide YPKWYLKyYTNFPRYLHRTGGG_ seems to be a result of a frameshift mutation in the original library. This peptide has neither the downstream cysteine nor the SALSA linker encoded by the library designs.
The underscore _ symbol at 3’-end is an internal clibas convention that labels incomplete codons at the 3’-end of a sequence (i.e., when the translation operation runs into 1- or 2-nt-long codon at the end of the ORF).
All of these frameshifted peptides are lost during the filtering process, both in our example here and in the original paper. Whether they warrant a closer inspection, and whether they may real selection hits is an open question.
These frameshifted reads, which comprise a large fraction of the round 7 dataset, are all lost upon read filtering:
[22]:
out_dir = C.fastq_parser.dirs.logs
exp_name = C.fastq_parser.exp_name
fname = next(Path(out_dir).glob(f"{exp_name}_pipeline_summary*.csv"))
summary = pd.read_csv(fname)
# only show the pipeline summary for the seventh round of selection
cols = ['Unnamed: 0', 'elapsed time, s', 'DCCL22_r7']
summary[cols].style.format({
'elapsed time, s': '{:.1f}',
'DCCL22_r7': '{:.0f}'
}).set_table_styles(
[{'selector': 'th, td', 'props': [('font-family', 'Consolas, monospace')]}]
)
[22]:
| Unnamed: 0 | elapsed time, s | DCCL22_r7 | |
|---|---|---|---|
| 0 | nan | nan | 0 |
| 1 | fetch_dir_fastq | 1.7 | 153025 |
| 2 | trim_exp | 2.6 | 153025 |
| 3 | translate_dna | 5.3 | 153025 |
| 4 | length_summary | 1.9 | 153025 |
| 5 | length_summary | 3.5 | 153025 |
| 6 | q_score_summary | 5.5 | 153025 |
| 7 | fastq_count_summary | 0.4 | 153025 |
| 8 | length_filter | 0.7 | 68564 |
| 9 | constant_region_filter | 0.2 | 65769 |
| 10 | variable_region_filter | 2.0 | 64921 |
| 11 | q_score_filter | 0.3 | 64462 |
| 12 | fetch_region | 0.1 | 64462 |
| 13 | filter_ambiguous | 2.3 | 64462 |
| 14 | sequence_level_convergence_summary | 3.4 | 64462 |
| 15 | token_level_convergence_analysis | 20.8 | 64462 |
| 16 | token_level_convergence_analysis | 48.9 | 64462 |
| 17 | unpad_data | 2.1 | 64462 |
| 18 | save_data | 0.1 | 64462 |
| 19 | fastq_count_summary | 1.1 | 64462 |
| 20 | fastq_count_summary | 1.0 | 64462 |
| 21 | fastq_count_summary | 0.5 | 64462 |
| 22 | fastq_count_summary | 0.4 | 64462 |
| 23 | library_design_summary | 0.0 | 64462 |
| 24 | joint_dataset_count_summary | 2.1 | 64462 |
| 25 | umap_hdbscan_summary | 560.0 | 64462 |
[26]:
fraction_retained = summary['DCCL22_r7'].iloc[-1] / summary['DCCL22_r7'].iloc[1]
print(f'Only {100 * fraction_retained:.1f}% reads survived the filtration. In other words, {100 - 100 * fraction_retained:.1f}% was discarded.')
Only 42.1% reads survived the filtration. In other words, 57.9% was discarded.
This filtering of 57.9% of the library strongly affected the apparent composition of the final dataset. Compare the relative abundances for several peptides before and after the filtering:
[31]:
peptide = initial_peptide_list.iloc[0]['Peptide']
initial_abundance = initial_peptide_list.iloc[0]['Dataset %']
final_abundance = final_peptide_list.iloc[0]['Dataset %']
print(f"Peptide: {peptide}; initial: {initial_abundance:.1f}%; final: {final_abundance:.1f}%")
Peptide: YYTHWIYTVNPRRCSALSA; initial: 9.3%; final: 22.4%
[32]:
peptide = initial_peptide_list.iloc[13]['Peptide']
initial_abundance = initial_peptide_list.iloc[13]['Dataset %']
final_abundance = final_peptide_list.iloc[7]['Dataset %']
print(f"Peptide: {peptide}; initial: {initial_abundance:.1f}%; final: {final_abundance:.1f}%")
Peptide: YETHWFVTYRAELCSALSA; initial: 0.8%; final: 1.9%
Whether these abundance differences are meaninfgul is besides the point of this example. With this demonstration, we hope to illustrate how clibas’s ability to collect statistics at arbitrary points in a pipeline enables a multifaceted analysis of NGS datasets.
This concludes example 2. The results have been written to ./outputs/Payne_CCL22 as specified in the config and can be interrogated further if desired. See the 1. Phage display: FXIa selection notebook for a walkthrough of the most important outputs.