4. SELEX: antibacterial aptamers

This notebook demonstrates the use of clibas to analyze SELEX aptamer selections using data from Reischer and co-workers as an example.

For further details and the source of data, see C. Kolm et al. DNA aptamers against bacterial cells can be efficiently selected by a SELEX process using state‑of‑the art qPCR and ultra‑deep sequencing. Sci. Rep. 2020, 10, 20917

In the paper, the authors conduct SELEX selections to identify DNA aptamere with specific affinity for live Enterococcus faecalis cells. The authors conducted 11 rounds of selection using a simple DNA library containing a random 40±3 nt insert flanked by a 5’- (TAGGGAAGAGAAGGACATATGAT) and 3’-end (TTGACTAGTACATGACCACTTGA) constant regions. NGS data provided by the authors is in the paired-end read format. At present, paired-end read merging is not supported in clibas; prior to running this analysis, we have merged the reads separately using FLASH with the following commands:

--max-overlap=200 --allow-outies --max-mismatch-density=0.10

The reads are provided with adapter sequences trimmed, so it is easy to write library designs:

001 selex.png

Naturally, there is no need for peptide library designs and a translation table in this case. The config file can be quite minimal:

experiment: "Reischer_antibacterial_SELEX"

LibraryDesigns:
  dna_templates:
    - "TAGGGAAGAGAAGGACATATGAT1111111111111111111111111111111111111TTGACTAGTACATGACCACTTGA"
    - "TAGGGAAGAGAAGGACATATGAT11111111111111111111111111111111111111TTGACTAGTACATGACCACTTGA"
    - "TAGGGAAGAGAAGGACATATGAT111111111111111111111111111111111111111TTGACTAGTACATGACCACTTGA"
    - "TAGGGAAGAGAAGGACATATGAT1111111111111111111111111111111111111111TTGACTAGTACATGACCACTTGA"
    - "TAGGGAAGAGAAGGACATATGAT11111111111111111111111111111111111111111TTGACTAGTACATGACCACTTGA"
    - "TAGGGAAGAGAAGGACATATGAT111111111111111111111111111111111111111111TTGACTAGTACATGACCACTTGA"
    - "TAGGGAAGAGAAGGACATATGAT1111111111111111111111111111111111111111111TTGACTAGTACATGACCACTTGA"

  dna_monomers:
    1: ["A", "G", "T", "C"] #N

TrackerConfig:
  logs: "./outputs/Reischer_SELEX"                 # Directory for writing logs to
  parser_out: "./outputs/Reischer_SELEX"           # Directory that stores fastq parser outputs
  analysis_out: "./outputs/Reischer_SELEX"         # Directory that stores outputs of data analysis operations

LoggerConfig:
  verbose: false                    # Verbose loggers print to the console
  log_to_file: true                 # Write logs to file
  level: "INFO"                     # Logger level; accepted values: "DEBUG", "INFO", "WARNING", "ERROR"

By way of illustration, we will silence the logger for this example, and instead write logs to a file – see the LoggerConfig specifications above.

[1]:
import clibas as C

C.initialize(config_path="Reischer_antibacterial_SELEX_config.yaml")

This paper by Reischer is one of the few examples that we could find that details NGS data processing. Below, we build a pipeline that roughly reproduces what is described in the paper. The authors mention the following two filtering steps:

An error rate of 20% (≤ 4 nt mismatches) within the constant regions was allowed; Sequences not fulfilling these conditions were discarded

and

Next, fastp 0.20.0 was used to filter sequences by quality score, discarding all sequences of an overall Phred quality score ≤ 30

The pipeline below is intentionally minimal – we are not collecting any optional stats, just trying to reproduce the paper:

[3]:
C.pipeline.enque(
    [
        # discard the DNA of incorrect length (incompatible with library designs)
        C.fastq_parser.len_filter(where="dna"),
        # discard the DNA with incorrect/overly mutated constant regions
        # up to 4 mutations are tolerated in a single region;
        # this can be achieved by calling cr_filter twice:
        C.fastq_parser.cr_filter(where="dna", loc=[0], tol=4),
        C.fastq_parser.cr_filter(where="dna", loc=[2], tol=4),
        # discard the DNA with average Q scores below 30 in region 1 (random insert region)
        C.fastq_parser.q_score_filt(minQ=30, loc=[1]),
        # clip DNA sequences to region 1 (random insert region)
        C.fastq_parser.fetch_at(where="dna", loc=[1]),
        # discard all DNA containing any ambiguous bases (N, etc)
        C.fastq_parser.filt_ambiguous(where="dna"),
        # count retained DNA sequences
        C.fastq_parser.count_summary(where="dna", top_n=1000, fmt="csv"),
    ]
)
[4]:
"""run everything from the specified folder together in memory"""
loader = C.data_loader.fetch_gz_from_dir(
    data_dir="./sequencing_data/Reischer_bacterial_SELEX"
)
data = C.pipeline.load_and_run(loader=loader, save_summary=True)

The minimal pipeline above demonstrates how full analysis can be performed with essentially just 7 operations and a few lines of code. Without collecting optional statistics, the pipeline is fast. It took 31.5 s to process all 11 .fastq.gz files:

[8]:
from pathlib import Path

import pandas as pd

out_dir = C.fastq_parser.dirs.parser_out
exp_name = C.fastq_parser.exp_name

fname = next(Path(out_dir).glob(f"{exp_name}_pipeline_summary*.csv"))
summary = pd.read_csv(fname)

t = summary["elapsed time, s"].sum()
print(f"The run was completed in {t:.1f} s!")
The run was completed in 31.5 s!
[25]:
c = [x for x in summary.columns if x.startswith("Reischer")]
reads = int(summary.iloc[1][c].sum())
print(f"The initial dataset had {reads:.2e} reads across 11 samples (.fastq files)")
The initial dataset had 2.54e+06 reads across 11 samples (.fastq files)

Most importantly, we arrive at the same top sequences as the authors:

[32]:
out_dir = C.fastq_parser.dirs.parser_out
sample_name = "Reischer_bact_SELEX_r11"

f_dir = Path(out_dir) / sample_name
fname = next(Path(f_dir).glob(f"{sample_name}_dna_count_summary*.csv"))

df = pd.read_csv(fname)
df.head(20).style.set_table_styles(
    [{"selector": "th, td", "props": [("font-family", "Consolas, monospace")]}]
)
[32]:
  Index DNA dna count Dataset %
0 1 TTTCTCAACGGGACCATCACTTACCTCAAGTACTTGGACG 2771 1.405793
1 2 GTCAACTCATTTATGGTGCTCCTCGTACCTCAGGTGGTTA 475 0.240979
2 3 CCGGCTATCTCCCTACCGTGGCCGAGTACCTCAAACGTTT 456 0.231339
3 4 GGCTCTCTGGTCTTCAAGGCCCATGATTACAGTCAGATCA 345 0.175027
4 5 CTGCCTGACTTCATAATGCTTCTTCTCCCTGTGGTACTTG 321 0.162851
5 6 CCTTAGGTCTTAACTATCAGGCGGTCTGTATCAATTCGAT 294 0.149153
6 7 GGCAGGGAGCGACCGGGTCATTGATTATCTGTCAAAGTGT 293 0.148646
7 8 GGCCATACCTCGTGCCTTCTGTGATCATCTCTATCAATTG 271 0.137485
8 9 CTCAATCATCAAGGGTCTACTTCCCGCTTGTGGGCCATTC 263 0.133426
9 10 CCTCTCTCTTACTGCTACTGGGCAGGGTACTCAATTACGT 251 0.127338
10 11 CGGTCCCGACTCAATATTGTTCCCTCCCCTTATCAGGCGG 236 0.119728
11 12 TCCTCTAATCAACTCTATGCCTTATCCCCTTGGTCAGGAC 217 0.110089
12 13 CATGGCTCCCTCTTCAACTTCAAGTCAGTGATCTGTCAAA 210 0.106538
13 14 CAGGTCTCGTCCCTTGTGGAACAGGAATACTGGCATCACA 208 0.105523
14 15 CCTGGATCATCGATGGCAAAAGCGCATTCCCGGCATGTGGC 206 0.104509
15 16 ACGCACATCATGAATTGGCCACTCATCACTTTATCGTGGT 202 0.102479
16 17 CCTCTGTCAGAGCCATCATAGAGACTATGATCCCTGGTCA 202 0.102479
17 18 GGCCATCCCCCAATCGCGGTGGGCTATGCACCTCAACAAG 192 0.097406
18 19 CCTTCGCCTCTCTACAAGGGCGCAATGCTTCGCTCAATCGT 186 0.094362
19 20 ACTGGCCTTGACACCCTGTTGTGGCTTGATGACAATAACA 185 0.093855

During processing, we truncated the reads to show only the variable region (C.fastq_parser.fetch_at(where='dna', loc=[1])), but these results are broadly consistent with the paper: top-1, 2, 3, 8, 10, 11 and 12 are the same (see the snippet from the paper below):

002 selex.png