4. SELEX: antibacterial aptamers¶

This notebook demonstrates the use of clibas to analyze SELEX aptamer selections using data from Reischer and co-workers as an example.

For further details and the source of data, see C. Kolm et al. DNA aptamers against bacterial cells can be efficiently selected by a SELEX process using state‑of‑the art qPCR and ultra‑deep sequencing. Sci. Rep. 2020, 10, 20917

In the paper, the authors conduct SELEX selections to identify DNA aptamere with specific affinity for live Enterococcus faecalis cells. The authors conducted 11 rounds of selection using a simple DNA library containing a random 40±3 nt insert flanked by a 5’- (TAGGGAAGAGAAGGACATATGAT) and 3’-end (TTGACTAGTACATGACCACTTGA) constant regions. NGS data provided by the authors is in the paired-end read format. At present, paired-end read merging is not supported in clibas; prior to running this analysis, we have merged the reads separately using FLASH with the following commands:

--max-overlap=200 --allow-outies --max-mismatch-density=0.10

The reads are provided with adapter sequences trimmed, so it is easy to write library designs:

Naturally, there is no need for peptide library designs and a translation table in this case. The config file can be quite minimal:

experiment: "Reischer_antibacterial_SELEX"

LibraryDesigns:
  dna_templates:
    - "TAGGGAAGAGAAGGACATATGAT1111111111111111111111111111111111111TTGACTAGTACATGACCACTTGA"
    - "TAGGGAAGAGAAGGACATATGAT11111111111111111111111111111111111111TTGACTAGTACATGACCACTTGA"
    - "TAGGGAAGAGAAGGACATATGAT111111111111111111111111111111111111111TTGACTAGTACATGACCACTTGA"
    - "TAGGGAAGAGAAGGACATATGAT1111111111111111111111111111111111111111TTGACTAGTACATGACCACTTGA"
    - "TAGGGAAGAGAAGGACATATGAT11111111111111111111111111111111111111111TTGACTAGTACATGACCACTTGA"
    - "TAGGGAAGAGAAGGACATATGAT111111111111111111111111111111111111111111TTGACTAGTACATGACCACTTGA"
    - "TAGGGAAGAGAAGGACATATGAT1111111111111111111111111111111111111111111TTGACTAGTACATGACCACTTGA"

  dna_monomers:
    1: ["A", "G", "T", "C"] #N

TrackerConfig:
  logs: "./outputs/Reischer_SELEX"                 # Directory for writing logs to
  parser_out: "./outputs/Reischer_SELEX"           # Directory that stores fastq parser outputs
  analysis_out: "./outputs/Reischer_SELEX"         # Directory that stores outputs of data analysis operations

LoggerConfig:
  verbose: false                    # Verbose loggers print to the console
  log_to_file: true                 # Write logs to file
  level: "INFO"                     # Logger level; accepted values: "DEBUG", "INFO", "WARNING", "ERROR"

By way of illustration, we will silence the logger for this example, and instead write logs to a file – see the LoggerConfig specifications above.

[1]:

import clibas as C

C.initialize(config_path="Reischer_antibacterial_SELEX_config.yaml")

This paper by Reischer is one of the few examples that we could find that details NGS data processing. Below, we build a pipeline that roughly reproduces what is described in the paper. The authors mention the following two filtering steps:

An error rate of 20% (≤ 4 nt mismatches) within the constant regions was allowed; Sequences not fulfilling these conditions were discarded

and

Next, fastp 0.20.0 was used to filter sequences by quality score, discarding all sequences of an overall Phred quality score ≤ 30

The pipeline below is intentionally minimal – we are not collecting any optional stats, just trying to reproduce the paper:

[3]:

C.pipeline.enque(
    [
        # discard the DNA of incorrect length (incompatible with library designs)
        C.fastq_parser.len_filter(where="dna"),
        # discard the DNA with incorrect/overly mutated constant regions
        # up to 4 mutations are tolerated in a single region;
        # this can be achieved by calling cr_filter twice:
        C.fastq_parser.cr_filter(where="dna", loc=[0], tol=4),
        C.fastq_parser.cr_filter(where="dna", loc=[2], tol=4),
        # discard the DNA with average Q scores below 30 in region 1 (random insert region)
        C.fastq_parser.q_score_filt(minQ=30, loc=[1]),
        # clip DNA sequences to region 1 (random insert region)
        C.fastq_parser.fetch_at(where="dna", loc=[1]),
        # discard all DNA containing any ambiguous bases (N, etc)
        C.fastq_parser.filt_ambiguous(where="dna"),
        # count retained DNA sequences
        C.fastq_parser.count_summary(where="dna", top_n=1000, fmt="csv"),
    ]
)

[4]:

"""run everything from the specified folder together in memory"""
loader = C.data_loader.fetch_gz_from_dir(
    data_dir="./sequencing_data/Reischer_bacterial_SELEX"
)
data = C.pipeline.load_and_run(loader=loader, save_summary=True)

The minimal pipeline above demonstrates how full analysis can be performed with essentially just 7 operations and a few lines of code. Without collecting optional statistics, the pipeline is fast. It took 31.5 s to process all 11 .fastq.gz files:

[8]:

from pathlib import Path

import pandas as pd

out_dir = C.fastq_parser.dirs.parser_out
exp_name = C.fastq_parser.exp_name

fname = next(Path(out_dir).glob(f"{exp_name}_pipeline_summary*.csv"))
summary = pd.read_csv(fname)

t = summary["elapsed time, s"].sum()
print(f"The run was completed in {t:.1f} s!")

The run was completed in 31.5 s!

[25]:

c = [x for x in summary.columns if x.startswith("Reischer")]
reads = int(summary.iloc[1][c].sum())
print(f"The initial dataset had {reads:.2e} reads across 11 samples (.fastq files)")

The initial dataset had 2.54e+06 reads across 11 samples (.fastq files)

Most importantly, we arrive at the same top sequences as the authors:

[32]:

out_dir = C.fastq_parser.dirs.parser_out
sample_name = "Reischer_bact_SELEX_r11"

f_dir = Path(out_dir) / sample_name
fname = next(Path(f_dir).glob(f"{sample_name}_dna_count_summary*.csv"))

df = pd.read_csv(fname)
df.head(20).style.set_table_styles(
    [{"selector": "th, td", "props": [("font-family", "Consolas, monospace")]}]
)

[32]:

	Index	DNA	dna count	Dataset %
0	1	TTTCTCAACGGGACCATCACTTACCTCAAGTACTTGGACG	2771	1.405793
1	2	GTCAACTCATTTATGGTGCTCCTCGTACCTCAGGTGGTTA	475	0.240979
2	3	CCGGCTATCTCCCTACCGTGGCCGAGTACCTCAAACGTTT	456	0.231339
3	4	GGCTCTCTGGTCTTCAAGGCCCATGATTACAGTCAGATCA	345	0.175027
4	5	CTGCCTGACTTCATAATGCTTCTTCTCCCTGTGGTACTTG	321	0.162851
5	6	CCTTAGGTCTTAACTATCAGGCGGTCTGTATCAATTCGAT	294	0.149153
6	7	GGCAGGGAGCGACCGGGTCATTGATTATCTGTCAAAGTGT	293	0.148646
7	8	GGCCATACCTCGTGCCTTCTGTGATCATCTCTATCAATTG	271	0.137485
8	9	CTCAATCATCAAGGGTCTACTTCCCGCTTGTGGGCCATTC	263	0.133426
9	10	CCTCTCTCTTACTGCTACTGGGCAGGGTACTCAATTACGT	251	0.127338
10	11	CGGTCCCGACTCAATATTGTTCCCTCCCCTTATCAGGCGG	236	0.119728
11	12	TCCTCTAATCAACTCTATGCCTTATCCCCTTGGTCAGGAC	217	0.110089
12	13	CATGGCTCCCTCTTCAACTTCAAGTCAGTGATCTGTCAAA	210	0.106538
13	14	CAGGTCTCGTCCCTTGTGGAACAGGAATACTGGCATCACA	208	0.105523
14	15	CCTGGATCATCGATGGCAAAAGCGCATTCCCGGCATGTGGC	206	0.104509
15	16	ACGCACATCATGAATTGGCCACTCATCACTTTATCGTGGT	202	0.102479
16	17	CCTCTGTCAGAGCCATCATAGAGACTATGATCCCTGGTCA	202	0.102479
17	18	GGCCATCCCCCAATCGCGGTGGGCTATGCACCTCAACAAG	192	0.097406
18	19	CCTTCGCCTCTCTACAAGGGCGCAATGCTTCGCTCAATCGT	186	0.094362
19	20	ACTGGCCTTGACACCCTGTTGTGGCTTGATGACAATAACA	185	0.093855

During processing, we truncated the reads to show only the variable region (C.fastq_parser.fetch_at(where='dna', loc=[1])), but these results are broadly consistent with the paper: top-1, 2, 3, 8, 10, 11 and 12 are the same (see the snippet from the paper below):