Fast primer screening with the pydna.primer_screen module#

Visit the full library documentation here

This notebook showcases how to use the pydna.primer_screen module to screen your primer collection. I routinely use it to find the annealing positions of about two thousands primers in a fraction of a second.

A provision is to have the sequences primer sequences in a format that pydna can read using the pydna.parsers.parse_primers.

The primer list can consist of Primer objects returned by :func:pydna.parsers.parse_primers or any objects with a seq attribute, such as pydna.seqrecord.SeqRecord or Bio.SeqRecord.SeqRecord.

This module uses the Aho–Corasick algorithm.

If the same primer list is used repeatedly, creating an automaton with pydna.make_automaton greatly speeds up repeated searches. See at the end of this notebook for how to create, save, and use the automaton.

%%capture
# Install pydna and download data files from the pydna Github repository (only when running on Colab)
import sys
if 'google.colab' in sys.modules:
    %pip install pydna[clipboard,download,express,gel]
    !curl -LO "https://github.com/pydna-group/pydna/raw/master/docs/notebooks/FAS2_S288C_wild-type_locus.gb"
    !curl -LO "https://github.com/pydna-group/pydna/raw/master/docs/notebooks/fas2__NatMX4_locus.gb"
    !curl -LO "https://github.com/pydna-group/pydna/raw/master/docs/notebooks/fas2__KanMX4_locus.gb"
    !curl -LO "https://github.com/pydna-group/pydna/raw/master/docs/notebooks/pIL68.gb"
    !curl -LO "https://github.com/pydna-group/pydna/raw/master/docs/notebooks/pIL75.gb"

Primer list#

The code cell below establishes a primer list. This list is a sublist of actual oligonucleotides that sit in a freezer in a lab in the north of Portugal. That freezer contain over two thousand primers, all of which are named “N_primername” where N is an integer. The number in the first part of the primer name indicate the index of that primer in the original list.

from pydna.parsers import parse_primers

# Eleven primers in FASTA format
primers = parse_primers ("""
>51_TefTermFwd A.gos 20-mer
CAGATGCGAAGTTAAGTGCG
>82_MSW_fwd this primer sits inside the loxP of pUG6
TCCTTGACAGTCTTGACG
>149_MX4rev (25-mer) Primer in the Ashbya gossypi TEF terminator in the reverse direction. NEW (2021-01-05)
ACAAATGACAAGTTCTTGAAAACAA
>255_kanC (22-mer)
TGATTTTGATGACGAGCGTAAT
>594_pRS306_rev 24-mer
tcgctttcttcccttcctttctcg
>595_pRS306_fwd
taaagggagcccccgatttagagc
>607_XDHcasrv (28-mer)
aaacagctatgaccatgattacgccaag
>700_sc_fas2-B1: (25-mer)
ATTTCTCTATGTAAAGACAGAGCAG
>701_sc_fas2-A1: (25-mer)
CTATATTTCTATTCTATCCGAACTC
>1215_URA3r 27-mer
CGGTTTCTTTGAAATTTTTTTGATTCG
>1564_KANMX_rev
CACTCGCATCAACCAAACC
""")

# A list of None is created
pl = [None for i in range(1565)]

for primer in primers:
    # Extract the number from the name
    number, rest = primer.id.split("_", maxsplit=1)
    # put the primer in the correct location in the list
    pl[int(number)] = primer

The primer list is just a list where most elements are empty (None) so that the primer index in the name matches the index of the primer.

assert len(pl) == 1565

Eleven elements should have a value that is not None

assert len([p for p in pl if p is not None]) == 11

Read data files#

from pydna.readers import read

The following three files are wild-type (wt) and two mutant alleles where the coding sequence of a gene has been replaced by two different marker genes. These were made in the lab by one of my students.

wt = read("FAS2_S288C_wild-type_locus.gb")
nat = read("fas2__NatMX4_locus.gb")
kan = read("fas2__KanMX4_locus.gb")

The following two files are plasmids from Liachko I, Dunham MJ. (2014). An autonomously replicating sequence for use in a wide range of budding yeasts. FEMS Yeast Res. 14(2):364–367 (pubmed).

pIL68 = read("pIL68.gb")
pIL75 = read("pIL75.gb")

Most important functions provided by `pydna.primer_screen`#

The six main functions are imported below:

from pydna.primer_screen import forward_primers
from pydna.primer_screen import reverse_primers
from pydna.primer_screen import primer_pairs
from pydna.primer_screen import flanking_primer_pairs
from pydna.primer_screen import diff_primer_pairs
from pydna.primer_screen import diff_primer_triplets

/home/runner/work/pydna/pydna/src/pydna/primer_screen.py:63: FutureWarning: The primer_screen module is experimental and not yet extensively tested. api may change in future versions.
  warnings.warn(

Function `forward_primers`#

The kan sequence is the Saccharomyces cerevisiae fas2 locus where the FAS2 open reading frame (orf) and 1000 bp upstream and downstream. The orf has been replaced by the KanMX4 marker gene from the pUG6 plasmid.

kan

Dseqrecord(-3613)

The forward_primers function returns a dictionary {int: list} representing the index and position(s) of forward primers from the pl list annealing to the kan sequence.

forward_primers(kan, pl)

{701: [534], 82: [1168], 255: [1979], 51: [2434]}

The returned dictionary has the form:

{ primer_A_index : [primer_A_location1, primer_A_location2, ...]
  primer_B_index : [primer_B_location1, primer_B_location2, ...] }

The location is the position where the 3’ part of the primer anneals. The forward primer in the figure below anneals at position 14 on the 20 bp template.

 5-gtcatgatctagtcgatgtta-3
   |||||||||||||||||||||

         5'-tagtcg-3'      <=== forward primer, location = 14
            ||||||
   |||||||||||||||||||||
 3-cagtactagatcagctacaat-5
                 ^
   0-------------1-----2 position
                 4     0

Function `reverse_primers`#

The reverse_primers function returns a dictionary {int: list} representing the index and position(s) of reverse primers from pl annealing to the kan sequence.

reverse_primers(kan, pl)

{149: [2306], 1564: [1940]}

The resulting dict has the same form as in the previous example describing the forward_primers function. In the example below, the reverse primeranneals at position 9 on thje 20 bp template.

5-gtcatgatctagtcgatgtta-3
  |||||||||||||||||||||
           ||||||
         3-atcagc-5       <=== reverse primer, location = 9

  |||||||||||||||||||||
3-cagtactagatcagctacaat-5
           ^
  0--------9----------2 position
                      0

primer_pairs(kan, pl, short=350, long=2000)

[amplicon_tuple(fp=701, rp=149, fposition=534, rposition=2306, size=1822),
 amplicon_tuple(fp=701, rp=1564, fposition=534, rposition=1940, size=1450),
 amplicon_tuple(fp=82, rp=149, fposition=1168, rposition=2306, size=1181),
 amplicon_tuple(fp=82, rp=1564, fposition=1168, rposition=1940, size=809),
 amplicon_tuple(fp=255, rp=149, fposition=1979, rposition=2306, size=374)]

Function `primer_pairs`#

The primer_pairs function return a list of named tuples. They each indicate a forward and a reverse primer, their positions and the size of the potential PCR product. PCR products with a size between optional keyword arguments short (500 bp by default) and long (2000 bp by default) are returned.

Function `flanking_primer_pairs`#

The flanking_primer_pairs function returns primers pairs that flank a target, in the example above from position 550 to 1200.

flanking_primer_pairs(kan, pl, target=(550, 1200))

[amplicon_tuple(fp=82, rp=1564, fposition=1168, rposition=1940, size=809),
 amplicon_tuple(fp=82, rp=149, fposition=1168, rposition=2306, size=1181)]

Function `diff_primer_pairs`#

The diff_primer_pairs function takes a list or tuple with Dseqrecord objects and returns another kind of named tuple primer_tuplethe same kind of named tuples as the previous two functions.

diff_primer_pairs((nat, kan), pl)

[(primer_tuple(seq=Dseqrecord(-3308), fp=82, rp=149, size=944),
  primer_tuple(seq=Dseqrecord(-3613), fp=82, rp=149, size=1181))]

Each tuple contain potential PCR products from a primer pair that results in different PCR product sizes. This can be used to find primers that can differentiate between two or more DNA sequences, such as before and after some genetic modification.

For example, Primers 1 and 2 both form PCR products from sequenceA and B below, but of different sizes. Primers 1 and 2 could be used to verify genetic modifications such as cloning an insert into a plasmid vector.

 1>              <2
-------NNNNNNNNN----  sequenceA


 1>          <2
-------XXXXX----      sequenceB

Using primers 82 and 149 results in a 944 bp PCR product from the kan sequence and and 1181 bp product from the nat sequence.

In this example, the nat and kan represents the same kind of deletion in the S. cerevisiae fas2 locus, but one was wade with the KanMX4 and the other with the NatMX4 from the pAG25 plasmid pubmed. They differ in size by about 300 bp:

nat, kan

(Dseqrecord(-3308), Dseqrecord(-3613))

Function `diff_primer_triplets`#

The two plasmids pIL68 and pIL75 are closely related and there is no single primer pair in our list that would result in PCR products of different sizes.

diff_primer_pairs((pIL68, pIL75), pl)

[]

pIL68, pIL75

(Dseqrecord(o4852), Dseqrecord(o5213))

The diff_primer_triplets function looks for primer triplets that can be used to distinguish two sequences.

For example, Primers 1, 2 and 3 form PCR products from sequenceA and B below, but of different sizes. Primer 1 binds both sequences while primers 2 and 3 bind one sequence each. This primer triplet could be used to verify genetic modifications by using a PCR reaction with three primers.

 1>        <2
-------NNNNNNNNN----  sequenceA

 1>     <3
-------XXXXX--------  sequenceB

The fuction returns a list of tuples. Each tuple contain potential PCR products from a primer pair that results in different PCR product sizes. This can be used to find primers that can differentiate between two or more DNA sequences, such as before and after some genetic modification.

diff_primer_triplets((pIL68, pIL75), pl)

[(primer_tuple(seq=Dseqrecord(o4852), fp=1215, rp=594, size=1474),
  primer_tuple(seq=Dseqrecord(o5213), fp=51, rp=594, size=548)),
 (primer_tuple(seq=Dseqrecord(o4852), fp=1215, rp=594, size=1474),
  primer_tuple(seq=Dseqrecord(o5213), fp=255, rp=594, size=1005))]

In this example, primers 1215, 51 and 594 could be used to amplify a 1474 bp product from the pIL68 (4852 bp) and a 548 bp product from the pIL75 (5213 bp). Alternatively, the primer 51 could be substiuted for primer 255 to give two products slightly closer in size.

Automaton for a list of primers.#

An automaton can be made prior to primer screening for a list of Primer objects for faster primer search.

from pydna.primer_screen import make_automaton

# build automaton
atm = make_automaton(pl, limit = 16)

# save automaton
import pickle # The standard lib pickle 
atm.save("atm.automaton", pickle.dumps)

# load automaton
import ahocorasick
atm = ahocorasick.load("atm.automaton", pickle.loads)

# use automaton
forward_primers(kan, pl, automaton=atm)

{701: [534], 82: [1168], 255: [1979], 51: [2434]}

This automaton can be reused as an optional argument across calls to

forward_primers
reverse_primers
primer_pairs
flanking_primer_pairs
diff_primer_pairs
diff_primer_triplets

The primer list can contain None, this can be used to remove primers from the primer_list for the automaton, while keeping the original index for each primer.

The limit is the part of the primer used to find annealing positions. The automaton processes the uppercase 3’ part of each primer up to limit. It has to be rebuilt if a different limit is needed.

The primers can contain ambiguous bases from the extended IUPAC DNA alphabet:

Code	Bases	Meaning
A	A	Adenine
C	C	Cytosine
G	G	Guanine
T	T	Thymine
U	T	Uracil (RNA)
R	AG	puRine
Y	CT	pYrimidine
S	GC	Strong (3 H-bonds)
W	AT	Weak (2 H-bonds)
K	GT	Keto
M	AC	aMino
B	CGT	not A
D	AGT	not C
H	ACT	not G
V	ACG	not T
N	ACGT	any
X	-	unknown

Fast primer screening with the pydna.primer_screen module

Contents

Fast primer screening with the pydna.primer_screen module#

Primer list#

Read data files#

Most important functions provided by pydna.primer_screen#

Function forward_primers#

Function reverse_primers#

Function primer_pairs#

Function flanking_primer_pairs#

Function diff_primer_pairs#

Function diff_primer_triplets#