Pydna is a python package providing code for simulation of the creation of
recombinant DNA molecules using
molecular biology
techniques. Development of pydna happens in this Github repository.
Provided:
PCR simulation
Assembly simulation based on shared identical sequences
Primer design for amplification of a given sequence
Automatic design of primer tails for Gibson assembly
or homologous recombination.
Restriction digestion and cut&paste cloning
Agarose gel simulation
Download sequences from Genbank
Parsing various sequence formats including the capacity to
handle broken Genbank format
The most important modules and how to import functions or classes from
them are listed below. Class names starts with a capital letter,
functions with a lowercase letter:
from pydna.module import function
from pydna.module import Class
Example: from pydna.gel import Gel
pydna
├── amplify
│ ├── Anneal
│ └── pcr
│
├── assembly
│ └── Assembly
│
├── design
│ ├── assembly_fragments
│ └── primer_design
│
├── dseqrecord
│ └── Dseqrecord
├── gel
│ └── Gel
│
├── genbank
│ ├── genbank
│ └── Genbank
│
├── parsers
│ ├── parse
│ └── parse_primers
│
└── readers
├── read
└── read_primers
Documentation is available as docstrings provided in the source code for
each module.
These docstrings can be inspected by reading the source code directly.
See further below on how to obtain the code for pydna.
In the python shell, use the built-in help function to view a
function’s docstring:
The doctrings are also used to provide an automaticly generated reference
manual available online at
read the docs.
Docstrings can be explored using IPython, an
advanced Python shell with
TAB-completion and introspection capabilities. To see which functions
are available in pydna,
type pydna.<TAB> (where <TAB> refers to the TAB key).
Use pydna.open_config_folder?<ENTER>`to view the docstring or
`pydna.open_config_folder??<ENTER> to view the source code.
In the Spyder IDE it is possible
to place the cursor immediately before the name of a module,class or
function and press ctrl+i to bring up docstrings in a separate window in Spyder
Code snippets are indicated by three greater-than signs:
Please join the
Google group
for pydna, this is the preferred location for help. If you find bugs
in pydna itself, open an issue at the
Github repository.
pcr is a convenience function for the Anneal class to simplify its
usage, especially from the command line. If more than one or no PCR
product is formed, a ValueError is raised.
args is any iterable of Dseqrecords or an iterable of iterables of
Dseqrecords. args will be greedily flattened.
Parameters:
args (iterable containing sequence objects) – Several arguments are also accepted.
limit (int = 13, optional) – limit length of the annealing part of the primers.
Notes
sequences in args could be of type:
string
Seq
SeqRecord (or subclass)
Dseqrecord (or sublcass)
The last sequence will be assumed to be the template while
all preceeding sequences will be assumed to be primers.
This is a powerful function, use with care!
Returns:
product – An pydna.amplicon.Amplicon object representing the PCR
product. The direction of the PCR product will be the same as for
the template sequence.
Assembly of a list of linear DNA fragments into linear or circular
constructs. The Assembly is meant to replace the Assembly method as it
is easier to use. Accepts a list of Dseqrecords (source fragments) to
initiate an Assembly object. Several methods are available for analysis
of overlapping sequences, graph construction and assembly.
This function takes the same paramenters as the
:func:pydna.genbank.Genbank.nucleotide method. The email address stored
in the pydna_email environment variable is used. The easiest way set
this permanantly is to edit the pydna.ini file.
See the documentation of pydna.open_config_folder()
if no accession is given, a very short Genbank
entry
is used as an example (see below). This can be useful for testing the
connection to Genbank.
Please note that this result is also cached by default by settings in
the pydna.ini file.
See the documentation of pydna.open_config_folder()
LOCUSCS57023314bpDNAlinearPAT18-MAY-2007DEFINITIONSequence6fromPatentWO2007025016.ACCESSIONCS570233VERSIONCS570233.1KEYWORDS.SOURCEsyntheticconstructORGANISMsyntheticconstructothersequences;artificialsequences.REFERENCE1AUTHORSShaw,R.W.andCottenoir,M.TITLEInhibitionofmetallo-beta-lactamasebydouble-strandeddnaJOURNALPatent:WO2007025016-A1601-MAR-2007;TexasTechUniversitySystem(US)FEATURESLocation/Qualifierssource1..14/organism="synthetic construct"/mol_type="unassigned DNA"/db_xref="taxon:32630"/note="This is a 14bp aptamer inhibitor."ORIGIN1atgttcctacatga//
This method downloads a genbank nuclotide record from genbank. This method is
cached by default. This can be controlled by editing the pydna_cached_funcs environment
variable. The best way to do this permanently is to edit the edit the
pydna.ini file. See the documentation of pydna.open_config_folder()
Item is a string containing one genbank accession number
for a nucleotide file. Genbank nucleotide accession numbers have this format:
A12345 = 1 letter + 5 numerals
AB123456 = 2 letters + 6 numerals
The accession number is sometimes followed by a point and version number
BK006936.2
Item can also contain optional interval information in the following formats:
BK006936.2 REGION: complement(613900..615202)
NM_005546 REGION: 1..100
NM_005546 REGION: complement(1..100)
21614549:1-100
21614549:c100-1
21614549 1-100
21614549 c100-1
It is useful to set an interval for large genbank records to limit the download time.
The items above containing interval information and can be obtained directly by
looking up an entry in Genbank and setting the Change region shown on the
upper right side of the page. The ACCESSION line of the displayed Genbank
file will have the formatting shown.
Alternatively, seq_start and seq_stop can be set explicitly to the sequence intervals to be
downloaded.
If strand is 2. “c”, “C”, “crick”, “Crick”, “antisense”,”Antisense”,
“2”, 2, “-” or “-1”, the antisense (Crick) strand is returned, otherwise
the sense (Watson) strand is returned.
Dseqrecord is a double stranded version of the Biopython SeqRecord [1] class.
The Dseqrecord object holds a Dseq object describing the sequence.
Additionally, Dseqrecord hold meta information about the sequence in the
from of a list of SeqFeatures, in the same way as the SeqRecord does.
The Dseqrecord can be initialized with a string, Seq, Dseq, SeqRecord
or another Dseqrecord. The sequence information will be stored in a
Dseq object in all cases.
Dseqrecord objects can be read or parsed from sequences in FASTA, EMBL or Genbank formats.
See the pydna.readers and pydna.parsers modules for further information.
There is a short representation associated with the Dseqrecord.
Dseqrecord(-3) represents a linear sequence of length 2
while Dseqrecord(o7)
represents a circular sequence of length 7.
Dseqrecord and Dseq share the same concept of length. This length can be larger
than each strand alone if they are staggered as in the example below.
<--length-->GATCCTTTAAAGCCTAG
Parameters:
record (string, Seq, SeqRecord, Dseq or other Dseqrecord object) – This data will be used to form the seq property
circular (bool, optional) – True or False reflecting the shape of the DNA molecule
linear (bool, optional) – True or False reflecting the shape of the DNA molecule
This checksum is the same as seguid but with base64.urlsafe
encoding instead of the normal base64. This means that
the characters + and / are replaced with - and _ so that
the checksum can be part of a URL.
Returns the sequence as a string using a format supported by Biopython
SeqIO [2]. Default is “gb” which is short for Genbank.
Allowed Formats are for example:
“fasta”: The standard FASTA format.
“fasta-2line”: No line wrapping and exactly two lines per record.
“genbank” (or “gb”): The GenBank flat file format.
“embl”: The EMBL flat file format.
“imgt”: The IMGT variant of the EMBL format.
The format string can be modified with the keyword “dscode” if
the underlying dscode string is desired in the output. for example:
Writes the Dseqrecord to a file using the format f, which must
be a format supported by Biopython SeqIO for writing [3]. Default
is “gb” which is short for Genbank. Note that Biopython SeqIO reads
more formats than it writes.
Filename is the path to the file where the sequece is to be
written. The filename is optional, if it is not given, the
description property (string) is used together with the format.
If obj is the Dseqrecord object, the default file name will be:
<obj.locus>.<f>
Where <f> is “gb” by default. If the filename already exists and
AND the sequence it contains is different, a new file name will be
used so that the old file is not lost:
This method returns a new circular sequence (Dseqrecord object), which has been rotated
in such a way that there is maximum overlap between the sequence and
ref, which may be a string, Biopython Seq, SeqRecord object or
another Dseqrecord object.
The reason for using this could be to rotate a new recombinant plasmid so
that it starts at the same position after cloning. See the example below:
Digest a Dseqrecord object with one or more restriction enzymes.
returns a list of linear Dseqrecords. If there are no cuts, an empty
list is returned.
See also Dseq.cut()
:param enzymes: A Bio.Restriction.XXX restriction object or iterable of such.
:type enzymes: enzyme object or iterable of such objects
Returns:
Dseqrecord_frags – list of Dseqrecord objects formed by the digestion
Dseq describes a double stranded DNA fragment, linear or circular.
Dseq can be initiated in two ways, using two strings, each representing the
Watson (upper, sense) strand, the Crick (lower, antisense) strand and an
optional value describing the stagger betwen the strands on the left side (ovhg).
Alternatively, a single string represenation using dsIUPAC codes can be used.
If a single string is used, the letters of that string are interpreted as base
pairs rather than single bases. For example “A” would indicate the basepair
“A/T”. An expanded IUPAC code is used where the letters PEXI have been assigned
to GATC on the Watson strand with no paring base on the Crick strand G/””, A/””,
T/”” and C/””. The letters QFZJ have been assigned the opposite base pairs with
an empty Watson strand “”/G, “”/A, “”/T, and “”/C.
watson (str) – a string representing the Watson (sense) DNA strand or a basepair
represenation.
crick (str, optional) – a string representing the Crick (antisense) DNA strand.
ovhg (int, optional) – A positive or negative number to describe the stagger between the
Watson and Crick strands.
see below for a detailed explanation.
circular (bool, optional) – True indicates that sequence is circular, False that it is linear.
Examples
Dseq is a subclass of the Biopython Bio.Seq.Seq class. The constructor
can accept two strings representing the Watson (sense) and Crick(antisense)
DNA strands. These are interpreted as single stranded DNA. There is a check
for complementarity between the strands.
If the DNA molecule is staggered on the left side, an integer ovhg
(overhang) must be given, describing the stagger between the Watson and Crick strand
in the 5’ end of the fragment.
Additionally, the optional boolean parameter circular can be given to indicate if the
DNA molecule is circular.
The most common usage of the Dseq class is probably not to use it directly, but to
create it as part of a Dseqrecord object (see pydna.dseqrecord.Dseqrecord).
This works in the same way as for the relationship between the Bio.Seq.Seq and
Bio.SeqRecord.SeqRecord classes in Biopython.
There are multiple ways of creating a Dseq object directly listed below, but you can also
use the function Dseq.from_full_sequence_and_overhangs() to create a Dseq:
Two arguments (string, string), no overhang provided:
If Watson and Crick are given, but not ovhg, an attempt will be made to find the best annealing
between the strands. There are important limitations to this. If there are several ways to
anneal the strands, this will fail. For long fragments it is quite slow.
Three arguments (string, string, ovhg=int):
The ovhg parameter is an integer describing the length of the Crick strand overhang on the
left side (the 5’ end of Watson strand).
The ovhg parameter controls the stagger at the five prime end:
If the ovhg parameter is specified a Crick strand also needs to be supplied, or
an exception is raised.
>>> Dseq(watson="agt",ovhg=2)Traceback (most recent call last):...ValueError: ovhg (overhang) defined without a crick strand.
The shape or topology of the fragment is set by the circular parameter, True or False (default).
>>> Dseq("aaa","ttt",ovhg=0)# A linear sequence by defaultDseq(-3)aaattt>>> Dseq("aaa","ttt",ovhg=0,circular=False)# A linear sequence if circular is FalseDseq(-3)aaattt>>> Dseq("aaa","ttt",ovhg=0,circular=True)# A circular sequenceDseq(o3)aaattt>>> Dseq("aaa","ttt",ovhg=1,circular=False)Dseq(-4) aaattt>>> Dseq("aaa","ttt",ovhg=-1)Dseq(-4)aaa ttt>>> Dseq("aaa","ttt",circular=True,ovhg=0)Dseq(o3)aaattt
The molecular weight of the DNA/RNA molecule in g/mol.
The molecular weight data in Biopython Bio.Data.IUPACData
is used. The DNA is assumed to have a 5’-phosphate as many
DNA fragments from restriction digestion do:
P-G-A-T-T-A-C-A-OH|||||||OH-C-T-A-A-T-G-T-P
The molecular weights listed in the unambiguous_dna_weights
dictionary refers to free monophosphate nucleotides.
One water molecule is removed for every phopshodiester bond
formed between nucleotides. For linear molecules, the weight
of one water molecule is added to account for the terminal
hydroxyl group and a hydrogen on the 5’ terminal phosphate
group.
>>> ds=Dseq("TAAG",circular=True)>>> ds.shifted(1)# First bp moved to right side:Dseq(o4)AAGTTTCA>>> ds.shifted(-1)# Last bp moved to left side:Dseq(o4)GTAACATT
Returns a 2-tuple of trings describing the structure of the 5’ end of
the DNA fragment.
The tuple contains (type , sticky) where type is eiter “5’” or “3’”.
sticky is always in lower case and contains the sequence of the
protruding end in 5’-3’ direction.
Fill in of five prime protruding end with a DNA polymerase
that has only DNA polymerase activity (such as Exo-Klenow [5]).
Exo-Klenow is a modified version of the Klenow fragment of E.
coli DNA polymerase I, which has been engineered to lack both
3-5 proofreading and 5-3 exonuclease activities.
and any combination of A, G, C or T. Default are all four
nucleotides together.
Fill in of five prime protruding end with a DNA polymerase
that has only DNA polymerase activity (such as Exo-Klenow [6]).
Exo-Klenow is a modified version of the Klenow fragment of E.
coli DNA polymerase I, which has been engineered to lack both
3-5 proofreading and 5-3 exonuclease activities.
and any combination of A, G, C or T. Default are all four
nucleotides together.
Simulates treatment a nuclease with both 5’-3’ and 3’-5’ single
strand specific exonuclease activity (such as mung bean nuclease [7])
Mung bean nuclease is a nuclease enzyme derived from mung bean sprouts
that preferentially degrades single-stranded DNA and RNA into
5’-phosphate- and 3’-hydroxyl-containing nucleotides.
Treatment results in blunt DNA, regardless of wheter the protruding end
is 5’ or 3’.
Simulates treatment a nuclease with both 5’-3’ and 3’-5’ single
strand specific exonuclease activity (such as mung bean nuclease [8])
Mung bean nuclease is a nuclease enzyme derived from mung bean sprouts
that preferentially degrades single-stranded DNA and RNA into
5’-phosphate- and 3’-hydroxyl-containing nucleotides.
Treatment results in blunt DNA, regardless of wheter the protruding end
is 5’ or 3’.
Fill in 5’ protruding ends and nibble 3’ protruding ends.
This is done using a DNA polymerase providing 3’-5’ nuclease activity
such as T4 DNA polymerase. This can be done in presence of any
combination of the four nucleotides A, G, C or T.
T4 DNA polymerase is widely used to “polish” DNA ends because of its
strong 3-5 exonuclease activity in the absence of dNTPs, it chews
back 3′ overhangs to create blunt ends; in the presence of limiting
dNTPs, it can fill in 5′ overhangs; and by carefully controlling
reaction time, temperature, and nucleotide supply, you can generate
defined recessed or blunt termini.
Tuning the nucleotide set can facilitate engineering of partial
sticky ends. Default are all four nucleotides together.
aaagatc-3aaa3' ends are always removed.|||--->|||AandTneededorthemoleculewill3-ctagttttttdegradecompletely.5-gatcaaagatcaaaGATC5' ends are filled in the|||--->|||||||||||presenceofGATCtttctag-5CTAGtttctag5-gatcaaagatcaaaGAT5' ends are partially filled in the|||--->|||||||||presenceofGATtoproducea1nttttctag-5TAGtttctag5' overhang5-gatcaaagatcaaaGA5' ends are partially filled in the|||--->|||||||presenceofGAtoproducea2nttttctag-5AGtttctag5' overhang5-gatcaaagatcaaaG5' ends are partially filled in the|||--->|||||presenceofGtoproducea3nttttctag-5Gtttctag5' overhang
Fill in 5’ protruding ends and nibble 3’ protruding ends.
This is done using a DNA polymerase providing 3’-5’ nuclease activity
such as T4 DNA polymerase. This can be done in presence of any
combination of the four nucleotides A, G, C or T.
T4 DNA polymerase is widely used to “polish” DNA ends because of its
strong 3-5 exonuclease activity in the absence of dNTPs, it chews
back 3′ overhangs to create blunt ends; in the presence of limiting
dNTPs, it can fill in 5′ overhangs; and by carefully controlling
reaction time, temperature, and nucleotide supply, you can generate
defined recessed or blunt termini.
Tuning the nucleotide set can facilitate engineering of partial
sticky ends. Default are all four nucleotides together.
aaagatc-3aaa3' ends are always removed.|||--->|||AandTneededorthemoleculewill3-ctagttttttdegradecompletely.5-gatcaaagatcaaaGATC5' ends are filled in the|||--->|||||||||||presenceofGATCtttctag-5CTAGtttctag5-gatcaaagatcaaaGAT5' ends are partially filled in the|||--->|||||||||presenceofGATtoproducea1nttttctag-5TAGtttctag5' overhang5-gatcaaagatcaaaGA5' ends are partially filled in the|||--->|||||||presenceofGAtoproducea2nttttctag-5AGtttctag5' overhang5-gatcaaagatcaaaG5' ends are partially filled in the|||--->|||||presenceofGtoproducea3nttttctag-5Gtttctag5' overhang
5’ => 3’ resection at the left side (start) of the molecule.
The argument n indicate the number of nucleotides that are to be
removed. The outcome of this depend on the structure of the molecule.
See the two examples below:
The figure below indicates a recess of length two from a blunt DNA
fragment. The resulting DNA fragment has a 3’ protruding single strand.
gatctc||||-->||ctagctag
The figure below indicates a recess of length two from a DNA fragment
with a 5’ sticky end resulting in a blunt sequence.
5’ => 3’ resection at the right side (end) of the molecule.
The argument n indicate the number of nucleotides that are to be
removed. The outcome of this depend on the structure of the molecule.
See the two examples below:
The figure below indicates a recess of length two from a blunt DNA
fragment. The resulting DNA fragment has a 3’ protruding single strand.
gatcgatc||||-->||ctagct
The figure below indicates a recess of length two from a DNA fragment
with a 5’ sticky end resulting in a blunt sequence.
5’ => 3’ resection at the left side (start) of the molecule.
The argument n indicate the number of nucleotides that are to be
removed. The outcome of this depend on the structure of the molecule.
See the two examples below:
The figure below indicates a recess of length two from a blunt DNA
fragment. The resulting DNA fragment has a 3’ protruding single strand.
gatctc||||-->||ctagctag
The figure below indicates a recess of length two from a DNA fragment
with a 5’ sticky end resulting in a blunt sequence.
5’ => 3’ resection at the right side (end) of the molecule.
The argument n indicate the number of nucleotides that are to be
removed. The outcome of this depend on the structure of the molecule.
See the two examples below:
The figure below indicates a recess of length two from a blunt DNA
fragment. The resulting DNA fragment has a 3’ protruding single strand.
gatcgatc||||-->||ctagct
The figure below indicates a recess of length two from a DNA fragment
with a 5’ sticky end resulting in a blunt sequence.
3’ => 5’ resection at the left side (beginning) of the molecule.
The argument n indicate the number of nucleotides that are to be
removed. The outcome of this depend on the structure of the molecule.
See the two examples below:
The figure below indicates a recess of length two from a blunt DNA
fragment. The resulting DNA fragment has a 5’ protruding single strand.
gatcgatc||||-->||ctagag
The figure below indicates a recess of length two from a DNA fragment
with a 3’ sticky end resulting in a blunt sequence.
3’ => 5’ resection at the right side (end) of the molecule.
The argument n indicate the number of nucleotides that are to be
removed. The outcome of this depend on the structure of the molecule.
See the two examples below:
The figure below indicates a recess of length two from a blunt DNA
fragment. The resulting DNA fragment has a 5’ protruding single strand.
gatcga||||-->||ctagctag
The figure below indicates a recess of length two from a DNA fragment
with a 3’ sticky end resulting in a blunt sequence.
Terminal deoxynucleotidyl transferase (TdT) is a template-independent
DNA polymerase that adds nucleotides to the 3′-OH ends of DNA, typically
single-stranded or recessed 3′ ends. In cloning, it’s classically used
to create homopolymer tails (e.g. poly-dG on a vector and poly-dC on an insert)
so that fragments can anneal via complementary overhangs (“tailing” cloning).
This activity ia also present in some DNA polymerases, such as Taq polymerase.
This property is used in the populat T/A cloning protocol ([9]).
USER Enzyme is a mixture of Uracil DNA glycosylase (UDG) and the
DNA glycosylase-lyase Endonuclease VIII.
UDG catalyses the excision of an uracil base, forming an abasic
or apyrimidinic site (AP site). Endonuclease VIII removes the AP
site creating a DNA gap.
((cut_watson, ovhg), enz), for example ((396, -4), EcoRI)
The cut_watson (positive integer) is the cut position of the sequence as for example
returned by the Bio.Restriction module.
The ovhg (overhang, positive or negative integer or 0) has the same meaning as
for restriction enzymes in the Bio.Restriction module and for
pydna.dseq.Dseq objects (see docstring for this module and example below)
Returns a list of cutsites, represented represented as ((cut_watson, ovhg), enz):
cut_watson is a positive integer contained in [0,len(seq)), where seq is the sequence
that will be cut. It represents the position of the cut on the watson strand, using the full
sequence as a reference. By “full sequence” I mean the one you would get from str(Dseq).
ovhg is the overhang left after the cut. It has the same meaning as ovhg in
the Bio.Restriction enzyme objects, or pydna’s Dseq property.
enz is the enzyme object. It’s not necessary to perform the cut, but can be
used to keep track of which enzyme was used.
Cuts are only returned if the recognition site and overhang are on the double-strand
part of the sequence.
Two lists of 2-tuples of integers are returned. Each tuple
(((from, to))) contains the start and end positions of a single
stranded region, shorter or equal to length.
In the example below, the middle 2 nt part is released from the
molecule.
DNA molecules can fall apart by melting if they have internal single
stranded regions. In the example below, the molecule has two gaps
on opposite sides, two nucleotides apart, which means that it hangs
together by two basepairs.
This molecule can melt into two separate 8 bp double stranded
molecules, each with 3 nt 3’ overhangs a depicted below.
A list of 2-tuples is returned. Each tuple (((cut_watson, ovhg), None))
contains cut position and the overhang value in the same format as
returned by the get_cutsites method for restriction enzymes.
Note that this function deals with melting that results in two double
stranded DNA molecules.
See get_ss_meltsites for melting of single stranded regions from
molecules.
For a given cut expressed as ((cut_watson, ovhg), enz), returns
a tuple (cut_watson, cut_crick, ovhg).
cut_watson: see get_cutsites docs
cut_crick: equivalent of cut_watson in the crick strand
ovhg: see get_cutsites docs
The cut can be None if it represents the left or right end of the sequence.
Then it will return the position of the watson and crick ends with respect
to the “full sequence”. The is_left parameter is only used in this case.
Single stranded DNA molecules shorter or equal to length shed from
a double stranded DNA molecule without affecting the length of the
remaining molecule.
In the examples below, the middle 2 nt part is released from the
molecule.
Returns pairs of cutsites that render the edges of the resulting fragments.
A fragment produced by restriction is represented by a tuple of length 2 that
may contain cutsites or None:
Two cutsites: represents the extraction of a fragment between those two
cutsites, in that orientation. To represent the opening of a circular
molecule with a single cutsite, we put the same cutsite twice.
None, cutsite: represents the extraction of a fragment between the left
edge of linear sequence and the cutsite.
cutsite, None: represents the extraction of a fragment between the cutsite
and the right edge of a linear sequence.
If no sequences are found, an empty list is returned. This is a greedy
function, use carefully.
Parameters:
data (string or iterable) –
The data parameter is a string containing:
an absolute path to a local file.
The file will be read in text
mode and parsed for EMBL, FASTA
and Genbank sequences. Can be
a string or a Path object.
a string containing one or more
sequences in EMBL, GENBANK,
or FASTA format. Mixed formats
are allowed.
data can be a list or other iterable where the elements are 1 or 2
ds (bool) – If True double stranded Dseqrecord objects are returned.
If False single stranded Bio.SeqRecord[10] objects are
returned.
This function designs a forward primer and a reverse primer for PCR amplification
of a given template sequence.
The template argument is a Dseqrecord object or equivalent containing the template sequence.
The optional fp and rp arguments can contain an existing primer for the sequence (either the forward or reverse primer).
One or the other primers can be specified, not both (since then there is nothing to design!, use the pydna.amplify.pcr function instead).
The limit argument is the minimum length of the primer. The default value is 13.
If one of the primers is given, the other primer is designed to match in terms of Tm.
If both primers are designed, they will be designed to target_tm
tm_func is a function that takes an ascii string representing an oligonuceotide as argument and returns a float.
Some useful functions can be found in the pydna.tm module, but can be substituted for a custom made function.
estimate_function is a tm_func-like function that is used to get a first guess for the primer design, that is then used as starting
point for the final result. This is useful when the tm_func function is slow to calculate (e.g. it relies on an
external API, such as the NEB primer design API). The estimate_function should be faster than the tm_func function.
The default value is None.
To use the default tm_func as estimate function to get the NEB Tm faster, you can do:
primer_design(dseqr, target_tm=55, tm_func=tm_neb, estimate_function=tm_default).
The function returns a pydna.amplicon.Amplicon class instance. This object has
the object.forward_primer and object.reverse_primer properties which contain the designed primers.
fp (pydna.primer.Primer, optional) – optional pydna.primer.Primer objects containing one primer each.
rp (pydna.primer.Primer, optional) – optional pydna.primer.Primer objects containing one primer each.
target_tm (float, optional) – target tm for the primers, set to 55°C by default.
tm_func (function) – Function used for tm calculation. This function takes an ascii string
representing an oligonuceotide as argument and returns a float.
Some useful functions can be found in the pydna.tm module, but can be
substituted for a custom made function.
This function return a list of pydna.amplicon.Amplicon objects where
primers have been modified with tails so that the fragments can be fused in
the order they appear in the list by for example Gibson assembly or homologous
recombination.
we can modify the reverse primer of a and forward primer of b with tails to allow
fusion by fusion PCR, Gibson assembly or in-vivo homologous recombination.
The basic requirements for the primers for the three techniques are the same.
The first argument of this function is a list of sequence objects containing
Amplicons and other similar objects.
At least every second sequence object needs to be an Amplicon
This rule exists because if a sequence object is that is not a PCR product
is to be fused with another fragment, that other fragment needs to be an Amplicon
so that the primer of the other object can be modified to include the whole stretch
of sequence homology needed for the fusion. See the example below where a is a
non-amplicon (a linear plasmid vector for instance)
The first argument of this function is a list of sequence objects containing
Amplicons and other similar objects.
The overlap argument controls how many base pairs of overlap required between
adjacent sequence fragments. In the junction between Amplicons, tails with the
length of about half of this value is added to the two primers
closest to the junction.
In the case of an Amplicon adjacent to a Dseqrecord object, the tail will
be twice as long (1*overlap) since the
recombining sequence is present entirely on this primer:
Note that if the sequence of DNA fragments starts or stops with an Amplicon,
the very first and very last prinmer will not be modified i.e. assembles are
always assumed to be linear. There are simple tricks around that for circular
assemblies depicted in the last two examples below.
The maxlink arguments controls the cut off length for sequences that will be
synhtesized by adding them to primers for the adjacent fragment(s). The
argument list may contain short spacers (such as spacers between fusion proteins).
Example 1: Linear assembly of PCR products (pydna.amplicon.Amplicon class objects) ------
> < > <
Amplicon1 Amplicon3
Amplicon2 Amplicon4
> < > <
⇣
pydna.design.assembly_fragments
⇣
> <- -> <- pydna.assembly.Assembly
Amplicon1 Amplicon3
Amplicon2 Amplicon4 ➤ Amplicon1Amplicon2Amplicon3Amplicon4
-> <- -> <
Example 2: Linear assembly of alternating Amplicons and other fragments
> < > <
Amplicon1 Amplicon2
Dseqrecd1 Dseqrecd2
⇣
pydna.design.assembly_fragments
⇣
> <-- --> <-- pydna.assembly.Assembly
Amplicon1 Amplicon2
Dseqrecd1 Dseqrecd2 ➤ Amplicon1Dseqrecd1Amplicon2Dseqrecd2
Example 3: Linear assembly of alternating Amplicons and other fragments
Dseqrecd1 Dseqrecd2
Amplicon1 Amplicon2
> < --> <
⇣
pydna.design.assembly_fragments
⇣
pydna.assembly.Assembly
Dseqrecd1 Dseqrecd2
Amplicon1 Amplicon2 ➤ Dseqrecd1Amplicon1Dseqrecd2Amplicon2
--> <-- --> <
Example 4: Circular assembly of alternating Amplicons and other fragments
-> <==
Dseqrecd1 Amplicon2
Amplicon1 Dseqrecd1
--> <-
⇣
pydna.design.assembly_fragments
⇣
pydna.assembly.Assembly
-> <==
Dseqrecd1 Amplicon2 -Dseqrecd1Amplicon1Amplicon2-
Amplicon1 ➤ | |
--> <- -----------------------------
------ Example 5: Circular assembly of Amplicons
> < > <
Amplicon1 Amplicon3
Amplicon2 Amplicon1
> < > <
⇣
pydna.design.assembly_fragments
⇣
> <= -> <-
Amplicon1 Amplicon3
Amplicon2 Amplicon1
-> <- +> <
⇣
make new Amplicon using the Amplicon1.template and
the last fwd primer and the first rev primer.
⇣
pydna.assembly.Assembly
+> <= -> <-
Amplicon1 Amplicon3 -Amplicon1Amplicon2Amplicon3-
Amplicon2 ➤ | |
-> <- -----------------------------
Parameters:
f (list of pydna.amplicon.Amplicon and other Dseqrecord like objects) – list Amplicon and Dseqrecord object for which fusion primers should be constructed.
overlap (int, optional) – Length of required overlap between fragments.
maxlink (int, optional) – Maximum length of spacer sequences that may be present in f. These will be included in tails for designed primers.
circular (bool, optional) – If True, the assembly is circular. If False, the assembly is linear.
>>> frompydna.dseqrecordimportDseqrecord>>> frompydna.designimportprimer_design>>> a=primer_design(Dseqrecord("atgactgctaacccttccttggtgttgaacaagatcgacgacatttcgttcgaaacttacgatg"))>>> b=primer_design(Dseqrecord("ccaaacccaccaggtaccttatgtaagtacttcaagtcgccagaagacttcttggtcaagttgcc"))>>> c=primer_design(Dseqrecord("tgtactggtgctgaaccttgtatcaagttgggtgttgacgccattgccccaggtggtcgtttcgtt"))>>> frompydna.designimportassembly_fragments>>> # We would like a circular recombination, so the first sequence has to be repeated>>> fa1,fb,fc,fa2=assembly_fragments([a,b,c,a])>>> # Since all fragments are Amplicons, we need to extract the rp of the 1st and fp of the last fragments.>>> frompydna.amplifyimportpcr>>> fa=pcr(fa2.forward_primer,fa1.reverse_primer,a)>>> [fa,fb,fc][Amplicon(100), Amplicon(101), Amplicon(102)]>>> fa.name,fb.name,fc.name="fa fb fc".split()>>> frompydna.assemblyimportAssembly>>> assemblyobj=Assembly([fa,fb,fc])>>> assemblyobjAssemblyfragments....: 100bp 101bp 102bplimit(bp)....: 25G.nodes......: 6algorithm....: common_sub_strings>>> assemblyobj.assemble_linear()[Contig(-231), Contig(-166), Contig(-36)]>>> assemblyobj.assemble_circular()[0].seguid()'cdseguid=85t6tfcvWav0wnXEIb-lkUtrl4s'>>> (a+b+c).looped().seguid()'cdseguid=85t6tfcvWav0wnXEIb-lkUtrl4s'>>> print(assemblyobj.assemble_circular()[0].figure()) -|fa|36| \/| /\| 36|fb|36| \/| /\| 36|fc|36| \/| /\| 36-| | -------------------->>>
Compares two or more DNA sequences for equality i.e. if they
represent the same DNA molecule.
Two linear sequences are considiered equal if either:
They have the same sequence (case insensitive)
One sequence is the reverse complement of the other
Two circular sequences are considered equal if they are circular
permutations meaning that they have the same length and:
One sequence can be found in the concatenation of the other sequence with itself.
The reverse complement of one sequence can be found in the concatenation of the other sequence with itself.
The topology for the comparison can be set using one of the keywords
linear or circular to True or False.
If circular or linear is not set, it will be deduced from the topology of
each sequence for sequences that have a linear or circular attribute
(like Dseq and Dseqrecord).
This function takes a string containing one genbank sequence
in Genbank format and returns a named tuple containing two fields,
the gbtext containing a string with the corrected genbank sequence and
jseq which contains the JSON intermediate.
Examples
>>> s = '''LOCUS New_DNA 3 bp DNA CIRCULAR SYN 19-JUN-2013
... DEFINITION .
... ACCESSION
... VERSION
... SOURCE .
... ORGANISM .
... COMMENT
... COMMENT ApEinfo:methylated:1
... ORIGIN
... 1 aaa
... //'''
>>> from pydna.readers import read
>>> read(s)
... /site-packages/Bio/GenBank/Scanner.py:1388: BiopythonParserWarning: Malformed LOCUS line found - is this correct?
:'LOCUS New_DNA 3 bp DNA CIRCULAR SYN 19-JUN-2013\n'
"correct?\n:%r" % line, BiopythonParserWarning)
Traceback (most recent call last):
File "... /pydna/readers.py", line 48, in read
results = results.pop()
IndexError: pop from empty list
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "... /pydna/readers.py", line 50, in read
raise ValueError("No sequences found in data:\n({})".format(data[:79]))
ValueError: No sequences found in data:
(LOCUS New_DNA 3 bp DNA CIRCULAR SYN 19-JUN-2013
DEFINITI)
>>> from pydna.genbankfixer import gbtext_clean
>>> s2, j2 = gbtext_clean(s)
>>> print(s2)
LOCUS New_DNA 3 bp ds-DNA circular SYN 19-JUN-2013
DEFINITION .
ACCESSION
VERSION
SOURCE .
ORGANISM .
COMMENT
COMMENT ApEinfo:methylated:1
FEATURES Location/Qualifiers
ORIGIN
1 aaa
//
>>> s3 = read(s2)
>>> s3
Dseqrecord(o3)
>>> print(s3.format())
LOCUS New_DNA 3 bp DNA circular SYN 19-JUN-2013
DEFINITION .
ACCESSION New_DNA
VERSION New_DNA
KEYWORDS .
SOURCE
ORGANISM .
.
COMMENT
ApEinfo:methylated:1
FEATURES Location/Qualifiers
ORIGIN
1 aaa
//
This file serves to define dscode, the DNA alphabet used in pydna.
Each symbol represents a basepair (two opposing bases in the two antiparalell
DNA strands).
The alphabet is defined in the end of this docstring which serve as the single
source of thruth. The alphabet is used to construct the codestrings dictionary
with has the following keys (strings) in the order indicated:
un_ambiguous_ds_dna
ds_rna
ambiguous_ds_dna
single_stranded_dna_rna
loops_dna_rna
mismatched_dna_rna
gap
Each value of the codestrings dictionary is a multiline string. This string
has five lines following this form:
W (line 1) and C (line 2) are complementary bases in a double stranded DNA
molecule and S (line 5) are the symbols of the alphabet used to
describe the base pair above the symbol.
Line 2 must contain only the pipe character, indicating basepairing and
line 4 must be empty. The lines must be of equal length and a series ot
tests are performed to ensure the integrity of the alphabet.
The string definition as well as the keys for the codestrings dict follow this
line and is contained in the last 13 lines of the docstring:
A regular expression for finding double-stranded regions flanked by single-stranded DNA
that can be melted to shed a single-stranded fragment.
This function returns a regular expression that finds double-stranded regions
(of length <= length) that are flanked by single-stranded regions on the same
side in dscode format. These regions are useful to identify as potential melt
sites, since melting them leads to the shedding of a single-stranded fragment.
The regular expression finds double stranded patches flanked by empty
positions on the same side (see figure below). Melting of this kind of
sites leads to the shedding of a single stranded fragment.
A regular expression for finding double-stranded regions flanked by single-stranded DNA
that can be melted to shed multiple double stranded fragments.
This function returns a regular expression that finds double-stranded regions
(of length <= length) that are flanked by single-stranded regions on opposite
sides in dscode format. These regions are useful to identify as potential melt
sites, since melting them leads to separation into multiple double stranded fragments.
The regular expression finds double stranded patches flanked by empty
positions on opposite sides(see figure below). Melting of this kind of
sites leads to separation into multiple double stranded fragments.
::
aaaGFTTAIAttt <– dscode
aaaG TTACAttt <– “TTA” is found by the regex for length <= 3
tttCTAAT Taaa
Find double strand breaks in DNA in dscode format.
An empty watson position next to an empty crick position in the dsDNA
leads to a discontinuous DNA. This function is used to show breaks in
DNA in Dseq.__init__.
Two line string representation of a sequence of dscode symbols.
See pydna.alphabet module for the definition of the pydna dscode
alphabet. The dscode has a symbol (ascii) character for base pairs
and single stranded DNA.
This function is used by the Dseq.__repr__() method.
Parameters:
data (TYPE, optional) – DESCRIPTION. The default is “”.
Returns:
A two line string containing The Watson and Crick strands.
The Amplicon class holds information about a PCR reaction involving two
primers and one template. This class is used by the Anneal class and is not
meant to be instantiated directly.
Parameters:
forward_primer (SeqRecord(Biopython)) – SeqRecord object holding the forward (sense) primer
reverse_primer (SeqRecord(Biopython)) – SeqRecord object holding the reverse (antisense) primer
template (Dseqrecord) – Dseqrecord object holding the template (circular or linear)
This module provide the Anneal class and the pcr() function
for PCR simulation. The pcr function is simpler to use, but expects only one
PCR product. The Anneal class should be used if more flexibility is required.
Primers with 5’ tails as well as inverse PCR on circular templates are handled
correctly.
pcr is a convenience function for the Anneal class to simplify its
usage, especially from the command line. If more than one or no PCR
product is formed, a ValueError is raised.
args is any iterable of Dseqrecords or an iterable of iterables of
Dseqrecords. args will be greedily flattened.
Parameters:
args (iterable containing sequence objects) – Several arguments are also accepted.
limit (int = 13, optional) – limit length of the annealing part of the primers.
Notes
sequences in args could be of type:
string
Seq
SeqRecord (or subclass)
Dseqrecord (or sublcass)
The last sequence will be assumed to be the template while
all preceeding sequences will be assumed to be primers.
This is a powerful function, use with care!
Returns:
product – An pydna.amplicon.Amplicon object representing the PCR
product. The direction of the PCR product will be the same as for
the template sequence.
Assembly of sequences by homologous recombination.
Should also be useful for related techniques such as Gibson assembly and fusion
PCR. Given a list of sequences (Dseqrecords), all sequences are analyzed for
shared homology longer than the set limit.
A graph is constructed where each overlapping region form a node and
sequences separating the overlapping regions form edges.
Assembly of a list of linear DNA fragments into linear or circular
constructs. The Assembly is meant to replace the Assembly method as it
is easier to use. Accepts a list of Dseqrecords (source fragments) to
initiate an Assembly object. Several methods are available for analysis
of overlapping sequences, graph construction and assembly.
Improved implementation of the assembly module. To see a list of issues with the previous implementation,
see [issues tagged with fixed-with-new-assembly-model](pydna-group/pydna#issues)
Turn a list of locations into a list of tuples of those locations, where each tuple contains
locations that overlap. For example, if locs = [loc1, loc2, loc3], and loc1 and loc2 overlap,
the output will be [(loc1, loc2), (loc3,)].
tuple[tuple[str, str], tuple[str, str]] – A tuple of two tuples, each containing the type of end (‘5’’, ‘3’’, or ‘blunt’)
and the sequence of the overhang. The first tuple is for the left end, second for the right end.
Assembly algorithm to find blunt overlaps. Used for blunt ligation.
It basically returns [(len(seqx), 0, 0)] if the right end of seqx is blunt and the
left end of seqy is blunt (compatible with blunt ligation). Otherwise, it returns an empty list.
Assembly algorithm to find common substrings of length == limit. see the docs of
the function common_sub_strings_str for more details. It is case insensitive.
Starting from the rightmost edge of the match, return a new match encompassing the max
number of bases. This can be used to return a longer match if a primer aligns for longer
than the limit or a shorter match if there are mismatches. This is convenient to maintain
as many features as possible. It is used in PCR assembly.
>>> seq=Dseqrecord('AAAAACGTCCCGT')>>> primer=Dseqrecord('ACGTCCCGT')>>> match=(13,9,0)# an empty match at the end of each>>> zip_match_leftwards(seq,primer,match)(4, 0, 9)
Works in circular molecules if the match spans the origin:
>>> seq = Dseqrecord(‘TCCCGTAAAAACG’, circular=True)
>>> primer = Dseqrecord(‘ACGTCCCGT’)
>>> match = (6, 9, 0)
>>> zip_match_leftwards(seq, primer, match)
(10, 0, 9)
Transform a Dseqrecord to a sequence string where U is replaced by T, everything is upper case and
circular sequences are repeated twice. This is used for PCR, to support primers with U’s (e.g. for USER cloning).
Assembly algorithm to find overlaps between a primer and a template. It accepts mismatches.
When there are mismatches, it only returns the common part between the primer and the template.
If seqx is a primer and seqy is a template, it represents the binding of a forward primer.
If seqx is a template and seqy is a primer, it represents the binding of a reverse primer,
where the primer has been passed as its reverse complement (see examples).
Convert an assembly to a string representation, for example:
((1, 2, [8:14], [1:7]),(2, 3, [10:17], [1:8]))
becomes:
(‘1[8:14]:2[1:7]’, ‘2[10:17]:3[1:8]’)
The reason for this is that by default, a feature ‘[8:14]’ when present in a tuple
is printed to the console as SimpleLocation(ExactPosition(8),ExactPosition(14),strand=1) (very long).
Based on the topology of the locations of an assembly, determine if it is circular.
This does not work for insertion assemblies, that’s why assemble takes the optional argument is_insertion.
Turn this kind of edge representation fragment 1, fragment 2, right edge on 1, left edge on 2
a = [(1, 2, ‘loc1a’, ‘loc2a’), (2, 3, ‘loc2b’, ‘loc3b’), (3, 1, ‘loc3c’, ‘loc1c’)]
Into this: fragment 1, left edge on 1, right edge on 1
b = [(1, ‘loc1c’, ‘loc1a’), (2, ‘loc2a’, ‘loc2b’), (3, ‘loc3b’, ‘loc3c’)]
Turn this kind of subfragment representation fragment 1, left edge on 1, right edge on 1
a = [(1, ‘loc1c’, ‘loc1a’), (2, ‘loc2a’, ‘loc2b’), (3, ‘loc3b’, ‘loc3c’)]
Into this: fragment 1, fragment 2, right edge on 1, left edge on 2
b = [(1, 2, ‘loc1a’, ‘loc2a’), (2, 3, ‘loc2b’ ‘loc3b’), (3, 1, ‘loc3c’, ‘loc1c’)]
From the fragment representation returned by edge_representation2subfragment_representation, get the subfragments that are joined together.
Subfragments are the slices of the fragments that are joined together
For example:
--A--TACGTAAT--B--TCGTAACGAGives:TACGTAA/CGTAACGA
To reproduce:
a=Dseqrecord('TACGTAAT')b=Dseqrecord('TCGTAACGA')f=Assembly([a,b],limit=5)a0=f.get_linear_assemblies()[0]print(assembly2str(a0))a0_subfragment_rep=edge_representation2subfragment_representation(a0,False)forfinget_assembly_subfragments([a,b],a0_subfragment_rep):print(f.seq)# prints TACGTAA and CGTAACGA
Assembly of a list of DNA fragments into linear or circular constructs.
Accepts a list of Dseqrecords (source fragments) to
initiate an Assembly object. Several methods are available for analysis
of overlapping sequences, graph construction and assembly.
The assembly contains a directed graph, where nodes represent fragments and
edges represent overlaps between fragments. :
The node keys are integers, representing the index of the fragment in the
input list of fragments. The sign of the node key represents the orientation
of the fragment, positive for forward orientation, negative for reverse orientation.
The edges contain the locations of the overlaps in the fragments. For an edge (u, v, key):
u and v are the nodes connected by the edge.
key is a string that represents the location of the overlap. In the format:
‘u[start:end](strand):v[start:end](strand)’.
Edges have a ‘locations’ attribute, which is a list of two FeatureLocation objects,
representing the location of the overlap in the u and v fragment, respectively.
You can think of an edge as a representation of the join of two fragments.
If fragment 1 and 2 share a subsequence of 6bp, [8:14] in fragment 1 and [1:7] in fragment 2,
there will be 4 edges representing that overlap in the graph, for all possible
orientations of the fragments (see add_edges_from_match for details):
(1,2,'1[8:14]:2[1:7]')
(2,1,'2[1:7]:1[8:14]')
(-1,-2,'-1[0:6]:-2[10:16]')
(-2,-1,'-2[10:16]:-1[0:6]')
An assembly can be thought of as a tuple of graph edges, but instead of representing them with node indexes and keys, we represent them
as u, v, locu, locv, where u and v are the nodes connected by the edge, and locu and locv are the locations of the overlap in the first
and second fragment. Assemblies are then represented as:
limit (int, optional) – The shortest shared homology to be considered, this is passed as the third argument to the algorithm function.
For certain algorithms, this might be ignored.
algorithm (function, optional) – The algorithm used to determine the shared sequences. It’s a function that takes two Dseqrecord objects as inputs,
and will get passed the third argument (limit), that may or may not be used. It must return a list of overlaps
(see common_sub_strings for an example).
use_fragment_order (bool, optional) – It’s set to True by default to reproduce legacy pydna behaviour: only assemblies that start with the first fragment and end with the last are considered.
You should set it to False.
use_all_fragments (bool, optional) – Constrain the assembly to use all fragments.
Examples
from assembly2 import Assembly, assembly2str
from pydna.dseqrecord import Dseqrecord
Add edges to the graph from a match returned by the algorithm function (see pydna.common_substrings). For
format of edges (see documentation of the Assembly class).
Matches are directional, because not all algorithm functions return the same match for (u,v) and (v,u). For example,
homologous recombination does but sticky end ligation does not. The function returns two edges:
Fragments in the orientation they were passed, with locations of the match (u, v, loc_u, loc_v)
Reverse complement of the fragments with inverted order, with flipped locations (-v, -u, flip(loc_v), flip(loc_u))/
Get linear assemblies, applying the constrains described in __init__, ensuring that paths represent
real assemblies (see assembly_is_valid). Subassemblies are removed (see remove_subassemblies).
Convert a node path in the format [1, 2, 3] (as returned by networkx.cycles.simple_cycles) to a list of all
possible assemblies.
There may be multiple assemblies for a given node path, if there are several edges connecting two nodes,
for example two overlaps between 1 and 2, and single overlap between 2 and 3 should return 3 assemblies.
Get the number of possible assemblies from a list of node paths. Basically, for each path
passed as a list of integers / nodes, we calculate the number of paths possible connecting
the nodes in that order, given the graph (all the edges connecting them).
Sorts the fragment representing a cycle so that they represent an insertion assembly if possible,
else returns None.
Here we check if one of the joins between fragments represents the edges of an insertion assembly
The fragment must be linear, and the join must be as indicated below
The above example will be [(1, 2, [4:6], [0:2]), (2, 3, [6:8], [0:2]), (3, 1, [8:10], [9:11)])]
These could be returned in any order by simple_cycles, so we sort the edges so that the first
and last u and v match the fragment that gets the insertion (1 in the example above).
Assemblies that represent the insertion of a fragment or series of fragment inside a linear construct. For instance,
digesting CCCCGAATTCCCCGAATTC with EcoRI and inserting the fragment with two overhangs into the EcoRI site of AAAGAATTCAAA.
This is not so much meant for the use-case of linear fragments that represent actual linear fragments, but for linear
fragments that represent a genome region. This can then be used to simulate homologous recombination.
Get a dictionary where the keys are the nodes in the graph, and the values are dictionaries with keys
left, right, containing (for each fragment) the locations where the fragment is joined to another fragment on its left
and right side. The values in left and right are often the same, except in restriction-ligation with partial overlap enabled,
where we can end up with a situation like this:
GGTCTCCCCAATT and aGGTCTCCAACCAA as fragments
# Partial overlap in assembly 1[9:11]:2[8:10]
GGTCTCCxxAACCAA
CCAGAGGGGTTxxTT
# Partial overlap in 2[10:12]:1[7:9]
aGGTCTCCxxCCAATT
tCCAGAGGTTGGxxAA
Check whether only adjacent edges within each fragment are used in the assembly. This is useful to check if a cut and ligate assembly is valid,
and prevent including partially digested fragments. For example, imagine the following fragment being an input for a digestion
and ligation assembly, where the enzyme cuts at the sites indicated by the vertical lines:
xyz-------|-------|-------|---------
We would only want assemblies that contain subfragments start-x, x-y, y-z, z-end, and not start-x, y-end, for instance.
The latter would indicate that the fragment was partially digested.
An assembly that represents a PCR, where fragments is a list of primer, template, primer (in that order).
It always uses the primer_template_overlap algorithm and accepts the mismatches argument to indicate
the number of mismatches allowed in the overlap. Only supports substitution mismatches, not indels.
Get linear assemblies, applying the constrains described in __init__, ensuring that paths represent
real assemblies (see assembly_is_valid). Subassemblies are removed (see remove_subassemblies).
Assemblies that represent the insertion of a fragment or series of fragment inside a linear construct. For instance,
digesting CCCCGAATTCCCCGAATTC with EcoRI and inserting the fragment with two overhangs into the EcoRI site of AAAGAATTCAAA.
This is not so much meant for the use-case of linear fragments that represent actual linear fragments, but for linear
fragments that represent a genome region. This can then be used to simulate homologous recombination.
Overrides the parent method to ensure that the 5’ of the crick strand of the product matches the
sequence of the reverse primer. This is important when using primers with dUTP (for USER cloning).
Get linear assemblies, applying the constrains described in __init__, ensuring that paths represent
real assemblies (see assembly_is_valid). Subassemblies are removed (see remove_subassemblies).
In the example below, we plan to assemble a plasmid from a backbone and an insert, using the EcoRI and SalI enzymes.
Note how 2 circular products are returned, one contains the insert (acgt)
and the desired part of the backbone (cccccc), the other contains the
reversed insert (tgga) and the cut-out part of the backbone (aaa).
Returns the products for Golden Gate assembly. This is the same as
restriction ligation assembly, but with a different name. Check the documentation
for restriction_ligation_assembly for more details.
Parameters:
frags (list[Dseqrecord]) – List of DNA fragments to assemble
enzymes (list[AbstractCut]) – List of restriction enzymes to use
allow_blunt (bool, optional) – If True, allow blunt end ligations, by default True
circular_only (bool, optional) – If True, only return circular assemblies, by default False
In the example below, we plan to assemble a plasmid from a backbone and an insert,
using the EcoRI enzyme. The insert and insertion site in the backbone are flanked by
EcoRI sites, so there are two possible products depending on the orientation of the insert.
Returns the products for Gateway assembly / Gateway cloning.
Parameters:
frags (list[Dseqrecord]) – List of DNA fragments to assemble
reaction_type (Literal['BP', 'LR']) – Type of Gateway reaction
greedy (bool, optional) – If True, use greedy gateway consensus sites, by default False
circular_only (bool, optional) – If True, only return circular assemblies, by default False
multi_site_only (bool, optional) – If True, only return products that where 2 sites recombined. Even if input sequences
contain multiple att sites (typically 2), a product could be generated where only one
site recombines. That’s typically not what you want, so you can set this to True to
only return products where both att sites recombined.
Now let’s understand the multi_site_only parameter. Let’s consider a case where we are swapping fragments
between two plasmids using an LR reaction. Experimentally, we expect to obtain two plasmids, resulting from the
swapping between the two att sites. That’s what we get if we set multi_site_only to True.
However, if we set multi_site_only to False, we get 4 products, which also include the intermediate products
where the two plasmids are combined into a single one through recombination of a single att site. This is an
intermediate of the reaction, and typically we don’t want it:
Returns the products resulting from the integration of an insert (or inserts joined
through in vivo recombination) into the genome through homologous recombination.
Example of a homologous recombination event, where a plasmid is excised from the
genome (circular sequence of 25 bp), and that part is removed from the genome,
leaving a shorter linear sequence (32 bp).
Returns the products resulting from the integration of an insert (or inserts joined
through cre-lox recombination among them) into the genome through cre-lox integration.
Also works with lox66 and lox71 (see pydna.cre_lox for more details).
Below an example with lox66 and lox71 (irreversible integration).
Here, the result of excision is still returned because there is a low
probability of it happening, but it’s considered a rare event.
Below an example with lox66 and lox71 (irreversible integration).
Here, the result of excision is still returned because there is a low
probability of it happening, but it’s considered a rare event.
Finds the the flanking common substrings between stringx and stringy
longer than limit. This means that the results only contains substrings
that starts or ends at the the ends of stringx and stringy.
This function is case sensitive.
returns a list of tuples describing the substrings
The list is sorted longest -> shortest.
This module contain functions for primer design for various purposes.
:func:primer_design for designing primers for a sequence or a matching primer for an existing primer. Returns an Amplicon object (same as the amplify module returns).
:func:assembly_fragments Adds tails to primers for a linear assembly through homologous recombination or Gibson assembly.
:func:circular_assembly_fragments Adds tails to primers for a circular assembly through homologous recombination or Gibson assembly.
This function designs a forward primer and a reverse primer for PCR amplification
of a given template sequence.
The template argument is a Dseqrecord object or equivalent containing the template sequence.
The optional fp and rp arguments can contain an existing primer for the sequence (either the forward or reverse primer).
One or the other primers can be specified, not both (since then there is nothing to design!, use the pydna.amplify.pcr function instead).
The limit argument is the minimum length of the primer. The default value is 13.
If one of the primers is given, the other primer is designed to match in terms of Tm.
If both primers are designed, they will be designed to target_tm
tm_func is a function that takes an ascii string representing an oligonuceotide as argument and returns a float.
Some useful functions can be found in the pydna.tm module, but can be substituted for a custom made function.
estimate_function is a tm_func-like function that is used to get a first guess for the primer design, that is then used as starting
point for the final result. This is useful when the tm_func function is slow to calculate (e.g. it relies on an
external API, such as the NEB primer design API). The estimate_function should be faster than the tm_func function.
The default value is None.
To use the default tm_func as estimate function to get the NEB Tm faster, you can do:
primer_design(dseqr, target_tm=55, tm_func=tm_neb, estimate_function=tm_default).
The function returns a pydna.amplicon.Amplicon class instance. This object has
the object.forward_primer and object.reverse_primer properties which contain the designed primers.
fp (pydna.primer.Primer, optional) – optional pydna.primer.Primer objects containing one primer each.
rp (pydna.primer.Primer, optional) – optional pydna.primer.Primer objects containing one primer each.
target_tm (float, optional) – target tm for the primers, set to 55°C by default.
tm_func (function) – Function used for tm calculation. This function takes an ascii string
representing an oligonuceotide as argument and returns a float.
Some useful functions can be found in the pydna.tm module, but can be
substituted for a custom made function.
This function return a list of pydna.amplicon.Amplicon objects where
primers have been modified with tails so that the fragments can be fused in
the order they appear in the list by for example Gibson assembly or homologous
recombination.
we can modify the reverse primer of a and forward primer of b with tails to allow
fusion by fusion PCR, Gibson assembly or in-vivo homologous recombination.
The basic requirements for the primers for the three techniques are the same.
The first argument of this function is a list of sequence objects containing
Amplicons and other similar objects.
At least every second sequence object needs to be an Amplicon
This rule exists because if a sequence object is that is not a PCR product
is to be fused with another fragment, that other fragment needs to be an Amplicon
so that the primer of the other object can be modified to include the whole stretch
of sequence homology needed for the fusion. See the example below where a is a
non-amplicon (a linear plasmid vector for instance)
The first argument of this function is a list of sequence objects containing
Amplicons and other similar objects.
The overlap argument controls how many base pairs of overlap required between
adjacent sequence fragments. In the junction between Amplicons, tails with the
length of about half of this value is added to the two primers
closest to the junction.
In the case of an Amplicon adjacent to a Dseqrecord object, the tail will
be twice as long (1*overlap) since the
recombining sequence is present entirely on this primer:
Note that if the sequence of DNA fragments starts or stops with an Amplicon,
the very first and very last prinmer will not be modified i.e. assembles are
always assumed to be linear. There are simple tricks around that for circular
assemblies depicted in the last two examples below.
The maxlink arguments controls the cut off length for sequences that will be
synhtesized by adding them to primers for the adjacent fragment(s). The
argument list may contain short spacers (such as spacers between fusion proteins).
Example 1: Linear assembly of PCR products (pydna.amplicon.Amplicon class objects) ------
> < > <
Amplicon1 Amplicon3
Amplicon2 Amplicon4
> < > <
⇣
pydna.design.assembly_fragments
⇣
> <- -> <- pydna.assembly.Assembly
Amplicon1 Amplicon3
Amplicon2 Amplicon4 ➤ Amplicon1Amplicon2Amplicon3Amplicon4
-> <- -> <
Example 2: Linear assembly of alternating Amplicons and other fragments
> < > <
Amplicon1 Amplicon2
Dseqrecd1 Dseqrecd2
⇣
pydna.design.assembly_fragments
⇣
> <-- --> <-- pydna.assembly.Assembly
Amplicon1 Amplicon2
Dseqrecd1 Dseqrecd2 ➤ Amplicon1Dseqrecd1Amplicon2Dseqrecd2
Example 3: Linear assembly of alternating Amplicons and other fragments
Dseqrecd1 Dseqrecd2
Amplicon1 Amplicon2
> < --> <
⇣
pydna.design.assembly_fragments
⇣
pydna.assembly.Assembly
Dseqrecd1 Dseqrecd2
Amplicon1 Amplicon2 ➤ Dseqrecd1Amplicon1Dseqrecd2Amplicon2
--> <-- --> <
Example 4: Circular assembly of alternating Amplicons and other fragments
-> <==
Dseqrecd1 Amplicon2
Amplicon1 Dseqrecd1
--> <-
⇣
pydna.design.assembly_fragments
⇣
pydna.assembly.Assembly
-> <==
Dseqrecd1 Amplicon2 -Dseqrecd1Amplicon1Amplicon2-
Amplicon1 ➤ | |
--> <- -----------------------------
------ Example 5: Circular assembly of Amplicons
> < > <
Amplicon1 Amplicon3
Amplicon2 Amplicon1
> < > <
⇣
pydna.design.assembly_fragments
⇣
> <= -> <-
Amplicon1 Amplicon3
Amplicon2 Amplicon1
-> <- +> <
⇣
make new Amplicon using the Amplicon1.template and
the last fwd primer and the first rev primer.
⇣
pydna.assembly.Assembly
+> <= -> <-
Amplicon1 Amplicon3 -Amplicon1Amplicon2Amplicon3-
Amplicon2 ➤ | |
-> <- -----------------------------
Parameters:
f (list of pydna.amplicon.Amplicon and other Dseqrecord like objects) – list Amplicon and Dseqrecord object for which fusion primers should be constructed.
overlap (int, optional) – Length of required overlap between fragments.
maxlink (int, optional) – Maximum length of spacer sequences that may be present in f. These will be included in tails for designed primers.
circular (bool, optional) – If True, the assembly is circular. If False, the assembly is linear.
>>> frompydna.dseqrecordimportDseqrecord>>> frompydna.designimportprimer_design>>> a=primer_design(Dseqrecord("atgactgctaacccttccttggtgttgaacaagatcgacgacatttcgttcgaaacttacgatg"))>>> b=primer_design(Dseqrecord("ccaaacccaccaggtaccttatgtaagtacttcaagtcgccagaagacttcttggtcaagttgcc"))>>> c=primer_design(Dseqrecord("tgtactggtgctgaaccttgtatcaagttgggtgttgacgccattgccccaggtggtcgtttcgtt"))>>> frompydna.designimportassembly_fragments>>> # We would like a circular recombination, so the first sequence has to be repeated>>> fa1,fb,fc,fa2=assembly_fragments([a,b,c,a])>>> # Since all fragments are Amplicons, we need to extract the rp of the 1st and fp of the last fragments.>>> frompydna.amplifyimportpcr>>> fa=pcr(fa2.forward_primer,fa1.reverse_primer,a)>>> [fa,fb,fc][Amplicon(100), Amplicon(101), Amplicon(102)]>>> fa.name,fb.name,fc.name="fa fb fc".split()>>> frompydna.assemblyimportAssembly>>> assemblyobj=Assembly([fa,fb,fc])>>> assemblyobjAssemblyfragments....: 100bp 101bp 102bplimit(bp)....: 25G.nodes......: 6algorithm....: common_sub_strings>>> assemblyobj.assemble_linear()[Contig(-231), Contig(-166), Contig(-36)]>>> assemblyobj.assemble_circular()[0].seguid()'cdseguid=85t6tfcvWav0wnXEIb-lkUtrl4s'>>> (a+b+c).looped().seguid()'cdseguid=85t6tfcvWav0wnXEIb-lkUtrl4s'>>> print(assemblyobj.assemble_circular()[0].figure()) -|fa|36| \/| /\| 36|fb|36| \/| /\| 36|fc|36| \/| /\| 36-| | -------------------->>>
Dseq describes a double stranded DNA fragment, linear or circular.
Dseq can be initiated in two ways, using two strings, each representing the
Watson (upper, sense) strand, the Crick (lower, antisense) strand and an
optional value describing the stagger betwen the strands on the left side (ovhg).
Alternatively, a single string represenation using dsIUPAC codes can be used.
If a single string is used, the letters of that string are interpreted as base
pairs rather than single bases. For example “A” would indicate the basepair
“A/T”. An expanded IUPAC code is used where the letters PEXI have been assigned
to GATC on the Watson strand with no paring base on the Crick strand G/””, A/””,
T/”” and C/””. The letters QFZJ have been assigned the opposite base pairs with
an empty Watson strand “”/G, “”/A, “”/T, and “”/C.
watson (str) – a string representing the Watson (sense) DNA strand or a basepair
represenation.
crick (str, optional) – a string representing the Crick (antisense) DNA strand.
ovhg (int, optional) – A positive or negative number to describe the stagger between the
Watson and Crick strands.
see below for a detailed explanation.
circular (bool, optional) – True indicates that sequence is circular, False that it is linear.
Examples
Dseq is a subclass of the Biopython Bio.Seq.Seq class. The constructor
can accept two strings representing the Watson (sense) and Crick(antisense)
DNA strands. These are interpreted as single stranded DNA. There is a check
for complementarity between the strands.
If the DNA molecule is staggered on the left side, an integer ovhg
(overhang) must be given, describing the stagger between the Watson and Crick strand
in the 5’ end of the fragment.
Additionally, the optional boolean parameter circular can be given to indicate if the
DNA molecule is circular.
The most common usage of the Dseq class is probably not to use it directly, but to
create it as part of a Dseqrecord object (see pydna.dseqrecord.Dseqrecord).
This works in the same way as for the relationship between the Bio.Seq.Seq and
Bio.SeqRecord.SeqRecord classes in Biopython.
There are multiple ways of creating a Dseq object directly listed below, but you can also
use the function Dseq.from_full_sequence_and_overhangs() to create a Dseq:
Two arguments (string, string), no overhang provided:
If Watson and Crick are given, but not ovhg, an attempt will be made to find the best annealing
between the strands. There are important limitations to this. If there are several ways to
anneal the strands, this will fail. For long fragments it is quite slow.
Three arguments (string, string, ovhg=int):
The ovhg parameter is an integer describing the length of the Crick strand overhang on the
left side (the 5’ end of Watson strand).
The ovhg parameter controls the stagger at the five prime end:
If the ovhg parameter is specified a Crick strand also needs to be supplied, or
an exception is raised.
>>> Dseq(watson="agt",ovhg=2)Traceback (most recent call last):...ValueError: ovhg (overhang) defined without a crick strand.
The shape or topology of the fragment is set by the circular parameter, True or False (default).
>>> Dseq("aaa","ttt",ovhg=0)# A linear sequence by defaultDseq(-3)aaattt>>> Dseq("aaa","ttt",ovhg=0,circular=False)# A linear sequence if circular is FalseDseq(-3)aaattt>>> Dseq("aaa","ttt",ovhg=0,circular=True)# A circular sequenceDseq(o3)aaattt>>> Dseq("aaa","ttt",ovhg=1,circular=False)Dseq(-4) aaattt>>> Dseq("aaa","ttt",ovhg=-1)Dseq(-4)aaa ttt>>> Dseq("aaa","ttt",circular=True,ovhg=0)Dseq(o3)aaattt
The molecular weight of the DNA/RNA molecule in g/mol.
The molecular weight data in Biopython Bio.Data.IUPACData
is used. The DNA is assumed to have a 5’-phosphate as many
DNA fragments from restriction digestion do:
P-G-A-T-T-A-C-A-OH|||||||OH-C-T-A-A-T-G-T-P
The molecular weights listed in the unambiguous_dna_weights
dictionary refers to free monophosphate nucleotides.
One water molecule is removed for every phopshodiester bond
formed between nucleotides. For linear molecules, the weight
of one water molecule is added to account for the terminal
hydroxyl group and a hydrogen on the 5’ terminal phosphate
group.
>>> ds=Dseq("TAAG",circular=True)>>> ds.shifted(1)# First bp moved to right side:Dseq(o4)AAGTTTCA>>> ds.shifted(-1)# Last bp moved to left side:Dseq(o4)GTAACATT
Returns a 2-tuple of trings describing the structure of the 5’ end of
the DNA fragment.
The tuple contains (type , sticky) where type is eiter “5’” or “3’”.
sticky is always in lower case and contains the sequence of the
protruding end in 5’-3’ direction.
Fill in of five prime protruding end with a DNA polymerase
that has only DNA polymerase activity (such as Exo-Klenow [12]).
Exo-Klenow is a modified version of the Klenow fragment of E.
coli DNA polymerase I, which has been engineered to lack both
3-5 proofreading and 5-3 exonuclease activities.
and any combination of A, G, C or T. Default are all four
nucleotides together.
Fill in of five prime protruding end with a DNA polymerase
that has only DNA polymerase activity (such as Exo-Klenow [13]).
Exo-Klenow is a modified version of the Klenow fragment of E.
coli DNA polymerase I, which has been engineered to lack both
3-5 proofreading and 5-3 exonuclease activities.
and any combination of A, G, C or T. Default are all four
nucleotides together.
Simulates treatment a nuclease with both 5’-3’ and 3’-5’ single
strand specific exonuclease activity (such as mung bean nuclease [14])
Mung bean nuclease is a nuclease enzyme derived from mung bean sprouts
that preferentially degrades single-stranded DNA and RNA into
5’-phosphate- and 3’-hydroxyl-containing nucleotides.
Treatment results in blunt DNA, regardless of wheter the protruding end
is 5’ or 3’.
Simulates treatment a nuclease with both 5’-3’ and 3’-5’ single
strand specific exonuclease activity (such as mung bean nuclease [15])
Mung bean nuclease is a nuclease enzyme derived from mung bean sprouts
that preferentially degrades single-stranded DNA and RNA into
5’-phosphate- and 3’-hydroxyl-containing nucleotides.
Treatment results in blunt DNA, regardless of wheter the protruding end
is 5’ or 3’.
Fill in 5’ protruding ends and nibble 3’ protruding ends.
This is done using a DNA polymerase providing 3’-5’ nuclease activity
such as T4 DNA polymerase. This can be done in presence of any
combination of the four nucleotides A, G, C or T.
T4 DNA polymerase is widely used to “polish” DNA ends because of its
strong 3-5 exonuclease activity in the absence of dNTPs, it chews
back 3′ overhangs to create blunt ends; in the presence of limiting
dNTPs, it can fill in 5′ overhangs; and by carefully controlling
reaction time, temperature, and nucleotide supply, you can generate
defined recessed or blunt termini.
Tuning the nucleotide set can facilitate engineering of partial
sticky ends. Default are all four nucleotides together.
aaagatc-3aaa3' ends are always removed.|||--->|||AandTneededorthemoleculewill3-ctagttttttdegradecompletely.5-gatcaaagatcaaaGATC5' ends are filled in the|||--->|||||||||||presenceofGATCtttctag-5CTAGtttctag5-gatcaaagatcaaaGAT5' ends are partially filled in the|||--->|||||||||presenceofGATtoproducea1nttttctag-5TAGtttctag5' overhang5-gatcaaagatcaaaGA5' ends are partially filled in the|||--->|||||||presenceofGAtoproducea2nttttctag-5AGtttctag5' overhang5-gatcaaagatcaaaG5' ends are partially filled in the|||--->|||||presenceofGtoproducea3nttttctag-5Gtttctag5' overhang
Fill in 5’ protruding ends and nibble 3’ protruding ends.
This is done using a DNA polymerase providing 3’-5’ nuclease activity
such as T4 DNA polymerase. This can be done in presence of any
combination of the four nucleotides A, G, C or T.
T4 DNA polymerase is widely used to “polish” DNA ends because of its
strong 3-5 exonuclease activity in the absence of dNTPs, it chews
back 3′ overhangs to create blunt ends; in the presence of limiting
dNTPs, it can fill in 5′ overhangs; and by carefully controlling
reaction time, temperature, and nucleotide supply, you can generate
defined recessed or blunt termini.
Tuning the nucleotide set can facilitate engineering of partial
sticky ends. Default are all four nucleotides together.
aaagatc-3aaa3' ends are always removed.|||--->|||AandTneededorthemoleculewill3-ctagttttttdegradecompletely.5-gatcaaagatcaaaGATC5' ends are filled in the|||--->|||||||||||presenceofGATCtttctag-5CTAGtttctag5-gatcaaagatcaaaGAT5' ends are partially filled in the|||--->|||||||||presenceofGATtoproducea1nttttctag-5TAGtttctag5' overhang5-gatcaaagatcaaaGA5' ends are partially filled in the|||--->|||||||presenceofGAtoproducea2nttttctag-5AGtttctag5' overhang5-gatcaaagatcaaaG5' ends are partially filled in the|||--->|||||presenceofGtoproducea3nttttctag-5Gtttctag5' overhang
5’ => 3’ resection at the left side (start) of the molecule.
The argument n indicate the number of nucleotides that are to be
removed. The outcome of this depend on the structure of the molecule.
See the two examples below:
The figure below indicates a recess of length two from a blunt DNA
fragment. The resulting DNA fragment has a 3’ protruding single strand.
gatctc||||-->||ctagctag
The figure below indicates a recess of length two from a DNA fragment
with a 5’ sticky end resulting in a blunt sequence.
5’ => 3’ resection at the right side (end) of the molecule.
The argument n indicate the number of nucleotides that are to be
removed. The outcome of this depend on the structure of the molecule.
See the two examples below:
The figure below indicates a recess of length two from a blunt DNA
fragment. The resulting DNA fragment has a 3’ protruding single strand.
gatcgatc||||-->||ctagct
The figure below indicates a recess of length two from a DNA fragment
with a 5’ sticky end resulting in a blunt sequence.
5’ => 3’ resection at the left side (start) of the molecule.
The argument n indicate the number of nucleotides that are to be
removed. The outcome of this depend on the structure of the molecule.
See the two examples below:
The figure below indicates a recess of length two from a blunt DNA
fragment. The resulting DNA fragment has a 3’ protruding single strand.
gatctc||||-->||ctagctag
The figure below indicates a recess of length two from a DNA fragment
with a 5’ sticky end resulting in a blunt sequence.
5’ => 3’ resection at the right side (end) of the molecule.
The argument n indicate the number of nucleotides that are to be
removed. The outcome of this depend on the structure of the molecule.
See the two examples below:
The figure below indicates a recess of length two from a blunt DNA
fragment. The resulting DNA fragment has a 3’ protruding single strand.
gatcgatc||||-->||ctagct
The figure below indicates a recess of length two from a DNA fragment
with a 5’ sticky end resulting in a blunt sequence.
3’ => 5’ resection at the left side (beginning) of the molecule.
The argument n indicate the number of nucleotides that are to be
removed. The outcome of this depend on the structure of the molecule.
See the two examples below:
The figure below indicates a recess of length two from a blunt DNA
fragment. The resulting DNA fragment has a 5’ protruding single strand.
gatcgatc||||-->||ctagag
The figure below indicates a recess of length two from a DNA fragment
with a 3’ sticky end resulting in a blunt sequence.
3’ => 5’ resection at the right side (end) of the molecule.
The argument n indicate the number of nucleotides that are to be
removed. The outcome of this depend on the structure of the molecule.
See the two examples below:
The figure below indicates a recess of length two from a blunt DNA
fragment. The resulting DNA fragment has a 5’ protruding single strand.
gatcga||||-->||ctagctag
The figure below indicates a recess of length two from a DNA fragment
with a 3’ sticky end resulting in a blunt sequence.
Terminal deoxynucleotidyl transferase (TdT) is a template-independent
DNA polymerase that adds nucleotides to the 3′-OH ends of DNA, typically
single-stranded or recessed 3′ ends. In cloning, it’s classically used
to create homopolymer tails (e.g. poly-dG on a vector and poly-dC on an insert)
so that fragments can anneal via complementary overhangs (“tailing” cloning).
This activity ia also present in some DNA polymerases, such as Taq polymerase.
This property is used in the populat T/A cloning protocol ([16]).
USER Enzyme is a mixture of Uracil DNA glycosylase (UDG) and the
DNA glycosylase-lyase Endonuclease VIII.
UDG catalyses the excision of an uracil base, forming an abasic
or apyrimidinic site (AP site). Endonuclease VIII removes the AP
site creating a DNA gap.
((cut_watson, ovhg), enz), for example ((396, -4), EcoRI)
The cut_watson (positive integer) is the cut position of the sequence as for example
returned by the Bio.Restriction module.
The ovhg (overhang, positive or negative integer or 0) has the same meaning as
for restriction enzymes in the Bio.Restriction module and for
pydna.dseq.Dseq objects (see docstring for this module and example below)
Returns a list of cutsites, represented represented as ((cut_watson, ovhg), enz):
cut_watson is a positive integer contained in [0,len(seq)), where seq is the sequence
that will be cut. It represents the position of the cut on the watson strand, using the full
sequence as a reference. By “full sequence” I mean the one you would get from str(Dseq).
ovhg is the overhang left after the cut. It has the same meaning as ovhg in
the Bio.Restriction enzyme objects, or pydna’s Dseq property.
enz is the enzyme object. It’s not necessary to perform the cut, but can be
used to keep track of which enzyme was used.
Cuts are only returned if the recognition site and overhang are on the double-strand
part of the sequence.
Two lists of 2-tuples of integers are returned. Each tuple
(((from, to))) contains the start and end positions of a single
stranded region, shorter or equal to length.
In the example below, the middle 2 nt part is released from the
molecule.
DNA molecules can fall apart by melting if they have internal single
stranded regions. In the example below, the molecule has two gaps
on opposite sides, two nucleotides apart, which means that it hangs
together by two basepairs.
This molecule can melt into two separate 8 bp double stranded
molecules, each with 3 nt 3’ overhangs a depicted below.
A list of 2-tuples is returned. Each tuple (((cut_watson, ovhg), None))
contains cut position and the overhang value in the same format as
returned by the get_cutsites method for restriction enzymes.
Note that this function deals with melting that results in two double
stranded DNA molecules.
See get_ss_meltsites for melting of single stranded regions from
molecules.
For a given cut expressed as ((cut_watson, ovhg), enz), returns
a tuple (cut_watson, cut_crick, ovhg).
cut_watson: see get_cutsites docs
cut_crick: equivalent of cut_watson in the crick strand
ovhg: see get_cutsites docs
The cut can be None if it represents the left or right end of the sequence.
Then it will return the position of the watson and crick ends with respect
to the “full sequence”. The is_left parameter is only used in this case.
Single stranded DNA molecules shorter or equal to length shed from
a double stranded DNA molecule without affecting the length of the
remaining molecule.
In the examples below, the middle 2 nt part is released from the
molecule.
Returns pairs of cutsites that render the edges of the resulting fragments.
A fragment produced by restriction is represented by a tuple of length 2 that
may contain cutsites or None:
Two cutsites: represents the extraction of a fragment between those two
cutsites, in that orientation. To represent the opening of a circular
molecule with a single cutsite, we put the same cutsite twice.
None, cutsite: represents the extraction of a fragment between the left
edge of linear sequence and the cutsite.
cutsite, None: represents the extraction of a fragment between the cutsite
and the right edge of a linear sequence.
This module provides the Dseqrecord class, for handling double stranded
DNA sequences. The Dseqrecord holds sequence information in the form of a pydna.dseq.Dseq
object. The Dseq and Dseqrecord classes are subclasses of Biopythons
Seq and SeqRecord classes, respectively.
The Dseq and Dseqrecord classes support the notion of circular and linear DNA topology.
Dseqrecord is a double stranded version of the Biopython SeqRecord [17] class.
The Dseqrecord object holds a Dseq object describing the sequence.
Additionally, Dseqrecord hold meta information about the sequence in the
from of a list of SeqFeatures, in the same way as the SeqRecord does.
The Dseqrecord can be initialized with a string, Seq, Dseq, SeqRecord
or another Dseqrecord. The sequence information will be stored in a
Dseq object in all cases.
Dseqrecord objects can be read or parsed from sequences in FASTA, EMBL or Genbank formats.
See the pydna.readers and pydna.parsers modules for further information.
There is a short representation associated with the Dseqrecord.
Dseqrecord(-3) represents a linear sequence of length 2
while Dseqrecord(o7)
represents a circular sequence of length 7.
Dseqrecord and Dseq share the same concept of length. This length can be larger
than each strand alone if they are staggered as in the example below.
<--length-->GATCCTTTAAAGCCTAG
Parameters:
record (string, Seq, SeqRecord, Dseq or other Dseqrecord object) – This data will be used to form the seq property
circular (bool, optional) – True or False reflecting the shape of the DNA molecule
linear (bool, optional) – True or False reflecting the shape of the DNA molecule
This checksum is the same as seguid but with base64.urlsafe
encoding instead of the normal base64. This means that
the characters + and / are replaced with - and _ so that
the checksum can be part of a URL.
Returns the sequence as a string using a format supported by Biopython
SeqIO [18]. Default is “gb” which is short for Genbank.
Allowed Formats are for example:
“fasta”: The standard FASTA format.
“fasta-2line”: No line wrapping and exactly two lines per record.
“genbank” (or “gb”): The GenBank flat file format.
“embl”: The EMBL flat file format.
“imgt”: The IMGT variant of the EMBL format.
The format string can be modified with the keyword “dscode” if
the underlying dscode string is desired in the output. for example:
Writes the Dseqrecord to a file using the format f, which must
be a format supported by Biopython SeqIO for writing [19]. Default
is “gb” which is short for Genbank. Note that Biopython SeqIO reads
more formats than it writes.
Filename is the path to the file where the sequece is to be
written. The filename is optional, if it is not given, the
description property (string) is used together with the format.
If obj is the Dseqrecord object, the default file name will be:
<obj.locus>.<f>
Where <f> is “gb” by default. If the filename already exists and
AND the sequence it contains is different, a new file name will be
used so that the old file is not lost:
This method returns a new circular sequence (Dseqrecord object), which has been rotated
in such a way that there is maximum overlap between the sequence and
ref, which may be a string, Biopython Seq, SeqRecord object or
another Dseqrecord object.
The reason for using this could be to rotate a new recombinant plasmid so
that it starts at the same position after cloning. See the example below:
Digest a Dseqrecord object with one or more restriction enzymes.
returns a list of linear Dseqrecords. If there are no cuts, an empty
list is returned.
See also Dseq.cut()
:param enzymes: A Bio.Restriction.XXX restriction object or iterable of such.
:type enzymes: enzyme object or iterable of such objects
Returns:
Dseqrecord_frags – list of Dseqrecord objects formed by the digestion
This module provides a class for downloading sequences from genbank
called Genbank and an function that does the same thing called genbank.
The function can be used if the environmental variable pydna_email has
been set to a valid email address. The easiest way to do this permanantly is to edit the
pydna.ini file. See the documentation of pydna.open_config_folder()
This method downloads a genbank nuclotide record from genbank. This method is
cached by default. This can be controlled by editing the pydna_cached_funcs environment
variable. The best way to do this permanently is to edit the edit the
pydna.ini file. See the documentation of pydna.open_config_folder()
Item is a string containing one genbank accession number
for a nucleotide file. Genbank nucleotide accession numbers have this format:
A12345 = 1 letter + 5 numerals
AB123456 = 2 letters + 6 numerals
The accession number is sometimes followed by a point and version number
BK006936.2
Item can also contain optional interval information in the following formats:
BK006936.2 REGION: complement(613900..615202)
NM_005546 REGION: 1..100
NM_005546 REGION: complement(1..100)
21614549:1-100
21614549:c100-1
21614549 1-100
21614549 c100-1
It is useful to set an interval for large genbank records to limit the download time.
The items above containing interval information and can be obtained directly by
looking up an entry in Genbank and setting the Change region shown on the
upper right side of the page. The ACCESSION line of the displayed Genbank
file will have the formatting shown.
Alternatively, seq_start and seq_stop can be set explicitly to the sequence intervals to be
downloaded.
If strand is 2. “c”, “C”, “crick”, “Crick”, “antisense”,”Antisense”,
“2”, 2, “-” or “-1”, the antisense (Crick) strand is returned, otherwise
the sense (Watson) strand is returned.
This function takes the same paramenters as the
:func:pydna.genbank.Genbank.nucleotide method. The email address stored
in the pydna_email environment variable is used. The easiest way set
this permanantly is to edit the pydna.ini file.
See the documentation of pydna.open_config_folder()
if no accession is given, a very short Genbank
entry
is used as an example (see below). This can be useful for testing the
connection to Genbank.
Please note that this result is also cached by default by settings in
the pydna.ini file.
See the documentation of pydna.open_config_folder()
LOCUSCS57023314bpDNAlinearPAT18-MAY-2007DEFINITIONSequence6fromPatentWO2007025016.ACCESSIONCS570233VERSIONCS570233.1KEYWORDS.SOURCEsyntheticconstructORGANISMsyntheticconstructothersequences;artificialsequences.REFERENCE1AUTHORSShaw,R.W.andCottenoir,M.TITLEInhibitionofmetallo-beta-lactamasebydouble-strandeddnaJOURNALPatent:WO2007025016-A1601-MAR-2007;TexasTechUniversitySystem(US)FEATURESLocation/Qualifierssource1..14/organism="synthetic construct"/mol_type="unassigned DNA"/db_xref="taxon:32630"/note="This is a 14bp aptamer inhibitor."ORIGIN1atgttcctacatga//
This module provides the gbtext_clean() function which can clean up broken Genbank files enough to
pass the BioPython Genbank parser
Almost all of this code was lifted from BioJSON (levskaya/BioJSON) by Anselm Levskaya.
The original code was not accompanied by any software licence. This parser is based on pyparsing.
There are some modifications to deal with fringe cases.
The parser first produces JSON as an intermediate format which is then formatted back into a
string in Genbank format.
The parser is not complete, so some fields do not survive the roundtrip (see below).
This should not be a difficult fix. The returned result has two properties,
.jseq which is the intermediate JSON produced by the parser and .gbtext
which is the formatted genbank string.
This function takes a string containing one genbank sequence
in Genbank format and returns a named tuple containing two fields,
the gbtext containing a string with the corrected genbank sequence and
jseq which contains the JSON intermediate.
Examples
>>> s = '''LOCUS New_DNA 3 bp DNA CIRCULAR SYN 19-JUN-2013
... DEFINITION .
... ACCESSION
... VERSION
... SOURCE .
... ORGANISM .
... COMMENT
... COMMENT ApEinfo:methylated:1
... ORIGIN
... 1 aaa
... //'''
>>> from pydna.readers import read
>>> read(s)
... /site-packages/Bio/GenBank/Scanner.py:1388: BiopythonParserWarning: Malformed LOCUS line found - is this correct?
:'LOCUS New_DNA 3 bp DNA CIRCULAR SYN 19-JUN-2013\n'
"correct?\n:%r" % line, BiopythonParserWarning)
Traceback (most recent call last):
File "... /pydna/readers.py", line 48, in read
results = results.pop()
IndexError: pop from empty list
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "... /pydna/readers.py", line 50, in read
raise ValueError("No sequences found in data:\n({})".format(data[:79]))
ValueError: No sequences found in data:
(LOCUS New_DNA 3 bp DNA CIRCULAR SYN 19-JUN-2013
DEFINITI)
>>> from pydna.genbankfixer import gbtext_clean
>>> s2, j2 = gbtext_clean(s)
>>> print(s2)
LOCUS New_DNA 3 bp ds-DNA circular SYN 19-JUN-2013
DEFINITION .
ACCESSION
VERSION
SOURCE .
ORGANISM .
COMMENT
COMMENT ApEinfo:methylated:1
FEATURES Location/Qualifiers
ORIGIN
1 aaa
//
>>> s3 = read(s2)
>>> s3
Dseqrecord(o3)
>>> print(s3.format())
LOCUS New_DNA 3 bp DNA circular SYN 19-JUN-2013
DEFINITION .
ACCESSION New_DNA
VERSION New_DNA
KEYWORDS .
SOURCE
ORGANISM .
.
COMMENT
ApEinfo:methylated:1
FEATURES Location/Qualifiers
ORIGIN
1 aaa
//
A DNA ladder is a list of FakeSeq objects that has to be initiated with
Size (bp), amount of substance (mol) and Relative mobility (Rf).
Rf is a float value between 0.000 and 1.000. These are used together with
the cubic spline interpolator in the gel module to calculate migartion
distance from fragment length. The Rf values are calculated manually from
a gel image. Exampel can be found in scripts/molecular_weight_standards.ods.
If it’s possible to anneal for minimal_annealing length, but with mismatches, it raises an error.
>>> oligonucleotide_hybridization_overhangs("cATGGC","GCCATa",5)Traceback (most recent call last):...ValueError: The oligonucleotides can anneal with mismatches
If there are mismatches given the minimal annealing length, it raises an error.
>>> fwd_primer3=Primer("cATGGC")>>> rvs_primer3=Primer("GCCATa")>>> oligonucleotide_hybridization(fwd_primer3,rvs_primer3,5)Traceback (most recent call last):...ValueError: The oligonucleotides can anneal with mismatches
This module provides classes that roughly map to the OpenCloning
data model, which is defined using LinkML <https://linkml.io>, and available as a python
package opencloning-linkml. These classes
are documented there, and the ones in this module essentially replace the fields pointing to
sequences and primers (which use ids in the data model) to Dseqrecord and Primer
objects, respectively. Similarly, it uses Location from Biopython instead of a string,
which is what the data model uses.
When using pydna to plan cloning, it stores the provenance of Dseqrecord objects in
their source attribute. Not all methods generate sources so far, so refer to the
documentation notebooks for examples on how to use this feature. The history method of
Dseqrecord objects can be used to get a string representation of the provenance of the
sequence. You can also use the CloningStrategy class to create a JSON representation of
the cloning strategy. That CloningStrategy can be loaded in the OpenCloning web interface
to see a representation of the cloning strategy.
Not all fields can be readily serialized to be converted to regular types in pydantic. For
instance, the coordinates field of the GenomeCoordinatesSource class is a
SimpleLocation object, or the input field of Source is a list of SourceInput
objects, which can be Dseqrecord or Primer objects, or AssemblyFragment objects.
For these type of fields, you have to define a field_serializer method to serialize them
to the correct type.
Context manager that is used to determine how ids are assigned to objects when
mapping them to the OpenCloning data model. If use_python_internal_id is True,
the built-in python id() function is used to assign ids to objects. That function
produces a unique integer for each object in python, so it’s guaranteed to be unique.
If use_python_internal_id is False, the object’s .id attribute
(must be a string integer) is used to assign ids to objects. This is useful
when the objects already have meaningful ids,
and you want to keep references to them in SourceInput objects (which sequences and
primers are used in a particular source).
Parameters:
use_python_internal_id (bool) – If True, use Python’s built-in id() function.
If False, use the object’s .id attribute (must be a string integer).
Generates a JSON representation of the model using Pydantic’s to_json method.
Parameters:
indent – Indentation to use in the JSON output. If None is passed, the output will be compact.
ensure_ascii – If True, the output is guaranteed to have all incoming non-ASCII characters escaped.
If False (the default), these characters will be output as-is.
include – Field(s) to include in the JSON output.
exclude – Field(s) to exclude from the JSON output.
context – Additional context to pass to the serializer.
by_alias – Whether to serialize using field aliases.
exclude_unset – Whether to exclude fields that have not been explicitly set.
exclude_defaults – Whether to exclude fields that are set to their default value.
exclude_none – Whether to exclude fields that have a value of None.
exclude_computed_fields – Whether to exclude computed fields.
While this can be useful for round-tripping, it is usually recommended to use the dedicated
round_trip parameter instead.
round_trip – If True, dumped values should be valid as input for non-idempotent types such as Json[T].
warnings – How to handle serialization errors. False/”none” ignores them, True/”warn” logs errors,
“error” raises a [PydanticSerializationError][pydantic_core.PydanticSerializationError].
fallback – A function to call when an unknown value is encountered. If not provided,
a [PydanticSerializationError][pydantic_core.PydanticSerializationError] error is raised.
serialize_as_any – Whether to serialize fields with duck-typing serialization behavior.
Generate a dictionary representation of the model, optionally specifying which fields to include or exclude.
Parameters:
mode – The mode in which to_python should run.
If mode is ‘json’, the output will only contain JSON serializable types.
If mode is ‘python’, the output may contain non-JSON-serializable Python objects.
include – A set of fields to include in the output.
exclude – A set of fields to exclude from the output.
context – Additional context to pass to the serializer.
by_alias – Whether to use the field’s alias in the dictionary key if defined.
exclude_unset – Whether to exclude fields that have not been explicitly set.
exclude_defaults – Whether to exclude fields that are set to their default value.
exclude_none – Whether to exclude fields that have a value of None.
exclude_computed_fields – Whether to exclude computed fields.
While this can be useful for round-tripping, it is usually recommended to use the dedicated
round_trip parameter instead.
round_trip – If True, dumped values should be valid as input for non-idempotent types such as Json[T].
warnings – How to handle serialization errors. False/”none” ignores them, True/”warn” logs errors,
“error” raises a [PydanticSerializationError][pydantic_core.PydanticSerializationError].
fallback – A function to call when an unknown value is encountered. If not provided,
a [PydanticSerializationError][pydantic_core.PydanticSerializationError] error is raised.
serialize_as_any – Whether to serialize fields with duck-typing serialization behavior.
If no sequences are found, an empty list is returned. This is a greedy
function, use carefully.
Parameters:
data (string or iterable) –
The data parameter is a string containing:
an absolute path to a local file.
The file will be read in text
mode and parsed for EMBL, FASTA
and Genbank sequences. Can be
a string or a Path object.
a string containing one or more
sequences in EMBL, GENBANK,
or FASTA format. Mixed formats
are allowed.
data can be a list or other iterable where the elements are 1 or 2
ds (bool) – If True double stranded Dseqrecord objects are returned.
If False single stranded Bio.SeqRecord[20] objects are
returned.
This module provides fast primer screening using the Aho-Corasick string-search
algorithm. It is useful for PCR diagnostic purposes when given a list of primers
and a single sequence or list of sequences to analyze.
The Aho-Corasick algorithm efficiently finds all occurrences of a set of sequences
within a larger text. If the same primer list is used repeatedly, creating an
automaton greatly speeds up repeated searches. See make_automaton() for
information on creating, saving, and loading such automata.
The primer list can contain None, this can be used to remove primers
from the primer_list for the automaton, while keeping the original index
for each primer.
The limit is the part of the primer used to find annealing positions.
The automaton processes the uppercase 3’ part of each primer up to limit.
It has to be rebuilt if a different limit is needed.
The primers can contain ambiguous bases from the extended IUPAC DNA alphabet.
The automaton can be saved and loaded like this (from the pyahocorasick docs):
importpicklefrompydnaimportprimer_screen# build automatonatm=make_automaton(pl,limit=16)# save automatonatm.save("atm.automaton",pickle.dumps)# load automatonimportahocorasickatm=ahocorasick.load(path,pickle.loads)# use automatonfps=forward_primers(template,primer_list,automaton=atm)
Parameters:
primer_list (list[Primer] | tuple[Primer]) – This is a list of pydna.primer.Primer objects or
any object with a seq property such as Bio.SeqRecord.SeqRecord.
limit (str, optional) – This is the primer part in the 3’-end that has to
anneal. The default is 16.
Returns:
pyahocorasick automaton made for the list of Primer objects.
This function accepts two integers representing PCR product sizes
and returns True or False indicating the ease with which the size
differences can be distinguished on a typical agarose gel.
Where a key such as primer_A_index (integer) is the index for a primer
in primer_list and the value is a list of locations (integers) where
the primer binds.
The concept of location is the same as used in pydna.primer.
The forward primer in the figure below anneals at position 14 on the
template.
seq (Dseqrecord) – Target sequence to find primer annealing positions.
primer_list (list[Primer] | tuple[Primer]) – This is a list of pydna.primer.Primer objects or any object
with a seq property such as Bio.SeqRecord.SeqRecord.
limit (str, optional) – This is the part at the 3’-end of each primer that has to
anneal. The default is 16.
automaton (ahocorasick.Automaton, optional) – Automaton made with the make_automaton(). The default is None.
Returns:
Dict of lists where keys are primer indices in primer_list and
values are lists with primer locations.
Where a key such as primer_A_index (integer) is the index for a primer
in primer_list and the value is a list of locations (integers) where
the primer binds.
The concept of location is the same as used in pydna.primer.
The reverse primer below anneals at position 9.
seq (Dseqrecord) – Target sequence to find primer annealing positions.
primer_list (list[Primer] | tuple[Primer]) – This is a list of pydna.primer.Primer objects or any object
with a seq property such as Bio.SeqRecord.SeqRecord.
limit (str, optional) – This is the part in the 3’-end of each primer that has to
anneal. The default is 16.
automaton (ahocorasick.Automaton, optional) – Automaton made with the make_automaton(). The default is None.
Returns:
Dict of lists where keys are primer indices in primer_list and
values are lists with primer locations.
Primer pairs that form PCR products larger than short and smaller
than long.
The PCR product size includes the PCR primers. Only unique primer pairs
are returned. This means that the forward and reverse primers can only
bind in one position on the template each.
The indices are the primer_list indices and positions are the positions of
the primers as described in forward_primers() and reverse_primers()
functions.
The size includes the length of each primer, so it is the true total length
of the PCR product.
Parameters:
seq (Dseqrecord) – Target sequence to find primer annealing positions.
primer_list (list[Primer] | tuple[Primer]) – This is a list of pydna.primer.Primer objects or any object
with a seq property such as Bio.SeqRecord.SeqRecord.
limit (str, optional) – This is the part in the 3’-end of each primer that has to
anneal. The default is 16.
short (int, optional) – Lower limit for the size of the PCR products. The default is 500.
long (int, optional) – Upper limit for the size of the PCR products. The default is 1500.
automaton (ahocorasick.Automaton, optional) – Automaton made with the make_automaton(). The default is None.
Returns:
List of tuples (index_fp, position_fp, index_rp, position_rp, size)
Primer pairs that flank a target position (begin..end). This means that
forward primers have to bind before or at the begin position and reverse primers
have to bind at or after the end position.
The function returns a list of the same flat 5-namedtuples of integers returned
from the primer_pairs() function.
seq (Dseqrecord) – Target sequence to find primer annealing positions.
primer_list (list[Primer] | tuple[Primer]) – This is a list of pydna.primer.Primer objects or any object
with a seq property such as Bio.SeqRecord.SeqRecord.
target (tuple[int, int]) – Start and stop position for target sequence.
limit (str, optional) – This is the part in the 3’-end of each primer that has to
anneal. The default is 16.
automaton (ahocorasick.Automaton, optional) – Automaton made with the make_automaton(). The default is None.
Returns:
List of tuples (index_fp, position_fp, index_rp, position_rp, size).
Given an iterable of sequences and a primer list, primers are selected that result in
unique product sizes from each of the input sequences.
Primers 1 and 2 both form PCR products from sequenceA and B below, but of
different sizes. Primers 1 and 2 could be used to verify genetic modifications such
as cloning an insert into a plasmid vector.
The callback function is used to return true or false for the PCR products. This score is
meant to filter for PCR products that are likely to migrate to
sufficiently distinct locations to be distinguishable on a typical agarose gel.
Only products larger than short and smaller than long are returned.
An example of the output for two sequences (Dseqrecord(-3308), Dseqrecord(-3613)).
Primers 501 and 1806 would yield a 933 bp product with the 3308 bp sequence and the same
primer pair would give 1212 bp with the 3613 bp sequence.
A list of named 4-tuples is returned (Sequence, forward_primer, reverse_primer, size_bp),
where each tuple has one entry for each sequence in the input argument.
primer_list (list[Primer] | tuple[Primer]) – This is a list of pydna.primer.Primer objects or any object
with a seq property such as Bio.SeqRecord.SeqRecord.
limit (str, optional) – This is the part in the 3’-end of each primer that has to
anneal. The default is 16.
short (int, optional) – Lower limit for the size of the PCR products. The default is 500.
long (int, optional) – Upper limit for the size of the PCR products. The default is 1500.
automaton (ahocorasick.Automaton, optional) – Automaton made with the make_automaton(). The default is None.
callback (callable[[list], bool], optional) – A function accepting a list of integers and returning True or False.
The default is callback.
Given a list of sequences and a primer list, primer triplets are selected that result in
PCR products of different sizes from each of the input sequences.
Primers 1, 2 and 3 form PCR products from sequenceA and B below, but of
different sizes. Primer 1 binds both sequences while primers 2 and 3 bind one
sequence each. This primer triplet could be used to verify genetic
modifications.
The callback function is used to give a score for the PCR products. This score can
be used to decide if a collection of PCR products are likely to migrate to distinct
locations on a typical agarose gel.
Only products larger than short and smaller than long are returned.
An example of the output for two sequences = [Dseqrecord(-7664), Dseqrecord(-3613)].
Primer pair 701, 700 would produce a 724 bp product with the 7664 bp sequence while
the primer pair 701, 1564 would give a 1450 bp product with the 3613 bp sequence.
primer_list (list[Primer] | tuple[Primer]) – This is a list of pydna.primer.Primer objects or any object
with a seq property such as Bio.SeqRecord.SeqRecord.
limit (str, optional) – This is the part in the 3’-end of each primer that has to
anneal. The default is 16.
short (int, optional) – Lower limit for the size of the PCR products. The default is 500.
long (int, optional) – Upper limit for the size of the PCR products. The default is 2000.
automaton (ahocorasick.Automaton, optional) – Automaton made with the make_automaton(). The default is None.
callback (callable[[list], bool], optional) – A function accepting a list of integers and returning True or False.
The default is callback.
The table argument is the name of a codon table (string). These names
can be for example “Standard” or “Alternative Yeast Nuclear” for the
yeast CUG clade where the CUG codon is translated as serine instead
of the standard leucine.
Over forty translation tables are available from the BioPython
Bio.Data.CodonTable module. Look at the keys of the dictionary
´CodonTable.ambiguous_generic_by_name´.
These are based on tables in this file provided by NCBI:
–+———+———+———+———+–
T | TTT F | TCT S | TAT Y | TGT C | T
T | TTC F | TCC S | TAC Y | TGC C | C
T | TTA L | TCA S | TAA Stop| TGA Stop| A
T | TTG L(s)| TCG S | TAG Stop| TGG W | G
–+———+———+———+———+–
C | CTT L | CCT P | CAT H | CGT R | T
C | CTC L | CCC P | CAC H | CGC R | C
C | CTA L | CCA P | CAA Q | CGA R | A
C | CTG L(s)| CCG P | CAG Q | CGG R | G
–+———+———+———+———+–
A | ATT I | ACT T | AAT N | AGT S | T
A | ATC I | ACC T | AAC N | AGC S | C
A | ATA I | ACA T | AAA K | AGA R | A
A | ATG M(s)| ACG T | AAG K | AGG R | G
–+———+———+———+———+–
G | GTT V | GCT A | GAT D | GGT G | T
G | GTC V | GCC A | GAC D | GGC G | C
G | GTA V | GCA A | GAA E | GGA G | A
G | GTG V | GCG A | GAG E | GGG G | G
–+———+———+———+———+–
The default is False. True means that translation terminates at the first
in frame stop codon. False translates to the end.
cds (bool, optional) – The default is False. If True, checks that the sequence starts with a
valid alternative start codon sequence length is a multiple of three, and
that there is a single in frame stop codon at the end. If these tests fail,
an exception is raised.
This checksum is the same as seguid but with base64.urlsafe
encoding instead of the normal base64. This means that
the characters + and / are replaced with - and _ so that
the checksum can be part of a URL.
Turn a nucleotide sequence into a protein sequence by creating a new sequence object.
This method will translate DNA or RNA sequences. It should not
be used on protein sequences as any result will be biologically
meaningless.
Parameters:
name (- table - Which codon table to use? This can be either a) – (string), an NCBI identifier (integer), or a CodonTable
object (useful for non-standard genetic codes). This
defaults to the “Standard” table.
string (- stop_symbol - Single character) – terminators. This defaults to the asterisk, “*”.
for (what to use) – terminators. This defaults to the asterisk, “*”.
Boolean (- cds -) – translation continuing on past any stop codons (translated as the
specified stop_symbol). If True, translation is terminated at
the first in frame stop codon (and the stop_symbol is not
appended to the returned protein sequence).
full (defaults to False meaning do a) – translation continuing on past any stop codons (translated as the
specified stop_symbol). If True, translation is terminated at
the first in frame stop codon (and the stop_symbol is not
appended to the returned protein sequence).
Boolean – this checks the sequence starts with a valid alternative start
codon (which will be translated as methionine, M), that the
sequence length is a multiple of three, and that there is a
single in frame stop codon at the end (this will be excluded
from the protein sequence, regardless of the to_stop option).
If these tests fail, an exception is raised.
True (indicates this is a complete CDS. If) – this checks the sequence starts with a valid alternative start
codon (which will be translated as methionine, M), that the
sequence length is a multiple of three, and that there is a
single in frame stop codon at the end (this will be excluded
from the protein sequence, regardless of the to_stop option).
If these tests fail, an exception is raised.
:paramthis checks the sequence starts with a valid alternative start
codon (which will be translated as methionine, M), that the
sequence length is a multiple of three, and that there is a
single in frame stop codon at the end (this will be excluded
from the protein sequence, regardless of the to_stop option).
If these tests fail, an exception is raised.
Parameters:
gaps. (- gap - Single character string to denote symbol used for) – Defaults to the minus sign.
A Seq object is returned if translate is called on a Seq
object; a MutableSeq object is returned if translate is called
pn a MutableSeq object.
It isn’t a valid CDS under NCBI table 1, due to both the start codon
and also the in frame stop codons:
>>> coding_dna.translate(table=1,cds=True)Traceback (most recent call last):...Bio.Data.CodonTable.TranslationError: First codon 'GTG' is not a start codon
If the sequence has no in-frame stop codon, then the to_stop argument
has no effect:
NOTE - Ambiguous codons like “TAN” or “NNN” could be an amino acid
or a stop codon. These are translated as “X”. Any invalid codon
(e.g. “TA?” or “T-A”) will throw a TranslationError.
NOTE - This does NOT behave like the python string’s translate
method. For that use str(my_seq).translate(…) instead
Transcribe a DNA sequence into RNA and return the RNA sequence as a new Seq object.
Following the usual convention, the sequence is interpreted as the
coding strand of the DNA double helix, not the template strand. This
means we can get the RNA sequence just by switching T to U.
As Seq objects are immutable, a TypeError is raised if
transcribe is called on a Seq object with inplace=True.
Trying to transcribe an RNA sequence has no effect.
If you have a nucleotide sequence which might be DNA or RNA
(or even a mixture), calling the transcribe method will ensure
any T becomes U.
Trying to transcribe a protein sequence will replace any
T for Threonine with U for Selenocysteine, which has no
biologically plausible rational.
As Seq objects are immutable, a TypeError is raised if
transcribe is called on a Seq object with inplace=True.
Trying to back-transcribe DNA has no effect, If you have a nucleotide
sequence which might be DNA or RNA (or even a mixture), calling the
back-transcribe method will ensure any U becomes T.
Trying to back-transcribe a protein sequence will replace any U for
Selenocysteine with T for Threonine, which is biologically meaningless.
This checksum is the same as seguid but with base64.urlsafe
encoding instead of the normal base64. This means that
the characters + and / are replaced with - and _ so that
the checksum can be part of a URL.
This checksum is the same as seguid but with base64.urlsafe
encoding instead of the normal base 64. This means that
the characters + and / are replaced with - and _ so that
the checksum can be a part of and URL or a filename.
Examples
>>> frompydna.seqrecordimportSeqRecord>>> a=SeqRecord("gattaca")>>> a.seguid()# original seguid is +bKGnebMkia5kNg/gF7IORXMnIU'lsseguid=tp2jzeCM2e3W4yxtrrx09CMKa_8'
Return the longest common substring between the sequence.
and another sequence (other). The other sequence can be a string,
Seq, SeqRecord, Dseq or DseqRecord.
The method returns a SeqFeature with type “read” as this method
is mostly used to map sequence reads to the sequence. This can be
changed by passing a type as keyword with some other string value.
Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the “Software”), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED “AS IS”, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.
Algorithm described in Pierre Duval, Jean. 1983. Factorizing Words
over an Ordered Alphabet. Journal of Algorithms & Computational Technology
4 (4) (December 1): 363–381. and Algorithms on strings and sequences based
on Lyndon words, David Eppstein 2011.
https://gist.github.com/dvberkel/1950267
Turn a three letter code protein sequence into one with one letter code.
The single input argument ‘seq’ should be a protein sequence using single
letter codes, as a python string.
This function returns the amino acid sequence as a string using the one
letter amino acid codes. Output follows the IUPAC standard (including
ambiguous characters B for “Asx”, J for “Xle” and X for “Xaa”, and also U
for “Sel” and O for “Pyl”) plus “Ter” for a terminator given as an
asterisk.
Any unknown
character (including possible gap characters), is changed into ‘Xaa’.
Examples
>>> fromBio.SeqUtilsimportseq3>>> seq3("MAIVMGRWKGAR*")'MetAlaIleValMetGlyArgTrpLysGlyAlaArgTer'>>> frompydna.utilsimportseq31>>> seq31('MetAlaIleValMetGlyArgTrpLysGlyAlaArgTer')'M A I V M G R W K G A R *'
Compares two or more DNA sequences for equality i.e. if they
represent the same DNA molecule.
Two linear sequences are considiered equal if either:
They have the same sequence (case insensitive)
One sequence is the reverse complement of the other
Two circular sequences are considered equal if they are circular
permutations meaning that they have the same length and:
One sequence can be found in the concatenation of the other sequence with itself.
The reverse complement of one sequence can be found in the concatenation of the other sequence with itself.
The topology for the comparison can be set using one of the keywords
linear or circular to True or False.
If circular or linear is not set, it will be deduced from the topology of
each sequence for sequences that have a linear or circular attribute
(like Dseq and Dseqrecord).
Create a location object from a start and end position.
If the end position is less than the start position, the location is circular. It handles negative positions.
Note this special case, 0 is the same as len(seq)
>>> str(create_location(5, 0, 10))
‘[5:10]’
Note the special case where if start and end are the same,
the location spans the entire sequence (it’s not empty).
>>> str(create_location(5, 5, 10))
‘join{[5:10], [0:5]}’