Pydna is a python package providing code for simulation of the creation of
recombinant DNA molecules using
molecular biology
techniques. Development of pydna happens in this Github repository.
Provided:
PCR simulation
Assembly simulation based on shared identical sequences
Primer design for amplification of a given sequence
Automatic design of primer tails for Gibson assembly
or homologous recombination.
Restriction digestion and cut&paste cloning
Agarose gel simulation
Download sequences from Genbank
Parsing various sequence formats including the capacity to
handle broken Genbank format
The most important modules and how to import functions or classes from
them are listed below. Class names starts with a capital letter,
functions with a lowercase letter:
from pydna.module import function
from pydna.module import Class
Example: from pydna.gel import Gel
pydna
├── amplify
│ ├── Anneal
│ └── pcr
├── assembly
│ └── Assembly
├── design
│ ├── assembly_fragments
│ └── primer_design
├── download
│ └── download_text
├── dseqrecord
│ └── Dseqrecord
├── gel
│ └── Gel
├── genbank
│ ├── genbank
│ └── Genbank
├── parsers
│ ├── parse
│ └── parse_primers
└── readers
├── read
└── read_primers
Documentation is available as docstrings provided in the source code for
each module.
These docstrings can be inspected by reading the source code directly.
See further below on how to obtain the code for pydna.
In the python shell, use the built-in help function to view a
function’s docstring:
The doctrings are also used to provide an automaticly generated reference
manual available online at
read the docs.
Docstrings can be explored using IPython, an
advanced Python shell with
TAB-completion and introspection capabilities. To see which functions
are available in pydna,
type pydna.<TAB> (where <TAB> refers to the TAB key).
Use pydna.open_config_folder?<ENTER>`to view the docstring or
`pydna.open_config_folder??<ENTER> to view the source code.
In the Spyder IDE it is possible
to place the cursor immediately before the name of a module,class or
function and press ctrl+i to bring up docstrings in a separate window in Spyder
Code snippets are indicated by three greater-than signs:
Please join the
Google group
for pydna, this is the preferred location for help. If you find bugs
in pydna itself, open an issue at the
Github repository.
pcr is a convenience function for the Anneal class to simplify its
usage, especially from the command line. If more than one or no PCR
product is formed, a ValueError is raised.
args is any iterable of Dseqrecords or an iterable of iterables of
Dseqrecords. args will be greedily flattened.
Parameters:
args (iterable containing sequence objects) – Several arguments are also accepted.
limit (int = 13, optional) – limit length of the annealing part of the primers.
Notes
sequences in args could be of type:
string
Seq
SeqRecord (or subclass)
Dseqrecord (or sublcass)
The last sequence will be assumed to be the template while
all preceeding sequences will be assumed to be primers.
This is a powerful function, use with care!
Returns:
product – An pydna.amplicon.Amplicon object representing the PCR
product. The direction of the PCR product will be the same as for
the template sequence.
Assembly of a list of linear DNA fragments into linear or circular
constructs. The Assembly is meant to replace the Assembly method as it
is easier to use. Accepts a list of Dseqrecords (source fragments) to
initiate an Assembly object. Several methods are available for analysis
of overlapping sequences, graph construction and assembly.
This function takes the same paramenters as the
:func:pydna.genbank.Genbank.nucleotide method. The email address stored
in the pydna_email environment variable is used. The easiest way set
this permanantly is to edit the pydna.ini file.
See the documentation of pydna.open_config_folder()
if no accession is given, a very short Genbank
entry
is used as an example (see below). This can be useful for testing the
connection to Genbank.
Please note that this result is also cached by default by settings in
the pydna.ini file.
See the documentation of pydna.open_config_folder()
LOCUSCS57023314bpDNAlinearPAT18-MAY-2007DEFINITIONSequence6fromPatentWO2007025016.ACCESSIONCS570233VERSIONCS570233.1KEYWORDS.SOURCEsyntheticconstructORGANISMsyntheticconstructothersequences;artificialsequences.REFERENCE1AUTHORSShaw,R.W.andCottenoir,M.TITLEInhibitionofmetallo-beta-lactamasebydouble-strandeddnaJOURNALPatent:WO2007025016-A1601-MAR-2007;TexasTechUniversitySystem(US)FEATURESLocation/Qualifierssource1..14/organism="synthetic construct"/mol_type="unassigned DNA"/db_xref="taxon:32630"/note="This is a 14bp aptamer inhibitor."ORIGIN1atgttcctacatga//
This method downloads a genbank nuclotide record from genbank. This method is
cached by default. This can be controlled by editing the pydna_cached_funcs environment
variable. The best way to do this permanently is to edit the edit the
pydna.ini file. See the documentation of pydna.open_config_folder()
Item is a string containing one genbank accession number
for a nucleotide file. Genbank nucleotide accession numbers have this format:
A12345 = 1 letter + 5 numerals
AB123456 = 2 letters + 6 numerals
The accession number is sometimes followed by a point and version number
BK006936.2
Item can also contain optional interval information in the following formats:
BK006936.2 REGION: complement(613900..615202)
NM_005546 REGION: 1..100
NM_005546 REGION: complement(1..100)
21614549:1-100
21614549:c100-1
21614549 1-100
21614549 c100-1
It is useful to set an interval for large genbank records to limit the download time.
The items above containing interval information and can be obtained directly by
looking up an entry in Genbank and setting the Change region shown on the
upper right side of the page. The ACCESSION line of the displayed Genbank
file will have the formatting shown.
Alternatively, seq_start and seq_stop can be set explicitly to the sequence intervals to be
downloaded.
If strand is 2. “c”, “C”, “crick”, “Crick”, “antisense”,”Antisense”,
“2”, 2, “-” or “-1”, the antisense (Crick) strand is returned, otherwise
the sense (Watson) strand is returned.
Dseqrecord is a double stranded version of the Biopython SeqRecord [1] class.
The Dseqrecord object holds a Dseq object describing the sequence.
Additionally, Dseqrecord hold meta information about the sequence in the
from of a list of SeqFeatures, in the same way as the SeqRecord does.
The Dseqrecord can be initialized with a string, Seq, Dseq, SeqRecord
or another Dseqrecord. The sequence information will be stored in a
Dseq object in all cases.
Dseqrecord objects can be read or parsed from sequences in FASTA, EMBL or Genbank formats.
See the pydna.readers and pydna.parsers modules for further information.
There is a short representation associated with the Dseqrecord.
Dseqrecord(-3) represents a linear sequence of length 2
while Dseqrecord(o7)
represents a circular sequence of length 7.
Dseqrecord and Dseq share the same concept of length. This length can be larger
than each strand alone if they are staggered as in the example below.
<--length-->GATCCTTTAAAGCCTAG
Parameters:
record (string, Seq, SeqRecord, Dseq or other Dseqrecord object) – This data will be used to form the seq property
circular (bool, optional) – True or False reflecting the shape of the DNA molecule
linear (bool, optional) – True or False reflecting the shape of the DNA molecule
Digest a Dseqrecord object with one or more restriction enzymes.
returns a list of linear Dseqrecords. If there are no cuts, an empty
list is returned.
See also Dseq.cut()
:param enzymes: A Bio.Restriction.XXX restriction object or iterable of such.
:type enzymes: enzyme object or iterable of such objects
Returns:
Dseqrecord_frags – list of Dseqrecord objects formed by the digestion
This checksum is the same as seguid but with base64.urlsafe
encoding instead of the normal base64. This means that
the characters + and / are replaced with - and _ so that
the checksum can be part of a URL.
This method returns a new circular sequence (Dseqrecord object), which has been rotated
in such a way that there is maximum overlap between the sequence and
ref, which may be a string, Biopython Seq, SeqRecord object or
another Dseqrecord object.
The reason for using this could be to rotate a new recombinant plasmid so
that it starts at the same position after cloning. See the example below:
Writes the Dseqrecord to a file using the format f, which must
be a format supported by Biopython SeqIO for writing [3]. Default
is “gb” which is short for Genbank. Note that Biopython SeqIO reads
more formats than it writes.
Filename is the path to the file where the sequece is to be
written. The filename is optional, if it is not given, the
description property (string) is used together with the format.
If obj is the Dseqrecord object, the default file name will be:
<obj.locus>.<f>
Where <f> is “gb” by default. If the filename already exists and
AND the sequence it contains is different, a new file name will be
used so that the old file is not lost:
Dseq holds information for a double stranded DNA fragment.
Dseq also holds information describing the topology of
the DNA fragment (linear or circular).
Parameters:
watson (str) – a string representing the watson (sense) DNA strand.
crick (str, optional) – a string representing the crick (antisense) DNA strand.
ovhg (int, optional) – A positive or negative number to describe the stagger between the
watson and crick strands.
see below for a detailed explanation.
linear (bool, optional) – True indicates that sequence is linear, False that it is circular.
circular (bool, optional) – True indicates that sequence is circular, False that it is linear.
Examples
Dseq is a subclass of the Biopython Seq object. It stores two
strings representing the watson (sense) and crick(antisense) strands.
two properties called linear and circular, and a numeric value ovhg
(overhang) describing the stagger for the watson and crick strand
in the 5’ end of the fragment.
The most common usage is probably to create a Dseq object as a
part of a Dseqrecord object (see pydna.dseqrecord.Dseqrecord).
There are three ways of creating a Dseq object directly listed below, but you can also
use the function Dseq.from_full_sequence_and_overhangs() to create a Dseq:
The given string will be interpreted as the watson strand of a
blunt, linear double stranded sequence object. The crick strand
is created automatically from the watson strand.
If both watson and crick are given, but not ovhg an attempt
will be made to find the best annealing between the strands.
There are limitations to this. For long fragments it is quite
slow. The length of the annealing sequences have to be at least
half the length of the shortest of the strands.
Three arguments (string, string, ovhg=int):
The ovhg parameter is an integer describing the length of the
crick strand overhang in the 5’ end of the molecule.
The ovhg parameter controls the stagger at the five prime end:
If the ovhg parameter is specified a crick strand also
needs to be supplied, otherwise an exception is raised.
>>> Dseq(watson="agt",ovhg=2)Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/local/lib/python2.7/dist-packages/pydna_/dsdna.py", line 169, in __init__else:ValueError: ovhg defined without crick strand!
The shape of the fragment is set by circular = True, False
Note that both ends of the DNA fragment has to be compatible to set
circular = True.
Fill in five prime protruding ends and chewing back
three prime protruding ends by a DNA polymerase providing both
5’-3’ DNA polymerase activity and 3’-5’ nuclease acitivty
(such as T4 DNA polymerase). This can be done in presence of any
combination of the four A, G, C or T. Removing one or more nucleotides
can facilitate engineering of sticky ends. Default are all four nucleotides together.
Returns False if:
- Cut positions fall outside the sequence (could be moved to Biopython)
- Overhang is not double stranded
- Recognition site is not double stranded or is outside the sequence
- For enzymes that cut twice, it checks that at least one possibility is valid
Fill in of five prime protruding end with a DNA polymerase
that has only DNA polymerase activity (such as exo-klenow [4])
and any combination of A, G, C or T. Default are all four
nucleotides together.
For a given cut expressed as ((cut_watson, ovhg), enz), returns
a tuple (cut_watson, cut_crick, ovhg).
cut_watson: see get_cutsites docs
cut_crick: equivalent of cut_watson in the crick strand
ovhg: see get_cutsites docs
The cut can be None if it represents the left or right end of the sequence.
Then it will return the position of the watson and crick ends with respect
to the “full sequence”. The is_left parameter is only used in this case.
Returns pairs of cutsites that render the edges of the resulting fragments.
A fragment produced by restriction is represented by a tuple of length 2 that
may contain cutsites or None:
Two cutsites: represents the extraction of a fragment between those two
cutsites, in that orientation. To represent the opening of a circular
molecule with a single cutsite, we put the same cutsite twice.
None, cutsite: represents the extraction of a fragment between the left
edge of linear sequence and the cutsite.
cutsite, None: represents the extraction of a fragment between the cutsite
and the right edge of a linear sequence.
Returns a list of cutsites, represented represented as ((cut_watson, ovhg), enz):
cut_watson is a positive integer contained in [0,len(seq)), where seq is the sequence
that will be cut. It represents the position of the cut on the watson strand, using the full
sequence as a reference. By “full sequence” I mean the one you would get from str(Dseq).
ovhg is the overhang left after the cut. It has the same meaning as ovhg in
the Bio.Restriction enzyme objects, or pydna’s Dseq property.
enz is the enzyme object. It’s not necessary to perform the cut, but can be
used to keep track of which enzyme was used.
Cuts are only returned if the recognition site and overhang are on the double-strand
part of the sequence.
This can only be done if the two ends are compatible,
otherwise a TypeError is raised.
Examples
>>> frompydna.dseqimportDseq>>> a=Dseq("catcgatc")>>> aDseq(-8)catcgatcgtagctag>>> a.looped()Dseq(o8)catcgatcgtagctag>>> a.T4("t")Dseq(-8)catcgat tagctag>>> a.T4("t").looped()Dseq(o7)catcgatgtagcta>>> a.T4("a")Dseq(-8)catcga agctag>>> a.T4("a").looped()Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/local/lib/python2.7/dist-packages/pydna/dsdna.py", line 357, in loopediftype5==type3andstr(sticky5)==str(rc(sticky3)):TypeError: DNA cannot be circularized.5' and 3' sticky ends not compatible!>>>
Fill in five prime protruding ends and chewing back
three prime protruding ends by a DNA polymerase providing both
5’-3’ DNA polymerase activity and 3’-5’ nuclease acitivty
(such as T4 DNA polymerase). This can be done in presence of any
combination of the four A, G, C or T. Removing one or more nucleotides
can facilitate engineering of sticky ends. Default are all four nucleotides together.
Transcribe a DNA sequence into RNA and return the RNA sequence as a new Seq object.
Following the usual convention, the sequence is interpreted as the
coding strand of the DNA double helix, not the template strand. This
means we can get the RNA sequence just by switching T to U.
As Seq objects are immutable, a TypeError is raised if
transcribe is called on a Seq object with inplace=True.
Trying to transcribe an RNA sequence has no effect.
If you have a nucleotide sequence which might be DNA or RNA
(or even a mixture), calling the transcribe method will ensure
any T becomes U.
Trying to transcribe a protein sequence will replace any
T for Threonine with U for Selenocysteine, which has no
biologically plausible rational.
If no sequences are found, an empty list is returned. This is a greedy
function, use carefully.
Parameters:
data (string or iterable) –
The data parameter is a string containing:
an absolute path to a local file.
The file will be read in text
mode and parsed for EMBL, FASTA
and Genbank sequences. Can be
a string or a Path object.
a string containing one or more
sequences in EMBL, GENBANK,
or FASTA format. Mixed formats
are allowed.
data can be a list or other iterable where the elements are 1 or 2
ds (bool) – If True double stranded Dseqrecord objects are returned.
If False single stranded Bio.SeqRecord[6] objects are
returned.
This function designs a forward primer and a reverse primer for PCR amplification
of a given template sequence.
The template argument is a Dseqrecord object or equivalent containing the template sequence.
The optional fp and rp arguments can contain an existing primer for the sequence (either the forward or reverse primer).
One or the other primers can be specified, not both (since then there is nothing to design!, use the pydna.amplify.pcr function instead).
The limit argument is the minimum length of the primer. The default value is 13.
If one of the primers is given, the other primer is designed to match in terms of Tm.
If both primers are designed, they will be designed to target_tm
tm_func is a function that takes an ascii string representing an oligonuceotide as argument and returns a float.
Some useful functions can be found in the pydna.tm module, but can be substituted for a custom made function.
estimate_function is a tm_func-like function that is used to get a first guess for the primer design, that is then used as starting
point for the final result. This is useful when the tm_func function is slow to calculate (e.g. it relies on an
external API, such as the NEB primer design API). The estimate_function should be faster than the tm_func function.
The default value is None.
To use the default tm_func as estimate function to get the NEB Tm faster, you can do:
primer_design(dseqr, target_tm=55, tm_func=tm_neb, estimate_function=tm_default).
The function returns a pydna.amplicon.Amplicon class instance. This object has
the object.forward_primer and object.reverse_primer properties which contain the designed primers.
fp (pydna.primer.Primer, optional) – optional pydna.primer.Primer objects containing one primer each.
rp (pydna.primer.Primer, optional) – optional pydna.primer.Primer objects containing one primer each.
target_tm (float, optional) – target tm for the primers, set to 55°C by default.
tm_func (function) – Function used for tm calculation. This function takes an ascii string
representing an oligonuceotide as argument and returns a float.
Some useful functions can be found in the pydna.tm module, but can be
substituted for a custom made function.
This function return a list of pydna.amplicon.Amplicon objects where
primers have been modified with tails so that the fragments can be fused in
the order they appear in the list by for example Gibson assembly or homologous
recombination.
we can modify the reverse primer of a and forward primer of b with tails to allow
fusion by fusion PCR, Gibson assembly or in-vivo homologous recombination.
The basic requirements for the primers for the three techniques are the same.
The first argument of this function is a list of sequence objects containing
Amplicons and other similar objects.
At least every second sequence object needs to be an Amplicon
This rule exists because if a sequence object is that is not a PCR product
is to be fused with another fragment, that other fragment needs to be an Amplicon
so that the primer of the other object can be modified to include the whole stretch
of sequence homology needed for the fusion. See the example below where a is a
non-amplicon (a linear plasmid vector for instance)
The first argument of this function is a list of sequence objects containing
Amplicons and other similar objects.
The overlap argument controls how many base pairs of overlap required between
adjacent sequence fragments. In the junction between Amplicons, tails with the
length of about half of this value is added to the two primers
closest to the junction.
In the case of an Amplicon adjacent to a Dseqrecord object, the tail will
be twice as long (1*overlap) since the
recombining sequence is present entirely on this primer:
Note that if the sequence of DNA fragments starts or stops with an Amplicon,
the very first and very last prinmer will not be modified i.e. assembles are
always assumed to be linear. There are simple tricks around that for circular
assemblies depicted in the last two examples below.
The maxlink arguments controls the cut off length for sequences that will be
synhtesized by adding them to primers for the adjacent fragment(s). The
argument list may contain short spacers (such as spacers between fusion proteins).
Example 1: Linear assembly of PCR products (pydna.amplicon.Amplicon class objects) ------
> < > <
Amplicon1 Amplicon3
Amplicon2 Amplicon4
> < > <
⇣
pydna.design.assembly_fragments
⇣
> <- -> <- pydna.assembly.Assembly
Amplicon1 Amplicon3
Amplicon2 Amplicon4 ➤ Amplicon1Amplicon2Amplicon3Amplicon4
-> <- -> <
Example 2: Linear assembly of alternating Amplicons and other fragments
> < > <
Amplicon1 Amplicon2
Dseqrecd1 Dseqrecd2
⇣
pydna.design.assembly_fragments
⇣
> <-- --> <-- pydna.assembly.Assembly
Amplicon1 Amplicon2
Dseqrecd1 Dseqrecd2 ➤ Amplicon1Dseqrecd1Amplicon2Dseqrecd2
Example 3: Linear assembly of alternating Amplicons and other fragments
Dseqrecd1 Dseqrecd2
Amplicon1 Amplicon2
> < --> <
⇣
pydna.design.assembly_fragments
⇣
pydna.assembly.Assembly
Dseqrecd1 Dseqrecd2
Amplicon1 Amplicon2 ➤ Dseqrecd1Amplicon1Dseqrecd2Amplicon2
--> <-- --> <
Example 4: Circular assembly of alternating Amplicons and other fragments
-> <==
Dseqrecd1 Amplicon2
Amplicon1 Dseqrecd1
--> <-
⇣
pydna.design.assembly_fragments
⇣
pydna.assembly.Assembly
-> <==
Dseqrecd1 Amplicon2 -Dseqrecd1Amplicon1Amplicon2-
Amplicon1 ➤ | |
--> <- -----------------------------
------ Example 5: Circular assembly of Amplicons
> < > <
Amplicon1 Amplicon3
Amplicon2 Amplicon1
> < > <
⇣
pydna.design.assembly_fragments
⇣
> <= -> <-
Amplicon1 Amplicon3
Amplicon2 Amplicon1
-> <- +> <
⇣
make new Amplicon using the Amplicon1.template and
the last fwd primer and the first rev primer.
⇣
pydna.assembly.Assembly
+> <= -> <-
Amplicon1 Amplicon3 -Amplicon1Amplicon2Amplicon3-
Amplicon2 ➤ | |
-> <- -----------------------------
Parameters:
f (list of pydna.amplicon.Amplicon and other Dseqrecord like objects) – list Amplicon and Dseqrecord object for which fusion primers should be constructed.
overlap (int, optional) – Length of required overlap between fragments.
maxlink (int, optional) – Maximum length of spacer sequences that may be present in f. These will be included in tails for designed primers.
circular (bool, optional) – If True, the assembly is circular. If False, the assembly is linear.
>>> frompydna.dseqrecordimportDseqrecord>>> frompydna.designimportprimer_design>>> a=primer_design(Dseqrecord("atgactgctaacccttccttggtgttgaacaagatcgacgacatttcgttcgaaacttacgatg"))>>> b=primer_design(Dseqrecord("ccaaacccaccaggtaccttatgtaagtacttcaagtcgccagaagacttcttggtcaagttgcc"))>>> c=primer_design(Dseqrecord("tgtactggtgctgaaccttgtatcaagttgggtgttgacgccattgccccaggtggtcgtttcgtt"))>>> frompydna.designimportassembly_fragments>>> # We would like a circular recombination, so the first sequence has to be repeated>>> fa1,fb,fc,fa2=assembly_fragments([a,b,c,a])>>> # Since all fragments are Amplicons, we need to extract the rp of the 1st and fp of the last fragments.>>> frompydna.amplifyimportpcr>>> fa=pcr(fa2.forward_primer,fa1.reverse_primer,a)>>> [fa,fb,fc][Amplicon(100), Amplicon(101), Amplicon(102)]>>> fa.name,fb.name,fc.name="fa fb fc".split()>>> frompydna.assemblyimportAssembly>>> assemblyobj=Assembly([fa,fb,fc])>>> assemblyobjAssemblyfragments....: 100bp 101bp 102bplimit(bp)....: 25G.nodes......: 6algorithm....: common_sub_strings>>> assemblyobj.assemble_linear()[Contig(-231), Contig(-166), Contig(-36)]>>> assemblyobj.assemble_circular()[0].seguid()'cdseguid=85t6tfcvWav0wnXEIb-lkUtrl4s'>>> (a+b+c).looped().seguid()'cdseguid=85t6tfcvWav0wnXEIb-lkUtrl4s'>>> print(assemblyobj.assemble_circular()[0].figure()) -|fa|36| \/| /\| 36|fb|36| \/| /\| 36|fc|36| \/| /\| 36-| | -------------------->>>
Compares two or more DNA sequences for equality i.e. if they
represent the same DNA molecule.
Two linear sequences are considiered equal if either:
They have the same sequence (case insensitive)
One sequence is the reverse complement of the other
Two circular sequences are considered equal if they are circular
permutations meaning that they have the same length and:
One sequence can be found in the concatenation of the other sequence with itself.
The reverse complement of one sequence can be found in the concatenation of the other sequence with itself.
The topology for the comparison can be set using one of the keywords
linear or circular to True or False.
If circular or linear is not set, it will be deduced from the topology of
each sequence for sequences that have a linear or circular attribute
(like Dseq and Dseqrecord).
This function takes a string containing one genbank sequence
in Genbank format and returns a named tuple containing two fields,
the gbtext containing a string with the corrected genbank sequence and
jseq which contains the JSON intermediate.
Examples
>>> s = '''LOCUS New_DNA 3 bp DNA CIRCULAR SYN 19-JUN-2013
... DEFINITION .
... ACCESSION
... VERSION
... SOURCE .
... ORGANISM .
... COMMENT
... COMMENT ApEinfo:methylated:1
... ORIGIN
... 1 aaa
... //'''
>>> from pydna.readers import read
>>> read(s)
... /site-packages/Bio/GenBank/Scanner.py:1388: BiopythonParserWarning: Malformed LOCUS line found - is this correct?
:'LOCUS New_DNA 3 bp DNA CIRCULAR SYN 19-JUN-2013\n'
"correct?\n:%r" % line, BiopythonParserWarning)
Traceback (most recent call last):
File "... /pydna/readers.py", line 48, in read
results = results.pop()
IndexError: pop from empty list
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "... /pydna/readers.py", line 50, in read
raise ValueError("No sequences found in data:\n({})".format(data[:79]))
ValueError: No sequences found in data:
(LOCUS New_DNA 3 bp DNA CIRCULAR SYN 19-JUN-2013
DEFINITI)
>>> from pydna.genbankfixer import gbtext_clean
>>> s2, j2 = gbtext_clean(s)
>>> print(s2)
LOCUS New_DNA 3 bp ds-DNA circular SYN 19-JUN-2013
DEFINITION .
ACCESSION
VERSION
SOURCE .
ORGANISM .
COMMENT
COMMENT ApEinfo:methylated:1
FEATURES Location/Qualifiers
ORIGIN
1 aaa
//
>>> s3 = read(s2)
>>> s3
Dseqrecord(o3)
>>> print(s3.format())
LOCUS New_DNA 3 bp DNA circular SYN 19-JUN-2013
DEFINITION .
ACCESSION New_DNA
VERSION New_DNA
KEYWORDS .
SOURCE
ORGANISM .
.
COMMENT
ApEinfo:methylated:1
FEATURES Location/Qualifiers
ORIGIN
1 aaa
//
The Amplicon class holds information about a PCR reaction involving two
primers and one template. This class is used by the Anneal class and is not
meant to be instantiated directly.
Parameters:
forward_primer (SeqRecord(Biopython)) – SeqRecord object holding the forward (sense) primer
reverse_primer (SeqRecord(Biopython)) – SeqRecord object holding the reverse (antisense) primer
template (Dseqrecord) – Dseqrecord object holding the template (circular or linear)
This module provide the Anneal class and the pcr() function
for PCR simulation. The pcr function is simpler to use, but expects only one
PCR product. The Anneal class should be used if more flexibility is required.
Primers with 5’ tails as well as inverse PCR on circular templates are handled
correctly.
pcr is a convenience function for the Anneal class to simplify its
usage, especially from the command line. If more than one or no PCR
product is formed, a ValueError is raised.
args is any iterable of Dseqrecords or an iterable of iterables of
Dseqrecords. args will be greedily flattened.
Parameters:
args (iterable containing sequence objects) – Several arguments are also accepted.
limit (int = 13, optional) – limit length of the annealing part of the primers.
Notes
sequences in args could be of type:
string
Seq
SeqRecord (or subclass)
Dseqrecord (or sublcass)
The last sequence will be assumed to be the template while
all preceeding sequences will be assumed to be primers.
This is a powerful function, use with care!
Returns:
product – An pydna.amplicon.Amplicon object representing the PCR
product. The direction of the PCR product will be the same as for
the template sequence.
Assembly of sequences by homologous recombination.
Should also be useful for related techniques such as Gibson assembly and fusion
PCR. Given a list of sequences (Dseqrecords), all sequences are analyzed for
shared homology longer than the set limit.
A graph is constructed where each overlapping region form a node and
sequences separating the overlapping regions form edges.
Assembly of a list of linear DNA fragments into linear or circular
constructs. The Assembly is meant to replace the Assembly method as it
is easier to use. Accepts a list of Dseqrecords (source fragments) to
initiate an Assembly object. Several methods are available for analysis
of overlapping sequences, graph construction and assembly.
Improved implementation of the assembly module. To see a list of issues with the previous implementation,
see [issues tagged with fixed-with-new-assembly-model](pydna-group/pydna#issues)
Turn a list of locations into a list of tuples of those locations, where each tuple contains
locations that overlap. For example, if locs = [loc1, loc2, loc3], and loc1 and loc2 overlap,
the output will be [(loc1, loc2), (loc3,)].
tuple[tuple[str, str], tuple[str, str]] – A tuple of two tuples, each containing the type of end (‘5’’, ‘3’’, or ‘blunt’)
and the sequence of the overhang. The first tuple is for the left end, second for the right end.
Assembly algorithm to find blunt overlaps. Used for blunt ligation.
It basically returns [(len(seqx), 0, 0)] if the right end of seqx is blunt and the
left end of seqy is blunt (compatible with blunt ligation). Otherwise, it returns an empty list.
Parameters:
seqx (_Dseqrecord) – The first sequence
seqy (_Dseqrecord) – The second sequence
limit (int) – There for compatibility, but it is ignored
Returns:
list[SequenceOverlap] – A list of overlaps between the two sequences
Assembly algorithm to find common substrings of length == limit. see the docs of
the function common_sub_strings_str for more details. It is case insensitive.
Starting from the rightmost edge of the match, return a new match encompassing the max
number of bases. This can be used to return a longer match if a primer aligns for longer
than the limit or a shorter match if there are mismatches. This is convenient to maintain
as many features as possible. It is used in PCR assembly.
>>> seq=_Dseqrecord('AAAAACGTCCCGT')>>> primer=_Dseqrecord('ACGTCCCGT')>>> match=(13,9,0)# an empty match at the end of each>>> zip_match_leftwards(seq,primer,match)(4, 0, 9)
Works in circular molecules if the match spans the origin:
>>> seq = _Dseqrecord(‘TCCCGTAAAAACG’, circular=True)
>>> primer = _Dseqrecord(‘ACGTCCCGT’)
>>> match = (6, 9, 0)
>>> zip_match_leftwards(seq, primer, match)
(10, 0, 9)
Transform a Dseqrecord to a sequence string where U is replaced by T, everything is upper case and
circular sequences are repeated twice. This is used for PCR, to support primers with U’s (e.g. for USER cloning).
Assembly algorithm to find overlaps between a primer and a template. It accepts mismatches.
When there are mismatches, it only returns the common part between the primer and the template.
If seqx is a primer and seqy is a template, it represents the binding of a forward primer.
If seqx is a template and seqy is a primer, it represents the binding of a reverse primer,
where the primer has been passed as its reverse complement (see examples).
Convert an assembly to a string representation, for example:
((1, 2, [8:14], [1:7]),(2, 3, [10:17], [1:8]))
becomes:
(‘1[8:14]:2[1:7]’, ‘2[10:17]:3[1:8]’)
The reason for this is that by default, a feature ‘[8:14]’ when present in a tuple
is printed to the console as SimpleLocation(ExactPosition(8),ExactPosition(14),strand=1) (very long).
Based on the topology of the locations of an assembly, determine if it is circular.
This does not work for insertion assemblies, that’s why assemble takes the optional argument is_insertion.
Turn this kind of edge representation fragment 1, fragment 2, right edge on 1, left edge on 2
a = [(1, 2, ‘loc1a’, ‘loc2a’), (2, 3, ‘loc2b’, ‘loc3b’), (3, 1, ‘loc3c’, ‘loc1c’)]
Into this: fragment 1, left edge on 1, right edge on 1
b = [(1, ‘loc1c’, ‘loc1a’), (2, ‘loc2a’, ‘loc2b’), (3, ‘loc3b’, ‘loc3c’)]
Turn this kind of subfragment representation fragment 1, left edge on 1, right edge on 1
a = [(1, ‘loc1c’, ‘loc1a’), (2, ‘loc2a’, ‘loc2b’), (3, ‘loc3b’, ‘loc3c’)]
Into this: fragment 1, fragment 2, right edge on 1, left edge on 2
b = [(1, 2, ‘loc1a’, ‘loc2a’), (2, 3, ‘loc2b’ ‘loc3b’), (3, 1, ‘loc3c’, ‘loc1c’)]
From the fragment representation returned by edge_representation2subfragment_representation, get the subfragments that are joined together.
Subfragments are the slices of the fragments that are joined together
For example:
--A--TACGTAAT--B--TCGTAACGAGives:TACGTAA/CGTAACGA
To reproduce:
a=Dseqrecord('TACGTAAT')b=Dseqrecord('TCGTAACGA')f=Assembly([a,b],limit=5)a0=f.get_linear_assemblies()[0]print(assembly2str(a0))a0_subfragment_rep=edge_representation2subfragment_representation(a0,False)forfinget_assembly_subfragments([a,b],a0_subfragment_rep):print(f.seq)# prints TACGTAA and CGTAACGA
Assembly of a list of DNA fragments into linear or circular constructs.
Accepts a list of Dseqrecords (source fragments) to
initiate an Assembly object. Several methods are available for analysis
of overlapping sequences, graph construction and assembly.
The assembly contains a directed graph, where nodes represent fragments and
edges represent overlaps between fragments. :
The node keys are integers, representing the index of the fragment in the
input list of fragments. The sign of the node key represents the orientation
of the fragment, positive for forward orientation, negative for reverse orientation.
The edges contain the locations of the overlaps in the fragments. For an edge (u, v, key):
u and v are the nodes connected by the edge.
key is a string that represents the location of the overlap. In the format:
‘u[start:end](strand):v[start:end](strand)’.
Edges have a ‘locations’ attribute, which is a list of two FeatureLocation objects,
representing the location of the overlap in the u and v fragment, respectively.
You can think of an edge as a representation of the join of two fragments.
If fragment 1 and 2 share a subsequence of 6bp, [8:14] in fragment 1 and [1:7] in fragment 2,
there will be 4 edges representing that overlap in the graph, for all possible
orientations of the fragments (see add_edges_from_match for details):
(1,2,'1[8:14]:2[1:7]')
(2,1,'2[1:7]:1[8:14]')
(-1,-2,'-1[0:6]:-2[10:16]')
(-2,-1,'-2[10:16]:-1[0:6]')
An assembly can be thought of as a tuple of graph edges, but instead of representing them with node indexes and keys, we represent them
as u, v, locu, locv, where u and v are the nodes connected by the edge, and locu and locv are the locations of the overlap in the first
and second fragment. Assemblies are then represented as:
limit (int, optional) – The shortest shared homology to be considered, this is passed as the third argument to the algorithm function.
For certain algorithms, this might be ignored.
algorithm (function, optional) – The algorithm used to determine the shared sequences. It’s a function that takes two Dseqrecord objects as inputs,
and will get passed the third argument (limit), that may or may not be used. It must return a list of overlaps
(see common_sub_strings for an example).
use_fragment_order (bool, optional) – It’s set to True by default to reproduce legacy pydna behaviour: only assemblies that start with the first fragment and end with the last are considered.
You should set it to False.
use_all_fragments (bool, optional) – Constrain the assembly to use all fragments.
Examples
from assembly2 import Assembly, assembly2str
from pydna.dseqrecord import Dseqrecord
Add edges to the graph from a match returned by the algorithm function (see pydna.common_substrings). For
format of edges (see documentation of the Assembly class).
Matches are directional, because not all algorithm functions return the same match for (u,v) and (v,u). For example,
homologous recombination does but sticky end ligation does not. The function returns two edges:
Fragments in the orientation they were passed, with locations of the match (u, v, loc_u, loc_v)
Reverse complement of the fragments with inverted order, with flipped locations (-v, -u, flip(loc_v), flip(loc_u))/
Get linear assemblies, applying the constrains described in __init__, ensuring that paths represent
real assemblies (see assembly_is_valid). Subassemblies are removed (see remove_subassemblies).
Convert a node path in the format [1, 2, 3] (as returned by _nx.cycles.simple_cycles) to a list of all
possible assemblies.
There may be multiple assemblies for a given node path, if there are several edges connecting two nodes,
for example two overlaps between 1 and 2, and single overlap between 2 and 3 should return 3 assemblies.
Get the number of possible assemblies from a list of node paths. Basically, for each path
passed as a list of integers / nodes, we calculate the number of paths possible connecting
the nodes in that order, given the graph (all the edges connecting them).
Sorts the fragment representing a cycle so that they represent an insertion assembly if possible,
else returns None.
Here we check if one of the joins between fragments represents the edges of an insertion assembly
The fragment must be linear, and the join must be as indicated below
The above example will be [(1, 2, [4:6], [0:2]), (2, 3, [6:8], [0:2]), (3, 1, [8:10], [9:11)])]
These could be returned in any order by simple_cycles, so we sort the edges so that the first
and last u and v match the fragment that gets the insertion (1 in the example above).
Assemblies that represent the insertion of a fragment or series of fragment inside a linear construct. For instance,
digesting CCCCGAATTCCCCGAATTC with EcoRI and inserting the fragment with two overhangs into the EcoRI site of AAAGAATTCAAA.
This is not so much meant for the use-case of linear fragments that represent actual linear fragments, but for linear
fragments that represent a genome region. This can then be used to simulate homologous recombination.
Get a dictionary where the keys are the nodes in the graph, and the values are dictionaries with keys
left, right, containing (for each fragment) the locations where the fragment is joined to another fragment on its left
and right side. The values in left and right are often the same, except in restriction-ligation with partial overlap enabled,
where we can end up with a situation like this:
GGTCTCCCCAATT and aGGTCTCCAACCAA as fragments
# Partial overlap in assembly 1[9:11]:2[8:10]
GGTCTCCxxAACCAA
CCAGAGGGGTTxxTT
# Partial overlap in 2[10:12]:1[7:9]
aGGTCTCCxxCCAATT
tCCAGAGGTTGGxxAA
Check whether only adjacent edges within each fragment are used in the assembly. This is useful to check if a cut and ligate assembly is valid,
and prevent including partially digested fragments. For example, imagine the following fragment being an input for a digestion
and ligation assembly, where the enzyme cuts at the sites indicated by the vertical lines:
xyz-------|-------|-------|---------
We would only want assemblies that contain subfragments start-x, x-y, y-z, z-end, and not start-x, y-end, for instance.
The latter would indicate that the fragment was partially digested.
An assembly that represents a PCR, where fragments is a list of primer, template, primer (in that order).
It always uses the primer_template_overlap algorithm and accepts the mismatches argument to indicate
the number of mismatches allowed in the overlap. Only supports substitution mismatches, not indels.
Get linear assemblies, applying the constrains described in __init__, ensuring that paths represent
real assemblies (see assembly_is_valid). Subassemblies are removed (see remove_subassemblies).
Assemblies that represent the insertion of a fragment or series of fragment inside a linear construct. For instance,
digesting CCCCGAATTCCCCGAATTC with EcoRI and inserting the fragment with two overhangs into the EcoRI site of AAAGAATTCAAA.
This is not so much meant for the use-case of linear fragments that represent actual linear fragments, but for linear
fragments that represent a genome region. This can then be used to simulate homologous recombination.
Get linear assemblies, applying the constrains described in __init__, ensuring that paths represent
real assemblies (see assembly_is_valid). Subassemblies are removed (see remove_subassemblies).
In the example below, we plan to assemble a plasmid from a backbone and an insert, using the EcoRI and SalI enzymes.
Note how 2 circular products are returned, one contains the insert (acgt)
and the desired part of the backbone (cccccc), the other contains the
reversed insert (tgga) and the cut-out part of the backbone (aaa).
Returns the products for Golden Gate assembly. This is the same as
restriction ligation assembly, but with a different name. Check the documentation
for restriction_ligation_assembly for more details.
Parameters:
frags (list[_Dseqrecord]) – List of DNA fragments to assemble
enzymes (list[_AbstractCut]) – List of restriction enzymes to use
allow_blunt (bool, optional) – If True, allow blunt end ligations, by default True
circular_only (bool, optional) – If True, only return circular assemblies, by default False
In the example below, we plan to assemble a plasmid from a backbone and an insert,
using the EcoRI enzyme. The insert and insertion site in the backbone are flanked by
EcoRI sites, so there are two possible products depending on the orientation of the insert.
Returns the products for Gateway assembly / Gateway cloning.
Parameters:
frags (list[_Dseqrecord]) – List of DNA fragments to assemble
reaction_type (Literal['BP', 'LR']) – Type of Gateway reaction
greedy (bool, optional) – If True, use greedy gateway consensus sites, by default False
circular_only (bool, optional) – If True, only return circular assemblies, by default False
multi_site_only (bool, optional) – If True, only return products that where 2 sites recombined. Even if input sequences
contain multiple att sites (typically 2), a product could be generated where only one
site recombines. That’s typically not what you want, so you can set this to True to
only return products where both att sites recombined.
Now let’s understand the multi_site_only parameter. Let’s consider a case where we are swapping fragments
between two plasmids using an LR reaction. Experimentally, we expect to obtain two plasmids, resulting from the
swapping between the two att sites. That’s what we get if we set multi_site_only to True.
However, if we set multi_site_only to False, we get 4 products, which also include the intermediate products
where the two plasmids are combined into a single one through recombination of a single att site. This is an
intermediate of the reaction, and typically we don’t want it:
Returns the products resulting from the integration of an insert (or inserts joined
through in vivo recombination) into the genome through homologous recombination.
Parameters:
genome (_Dseqrecord) – Target genome sequence
inserts (list[_Dseqrecord]) – DNA fragment(s) to insert
Example of a homologous recombination event, where a plasmid is excised from the
genome (circular sequence of 25 bp), and that part is removed from the genome,
leaving a shorter linear sequence (32 bp).
Returns the products resulting from the integration of an insert (or inserts joined
through cre-lox recombination among them) into the genome through cre-lox integration.
Also works with lox66 and lox71 (see pydna.cre_lox for more details).
Parameters:
genome (_Dseqrecord) – Target genome sequence
inserts (list[_Dseqrecord] or _Dseqrecord) – DNA fragment(s) to insert
Below an example with lox66 and lox71 (irreversible integration).
Here, the result of excision is still returned because there is a low
probability of it happening, but it’s considered a rare event.
Below an example with lox66 and lox71 (irreversible integration).
Here, the result of excision is still returned because there is a low
probability of it happening, but it’s considered a rare event.
Finds the the flanking common substrings between stringx and stringy
longer than limit. This means that the results only contains substrings
that starts or ends at the the ends of stringx and stringy.
This function is case sensitive.
returns a list of tuples describing the substrings
The list is sorted longest -> shortest.
This module contain functions for primer design for various purposes.
:func:primer_design for designing primers for a sequence or a matching primer for an existing primer. Returns an Amplicon object (same as the amplify module returns).
:func:assembly_fragments Adds tails to primers for a linear assembly through homologous recombination or Gibson assembly.
:func:circular_assembly_fragments Adds tails to primers for a circular assembly through homologous recombination or Gibson assembly.
This function designs a forward primer and a reverse primer for PCR amplification
of a given template sequence.
The template argument is a Dseqrecord object or equivalent containing the template sequence.
The optional fp and rp arguments can contain an existing primer for the sequence (either the forward or reverse primer).
One or the other primers can be specified, not both (since then there is nothing to design!, use the pydna.amplify.pcr function instead).
The limit argument is the minimum length of the primer. The default value is 13.
If one of the primers is given, the other primer is designed to match in terms of Tm.
If both primers are designed, they will be designed to target_tm
tm_func is a function that takes an ascii string representing an oligonuceotide as argument and returns a float.
Some useful functions can be found in the pydna.tm module, but can be substituted for a custom made function.
estimate_function is a tm_func-like function that is used to get a first guess for the primer design, that is then used as starting
point for the final result. This is useful when the tm_func function is slow to calculate (e.g. it relies on an
external API, such as the NEB primer design API). The estimate_function should be faster than the tm_func function.
The default value is None.
To use the default tm_func as estimate function to get the NEB Tm faster, you can do:
primer_design(dseqr, target_tm=55, tm_func=tm_neb, estimate_function=tm_default).
The function returns a pydna.amplicon.Amplicon class instance. This object has
the object.forward_primer and object.reverse_primer properties which contain the designed primers.
fp (pydna.primer.Primer, optional) – optional pydna.primer.Primer objects containing one primer each.
rp (pydna.primer.Primer, optional) – optional pydna.primer.Primer objects containing one primer each.
target_tm (float, optional) – target tm for the primers, set to 55°C by default.
tm_func (function) – Function used for tm calculation. This function takes an ascii string
representing an oligonuceotide as argument and returns a float.
Some useful functions can be found in the pydna.tm module, but can be
substituted for a custom made function.
This function return a list of pydna.amplicon.Amplicon objects where
primers have been modified with tails so that the fragments can be fused in
the order they appear in the list by for example Gibson assembly or homologous
recombination.
we can modify the reverse primer of a and forward primer of b with tails to allow
fusion by fusion PCR, Gibson assembly or in-vivo homologous recombination.
The basic requirements for the primers for the three techniques are the same.
The first argument of this function is a list of sequence objects containing
Amplicons and other similar objects.
At least every second sequence object needs to be an Amplicon
This rule exists because if a sequence object is that is not a PCR product
is to be fused with another fragment, that other fragment needs to be an Amplicon
so that the primer of the other object can be modified to include the whole stretch
of sequence homology needed for the fusion. See the example below where a is a
non-amplicon (a linear plasmid vector for instance)
The first argument of this function is a list of sequence objects containing
Amplicons and other similar objects.
The overlap argument controls how many base pairs of overlap required between
adjacent sequence fragments. In the junction between Amplicons, tails with the
length of about half of this value is added to the two primers
closest to the junction.
In the case of an Amplicon adjacent to a Dseqrecord object, the tail will
be twice as long (1*overlap) since the
recombining sequence is present entirely on this primer:
Note that if the sequence of DNA fragments starts or stops with an Amplicon,
the very first and very last prinmer will not be modified i.e. assembles are
always assumed to be linear. There are simple tricks around that for circular
assemblies depicted in the last two examples below.
The maxlink arguments controls the cut off length for sequences that will be
synhtesized by adding them to primers for the adjacent fragment(s). The
argument list may contain short spacers (such as spacers between fusion proteins).
Example 1: Linear assembly of PCR products (pydna.amplicon.Amplicon class objects) ------
> < > <
Amplicon1 Amplicon3
Amplicon2 Amplicon4
> < > <
⇣
pydna.design.assembly_fragments
⇣
> <- -> <- pydna.assembly.Assembly
Amplicon1 Amplicon3
Amplicon2 Amplicon4 ➤ Amplicon1Amplicon2Amplicon3Amplicon4
-> <- -> <
Example 2: Linear assembly of alternating Amplicons and other fragments
> < > <
Amplicon1 Amplicon2
Dseqrecd1 Dseqrecd2
⇣
pydna.design.assembly_fragments
⇣
> <-- --> <-- pydna.assembly.Assembly
Amplicon1 Amplicon2
Dseqrecd1 Dseqrecd2 ➤ Amplicon1Dseqrecd1Amplicon2Dseqrecd2
Example 3: Linear assembly of alternating Amplicons and other fragments
Dseqrecd1 Dseqrecd2
Amplicon1 Amplicon2
> < --> <
⇣
pydna.design.assembly_fragments
⇣
pydna.assembly.Assembly
Dseqrecd1 Dseqrecd2
Amplicon1 Amplicon2 ➤ Dseqrecd1Amplicon1Dseqrecd2Amplicon2
--> <-- --> <
Example 4: Circular assembly of alternating Amplicons and other fragments
-> <==
Dseqrecd1 Amplicon2
Amplicon1 Dseqrecd1
--> <-
⇣
pydna.design.assembly_fragments
⇣
pydna.assembly.Assembly
-> <==
Dseqrecd1 Amplicon2 -Dseqrecd1Amplicon1Amplicon2-
Amplicon1 ➤ | |
--> <- -----------------------------
------ Example 5: Circular assembly of Amplicons
> < > <
Amplicon1 Amplicon3
Amplicon2 Amplicon1
> < > <
⇣
pydna.design.assembly_fragments
⇣
> <= -> <-
Amplicon1 Amplicon3
Amplicon2 Amplicon1
-> <- +> <
⇣
make new Amplicon using the Amplicon1.template and
the last fwd primer and the first rev primer.
⇣
pydna.assembly.Assembly
+> <= -> <-
Amplicon1 Amplicon3 -Amplicon1Amplicon2Amplicon3-
Amplicon2 ➤ | |
-> <- -----------------------------
Parameters:
f (list of pydna.amplicon.Amplicon and other Dseqrecord like objects) – list Amplicon and Dseqrecord object for which fusion primers should be constructed.
overlap (int, optional) – Length of required overlap between fragments.
maxlink (int, optional) – Maximum length of spacer sequences that may be present in f. These will be included in tails for designed primers.
circular (bool, optional) – If True, the assembly is circular. If False, the assembly is linear.
>>> frompydna.dseqrecordimportDseqrecord>>> frompydna.designimportprimer_design>>> a=primer_design(Dseqrecord("atgactgctaacccttccttggtgttgaacaagatcgacgacatttcgttcgaaacttacgatg"))>>> b=primer_design(Dseqrecord("ccaaacccaccaggtaccttatgtaagtacttcaagtcgccagaagacttcttggtcaagttgcc"))>>> c=primer_design(Dseqrecord("tgtactggtgctgaaccttgtatcaagttgggtgttgacgccattgccccaggtggtcgtttcgtt"))>>> frompydna.designimportassembly_fragments>>> # We would like a circular recombination, so the first sequence has to be repeated>>> fa1,fb,fc,fa2=assembly_fragments([a,b,c,a])>>> # Since all fragments are Amplicons, we need to extract the rp of the 1st and fp of the last fragments.>>> frompydna.amplifyimportpcr>>> fa=pcr(fa2.forward_primer,fa1.reverse_primer,a)>>> [fa,fb,fc][Amplicon(100), Amplicon(101), Amplicon(102)]>>> fa.name,fb.name,fc.name="fa fb fc".split()>>> frompydna.assemblyimportAssembly>>> assemblyobj=Assembly([fa,fb,fc])>>> assemblyobjAssemblyfragments....: 100bp 101bp 102bplimit(bp)....: 25G.nodes......: 6algorithm....: common_sub_strings>>> assemblyobj.assemble_linear()[Contig(-231), Contig(-166), Contig(-36)]>>> assemblyobj.assemble_circular()[0].seguid()'cdseguid=85t6tfcvWav0wnXEIb-lkUtrl4s'>>> (a+b+c).looped().seguid()'cdseguid=85t6tfcvWav0wnXEIb-lkUtrl4s'>>> print(assemblyobj.assemble_circular()[0].figure()) -|fa|36| \/| /\| 36|fb|36| \/| /\| 36|fc|36| \/| /\| 36-| | -------------------->>>
Dseq holds information for a double stranded DNA fragment.
Dseq also holds information describing the topology of
the DNA fragment (linear or circular).
Parameters:
watson (str) – a string representing the watson (sense) DNA strand.
crick (str, optional) – a string representing the crick (antisense) DNA strand.
ovhg (int, optional) – A positive or negative number to describe the stagger between the
watson and crick strands.
see below for a detailed explanation.
linear (bool, optional) – True indicates that sequence is linear, False that it is circular.
circular (bool, optional) – True indicates that sequence is circular, False that it is linear.
Examples
Dseq is a subclass of the Biopython Seq object. It stores two
strings representing the watson (sense) and crick(antisense) strands.
two properties called linear and circular, and a numeric value ovhg
(overhang) describing the stagger for the watson and crick strand
in the 5’ end of the fragment.
The most common usage is probably to create a Dseq object as a
part of a Dseqrecord object (see pydna.dseqrecord.Dseqrecord).
There are three ways of creating a Dseq object directly listed below, but you can also
use the function Dseq.from_full_sequence_and_overhangs() to create a Dseq:
The given string will be interpreted as the watson strand of a
blunt, linear double stranded sequence object. The crick strand
is created automatically from the watson strand.
If both watson and crick are given, but not ovhg an attempt
will be made to find the best annealing between the strands.
There are limitations to this. For long fragments it is quite
slow. The length of the annealing sequences have to be at least
half the length of the shortest of the strands.
Three arguments (string, string, ovhg=int):
The ovhg parameter is an integer describing the length of the
crick strand overhang in the 5’ end of the molecule.
The ovhg parameter controls the stagger at the five prime end:
If the ovhg parameter is specified a crick strand also
needs to be supplied, otherwise an exception is raised.
>>> Dseq(watson="agt",ovhg=2)Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/local/lib/python2.7/dist-packages/pydna_/dsdna.py", line 169, in __init__else:ValueError: ovhg defined without crick strand!
The shape of the fragment is set by circular = True, False
Note that both ends of the DNA fragment has to be compatible to set
circular = True.
This can only be done if the two ends are compatible,
otherwise a TypeError is raised.
Examples
>>> frompydna.dseqimportDseq>>> a=Dseq("catcgatc")>>> aDseq(-8)catcgatcgtagctag>>> a.looped()Dseq(o8)catcgatcgtagctag>>> a.T4("t")Dseq(-8)catcgat tagctag>>> a.T4("t").looped()Dseq(o7)catcgatgtagcta>>> a.T4("a")Dseq(-8)catcga agctag>>> a.T4("a").looped()Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/local/lib/python2.7/dist-packages/pydna/dsdna.py", line 357, in loopediftype5==type3andstr(sticky5)==str(rc(sticky3)):TypeError: DNA cannot be circularized.5' and 3' sticky ends not compatible!>>>
Fill in of five prime protruding end with a DNA polymerase
that has only DNA polymerase activity (such as exo-klenow [7])
and any combination of A, G, C or T. Default are all four
nucleotides together.
Transcribe a DNA sequence into RNA and return the RNA sequence as a new Seq object.
Following the usual convention, the sequence is interpreted as the
coding strand of the DNA double helix, not the template strand. This
means we can get the RNA sequence just by switching T to U.
As Seq objects are immutable, a TypeError is raised if
transcribe is called on a Seq object with inplace=True.
Trying to transcribe an RNA sequence has no effect.
If you have a nucleotide sequence which might be DNA or RNA
(or even a mixture), calling the transcribe method will ensure
any T becomes U.
Trying to transcribe a protein sequence will replace any
T for Threonine with U for Selenocysteine, which has no
biologically plausible rational.
Fill in five prime protruding ends and chewing back
three prime protruding ends by a DNA polymerase providing both
5’-3’ DNA polymerase activity and 3’-5’ nuclease acitivty
(such as T4 DNA polymerase). This can be done in presence of any
combination of the four A, G, C or T. Removing one or more nucleotides
can facilitate engineering of sticky ends. Default are all four nucleotides together.
Fill in five prime protruding ends and chewing back
three prime protruding ends by a DNA polymerase providing both
5’-3’ DNA polymerase activity and 3’-5’ nuclease acitivty
(such as T4 DNA polymerase). This can be done in presence of any
combination of the four A, G, C or T. Removing one or more nucleotides
can facilitate engineering of sticky ends. Default are all four nucleotides together.
Returns False if:
- Cut positions fall outside the sequence (could be moved to Biopython)
- Overhang is not double stranded
- Recognition site is not double stranded or is outside the sequence
- For enzymes that cut twice, it checks that at least one possibility is valid
Returns a list of cutsites, represented represented as ((cut_watson, ovhg), enz):
cut_watson is a positive integer contained in [0,len(seq)), where seq is the sequence
that will be cut. It represents the position of the cut on the watson strand, using the full
sequence as a reference. By “full sequence” I mean the one you would get from str(Dseq).
ovhg is the overhang left after the cut. It has the same meaning as ovhg in
the Bio.Restriction enzyme objects, or pydna’s Dseq property.
enz is the enzyme object. It’s not necessary to perform the cut, but can be
used to keep track of which enzyme was used.
Cuts are only returned if the recognition site and overhang are on the double-strand
part of the sequence.
For a given cut expressed as ((cut_watson, ovhg), enz), returns
a tuple (cut_watson, cut_crick, ovhg).
cut_watson: see get_cutsites docs
cut_crick: equivalent of cut_watson in the crick strand
ovhg: see get_cutsites docs
The cut can be None if it represents the left or right end of the sequence.
Then it will return the position of the watson and crick ends with respect
to the “full sequence”. The is_left parameter is only used in this case.
Returns pairs of cutsites that render the edges of the resulting fragments.
A fragment produced by restriction is represented by a tuple of length 2 that
may contain cutsites or None:
Two cutsites: represents the extraction of a fragment between those two
cutsites, in that orientation. To represent the opening of a circular
molecule with a single cutsite, we put the same cutsite twice.
None, cutsite: represents the extraction of a fragment between the left
edge of linear sequence and the cutsite.
cutsite, None: represents the extraction of a fragment between the cutsite
and the right edge of a linear sequence.
This module provides the Dseqrecord class, for handling double stranded
DNA sequences. The Dseqrecord holds sequence information in the form of a pydna.dseq.Dseq
object. The Dseq and Dseqrecord classes are subclasses of Biopythons
Seq and SeqRecord classes, respectively.
The Dseq and Dseqrecord classes support the notion of circular and linear DNA topology.
Dseqrecord is a double stranded version of the Biopython SeqRecord [9] class.
The Dseqrecord object holds a Dseq object describing the sequence.
Additionally, Dseqrecord hold meta information about the sequence in the
from of a list of SeqFeatures, in the same way as the SeqRecord does.
The Dseqrecord can be initialized with a string, Seq, Dseq, SeqRecord
or another Dseqrecord. The sequence information will be stored in a
Dseq object in all cases.
Dseqrecord objects can be read or parsed from sequences in FASTA, EMBL or Genbank formats.
See the pydna.readers and pydna.parsers modules for further information.
There is a short representation associated with the Dseqrecord.
Dseqrecord(-3) represents a linear sequence of length 2
while Dseqrecord(o7)
represents a circular sequence of length 7.
Dseqrecord and Dseq share the same concept of length. This length can be larger
than each strand alone if they are staggered as in the example below.
<--length-->GATCCTTTAAAGCCTAG
Parameters:
record (string, Seq, SeqRecord, Dseq or other Dseqrecord object) – This data will be used to form the seq property
circular (bool, optional) – True or False reflecting the shape of the DNA molecule
linear (bool, optional) – True or False reflecting the shape of the DNA molecule
This checksum is the same as seguid but with base64.urlsafe
encoding instead of the normal base64. This means that
the characters + and / are replaced with - and _ so that
the checksum can be part of a URL.
Writes the Dseqrecord to a file using the format f, which must
be a format supported by Biopython SeqIO for writing [11]. Default
is “gb” which is short for Genbank. Note that Biopython SeqIO reads
more formats than it writes.
Filename is the path to the file where the sequece is to be
written. The filename is optional, if it is not given, the
description property (string) is used together with the format.
If obj is the Dseqrecord object, the default file name will be:
<obj.locus>.<f>
Where <f> is “gb” by default. If the filename already exists and
AND the sequence it contains is different, a new file name will be
used so that the old file is not lost:
This method returns a new circular sequence (Dseqrecord object), which has been rotated
in such a way that there is maximum overlap between the sequence and
ref, which may be a string, Biopython Seq, SeqRecord object or
another Dseqrecord object.
The reason for using this could be to rotate a new recombinant plasmid so
that it starts at the same position after cloning. See the example below:
Digest a Dseqrecord object with one or more restriction enzymes.
returns a list of linear Dseqrecords. If there are no cuts, an empty
list is returned.
See also Dseq.cut()
:param enzymes: A Bio.Restriction.XXX restriction object or iterable of such.
:type enzymes: enzyme object or iterable of such objects
Returns:
Dseqrecord_frags – list of Dseqrecord objects formed by the digestion
This module provides a class for downloading sequences from genbank
called Genbank and an function that does the same thing called genbank.
The function can be used if the environmental variable pydna_email has
been set to a valid email address. The easiest way to do this permanantly is to edit the
pydna.ini file. See the documentation of pydna.open_config_folder()
This method downloads a genbank nuclotide record from genbank. This method is
cached by default. This can be controlled by editing the pydna_cached_funcs environment
variable. The best way to do this permanently is to edit the edit the
pydna.ini file. See the documentation of pydna.open_config_folder()
Item is a string containing one genbank accession number
for a nucleotide file. Genbank nucleotide accession numbers have this format:
A12345 = 1 letter + 5 numerals
AB123456 = 2 letters + 6 numerals
The accession number is sometimes followed by a point and version number
BK006936.2
Item can also contain optional interval information in the following formats:
BK006936.2 REGION: complement(613900..615202)
NM_005546 REGION: 1..100
NM_005546 REGION: complement(1..100)
21614549:1-100
21614549:c100-1
21614549 1-100
21614549 c100-1
It is useful to set an interval for large genbank records to limit the download time.
The items above containing interval information and can be obtained directly by
looking up an entry in Genbank and setting the Change region shown on the
upper right side of the page. The ACCESSION line of the displayed Genbank
file will have the formatting shown.
Alternatively, seq_start and seq_stop can be set explicitly to the sequence intervals to be
downloaded.
If strand is 2. “c”, “C”, “crick”, “Crick”, “antisense”,”Antisense”,
“2”, 2, “-” or “-1”, the antisense (Crick) strand is returned, otherwise
the sense (Watson) strand is returned.
This function takes the same paramenters as the
:func:pydna.genbank.Genbank.nucleotide method. The email address stored
in the pydna_email environment variable is used. The easiest way set
this permanantly is to edit the pydna.ini file.
See the documentation of pydna.open_config_folder()
if no accession is given, a very short Genbank
entry
is used as an example (see below). This can be useful for testing the
connection to Genbank.
Please note that this result is also cached by default by settings in
the pydna.ini file.
See the documentation of pydna.open_config_folder()
LOCUSCS57023314bpDNAlinearPAT18-MAY-2007DEFINITIONSequence6fromPatentWO2007025016.ACCESSIONCS570233VERSIONCS570233.1KEYWORDS.SOURCEsyntheticconstructORGANISMsyntheticconstructothersequences;artificialsequences.REFERENCE1AUTHORSShaw,R.W.andCottenoir,M.TITLEInhibitionofmetallo-beta-lactamasebydouble-strandeddnaJOURNALPatent:WO2007025016-A1601-MAR-2007;TexasTechUniversitySystem(US)FEATURESLocation/Qualifierssource1..14/organism="synthetic construct"/mol_type="unassigned DNA"/db_xref="taxon:32630"/note="This is a 14bp aptamer inhibitor."ORIGIN1atgttcctacatga//
This module provides the gbtext_clean() function which can clean up broken Genbank files enough to
pass the BioPython Genbank parser
Almost all of this code was lifted from BioJSON (levskaya/BioJSON) by Anselm Levskaya.
The original code was not accompanied by any software licence. This parser is based on pyparsing.
There are some modifications to deal with fringe cases.
The parser first produces JSON as an intermediate format which is then formatted back into a
string in Genbank format.
The parser is not complete, so some fields do not survive the roundtrip (see below).
This should not be a difficult fix. The returned result has two properties,
.jseq which is the intermediate JSON produced by the parser and .gbtext
which is the formatted genbank string.
This function takes a string containing one genbank sequence
in Genbank format and returns a named tuple containing two fields,
the gbtext containing a string with the corrected genbank sequence and
jseq which contains the JSON intermediate.
Examples
>>> s = '''LOCUS New_DNA 3 bp DNA CIRCULAR SYN 19-JUN-2013
... DEFINITION .
... ACCESSION
... VERSION
... SOURCE .
... ORGANISM .
... COMMENT
... COMMENT ApEinfo:methylated:1
... ORIGIN
... 1 aaa
... //'''
>>> from pydna.readers import read
>>> read(s)
... /site-packages/Bio/GenBank/Scanner.py:1388: BiopythonParserWarning: Malformed LOCUS line found - is this correct?
:'LOCUS New_DNA 3 bp DNA CIRCULAR SYN 19-JUN-2013\n'
"correct?\n:%r" % line, BiopythonParserWarning)
Traceback (most recent call last):
File "... /pydna/readers.py", line 48, in read
results = results.pop()
IndexError: pop from empty list
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "... /pydna/readers.py", line 50, in read
raise ValueError("No sequences found in data:\n({})".format(data[:79]))
ValueError: No sequences found in data:
(LOCUS New_DNA 3 bp DNA CIRCULAR SYN 19-JUN-2013
DEFINITI)
>>> from pydna.genbankfixer import gbtext_clean
>>> s2, j2 = gbtext_clean(s)
>>> print(s2)
LOCUS New_DNA 3 bp ds-DNA circular SYN 19-JUN-2013
DEFINITION .
ACCESSION
VERSION
SOURCE .
ORGANISM .
COMMENT
COMMENT ApEinfo:methylated:1
FEATURES Location/Qualifiers
ORIGIN
1 aaa
//
>>> s3 = read(s2)
>>> s3
Dseqrecord(o3)
>>> print(s3.format())
LOCUS New_DNA 3 bp DNA circular SYN 19-JUN-2013
DEFINITION .
ACCESSION New_DNA
VERSION New_DNA
KEYWORDS .
SOURCE
ORGANISM .
.
COMMENT
ApEinfo:methylated:1
FEATURES Location/Qualifiers
ORIGIN
1 aaa
//
A DNA ladder is a list of FakeSeq objects that has to be initiated with
Size (bp), amount of substance (mol) and Relative mobility (Rf).
Rf is a float value between 0.000 and 1.000. These are used together with
the cubic spline interpolator in the gel module to calculate migartion
distance from fragment length. The Rf values are calculated manually from
a gel image. Exampel can be found in scripts/molecular_weight_standards.ods.
This module provides classes that roughly map to the OpenCloning
data model, which is defined using LinkML <https://linkml.io>, and available as a python
package opencloning-linkml. These classes
are documented there, and the ones in this module essentially replace the fields pointing to
sequences and primers (which use ids in the data model) to Dseqrecord and Primer
objects, respectively. Similarly, it uses Location from Biopython instead of a string,
which is what the data model uses.
When using pydna to plan cloning, it stores the provenance of Dseqrecord objects in
their source attribute. Not all methods generate sources so far, so refer to the
documentation notebooks for examples on how to use this feature. The history method of
Dseqrecord objects can be used to get a string representation of the provenance of the
sequence. You can also use the CloningStrategy class to create a JSON representation of
the cloning strategy. That CloningStrategy can be loaded in the OpenCloning web interface
to see a representation of the cloning strategy.
Context manager that is used to determine how ids are assigned to objects when
mapping them to the OpenCloning data model. If use_python_internal_id is True,
the built-in python id() function is used to assign ids to objects. That function
produces a unique integer for each object in python, so it’s guaranteed to be unique.
If use_python_internal_id is False, the object’s .id attribute (must be a string integer)
is used to assign ids to objects. This is useful when the objects already have meaningful ids,
and you want to keep references to them in SourceInput objects (which sequences and
primers are used in a particular source).
Parameters:
use_python_internal_id (bool) – If True, use Python’s built-in id() function.
If False, use the object’s .id attribute (must be a string integer).
Generates a JSON representation of the model using Pydantic’s to_json method.
Parameters:
indent – Indentation to use in the JSON output. If None is passed, the output will be compact.
ensure_ascii – If True, the output is guaranteed to have all incoming non-ASCII characters escaped.
If False (the default), these characters will be output as-is.
include – Field(s) to include in the JSON output.
exclude – Field(s) to exclude from the JSON output.
context – Additional context to pass to the serializer.
by_alias – Whether to serialize using field aliases.
exclude_unset – Whether to exclude fields that have not been explicitly set.
exclude_defaults – Whether to exclude fields that are set to their default value.
exclude_none – Whether to exclude fields that have a value of None.
exclude_computed_fields – Whether to exclude computed fields.
While this can be useful for round-tripping, it is usually recommended to use the dedicated
round_trip parameter instead.
round_trip – If True, dumped values should be valid as input for non-idempotent types such as Json[T].
warnings – How to handle serialization errors. False/”none” ignores them, True/”warn” logs errors,
“error” raises a [PydanticSerializationError][pydantic_core.PydanticSerializationError].
fallback – A function to call when an unknown value is encountered. If not provided,
a [PydanticSerializationError][pydantic_core.PydanticSerializationError] error is raised.
serialize_as_any – Whether to serialize fields with duck-typing serialization behavior.
Generate a dictionary representation of the model, optionally specifying which fields to include or exclude.
Parameters:
mode – The mode in which to_python should run.
If mode is ‘json’, the output will only contain JSON serializable types.
If mode is ‘python’, the output may contain non-JSON-serializable Python objects.
include – A set of fields to include in the output.
exclude – A set of fields to exclude from the output.
context – Additional context to pass to the serializer.
by_alias – Whether to use the field’s alias in the dictionary key if defined.
exclude_unset – Whether to exclude fields that have not been explicitly set.
exclude_defaults – Whether to exclude fields that are set to their default value.
exclude_none – Whether to exclude fields that have a value of None.
exclude_computed_fields – Whether to exclude computed fields.
While this can be useful for round-tripping, it is usually recommended to use the dedicated
round_trip parameter instead.
round_trip – If True, dumped values should be valid as input for non-idempotent types such as Json[T].
warnings – How to handle serialization errors. False/”none” ignores them, True/”warn” logs errors,
“error” raises a [PydanticSerializationError][pydantic_core.PydanticSerializationError].
fallback – A function to call when an unknown value is encountered. If not provided,
a [PydanticSerializationError][pydantic_core.PydanticSerializationError] error is raised.
serialize_as_any – Whether to serialize fields with duck-typing serialization behavior.
If no sequences are found, an empty list is returned. This is a greedy
function, use carefully.
Parameters:
data (string or iterable) –
The data parameter is a string containing:
an absolute path to a local file.
The file will be read in text
mode and parsed for EMBL, FASTA
and Genbank sequences. Can be
a string or a Path object.
a string containing one or more
sequences in EMBL, GENBANK,
or FASTA format. Mixed formats
are allowed.
data can be a list or other iterable where the elements are 1 or 2
ds (bool) – If True double stranded Dseqrecord objects are returned.
If False single stranded Bio.SeqRecord[12] objects are
returned.
This checksum is the same as seguid but with base64.urlsafe
encoding instead of the normal base64. This means that
the characters + and / are replaced with - and _ so that
the checksum can be part of a URL.
Turn a nucleotide sequence into a protein sequence by creating a new sequence object.
This method will translate DNA or RNA sequences. It should not
be used on protein sequences as any result will be biologically
meaningless.
Parameters:
name (- table - Which codon table to use? This can be either a) – (string), an NCBI identifier (integer), or a CodonTable
object (useful for non-standard genetic codes). This
defaults to the “Standard” table.
string (- stop_symbol - Single character) – terminators. This defaults to the asterisk, “*”.
for (what to use) – terminators. This defaults to the asterisk, “*”.
Boolean (- cds -) – translation continuing on past any stop codons (translated as the
specified stop_symbol). If True, translation is terminated at
the first in frame stop codon (and the stop_symbol is not
appended to the returned protein sequence).
full (defaults to False meaning do a) – translation continuing on past any stop codons (translated as the
specified stop_symbol). If True, translation is terminated at
the first in frame stop codon (and the stop_symbol is not
appended to the returned protein sequence).
Boolean – this checks the sequence starts with a valid alternative start
codon (which will be translated as methionine, M), that the
sequence length is a multiple of three, and that there is a
single in frame stop codon at the end (this will be excluded
from the protein sequence, regardless of the to_stop option).
If these tests fail, an exception is raised.
True (indicates this is a complete CDS. If) – this checks the sequence starts with a valid alternative start
codon (which will be translated as methionine, M), that the
sequence length is a multiple of three, and that there is a
single in frame stop codon at the end (this will be excluded
from the protein sequence, regardless of the to_stop option).
If these tests fail, an exception is raised.
:paramthis checks the sequence starts with a valid alternative start
codon (which will be translated as methionine, M), that the
sequence length is a multiple of three, and that there is a
single in frame stop codon at the end (this will be excluded
from the protein sequence, regardless of the to_stop option).
If these tests fail, an exception is raised.
Parameters:
gaps. (- gap - Single character string to denote symbol used for) – Defaults to the minus sign.
A Seq object is returned if translate is called on a Seq
object; a MutableSeq object is returned if translate is called
pn a MutableSeq object.
It isn’t a valid CDS under NCBI table 1, due to both the start codon
and also the in frame stop codons:
>>> coding_dna.translate(table=1,cds=True)Traceback (most recent call last):...Bio.Data.CodonTable.TranslationError: First codon 'GTG' is not a start codon
If the sequence has no in-frame stop codon, then the to_stop argument
has no effect:
NOTE - Ambiguous codons like “TAN” or “NNN” could be an amino acid
or a stop codon. These are translated as “X”. Any invalid codon
(e.g. “TA?” or “T-A”) will throw a TranslationError.
NOTE - This does NOT behave like the python string’s translate
method. For that use str(my_seq).translate(…) instead
Transcribe a DNA sequence into RNA and return the RNA sequence as a new Seq object.
Following the usual convention, the sequence is interpreted as the
coding strand of the DNA double helix, not the template strand. This
means we can get the RNA sequence just by switching T to U.
As Seq objects are immutable, a TypeError is raised if
transcribe is called on a Seq object with inplace=True.
Trying to transcribe an RNA sequence has no effect.
If you have a nucleotide sequence which might be DNA or RNA
(or even a mixture), calling the transcribe method will ensure
any T becomes U.
Trying to transcribe a protein sequence will replace any
T for Threonine with U for Selenocysteine, which has no
biologically plausible rational.
As Seq objects are immutable, a TypeError is raised if
transcribe is called on a Seq object with inplace=True.
Trying to back-transcribe DNA has no effect, If you have a nucleotide
sequence which might be DNA or RNA (or even a mixture), calling the
back-transcribe method will ensure any U becomes T.
Trying to back-transcribe a protein sequence will replace any U for
Selenocysteine with T for Threonine, which is biologically meaningless.
This checksum is the same as seguid but with base64.urlsafe
encoding instead of the normal base64. This means that
the characters + and / are replaced with - and _ so that
the checksum can be part of a URL.
This checksum is the same as seguid but with base64.urlsafe
encoding instead of the normal base 64. This means that
the characters + and / are replaced with - and _ so that
the checksum can be a part of and URL or a filename.
Examples
>>> frompydna.seqrecordimportSeqRecord>>> a=SeqRecord("gattaca")>>> a.seguid()# original seguid is +bKGnebMkia5kNg/gF7IORXMnIU'lsseguid=tp2jzeCM2e3W4yxtrrx09CMKa_8'
Return the longest common substring between the sequence.
and another sequence (other). The other sequence can be a string,
Seq, SeqRecord, Dseq or DseqRecord.
The method returns a SeqFeature with type “read” as this method
is mostly used to map sequence reads to the sequence. This can be
changed by passing a type as keyword with some other string value.
Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the “Software”), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED “AS IS”, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.
Algorithm described in Pierre Duval, Jean. 1983. Factorizing Words
over an Ordered Alphabet. Journal of Algorithms & Computational Technology
4 (4) (December 1): 363–381. and Algorithms on strings and sequences based
on Lyndon words, David Eppstein 2011.
https://gist.github.com/dvberkel/1950267
Turn a three letter code protein sequence into one with one letter code.
The single input argument ‘seq’ should be a protein sequence using single
letter codes, as a python string.
This function returns the amino acid sequence as a string using the one
letter amino acid codes. Output follows the IUPAC standard (including
ambiguous characters B for “Asx”, J for “Xle” and X for “Xaa”, and also U
for “Sel” and O for “Pyl”) plus “Ter” for a terminator given as an
asterisk.
Any unknown
character (including possible gap characters), is changed into ‘Xaa’.
Examples
>>> fromBio.SeqUtilsimportseq3>>> seq3("MAIVMGRWKGAR*")'MetAlaIleValMetGlyArgTrpLysGlyAlaArgTer'>>> frompydna.utilsimportseq31>>> seq31('MetAlaIleValMetGlyArgTrpLysGlyAlaArgTer')'M A I V M G R W K G A R *'
Compares two or more DNA sequences for equality i.e. if they
represent the same DNA molecule.
Two linear sequences are considiered equal if either:
They have the same sequence (case insensitive)
One sequence is the reverse complement of the other
Two circular sequences are considered equal if they are circular
permutations meaning that they have the same length and:
One sequence can be found in the concatenation of the other sequence with itself.
The reverse complement of one sequence can be found in the concatenation of the other sequence with itself.
The topology for the comparison can be set using one of the keywords
linear or circular to True or False.
If circular or linear is not set, it will be deduced from the topology of
each sequence for sequences that have a linear or circular attribute
(like Dseq and Dseqrecord).
Create a location object from a start and end position.
If the end position is less than the start position, the location is circular. It handles negative positions.
Note this special case, 0 is the same as len(seq)
>>> str(create_location(5, 0, 10))
‘[5:10]’
Note the special case where if start and end are the same,
the location spans the entire sequence (it’s not empty).
>>> str(create_location(5, 5, 10))
‘join{[5:10], [0:5]}’