pydna package

Contents

pydna package#

copyright:

Copyright 2013-2023 by Björn Johansson. All rights reserved.

license:

This code is part of the pydna package, governed by the license in LICENSE.txt that should be included as part of this package.

pydna#

Pydna is a python package providing code for simulation of the creation of recombinant DNA molecules using molecular biology techniques. Development of pydna happens in this Github repository.

Provided:
  1. PCR simulation

  2. Assembly simulation based on shared identical sequences

  3. Primer design for amplification of a given sequence

  4. Automatic design of primer tails for Gibson assembly or homologous recombination.

  5. Restriction digestion and cut&paste cloning

  6. Agarose gel simulation

  7. Download sequences from Genbank

  8. Parsing various sequence formats including the capacity to handle broken Genbank format

pydna package layout#

The most important modules and how to import functions or classes from them are listed below. Class names starts with a capital letter, functions with a lowercase letter:

from pydna.module import function
from pydna.module import Class

Example: from pydna.gel import Gel

pydna
   ├── amplify
   │         ├── Anneal
   │         └── pcr
   │
   ├── assembly
   │          └── Assembly
   │
   ├── design
   │        ├── assembly_fragments
   │        └── primer_design
   │
   ├── dseqrecord
   │            └── Dseqrecord
   ├── gel
   │     └── Gel
   │
   ├── genbank
   │         ├── genbank
   │         └── Genbank
   │
   ├── parsers
   │         ├── parse
   │         └── parse_primers
   │
   └── readers
             ├── read
             └── read_primers

How to use the documentation#

Documentation is available as docstrings provided in the source code for each module. These docstrings can be inspected by reading the source code directly. See further below on how to obtain the code for pydna.

In the python shell, use the built-in help function to view a function’s docstring:

>>> from pydna import readers
>>> help(readers.read)
... 

The doctrings are also used to provide an automaticly generated reference manual available online at read the docs.

Docstrings can be explored using IPython, an advanced Python shell with TAB-completion and introspection capabilities. To see which functions are available in pydna, type pydna.<TAB> (where <TAB> refers to the TAB key). Use pydna.open_config_folder?<ENTER>`to view the docstring or `pydna.open_config_folder??<ENTER> to view the source code.

In the Spyder IDE it is possible to place the cursor immediately before the name of a module,class or function and press ctrl+i to bring up docstrings in a separate window in Spyder

Code snippets are indicated by three greater-than signs:

>>> x=41
>>> x=x+1
>>> x
42

pydna source code#

The pydna source code is available on Github.

How to get more help#

Please join the Google group for pydna, this is the preferred location for help. If you find bugs in pydna itself, open an issue at the Github repository.

Examples of pydna in use#

See this repository for a collection of

examples.

pydna.get_env()[source]#

Print a an ascii table containing all environmental variables.

Pydna related variables have names that starts with pydna_

Ascii-art logotype of pydna.

Submodules#

pydna.all module#

This module provide most pydna functionality in the local namespace.

Example

>>> from pydna.all import *
>>> Dseq("aaa")
Dseq(-3)
aaa
ttt
>>> Dseqrecord("aaa")
Dseqrecord(-3)
>>> from pydna.all import __all__
>>> __all__
['Anneal', 'pcr', 'Assembly', 'genbank', 'Genbank', 'Dseqrecord',
'Dseq', 'read', 'read_primer', 'parse', 'parse_primers', 'primer_design', 'assembly_fragments', 'eq', 'gbtext_clean']
>>>
class pydna.all.Anneal(primers, template, limit=13, **kwargs)[source]#

Bases: object

The Anneal class has the following important attributes:

forward_primers#

Description of forward_primers.

Type:

list

reverse_primers#

Description of reverse_primers.

Type:

list

template#

A copy of the template argument. Primers annealing sites has been added as features that can be visualized in a seqence editor such as ApE.

Type:

Dseqrecord

limit#

The limit of PCR primer annealing, default is 13 bp.

Type:

int, optional

property products#
report()#

returns a short report describing if or where primer anneal on the template.

pydna.all.pcr(*args, **kwargs) Amplicon[source]#

pcr is a convenience function for the Anneal class to simplify its usage, especially from the command line. If more than one or no PCR product is formed, a ValueError is raised.

args is any iterable of Dseqrecords or an iterable of iterables of Dseqrecords. args will be greedily flattened.

Parameters:
  • args (iterable containing sequence objects) – Several arguments are also accepted.

  • limit (int = 13, optional) – limit length of the annealing part of the primers.

Notes

sequences in args could be of type:

  • string

  • Seq

  • SeqRecord (or subclass)

  • Dseqrecord (or sublcass)

The last sequence will be assumed to be the template while all preceeding sequences will be assumed to be primers.

This is a powerful function, use with care!

Returns:

product – An pydna.amplicon.Amplicon object representing the PCR product. The direction of the PCR product will be the same as for the template sequence.

Return type:

Amplicon

Examples

>>> from pydna.dseqrecord import Dseqrecord
>>> from pydna.readers import read
>>> from pydna.amplify import pcr
>>> from pydna.primer import Primer
>>> template = Dseqrecord("tacactcaccgtctatcattatctactatcgactgtatcatctgatagcac")
>>> from Bio.SeqRecord import SeqRecord
>>> p1 = Primer("tacactcaccgtctatcattatc")
>>> p2 = Primer("cgactgtatcatctgatagcac").reverse_complement()
>>> pcr(p1, p2, template)
Amplicon(51)
>>> pcr([p1, p2], template)
Amplicon(51)
>>> pcr((p1,p2,), template)
Amplicon(51)
>>>
class pydna.all.Assembly(frags: List[Dseqrecord], limit: int = 25, algorithm: Callable[[str, str, int], List[Tuple[int, int, int]]] = common_sub_strings)[source]#

Bases: object

Assembly of a list of linear DNA fragments into linear or circular constructs. The Assembly is meant to replace the Assembly method as it is easier to use. Accepts a list of Dseqrecords (source fragments) to initiate an Assembly object. Several methods are available for analysis of overlapping sequences, graph construction and assembly.

Parameters:
  • fragments (list) – a list of Dseqrecord objects.

  • limit (int, optional) – The shortest shared homology to be considered

  • algorithm (function, optional) – The algorithm used to determine the shared sequences.

  • max_nodes (int) – The maximum number of nodes in the graph. This can be tweaked to manage sequences with a high number of shared sub sequences.

Examples

>>> from pydna.assembly import Assembly
>>> from pydna.dseqrecord import Dseqrecord
>>> a = Dseqrecord("acgatgctatactgCCCCCtgtgctgtgctcta")
>>> b = Dseqrecord("tgtgctgtgctctaTTTTTtattctggctgtatc")
>>> c = Dseqrecord("tattctggctgtatcGGGGGtacgatgctatactg")
>>> x = Assembly((a,b,c), limit=14)
>>> x
Assembly
fragments....: 33bp 34bp 35bp
limit(bp)....: 14
G.nodes......: 6
algorithm....: common_sub_strings
>>> x.assemble_circular()
[Contig(o59), Contig(o59)]
>>> x.assemble_circular()[0].seq.watson
'acgatgctatactgCCCCCtgtgctgtgctctaTTTTTtattctggctgtatcGGGGGt'
assemble_linear(**kwargs)#
assemble_circular(**kwargs)#
pydna.all.genbank(accession: str = 'CS570233.1', *args, email=None, **kwargs) Dseqrecord[source]#

Download a genbank nuclotide record.

This function takes the same paramenters as the :func:pydna.genbank.Genbank.nucleotide method. The email address stored in the pydna_email environment variable is used. The easiest way set this permanantly is to edit the pydna.ini file. See the documentation of pydna.open_config_folder()

if no accession is given, a very short Genbank entry is used as an example (see below). This can be useful for testing the connection to Genbank.

Please note that this result is also cached by default by settings in the pydna.ini file. See the documentation of pydna.open_config_folder()

LOCUS       CS570233                  14 bp    DNA     linear   PAT 18-MAY-2007
DEFINITION  Sequence 6 from Patent WO2007025016.
ACCESSION   CS570233
VERSION     CS570233.1
KEYWORDS    .
SOURCE      synthetic construct
  ORGANISM  synthetic construct
            other sequences; artificial sequences.
REFERENCE   1
  AUTHORS   Shaw,R.W. and Cottenoir,M.
  TITLE     Inhibition of metallo-beta-lactamase by double-stranded dna
  JOURNAL   Patent: WO 2007025016-A1 6 01-MAR-2007;
            Texas Tech University System (US)
FEATURES             Location/Qualifiers
     source          1..14
                     /organism="synthetic construct"
                     /mol_type="unassigned DNA"
                     /db_xref="taxon:32630"
                     /note="This is a 14bp aptamer inhibitor."
ORIGIN
        1 atgttcctac atga
//
class pydna.all.Genbank(users_email: str, *, tool: str = 'pydna')[source]#

Bases: object

Class to facilitate download from genbank. It is easier and quicker to use the pydna.genbank.genbank() function directly.

Parameters:

users_email (string) – Has to be a valid email address. You should always tell Genbanks who you are, so that they can contact you.

Examples

>>> from pydna.genbank import Genbank
>>> gb=Genbank("bjornjobb@gmail.com")
>>> rec = gb.nucleotide("LP002422.1")   # <- entry from genbank
>>> print(len(rec))
1
nucleotide(item: str, seq_start: int | None = None, seq_stop: int | None = None, strand: Literal[1, 2] = 1) Dseqrecord[source]#

This method downloads a genbank nuclotide record from genbank. This method is cached by default. This can be controlled by editing the pydna_cached_funcs environment variable. The best way to do this permanently is to edit the edit the pydna.ini file. See the documentation of pydna.open_config_folder()

Item is a string containing one genbank accession number for a nucleotide file. Genbank nucleotide accession numbers have this format:

A12345 = 1 letter + 5 numerals
AB123456 = 2 letters + 6 numerals

The accession number is sometimes followed by a point and version number

BK006936.2

Item can also contain optional interval information in the following formats:

BK006936.2 REGION: complement(613900..615202)
NM_005546 REGION: 1..100
NM_005546 REGION: complement(1..100)
21614549:1-100
21614549:c100-1
21614549 1-100
21614549 c100-1

It is useful to set an interval for large genbank records to limit the download time. The items above containing interval information and can be obtained directly by looking up an entry in Genbank and setting the Change region shown on the upper right side of the page. The ACCESSION line of the displayed Genbank file will have the formatting shown.

Alternatively, seq_start and seq_stop can be set explicitly to the sequence intervals to be downloaded.

If strand is 2. “c”, “C”, “crick”, “Crick”, “antisense”,”Antisense”, “2”, 2, “-” or “-1”, the antisense (Crick) strand is returned, otherwise the sense (Watson) strand is returned.

Result is returned as a Dseqrecord object.

References

class pydna.all.Dseqrecord(record, *args, circular=None, n=5e-14, source=None, **kwargs)[source]#

Bases: SeqRecord

Dseqrecord is a double stranded version of the Biopython SeqRecord [1] class. The Dseqrecord object holds a Dseq object describing the sequence. Additionally, Dseqrecord hold meta information about the sequence in the from of a list of SeqFeatures, in the same way as the SeqRecord does.

The Dseqrecord can be initialized with a string, Seq, Dseq, SeqRecord or another Dseqrecord. The sequence information will be stored in a Dseq object in all cases.

Dseqrecord objects can be read or parsed from sequences in FASTA, EMBL or Genbank formats. See the pydna.readers and pydna.parsers modules for further information.

There is a short representation associated with the Dseqrecord. Dseqrecord(-3) represents a linear sequence of length 2 while Dseqrecord(o7) represents a circular sequence of length 7.

Dseqrecord and Dseq share the same concept of length. This length can be larger than each strand alone if they are staggered as in the example below.

<-- length -->
GATCCTTT
     AAAGCCTAG
Parameters:
  • record (string, Seq, SeqRecord, Dseq or other Dseqrecord object) – This data will be used to form the seq property

  • circular (bool, optional) – True or False reflecting the shape of the DNA molecule

  • linear (bool, optional) – True or False reflecting the shape of the DNA molecule

Examples

>>> from pydna.dseqrecord import Dseqrecord
>>> a=Dseqrecord("aaa")
>>> a
Dseqrecord(-3)
>>> a.seq
Dseq(-3)
aaa
ttt
>>> from pydna.seq import Seq
>>> b=Dseqrecord(Seq("aaa"))
>>> b
Dseqrecord(-3)
>>> b.seq
Dseq(-3)
aaa
ttt
>>> from Bio.SeqRecord import SeqRecord
>>> c=Dseqrecord(SeqRecord(Seq("aaa")))
>>> c
Dseqrecord(-3)
>>> c.seq
Dseq(-3)
aaa
ttt

References

source: Source | None = None#
classmethod from_string(record: str = '', *args, circular=False, n=5e-14, **kwargs)[source]#

docstring.

classmethod from_SeqRecord(record: SeqRecord, *args, circular=None, n=5e-14, **kwargs)[source]#
property circular#

The circular property can not be set directly. Use looped()

m()[source]#

This method returns the mass of the DNA molecule in grams. This is calculated as the product between the molecular weight of the Dseq object and the

extract_feature(n)[source]#

Extracts a feature and creates a new Dseqrecord object.

Parameters:

n (int) – Indicates the feature to extract

Examples

>>> from pydna.dseqrecord import Dseqrecord
>>> a=Dseqrecord("atgtaa")
>>> a.add_feature(2,4)
>>> b=a.extract_feature(0)
>>> b
Dseqrecord(-2)
>>> b.seq
Dseq(-2)
gt
ca
add_feature(x=None, y=None, seq=None, type_='misc', strand=1, *args, **kwargs)[source]#

Add a feature of type misc to the feature list of the sequence.

Parameters:
  • x (int) – Indicates start of the feature

  • y (int) – Indicates end of the feature

Examples

>>> from pydna.seqrecord import SeqRecord
>>> a=SeqRecord("atgtaa")
>>> a.features
[]
>>> a.add_feature(2,4)
>>> a.features
[SeqFeature(SimpleLocation(ExactPosition(2), ExactPosition(4), strand=1), type='misc', qualifiers=...)]
seguid()[source]#

Url safe SEGUID for the sequence.

This checksum is the same as seguid but with base64.urlsafe encoding instead of the normal base64. This means that the characters + and / are replaced with - and _ so that the checksum can be part of a URL.

Examples

>>> from pydna.dseqrecord import Dseqrecord
>>> a = Dseqrecord("aa")
>>> a.seguid()
'ldseguid=TEwydy0ugvGXh3VJnVwgtxoyDQA'
looped()[source]#

Circular version of the Dseqrecord object.

The underlying linear Dseq object has to have compatible ends.

Examples

>>> from pydna.dseqrecord import Dseqrecord
>>> a=Dseqrecord("aaa")
>>> a
Dseqrecord(-3)
>>> b=a.looped()
>>> b
Dseqrecord(o3)
>>>
tolinear()[source]#

Returns a linear, blunt copy of a circular Dseqrecord object. The underlying Dseq object has to be circular.

This method is deprecated, use slicing instead. See example below.

Examples

>>> from pydna.dseqrecord import Dseqrecord
>>> a=Dseqrecord("aaa", circular = True)
>>> a
Dseqrecord(o3)
>>> b=a[:]
>>> b
Dseqrecord(-3)
>>>
terminal_transferase(nucleotides='a')[source]#

docstring.

format(format: str = 'gb')[source]#

Returns the sequence as a string using a format supported by Biopython SeqIO [2]. Default is “gb” which is short for Genbank. Allowed Formats are for example:

  • “fasta”: The standard FASTA format.

  • “fasta-2line”: No line wrapping and exactly two lines per record.

  • “genbank” (or “gb”): The GenBank flat file format.

  • “embl”: The EMBL flat file format.

  • “imgt”: The IMGT variant of the EMBL format.

The format string can be modified with the keyword “dscode” if the underlying dscode string is desired in the output. for example:

Dseqrecord("PEXIGATCQFZJ").format("fasta-2line dscode")

Examples

>>> from pydna.dseqrecord import Dseqrecord
>>> x=Dseqrecord("aaa")
>>> x.annotations['date'] = '02-FEB-2013'
>>> x
Dseqrecord(-3)
>>> print(x.format("gb"))
LOCUS       name                       3 bp    DNA     linear   UNK 02-FEB-2013
DEFINITION  description.
ACCESSION   id
VERSION     id
KEYWORDS    .
SOURCE      .
  ORGANISM  .
            .
FEATURES             Location/Qualifiers
ORIGIN
        1 aaa
//
>>> print(Dseqrecord("PEXIGATCQFZJ").format("fasta-2line"))
>id description
GATCGATCGATC
>>> print(Dseqrecord("PEXIGATCQFZJ").format("fasta-2line dscode"))
>id description
PEXIGATCQFZJ

References

write(filename=None, f='gb')[source]#

Writes the Dseqrecord to a file using the format f, which must be a format supported by Biopython SeqIO for writing [3]. Default is “gb” which is short for Genbank. Note that Biopython SeqIO reads more formats than it writes.

Filename is the path to the file where the sequece is to be written. The filename is optional, if it is not given, the description property (string) is used together with the format.

If obj is the Dseqrecord object, the default file name will be:

<obj.locus>.<f>

Where <f> is “gb” by default. If the filename already exists and AND the sequence it contains is different, a new file name will be used so that the old file is not lost:

<obj.locus>_NEW.<f>

References

find(other)[source]#
find_aminoacids(other)[source]#
>>> from pydna.dseqrecord import Dseqrecord
>>> s=Dseqrecord("atgtacgatcgtatgctggttatattttag")
>>> s.seq.translate()
ProteinSeq('MYDRMLVIF*')
>>> "RML" in s
True
>>> "MMM" in s
False
>>> s.seq.rc().translate()
ProteinSeq('LKYNQHTIVH')
>>> "QHT" in s.rc()
True
>>> "QHT" in s
False
>>> slc = s.find_aa("RML")
>>> slc
slice(9, 18, None)
>>> s[slc]
Dseqrecord(-9)
>>> code = s[slc].seq
>>> code
Dseq(-9)
cgtatgctg
gcatacgac
>>> code.translate()
ProteinSeq('RML')
find_aa(other)#
>>> from pydna.dseqrecord import Dseqrecord
>>> s=Dseqrecord("atgtacgatcgtatgctggttatattttag")
>>> s.seq.translate()
ProteinSeq('MYDRMLVIF*')
>>> "RML" in s
True
>>> "MMM" in s
False
>>> s.seq.rc().translate()
ProteinSeq('LKYNQHTIVH')
>>> "QHT" in s.rc()
True
>>> "QHT" in s
False
>>> slc = s.find_aa("RML")
>>> slc
slice(9, 18, None)
>>> s[slc]
Dseqrecord(-9)
>>> code = s[slc].seq
>>> code
Dseq(-9)
cgtatgctg
gcatacgac
>>> code.translate()
ProteinSeq('RML')
map_trace_files(pth, limit=25)[source]#
linearize(*enzymes)[source]#

Similar to :func:cut.

Throws an exception if there is not excactly one cut i.e. none or more than one digestion products.

no_cutters(batch: RestrictionBatch = None)[source]#

docstring.

unique_cutters(batch: RestrictionBatch = None)[source]#

docstring.

once_cutters(batch: RestrictionBatch = None)[source]#

docstring.

twice_cutters(batch: RestrictionBatch = None)[source]#

docstring.

n_cutters(n=3, batch: RestrictionBatch = None)[source]#

docstring.

cutters(batch: RestrictionBatch = None)[source]#

docstring.

number_of_cuts(*enzymes)[source]#

The number of cuts by digestion with the Restriction enzymes contained in the iterable.

reverse_complement()[source]#

Reverse complement.

Examples

>>> from pydna.dseqrecord import Dseqrecord
>>> a=Dseqrecord("ggaatt")
>>> a
Dseqrecord(-6)
>>> a.seq
Dseq(-6)
ggaatt
ccttaa
>>> a.reverse_complement().seq
Dseq(-6)
aattcc
ttaagg
>>>
rc()#

Reverse complement.

Examples

>>> from pydna.dseqrecord import Dseqrecord
>>> a=Dseqrecord("ggaatt")
>>> a
Dseqrecord(-6)
>>> a.seq
Dseq(-6)
ggaatt
ccttaa
>>> a.reverse_complement().seq
Dseq(-6)
aattcc
ttaagg
>>>
synced(ref, limit=25)[source]#

This method returns a new circular sequence (Dseqrecord object), which has been rotated in such a way that there is maximum overlap between the sequence and ref, which may be a string, Biopython Seq, SeqRecord object or another Dseqrecord object.

The reason for using this could be to rotate a new recombinant plasmid so that it starts at the same position after cloning. See the example below:

Examples

>>> from pydna.dseqrecord import Dseqrecord
>>> a=Dseqrecord("gaat", circular=True)
>>> a.seq
Dseq(o4)
gaat
ctta
>>> d = a[2:] + a[:2]
>>> d.seq
Dseq(-4)
atga
tact
>>> insert=Dseqrecord("CCC")
>>> recombinant = (d+insert).looped()
>>> recombinant.seq
Dseq(o7)
atgaCCC
tactGGG
>>> recombinant.synced(a).seq
Dseq(o7)
gaCCCat
ctGGGta
upper()[source]#

Returns an uppercase copy. >>> from pydna.dseqrecord import Dseqrecord >>> my_seq = Dseqrecord(“aAa”) >>> my_seq.seq Dseq(-3) aAa tTt >>> upper = my_seq.upper() >>> upper.seq Dseq(-3) AAA TTT >>>

Returns:

Dseqrecord object in uppercase

Return type:

Dseqrecord

lower()[source]#
>>> from pydna.dseqrecord import Dseqrecord
>>> my_seq = Dseqrecord("aAa")
>>> my_seq.seq
Dseq(-3)
aAa
tTt
>>> upper = my_seq.upper()
>>> upper.seq
Dseq(-3)
AAA
TTT
>>> lower = my_seq.lower()
>>> lower
Dseqrecord(-3)
>>>
Returns:

Dseqrecord object in lowercase

Return type:

Dseqrecord

orfs(minsize=300)[source]#

docstring.

orfs_to_features(minsize=300)[source]#

docstring.

copy_gb_to_clipboard()[source]#

docstring.

copy_fasta_to_clipboard()[source]#

docstring.

figure(feature=0, highlight='\x1b[48;5;11m', plain='\x1b[0m')[source]#

docstring.

shifted(shift)[source]#

Circular Dseqrecord with a new origin <shift>.

This only works on circular Dseqrecords. If we consider the following circular sequence:

GAAAT   <-- watson strand
CTTTA   <-- crick strand

The T and the G on the watson strand are linked together as well as the A and the C of the of the crick strand.

if shift is 1, this indicates a new origin at position 1:

new origin at the | symbol:

G|AAAT
C|TTTA

new sequence:

AAATG
TTTAC

Examples

>>> from pydna.dseqrecord import Dseqrecord
>>> a=Dseqrecord("aaat",circular=True)
>>> a
Dseqrecord(o4)
>>> a.seq
Dseq(o4)
aaat
ttta
>>> b=a.shifted(1)
>>> b
Dseqrecord(o4)
>>> b.seq
Dseq(o4)
aata
ttat
cut(*enzymes)[source]#

Digest a Dseqrecord object with one or more restriction enzymes.

returns a list of linear Dseqrecords. If there are no cuts, an empty list is returned.

See also Dseq.cut() :param enzymes: A Bio.Restriction.XXX restriction object or iterable of such. :type enzymes: enzyme object or iterable of such objects

Returns:

Dseqrecord_frags – list of Dseqrecord objects formed by the digestion

Return type:

list

Examples

>>> from pydna.dseqrecord import Dseqrecord
>>> a=Dseqrecord("ggatcc")
>>> from Bio.Restriction import BamHI
>>> a.cut(BamHI)
(Dseqrecord(-5), Dseqrecord(-5))
>>> frag1, frag2 = a.cut(BamHI)
>>> frag1.seq
Dseq(-5)
g
cctag
>>> frag2.seq
Dseq(-5)
gatcc
    g
apply_cut(left_cut, right_cut)[source]#
history()[source]#

Returns a string representation of the cloning history of the sequence. Returns an empty string if the sequence has no source.

Check the documentation notebooks for extensive examples.

Returns:

str

Return type:

A string representation of the cloning history of the sequence.

Examples

>>> from pydna.dseqrecord import Dseqrecord
>>> from pydna.assembly2 import gibson_assembly
>>> fragments = [
...    Dseqrecord("TTTTacgatAAtgctccCCCC", circular=False, name="fragment1"),
...    Dseqrecord("CCCCtcatGGGG", circular=False, name="fragment2"),
...    Dseqrecord("GGGGatataTTTT", circular=False, name="fragment3"),
... ]
>>> product, *_ = gibson_assembly(fragments, limit=4)
>>> product.name = "product_name"
>>> print(product.history())
╙── product_name (Dseqrecord(o34))
    └─╼ GibsonAssemblySource
        ├─╼ fragment1 (Dseqrecord(-21))
        ├─╼ fragment2 (Dseqrecord(-12))
        └─╼ fragment3 (Dseqrecord(-13))
join(fragments)[source]#

Join an iterable of Dseqrecords with this instance as the separator.

Example:

>>> sep = Dseqrecord("a")
>>> joined = sep.join([Dseqrecord("A"), Dseqrecord("B"), Dseqrecord("C")])
>>> joined
Dseqrecord(-5)
>>> joined.seq
Dseq(-5)
AaBaC
TtVtG
class pydna.all.Dseq(watson: str | bytes, crick: str | bytes | None = None, ovhg=None, circular=False, pos=0)[source]#

Bases: Seq

Dseq describes a double stranded DNA fragment, linear or circular.

Dseq can be initiated in two ways, using two strings, each representing the Watson (upper, sense) strand, the Crick (lower, antisense) strand and an optional value describing the stagger betwen the strands on the left side (ovhg).

Alternatively, a single string represenation using dsIUPAC codes can be used. If a single string is used, the letters of that string are interpreted as base pairs rather than single bases. For example “A” would indicate the basepair “A/T”. An expanded IUPAC code is used where the letters PEXI have been assigned to GATC on the Watson strand with no paring base on the Crick strand G/””, A/””, T/”” and C/””. The letters QFZJ have been assigned the opposite base pairs with an empty Watson strand “”/G, “”/A, “”/T, and “”/C.

PEXIGATCQFZJ  would indicate the linear double-stranded fragment:

GATCGATC
    CTAGCTAG
Parameters:
  • watson (str) – a string representing the Watson (sense) DNA strand or a basepair represenation.

  • crick (str, optional) – a string representing the Crick (antisense) DNA strand.

  • ovhg (int, optional) – A positive or negative number to describe the stagger between the Watson and Crick strands. see below for a detailed explanation.

  • circular (bool, optional) – True indicates that sequence is circular, False that it is linear.

Examples

Dseq is a subclass of the Biopython Bio.Seq.Seq class. The constructor can accept two strings representing the Watson (sense) and Crick(antisense) DNA strands. These are interpreted as single stranded DNA. There is a check for complementarity between the strands.

If the DNA molecule is staggered on the left side, an integer ovhg (overhang) must be given, describing the stagger between the Watson and Crick strand in the 5’ end of the fragment.

Additionally, the optional boolean parameter circular can be given to indicate if the DNA molecule is circular.

The most common usage of the Dseq class is probably not to use it directly, but to create it as part of a Dseqrecord object (see pydna.dseqrecord.Dseqrecord). This works in the same way as for the relationship between the Bio.Seq.Seq and Bio.SeqRecord.SeqRecord classes in Biopython.

There are multiple ways of creating a Dseq object directly listed below, but you can also use the function Dseq.from_full_sequence_and_overhangs() to create a Dseq:

Two arguments (string, string), no overhang provided:

>>> from pydna.dseq import Dseq
>>> Dseq("gggaaat","ttt")
Dseq(-7)
gggaaat
   ttt

If Watson and Crick are given, but not ovhg, an attempt will be made to find the best annealing between the strands. There are important limitations to this. If there are several ways to anneal the strands, this will fail. For long fragments it is quite slow.

Three arguments (string, string, ovhg=int):

The ovhg parameter is an integer describing the length of the Crick strand overhang on the left side (the 5’ end of Watson strand).

The ovhg parameter controls the stagger at the five prime end:

dsDNA       overhang

  nnn...    2
nnnnn...

 nnnn...    1
nnnnn...

nnnnn...    0
nnnnn...

nnnnn...   -1
 nnnn...

nnnnn...   -2
  nnn...

Example of creating Dseq objects with different amounts of stagger:

>>> Dseq(watson="att", crick="acata", ovhg=-2)
Dseq(-7)
att
  ataca
>>> Dseq(watson="ata",crick="acata",ovhg=-1)
Dseq(-6)
ata
 ataca
>>> Dseq(watson="taa",crick="actta",ovhg=0)
Dseq(-5)
taa
attca
>>> Dseq(watson="aag",crick="actta",ovhg=1)
Dseq(-5)
 aag
attca
>>> Dseq(watson="agt",crick="actta",ovhg=2)
Dseq(-5)
  agt
attca

If the ovhg parameter is specified a Crick strand also needs to be supplied, or an exception is raised.

>>> Dseq(watson="agt", ovhg=2)
Traceback (most recent call last):
    ...
ValueError: ovhg (overhang) defined without a crick strand.

The shape or topology of the fragment is set by the circular parameter, True or False (default).

>>> Dseq("aaa", "ttt", ovhg = 0)  # A linear sequence by default
Dseq(-3)
aaa
ttt
>>> Dseq("aaa", "ttt", ovhg = 0, circular = False)  # A linear sequence if circular is False
Dseq(-3)
aaa
ttt
>>> Dseq("aaa", "ttt", ovhg = 0, circular = True)  # A circular sequence
Dseq(o3)
aaa
ttt
>>> Dseq("aaa", "ttt", ovhg=1, circular = False)
Dseq(-4)
 aaa
ttt
>>> Dseq("aaa","ttt",ovhg=-1)
Dseq(-4)
aaa
 ttt
>>> Dseq("aaa", "ttt", circular = True , ovhg=0)
Dseq(o3)
aaa
ttt
>>> a=Dseq("tttcccc","aaacccc")
>>> a
Dseq(-11)
    tttcccc
ccccaaa
>>> a.ovhg
4
>>> b=Dseq("ccccttt","ccccaaa")
>>> b
Dseq(-11)
ccccttt
    aaacccc
>>> b.ovhg
-4
>>>

dsIUPAC [4] is an nn extension to the IUPAC alphabet used to describe ss regions:

    aaaGATC       GATCccc          ad-hoc representations
CTAGttt               gggCTAG

QFZJaaaPEXI       PEXIcccQFZJ      dsIUPAC

Coercing to string

>>> str(a)
'ggggtttcccc'

A Dseq object can be longer that either the watson or crick strands.

<-- length -->
GATCCTTT
     AAAGCCTAG

<-- length -->
      GATCCTTT
AAAGCCCTA

The slicing of a linear Dseq object works mostly as it does for a string.

>>> s="ggatcc"
>>> s[2:3]
'a'
>>> s[2:4]
'at'
>>> s[2:4:-1]
''
>>> s[::2]
'gac'
>>> from pydna.dseq import Dseq
>>> d=Dseq(s, circular=False)
>>> d[2:3]
Dseq(-1)
a
t
>>> d[2:4]
Dseq(-2)
at
ta
>>> d[2:4:-1]
Dseq(-0)


>>> d[::2]
Dseq(-3)
gac
ctg

The slicing of a circular Dseq object has a slightly different meaning.

>>> s="ggAtCc"
>>> d=Dseq(s, circular=True)
>>> d
Dseq(o6)
ggAtCc
ccTaGg
>>> d[4:3]
Dseq(-5)
CcggA
GgccT

The slice [X:X] produces an empty slice for a string, while this will return the linearized sequence starting at X:

>>> s="ggatcc"
>>> d=Dseq(s, circular=True)
>>> d
Dseq(o6)
ggatcc
cctagg
>>> d[3:3]
Dseq(-6)
tccgga
aggcct
>>>
classmethod quick(data: bytes, *args, circular=False, pos=0, **kwargs)[source]#

Fastest way to instantiate an object of the Dseq class.

No checks of parameters are made. Does not call Bio.Seq.Seq.__init__() which has lots of time consuming checks.

classmethod from_representation(dsdna: str, *args, **kwargs)[source]#
classmethod from_full_sequence_and_overhangs(full_sequence: str, crick_ovhg: int, watson_ovhg: int)[source]#

Create a linear Dseq object from a full sequence and the 3’ overhangs of each strand.

The order of the parameters is like this because the 3’ overhang of the crick strand is the one on the left side of the sequence.

Parameters:
  • full_sequence (str) – The full sequence of the Dseq object.

  • crick_ovhg (int) – The overhang of the crick strand in the 3’ end. Equivalent to Dseq.ovhg.

  • watson_ovhg (int) – The overhang of the watson strand in the 5’ end.

Returns:

A Dseq object.

Return type:

Dseq

Examples

>>> Dseq.from_full_sequence_and_overhangs('AAAAAA', crick_ovhg=2, watson_ovhg=2)
Dseq(-6)
  AAAA
TTTT
>>> Dseq.from_full_sequence_and_overhangs('AAAAAA', crick_ovhg=-2, watson_ovhg=2)
Dseq(-6)
AAAAAA
  TT
>>> Dseq.from_full_sequence_and_overhangs('AAAAAA', crick_ovhg=2, watson_ovhg=-2)
Dseq(-6)
  AA
TTTTTT
>>> Dseq.from_full_sequence_and_overhangs('AAAAAA', crick_ovhg=-2, watson_ovhg=-2)
Dseq(-6)
AAAA
  TTTT
property watson: str#

The watson (upper) strand of the double stranded fragment 5’-3’.

Returns:

DESCRIPTION.

Return type:

TYPE

property crick: str#

The crick (lower) strand of the double stranded fragment 5’-3’.

Returns:

DESCRIPTION.

Return type:

TYPE

property left_ovhg: int#

The 5’ overhang of the lower strand compared the the upper.

See module docstring for more information.

Returns:

DESCRIPTION.

Return type:

TYPE

property ovhg: int#

The 5’ overhang of the lower strand compared the the upper.

See module docstring for more information.

Returns:

DESCRIPTION.

Return type:

TYPE

property right_ovhg: int#

Overhang at the right side (end).

property watson_ovhg: int#

Overhang at the right side (end).

to_blunt_string() str#

A string representation of the sequence. The returned string is the watson strand of a blunt version of the sequence.

>>> ds = Dseq.from_representation(
... '''
... GAATTC
...   TAA
... ''')
>>> str(ds)
'GAATTC'
>>> ds = Dseq.from_representation(
... '''
...   ATT
... CTTAAG
... ''')
>>> str(ds)
'GAATTC'
Returns:

A string representation of the sequence.

Return type:

str

mw() float[source]#

The molecular weight of the DNA/RNA molecule in g/mol.

The molecular weight data in Biopython Bio.Data.IUPACData is used. The DNA is assumed to have a 5’-phosphate as many DNA fragments from restriction digestion do:

 P - G-A-T-T-A-C-A - OH
     | | | | | | |
OH - C-T-A-A-T-G-T - P

The molecular weights listed in the unambiguous_dna_weights dictionary refers to free monophosphate nucleotides. One water molecule is removed for every phopshodiester bond formed between nucleotides. For linear molecules, the weight of one water molecule is added to account for the terminal hydroxyl group and a hydrogen on the 5’ terminal phosphate group.

 P - G---A---T - OH  P - C---A - OH
     |   |   |           |   |
OH - C---T---A---A---T---G---T - P

If the DNA is discontinuous, the internal 5’- end is assumed to have a phosphate and the 3’- a hydroxyl group:

Examples

>>> from pydna.dseq import Dseq
>>> ds_lin_obj = Dseq("GATTACA")
>>> ds_lin_obj
Dseq(-7)
GATTACA
CTAATGT
>>> round(ds_lin_obj.mw(), 1)
4359.8
>>> ds_circ_obj = Dseq("GATTACA", circular = True)
>>> round(ds_circ_obj.mw(), 1)
4323.8
>>> ssobj = Dseq("PEXXEIE")
>>> ssobj
Dseq(-7)
GATTACA
|||||||
>>> round(ssobj.mw(), 1)
2184.4
>>> ds_lin_obj2 = Dseq("GATZFCA")
>>> ds_lin_obj2
Dseq(-7)
GAT  CA
CTAATGT
>>> round(ds_lin_obj2.mw(), 1)
3724.4
find(sub: _SeqAbstractBaseClass | str | bytes, start=0, end=sys.maxsize) int[source]#

This method behaves like the python string method of the same name.

Returns an integer, the index of the first occurrence of substring argument sub in the (sub)sequence given by [start:end].

Returns -1 if the subsequence is NOT found.

The search is case sensitive.

Parameters:
  • sub (string or Seq object) – a string or another Seq object to look for.

  • start (int, optional) – slice start.

  • end (int, optional) – slice end.

Examples

>>> from pydna.dseq import Dseq
>>> seq = Dseq("agtaagt")
>>> seq
Dseq(-7)
agtaagt
tcattca
>>> seq.find("taa")
2
>>> seq = Dseq(watson="agta",crick="actta",ovhg=-2)
>>> seq
Dseq(-7)
agta
  attca
>>> seq.find("taa")
-1
>>> seq = Dseq(watson="agta",crick="actta",ovhg=-2)
>>> seq
Dseq(-7)
agta
  attca
>>> seq.find("ta")
2
reverse_complement() Dseq[source]#

Dseq object where watson and crick have switched places.

This represents the same double stranded sequence.

Examples

>>> from pydna.dseq import Dseq
>>> a=Dseq("catcgatc")
>>> a
Dseq(-8)
catcgatc
gtagctag
>>> b=a.reverse_complement()
>>> b
Dseq(-8)
gatcgatg
ctagctac
>>>
rc() Dseq#

Dseq object where watson and crick have switched places.

This represents the same double stranded sequence.

Examples

>>> from pydna.dseq import Dseq
>>> a=Dseq("catcgatc")
>>> a
Dseq(-8)
catcgatc
gtagctag
>>> b=a.reverse_complement()
>>> b
Dseq(-8)
gatcgatg
ctagctac
>>>
shifted(shift: int) DseqType[source]#

Shifted copy of a circular Dseq object.

>>> ds = Dseq("TAAG", circular = True)
>>> ds.shifted(1) # First bp moved to right side:
Dseq(o4)
AAGT
TTCA
>>> ds.shifted(-1) # Last bp moved to left side:
Dseq(o4)
GTAA
CATT
looped() DseqType[source]#

Circularized Dseq object.

This can only be done if the two ends are compatible, otherwise a TypeError is raised.

Examples

>>> from pydna.dseq import Dseq
>>> a=Dseq("catcgatc")
>>> a
Dseq(-8)
catcgatc
gtagctag
>>> a.looped()
Dseq(o8)
catcgatc
gtagctag
>>> b = Dseq("iatcgatj")
>>> b
Dseq(-8)
catcgat
 tagctag
>>> b.looped()
Dseq(o7)
catcgat
gtagcta
>>> c = Dseq("jatcgati")
>>> c
Dseq(-8)
 atcgatc
gtagcta
>>> c.looped()
Dseq(o7)
catcgat
gtagcta
>>> d = Dseq("ietcgazj")
>>> d
Dseq(-8)
catcga
  agctag
>>> d.looped()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python2.7/dist-packages/pydna/dsdna.py", line 357, in looped
    if type5 == type3 and str(sticky5) == str(rc(sticky3)):
TypeError: DNA cannot be circularized.
5' and 3' sticky ends not compatible!
>>>
five_prime_end() Tuple[str, str][source]#

Returns a 2-tuple of trings describing the structure of the 5’ end of the DNA fragment.

The tuple contains (type , sticky) where type is eiter “5’” or “3’”. sticky is always in lower case and contains the sequence of the protruding end in 5’-3’ direction.

See examples below:

Examples

>>> from pydna.dseq import Dseq
>>> a = Dseq("aa", "tttg", ovhg=2)
>>> a
Dseq(-4)
  aa
gttt
>>> a.five_prime_end()
("3'", 'tg')
>>> a = Dseq("caaa", "tt", ovhg=-2)
>>> a
Dseq(-4)
caaa
  tt
>>> a.five_prime_end()
("5'", 'ca')
>>> a = Dseq("aa", "tt")
>>> a
Dseq(-2)
aa
tt
>>> a.five_prime_end()
('blunt', '')
three_prime_end() Tuple[str, str][source]#

Returns a tuple describing the structure of the 5’ end of the DNA fragment

>>> a = Dseq("aa", "gttt", ovhg=0)
>>> a
Dseq(-4)
aa
tttg
>>> a.three_prime_end()
("5'", 'gt')
>>> a = Dseq("aaac", "tt", ovhg=0)
>>> a
Dseq(-4)
aaac
tt
>>> a.three_prime_end()
("3'", 'ac')
>>> from pydna.dseq import Dseq
>>> a=Dseq("aaa", "ttt")
>>> a
Dseq(-3)
aaa
ttt
>>> a.three_prime_end()
('blunt', '')
fill_in(nucleotides: None | str = None) DseqType[source]#

Fill in of five prime protruding end with a DNA polymerase that has only DNA polymerase activity (such as Exo-Klenow [5]). Exo-Klenow is a modified version of the Klenow fragment of E. coli DNA polymerase I, which has been engineered to lack both 3-5 proofreading and 5-3 exonuclease activities.

and any combination of A, G, C or T. Default are all four nucleotides together.

Parameters:

nucleotides (str)

Examples

>>> from pydna.dseq import Dseq
>>> b=Dseq("caaa", "cttt")
>>> b
Dseq(-5)
caaa
 tttc
>>> b.fill_in()
Dseq(-5)
caaag
gtttc
>>> b.fill_in("g")
Dseq(-5)
caaag
gtttc
>>> b.fill_in("tac")
Dseq(-5)
caaa
 tttc
>>> c=Dseq("aaac", "tttg")
>>> c
Dseq(-5)
 aaac
gttt
>>> c.fill_in()
Dseq(-5)
 aaac
gttt
>>> a=Dseq("aaa", "ttt")
>>> a
Dseq(-3)
aaa
ttt
>>> a.fill_in()
Dseq(-3)
aaa
ttt

References

klenow(nucleotides: None | str = None) DseqType#

Fill in of five prime protruding end with a DNA polymerase that has only DNA polymerase activity (such as Exo-Klenow [6]). Exo-Klenow is a modified version of the Klenow fragment of E. coli DNA polymerase I, which has been engineered to lack both 3-5 proofreading and 5-3 exonuclease activities.

and any combination of A, G, C or T. Default are all four nucleotides together.

Parameters:

nucleotides (str)

Examples

>>> from pydna.dseq import Dseq
>>> b=Dseq("caaa", "cttt")
>>> b
Dseq(-5)
caaa
 tttc
>>> b.fill_in()
Dseq(-5)
caaag
gtttc
>>> b.fill_in("g")
Dseq(-5)
caaag
gtttc
>>> b.fill_in("tac")
Dseq(-5)
caaa
 tttc
>>> c=Dseq("aaac", "tttg")
>>> c
Dseq(-5)
 aaac
gttt
>>> c.fill_in()
Dseq(-5)
 aaac
gttt
>>> a=Dseq("aaa", "ttt")
>>> a
Dseq(-3)
aaa
ttt
>>> a.fill_in()
Dseq(-3)
aaa
ttt

References

nibble_to_blunt() DseqType[source]#

Simulates treatment a nuclease with both 5’-3’ and 3’-5’ single strand specific exonuclease activity (such as mung bean nuclease [7])

Mung bean nuclease is a nuclease enzyme derived from mung bean sprouts that preferentially degrades single-stranded DNA and RNA into 5’-phosphate- and 3’-hydroxyl-containing nucleotides.

Treatment results in blunt DNA, regardless of wheter the protruding end is 5’ or 3’.

    ggatcc    ->     gatcc
     ctaggg          ctagg

     ggatcc   ->     ggatc
    tcctag           cctag

>>> from pydna.dseq import Dseq
>>> b=Dseq("caaa", "cttt")
>>> b
Dseq(-5)
caaa
 tttc
>>> b.mung()
Dseq(-3)
aaa
ttt
>>> c=Dseq("aaac", "tttg")
>>> c
Dseq(-5)
 aaac
gttt
>>> c.mung()
Dseq(-3)
aaa
ttt

References

mung() DseqType#

Simulates treatment a nuclease with both 5’-3’ and 3’-5’ single strand specific exonuclease activity (such as mung bean nuclease [8])

Mung bean nuclease is a nuclease enzyme derived from mung bean sprouts that preferentially degrades single-stranded DNA and RNA into 5’-phosphate- and 3’-hydroxyl-containing nucleotides.

Treatment results in blunt DNA, regardless of wheter the protruding end is 5’ or 3’.

    ggatcc    ->     gatcc
     ctaggg          ctagg

     ggatcc   ->     ggatc
    tcctag           cctag

>>> from pydna.dseq import Dseq
>>> b=Dseq("caaa", "cttt")
>>> b
Dseq(-5)
caaa
 tttc
>>> b.mung()
Dseq(-3)
aaa
ttt
>>> c=Dseq("aaac", "tttg")
>>> c
Dseq(-5)
 aaac
gttt
>>> c.mung()
Dseq(-3)
aaa
ttt

References

T4(nucleotides=None) DseqType[source]#

Fill in 5’ protruding ends and nibble 3’ protruding ends.

This is done using a DNA polymerase providing 3’-5’ nuclease activity such as T4 DNA polymerase. This can be done in presence of any combination of the four nucleotides A, G, C or T.

T4 DNA polymerase is widely used to “polish” DNA ends because of its strong 3-5 exonuclease activity in the absence of dNTPs, it chews back 3′ overhangs to create blunt ends; in the presence of limiting dNTPs, it can fill in 5′ overhangs; and by carefully controlling reaction time, temperature, and nucleotide supply, you can generate defined recessed or blunt termini.

Tuning the nucleotide set can facilitate engineering of partial sticky ends. Default are all four nucleotides together.

      aaagatc-3        aaa      3' ends are always removed.
      |||       --->   |||      A and T needed or the molecule will
3-ctagttt              ttt      degrade completely.

5-gatcaaa              gatcaaaGATC      5' ends are filled in the
      |||       --->   |||||||||||      presence of GATC
      tttctag-5        CTAGtttctag

5-gatcaaa              gatcaaaGAT       5' ends are partially filled in the
      |||       --->    |||||||||       presence of GAT to produce a 1 nt
      tttctag-5         TAGtttctag      5' overhang

5-gatcaaa              gatcaaaGA       5' ends are partially filled in the
      |||       --->     |||||||       presence of GA to produce a 2 nt
      tttctag-5          AGtttctag     5' overhang

5-gatcaaa              gatcaaaG        5' ends are partially filled in the
      |||       --->      |||||        presence of G to produce a 3 nt
      tttctag-5           Gtttctag     5' overhang
Parameters:

nucleotides (str)

Examples

>>> from pydna.dseq import Dseq
>>> a = Dseq.from_representation(
... '''
... gatcaaa
...     tttctag
... ''')
>>> a
Dseq(-11)
gatcaaa
    tttctag
>>> a.T4()
Dseq(-11)
gatcaaagatc
ctagtttctag
>>> a.T4("GAT")
Dseq(-11)
gatcaaagat
 tagtttctag
>>> a.T4("GA")
Dseq(-11)
gatcaaaga
  agtttctag
>>> a.T4("G")
Dseq(-11)
gatcaaag
   gtttctag
t4(nucleotides=None) DseqType#

Fill in 5’ protruding ends and nibble 3’ protruding ends.

This is done using a DNA polymerase providing 3’-5’ nuclease activity such as T4 DNA polymerase. This can be done in presence of any combination of the four nucleotides A, G, C or T.

T4 DNA polymerase is widely used to “polish” DNA ends because of its strong 3-5 exonuclease activity in the absence of dNTPs, it chews back 3′ overhangs to create blunt ends; in the presence of limiting dNTPs, it can fill in 5′ overhangs; and by carefully controlling reaction time, temperature, and nucleotide supply, you can generate defined recessed or blunt termini.

Tuning the nucleotide set can facilitate engineering of partial sticky ends. Default are all four nucleotides together.

      aaagatc-3        aaa      3' ends are always removed.
      |||       --->   |||      A and T needed or the molecule will
3-ctagttt              ttt      degrade completely.

5-gatcaaa              gatcaaaGATC      5' ends are filled in the
      |||       --->   |||||||||||      presence of GATC
      tttctag-5        CTAGtttctag

5-gatcaaa              gatcaaaGAT       5' ends are partially filled in the
      |||       --->    |||||||||       presence of GAT to produce a 1 nt
      tttctag-5         TAGtttctag      5' overhang

5-gatcaaa              gatcaaaGA       5' ends are partially filled in the
      |||       --->     |||||||       presence of GA to produce a 2 nt
      tttctag-5          AGtttctag     5' overhang

5-gatcaaa              gatcaaaG        5' ends are partially filled in the
      |||       --->      |||||        presence of G to produce a 3 nt
      tttctag-5           Gtttctag     5' overhang
Parameters:

nucleotides (str)

Examples

>>> from pydna.dseq import Dseq
>>> a = Dseq.from_representation(
... '''
... gatcaaa
...     tttctag
... ''')
>>> a
Dseq(-11)
gatcaaa
    tttctag
>>> a.T4()
Dseq(-11)
gatcaaagatc
ctagtttctag
>>> a.T4("GAT")
Dseq(-11)
gatcaaagat
 tagtttctag
>>> a.T4("GA")
Dseq(-11)
gatcaaaga
  agtttctag
>>> a.T4("G")
Dseq(-11)
gatcaaag
   gtttctag
nibble_five_prime_left(n: int = 1) DseqType[source]#

5’ => 3’ resection at the left side (start) of the molecule.

The argument n indicate the number of nucleotides that are to be removed. The outcome of this depend on the structure of the molecule. See the two examples below:

The figure below indicates a recess of length two from a blunt DNA fragment. The resulting DNA fragment has a 3’ protruding single strand.

gatc           tc
||||   -->     ||
ctag         ctag

The figure below indicates a recess of length two from a DNA fragment with a 5’ sticky end resulting in a blunt sequence.

ttgatc         gatc
  ||||   -->   ||||
  ctag         ctag
>>> from pydna.dseq import Dseq
>>> ds = Dseq("gatc")
>>> ds
Dseq(-4)
gatc
ctag
>>> ds.nibble_five_prime_left(2)
Dseq(-4)
  tc
ctag
>>> ds.nibble_five_prime_left(3)
Dseq(-4)
   c
ctag
>>> ds.nibble_five_prime_left(4)
Dseq(-4)
||||
ctag
>>> ds = Dseq.from_representation(
... '''
... GGgatc
...   ctag
... ''')
>>> ds
Dseq(-6)
GGgatc
  ctag
>>> ds.nibble_five_prime_left(2)
Dseq(-4)
gatc
ctag
Parameters:

n (int, optional) – The default is 1. This is the number of nucleotides removed.

Returns:

DESCRIPTION.

Return type:

DseqType

nibble_five_prime_right(n: int = 1) DseqType[source]#

5’ => 3’ resection at the right side (end) of the molecule.

The argument n indicate the number of nucleotides that are to be removed. The outcome of this depend on the structure of the molecule. See the two examples below:

The figure below indicates a recess of length two from a blunt DNA fragment. The resulting DNA fragment has a 3’ protruding single strand.

gatc         gatc
||||   -->   ||
ctag         ct

The figure below indicates a recess of length two from a DNA fragment with a 5’ sticky end resulting in a blunt sequence.

gatc         gatc
||||   -->   ||||
ctagtt       ctag
>>> from pydna.dseq import Dseq
>>> ds = Dseq("gatc")
>>> ds
Dseq(-4)
gatc
ctag
>>> ds.nibble_five_prime_right(2)
Dseq(-4)
gatc
ct
>>> ds.nibble_five_prime_right(3)
Dseq(-4)
gatc
c
>>> ds.nibble_five_prime_right(4)
Dseq(-4)
gatc
||||
>>> ds = Dseq.from_representation(
... '''
... gatc
... ctagGG
... ''')
>>> ds.nibble_five_prime_right(2)
Dseq(-4)
gatc
ctag
exo1_front(n: int = 1) DseqType#

5’ => 3’ resection at the left side (start) of the molecule.

The argument n indicate the number of nucleotides that are to be removed. The outcome of this depend on the structure of the molecule. See the two examples below:

The figure below indicates a recess of length two from a blunt DNA fragment. The resulting DNA fragment has a 3’ protruding single strand.

gatc           tc
||||   -->     ||
ctag         ctag

The figure below indicates a recess of length two from a DNA fragment with a 5’ sticky end resulting in a blunt sequence.

ttgatc         gatc
  ||||   -->   ||||
  ctag         ctag
>>> from pydna.dseq import Dseq
>>> ds = Dseq("gatc")
>>> ds
Dseq(-4)
gatc
ctag
>>> ds.nibble_five_prime_left(2)
Dseq(-4)
  tc
ctag
>>> ds.nibble_five_prime_left(3)
Dseq(-4)
   c
ctag
>>> ds.nibble_five_prime_left(4)
Dseq(-4)
||||
ctag
>>> ds = Dseq.from_representation(
... '''
... GGgatc
...   ctag
... ''')
>>> ds
Dseq(-6)
GGgatc
  ctag
>>> ds.nibble_five_prime_left(2)
Dseq(-4)
gatc
ctag
Parameters:

n (int, optional) – The default is 1. This is the number of nucleotides removed.

Returns:

DESCRIPTION.

Return type:

DseqType

exo1_end(n: int = 1) DseqType#

5’ => 3’ resection at the right side (end) of the molecule.

The argument n indicate the number of nucleotides that are to be removed. The outcome of this depend on the structure of the molecule. See the two examples below:

The figure below indicates a recess of length two from a blunt DNA fragment. The resulting DNA fragment has a 3’ protruding single strand.

gatc         gatc
||||   -->   ||
ctag         ct

The figure below indicates a recess of length two from a DNA fragment with a 5’ sticky end resulting in a blunt sequence.

gatc         gatc
||||   -->   ||||
ctagtt       ctag
>>> from pydna.dseq import Dseq
>>> ds = Dseq("gatc")
>>> ds
Dseq(-4)
gatc
ctag
>>> ds.nibble_five_prime_right(2)
Dseq(-4)
gatc
ct
>>> ds.nibble_five_prime_right(3)
Dseq(-4)
gatc
c
>>> ds.nibble_five_prime_right(4)
Dseq(-4)
gatc
||||
>>> ds = Dseq.from_representation(
... '''
... gatc
... ctagGG
... ''')
>>> ds.nibble_five_prime_right(2)
Dseq(-4)
gatc
ctag
nibble_three_prime_left(n=1) DseqType[source]#

3’ => 5’ resection at the left side (beginning) of the molecule.

The argument n indicate the number of nucleotides that are to be removed. The outcome of this depend on the structure of the molecule. See the two examples below:

The figure below indicates a recess of length two from a blunt DNA fragment. The resulting DNA fragment has a 5’ protruding single strand.

gatc         gatc
||||   -->     ||
ctag           ag

The figure below indicates a recess of length two from a DNA fragment with a 3’ sticky end resulting in a blunt sequence.

  gatc         gatc
  ||||   -->   ||||
ttctag         ctag
>>> from pydna.dseq import Dseq
>>> ds = Dseq("gatc")
>>> ds
Dseq(-4)
gatc
ctag
>>> ds.nibble_three_prime_left(2)
Dseq(-4)
gatc
  ag
>>> ds.nibble_three_prime_left(3)
Dseq(-4)
gatc
   g
>>> ds.nibble_three_prime_left(4)
Dseq(-4)
gatc
||||
>>> ds = Dseq.from_representation(
... '''
...   gatc
... CCctag
... ''')
>>> ds
Dseq(-6)
  gatc
CCctag
>>> ds.nibble_three_prime_left(2)
Dseq(-4)
gatc
ctag
nibble_three_prime_right(n=1) DseqType[source]#

3’ => 5’ resection at the right side (end) of the molecule.

The argument n indicate the number of nucleotides that are to be removed. The outcome of this depend on the structure of the molecule. See the two examples below:

The figure below indicates a recess of length two from a blunt DNA fragment. The resulting DNA fragment has a 5’ protruding single strand.

gatc         ga
||||   -->   ||
ctag         ctag

The figure below indicates a recess of length two from a DNA fragment with a 3’ sticky end resulting in a blunt sequence.

gatctt       gatc
||||   -->   ||||
ctag         ctag
>>> from pydna.dseq import Dseq
>>> ds = Dseq("gatc")
>>> ds
Dseq(-4)
gatc
ctag
>>> ds.nibble_three_prime_right(2)
Dseq(-4)
ga
ctag
>>> ds.nibble_three_prime_right(3)
Dseq(-4)
g
ctag
>>> ds.nibble_three_prime_right(4)
Dseq(-4)
||||
ctag
>>> ds = Dseq.from_representation(
... '''
... gatcCC
... ctag
... ''')
>>> ds.nibble_three_prime_right(2)
Dseq(-4)
gatc
ctag
no_cutters(batch: RestrictionBatch | None = None) RestrictionBatch[source]#

Enzymes in a RestrictionBatch not cutting sequence.

unique_cutters(batch: RestrictionBatch | None = None) RestrictionBatch[source]#

Enzymes in a RestrictionBatch cutting sequence once.

once_cutters(batch: RestrictionBatch | None = None) RestrictionBatch#

Enzymes in a RestrictionBatch cutting sequence once.

twice_cutters(batch: RestrictionBatch | None = None) RestrictionBatch[source]#

Enzymes in a RestrictionBatch cutting sequence twice.

n_cutters(n=3, batch: RestrictionBatch | None = None) RestrictionBatch[source]#

Enzymes in a RestrictionBatch cutting n times.

cutters(batch: RestrictionBatch | None = None) RestrictionBatch[source]#

Enzymes in a RestrictionBatch cutting sequence at least once.

seguid() str[source]#

SEGUID checksum for the sequence.

isblunt() bool[source]#

isblunt.

Return True if Dseq is linear and blunt and false if staggered or circular.

Examples

>>> from pydna.dseq import Dseq
>>> a=Dseq("gat")
>>> a
Dseq(-3)
gat
cta
>>> a.isblunt()
True
>>> a=Dseq("gat", "atcg")
>>> a
Dseq(-4)
 gat
gcta
>>> a.isblunt()
False
>>> a=Dseq("gat", "gatc")
>>> a
Dseq(-4)
gat
ctag
>>> a.isblunt()
False
>>> a=Dseq("gat", circular=True)
>>> a
Dseq(o3)
gat
cta
>>> a.isblunt()
False
terminal_transferase(nucleotides: str = 'a') DseqType[source]#

Terminal deoxynucleotidyl transferase (TdT) is a template-independent DNA polymerase that adds nucleotides to the 3′-OH ends of DNA, typically single-stranded or recessed 3′ ends. In cloning, it’s classically used to create homopolymer tails (e.g. poly-dG on a vector and poly-dC on an insert) so that fragments can anneal via complementary overhangs (“tailing” cloning).

This activity ia also present in some DNA polymerases, such as Taq polymerase. This property is used in the populat T/A cloning protocol ([9]).

gct          gcta
|||   -->    |||
cga         acga
>>> from pydna.dseq import Dseq
>>> a = Dseq("aa")
>>> a = Dseq("gct")
>>> a
Dseq(-3)
gct
cga
>>> a.terminal_transferase()
Dseq(-5)
 gcta
acga
>>> a.terminal_transferase("G")
Dseq(-5)
 gctG
Gcga
Parameters:

nucleotides (str, optional) – The default is “a”.

Returns:

DESCRIPTION.

Return type:

DseqType

References

user() DseqType[source]#

USER Enzyme treatment.

USER Enzyme is a mixture of Uracil DNA glycosylase (UDG) and the DNA glycosylase-lyase Endonuclease VIII.

UDG catalyses the excision of an uracil base, forming an abasic or apyrimidinic site (AP site). Endonuclease VIII removes the AP site creating a DNA gap.

tagaagtaggUat          tagaagtagg at
|||||||||||||  --->    |||||||||| ||
atcUtcatccata          atc tcatccata
>>> a = Dseq("tagaagtaggUat", "atcUtcatccata"[::-1], 0)
>>> a
Dseq(-13)
tagaagtaggUat
atcutcatccAta
>>> a.user()
Dseq(-13)
tagaagtagg at
atc tcatccAta
Returns:

DNA fragment with uracile bases removed.

Return type:

DseqType

cut(*enzymes: EnzymesType) Tuple[DseqType, ...][source]#

Returns a list of linear Dseq fragments produced in the digestion. If there are no cuts, an empty list is returned.

Parameters:

enzymes (enzyme object or iterable of such objects) – A Bio.Restriction.XXX restriction objects or iterable.

Returns:

frags – list of Dseq objects formed by the digestion

Return type:

list

Examples

>>> from pydna.dseq import Dseq
>>> seq=Dseq("ggatccnnngaattc")
>>> seq
Dseq(-15)
ggatccnnngaattc
cctaggnnncttaag
>>> from Bio.Restriction import BamHI,EcoRI
>>> type(seq.cut(BamHI))
<class 'tuple'>
>>> for frag in seq.cut(BamHI): print(repr(frag))
Dseq(-5)
g
cctag
Dseq(-14)
gatccnnngaattc
    gnnncttaag
>>> seq.cut(EcoRI, BamHI) ==  seq.cut(BamHI, EcoRI)
True
>>> a,b,c = seq.cut(EcoRI, BamHI)
>>> a+b+c
Dseq(-15)
ggatccnnngaattc
cctaggnnncttaag
>>>
cutsite_is_valid(cutsite: Tuple[Tuple[int, int], AbstractCut | None | _cas]) bool[source]#

Check is a cutsite is valid.

A cutsite is a nested 2-tuple with this form:

((cut_watson, ovhg), enz), for example ((396, -4), EcoRI)

The cut_watson (positive integer) is the cut position of the sequence as for example returned by the Bio.Restriction module.

The ovhg (overhang, positive or negative integer or 0) has the same meaning as for restriction enzymes in the Bio.Restriction module and for pydna.dseq.Dseq objects (see docstring for this module and example below)

Enzyme can be None.

Enzyme overhang

EcoRI  -4     --GAATTC--        --G       AATTC--
                ||||||     -->    |           |
              --CTTAAG--        --CTTAA       G--

KpnI    4     --GGTACC--        --GGTAC       C--
                ||||||     -->    |           |
              --CCATGG--        --C       CATGG--

SmaI    0     --CCCGGG--        --CCC       GGG--
                ||||||     -->    |||       |||
              --GGGCCC--        --GGG       CCC--
>>> from Bio.Restriction import EcoRI, KpnI, SmaI
>>> EcoRI.ovhg
-4
>>> KpnI.ovhg
4
>>> SmaI.ovhg
0

Returns False if:

  • Cut positions fall outside the sequence (could be moved to Biopython)

TODO: example

  • Overhang is not double stranded

TODO: example

  • Recognition site is not double stranded or is outside the sequence

TODO: example

  • For enzymes that cut twice, it checks that at least one possibility is valid

TODO: example

Parameters:

cutsite (CutSiteType) – DESCRIPTION.

Returns:

True if cutsite can cut the DNA fragment.

Return type:

bool

get_cutsites(*enzymes: EnzymesType) List[Tuple[Tuple[int, int], AbstractCut | None | _cas]][source]#

Returns a list of cutsites, represented represented as ((cut_watson, ovhg), enz):

  • cut_watson is a positive integer contained in [0,len(seq)), where seq is the sequence that will be cut. It represents the position of the cut on the watson strand, using the full sequence as a reference. By “full sequence” I mean the one you would get from str(Dseq).

  • ovhg is the overhang left after the cut. It has the same meaning as ovhg in the Bio.Restriction enzyme objects, or pydna’s Dseq property.

  • enz is the enzyme object. It’s not necessary to perform the cut, but can be

    used to keep track of which enzyme was used.

Cuts are only returned if the recognition site and overhang are on the double-strand part of the sequence.

Parameters:

enzymes (Union[RestrictionBatch,list[_AbstractCut]])

Return type:

list[tuple[tuple[int,int], _AbstractCut]]

Examples

>>> from Bio.Restriction import EcoRI
>>> from pydna.dseq import Dseq
>>> seq = Dseq('AAGAATTCAAGAATTC')
>>> seq.get_cutsites(EcoRI)
[((3, -4), EcoRI), ((11, -4), EcoRI)]

cut_watson is defined with respect to the “full sequence”, not the watson strand:

>>> dseq = Dseq.from_full_sequence_and_overhangs('aaGAATTCaa', 1, 0)
>>> dseq
Dseq(-10)
 aGAATTCaa
ttCTTAAGtt
>>> dseq.get_cutsites([EcoRI])
[((3, -4), EcoRI)]

Cuts are only returned if the recognition site and overhang are on the double-strand part of the sequence.

>>> Dseq('GAATTC').get_cutsites([EcoRI])
[((1, -4), EcoRI)]
>>> Dseq.from_full_sequence_and_overhangs('GAATTC', -1, 0).get_cutsites([EcoRI])
[]
left_end_position() Tuple[int, int][source]#

The index in the full sequence of the watson and crick start positions.

full sequence (str(self)) for all three cases is AAA

AAA              AA               AAT
 TT             TTT               TTT
Returns (0, 1)  Returns (1, 0)    Returns (0, 0)
right_end_position() Tuple[int, int][source]#

The index in the full sequence of the watson and crick end positions.

full sequence (str(self)) for all three cases is AAA

` AAA               AA                   AAA TT                TTT                  TTT Returns (3, 2)    Returns (2, 3)       Returns (3, 3) `

get_ss_meltsites(length: int) tuple[int, int][source]#

Single stranded DNA melt sites

Two lists of 2-tuples of integers are returned. Each tuple (((from, to))) contains the start and end positions of a single stranded region, shorter or equal to length.

In the example below, the middle 2 nt part is released from the molecule.

tagaa ta gtatg
||||| || |||||  -->   [(6,8)], []
atcttcatccatac

tagaagtaggtatg
||||| || |||||  -->   [], [(6,8)]
atctt at catac

The output of this method is used in the melt_ss_dna method in order to determine the start and end positions of single stranded regions.

See get_ds_meltsites for melting ds sequences.

Examples

>>> from pydna.dseq import Dseq
>>> ds = Dseq("tagaaqtaqgtatg")
>>> ds
Dseq(-14)
tagaa ta gtatg
atcttcatccatac
>>> cutsites = ds.get_ss_meltsites(2)
>>> cutsites
([(6, 8)], [])
>>> ds[6:8]
Dseq(-2)
ta
at
>>> ds = Dseq("tagaaptapgtatg")
>>> ds
Dseq(-14)
tagaagtaggtatg
atctt at catac
>>> cutsites = ds.get_ss_meltsites(2)
>>> cutsites
([], [(6, 8)])
get_ds_meltsites(length: int) List[Tuple[Tuple[int, int], AbstractCut | None | _cas]][source]#

Double stranded DNA melt sites

DNA molecules can fall apart by melting if they have internal single stranded regions. In the example below, the molecule has two gaps on opposite sides, two nucleotides apart, which means that it hangs together by two basepairs.

This molecule can melt into two separate 8 bp double stranded molecules, each with 3 nt 3’ overhangs a depicted below.

tagaagta gtatg        tagaagta          gtatg
||||| || |||||  -->   |||||             |||||
atctt atccatac        atctt          atccatac

A list of 2-tuples is returned. Each tuple (((cut_watson, ovhg), None)) contains cut position and the overhang value in the same format as returned by the get_cutsites method for restriction enzymes.

Note that this function deals with melting that results in two double stranded DNA molecules.

See get_ss_meltsites for melting of single stranded regions from molecules.

Examples

>>> from pydna.dseq import Dseq
>>> ds = Dseq("tagaaptaqgtatg")
>>> ds
Dseq(-14)
tagaagta gtatg
atctt atccatac
>>> cutsite = ds.get_ds_meltsites(2)
>>> cutsite
[((8, 2), None)]
cast_to_ds_right()[source]#

NNNN NNNNGATC |||| –> |||||||| NNNNCTAG NNNNCTAG

NNNNGATC NNNNGATC |||| –> |||||||| NNNN NNNNCTAG

cast_to_ds()[source]#

Sequencially calls cast_to_ds_left and cast_to_ds_right.

cast_to_ds_left()[source]#
GATCNNNN GATCNNNN

|||| –> |||||||| NNNN CTAGNNNN

NNNN GATCNNNN |||| –> ||||||||

CTAGNNNN CTAGNNNN

get_cut_parameters(cut: Tuple[Tuple[int, int], AbstractCut | None | _cas] | None, is_left: bool) Tuple[int, int, int][source]#

For a given cut expressed as ((cut_watson, ovhg), enz), returns a tuple (cut_watson, cut_crick, ovhg).

  • cut_watson: see get_cutsites docs

  • cut_crick: equivalent of cut_watson in the crick strand

  • ovhg: see get_cutsites docs

The cut can be None if it represents the left or right end of the sequence. Then it will return the position of the watson and crick ends with respect to the “full sequence”. The is_left parameter is only used in this case.

melt(length)[source]#

TBD

Parameters:

length (TYPE) – DESCRIPTION.

Returns:

DESCRIPTION.

Return type:

TYPE

melt_ss_dna(length) tuple[Dseq, list[Dseq]][source]#

Melt to separate single stranded DNA

Single stranded DNA molecules shorter or equal to length shed from a double stranded DNA molecule without affecting the length of the remaining molecule.

In the examples below, the middle 2 nt part is released from the molecule.

tagaa ta gtatg        tagaa    gtatg          ta
||||| || |||||  -->   |||||    |||||     +    ||
atcttcatccatac        atcttcatccatac

tagaagtaggtatg        tagaagtaggtatg
||||| || |||||  -->   |||||    |||||     +    ||
atctt at catac        atctt    catac          at

Examples

>>> from pydna.dseq import Dseq
>>> ds = Dseq("tagaaqtaqgtatg")
>>> ds
Dseq(-14)
tagaa ta gtatg
atcttcatccatac
>>> new, strands  = ds.melt_ss_dna(2)
>>> new
Dseq(-14)
tagaa    gtatg
atcttcatccatac
>>> strands[0]
Dseq(-2)
ta
||
>>> ds = Dseq("tagaaptapgtatg")
>>> ds
Dseq(-14)
tagaagtaggtatg
atctt at catac
>>> new, strands = ds.melt_ss_dna(2)
>>> new
Dseq(-14)
tagaagtaggtatg
atctt    catac
>>> strands[0]
Dseq(-2)
||
at
shed_ss_dna(watson_cutpairs: list[tuple[int, int]] = None, crick_cutpairs: list[tuple[int, int]] = None)[source]#

Separate parts of one of the DNA strands

Examples

>>> from pydna.dseq import Dseq
>>> ds = Dseq("tagaagtaggtatg")
>>> ds
Dseq(-14)
tagaagtaggtatg
atcttcatccatac
>>> new, strands = ds.shed_ss_dna([(6, 8)],[])
>>> new
Dseq(-14)
tagaag  ggtatg
atcttcatccatac
>>> strands[0]
Dseq(-2)
ta
||
>>> new, strands = ds.shed_ss_dna([],[(6, 8)])
>>> new
Dseq(-14)
tagaagtaggtatg
atcttc  ccatac
>>> strands[0]
Dseq(-2)
||
at
>>> ds = Dseq("tagaagtaggtatg")
>>> new, (strand1, strand2) = ds.shed_ss_dna([(6, 8), (9, 11)],[])
>>> new
Dseq(-14)
tagaag  g  atg
atcttcatccatac
>>> strand1
Dseq(-2)
ta
||
>>> strand2
Dseq(-2)
gt
||
apply_cut(left_cut: Tuple[Tuple[int, int], AbstractCut | None | _cas], right_cut: Tuple[Tuple[int, int], AbstractCut | None | _cas]) Dseq[source]#

Extracts a subfragment of the sequence between two cuts.

For more detail see the documentation of get_cutsite_pairs.

Parameters:
Return type:

Dseq

Examples

>>> from Bio.Restriction import EcoRI
>>> from pydna.dseq import Dseq
>>> dseq = Dseq('aaGAATTCaaGAATTCaa')
>>> cutsites = dseq.get_cutsites([EcoRI])
>>> cutsites
[((3, -4), EcoRI), ((11, -4), EcoRI)]
>>> p1, p2, p3 = dseq.get_cutsite_pairs(cutsites)
>>> p1
(None, ((3, -4), EcoRI))
>>> dseq.apply_cut(*p1)
Dseq(-7)
aaG
ttCTTAA
>>> p2
(((3, -4), EcoRI), ((11, -4), EcoRI))
>>> dseq.apply_cut(*p2)
Dseq(-12)
AATTCaaG
    GttCTTAA
>>> p3
(((11, -4), EcoRI), None)
>>> dseq.apply_cut(*p3)
Dseq(-7)
AATTCaa
    Gtt
>>> dseq = Dseq('TTCaaGAA', circular=True)
>>> cutsites = dseq.get_cutsites([EcoRI])
>>> cutsites
[((6, -4), EcoRI)]
>>> pair = dseq.get_cutsite_pairs(cutsites)[0]
>>> pair
(((6, -4), EcoRI), ((6, -4), EcoRI))
>>> dseq.apply_cut(*pair)
Dseq(-12)
AATTCaaG
    GttCTTAA
get_cutsite_pairs(cutsites: List[Tuple[Tuple[int, int], AbstractCut | None | _cas]]) List[Tuple[None | Tuple[Tuple[int, int], AbstractCut | None | _cas], None | Tuple[Tuple[int, int], AbstractCut | None | _cas]]][source]#

Returns pairs of cutsites that render the edges of the resulting fragments.

A fragment produced by restriction is represented by a tuple of length 2 that may contain cutsites or None:

  • Two cutsites: represents the extraction of a fragment between those two cutsites, in that orientation. To represent the opening of a circular molecule with a single cutsite, we put the same cutsite twice.

  • None, cutsite: represents the extraction of a fragment between the left edge of linear sequence and the cutsite.

  • cutsite, None: represents the extraction of a fragment between the cutsite and the right edge of a linear sequence.

Parameters:

cutsites (list[tuple[tuple[int,int], _AbstractCut]])

Return type:

list[tuple[tuple[tuple[int,int], _AbstractCut]|None],tuple[tuple[int,int], _AbstractCut]|None]

Examples

>>> from Bio.Restriction import EcoRI
>>> from pydna.dseq import Dseq
>>> dseq = Dseq('aaGAATTCaaGAATTCaa')
>>> cutsites = dseq.get_cutsites([EcoRI])
>>> cutsites
[((3, -4), EcoRI), ((11, -4), EcoRI)]
>>> dseq.get_cutsite_pairs(cutsites)
[(None, ((3, -4), EcoRI)), (((3, -4), EcoRI), ((11, -4), EcoRI)), (((11, -4), EcoRI), None)]
>>> dseq = Dseq('TTCaaGAA', circular=True)
>>> cutsites = dseq.get_cutsites([EcoRI])
>>> cutsites
[((6, -4), EcoRI)]
>>> dseq.get_cutsite_pairs(cutsites)
[(((6, -4), EcoRI), ((6, -4), EcoRI))]
get_parts()[source]#

Returns a DseqParts instance containing the parts (strings) of a dsDNA sequence. DseqParts instance field names:

 "sticky_left5"
 |
 |      "sticky_right5"
 |      |
---    ---
GGGATCC
   TAGGTCA
   ----
     |
     "middle"

 "sticky_left3"
 |
 |      "sticky_right3"
 |      |
---    ---
   ATCCAGT
CCCTAGG
   ----
     |
     "middle"

   "single_watson" (only an upper strand)
   |
-------
ATCCAGT
|||||||

   "single_crick" (only a lower strand)
   |
-------

|||||||
CCCTAGG

Up to seven groups (0..6) are captured, but some are mutually exclusive which means that one of them is an empty string:

0 or 1, not both, a DNA fragment has either 5’ or 3’ sticky end.

2 or 5 or 6, a DNA molecule has a ds region or is single stranded.

3 or 4, not both, either 5’ or 3’ sticky end.

Note that internal single stranded regions are not identified and will be contained in the middle part if they are present.

Examples

>>> from pydna.dseq import Dseq
>>> ds = Dseq("PPPATCFQZ")
>>> ds
Dseq(-9)
GGGATC
   TAGTCA
>>> parts = ds.get_parts()
>>> parts
DseqParts(sticky_left5='PPP', sticky_left3='', middle='ATC', sticky_right3='', sticky_right5='FQZ', single_watson='', single_crick='')
>>> Dseq(parts.sticky_left5)
Dseq(-3)
GGG
|||
>>> Dseq(parts.middle)
Dseq(-3)
ATC
TAG
>>> Dseq(parts.sticky_right5)
Dseq(-3)
|||
TCA
Parameters:

datastring (str) – A string with dscode.

Returns:

Seven string fields describing the DNA molecule. fragment(sticky_left5=’’, sticky_left3=’’,

middle=’’, sticky_right3=’’, sticky_right5=’’, single_watson=’’, single_crick=’’)

Return type:

namedtuple

pydna.all.read(data, ds=True)[source]#

This function is similar the parse() function but expects one and only one sequence or and exception is thrown.

Parameters:
  • data (string) – see below

  • ds (bool) – Double stranded or single stranded DNA, if True return Dseqrecord objects, else Bio.SeqRecord objects.

Returns:

contains the first Dseqrecord or SeqRecord object parsed.

Return type:

Dseqrecord

Notes

The data parameter is similar to the data parameter for parse().

See also

parse

pydna.all.read_primer(data)[source]#

Use this function to read a primer sequence from a string or a local file. The usage is similar to the parse_primer() function.

pydna.all.parse(data, ds=True) list[Dseqrecord | SeqRecord][source]#

Return all DNA sequences found in data.

If no sequences are found, an empty list is returned. This is a greedy function, use carefully.

Parameters:
  • data (string or iterable) –

    The data parameter is a string containing:

    1. an absolute path to a local file. The file will be read in text mode and parsed for EMBL, FASTA and Genbank sequences. Can be a string or a Path object.

    2. a string containing one or more sequences in EMBL, GENBANK, or FASTA format. Mixed formats are allowed.

    3. data can be a list or other iterable where the elements are 1 or 2

  • ds (bool) – If True double stranded Dseqrecord objects are returned. If False single stranded Bio.SeqRecord [10] objects are returned.

Returns:

contains Dseqrecord or SeqRecord objects

Return type:

list

References

See also

read

pydna.all.parse_primers(data)[source]#

docstring.

pydna.all.primer_design(template, fp=None, rp=None, limit=13, target_tm=55.0, tm_func=tm_default, estimate_function=None, **kwargs)[source]#

This function designs a forward primer and a reverse primer for PCR amplification of a given template sequence.

The template argument is a Dseqrecord object or equivalent containing the template sequence.

The optional fp and rp arguments can contain an existing primer for the sequence (either the forward or reverse primer). One or the other primers can be specified, not both (since then there is nothing to design!, use the pydna.amplify.pcr function instead).

The limit argument is the minimum length of the primer. The default value is 13.

If one of the primers is given, the other primer is designed to match in terms of Tm. If both primers are designed, they will be designed to target_tm

tm_func is a function that takes an ascii string representing an oligonuceotide as argument and returns a float. Some useful functions can be found in the pydna.tm module, but can be substituted for a custom made function.

estimate_function is a tm_func-like function that is used to get a first guess for the primer design, that is then used as starting point for the final result. This is useful when the tm_func function is slow to calculate (e.g. it relies on an external API, such as the NEB primer design API). The estimate_function should be faster than the tm_func function. The default value is None. To use the default tm_func as estimate function to get the NEB Tm faster, you can do: primer_design(dseqr, target_tm=55, tm_func=tm_neb, estimate_function=tm_default).

The function returns a pydna.amplicon.Amplicon class instance. This object has the object.forward_primer and object.reverse_primer properties which contain the designed primers.

Parameters:
  • template (pydna.dseqrecord.Dseqrecord) – a Dseqrecord object. The only required argument.

  • fp (pydna.primer.Primer, optional) – optional pydna.primer.Primer objects containing one primer each.

  • rp (pydna.primer.Primer, optional) – optional pydna.primer.Primer objects containing one primer each.

  • target_tm (float, optional) – target tm for the primers, set to 55°C by default.

  • tm_func (function) – Function used for tm calculation. This function takes an ascii string representing an oligonuceotide as argument and returns a float. Some useful functions can be found in the pydna.tm module, but can be substituted for a custom made function.

Returns:

result

Return type:

Amplicon

Examples

>>> from pydna.dseqrecord import Dseqrecord
>>> t=Dseqrecord("atgactgctaacccttccttggtgttgaacaagatcgacgacatttcgttcgaaacttacgatg")
>>> t
Dseqrecord(-64)
>>> from pydna.design import primer_design
>>> ampl = primer_design(t)
>>> ampl
Amplicon(64)
>>> ampl.forward_primer
f64 17-mer:5'-atgactgctaacccttc-3'
>>> ampl.reverse_primer
r64 18-mer:5'-catcgtaagtttcgaacg-3'
>>> print(ampl.figure())
5atgactgctaacccttc...cgttcgaaacttacgatg3
                     ||||||||||||||||||
                    3gcaagctttgaatgctac5
5atgactgctaacccttc3
 |||||||||||||||||
3tactgacgattgggaag...gcaagctttgaatgctac5
>>> pf = "GGATCC" + ampl.forward_primer
>>> pr = "GGATCC" + ampl.reverse_primer
>>> pf
f64 23-mer:5'-GGATCCatgactgct..ttc-3'
>>> pr
r64 24-mer:5'-GGATCCcatcgtaag..acg-3'
>>> from pydna.amplify import pcr
>>> pcr_prod = pcr(pf, pr, t)
>>> print(pcr_prod.figure())
      5atgactgctaacccttc...cgttcgaaacttacgatg3
                           ||||||||||||||||||
                          3gcaagctttgaatgctacCCTAGG5
5GGATCCatgactgctaacccttc3
       |||||||||||||||||
      3tactgacgattgggaag...gcaagctttgaatgctac5
>>> print(pcr_prod.seq)
GGATCCatgactgctaacccttccttggtgttgaacaagatcgacgacatttcgttcgaaacttacgatgGGATCC
>>> from pydna.primer import Primer
>>> pf = Primer("atgactgctaacccttccttggtgttg", id="myprimer")
>>> ampl = primer_design(t, fp = pf)
>>> ampl.forward_primer
myprimer 27-mer:5'-atgactgctaaccct..ttg-3'
>>> ampl.reverse_primer
r64 32-mer:5'-catcgtaagtttcga..atc-3'
pydna.all.assembly_fragments(f, overlap=35, maxlink=40, circular=False)[source]#

This function return a list of pydna.amplicon.Amplicon objects where primers have been modified with tails so that the fragments can be fused in the order they appear in the list by for example Gibson assembly or homologous recombination.

Given that we have two linear pydna.amplicon.Amplicon objects a and b

we can modify the reverse primer of a and forward primer of b with tails to allow fusion by fusion PCR, Gibson assembly or in-vivo homologous recombination. The basic requirements for the primers for the three techniques are the same.

 _________ a _________
/                     \
agcctatcatcttggtctctgca
                  |||||
                 <gacgt
agcct>
|||||
tcggatagtagaaccagagacgt

                        __________ b ________
                       /                     \
                       TTTATATCGCATGACTCTTCTTT
                                         |||||
                                        <AGAAA
                       TTTAT>
                       |||||
                       AAATATAGCGTACTGAGAAGAAA

agcctatcatcttggtctctgcaTTTATATCGCATGACTCTTCTTT
||||||||||||||||||||||||||||||||||||||||||||||
tcggatagtagaaccagagacgtAAATATAGCGTACTGAGAAGAAA
\___________________ c ______________________/

Design tailed primers incorporating a part of the next or previous fragment to be assembled.

agcctatcatcttggtctctgca
|||||||||||||||||||||||
                gagacgtAAATATA

|||||||||||||||||||||||
tcggatagtagaaccagagacgt

                       TTTATATCGCATGACTCTTCTTT
                       |||||||||||||||||||||||

                ctctgcaTTTATAT
                       |||||||||||||||||||||||
                       AAATATAGCGTACTGAGAAGAAA

PCR products with flanking sequences are formed in the PCR process.

agcctatcatcttggtctctgcaTTTATAT
||||||||||||||||||||||||||||||
tcggatagtagaaccagagacgtAAATATA
                \____________/

                   identical
                   sequences
                 ____________
                /            \
                ctctgcaTTTATATCGCATGACTCTTCTTT
                ||||||||||||||||||||||||||||||
                gagacgtAAATATAGCGTACTGAGAAGAAA

The fragments can be fused by any of the techniques mentioned earlier to form c:

agcctatcatcttggtctctgcaTTTATATCGCATGACTCTTCTTT
||||||||||||||||||||||||||||||||||||||||||||||
tcggatagtagaaccagagacgtAAATATAGCGTACTGAGAAGAAA

The first argument of this function is a list of sequence objects containing Amplicons and other similar objects.

At least every second sequence object needs to be an Amplicon

This rule exists because if a sequence object is that is not a PCR product is to be fused with another fragment, that other fragment needs to be an Amplicon so that the primer of the other object can be modified to include the whole stretch of sequence homology needed for the fusion. See the example below where a is a non-amplicon (a linear plasmid vector for instance)

 _________ a _________           __________ b ________
/                     \         /                     \
agcctatcatcttggtctctgca   <-->  TTTATATCGCATGACTCTTCTTT
|||||||||||||||||||||||         |||||||||||||||||||||||
tcggatagtagaaccagagacgt                          <AGAAA
                                TTTAT>
                                |||||||||||||||||||||||
                          <-->  AAATATAGCGTACTGAGAAGAAA

     agcctatcatcttggtctctgcaTTTATATCGCATGACTCTTCTTT
     ||||||||||||||||||||||||||||||||||||||||||||||
     tcggatagtagaaccagagacgtAAATATAGCGTACTGAGAAGAAA
     \___________________ c ______________________/

In this case only the forward primer of b is fitted with a tail with a part a:

agcctatcatcttggtctctgca
|||||||||||||||||||||||
tcggatagtagaaccagagacgt

                       TTTATATCGCATGACTCTTCTTT
                       |||||||||||||||||||||||
                                        <AGAAA
         tcttggtctctgcaTTTATAT
                       |||||||||||||||||||||||
                       AAATATAGCGTACTGAGAAGAAA

PCR products with flanking sequences are formed in the PCR process.

agcctatcatcttggtctctgcaTTTATAT
||||||||||||||||||||||||||||||
tcggatagtagaaccagagacgtAAATATA
                \____________/

                   identical
                   sequences
                 ____________
                /            \
                ctctgcaTTTATATCGCATGACTCTTCTTT
                ||||||||||||||||||||||||||||||
                gagacgtAAATATAGCGTACTGAGAAGAAA

The fragments can be fused by for example Gibson assembly:

agcctatcatcttggtctctgcaTTTATAT
||||||||||||||||||||||||||||||
tcggatagtagaacca

                             TCGCATGACTCTTCTTT
                ||||||||||||||||||||||||||||||
                gagacgtAAATATAGCGTACTGAGAAGAAA

to form c:

agcctatcatcttggtctctgcaTTTATATCGCATGACTCTTCTTT
||||||||||||||||||||||||||||||||||||||||||||||
tcggatagtagaaccagagacgtAAATATAGCGTACTGAGAAGAAA

The first argument of this function is a list of sequence objects containing Amplicons and other similar objects.

The overlap argument controls how many base pairs of overlap required between adjacent sequence fragments. In the junction between Amplicons, tails with the length of about half of this value is added to the two primers closest to the junction.

>       <
Amplicon1
         Amplicon2
         >       <

         ⇣

>       <-
Amplicon1
         Amplicon2
        ->       <

In the case of an Amplicon adjacent to a Dseqrecord object, the tail will be twice as long (1*overlap) since the recombining sequence is present entirely on this primer:

Dseqrecd1
         Amplicon1
         >       <

         ⇣

Dseqrecd1
         Amplicon1
       -->       <

Note that if the sequence of DNA fragments starts or stops with an Amplicon, the very first and very last prinmer will not be modified i.e. assembles are always assumed to be linear. There are simple tricks around that for circular assemblies depicted in the last two examples below.

The maxlink arguments controls the cut off length for sequences that will be synhtesized by adding them to primers for the adjacent fragment(s). The argument list may contain short spacers (such as spacers between fusion proteins).

Example 1: Linear assembly of PCR products (pydna.amplicon.Amplicon class objects) ------

>       <         >       <
Amplicon1         Amplicon3
         Amplicon2         Amplicon4
         >       <         >       <

                     ⇣
                     pydna.design.assembly_fragments
                     ⇣

>       <-       ->       <-                      pydna.assembly.Assembly
Amplicon1         Amplicon3
         Amplicon2         Amplicon4     ➤  Amplicon1Amplicon2Amplicon3Amplicon4
        ->       <-       ->       <

Example 2: Linear assembly of alternating Amplicons and other fragments

>       <         >       <
Amplicon1         Amplicon2
         Dseqrecd1         Dseqrecd2

                     ⇣
                     pydna.design.assembly_fragments
                     ⇣

>       <--     -->       <--                     pydna.assembly.Assembly
Amplicon1         Amplicon2
         Dseqrecd1         Dseqrecd2     ➤  Amplicon1Dseqrecd1Amplicon2Dseqrecd2

Example 3: Linear assembly of alternating Amplicons and other fragments

Dseqrecd1         Dseqrecd2
         Amplicon1         Amplicon2
         >       <       -->       <

                     ⇣
             pydna.design.assembly_fragments
                     ⇣
                                                  pydna.assembly.Assembly
Dseqrecd1         Dseqrecd2
         Amplicon1         Amplicon2     ➤  Dseqrecd1Amplicon1Dseqrecd2Amplicon2
       -->       <--     -->       <

Example 4: Circular assembly of alternating Amplicons and other fragments

                 ->       <==
Dseqrecd1         Amplicon2
         Amplicon1         Dseqrecd1
       -->       <-
                     ⇣
                     pydna.design.assembly_fragments
                     ⇣
                                                   pydna.assembly.Assembly
                 ->       <==
Dseqrecd1         Amplicon2                    -Dseqrecd1Amplicon1Amplicon2-
         Amplicon1                       ➤    |                             |
       -->       <-                            -----------------------------

------ Example 5: Circular assembly of Amplicons

>       <         >       <
Amplicon1         Amplicon3
         Amplicon2         Amplicon1
         >       <         >       <

                     ⇣
                     pydna.design.assembly_fragments
                     ⇣

>       <=       ->       <-
Amplicon1         Amplicon3
         Amplicon2         Amplicon1
        ->       <-       +>       <

                     ⇣
             make new Amplicon using the Amplicon1.template and
             the last fwd primer and the first rev primer.
                     ⇣
                                                   pydna.assembly.Assembly
+>       <=       ->       <-
 Amplicon1         Amplicon3                  -Amplicon1Amplicon2Amplicon3-
          Amplicon2                      ➤   |                             |
         ->       <-                          -----------------------------
Parameters:
  • f (list of pydna.amplicon.Amplicon and other Dseqrecord like objects) – list Amplicon and Dseqrecord object for which fusion primers should be constructed.

  • overlap (int, optional) – Length of required overlap between fragments.

  • maxlink (int, optional) – Maximum length of spacer sequences that may be present in f. These will be included in tails for designed primers.

  • circular (bool, optional) – If True, the assembly is circular. If False, the assembly is linear.

Returns:

seqs

[Amplicon1,
 Amplicon2, ...]

Return type:

list of pydna.amplicon.Amplicon and other Dseqrecord like objects pydna.amplicon.Amplicon objects

Examples

>>> from pydna.dseqrecord import Dseqrecord
>>> from pydna.design import primer_design
>>> a=primer_design(Dseqrecord("atgactgctaacccttccttggtgttgaacaagatcgacgacatttcgttcgaaacttacgatg"))
>>> b=primer_design(Dseqrecord("ccaaacccaccaggtaccttatgtaagtacttcaagtcgccagaagacttcttggtcaagttgcc"))
>>> c=primer_design(Dseqrecord("tgtactggtgctgaaccttgtatcaagttgggtgttgacgccattgccccaggtggtcgtttcgtt"))
>>> from pydna.design import assembly_fragments
>>> # We would like a circular recombination, so the first sequence has to be repeated
>>> fa1,fb,fc,fa2 = assembly_fragments([a,b,c,a])
>>> # Since all fragments are Amplicons, we need to extract the rp of the 1st and fp of the last fragments.
>>> from pydna.amplify import pcr
>>> fa = pcr(fa2.forward_primer, fa1.reverse_primer, a)
>>> [fa,fb,fc]
[Amplicon(100), Amplicon(101), Amplicon(102)]
>>> fa.name, fb.name, fc.name = "fa fb fc".split()
>>> from pydna.assembly import Assembly
>>> assemblyobj = Assembly([fa,fb,fc])
>>> assemblyobj
Assembly
fragments....: 100bp 101bp 102bp
limit(bp)....: 25
G.nodes......: 6
algorithm....: common_sub_strings
>>> assemblyobj.assemble_linear()
[Contig(-231), Contig(-166), Contig(-36)]
>>> assemblyobj.assemble_circular()[0].seguid()
'cdseguid=85t6tfcvWav0wnXEIb-lkUtrl4s'
>>> (a+b+c).looped().seguid()
'cdseguid=85t6tfcvWav0wnXEIb-lkUtrl4s'
>>> print(assemblyobj.assemble_circular()[0].figure())
 -|fa|36
|     \/
|     /\
|     36|fb|36
|           \/
|           /\
|           36|fc|36
|                 \/
|                 /\
|                 36-
|                    |
 --------------------
>>>
pydna.all.eq(*args, **kwargs)[source]#

Compare two or more DNA sequences for equality.

Compares two or more DNA sequences for equality i.e. if they represent the same double stranded DNA molecule.

Parameters:
  • args (iterable) – iterable containing sequences args can be strings, Biopython Seq or SeqRecord, Dseqrecord or dsDNA objects.

  • circular (bool, optional) – Consider all molecules circular or linear

  • linear (bool, optional) – Consider all molecules circular or linear

Returns:

eq – Returns True or False

Return type:

bool

Notes

Compares two or more DNA sequences for equality i.e. if they represent the same DNA molecule.

Two linear sequences are considiered equal if either:

  1. They have the same sequence (case insensitive)

  2. One sequence is the reverse complement of the other

Two circular sequences are considered equal if they are circular permutations meaning that they have the same length and:

  1. One sequence can be found in the concatenation of the other sequence with itself.

  2. The reverse complement of one sequence can be found in the concatenation of the other sequence with itself.

The topology for the comparison can be set using one of the keywords linear or circular to True or False.

If circular or linear is not set, it will be deduced from the topology of each sequence for sequences that have a linear or circular attribute (like Dseq and Dseqrecord).

Examples

>>> from pydna.dseqrecord import Dseqrecord
>>> from pydna.utils import eq
>>> eq("aaa","AAA")
True
>>> eq("aaa","AAA","TTT")
True
>>> eq("aaa","AAA","TTT","tTt")
True
>>> eq("aaa","AAA","TTT","tTt", linear=True)
True
>>> eq("Taaa","aTaa", linear = True)
False
>>> eq("Taaa","aTaa", circular = True)
True
>>> a=Dseqrecord("Taaa")
>>> b=Dseqrecord("aTaa")
>>> eq(a,b)
False
>>> eq(a,b,circular=True)
True
>>> a=a.looped()
>>> b=b.looped()
>>> eq(a,b)
True
>>> eq(a,b,circular=False)
False
>>> eq(a,b,linear=True)
False
>>> eq(a,b,linear=False)
True
>>> eq("ggatcc","GGATCC")
True
>>> eq("ggatcca","GGATCCa")
True
>>> eq("ggatcca","tGGATCC")
True
pydna.all.gbtext_clean(gbtext)[source]#

This function takes a string containing one genbank sequence in Genbank format and returns a named tuple containing two fields, the gbtext containing a string with the corrected genbank sequence and jseq which contains the JSON intermediate.

Examples

>>> s = '''LOCUS       New_DNA      3 bp    DNA   CIRCULAR SYN        19-JUN-2013
... DEFINITION  .
... ACCESSION
... VERSION
... SOURCE      .
...   ORGANISM  .
... COMMENT
... COMMENT     ApEinfo:methylated:1
... ORIGIN
...         1 aaa
... //'''
>>> from pydna.readers import read
>>> read(s)  
... /site-packages/Bio/GenBank/Scanner.py:1388: BiopythonParserWarning: Malformed LOCUS line found - is this correct?
:'LOCUS       New_DNA      3 bp    DNA   CIRCULAR SYN        19-JUN-2013\n'
  "correct?\n:%r" % line, BiopythonParserWarning)
Traceback (most recent call last):
  File "... /pydna/readers.py", line 48, in read
    results = results.pop()
IndexError: pop from empty list

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "... /pydna/readers.py", line 50, in read
    raise ValueError("No sequences found in data:\n({})".format(data[:79]))
ValueError: No sequences found in data:
(LOCUS       New_DNA      3 bp    DNA   CIRCULAR SYN        19-JUN-2013
DEFINITI)
>>> from pydna.genbankfixer import gbtext_clean
>>> s2, j2 = gbtext_clean(s)
>>> print(s2)
LOCUS       New_DNA                    3 bp ds-DNA     circular SYN 19-JUN-2013
DEFINITION  .
ACCESSION
VERSION
SOURCE      .
ORGANISM  .
COMMENT
COMMENT     ApEinfo:methylated:1
FEATURES             Location/Qualifiers
ORIGIN
        1 aaa
//
>>> s3 = read(s2)
>>> s3
Dseqrecord(o3)
>>> print(s3.format())
LOCUS       New_DNA                    3 bp    DNA     circular SYN 19-JUN-2013
DEFINITION  .
ACCESSION   New_DNA
VERSION     New_DNA
KEYWORDS    .
SOURCE
  ORGANISM  .
            .
COMMENT
            ApEinfo:methylated:1
FEATURES             Location/Qualifiers
ORIGIN
        1 aaa
//

pydna.alphabet module#

dscode - The nucleic acid alphabet used in pydna

This file serves to define dscode, the DNA alphabet used in pydna. Each symbol represents a basepair (two opposing bases in the two antiparalell DNA strands).

The alphabet is defined in the end of this docstring which serve as the single source of thruth. The alphabet is used to construct the codestrings dictionary with has the following keys (strings) in the order indicated:

  1. un_ambiguous_ds_dna

  2. ds_rna

  3. ambiguous_ds_dna

  4. single_stranded_dna_rna

  5. loops_dna_rna

  6. mismatched_dna_rna

  7. gap

Each value of the codestrings dictionary is a multiline string. This string has five lines following this form:

W             1   Watson symbol
|             2   Pipe
C             3   Crick symbol
<empty line>  4
S             5   dscode symbol

W (line 1) and C (line 2) are complementary bases in a double stranded DNA molecule and S (line 5) are the symbols of the alphabet used to describe the base pair above the symbol.

Line 2 must contain only the pipe character, indicating basepairing and line 4 must be empty. The lines must be of equal length and a series ot tests are performed to ensure the integrity of the alphabet.

The string definition as well as the keys for the codestrings dict follow this line and is contained in the last 13 lines of the docstring:

un_ambiguous_ds_dna | ds_rna | | ambiguous_ds_dna | | | single_stranded_dna_rna | | | | loops_dna_rna | | | | | mismatched_dna_rna | | | | | | gap | | | | | | | GATC UA RYMKSWHBVDN GATC••••U• —–AGCTU AAACCCGGGTTTUUUGCT • |||| || ||||||||||| |||||||||| |||||||||| |||||||||||||||||| | CTAG AU YRKMSWDVBHN ••••CTAG•U AGCTU—– ACGACTAGTCGTGCTUUU •

GATC UO RYMKSWHBVDN PEXIQFZJ$% 0123456789 !#{}&*()<>@:?[]=_; •

pydna.alphabet.regex_ss_melt_factory(length: int) Pattern[source]#

A regular expression for finding double-stranded regions flanked by single-stranded DNA that can be melted to shed a single-stranded fragment.

This function returns a regular expression that finds double-stranded regions (of length <= length) that are flanked by single-stranded regions on the same side in dscode format. These regions are useful to identify as potential melt sites, since melting them leads to the shedding of a single-stranded fragment.

The regular expression finds double stranded patches flanked by empty positions on the same side (see figure below). Melting of this kind of sites leads to the shedding of a single stranded fragment.

GFTTAJA   <-- dscode representing the ds DNA below.

G TTA A   <-- "TTA" is found by the regex for length <= 3
CTAATGT

Examples

>>> from pydna.dseq import Dseq
>>> regex = regex_ss_melt_factory(3)
>>> s = Dseq("GFTTAJA")
>>> s
Dseq(-7)
G TTA A
CTAATGT
>>> mobj = regex.search(s._data)
>>> mobj.groupdict()
{'watson': b'TTA', 'crick': None}
Parameters:

length (int) – Max length of double stranded region flanked by single stranded regions.

Returns:

regular expression object.

Return type:

TYPE

pydna.alphabet.regex_ds_melt_factory(length: int) Pattern[source]#

A regular expression for finding double-stranded regions flanked by single-stranded DNA that can be melted to shed multiple double stranded fragments.

This function returns a regular expression that finds double-stranded regions (of length <= length) that are flanked by single-stranded regions on opposite sides in dscode format. These regions are useful to identify as potential melt sites, since melting them leads to separation into multiple double stranded fragments.

The regular expression finds double stranded patches flanked by empty positions on opposite sides(see figure below). Melting of this kind of sites leads to separation into multiple double stranded fragments.

::

aaaGFTTAIAttt <– dscode

aaaG TTACAttt <– “TTA” is found by the regex for length <= 3 tttCTAAT Taaa

Examples

>>> from pydna.dseq import Dseq
>>> regex = regex_ds_melt_factory(3)
>>> s = Dseq("aaaGFTTAIAttt")
>>> s
Dseq(-13)
aaaG TTACAttt
tttCTAAT Taaa
>>> mobj = regex.search(s._data)
>>> mobj.groupdict()
{'watson': None, 'crick': b'TTA'}
Parameters:

length (int) – Max length of double stranded region flanked by single stranded regions.

Returns:

regular expression object.

Return type:

TYPE

class pydna.alphabet.DseqParts(sticky_left5: str, sticky_left3: str, middle: str, sticky_right3: str, sticky_right5: str, single_watson: str, single_crick: str)[source]#

Bases: object

sticky_left5: str#
sticky_left3: str#
middle: str#
sticky_right3: str#
sticky_right5: str#
single_watson: str#
single_crick: str#
pydna.alphabet.get_parts(datastring: str) DseqParts[source]#

Returns a DseqParts instance containing the parts of a dsDNA sequence.

The datastring argument should contain a string with dscode symbols.

A regular expression is used to capture the single stranded regions at the ends as well as the ds region in the middle, if any.

The figure below numbers the regex capture groups and what they capture as well as the DseqParts instance field name for each group.

 group 0 "sticky_left5"
 |
 |      group 3"sticky_right5"
 |      |
---    ---
GGGATCC
   TAGGTCA
   ----
     |
     group 2 "middle"

 group 1 "sticky_left3"
 |
 |      group 4 "sticky_right3"
 |      |
---    ---
   ATCCAGT
CCCTAGG
   ----
     |
     group 2 "middle"

   group 5 "single_watson" (only an upper strand)
   |
-------
ATCCAGT
|||||||

   group 6 "single_crick" (only a lower strand)
   |
-------

|||||||
CCCTAGG

Examples

>>>

Up to seven groups (0..6) are captured.s ome are mutually exclusive which means that one of them is an empty string:

0 or 1, not both, a DNA fragment has either 5’ or 3’ sticky end.

2 or 5 or 6, a DNA molecule has a ds region or is entirely single stranded.

3 or 4, not both, either 5’ or 3’ sticky end.

Note that internal single stranded regions are not identified and will be contained in the middle part if they are present.

Parameters:

datastring (str) – A string with dscode.

Returns:

Seven string fields describing the DNA molecule. DseqParts(sticky_left5=’’, sticky_left3=’’,

middle=’’, sticky_right3=’’, sticky_right5=’’, single_watson=’’, single_crick=’’)

Return type:

DseqParts

pydna.alphabet.dsbreaks(datastring: str) list[str][source]#

Find double strand breaks in DNA in dscode format.

An empty watson position next to an empty crick position in the dsDNA leads to a discontinuous DNA. This function is used to show breaks in DNA in Dseq.__init__.

>>> from pydna.alphabet import dsbreaks
>>> x, = dsbreaks("GATPFTAA")
>>> print(x)
[0:8]
GATG TAA
CTA TATT
>>> dsbreaks("GATC")
[]
Parameters:

data (str) – A string representing DNA in dscode format.

Returns:

A list of 3-line

Return type:

list[str]

pydna.alphabet.representation_tuple(datastring: str = '', length_limit_for_repr: int = 30, chunk: int = 4)[source]#

Two line string representation of a sequence of dscode symbols.

See pydna.alphabet module for the definition of the pydna dscode alphabet. The dscode has a symbol (ascii) character for base pairs and single stranded DNA.

This function is used by the Dseq.__repr__() method.

Parameters:

data (TYPE, optional) – DESCRIPTION. The default is “”.

Returns:

A two line string containing The Watson and Crick strands.

Return type:

str

pydna.alphabet.anneal_strands(strand_a: str, strand_b: str) bool[source]#

Test if two DNA strands containing dscode anneal or not.

Both strands are assumed to be given in 5’ -> 3’ direction.

Examples

>>> from pydna.alphabet import anneal_strands
>>> a = "TTA"
>>> b = "AAT"[::-1]
>>> anneal_strands(a, b)
True
>>> anneal_strands(b, a)
True
>>> c = "UUA"
>>> anneal_strands(c, b)
True
>>> anneal_strands(a.lower(), b)
True
>>> anneal_strands("TG", "AA")
False
Parameters:
  • watson (str) – A single DNA strand.

  • crick (str) – A single DNA strand.

Returns:

True if annealing is perfect.

Return type:

bool

pydna.amplicon module#

This module provides the Amplicon class for PCR simulation. This class is not meant to be use directly but is used by the amplify module

class pydna.amplicon.Amplicon(record, *args, template=None, forward_primer=None, reverse_primer=None, **kwargs)[source]#

Bases: Dseqrecord

The Amplicon class holds information about a PCR reaction involving two primers and one template. This class is used by the Anneal class and is not meant to be instantiated directly.

Parameters:
  • forward_primer (SeqRecord(Biopython)) – SeqRecord object holding the forward (sense) primer

  • reverse_primer (SeqRecord(Biopython)) – SeqRecord object holding the reverse (antisense) primer

  • template (Dseqrecord) – Dseqrecord object holding the template (circular or linear)

classmethod from_SeqRecord(record, *args, path=None, **kwargs)[source]#
reverse_complement()[source]#

Reverse complement.

Examples

>>> from pydna.dseqrecord import Dseqrecord
>>> a=Dseqrecord("ggaatt")
>>> a
Dseqrecord(-6)
>>> a.seq
Dseq(-6)
ggaatt
ccttaa
>>> a.reverse_complement().seq
Dseq(-6)
aattcc
ttaagg
>>>
rc()#

Reverse complement.

Examples

>>> from pydna.dseqrecord import Dseqrecord
>>> a=Dseqrecord("ggaatt")
>>> a
Dseqrecord(-6)
>>> a.seq
Dseq(-6)
ggaatt
ccttaa
>>> a.reverse_complement().seq
Dseq(-6)
aattcc
ttaagg
>>>
figure()[source]#

This method returns a simple figure of the two primers binding to a part of the template.

5tacactcaccgtctatcattatc...cgactgtatcatctgatagcac3
                           ||||||||||||||||||||||
                          3gctgacatagtagactatcgtg5
5tacactcaccgtctatcattatc3
 |||||||||||||||||||||||
3atgtgagtggcagatagtaatag...gctgacatagtagactatcgtg5
Returns:

figure – A string containing a text representation of the primers annealing on the template (see example above).

Return type:

string

set_forward_primer_footprint(length)[source]#
set_reverse_primer_footprint(length)[source]#
program()[source]#
dbd_program()[source]#
primers()[source]#

pydna.amplify module#

This module provide the Anneal class and the pcr() function for PCR simulation. The pcr function is simpler to use, but expects only one PCR product. The Anneal class should be used if more flexibility is required.

Primers with 5’ tails as well as inverse PCR on circular templates are handled correctly.

class pydna.amplify.Anneal(primers, template, limit=13, **kwargs)[source]#

Bases: object

The Anneal class has the following important attributes:

forward_primers#

Description of forward_primers.

Type:

list

reverse_primers#

Description of reverse_primers.

Type:

list

template#

A copy of the template argument. Primers annealing sites has been added as features that can be visualized in a seqence editor such as ApE.

Type:

Dseqrecord

limit#

The limit of PCR primer annealing, default is 13 bp.

Type:

int, optional

property products#
report()#

returns a short report describing if or where primer anneal on the template.

pydna.amplify.pcr(*args, **kwargs) Amplicon[source]#

pcr is a convenience function for the Anneal class to simplify its usage, especially from the command line. If more than one or no PCR product is formed, a ValueError is raised.

args is any iterable of Dseqrecords or an iterable of iterables of Dseqrecords. args will be greedily flattened.

Parameters:
  • args (iterable containing sequence objects) – Several arguments are also accepted.

  • limit (int = 13, optional) – limit length of the annealing part of the primers.

Notes

sequences in args could be of type:

  • string

  • Seq

  • SeqRecord (or subclass)

  • Dseqrecord (or sublcass)

The last sequence will be assumed to be the template while all preceeding sequences will be assumed to be primers.

This is a powerful function, use with care!

Returns:

product – An pydna.amplicon.Amplicon object representing the PCR product. The direction of the PCR product will be the same as for the template sequence.

Return type:

Amplicon

Examples

>>> from pydna.dseqrecord import Dseqrecord
>>> from pydna.readers import read
>>> from pydna.amplify import pcr
>>> from pydna.primer import Primer
>>> template = Dseqrecord("tacactcaccgtctatcattatctactatcgactgtatcatctgatagcac")
>>> from Bio.SeqRecord import SeqRecord
>>> p1 = Primer("tacactcaccgtctatcattatc")
>>> p2 = Primer("cgactgtatcatctgatagcac").reverse_complement()
>>> pcr(p1, p2, template)
Amplicon(51)
>>> pcr([p1, p2], template)
Amplicon(51)
>>> pcr((p1,p2,), template)
Amplicon(51)
>>>

pydna.assembly module#

Assembly of sequences by homologous recombination.

Should also be useful for related techniques such as Gibson assembly and fusion PCR. Given a list of sequences (Dseqrecords), all sequences are analyzed for shared homology longer than the set limit.

A graph is constructed where each overlapping region form a node and sequences separating the overlapping regions form edges.

            -- A --
catgatctacgtatcgtgt     -- B --
            atcgtgtactgtcatattc
                        catattcaaagttct



--x--> A --y--> B --z-->   (Graph)

Nodes:

A : atcgtgt
B : catattc

Edges:

x : catgatctacgt
y : actgt
z : aaagttct

The NetworkX package is used to trace linear and circular paths through the graph.

class pydna.assembly.Assembly(frags: List[Dseqrecord], limit: int = 25, algorithm: Callable[[str, str, int], List[Tuple[int, int, int]]] = common_sub_strings)[source]#

Bases: object

Assembly of a list of linear DNA fragments into linear or circular constructs. The Assembly is meant to replace the Assembly method as it is easier to use. Accepts a list of Dseqrecords (source fragments) to initiate an Assembly object. Several methods are available for analysis of overlapping sequences, graph construction and assembly.

Parameters:
  • fragments (list) – a list of Dseqrecord objects.

  • limit (int, optional) – The shortest shared homology to be considered

  • algorithm (function, optional) – The algorithm used to determine the shared sequences.

  • max_nodes (int) – The maximum number of nodes in the graph. This can be tweaked to manage sequences with a high number of shared sub sequences.

Examples

>>> from pydna.assembly import Assembly
>>> from pydna.dseqrecord import Dseqrecord
>>> a = Dseqrecord("acgatgctatactgCCCCCtgtgctgtgctcta")
>>> b = Dseqrecord("tgtgctgtgctctaTTTTTtattctggctgtatc")
>>> c = Dseqrecord("tattctggctgtatcGGGGGtacgatgctatactg")
>>> x = Assembly((a,b,c), limit=14)
>>> x
Assembly
fragments....: 33bp 34bp 35bp
limit(bp)....: 14
G.nodes......: 6
algorithm....: common_sub_strings
>>> x.assemble_circular()
[Contig(o59), Contig(o59)]
>>> x.assemble_circular()[0].seq.watson
'acgatgctatactgCCCCCtgtgctgtgctctaTTTTTtattctggctgtatcGGGGGt'
assemble_linear(**kwargs)#
assemble_circular(**kwargs)#

pydna.assembly2 module#

Improved implementation of the assembly module. To see a list of issues with the previous implementation, see [issues tagged with fixed-with-new-assembly-model](pydna-group/pydna#issues)

pydna.assembly2.gather_overlapping_locations(locs: list[Location], fragment_length: int) list[tuple[Location, ...]][source]#

Turn a list of locations into a list of tuples of those locations, where each tuple contains locations that overlap. For example, if locs = [loc1, loc2, loc3], and loc1 and loc2 overlap, the output will be [(loc1, loc2), (loc3,)].

pydna.assembly2.ends_from_cutsite(cutsite: Tuple[Tuple[int, int], AbstractCut | None | _cas], seq: Dseq) tuple[tuple[str, str], tuple[str, str]][source]#

Get the sticky or blunt ends created by a restriction enzyme cut.

Parameters:
  • cutsite (CutSiteType) – A tuple ((cut_watson, ovhg), enzyme) describing where the cut occurs

  • seq (_Dseq) – The DNA sequence being cut

Raises:

ValueError – If cutsite is None

Returns:

  • tuple[tuple[str, str], tuple[str, str]] – A tuple of two tuples, each containing the type of end (‘5’’, ‘3’’, or ‘blunt’) and the sequence of the overhang. The first tuple is for the left end, second for the right end.

  • >>> from Bio.Restriction import NotI

  • >>> x = Dseq(“ctcgGCGGCCGCcagcggccg”)

  • >>> x.get_cutsites(NotI)

  • [((6, -4), NotI)]

  • >>> ends_from_cutsite(x.get_cutsites(NotI)[0], x)

  • ((“5’”, ‘ggcc’), (“5’”, ‘ggcc’))

pydna.assembly2.restriction_ligation_overlap(seqx: Dseqrecord, seqy: Dseqrecord, enzymes=RestrictionBatch, partial=False, allow_blunt=False) list[Tuple[int, int, int]][source]#

Assembly algorithm to find overlaps that would result from restriction and ligation.

Like in sticky and gibson, the order matters (see example below of partial overlap)

Parameters:
  • seqx (Dseqrecord) – The first sequence

  • seqy (Dseqrecord) – The second sequence

  • enzymes (RestrictionBatch) – The enzymes to use

  • partial (bool) – Whether to allow partial overlaps

  • allow_blunt (bool) – Whether to allow blunt ends

Returns:

  • list[SequenceOverlap] – A list of overlaps between the two sequences

  • >>> from pydna.dseqrecord import Dseqrecord

  • >>> from pydna.assembly2 import restriction_ligation_overlap

  • >>> from Bio.Restriction import EcoRI, RgaI, DrdI, EcoRV

  • >>> x = Dseqrecord(“ccGAATTCaa”)

  • >>> y = Dseqrecord(“aaaaGAATTCgg”)

  • >>> restriction_ligation_overlap(x, y, [EcoRI])

  • [(3, 5, 4)]

  • >>> restriction_ligation_overlap(y, x, [EcoRI])

  • [(5, 3, 4)]

  • Partial overlap, note how it is not symmetric

  • >>> x = Dseqrecord(“GACTAAAGGGTC”)

  • >>> y = Dseqrecord(“AAGCGATCGCAAGCGATCGCAA”)

  • >>> restriction_ligation_overlap(x, y, [RgaI, DrdI], partial=True)

  • [(6, 5, 1), (6, 15, 1)]

  • >>> restriction_ligation_overlap(y, x, [RgaI, DrdI], partial=True)

  • []

  • Blunt overlap, returns length of the overlap 0

  • >>> x = Dseqrecord(“aaGATATCcc”)

  • >>> y = Dseqrecord(“ttttGATATCaa”)

  • >>> restriction_ligation_overlap(x, y, [EcoRV], allow_blunt=True)

  • [(5, 7, 0)]

  • >>> restriction_ligation_overlap(y, x, [EcoRV], allow_blunt=True)

  • [(7, 5, 0)]

pydna.assembly2.combine_algorithms(*algorithms: Callable[[Dseqrecord, Dseqrecord, int], list[Tuple[int, int, int]]]) Callable[[Dseqrecord, Dseqrecord, int], list[Tuple[int, int, int]]][source]#

Combine assembly algorithms, if any of them returns a match, the match is returned.

This can be used for example in a ligation where you want to allow both sticky and blunt end ligation.

pydna.assembly2.blunt_overlap(seqx: Dseqrecord, seqy: Dseqrecord, limit=None) list[Tuple[int, int, int]][source]#

Assembly algorithm to find blunt overlaps. Used for blunt ligation.

It basically returns [(len(seqx), 0, 0)] if the right end of seqx is blunt and the left end of seqy is blunt (compatible with blunt ligation). Otherwise, it returns an empty list.

Parameters:
  • seqx (Dseqrecord) – The first sequence

  • seqy (Dseqrecord) – The second sequence

  • limit (int) – There for compatibility, but it is ignored

Returns:

  • list[SequenceOverlap] – A list of overlaps between the two sequences

  • >>> from pydna.assembly2 import blunt_overlap

  • >>> from pydna.dseqrecord import Dseqrecord

  • >>> x = Dseqrecord(“AAAAAA”)

  • >>> y = Dseqrecord(“TTTTTT”)

  • >>> blunt_overlap(x, y)

  • [(6, 0, 0)]

pydna.assembly2.common_sub_strings(seqx: Dseqrecord, seqy: Dseqrecord, limit=25) list[Tuple[int, int, int]][source]#

Assembly algorithm to find common substrings of length == limit. see the docs of the function common_sub_strings_str for more details. It is case insensitive.

>>> from pydna.dseqrecord import Dseqrecord
>>> x = Dseqrecord("TAAAAAAT")
>>> y = Dseqrecord("CCaAaAaACC")
>>> common_sub_strings(x, y, limit=5)
[(1, 2, 6), (1, 3, 5), (2, 2, 5)]
pydna.assembly2.terminal_overlap(seqx: Dseqrecord, seqy: Dseqrecord, limit=25, trim_ends: None | str = None)[source]#

Assembly algorithm to find terminal overlaps (e.g. for Gibson assembly). The order matters, we want alignments like:

seqx:    oooo------xxxx
seqy:              xxxx------oooo
Product: oooo------xxxx------oooo

Not like:

seqx:               oooo------xxxx
seqy:     xxxx------oooo
Product (unwanted): oooo
Parameters:
  • seqx (Dseqrecord) – The first sequence

  • seqy (Dseqrecord) – The second sequence

  • limit (int) – Minimum length of the overlap

  • trim_ends (str) – The ends to trim, either ‘5’ or ‘3’ If None, no trimming is done

Returns:

  • list[SequenceOverlap] – A list of overlaps between the two sequences

  • >>> from pydna.dseqrecord import Dseqrecord

  • >>> from pydna.assembly2 import terminal_overlap

  • >>> x = Dseqrecord(“ttactaAAAAAA”)

  • >>> y = Dseqrecord(“AAAAAAcgcacg”)

  • >>> terminal_overlap(x, y, limit=5)

  • [(6, 0, 6), (7, 0, 5)]

  • >>> terminal_overlap(y, x, limit=5)

  • []

  • Trimming the ends

  • >>> from pydna.dseq import Dseq

  • >>> from pydna.dseqrecord import Dseqrecord

  • >>> from pydna.assembly2 import terminal_overlap

  • >>> x = Dseqrecord(Dseq.from_full_sequence_and_overhangs(“aaaACGT”, 0, 3))

  • >>> y = Dseqrecord(Dseq.from_full_sequence_and_overhangs(“ACGTccc”, 3, 0))

  • >>> terminal_overlap(x, y, limit=4)

  • [(3, 0, 4)]

  • >>> terminal_overlap(x, y, limit=4, trim_ends=”5’”)

  • [(3, 0, 4)]

  • >>> terminal_overlap(x, y, limit=4, trim_ends=”3’”)

  • []

pydna.assembly2.gibson_overlap(seqx: Dseqrecord, seqy: Dseqrecord, limit=25)[source]#

Assembly algorithm to find terminal overlaps for Gibson assembly. It is a wrapper around terminal_overlap with trim_ends=”5’”.

pydna.assembly2.in_fusion_overlap(seqx: Dseqrecord, seqy: Dseqrecord, limit=25)[source]#

Assembly algorithm to find terminal overlaps for in-fusion assembly. It is a wrapper around terminal_overlap with trim_ends=”3’”.

pydna.assembly2.pcr_fusion_overlap(seqx: Dseqrecord, seqy: Dseqrecord, limit=25)[source]#

Assembly algorithm to find terminal overlaps for PCR fusion assembly. It is a wrapper around terminal_overlap with trim_ends=None.

pydna.assembly2.sticky_end_sub_strings(seqx: Dseqrecord, seqy: Dseqrecord, limit: bool = False)[source]#

Assembly algorithm for ligation of sticky ends.

For now, if limit 0 / False (default) only full overlaps are considered. Otherwise, partial overlaps are also returned.

Parameters:
  • seqx (Dseqrecord) – The first sequence

  • seqy (Dseqrecord) – The second sequence

  • limit (bool) – Whether to allow partial overlaps

Returns:

A list of overlaps between the two sequences

Return type:

list[SequenceOverlap]

Ligation of fully overlapping sticky ends, note how the order matters

>>> from pydna.dseq import Dseq
>>> from pydna.dseqrecord import Dseqrecord
>>> from pydna.assembly2 import sticky_end_sub_strings
>>> x = Dseqrecord(Dseq.from_full_sequence_and_overhangs("AAAAAA", 0, 3))
>>> y = Dseqrecord(Dseq.from_full_sequence_and_overhangs("AAAAAA", 3, 0))
>>> sticky_end_sub_strings(x, y, limit=False)
[(3, 0, 3)]
>>> sticky_end_sub_strings(y, x, limit=False)
[]

Ligation of partially overlapping sticky ends, specified with limit=True

>>> x = Dseqrecord(Dseq.from_full_sequence_and_overhangs("AAAAAA", 0, 2))
>>> y = Dseqrecord(Dseq.from_full_sequence_and_overhangs("AAAAAA", 3, 0))
>>> sticky_end_sub_strings(x, y, limit=False)
[]
>>> sticky_end_sub_strings(x, y, limit=True)
[(4, 0, 2)]
pydna.assembly2.zip_match_leftwards(seqx: SeqRecord, seqy: SeqRecord, match: Tuple[int, int, int]) Tuple[int, int, int][source]#

Starting from the rightmost edge of the match, return a new match encompassing the max number of bases. This can be used to return a longer match if a primer aligns for longer than the limit or a shorter match if there are mismatches. This is convenient to maintain as many features as possible. It is used in PCR assembly.

>>> seq = Dseqrecord('AAAAACGTCCCGT')
>>> primer = Dseqrecord('ACGTCCCGT')
>>> match = (13, 9, 0) # an empty match at the end of each
>>> zip_match_leftwards(seq, primer, match)
(4, 0, 9)

Works in circular molecules if the match spans the origin: >>> seq = Dseqrecord(‘TCCCGTAAAAACG’, circular=True) >>> primer = Dseqrecord(‘ACGTCCCGT’) >>> match = (6, 9, 0) >>> zip_match_leftwards(seq, primer, match) (10, 0, 9)

pydna.assembly2.zip_match_rightwards(seqx: Dseqrecord, seqy: Dseqrecord, match: Tuple[int, int, int]) Tuple[int, int, int][source]#

Same as zip_match_leftwards, but towards the right.

pydna.assembly2.seqrecord2_uppercase_DNA_string(seqr: SeqRecord) str[source]#

Transform a Dseqrecord to a sequence string where U is replaced by T, everything is upper case and circular sequences are repeated twice. This is used for PCR, to support primers with U’s (e.g. for USER cloning).

pydna.assembly2.primer_template_overlap(seqx: Dseqrecord | Primer, seqy: Dseqrecord | Primer, limit=25, mismatches=0) list[Tuple[int, int, int]][source]#

Assembly algorithm to find overlaps between a primer and a template. It accepts mismatches. When there are mismatches, it only returns the common part between the primer and the template.

If seqx is a primer and seqy is a template, it represents the binding of a forward primer. If seqx is a template and seqy is a primer, it represents the binding of a reverse primer, where the primer has been passed as its reverse complement (see examples).

Parameters:
  • seqx (Dseqrecord | Primer) – The primer

  • seqy (Dseqrecord | Primer) – The template

  • limit (int) – Minimum length of the overlap

  • mismatches (int) – Maximum number of mismatches (only substitutions, no deletion or insertion)

Returns:

  • list[SequenceOverlap] – A list of overlaps between the primer and the template

  • >>> from pydna.dseqrecord import Dseqrecord

  • >>> from pydna.primer import Primer

  • >>> from pydna.assembly2 import primer_template_overlap

  • >>> template = Dseqrecord(“AATTAGCAGCGATCGAGT”, circular=True)

  • >>> primer = Primer(“TTAGCAGC”)

  • >>> primer_template_overlap(primer, template, limit=8, mismatches=0)

  • [(0, 2, 8)]

  • This actually represents the binding of the primer GCTGCTAA (reverse complement)

  • >>> primer_template_overlap(template, primer, limit=8, mismatches=0)

  • [(2, 0, 8)]

  • >>> primer_template_overlap(primer, template.reverse_complement(), limit=8, mismatches=0)

  • []

  • >>> primer_template_overlap(primer.reverse_complement(), template, limit=8, mismatches=0)

  • []

pydna.assembly2.reverse_complement_assembly(assembly: list[Tuple[int, int, Location | None, Location | None]], fragments: list[Dseqrecord]) list[Tuple[int, int, Location | None, Location | None]][source]#

Complement an assembly, i.e. reverse the order of the fragments and the orientation of the overlaps.

pydna.assembly2.filter_linear_subassemblies(linear_assemblies: list[list[Tuple[int, int, Location | None, Location | None]]], circular_assemblies: list[list[Tuple[int, int, Location | None, Location | None]]], fragments: list[Dseqrecord]) list[list[Tuple[int, int, Location | None, Location | None]]][source]#

Remove linear assemblies which are sub-assemblies of circular assemblies

pydna.assembly2.remove_subassemblies(assemblies: list[list[Tuple[int, int, Location | None, Location | None]]]) list[list[Tuple[int, int, Location | None, Location | None]]][source]#

Filter out subassemblies, i.e. assemblies that are contained within another assembly.

For example:

[(1, 2, ‘1[8:14]:2[1:7]’), (2, 3, ‘2[10:17]:3[1:8]’)] [(1, 2, ‘1[8:14]:2[1:7]’)]

The second one is a subassembly of the first one.

pydna.assembly2.assembly2str(assembly: list[Tuple[int, int, Location | None, Location | None]]) str[source]#

Convert an assembly to a string representation, for example: ((1, 2, [8:14], [1:7]),(2, 3, [10:17], [1:8])) becomes: (‘1[8:14]:2[1:7]’, ‘2[10:17]:3[1:8]’)

The reason for this is that by default, a feature ‘[8:14]’ when present in a tuple is printed to the console as SimpleLocation(ExactPosition(8), ExactPosition(14), strand=1) (very long).

pydna.assembly2.assembly2str_tuple(assembly: list[Tuple[int, int, Location | None, Location | None]]) str[source]#

Convert an assembly to a string representation, like ((1, 2, [8:14], [1:7]),(2, 3, [10:17], [1:8]))

pydna.assembly2.assembly_has_mismatches(fragments: list[Dseqrecord], assembly: list[Tuple[int, int, Location | None, Location | None]]) bool[source]#

Check if an assembly has mismatches. This should never happen and if so it returns an error.

pydna.assembly2.assembly_is_circular(assembly: list[Tuple[int, int, Location | None, Location | None]], fragments: list[Dseqrecord]) bool[source]#

Based on the topology of the locations of an assembly, determine if it is circular. This does not work for insertion assemblies, that’s why assemble takes the optional argument is_insertion.

pydna.assembly2.assemble(fragments: list[Dseqrecord], assembly: list[Tuple[int, int, Location | None, Location | None]], is_insertion: bool = False) Dseqrecord[source]#

Generate a Dseqrecord from an assembly and a list of fragments.

pydna.assembly2.annotate_primer_binding_sites(input_dseqr: Dseqrecord, fragments: list[Dseqrecord]) Dseqrecord[source]#

Annotate the primer binding sites in a Dseqrecord.

pydna.assembly2.edge_representation2subfragment_representation(assembly: list[Tuple[int, int, Location | None, Location | None]], is_circular: bool) list[Tuple[int, Location | None, Location | None]][source]#

Turn this kind of edge representation fragment 1, fragment 2, right edge on 1, left edge on 2 a = [(1, 2, ‘loc1a’, ‘loc2a’), (2, 3, ‘loc2b’, ‘loc3b’), (3, 1, ‘loc3c’, ‘loc1c’)] Into this: fragment 1, left edge on 1, right edge on 1 b = [(1, ‘loc1c’, ‘loc1a’), (2, ‘loc2a’, ‘loc2b’), (3, ‘loc3b’, ‘loc3c’)]

pydna.assembly2.subfragment_representation2edge_representation(assembly: list[Tuple[int, Location | None, Location | None]], is_circular: bool) list[Tuple[int, int, Location | None, Location | None]][source]#

Turn this kind of subfragment representation fragment 1, left edge on 1, right edge on 1 a = [(1, ‘loc1c’, ‘loc1a’), (2, ‘loc2a’, ‘loc2b’), (3, ‘loc3b’, ‘loc3c’)] Into this: fragment 1, fragment 2, right edge on 1, left edge on 2 b = [(1, 2, ‘loc1a’, ‘loc2a’), (2, 3, ‘loc2b’ ‘loc3b’), (3, 1, ‘loc3c’, ‘loc1c’)]

pydna.assembly2.get_assembly_subfragments(fragments: list[Dseqrecord], subfragment_representation: list[Tuple[int, Location | None, Location | None]]) list[Dseqrecord][source]#

From the fragment representation returned by edge_representation2subfragment_representation, get the subfragments that are joined together.

Subfragments are the slices of the fragments that are joined together

For example:

  --A--
TACGTAAT
  --B--
 TCGTAACGA

Gives: TACGTAA / CGTAACGA

To reproduce:

a = Dseqrecord('TACGTAAT')
b = Dseqrecord('TCGTAACGA')
f = Assembly([a, b], limit=5)
a0 = f.get_linear_assemblies()[0]
print(assembly2str(a0))
a0_subfragment_rep =edge_representation2subfragment_representation(a0, False)
for f in get_assembly_subfragments([a, b], a0_subfragment_rep):
    print(f.seq)

# prints TACGTAA and CGTAACGA

Subfragments: cccccgtatcgtgt, atcgtgtactgtcatattc

pydna.assembly2.extract_subfragment(seq: Dseqrecord, start_location: Location | None, end_location: Location | None) Dseqrecord[source]#

Extract a subfragment from a sequence for an assembly, given the start and end locations of the subfragment.

pydna.assembly2.is_sublist(sublist: list, my_list: list, my_list_is_cyclic: bool = False) bool[source]#

Returns True if argument sublist is a sublist of argument my_list (can be treated as cyclic), False otherwise.

Examples

>>> is_sublist([1, 2], [1, 2, 3], False)
True
>>> is_sublist([1, 2], [1, 3, 2], False)
False

# See the case here for cyclic lists >>> is_sublist([3, 1], [1, 2, 3], False) False >>> is_sublist([3, 1], [1, 2, 3], True) True

pydna.assembly2.circular_permutation_min_abs(lst: list) list[source]#

Returns the circular permutation of lst with the smallest absolute value first.

Examples

>>> circular_permutation_min_abs([1, 2, 3])
[1, 2, 3]
>>> circular_permutation_min_abs([3, 1, 2])
[1, 2, 3]
class pydna.assembly2.Assembly(frags: list[Dseqrecord], limit: int = 25, algorithm: Callable[[Dseqrecord, Dseqrecord, int], list[Tuple[int, int, int]]] = common_sub_strings, use_fragment_order: bool = True, use_all_fragments: bool = False)[source]#

Bases: object

Assembly of a list of DNA fragments into linear or circular constructs. Accepts a list of Dseqrecords (source fragments) to initiate an Assembly object. Several methods are available for analysis of overlapping sequences, graph construction and assembly.

The assembly contains a directed graph, where nodes represent fragments and edges represent overlaps between fragments. :

  • The node keys are integers, representing the index of the fragment in the input list of fragments. The sign of the node key represents the orientation of the fragment, positive for forward orientation, negative for reverse orientation.

  • The edges contain the locations of the overlaps in the fragments. For an edge (u, v, key):
    • u and v are the nodes connected by the edge.

    • key is a string that represents the location of the overlap. In the format: ‘u[start:end](strand):v[start:end](strand)’.

    • Edges have a ‘locations’ attribute, which is a list of two FeatureLocation objects, representing the location of the overlap in the u and v fragment, respectively.

    • You can think of an edge as a representation of the join of two fragments.

If fragment 1 and 2 share a subsequence of 6bp, [8:14] in fragment 1 and [1:7] in fragment 2, there will be 4 edges representing that overlap in the graph, for all possible orientations of the fragments (see add_edges_from_match for details):

  • (1, 2, '1[8:14]:2[1:7]')

  • (2, 1, '2[1:7]:1[8:14]')

  • (-1, -2, '-1[0:6]:-2[10:16]')

  • (-2, -1, '-2[10:16]:-1[0:6]')

An assembly can be thought of as a tuple of graph edges, but instead of representing them with node indexes and keys, we represent them as u, v, locu, locv, where u and v are the nodes connected by the edge, and locu and locv are the locations of the overlap in the first and second fragment. Assemblies are then represented as:

  • Linear: ((1, 2, [8:14], [1:7]), (2, 3, [10:17], [1:8]))

  • Circular: ((1, 2, [8:14], [1:7]), (2, 3, [10:17], [1:8]), (3, 1, [12:17], [1:6]))

Note that the first and last fragment are the same in a circular assembly.

The following constrains are applied to remove duplicate assemblies:

  • Circular assemblies: the first subfragment is not reversed, and has the smallest index in the input fragment list. use_fragment_order is ignored.

  • Linear assemblies:
    • Using uid (see add_edges_from_match) to identify unique edges.

Parameters:
  • frags (list) – A list of Dseqrecord objects.

  • limit (int, optional) – The shortest shared homology to be considered, this is passed as the third argument to the algorithm function. For certain algorithms, this might be ignored.

  • algorithm (function, optional) – The algorithm used to determine the shared sequences. It’s a function that takes two Dseqrecord objects as inputs, and will get passed the third argument (limit), that may or may not be used. It must return a list of overlaps (see common_sub_strings for an example).

  • use_fragment_order (bool, optional) – It’s set to True by default to reproduce legacy pydna behaviour: only assemblies that start with the first fragment and end with the last are considered. You should set it to False.

  • use_all_fragments (bool, optional) – Constrain the assembly to use all fragments.

Examples

from assembly2 import Assembly, assembly2str from pydna.dseqrecord import Dseqrecord

example_fragments = (

Dseqrecord(‘AacgatCAtgctcc’, name=’a’), Dseqrecord(‘TtgctccTAAattctgc’, name=’b’), Dseqrecord(‘CattctgcGAGGacgatG’, name=’c’),

)

asm = Assembly(example_fragments, limit=5, use_fragment_order=False) print(‘Linear ===============’) for assembly in asm.get_linear_assemblies():

print(’ ‘, assembly2str(assembly))

print(‘Circular =============’) for assembly in asm.get_circular_assemblies():

print(’ ‘, assembly2str(assembly))

# Prints Linear ===============

(‘1[8:14]:2[1:7]’, ‘2[10:17]:3[1:8]’) (‘2[10:17]:3[1:8]’, ‘3[12:17]:1[1:6]’) (‘3[12:17]:1[1:6]’, ‘1[8:14]:2[1:7]’) (‘1[1:6]:3[12:17]’,) (‘2[1:7]:1[8:14]’,) (‘3[1:8]:2[10:17]’,)

Circular =============

(‘1[8:14]:2[1:7]’, ‘2[10:17]:3[1:8]’, ‘3[12:17]:1[1:6]’)

classmethod assembly_is_valid(fragments: list[Dseqrecord | Primer], assembly: list[Tuple[int, int, Location | None, Location | None]], is_circular: bool, use_all_fragments: bool, is_insertion: bool = False) bool[source]#

Returns True if the assembly is valid, False otherwise. See function comments for conditions tested.

add_edges_from_match(match: Tuple[int, int, int], u: int, v: int, first: Dseqrecord, secnd: Dseqrecord)[source]#

Add edges to the graph from a match returned by the algorithm function (see pydna.common_substrings). For format of edges (see documentation of the Assembly class).

Matches are directional, because not all algorithm functions return the same match for (u,v) and (v,u). For example, homologous recombination does but sticky end ligation does not. The function returns two edges:

  • Fragments in the orientation they were passed, with locations of the match (u, v, loc_u, loc_v)

  • Reverse complement of the fragments with inverted order, with flipped locations (-v, -u, flip(loc_v), flip(loc_u))/

format_assembly_edge(graph_edge: tuple[int, int, str]) Tuple[int, int, Location | None, Location | None][source]#

Go from the (u, v, key) to the (u, v, locu, locv) format.

get_linear_assemblies(only_adjacent_edges: bool = False, max_assemblies: int = 50) list[list[Tuple[int, int, Location | None, Location | None]]][source]#

Get linear assemblies, applying the constrains described in __init__, ensuring that paths represent real assemblies (see assembly_is_valid). Subassemblies are removed (see remove_subassemblies).

node_path2assembly_list(cycle: list[int], circular: bool) list[list[Tuple[int, int, Location | None, Location | None]]][source]#
Convert a node path in the format [1, 2, 3] (as returned by networkx.cycles.simple_cycles) to a list of all

possible assemblies.

There may be multiple assemblies for a given node path, if there are several edges connecting two nodes, for example two overlaps between 1 and 2, and single overlap between 2 and 3 should return 3 assemblies.

get_unique_linear_paths(G_with_begin_end: MultiDiGraph, max_paths=10000) list[list[int]][source]#

Get unique linear paths from the graph, removing those that contain the same node twice.

get_possible_assembly_number(paths: list[list[int]]) int[source]#

Get the number of possible assemblies from a list of node paths. Basically, for each path passed as a list of integers / nodes, we calculate the number of paths possible connecting the nodes in that order, given the graph (all the edges connecting them).

get_circular_assemblies(only_adjacent_edges: bool = False, max_assemblies: int = 50) list[list[Tuple[int, int, Location | None, Location | None]]][source]#

Get circular assemblies, applying the constrains described in __init__, ensuring that paths represent real assemblies (see assembly_is_valid).

format_insertion_assembly(assembly: list[Tuple[int, int, Location | None, Location | None]]) list[Tuple[int, int, Location | None, Location | None]] | None[source]#

Sorts the fragment representing a cycle so that they represent an insertion assembly if possible, else returns None.

Here we check if one of the joins between fragments represents the edges of an insertion assembly The fragment must be linear, and the join must be as indicated below

--------         -------           Fragment 1
    ||            ||
    xxxxxxxx      ||               Fragment 2
          ||      ||
          oooooooooo               Fragment 3

The above example will be [(1, 2, [4:6], [0:2]), (2, 3, [6:8], [0:2]), (3, 1, [8:10], [9:11)])]

These could be returned in any order by simple_cycles, so we sort the edges so that the first and last u and v match the fragment that gets the insertion (1 in the example above).

format_insertion_assembly_edge_case(assembly: list[Tuple[int, int, Location | None, Location | None]]) list[Tuple[int, int, Location | None, Location | None]][source]#

Edge case from manulera/OpenCloning_backend#329

get_insertion_assemblies(only_adjacent_edges: bool = False, max_assemblies: int = 50) list[list[Tuple[int, int, Location | None, Location | None]]][source]#

Assemblies that represent the insertion of a fragment or series of fragment inside a linear construct. For instance, digesting CCCCGAATTCCCCGAATTC with EcoRI and inserting the fragment with two overhangs into the EcoRI site of AAAGAATTCAAA. This is not so much meant for the use-case of linear fragments that represent actual linear fragments, but for linear fragments that represent a genome region. This can then be used to simulate homologous recombination.

assemble_linear(only_adjacent_edges: bool = False, max_assemblies: int = 50) list[Dseqrecord][source]#

Assemble linear constructs, from assemblies returned by self.get_linear_assemblies.

assemble_circular(only_adjacent_edges: bool = False, max_assemblies: int = 50) list[Dseqrecord][source]#

Assemble circular constructs, from assemblies returned by self.get_circular_assemblies.

assemble_insertion(only_adjacent_edges: bool = False) list[Dseqrecord][source]#

Assemble insertion constructs, from assemblies returned by self.get_insertion_assemblies.

get_locations_on_fragments() dict[int, dict[str, list[Location]]][source]#

Get a dictionary where the keys are the nodes in the graph, and the values are dictionaries with keys left, right, containing (for each fragment) the locations where the fragment is joined to another fragment on its left and right side. The values in left and right are often the same, except in restriction-ligation with partial overlap enabled, where we can end up with a situation like this:

GGTCTCCCCAATT and aGGTCTCCAACCAA as fragments

# Partial overlap in assembly 1[9:11]:2[8:10] GGTCTCCxxAACCAA CCAGAGGGGTTxxTT

# Partial overlap in 2[10:12]:1[7:9] aGGTCTCCxxCCAATT tCCAGAGGTTGGxxAA

Would return:

{
    1: {'left': [7:9], 'right': [9:11]},
    2: {'left': [8:10], 'right': [10:12]},
    -1: {'left': [2:4], 'right': [4:6]},
    -2: {'left': [2:4], 'right': [4:6]}
}
assembly_uses_only_adjacent_edges(assembly, is_circular: bool) bool[source]#

Check whether only adjacent edges within each fragment are used in the assembly. This is useful to check if a cut and ligate assembly is valid, and prevent including partially digested fragments. For example, imagine the following fragment being an input for a digestion and ligation assembly, where the enzyme cuts at the sites indicated by the vertical lines:

       x       y       z
-------|-------|-------|---------

We would only want assemblies that contain subfragments start-x, x-y, y-z, z-end, and not start-x, y-end, for instance. The latter would indicate that the fragment was partially digested.

class pydna.assembly2.PCRAssembly(frags: list[Dseqrecord | Primer], limit=25, mismatches=0)[source]#

Bases: Assembly

An assembly that represents a PCR, where fragments is a list of primer, template, primer (in that order). It always uses the primer_template_overlap algorithm and accepts the mismatches argument to indicate the number of mismatches allowed in the overlap. Only supports substitution mismatches, not indels.

get_linear_assemblies(only_adjacent_edges: bool = False, max_assemblies: int = 50) list[list[Tuple[int, int, Location | None, Location | None]]][source]#

Get linear assemblies, applying the constrains described in __init__, ensuring that paths represent real assemblies (see assembly_is_valid). Subassemblies are removed (see remove_subassemblies).

get_circular_assemblies(only_adjacent_edges: bool = False)[source]#

Get circular assemblies, applying the constrains described in __init__, ensuring that paths represent real assemblies (see assembly_is_valid).

get_insertion_assemblies(only_adjacent_edges: bool = False)[source]#

Assemblies that represent the insertion of a fragment or series of fragment inside a linear construct. For instance, digesting CCCCGAATTCCCCGAATTC with EcoRI and inserting the fragment with two overhangs into the EcoRI site of AAAGAATTCAAA. This is not so much meant for the use-case of linear fragments that represent actual linear fragments, but for linear fragments that represent a genome region. This can then be used to simulate homologous recombination.

assemble_linear(only_adjacent_edges: bool = False, max_assemblies: int = 50) list[Dseqrecord][source]#

Overrides the parent method to ensure that the 5’ of the crick strand of the product matches the sequence of the reverse primer. This is important when using primers with dUTP (for USER cloning).

class pydna.assembly2.SingleFragmentAssembly(frags: [<class 'pydna.dseqrecord.Dseqrecord'>], limit=25, algorithm=common_sub_strings)[source]#

Bases: Assembly

An assembly that represents the circularisation or splicing of a single fragment.

get_circular_assemblies(only_adjacent_edges: bool = False, max_assemblies: int = 50) list[list[Tuple[int, int, Location | None, Location | None]]][source]#

Get circular assemblies, applying the constrains described in __init__, ensuring that paths represent real assemblies (see assembly_is_valid).

get_insertion_assemblies(only_adjacent_edges: bool = False, max_assemblies: int = 50) list[list[Tuple[int, int, Location | None, Location | None]]][source]#

This could be renamed splicing assembly, but the essence is similar

get_linear_assemblies()[source]#

Get linear assemblies, applying the constrains described in __init__, ensuring that paths represent real assemblies (see assembly_is_valid). Subassemblies are removed (see remove_subassemblies).

pydna.assembly2.common_function_assembly_products(frags: list[Dseqrecord], limit: int | None, algorithm: Callable, circular_only: bool, filter_results_function: Callable | None = None, only_adjacent_edges: bool = False) list[Dseqrecord][source]#

Common function to avoid code duplication. Could be simplified further once SingleFragmentAssembly and Assembly are merged.

Parameters:
  • frags (list[Dseqrecord]) – List of DNA fragments to assemble

  • limit (int or None) – Minimum overlap length required, or None if not applicable

  • algorithm (Callable) – Function that determines valid overlaps between fragments

  • circular_only (bool) – If True, only return circular assemblies

  • filter_results_function (Callable or None) – Function that filters the results

  • only_adjacent_edges (bool) – If True, only return assemblies that use only adjacent edges

Returns:

List of assembled DNA molecules

Return type:

list[Dseqrecord]

pydna.assembly2.gibson_assembly(frags: list[Dseqrecord], limit: int = 25, circular_only: bool = False) list[Dseqrecord][source]#

Returns the products for Gibson assembly.

Parameters:
  • frags (list[Dseqrecord]) – List of DNA fragments to assemble

  • limit (int, optional) – Minimum overlap length required, by default 25

  • circular_only (bool, optional) – If True, only return circular assemblies, by default False

Returns:

List of assembled DNA molecules

Return type:

list[Dseqrecord]

pydna.assembly2.in_fusion_assembly(frags: list[Dseqrecord], limit: int = 25, circular_only: bool = False) list[Dseqrecord][source]#

Returns the products for in-fusion assembly. This is the same as Gibson assembly, but with a different name.

Parameters:
  • frags (list[Dseqrecord]) – List of DNA fragments to assemble

  • limit (int, optional) – Minimum overlap length required, by default 25

  • circular_only (bool, optional) – If True, only return circular assemblies, by default False

Returns:

List of assembled DNA molecules

Return type:

list[Dseqrecord]

pydna.assembly2.fusion_pcr_assembly(frags: list[Dseqrecord], limit: int = 25, circular_only: bool = False) list[Dseqrecord][source]#

Returns the products for fusion PCR assembly. This is the same as Gibson assembly, but with a different name.

Parameters:
  • frags (list[Dseqrecord]) – List of DNA fragments to assemble

  • limit (int, optional) – Minimum overlap length required, by default 25

  • circular_only (bool, optional) – If True, only return circular assemblies, by default False

Returns:

List of assembled DNA molecules

Return type:

list[Dseqrecord]

pydna.assembly2.in_vivo_assembly(frags: list[Dseqrecord], limit: int = 25, circular_only: bool = False) list[Dseqrecord][source]#

Returns the products for in vivo assembly (IVA), which relies on homologous recombination between the fragments.

Parameters:
  • frags (list[Dseqrecord]) – List of DNA fragments to assemble

  • limit (int, optional) – Minimum overlap length required, by default 25

  • circular_only (bool, optional) – If True, only return circular assemblies, by default False

Returns:

List of assembled DNA molecules

Return type:

list[Dseqrecord]

pydna.assembly2.restriction_ligation_assembly(frags: list[Dseqrecord], enzymes: list[AbstractCut], allow_blunt: bool = True, circular_only: bool = False) list[Dseqrecord][source]#

Returns the products for restriction ligation assembly:

  • Finds cutsites in the fragments

  • Finds all products that could be assembled by ligating the fragments based on those cutsites

  • Will NOT return products that combine an existing end with an end generated by the same enzyme (see example below)

Parameters:
  • frags (list[Dseqrecord]) – List of DNA fragments to assemble

  • enzymes (list[AbstractCut]) – List of restriction enzymes to use

  • allow_blunt (bool, optional) – If True, allow blunt end ligations, by default True

  • circular_only (bool, optional) – If True, only return circular assemblies, by default False

Returns:

List of assembled DNA molecules

Return type:

list[Dseqrecord]

Examples

In the example below, we plan to assemble a plasmid from a backbone and an insert, using the EcoRI and SalI enzymes. Note how 2 circular products are returned, one contains the insert (acgt) and the desired part of the backbone (cccccc), the other contains the reversed insert (tgga) and the cut-out part of the backbone (aaa).

>>> from pydna.assembly2 import restriction_ligation_assembly
>>> from pydna.dseqrecord import Dseqrecord
>>> from Bio.Restriction import EcoRI, SalI
>>> backbone = Dseqrecord("cccGAATTCaaaGTCGACccc", circular=True)
>>> insert = Dseqrecord("ggGAATTCaggtGTCGACgg")
>>> products = restriction_ligation_assembly([backbone, insert], [EcoRI, SalI], circular_only=True)
>>> products[0].seq
Dseq(o22)
TCGACccccccGAATTCaggtG
AGCTGggggggCTTAAGtccaC
>>> products[1].seq
Dseq(o19)
AATTCaaaGTCGACacctG
TTAAGtttCAGCTGtggaC

Note that passing a pre-cut fragment will not work.

>>> restriction_products = insert.cut([EcoRI, SalI])
>>> cut_insert = restriction_products[1]
>>> restriction_ligation_assembly([backbone, cut_insert], [EcoRI, SalI], circular_only=True)
[]

It also works with a single fragment, for circularization:

>>> seq = Dseqrecord("GAATTCaaaGAATTC")
>>> products =restriction_ligation_assembly([seq], [EcoRI])
>>> products[0].seq
Dseq(o9)
AATTCaaaG
TTAAGtttC
pydna.assembly2.golden_gate_assembly(frags: list[Dseqrecord], enzymes: list[AbstractCut], allow_blunt: bool = True, circular_only: bool = False) list[Dseqrecord][source]#

Returns the products for Golden Gate assembly. This is the same as restriction ligation assembly, but with a different name. Check the documentation for restriction_ligation_assembly for more details.

Parameters:
  • frags (list[Dseqrecord]) – List of DNA fragments to assemble

  • enzymes (list[AbstractCut]) – List of restriction enzymes to use

  • allow_blunt (bool, optional) – If True, allow blunt end ligations, by default True

  • circular_only (bool, optional) – If True, only return circular assemblies, by default False

Returns:

List of assembled DNA molecules

Return type:

list[Dseqrecord]

Examples

See the example for restriction_ligation_assembly.

pydna.assembly2.ligation_assembly(frags: list[Dseqrecord], allow_blunt: bool = False, allow_partial_overlap: bool = False, circular_only: bool = False) list[Dseqrecord][source]#

Returns the products for ligation assembly, as inputs pass the fragments (digested if needed) that will be ligated.

For most cases, you probably should use restriction_ligation_assembly instead.

Parameters:
  • frags (list[Dseqrecord]) – List of DNA fragments to assemble

  • allow_blunt (bool, optional) – If True, allow blunt end ligations, by default False

  • allow_partial_overlap (bool, optional) – If True, allow partial overlaps between sticky ends, by default False

  • circular_only (bool, optional) – If True, only return circular assemblies, by default False

Returns:

List of assembled DNA molecules

Return type:

list[Dseqrecord]

Examples

In the example below, we plan to assemble a plasmid from a backbone and an insert, using the EcoRI enzyme. The insert and insertion site in the backbone are flanked by EcoRI sites, so there are two possible products depending on the orientation of the insert.

>>> from pydna.assembly2 import ligation_assembly
>>> from pydna.dseqrecord import Dseqrecord
>>> from Bio.Restriction import EcoRI
>>> backbone = Dseqrecord("cccGAATTCaaaGAATTCccc", circular=True)
>>> backbone_cut = backbone.cut(EcoRI)[1]
>>> insert = Dseqrecord("ggGAATTCaggtGAATTCgg")
>>> insert_cut = insert.cut(EcoRI)[1]
>>> products = ligation_assembly([backbone_cut, insert_cut])
>>> products[0].seq
Dseq(o22)
AATTCccccccGAATTCaggtG
TTAAGggggggCTTAAGtccaC
>>> products[1].seq
Dseq(o22)
AATTCccccccGAATTCacctG
TTAAGggggggCTTAAGtggaC
pydna.assembly2.assembly_is_multi_site(asm: list[list[Tuple[int, int, Location | None, Location | None]]]) bool[source]#

Returns True if the assembly is a multi-site assembly, False otherwise.

pydna.assembly2.gateway_assembly(frags: list[Dseqrecord], reaction_type: Literal['BP', 'LR'], greedy: bool = False, circular_only: bool = False, multi_site_only: bool = False) list[Dseqrecord][source]#

Returns the products for Gateway assembly / Gateway cloning.

Parameters:
  • frags (list[Dseqrecord]) – List of DNA fragments to assemble

  • reaction_type (Literal['BP', 'LR']) – Type of Gateway reaction

  • greedy (bool, optional) – If True, use greedy gateway consensus sites, by default False

  • circular_only (bool, optional) – If True, only return circular assemblies, by default False

  • multi_site_only (bool, optional) – If True, only return products that where 2 sites recombined. Even if input sequences contain multiple att sites (typically 2), a product could be generated where only one site recombines. That’s typically not what you want, so you can set this to True to only return products where both att sites recombined.

Returns:

List of assembled DNA molecules

Return type:

list[Dseqrecord]

Examples

Below an example with dummy Gateway sequences, composed with minimal sequences and the consensus att sites.

>>> from pydna.assembly2 import gateway_assembly
>>> from pydna.dseqrecord import Dseqrecord
>>> attB1 = "ACAACTTTGTACAAAAAAGCAGAAG"
>>> attP1 = "AAAATAATGATTTTATTTGACTGATAGTGACCTGTTCGTTGCAACAAATTGATGAGCAATGCTTTTTTATAATGCCAACTTTGTACAAAAAAGCTGAACGAGAAGCGTAAAATGATATAAATATCAATATATTAAATTAGATTTTGCATAAAAAACAGACTACATAATACTGTAAAACACAACATATCCAGTCACTATGAATCAACTACTTAGATGGTATTAGTGACCTGTA"
>>> attR1 = "ACAACTTTGTACAAAAAAGCTGAACGAGAAACGTAAAATGATATAAATATCAATATATTAAATTAGATTTTGCATAAAAAACAGACTACATAATACTGTAAAACACAACATATGCAGTCACTATG"
>>> attL1 = "CAAATAATGATTTTATTTTGACTGATAGTGACCTGTTCGTTGCAACAAATTGATAAGCAATGCTTTCTTATAATGCCAACTTTGTACAAAAAAGCAGGCT"
>>> seq1 = Dseqrecord("aaa" + attB1 + "ccc")
>>> seq2 = Dseqrecord("aaa" + attP1 + "ccc")
>>> seq3 = Dseqrecord("aaa" + attR1 + "ccc")
>>> seq4 = Dseqrecord("aaa" + attL1 + "ccc")
>>> products_BP = gateway_assembly([seq1, seq2], "BP")
>>> products_LR = gateway_assembly([seq3, seq4], "LR")
>>> len(products_BP)
2
>>> len(products_LR)
2

Now let’s understand the multi_site_only parameter. Let’s consider a case where we are swapping fragments between two plasmids using an LR reaction. Experimentally, we expect to obtain two plasmids, resulting from the swapping between the two att sites. That’s what we get if we set multi_site_only to True.

>>> attL2 = 'aaataatgattttattttgactgatagtgacctgttcgttgcaacaaattgataagcaatgctttcttataatgccaactttgtacaagaaagctg'
>>> attR2 = 'accactttgtacaagaaagctgaacgagaaacgtaaaatgatataaatatcaatatattaaattagattttgcataaaaaacagactacataatactgtaaaacacaacatatccagtcactatg'
>>> insert = Dseqrecord("cccccc" + attL1 + "ccc" + attL2 + "cccccc", circular=True)
>>> backbone = Dseqrecord("ttttt" + attR1 + "aaa" + attR2, circular=True)
>>> products = gateway_assembly([insert, backbone], "LR", multi_site_only=True)
>>> len(products)
2

However, if we set multi_site_only to False, we get 4 products, which also include the intermediate products where the two plasmids are combined into a single one through recombination of a single att site. This is an intermediate of the reaction, and typically we don’t want it:

>>> products = gateway_assembly([insert, backbone], "LR", multi_site_only=False)
>>> print([len(p) for p in products])
[469, 237, 232, 469]
pydna.assembly2.common_function_integration_products(frags: list[Dseqrecord], limit: int | None, algorithm: Callable) list[Dseqrecord][source]#

Common function to avoid code duplication for integration products.

Parameters:
  • frags (list[Dseqrecord]) – List of DNA fragments to integrate

  • limit (int or None) – Minimum overlap length required, or None if not applicable

  • algorithm (Callable) – Function that determines valid overlaps between fragments

Returns:

List of integrated DNA molecules

Return type:

list[Dseqrecord]

pydna.assembly2.common_handle_insertion_fragments(genome: Dseqrecord, inserts: list[Dseqrecord]) list[Dseqrecord][source]#

Common function to handle / validate insertion fragments.

Parameters:
Returns:

List containing genome and insert fragments

Return type:

list[Dseqrecord]

pydna.assembly2.common_function_excision_products(genome: Dseqrecord, limit: int | None, algorithm: Callable) list[Dseqrecord][source]#

Common function to avoid code duplication for excision products.

Parameters:
  • genome (Dseqrecord) – Target genome sequence

  • limit (int or None) – Minimum overlap length required, or None if not applicable

  • algorithm (Callable) – Function that determines valid overlaps between fragments

Returns:

List of excised DNA molecules

Return type:

list[Dseqrecord]

pydna.assembly2.homologous_recombination_integration(genome: Dseqrecord, inserts: list[Dseqrecord], limit: int = 40) list[Dseqrecord][source]#

Returns the products resulting from the integration of an insert (or inserts joined through in vivo recombination) into the genome through homologous recombination.

Parameters:
  • genome (Dseqrecord) – Target genome sequence

  • inserts (list[Dseqrecord]) – DNA fragment(s) to insert

  • limit (int, optional) – Minimum homology length required, by default 40

Returns:

List of integrated DNA molecules

Return type:

list[Dseqrecord]

Examples

Below an example with a single insert.

>>> from pydna.assembly2 import homologous_recombination_integration
>>> from pydna.dseqrecord import Dseqrecord
>>> homology = "AAGTCCGTTCGTTTTACCTG"
>>> genome = Dseqrecord(f"aaaaaa{homology}ccccc{homology}aaaaaa")
>>> insert = Dseqrecord(f"{homology}gggg{homology}")
>>> products = homologous_recombination_integration(genome, [insert], 20)
>>> str(products[0].seq)
'aaaaaaAAGTCCGTTCGTTTTACCTGggggAAGTCCGTTCGTTTTACCTGaaaaaa'

Below an example with two inserts joined through homology.

>>> homology2 = "ATTACAGCATGGGAAGAAAGA"
>>> insert_1 = Dseqrecord(f"{homology}gggg{homology2}")
>>> insert_2 = Dseqrecord(f"{homology2}cccc{homology}")
>>> products = homologous_recombination_integration(genome, [insert_1, insert_2], 20)
>>> str(products[0].seq)
'aaaaaaAAGTCCGTTCGTTTTACCTGggggATTACAGCATGGGAAGAAAGAccccAAGTCCGTTCGTTTTACCTGaaaaaa'
pydna.assembly2.homologous_recombination_excision(genome: Dseqrecord, limit: int = 40) list[Dseqrecord][source]#

Returns the products resulting from the excision of a fragment from the genome through homologous recombination.

Parameters:
  • genome (Dseqrecord) – Target genome sequence

  • limit (int, optional) – Minimum homology length required, by default 40

Returns:

List containing excised plasmid and remaining genome sequence

Return type:

list[Dseqrecord]

Examples

Example of a homologous recombination event, where a plasmid is excised from the genome (circular sequence of 25 bp), and that part is removed from the genome, leaving a shorter linear sequence (32 bp).

>>> from pydna.assembly2 import homologous_recombination_excision
>>> from pydna.dseqrecord import Dseqrecord
>>> homology = "AAGTCCGTTCGTTTTACCTG"
>>> genome = Dseqrecord(f"aaaaaa{homology}ccccc{homology}aaaaaa")
>>> products = homologous_recombination_excision(genome, 20)
>>> products
[Dseqrecord(o25), Dseqrecord(-32)]
pydna.assembly2.cre_lox_integration(genome: Dseqrecord, inserts: list[Dseqrecord]) list[Dseqrecord][source]#

Returns the products resulting from the integration of an insert (or inserts joined through cre-lox recombination among them) into the genome through cre-lox integration.

Also works with lox66 and lox71 (see pydna.cre_lox for more details).

Parameters:
Returns:

List of integrated DNA molecules

Return type:

list[Dseqrecord]

Examples

Below an example of reversible integration and excision.

>>> from pydna.dseqrecord import Dseqrecord
>>> from pydna.assembly2 import cre_lox_integration, cre_lox_excision
>>> from pydna.cre_lox import LOXP_SEQUENCE
>>> a = Dseqrecord(f"cccccc{LOXP_SEQUENCE}aaaaa")
>>> b = Dseqrecord(f"{LOXP_SEQUENCE}bbbbb", circular=True)
>>> [a, b]
[Dseqrecord(-45), Dseqrecord(o39)]
>>> res = cre_lox_integration(a, [b])
>>> res
[Dseqrecord(-84)]
>>> res2 = cre_lox_excision(res[0])
>>> res2
[Dseqrecord(o39), Dseqrecord(-45)]

Below an example with lox66 and lox71 (irreversible integration). Here, the result of excision is still returned because there is a low probability of it happening, but it’s considered a rare event.

>>> lox66 = 'ATAACTTCGTATAGCATACATTATACGAACGGTA'
>>> lox71 = 'TACCGTTCGTATAGCATACATTATACGAAGTTAT'
>>> a = Dseqrecord(f"cccccc{lox66}aaaaa")
>>> b = Dseqrecord(f"{lox71}bbbbb", circular=True)
>>> res = cre_lox_integration(a, [b])
>>> res
[Dseqrecord(-84)]
>>> res2 = cre_lox_excision(res[0])
>>> res2
[Dseqrecord(o39), Dseqrecord(-45)]
pydna.assembly2.cre_lox_excision(genome: Dseqrecord) list[Dseqrecord][source]#

Returns the products for CRE-lox excision.

Parameters:

genome (Dseqrecord) – Target genome sequence

Returns:

List containing excised plasmid and remaining genome sequence

Return type:

list[Dseqrecord]

Examples

Below an example of reversible integration and excision.

>>> from pydna.dseqrecord import Dseqrecord
>>> from pydna.assembly2 import cre_lox_integration, cre_lox_excision
>>> from pydna.cre_lox import LOXP_SEQUENCE
>>> a = Dseqrecord(f"cccccc{LOXP_SEQUENCE}aaaaa")
>>> b = Dseqrecord(f"{LOXP_SEQUENCE}bbbbb", circular=True)
>>> [a, b]
[Dseqrecord(-45), Dseqrecord(o39)]
>>> res = cre_lox_integration(a, [b])
>>> res
[Dseqrecord(-84)]
>>> res2 = cre_lox_excision(res[0])
>>> res2
[Dseqrecord(o39), Dseqrecord(-45)]

Below an example with lox66 and lox71 (irreversible integration). Here, the result of excision is still returned because there is a low probability of it happening, but it’s considered a rare event.

>>> lox66 = 'ATAACTTCGTATAGCATACATTATACGAACGGTA'
>>> lox71 = 'TACCGTTCGTATAGCATACATTATACGAAGTTAT'
>>> a = Dseqrecord(f"cccccc{lox66}aaaaa")
>>> b = Dseqrecord(f"{lox71}bbbbb", circular=True)
>>> res = cre_lox_integration(a, [b])
>>> res
[Dseqrecord(-84)]
>>> res2 = cre_lox_excision(res[0])
>>> res2
[Dseqrecord(o39), Dseqrecord(-45)]
pydna.assembly2.crispr_integration(genome: Dseqrecord, inserts: list[Dseqrecord], guides: list[Primer], limit: int = 40) list[Dseqrecord][source]#

Returns the products for CRISPR integration.

Parameters:
  • genome (Dseqrecord) – Target genome sequence

  • inserts (list[Dseqrecord]) – DNA fragment(s) to insert

  • guides (list[Primer]) – List of guide RNAs as Primer objects. This may change in the future.

  • limit (int, optional) – Minimum overlap length required, by default 40

Returns:

List of integrated DNA molecules

Return type:

list[Dseqrecord]

Examples

>>> from pydna.dseqrecord import Dseqrecord
>>> from pydna.assembly2 import crispr_integration
>>> from pydna.primer import Primer
>>> genome = Dseqrecord("aaccggttcaatgcaaacagtaatgatggatgacattcaaagcac", name="genome")
>>> insert = Dseqrecord("aaccggttAAAAAAAAAttcaaagcac", name="insert")
>>> guide = Primer("ttcaatgcaaacagtaatga", name="guide")
>>> product, *_ = crispr_integration(genome, [insert], [guide], 8)
>>> product
Dseqrecord(-27)
pydna.assembly2.pcr_assembly(template: Dseqrecord, fwd_primer: Primer, rvs_primer: Primer, add_primer_features: bool = False, limit: int = 14, mismatches: int = 0) list[Dseqrecord][source]#

Returns the products for PCR assembly.

Parameters:
  • template (Dseqrecord) – Template sequence

  • fwd_primer (Primer) – Forward primer

  • rvs_primer (Primer) – Reverse primer

  • add_primer_features (bool, optional) – If True, add primer features to the product, by default False

  • limit (int, optional) – Minimum overlap length required, by default 14

  • mismatches (int, optional) – Maximum number of mismatches, by default 0

Returns:

List of assembled DNA molecules

Return type:

list[Dseqrecord]

pydna.codon module#

docstring.

pydna.common_sub_strings module#

This module is based on the Py-rstr-max package that was written by Romain Brixtel (rbrixtel_at_gmail_dot_com) (https://brixtel.users.greyc.fr) and is available from https://code.google.com/p/py-rstr-max gip0/py-rstr-max the original code was covered by an MIT licence.

pydna.common_sub_strings.common_sub_strings(stringx: str, stringy: str, limit: int = 25) List[Tuple[int, int, int]][source]#

Finds all common substrings between stringx and stringy, and returns them sorted by length.

This function is case sensitive.

Parameters:
  • stringx (str)

  • stringy (str)

  • limit (int, optional)

Returns:

[(startx1, starty1, length1),(startx2, starty2, length2), …]

startx1 = startposition in x, where substring 1 starts starty1 = position in y where substring 1 starts length1 = lenght of substring

Return type:

list of tuple

pydna.common_sub_strings.terminal_overlap(stringx: str, stringy: str, limit: int = 15) List[Tuple[int, int, int]][source]#

Finds the the flanking common substrings between stringx and stringy longer than limit. This means that the results only contains substrings that starts or ends at the the ends of stringx and stringy.

This function is case sensitive.

returns a list of tuples describing the substrings The list is sorted longest -> shortest.

Parameters:
  • stringx (str)

  • stringy (str)

  • limit (int, optional)

Returns:

[(startx1,starty1,length1),(startx2,starty2,length2), …]

startx1 = startposition in x, where substring 1 starts starty1 = position in y where substring 1 starts length1 = lenght of substring

Return type:

list of tuple

Examples

>>> from pydna.common_sub_strings import terminal_overlap
>>> terminal_overlap("agctatgtatcttgcatcgta", "gcatcgtagtctatttgcttac", limit=8)
[(13, 0, 8)]
             <-- 8 ->
<---- 13 --->
agctatgtatcttgcatcgta                    stringx
             gcatcgtagtctatttgcttac      stringy
             0

pydna.contig module#

class pydna.contig.Contig(record, *args, graph=None, nodemap=None, **kwargs)[source]#

Bases: Dseqrecord

This class holds information about a DNA assembly. This class is instantiated by the Assembly class and is not meant to be used directly.

classmethod from_string(record: str = '', *args, graph=None, nodemap=None, **kwargs)[source]#

docstring.

classmethod from_SeqRecord(record, *args, graph=None, nodemap=None, **kwargs)[source]#
reverse_complement()[source]#

Reverse complement.

Examples

>>> from pydna.dseqrecord import Dseqrecord
>>> a=Dseqrecord("ggaatt")
>>> a
Dseqrecord(-6)
>>> a.seq
Dseq(-6)
ggaatt
ccttaa
>>> a.reverse_complement().seq
Dseq(-6)
aattcc
ttaagg
>>>
rc()#

Reverse complement.

Examples

>>> from pydna.dseqrecord import Dseqrecord
>>> a=Dseqrecord("ggaatt")
>>> a
Dseqrecord(-6)
>>> a.seq
Dseq(-6)
ggaatt
ccttaa
>>> a.reverse_complement().seq
Dseq(-6)
aattcc
ttaagg
>>>
detailed_figure()[source]#

Returns a text representation of the assembled fragments.

Linear:

acgatgctatactgCCCCCtgtgctgtgctcta
                   TGTGCTGTGCTCTA
                   tgtgctgtgctctaTTTTTtattctggctgtatc

Circular:

||||||||||||||
acgatgctatactgCCCCCtgtgctgtgctcta
                   TGTGCTGTGCTCTA
                   tgtgctgtgctctaTTTTTtattctggctgtatc
                                      TATTCTGGCTGTATC
                                      tattctggctgtatcGGGGGtacgatgctatactg
                                                           ACGATGCTATACTG
figure()[source]#

Compact ascii representation of the assembled fragments.

Each fragment is represented by:

Size of common 5' substring|Name and size of DNA fragment|
Size of common 5' substring

Linear:

frag20| 6
       \\/
       /\\
        6|frag23| 6
                 \\/
                 /\\
                  6|frag14

Circular:

 -|2577|61
|       \\/
|       /\\
|       61|5681|98
|               \\/
|               /\\
|               98|2389|557
|                       \\/
|                       /\\
|                       557-
|                          |
 --------------------------
figure_mpl()[source]#

Graphic representation of the assembly.

Returns:

A representation of a linear or culrcular assembly.

Return type:

matplotlib.figure.Figure

pydna.cre_lox module#

pydna.cre_lox.cre_loxP_overlap(x: Dseqrecord, y: Dseqrecord, _l: None = None) list[tuple[int, int, int]][source]#

Find matching loxP sites between two sequences.

pydna.cre_lox.get_regex_dict(original_dict: dict[str, str]) dict[str, str][source]#

Get the regex dictionary for the original dictionary.

pydna.cre_lox.find_loxP_sites(seq: Dseqrecord) dict[str, list[Location]][source]#

Find all loxP sites in a sequence and return a dictionary with the name and positions of the sites.

pydna.cre_lox.annotate_loxP_sites(seq: Dseqrecord) Dseqrecord[source]#

pydna.crispr module#

Provides the Dseq class for handling double stranded DNA sequences.

Dseq is a subclass of Bio.Seq.Seq. The Dseq class is mostly useful as a part of the pydna.dseqrecord.Dseqrecord class which can hold more meta data.

The Dseq class support the notion of circular and linear DNA topology.

class pydna.crispr.cas9(protospacer)[source]#

Bases: _cas

docstring.

    |----size----------|

    ---protospacer------
                    -fst3
    fst5             |-|
    |--------------|
                        PAM
5-NNGGAAGAGTAATACACTA-AAANGGNN-3
||||||||||||||||||| ||||||||
3-NNCCTTCTCATTATGTGAT-TTTNCCNN-5
    ||||||||||||||||| |||
5-GGAAGAGTAATACACTA-AAAg-u-a-a-g-g  Scaffold
    ---gRNA spacer---    u-a
                        u-a
                        u-a
                        u-a
                        a-u
                        g-u-g
                        a    a
                        g-c-a
                        c-g
                        u-a
                        a-u
                        g   a  tetraloop
                        a-a
scaffold = 'GTTTTAGAGCTAGAAATAGCAAGTTAAAATAAGG'#
pam = '.GG'#
size = 20#
fst5 = 17#
fst3 = -3#
ovhg = 0#
search(dna, linear=True)[source]#

docstring.

pydna.crispr.protospacer(guide_construct, cas=cas9)[source]#

docstring.

pydna.design module#

This module contain functions for primer design for various purposes.

  • :func:primer_design for designing primers for a sequence or a matching primer for an existing primer. Returns an Amplicon object (same as the amplify module returns).

  • :func:assembly_fragments Adds tails to primers for a linear assembly through homologous recombination or Gibson assembly.

  • :func:circular_assembly_fragments Adds tails to primers for a circular assembly through homologous recombination or Gibson assembly.

pydna.design.primer_design(template, fp=None, rp=None, limit=13, target_tm=55.0, tm_func=tm_default, estimate_function=None, **kwargs)[source]#

This function designs a forward primer and a reverse primer for PCR amplification of a given template sequence.

The template argument is a Dseqrecord object or equivalent containing the template sequence.

The optional fp and rp arguments can contain an existing primer for the sequence (either the forward or reverse primer). One or the other primers can be specified, not both (since then there is nothing to design!, use the pydna.amplify.pcr function instead).

The limit argument is the minimum length of the primer. The default value is 13.

If one of the primers is given, the other primer is designed to match in terms of Tm. If both primers are designed, they will be designed to target_tm

tm_func is a function that takes an ascii string representing an oligonuceotide as argument and returns a float. Some useful functions can be found in the pydna.tm module, but can be substituted for a custom made function.

estimate_function is a tm_func-like function that is used to get a first guess for the primer design, that is then used as starting point for the final result. This is useful when the tm_func function is slow to calculate (e.g. it relies on an external API, such as the NEB primer design API). The estimate_function should be faster than the tm_func function. The default value is None. To use the default tm_func as estimate function to get the NEB Tm faster, you can do: primer_design(dseqr, target_tm=55, tm_func=tm_neb, estimate_function=tm_default).

The function returns a pydna.amplicon.Amplicon class instance. This object has the object.forward_primer and object.reverse_primer properties which contain the designed primers.

Parameters:
  • template (pydna.dseqrecord.Dseqrecord) – a Dseqrecord object. The only required argument.

  • fp (pydna.primer.Primer, optional) – optional pydna.primer.Primer objects containing one primer each.

  • rp (pydna.primer.Primer, optional) – optional pydna.primer.Primer objects containing one primer each.

  • target_tm (float, optional) – target tm for the primers, set to 55°C by default.

  • tm_func (function) – Function used for tm calculation. This function takes an ascii string representing an oligonuceotide as argument and returns a float. Some useful functions can be found in the pydna.tm module, but can be substituted for a custom made function.

Returns:

result

Return type:

Amplicon

Examples

>>> from pydna.dseqrecord import Dseqrecord
>>> t=Dseqrecord("atgactgctaacccttccttggtgttgaacaagatcgacgacatttcgttcgaaacttacgatg")
>>> t
Dseqrecord(-64)
>>> from pydna.design import primer_design
>>> ampl = primer_design(t)
>>> ampl
Amplicon(64)
>>> ampl.forward_primer
f64 17-mer:5'-atgactgctaacccttc-3'
>>> ampl.reverse_primer
r64 18-mer:5'-catcgtaagtttcgaacg-3'
>>> print(ampl.figure())
5atgactgctaacccttc...cgttcgaaacttacgatg3
                     ||||||||||||||||||
                    3gcaagctttgaatgctac5
5atgactgctaacccttc3
 |||||||||||||||||
3tactgacgattgggaag...gcaagctttgaatgctac5
>>> pf = "GGATCC" + ampl.forward_primer
>>> pr = "GGATCC" + ampl.reverse_primer
>>> pf
f64 23-mer:5'-GGATCCatgactgct..ttc-3'
>>> pr
r64 24-mer:5'-GGATCCcatcgtaag..acg-3'
>>> from pydna.amplify import pcr
>>> pcr_prod = pcr(pf, pr, t)
>>> print(pcr_prod.figure())
      5atgactgctaacccttc...cgttcgaaacttacgatg3
                           ||||||||||||||||||
                          3gcaagctttgaatgctacCCTAGG5
5GGATCCatgactgctaacccttc3
       |||||||||||||||||
      3tactgacgattgggaag...gcaagctttgaatgctac5
>>> print(pcr_prod.seq)
GGATCCatgactgctaacccttccttggtgttgaacaagatcgacgacatttcgttcgaaacttacgatgGGATCC
>>> from pydna.primer import Primer
>>> pf = Primer("atgactgctaacccttccttggtgttg", id="myprimer")
>>> ampl = primer_design(t, fp = pf)
>>> ampl.forward_primer
myprimer 27-mer:5'-atgactgctaaccct..ttg-3'
>>> ampl.reverse_primer
r64 32-mer:5'-catcgtaagtttcga..atc-3'
pydna.design.assembly_fragments(f, overlap=35, maxlink=40, circular=False)[source]#

This function return a list of pydna.amplicon.Amplicon objects where primers have been modified with tails so that the fragments can be fused in the order they appear in the list by for example Gibson assembly or homologous recombination.

Given that we have two linear pydna.amplicon.Amplicon objects a and b

we can modify the reverse primer of a and forward primer of b with tails to allow fusion by fusion PCR, Gibson assembly or in-vivo homologous recombination. The basic requirements for the primers for the three techniques are the same.

 _________ a _________
/                     \
agcctatcatcttggtctctgca
                  |||||
                 <gacgt
agcct>
|||||
tcggatagtagaaccagagacgt

                        __________ b ________
                       /                     \
                       TTTATATCGCATGACTCTTCTTT
                                         |||||
                                        <AGAAA
                       TTTAT>
                       |||||
                       AAATATAGCGTACTGAGAAGAAA

agcctatcatcttggtctctgcaTTTATATCGCATGACTCTTCTTT
||||||||||||||||||||||||||||||||||||||||||||||
tcggatagtagaaccagagacgtAAATATAGCGTACTGAGAAGAAA
\___________________ c ______________________/

Design tailed primers incorporating a part of the next or previous fragment to be assembled.

agcctatcatcttggtctctgca
|||||||||||||||||||||||
                gagacgtAAATATA

|||||||||||||||||||||||
tcggatagtagaaccagagacgt

                       TTTATATCGCATGACTCTTCTTT
                       |||||||||||||||||||||||

                ctctgcaTTTATAT
                       |||||||||||||||||||||||
                       AAATATAGCGTACTGAGAAGAAA

PCR products with flanking sequences are formed in the PCR process.

agcctatcatcttggtctctgcaTTTATAT
||||||||||||||||||||||||||||||
tcggatagtagaaccagagacgtAAATATA
                \____________/

                   identical
                   sequences
                 ____________
                /            \
                ctctgcaTTTATATCGCATGACTCTTCTTT
                ||||||||||||||||||||||||||||||
                gagacgtAAATATAGCGTACTGAGAAGAAA

The fragments can be fused by any of the techniques mentioned earlier to form c:

agcctatcatcttggtctctgcaTTTATATCGCATGACTCTTCTTT
||||||||||||||||||||||||||||||||||||||||||||||
tcggatagtagaaccagagacgtAAATATAGCGTACTGAGAAGAAA

The first argument of this function is a list of sequence objects containing Amplicons and other similar objects.

At least every second sequence object needs to be an Amplicon

This rule exists because if a sequence object is that is not a PCR product is to be fused with another fragment, that other fragment needs to be an Amplicon so that the primer of the other object can be modified to include the whole stretch of sequence homology needed for the fusion. See the example below where a is a non-amplicon (a linear plasmid vector for instance)

 _________ a _________           __________ b ________
/                     \         /                     \
agcctatcatcttggtctctgca   <-->  TTTATATCGCATGACTCTTCTTT
|||||||||||||||||||||||         |||||||||||||||||||||||
tcggatagtagaaccagagacgt                          <AGAAA
                                TTTAT>
                                |||||||||||||||||||||||
                          <-->  AAATATAGCGTACTGAGAAGAAA

     agcctatcatcttggtctctgcaTTTATATCGCATGACTCTTCTTT
     ||||||||||||||||||||||||||||||||||||||||||||||
     tcggatagtagaaccagagacgtAAATATAGCGTACTGAGAAGAAA
     \___________________ c ______________________/

In this case only the forward primer of b is fitted with a tail with a part a:

agcctatcatcttggtctctgca
|||||||||||||||||||||||
tcggatagtagaaccagagacgt

                       TTTATATCGCATGACTCTTCTTT
                       |||||||||||||||||||||||
                                        <AGAAA
         tcttggtctctgcaTTTATAT
                       |||||||||||||||||||||||
                       AAATATAGCGTACTGAGAAGAAA

PCR products with flanking sequences are formed in the PCR process.

agcctatcatcttggtctctgcaTTTATAT
||||||||||||||||||||||||||||||
tcggatagtagaaccagagacgtAAATATA
                \____________/

                   identical
                   sequences
                 ____________
                /            \
                ctctgcaTTTATATCGCATGACTCTTCTTT
                ||||||||||||||||||||||||||||||
                gagacgtAAATATAGCGTACTGAGAAGAAA

The fragments can be fused by for example Gibson assembly:

agcctatcatcttggtctctgcaTTTATAT
||||||||||||||||||||||||||||||
tcggatagtagaacca

                             TCGCATGACTCTTCTTT
                ||||||||||||||||||||||||||||||
                gagacgtAAATATAGCGTACTGAGAAGAAA

to form c:

agcctatcatcttggtctctgcaTTTATATCGCATGACTCTTCTTT
||||||||||||||||||||||||||||||||||||||||||||||
tcggatagtagaaccagagacgtAAATATAGCGTACTGAGAAGAAA

The first argument of this function is a list of sequence objects containing Amplicons and other similar objects.

The overlap argument controls how many base pairs of overlap required between adjacent sequence fragments. In the junction between Amplicons, tails with the length of about half of this value is added to the two primers closest to the junction.

>       <
Amplicon1
         Amplicon2
         >       <

         ⇣

>       <-
Amplicon1
         Amplicon2
        ->       <

In the case of an Amplicon adjacent to a Dseqrecord object, the tail will be twice as long (1*overlap) since the recombining sequence is present entirely on this primer:

Dseqrecd1
         Amplicon1
         >       <

         ⇣

Dseqrecd1
         Amplicon1
       -->       <

Note that if the sequence of DNA fragments starts or stops with an Amplicon, the very first and very last prinmer will not be modified i.e. assembles are always assumed to be linear. There are simple tricks around that for circular assemblies depicted in the last two examples below.

The maxlink arguments controls the cut off length for sequences that will be synhtesized by adding them to primers for the adjacent fragment(s). The argument list may contain short spacers (such as spacers between fusion proteins).

Example 1: Linear assembly of PCR products (pydna.amplicon.Amplicon class objects) ------

>       <         >       <
Amplicon1         Amplicon3
         Amplicon2         Amplicon4
         >       <         >       <

                     ⇣
                     pydna.design.assembly_fragments
                     ⇣

>       <-       ->       <-                      pydna.assembly.Assembly
Amplicon1         Amplicon3
         Amplicon2         Amplicon4     ➤  Amplicon1Amplicon2Amplicon3Amplicon4
        ->       <-       ->       <

Example 2: Linear assembly of alternating Amplicons and other fragments

>       <         >       <
Amplicon1         Amplicon2
         Dseqrecd1         Dseqrecd2

                     ⇣
                     pydna.design.assembly_fragments
                     ⇣

>       <--     -->       <--                     pydna.assembly.Assembly
Amplicon1         Amplicon2
         Dseqrecd1         Dseqrecd2     ➤  Amplicon1Dseqrecd1Amplicon2Dseqrecd2

Example 3: Linear assembly of alternating Amplicons and other fragments

Dseqrecd1         Dseqrecd2
         Amplicon1         Amplicon2
         >       <       -->       <

                     ⇣
             pydna.design.assembly_fragments
                     ⇣
                                                  pydna.assembly.Assembly
Dseqrecd1         Dseqrecd2
         Amplicon1         Amplicon2     ➤  Dseqrecd1Amplicon1Dseqrecd2Amplicon2
       -->       <--     -->       <

Example 4: Circular assembly of alternating Amplicons and other fragments

                 ->       <==
Dseqrecd1         Amplicon2
         Amplicon1         Dseqrecd1
       -->       <-
                     ⇣
                     pydna.design.assembly_fragments
                     ⇣
                                                   pydna.assembly.Assembly
                 ->       <==
Dseqrecd1         Amplicon2                    -Dseqrecd1Amplicon1Amplicon2-
         Amplicon1                       ➤    |                             |
       -->       <-                            -----------------------------

------ Example 5: Circular assembly of Amplicons

>       <         >       <
Amplicon1         Amplicon3
         Amplicon2         Amplicon1
         >       <         >       <

                     ⇣
                     pydna.design.assembly_fragments
                     ⇣

>       <=       ->       <-
Amplicon1         Amplicon3
         Amplicon2         Amplicon1
        ->       <-       +>       <

                     ⇣
             make new Amplicon using the Amplicon1.template and
             the last fwd primer and the first rev primer.
                     ⇣
                                                   pydna.assembly.Assembly
+>       <=       ->       <-
 Amplicon1         Amplicon3                  -Amplicon1Amplicon2Amplicon3-
          Amplicon2                      ➤   |                             |
         ->       <-                          -----------------------------
Parameters:
  • f (list of pydna.amplicon.Amplicon and other Dseqrecord like objects) – list Amplicon and Dseqrecord object for which fusion primers should be constructed.

  • overlap (int, optional) – Length of required overlap between fragments.

  • maxlink (int, optional) – Maximum length of spacer sequences that may be present in f. These will be included in tails for designed primers.

  • circular (bool, optional) – If True, the assembly is circular. If False, the assembly is linear.

Returns:

seqs

[Amplicon1,
 Amplicon2, ...]

Return type:

list of pydna.amplicon.Amplicon and other Dseqrecord like objects pydna.amplicon.Amplicon objects

Examples

>>> from pydna.dseqrecord import Dseqrecord
>>> from pydna.design import primer_design
>>> a=primer_design(Dseqrecord("atgactgctaacccttccttggtgttgaacaagatcgacgacatttcgttcgaaacttacgatg"))
>>> b=primer_design(Dseqrecord("ccaaacccaccaggtaccttatgtaagtacttcaagtcgccagaagacttcttggtcaagttgcc"))
>>> c=primer_design(Dseqrecord("tgtactggtgctgaaccttgtatcaagttgggtgttgacgccattgccccaggtggtcgtttcgtt"))
>>> from pydna.design import assembly_fragments
>>> # We would like a circular recombination, so the first sequence has to be repeated
>>> fa1,fb,fc,fa2 = assembly_fragments([a,b,c,a])
>>> # Since all fragments are Amplicons, we need to extract the rp of the 1st and fp of the last fragments.
>>> from pydna.amplify import pcr
>>> fa = pcr(fa2.forward_primer, fa1.reverse_primer, a)
>>> [fa,fb,fc]
[Amplicon(100), Amplicon(101), Amplicon(102)]
>>> fa.name, fb.name, fc.name = "fa fb fc".split()
>>> from pydna.assembly import Assembly
>>> assemblyobj = Assembly([fa,fb,fc])
>>> assemblyobj
Assembly
fragments....: 100bp 101bp 102bp
limit(bp)....: 25
G.nodes......: 6
algorithm....: common_sub_strings
>>> assemblyobj.assemble_linear()
[Contig(-231), Contig(-166), Contig(-36)]
>>> assemblyobj.assemble_circular()[0].seguid()
'cdseguid=85t6tfcvWav0wnXEIb-lkUtrl4s'
>>> (a+b+c).looped().seguid()
'cdseguid=85t6tfcvWav0wnXEIb-lkUtrl4s'
>>> print(assemblyobj.assemble_circular()[0].figure())
 -|fa|36
|     \/
|     /\
|     36|fb|36
|           \/
|           /\
|           36|fc|36
|                 \/
|                 /\
|                 36-
|                    |
 --------------------
>>>
pydna.design.circular_assembly_fragments(f, overlap=35, maxlink=40)[source]#

Equivalent to assembly_fragments with circular=True.

Deprecated, kept for backward compatibility. Use assembly_fragments with circular=True instead.

pydna.design.user_assembly_design(f: list[Amplicon], max_overlap: int = 15, min_overlap: int = 4, max_tail=50) list[Amplicon][source]#

pydna.dseq module#

Provides the Dseq class for handling double stranded DNA sequences.

Dseq is a subclass of Bio.Seq.Seq. The Dseq class is mostly useful as a part of the pydna.dseqrecord.Dseqrecord class which can hold more meta data.

The Dseq class support the notion of circular and linear DNA topology.

class pydna.dseq.CircularBytes(value: bytes | bytearray | memoryview)[source]#

Bases: bytes

A circular bytes sequence: indexing and slicing wrap around index 0.

cutaround(start: int, length: int) bytes[source]#

Return a circular slice of given length starting at index start. Can exceed len(self), wrapping around as needed.

Examples

s = CircularBytes(b”ABCDE”) assert s.cutaround(3, 7) == b”DEABCDE” assert s.cutaround(-1, 4) == b”EABC”

find(sub: bytes | bytearray | memoryview | str, start: int = 0, end: int | None = None) int[source]#

Find a subsequence in the circular sequence, possibly wrapping across the origin. Returns -1 if not found.

class pydna.dseq.Dseq(watson: str | bytes, crick: str | bytes | None = None, ovhg=None, circular=False, pos=0)[source]#

Bases: Seq

Dseq describes a double stranded DNA fragment, linear or circular.

Dseq can be initiated in two ways, using two strings, each representing the Watson (upper, sense) strand, the Crick (lower, antisense) strand and an optional value describing the stagger betwen the strands on the left side (ovhg).

Alternatively, a single string represenation using dsIUPAC codes can be used. If a single string is used, the letters of that string are interpreted as base pairs rather than single bases. For example “A” would indicate the basepair “A/T”. An expanded IUPAC code is used where the letters PEXI have been assigned to GATC on the Watson strand with no paring base on the Crick strand G/””, A/””, T/”” and C/””. The letters QFZJ have been assigned the opposite base pairs with an empty Watson strand “”/G, “”/A, “”/T, and “”/C.

PEXIGATCQFZJ  would indicate the linear double-stranded fragment:

GATCGATC
    CTAGCTAG
Parameters:
  • watson (str) – a string representing the Watson (sense) DNA strand or a basepair represenation.

  • crick (str, optional) – a string representing the Crick (antisense) DNA strand.

  • ovhg (int, optional) – A positive or negative number to describe the stagger between the Watson and Crick strands. see below for a detailed explanation.

  • circular (bool, optional) – True indicates that sequence is circular, False that it is linear.

Examples

Dseq is a subclass of the Biopython Bio.Seq.Seq class. The constructor can accept two strings representing the Watson (sense) and Crick(antisense) DNA strands. These are interpreted as single stranded DNA. There is a check for complementarity between the strands.

If the DNA molecule is staggered on the left side, an integer ovhg (overhang) must be given, describing the stagger between the Watson and Crick strand in the 5’ end of the fragment.

Additionally, the optional boolean parameter circular can be given to indicate if the DNA molecule is circular.

The most common usage of the Dseq class is probably not to use it directly, but to create it as part of a Dseqrecord object (see pydna.dseqrecord.Dseqrecord). This works in the same way as for the relationship between the Bio.Seq.Seq and Bio.SeqRecord.SeqRecord classes in Biopython.

There are multiple ways of creating a Dseq object directly listed below, but you can also use the function Dseq.from_full_sequence_and_overhangs() to create a Dseq:

Two arguments (string, string), no overhang provided:

>>> from pydna.dseq import Dseq
>>> Dseq("gggaaat","ttt")
Dseq(-7)
gggaaat
   ttt

If Watson and Crick are given, but not ovhg, an attempt will be made to find the best annealing between the strands. There are important limitations to this. If there are several ways to anneal the strands, this will fail. For long fragments it is quite slow.

Three arguments (string, string, ovhg=int):

The ovhg parameter is an integer describing the length of the Crick strand overhang on the left side (the 5’ end of Watson strand).

The ovhg parameter controls the stagger at the five prime end:

dsDNA       overhang

  nnn...    2
nnnnn...

 nnnn...    1
nnnnn...

nnnnn...    0
nnnnn...

nnnnn...   -1
 nnnn...

nnnnn...   -2
  nnn...

Example of creating Dseq objects with different amounts of stagger:

>>> Dseq(watson="att", crick="acata", ovhg=-2)
Dseq(-7)
att
  ataca
>>> Dseq(watson="ata",crick="acata",ovhg=-1)
Dseq(-6)
ata
 ataca
>>> Dseq(watson="taa",crick="actta",ovhg=0)
Dseq(-5)
taa
attca
>>> Dseq(watson="aag",crick="actta",ovhg=1)
Dseq(-5)
 aag
attca
>>> Dseq(watson="agt",crick="actta",ovhg=2)
Dseq(-5)
  agt
attca

If the ovhg parameter is specified a Crick strand also needs to be supplied, or an exception is raised.

>>> Dseq(watson="agt", ovhg=2)
Traceback (most recent call last):
    ...
ValueError: ovhg (overhang) defined without a crick strand.

The shape or topology of the fragment is set by the circular parameter, True or False (default).

>>> Dseq("aaa", "ttt", ovhg = 0)  # A linear sequence by default
Dseq(-3)
aaa
ttt
>>> Dseq("aaa", "ttt", ovhg = 0, circular = False)  # A linear sequence if circular is False
Dseq(-3)
aaa
ttt
>>> Dseq("aaa", "ttt", ovhg = 0, circular = True)  # A circular sequence
Dseq(o3)
aaa
ttt
>>> Dseq("aaa", "ttt", ovhg=1, circular = False)
Dseq(-4)
 aaa
ttt
>>> Dseq("aaa","ttt",ovhg=-1)
Dseq(-4)
aaa
 ttt
>>> Dseq("aaa", "ttt", circular = True , ovhg=0)
Dseq(o3)
aaa
ttt
>>> a=Dseq("tttcccc","aaacccc")
>>> a
Dseq(-11)
    tttcccc
ccccaaa
>>> a.ovhg
4
>>> b=Dseq("ccccttt","ccccaaa")
>>> b
Dseq(-11)
ccccttt
    aaacccc
>>> b.ovhg
-4
>>>

dsIUPAC [11] is an nn extension to the IUPAC alphabet used to describe ss regions:

    aaaGATC       GATCccc          ad-hoc representations
CTAGttt               gggCTAG

QFZJaaaPEXI       PEXIcccQFZJ      dsIUPAC

Coercing to string

>>> str(a)
'ggggtttcccc'

A Dseq object can be longer that either the watson or crick strands.

<-- length -->
GATCCTTT
     AAAGCCTAG

<-- length -->
      GATCCTTT
AAAGCCCTA

The slicing of a linear Dseq object works mostly as it does for a string.

>>> s="ggatcc"
>>> s[2:3]
'a'
>>> s[2:4]
'at'
>>> s[2:4:-1]
''
>>> s[::2]
'gac'
>>> from pydna.dseq import Dseq
>>> d=Dseq(s, circular=False)
>>> d[2:3]
Dseq(-1)
a
t
>>> d[2:4]
Dseq(-2)
at
ta
>>> d[2:4:-1]
Dseq(-0)


>>> d[::2]
Dseq(-3)
gac
ctg

The slicing of a circular Dseq object has a slightly different meaning.

>>> s="ggAtCc"
>>> d=Dseq(s, circular=True)
>>> d
Dseq(o6)
ggAtCc
ccTaGg
>>> d[4:3]
Dseq(-5)
CcggA
GgccT

The slice [X:X] produces an empty slice for a string, while this will return the linearized sequence starting at X:

>>> s="ggatcc"
>>> d=Dseq(s, circular=True)
>>> d
Dseq(o6)
ggatcc
cctagg
>>> d[3:3]
Dseq(-6)
tccgga
aggcct
>>>
classmethod quick(data: bytes, *args, circular=False, pos=0, **kwargs)[source]#

Fastest way to instantiate an object of the Dseq class.

No checks of parameters are made. Does not call Bio.Seq.Seq.__init__() which has lots of time consuming checks.

classmethod from_representation(dsdna: str, *args, **kwargs)[source]#
classmethod from_full_sequence_and_overhangs(full_sequence: str, crick_ovhg: int, watson_ovhg: int)[source]#

Create a linear Dseq object from a full sequence and the 3’ overhangs of each strand.

The order of the parameters is like this because the 3’ overhang of the crick strand is the one on the left side of the sequence.

Parameters:
  • full_sequence (str) – The full sequence of the Dseq object.

  • crick_ovhg (int) – The overhang of the crick strand in the 3’ end. Equivalent to Dseq.ovhg.

  • watson_ovhg (int) – The overhang of the watson strand in the 5’ end.

Returns:

A Dseq object.

Return type:

Dseq

Examples

>>> Dseq.from_full_sequence_and_overhangs('AAAAAA', crick_ovhg=2, watson_ovhg=2)
Dseq(-6)
  AAAA
TTTT
>>> Dseq.from_full_sequence_and_overhangs('AAAAAA', crick_ovhg=-2, watson_ovhg=2)
Dseq(-6)
AAAAAA
  TT
>>> Dseq.from_full_sequence_and_overhangs('AAAAAA', crick_ovhg=2, watson_ovhg=-2)
Dseq(-6)
  AA
TTTTTT
>>> Dseq.from_full_sequence_and_overhangs('AAAAAA', crick_ovhg=-2, watson_ovhg=-2)
Dseq(-6)
AAAA
  TTTT
property watson: str#

The watson (upper) strand of the double stranded fragment 5’-3’.

Returns:

DESCRIPTION.

Return type:

TYPE

property crick: str#

The crick (lower) strand of the double stranded fragment 5’-3’.

Returns:

DESCRIPTION.

Return type:

TYPE

property left_ovhg: int#

The 5’ overhang of the lower strand compared the the upper.

See module docstring for more information.

Returns:

DESCRIPTION.

Return type:

TYPE

property ovhg: int#

The 5’ overhang of the lower strand compared the the upper.

See module docstring for more information.

Returns:

DESCRIPTION.

Return type:

TYPE

property right_ovhg: int#

Overhang at the right side (end).

property watson_ovhg: int#

Overhang at the right side (end).

to_blunt_string() str#

A string representation of the sequence. The returned string is the watson strand of a blunt version of the sequence.

>>> ds = Dseq.from_representation(
... '''
... GAATTC
...   TAA
... ''')
>>> str(ds)
'GAATTC'
>>> ds = Dseq.from_representation(
... '''
...   ATT
... CTTAAG
... ''')
>>> str(ds)
'GAATTC'
Returns:

A string representation of the sequence.

Return type:

str

mw() float[source]#

The molecular weight of the DNA/RNA molecule in g/mol.

The molecular weight data in Biopython Bio.Data.IUPACData is used. The DNA is assumed to have a 5’-phosphate as many DNA fragments from restriction digestion do:

 P - G-A-T-T-A-C-A - OH
     | | | | | | |
OH - C-T-A-A-T-G-T - P

The molecular weights listed in the unambiguous_dna_weights dictionary refers to free monophosphate nucleotides. One water molecule is removed for every phopshodiester bond formed between nucleotides. For linear molecules, the weight of one water molecule is added to account for the terminal hydroxyl group and a hydrogen on the 5’ terminal phosphate group.

 P - G---A---T - OH  P - C---A - OH
     |   |   |           |   |
OH - C---T---A---A---T---G---T - P

If the DNA is discontinuous, the internal 5’- end is assumed to have a phosphate and the 3’- a hydroxyl group:

Examples

>>> from pydna.dseq import Dseq
>>> ds_lin_obj = Dseq("GATTACA")
>>> ds_lin_obj
Dseq(-7)
GATTACA
CTAATGT
>>> round(ds_lin_obj.mw(), 1)
4359.8
>>> ds_circ_obj = Dseq("GATTACA", circular = True)
>>> round(ds_circ_obj.mw(), 1)
4323.8
>>> ssobj = Dseq("PEXXEIE")
>>> ssobj
Dseq(-7)
GATTACA
|||||||
>>> round(ssobj.mw(), 1)
2184.4
>>> ds_lin_obj2 = Dseq("GATZFCA")
>>> ds_lin_obj2
Dseq(-7)
GAT  CA
CTAATGT
>>> round(ds_lin_obj2.mw(), 1)
3724.4
find(sub: _SeqAbstractBaseClass | str | bytes, start=0, end=sys.maxsize) int[source]#

This method behaves like the python string method of the same name.

Returns an integer, the index of the first occurrence of substring argument sub in the (sub)sequence given by [start:end].

Returns -1 if the subsequence is NOT found.

The search is case sensitive.

Parameters:
  • sub (string or Seq object) – a string or another Seq object to look for.

  • start (int, optional) – slice start.

  • end (int, optional) – slice end.

Examples

>>> from pydna.dseq import Dseq
>>> seq = Dseq("agtaagt")
>>> seq
Dseq(-7)
agtaagt
tcattca
>>> seq.find("taa")
2
>>> seq = Dseq(watson="agta",crick="actta",ovhg=-2)
>>> seq
Dseq(-7)
agta
  attca
>>> seq.find("taa")
-1
>>> seq = Dseq(watson="agta",crick="actta",ovhg=-2)
>>> seq
Dseq(-7)
agta
  attca
>>> seq.find("ta")
2
reverse_complement() Dseq[source]#

Dseq object where watson and crick have switched places.

This represents the same double stranded sequence.

Examples

>>> from pydna.dseq import Dseq
>>> a=Dseq("catcgatc")
>>> a
Dseq(-8)
catcgatc
gtagctag
>>> b=a.reverse_complement()
>>> b
Dseq(-8)
gatcgatg
ctagctac
>>>
rc() Dseq#

Dseq object where watson and crick have switched places.

This represents the same double stranded sequence.

Examples

>>> from pydna.dseq import Dseq
>>> a=Dseq("catcgatc")
>>> a
Dseq(-8)
catcgatc
gtagctag
>>> b=a.reverse_complement()
>>> b
Dseq(-8)
gatcgatg
ctagctac
>>>
shifted(shift: int) DseqType[source]#

Shifted copy of a circular Dseq object.

>>> ds = Dseq("TAAG", circular = True)
>>> ds.shifted(1) # First bp moved to right side:
Dseq(o4)
AAGT
TTCA
>>> ds.shifted(-1) # Last bp moved to left side:
Dseq(o4)
GTAA
CATT
looped() DseqType[source]#

Circularized Dseq object.

This can only be done if the two ends are compatible, otherwise a TypeError is raised.

Examples

>>> from pydna.dseq import Dseq
>>> a=Dseq("catcgatc")
>>> a
Dseq(-8)
catcgatc
gtagctag
>>> a.looped()
Dseq(o8)
catcgatc
gtagctag
>>> b = Dseq("iatcgatj")
>>> b
Dseq(-8)
catcgat
 tagctag
>>> b.looped()
Dseq(o7)
catcgat
gtagcta
>>> c = Dseq("jatcgati")
>>> c
Dseq(-8)
 atcgatc
gtagcta
>>> c.looped()
Dseq(o7)
catcgat
gtagcta
>>> d = Dseq("ietcgazj")
>>> d
Dseq(-8)
catcga
  agctag
>>> d.looped()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python2.7/dist-packages/pydna/dsdna.py", line 357, in looped
    if type5 == type3 and str(sticky5) == str(rc(sticky3)):
TypeError: DNA cannot be circularized.
5' and 3' sticky ends not compatible!
>>>
five_prime_end() Tuple[str, str][source]#

Returns a 2-tuple of trings describing the structure of the 5’ end of the DNA fragment.

The tuple contains (type , sticky) where type is eiter “5’” or “3’”. sticky is always in lower case and contains the sequence of the protruding end in 5’-3’ direction.

See examples below:

Examples

>>> from pydna.dseq import Dseq
>>> a = Dseq("aa", "tttg", ovhg=2)
>>> a
Dseq(-4)
  aa
gttt
>>> a.five_prime_end()
("3'", 'tg')
>>> a = Dseq("caaa", "tt", ovhg=-2)
>>> a
Dseq(-4)
caaa
  tt
>>> a.five_prime_end()
("5'", 'ca')
>>> a = Dseq("aa", "tt")
>>> a
Dseq(-2)
aa
tt
>>> a.five_prime_end()
('blunt', '')
three_prime_end() Tuple[str, str][source]#

Returns a tuple describing the structure of the 5’ end of the DNA fragment

>>> a = Dseq("aa", "gttt", ovhg=0)
>>> a
Dseq(-4)
aa
tttg
>>> a.three_prime_end()
("5'", 'gt')
>>> a = Dseq("aaac", "tt", ovhg=0)
>>> a
Dseq(-4)
aaac
tt
>>> a.three_prime_end()
("3'", 'ac')
>>> from pydna.dseq import Dseq
>>> a=Dseq("aaa", "ttt")
>>> a
Dseq(-3)
aaa
ttt
>>> a.three_prime_end()
('blunt', '')
fill_in(nucleotides: None | str = None) DseqType[source]#

Fill in of five prime protruding end with a DNA polymerase that has only DNA polymerase activity (such as Exo-Klenow [12]). Exo-Klenow is a modified version of the Klenow fragment of E. coli DNA polymerase I, which has been engineered to lack both 3-5 proofreading and 5-3 exonuclease activities.

and any combination of A, G, C or T. Default are all four nucleotides together.

Parameters:

nucleotides (str)

Examples

>>> from pydna.dseq import Dseq
>>> b=Dseq("caaa", "cttt")
>>> b
Dseq(-5)
caaa
 tttc
>>> b.fill_in()
Dseq(-5)
caaag
gtttc
>>> b.fill_in("g")
Dseq(-5)
caaag
gtttc
>>> b.fill_in("tac")
Dseq(-5)
caaa
 tttc
>>> c=Dseq("aaac", "tttg")
>>> c
Dseq(-5)
 aaac
gttt
>>> c.fill_in()
Dseq(-5)
 aaac
gttt
>>> a=Dseq("aaa", "ttt")
>>> a
Dseq(-3)
aaa
ttt
>>> a.fill_in()
Dseq(-3)
aaa
ttt

References

klenow(nucleotides: None | str = None) DseqType#

Fill in of five prime protruding end with a DNA polymerase that has only DNA polymerase activity (such as Exo-Klenow [13]). Exo-Klenow is a modified version of the Klenow fragment of E. coli DNA polymerase I, which has been engineered to lack both 3-5 proofreading and 5-3 exonuclease activities.

and any combination of A, G, C or T. Default are all four nucleotides together.

Parameters:

nucleotides (str)

Examples

>>> from pydna.dseq import Dseq
>>> b=Dseq("caaa", "cttt")
>>> b
Dseq(-5)
caaa
 tttc
>>> b.fill_in()
Dseq(-5)
caaag
gtttc
>>> b.fill_in("g")
Dseq(-5)
caaag
gtttc
>>> b.fill_in("tac")
Dseq(-5)
caaa
 tttc
>>> c=Dseq("aaac", "tttg")
>>> c
Dseq(-5)
 aaac
gttt
>>> c.fill_in()
Dseq(-5)
 aaac
gttt
>>> a=Dseq("aaa", "ttt")
>>> a
Dseq(-3)
aaa
ttt
>>> a.fill_in()
Dseq(-3)
aaa
ttt

References

nibble_to_blunt() DseqType[source]#

Simulates treatment a nuclease with both 5’-3’ and 3’-5’ single strand specific exonuclease activity (such as mung bean nuclease [14])

Mung bean nuclease is a nuclease enzyme derived from mung bean sprouts that preferentially degrades single-stranded DNA and RNA into 5’-phosphate- and 3’-hydroxyl-containing nucleotides.

Treatment results in blunt DNA, regardless of wheter the protruding end is 5’ or 3’.

    ggatcc    ->     gatcc
     ctaggg          ctagg

     ggatcc   ->     ggatc
    tcctag           cctag

>>> from pydna.dseq import Dseq
>>> b=Dseq("caaa", "cttt")
>>> b
Dseq(-5)
caaa
 tttc
>>> b.mung()
Dseq(-3)
aaa
ttt
>>> c=Dseq("aaac", "tttg")
>>> c
Dseq(-5)
 aaac
gttt
>>> c.mung()
Dseq(-3)
aaa
ttt

References

mung() DseqType#

Simulates treatment a nuclease with both 5’-3’ and 3’-5’ single strand specific exonuclease activity (such as mung bean nuclease [15])

Mung bean nuclease is a nuclease enzyme derived from mung bean sprouts that preferentially degrades single-stranded DNA and RNA into 5’-phosphate- and 3’-hydroxyl-containing nucleotides.

Treatment results in blunt DNA, regardless of wheter the protruding end is 5’ or 3’.

    ggatcc    ->     gatcc
     ctaggg          ctagg

     ggatcc   ->     ggatc
    tcctag           cctag

>>> from pydna.dseq import Dseq
>>> b=Dseq("caaa", "cttt")
>>> b
Dseq(-5)
caaa
 tttc
>>> b.mung()
Dseq(-3)
aaa
ttt
>>> c=Dseq("aaac", "tttg")
>>> c
Dseq(-5)
 aaac
gttt
>>> c.mung()
Dseq(-3)
aaa
ttt

References

T4(nucleotides=None) DseqType[source]#

Fill in 5’ protruding ends and nibble 3’ protruding ends.

This is done using a DNA polymerase providing 3’-5’ nuclease activity such as T4 DNA polymerase. This can be done in presence of any combination of the four nucleotides A, G, C or T.

T4 DNA polymerase is widely used to “polish” DNA ends because of its strong 3-5 exonuclease activity in the absence of dNTPs, it chews back 3′ overhangs to create blunt ends; in the presence of limiting dNTPs, it can fill in 5′ overhangs; and by carefully controlling reaction time, temperature, and nucleotide supply, you can generate defined recessed or blunt termini.

Tuning the nucleotide set can facilitate engineering of partial sticky ends. Default are all four nucleotides together.

      aaagatc-3        aaa      3' ends are always removed.
      |||       --->   |||      A and T needed or the molecule will
3-ctagttt              ttt      degrade completely.

5-gatcaaa              gatcaaaGATC      5' ends are filled in the
      |||       --->   |||||||||||      presence of GATC
      tttctag-5        CTAGtttctag

5-gatcaaa              gatcaaaGAT       5' ends are partially filled in the
      |||       --->    |||||||||       presence of GAT to produce a 1 nt
      tttctag-5         TAGtttctag      5' overhang

5-gatcaaa              gatcaaaGA       5' ends are partially filled in the
      |||       --->     |||||||       presence of GA to produce a 2 nt
      tttctag-5          AGtttctag     5' overhang

5-gatcaaa              gatcaaaG        5' ends are partially filled in the
      |||       --->      |||||        presence of G to produce a 3 nt
      tttctag-5           Gtttctag     5' overhang
Parameters:

nucleotides (str)

Examples

>>> from pydna.dseq import Dseq
>>> a = Dseq.from_representation(
... '''
... gatcaaa
...     tttctag
... ''')
>>> a
Dseq(-11)
gatcaaa
    tttctag
>>> a.T4()
Dseq(-11)
gatcaaagatc
ctagtttctag
>>> a.T4("GAT")
Dseq(-11)
gatcaaagat
 tagtttctag
>>> a.T4("GA")
Dseq(-11)
gatcaaaga
  agtttctag
>>> a.T4("G")
Dseq(-11)
gatcaaag
   gtttctag
t4(nucleotides=None) DseqType#

Fill in 5’ protruding ends and nibble 3’ protruding ends.

This is done using a DNA polymerase providing 3’-5’ nuclease activity such as T4 DNA polymerase. This can be done in presence of any combination of the four nucleotides A, G, C or T.

T4 DNA polymerase is widely used to “polish” DNA ends because of its strong 3-5 exonuclease activity in the absence of dNTPs, it chews back 3′ overhangs to create blunt ends; in the presence of limiting dNTPs, it can fill in 5′ overhangs; and by carefully controlling reaction time, temperature, and nucleotide supply, you can generate defined recessed or blunt termini.

Tuning the nucleotide set can facilitate engineering of partial sticky ends. Default are all four nucleotides together.

      aaagatc-3        aaa      3' ends are always removed.
      |||       --->   |||      A and T needed or the molecule will
3-ctagttt              ttt      degrade completely.

5-gatcaaa              gatcaaaGATC      5' ends are filled in the
      |||       --->   |||||||||||      presence of GATC
      tttctag-5        CTAGtttctag

5-gatcaaa              gatcaaaGAT       5' ends are partially filled in the
      |||       --->    |||||||||       presence of GAT to produce a 1 nt
      tttctag-5         TAGtttctag      5' overhang

5-gatcaaa              gatcaaaGA       5' ends are partially filled in the
      |||       --->     |||||||       presence of GA to produce a 2 nt
      tttctag-5          AGtttctag     5' overhang

5-gatcaaa              gatcaaaG        5' ends are partially filled in the
      |||       --->      |||||        presence of G to produce a 3 nt
      tttctag-5           Gtttctag     5' overhang
Parameters:

nucleotides (str)

Examples

>>> from pydna.dseq import Dseq
>>> a = Dseq.from_representation(
... '''
... gatcaaa
...     tttctag
... ''')
>>> a
Dseq(-11)
gatcaaa
    tttctag
>>> a.T4()
Dseq(-11)
gatcaaagatc
ctagtttctag
>>> a.T4("GAT")
Dseq(-11)
gatcaaagat
 tagtttctag
>>> a.T4("GA")
Dseq(-11)
gatcaaaga
  agtttctag
>>> a.T4("G")
Dseq(-11)
gatcaaag
   gtttctag
nibble_five_prime_left(n: int = 1) DseqType[source]#

5’ => 3’ resection at the left side (start) of the molecule.

The argument n indicate the number of nucleotides that are to be removed. The outcome of this depend on the structure of the molecule. See the two examples below:

The figure below indicates a recess of length two from a blunt DNA fragment. The resulting DNA fragment has a 3’ protruding single strand.

gatc           tc
||||   -->     ||
ctag         ctag

The figure below indicates a recess of length two from a DNA fragment with a 5’ sticky end resulting in a blunt sequence.

ttgatc         gatc
  ||||   -->   ||||
  ctag         ctag
>>> from pydna.dseq import Dseq
>>> ds = Dseq("gatc")
>>> ds
Dseq(-4)
gatc
ctag
>>> ds.nibble_five_prime_left(2)
Dseq(-4)
  tc
ctag
>>> ds.nibble_five_prime_left(3)
Dseq(-4)
   c
ctag
>>> ds.nibble_five_prime_left(4)
Dseq(-4)
||||
ctag
>>> ds = Dseq.from_representation(
... '''
... GGgatc
...   ctag
... ''')
>>> ds
Dseq(-6)
GGgatc
  ctag
>>> ds.nibble_five_prime_left(2)
Dseq(-4)
gatc
ctag
Parameters:

n (int, optional) – The default is 1. This is the number of nucleotides removed.

Returns:

DESCRIPTION.

Return type:

DseqType

nibble_five_prime_right(n: int = 1) DseqType[source]#

5’ => 3’ resection at the right side (end) of the molecule.

The argument n indicate the number of nucleotides that are to be removed. The outcome of this depend on the structure of the molecule. See the two examples below:

The figure below indicates a recess of length two from a blunt DNA fragment. The resulting DNA fragment has a 3’ protruding single strand.

gatc         gatc
||||   -->   ||
ctag         ct

The figure below indicates a recess of length two from a DNA fragment with a 5’ sticky end resulting in a blunt sequence.

gatc         gatc
||||   -->   ||||
ctagtt       ctag
>>> from pydna.dseq import Dseq
>>> ds = Dseq("gatc")
>>> ds
Dseq(-4)
gatc
ctag
>>> ds.nibble_five_prime_right(2)
Dseq(-4)
gatc
ct
>>> ds.nibble_five_prime_right(3)
Dseq(-4)
gatc
c
>>> ds.nibble_five_prime_right(4)
Dseq(-4)
gatc
||||
>>> ds = Dseq.from_representation(
... '''
... gatc
... ctagGG
... ''')
>>> ds.nibble_five_prime_right(2)
Dseq(-4)
gatc
ctag
exo1_front(n: int = 1) DseqType#

5’ => 3’ resection at the left side (start) of the molecule.

The argument n indicate the number of nucleotides that are to be removed. The outcome of this depend on the structure of the molecule. See the two examples below:

The figure below indicates a recess of length two from a blunt DNA fragment. The resulting DNA fragment has a 3’ protruding single strand.

gatc           tc
||||   -->     ||
ctag         ctag

The figure below indicates a recess of length two from a DNA fragment with a 5’ sticky end resulting in a blunt sequence.

ttgatc         gatc
  ||||   -->   ||||
  ctag         ctag
>>> from pydna.dseq import Dseq
>>> ds = Dseq("gatc")
>>> ds
Dseq(-4)
gatc
ctag
>>> ds.nibble_five_prime_left(2)
Dseq(-4)
  tc
ctag
>>> ds.nibble_five_prime_left(3)
Dseq(-4)
   c
ctag
>>> ds.nibble_five_prime_left(4)
Dseq(-4)
||||
ctag
>>> ds = Dseq.from_representation(
... '''
... GGgatc
...   ctag
... ''')
>>> ds
Dseq(-6)
GGgatc
  ctag
>>> ds.nibble_five_prime_left(2)
Dseq(-4)
gatc
ctag
Parameters:

n (int, optional) – The default is 1. This is the number of nucleotides removed.

Returns:

DESCRIPTION.

Return type:

DseqType

exo1_end(n: int = 1) DseqType#

5’ => 3’ resection at the right side (end) of the molecule.

The argument n indicate the number of nucleotides that are to be removed. The outcome of this depend on the structure of the molecule. See the two examples below:

The figure below indicates a recess of length two from a blunt DNA fragment. The resulting DNA fragment has a 3’ protruding single strand.

gatc         gatc
||||   -->   ||
ctag         ct

The figure below indicates a recess of length two from a DNA fragment with a 5’ sticky end resulting in a blunt sequence.

gatc         gatc
||||   -->   ||||
ctagtt       ctag
>>> from pydna.dseq import Dseq
>>> ds = Dseq("gatc")
>>> ds
Dseq(-4)
gatc
ctag
>>> ds.nibble_five_prime_right(2)
Dseq(-4)
gatc
ct
>>> ds.nibble_five_prime_right(3)
Dseq(-4)
gatc
c
>>> ds.nibble_five_prime_right(4)
Dseq(-4)
gatc
||||
>>> ds = Dseq.from_representation(
... '''
... gatc
... ctagGG
... ''')
>>> ds.nibble_five_prime_right(2)
Dseq(-4)
gatc
ctag
nibble_three_prime_left(n=1) DseqType[source]#

3’ => 5’ resection at the left side (beginning) of the molecule.

The argument n indicate the number of nucleotides that are to be removed. The outcome of this depend on the structure of the molecule. See the two examples below:

The figure below indicates a recess of length two from a blunt DNA fragment. The resulting DNA fragment has a 5’ protruding single strand.

gatc         gatc
||||   -->     ||
ctag           ag

The figure below indicates a recess of length two from a DNA fragment with a 3’ sticky end resulting in a blunt sequence.

  gatc         gatc
  ||||   -->   ||||
ttctag         ctag
>>> from pydna.dseq import Dseq
>>> ds = Dseq("gatc")
>>> ds
Dseq(-4)
gatc
ctag
>>> ds.nibble_three_prime_left(2)
Dseq(-4)
gatc
  ag
>>> ds.nibble_three_prime_left(3)
Dseq(-4)
gatc
   g
>>> ds.nibble_three_prime_left(4)
Dseq(-4)
gatc
||||
>>> ds = Dseq.from_representation(
... '''
...   gatc
... CCctag
... ''')
>>> ds
Dseq(-6)
  gatc
CCctag
>>> ds.nibble_three_prime_left(2)
Dseq(-4)
gatc
ctag
nibble_three_prime_right(n=1) DseqType[source]#

3’ => 5’ resection at the right side (end) of the molecule.

The argument n indicate the number of nucleotides that are to be removed. The outcome of this depend on the structure of the molecule. See the two examples below:

The figure below indicates a recess of length two from a blunt DNA fragment. The resulting DNA fragment has a 5’ protruding single strand.

gatc         ga
||||   -->   ||
ctag         ctag

The figure below indicates a recess of length two from a DNA fragment with a 3’ sticky end resulting in a blunt sequence.

gatctt       gatc
||||   -->   ||||
ctag         ctag
>>> from pydna.dseq import Dseq
>>> ds = Dseq("gatc")
>>> ds
Dseq(-4)
gatc
ctag
>>> ds.nibble_three_prime_right(2)
Dseq(-4)
ga
ctag
>>> ds.nibble_three_prime_right(3)
Dseq(-4)
g
ctag
>>> ds.nibble_three_prime_right(4)
Dseq(-4)
||||
ctag
>>> ds = Dseq.from_representation(
... '''
... gatcCC
... ctag
... ''')
>>> ds.nibble_three_prime_right(2)
Dseq(-4)
gatc
ctag
no_cutters(batch: RestrictionBatch | None = None) RestrictionBatch[source]#

Enzymes in a RestrictionBatch not cutting sequence.

unique_cutters(batch: RestrictionBatch | None = None) RestrictionBatch[source]#

Enzymes in a RestrictionBatch cutting sequence once.

once_cutters(batch: RestrictionBatch | None = None) RestrictionBatch#

Enzymes in a RestrictionBatch cutting sequence once.

twice_cutters(batch: RestrictionBatch | None = None) RestrictionBatch[source]#

Enzymes in a RestrictionBatch cutting sequence twice.

n_cutters(n=3, batch: RestrictionBatch | None = None) RestrictionBatch[source]#

Enzymes in a RestrictionBatch cutting n times.

cutters(batch: RestrictionBatch | None = None) RestrictionBatch[source]#

Enzymes in a RestrictionBatch cutting sequence at least once.

seguid() str[source]#

SEGUID checksum for the sequence.

isblunt() bool[source]#

isblunt.

Return True if Dseq is linear and blunt and false if staggered or circular.

Examples

>>> from pydna.dseq import Dseq
>>> a=Dseq("gat")
>>> a
Dseq(-3)
gat
cta
>>> a.isblunt()
True
>>> a=Dseq("gat", "atcg")
>>> a
Dseq(-4)
 gat
gcta
>>> a.isblunt()
False
>>> a=Dseq("gat", "gatc")
>>> a
Dseq(-4)
gat
ctag
>>> a.isblunt()
False
>>> a=Dseq("gat", circular=True)
>>> a
Dseq(o3)
gat
cta
>>> a.isblunt()
False
terminal_transferase(nucleotides: str = 'a') DseqType[source]#

Terminal deoxynucleotidyl transferase (TdT) is a template-independent DNA polymerase that adds nucleotides to the 3′-OH ends of DNA, typically single-stranded or recessed 3′ ends. In cloning, it’s classically used to create homopolymer tails (e.g. poly-dG on a vector and poly-dC on an insert) so that fragments can anneal via complementary overhangs (“tailing” cloning).

This activity ia also present in some DNA polymerases, such as Taq polymerase. This property is used in the populat T/A cloning protocol ([16]).

gct          gcta
|||   -->    |||
cga         acga
>>> from pydna.dseq import Dseq
>>> a = Dseq("aa")
>>> a = Dseq("gct")
>>> a
Dseq(-3)
gct
cga
>>> a.terminal_transferase()
Dseq(-5)
 gcta
acga
>>> a.terminal_transferase("G")
Dseq(-5)
 gctG
Gcga
Parameters:

nucleotides (str, optional) – The default is “a”.

Returns:

DESCRIPTION.

Return type:

DseqType

References

user() DseqType[source]#

USER Enzyme treatment.

USER Enzyme is a mixture of Uracil DNA glycosylase (UDG) and the DNA glycosylase-lyase Endonuclease VIII.

UDG catalyses the excision of an uracil base, forming an abasic or apyrimidinic site (AP site). Endonuclease VIII removes the AP site creating a DNA gap.

tagaagtaggUat          tagaagtagg at
|||||||||||||  --->    |||||||||| ||
atcUtcatccata          atc tcatccata
>>> a = Dseq("tagaagtaggUat", "atcUtcatccata"[::-1], 0)
>>> a
Dseq(-13)
tagaagtaggUat
atcutcatccAta
>>> a.user()
Dseq(-13)
tagaagtagg at
atc tcatccAta
Returns:

DNA fragment with uracile bases removed.

Return type:

DseqType

cut(*enzymes: EnzymesType) Tuple[DseqType, ...][source]#

Returns a list of linear Dseq fragments produced in the digestion. If there are no cuts, an empty list is returned.

Parameters:

enzymes (enzyme object or iterable of such objects) – A Bio.Restriction.XXX restriction objects or iterable.

Returns:

frags – list of Dseq objects formed by the digestion

Return type:

list

Examples

>>> from pydna.dseq import Dseq
>>> seq=Dseq("ggatccnnngaattc")
>>> seq
Dseq(-15)
ggatccnnngaattc
cctaggnnncttaag
>>> from Bio.Restriction import BamHI,EcoRI
>>> type(seq.cut(BamHI))
<class 'tuple'>
>>> for frag in seq.cut(BamHI): print(repr(frag))
Dseq(-5)
g
cctag
Dseq(-14)
gatccnnngaattc
    gnnncttaag
>>> seq.cut(EcoRI, BamHI) ==  seq.cut(BamHI, EcoRI)
True
>>> a,b,c = seq.cut(EcoRI, BamHI)
>>> a+b+c
Dseq(-15)
ggatccnnngaattc
cctaggnnncttaag
>>>
cutsite_is_valid(cutsite: Tuple[Tuple[int, int], AbstractCut | None | _cas]) bool[source]#

Check is a cutsite is valid.

A cutsite is a nested 2-tuple with this form:

((cut_watson, ovhg), enz), for example ((396, -4), EcoRI)

The cut_watson (positive integer) is the cut position of the sequence as for example returned by the Bio.Restriction module.

The ovhg (overhang, positive or negative integer or 0) has the same meaning as for restriction enzymes in the Bio.Restriction module and for pydna.dseq.Dseq objects (see docstring for this module and example below)

Enzyme can be None.

Enzyme overhang

EcoRI  -4     --GAATTC--        --G       AATTC--
                ||||||     -->    |           |
              --CTTAAG--        --CTTAA       G--

KpnI    4     --GGTACC--        --GGTAC       C--
                ||||||     -->    |           |
              --CCATGG--        --C       CATGG--

SmaI    0     --CCCGGG--        --CCC       GGG--
                ||||||     -->    |||       |||
              --GGGCCC--        --GGG       CCC--
>>> from Bio.Restriction import EcoRI, KpnI, SmaI
>>> EcoRI.ovhg
-4
>>> KpnI.ovhg
4
>>> SmaI.ovhg
0

Returns False if:

  • Cut positions fall outside the sequence (could be moved to Biopython)

TODO: example

  • Overhang is not double stranded

TODO: example

  • Recognition site is not double stranded or is outside the sequence

TODO: example

  • For enzymes that cut twice, it checks that at least one possibility is valid

TODO: example

Parameters:

cutsite (CutSiteType) – DESCRIPTION.

Returns:

True if cutsite can cut the DNA fragment.

Return type:

bool

get_cutsites(*enzymes: EnzymesType) List[Tuple[Tuple[int, int], AbstractCut | None | _cas]][source]#

Returns a list of cutsites, represented represented as ((cut_watson, ovhg), enz):

  • cut_watson is a positive integer contained in [0,len(seq)), where seq is the sequence that will be cut. It represents the position of the cut on the watson strand, using the full sequence as a reference. By “full sequence” I mean the one you would get from str(Dseq).

  • ovhg is the overhang left after the cut. It has the same meaning as ovhg in the Bio.Restriction enzyme objects, or pydna’s Dseq property.

  • enz is the enzyme object. It’s not necessary to perform the cut, but can be

    used to keep track of which enzyme was used.

Cuts are only returned if the recognition site and overhang are on the double-strand part of the sequence.

Parameters:

enzymes (Union[RestrictionBatch,list[_AbstractCut]])

Return type:

list[tuple[tuple[int,int], _AbstractCut]]

Examples

>>> from Bio.Restriction import EcoRI
>>> from pydna.dseq import Dseq
>>> seq = Dseq('AAGAATTCAAGAATTC')
>>> seq.get_cutsites(EcoRI)
[((3, -4), EcoRI), ((11, -4), EcoRI)]

cut_watson is defined with respect to the “full sequence”, not the watson strand:

>>> dseq = Dseq.from_full_sequence_and_overhangs('aaGAATTCaa', 1, 0)
>>> dseq
Dseq(-10)
 aGAATTCaa
ttCTTAAGtt
>>> dseq.get_cutsites([EcoRI])
[((3, -4), EcoRI)]

Cuts are only returned if the recognition site and overhang are on the double-strand part of the sequence.

>>> Dseq('GAATTC').get_cutsites([EcoRI])
[((1, -4), EcoRI)]
>>> Dseq.from_full_sequence_and_overhangs('GAATTC', -1, 0).get_cutsites([EcoRI])
[]
left_end_position() Tuple[int, int][source]#

The index in the full sequence of the watson and crick start positions.

full sequence (str(self)) for all three cases is AAA

AAA              AA               AAT
 TT             TTT               TTT
Returns (0, 1)  Returns (1, 0)    Returns (0, 0)
right_end_position() Tuple[int, int][source]#

The index in the full sequence of the watson and crick end positions.

full sequence (str(self)) for all three cases is AAA

` AAA               AA                   AAA TT                TTT                  TTT Returns (3, 2)    Returns (2, 3)       Returns (3, 3) `

get_ss_meltsites(length: int) tuple[int, int][source]#

Single stranded DNA melt sites

Two lists of 2-tuples of integers are returned. Each tuple (((from, to))) contains the start and end positions of a single stranded region, shorter or equal to length.

In the example below, the middle 2 nt part is released from the molecule.

tagaa ta gtatg
||||| || |||||  -->   [(6,8)], []
atcttcatccatac

tagaagtaggtatg
||||| || |||||  -->   [], [(6,8)]
atctt at catac

The output of this method is used in the melt_ss_dna method in order to determine the start and end positions of single stranded regions.

See get_ds_meltsites for melting ds sequences.

Examples

>>> from pydna.dseq import Dseq
>>> ds = Dseq("tagaaqtaqgtatg")
>>> ds
Dseq(-14)
tagaa ta gtatg
atcttcatccatac
>>> cutsites = ds.get_ss_meltsites(2)
>>> cutsites
([(6, 8)], [])
>>> ds[6:8]
Dseq(-2)
ta
at
>>> ds = Dseq("tagaaptapgtatg")
>>> ds
Dseq(-14)
tagaagtaggtatg
atctt at catac
>>> cutsites = ds.get_ss_meltsites(2)
>>> cutsites
([], [(6, 8)])
get_ds_meltsites(length: int) List[Tuple[Tuple[int, int], AbstractCut | None | _cas]][source]#

Double stranded DNA melt sites

DNA molecules can fall apart by melting if they have internal single stranded regions. In the example below, the molecule has two gaps on opposite sides, two nucleotides apart, which means that it hangs together by two basepairs.

This molecule can melt into two separate 8 bp double stranded molecules, each with 3 nt 3’ overhangs a depicted below.

tagaagta gtatg        tagaagta          gtatg
||||| || |||||  -->   |||||             |||||
atctt atccatac        atctt          atccatac

A list of 2-tuples is returned. Each tuple (((cut_watson, ovhg), None)) contains cut position and the overhang value in the same format as returned by the get_cutsites method for restriction enzymes.

Note that this function deals with melting that results in two double stranded DNA molecules.

See get_ss_meltsites for melting of single stranded regions from molecules.

Examples

>>> from pydna.dseq import Dseq
>>> ds = Dseq("tagaaptaqgtatg")
>>> ds
Dseq(-14)
tagaagta gtatg
atctt atccatac
>>> cutsite = ds.get_ds_meltsites(2)
>>> cutsite
[((8, 2), None)]
cast_to_ds_right()[source]#

NNNN NNNNGATC |||| –> |||||||| NNNNCTAG NNNNCTAG

NNNNGATC NNNNGATC |||| –> |||||||| NNNN NNNNCTAG

cast_to_ds()[source]#

Sequencially calls cast_to_ds_left and cast_to_ds_right.

cast_to_ds_left()[source]#
GATCNNNN GATCNNNN

|||| –> |||||||| NNNN CTAGNNNN

NNNN GATCNNNN |||| –> ||||||||

CTAGNNNN CTAGNNNN

get_cut_parameters(cut: Tuple[Tuple[int, int], AbstractCut | None | _cas] | None, is_left: bool) Tuple[int, int, int][source]#

For a given cut expressed as ((cut_watson, ovhg), enz), returns a tuple (cut_watson, cut_crick, ovhg).

  • cut_watson: see get_cutsites docs

  • cut_crick: equivalent of cut_watson in the crick strand

  • ovhg: see get_cutsites docs

The cut can be None if it represents the left or right end of the sequence. Then it will return the position of the watson and crick ends with respect to the “full sequence”. The is_left parameter is only used in this case.

melt(length)[source]#

TBD

Parameters:

length (TYPE) – DESCRIPTION.

Returns:

DESCRIPTION.

Return type:

TYPE

melt_ss_dna(length) tuple[Dseq, list[Dseq]][source]#

Melt to separate single stranded DNA

Single stranded DNA molecules shorter or equal to length shed from a double stranded DNA molecule without affecting the length of the remaining molecule.

In the examples below, the middle 2 nt part is released from the molecule.

tagaa ta gtatg        tagaa    gtatg          ta
||||| || |||||  -->   |||||    |||||     +    ||
atcttcatccatac        atcttcatccatac

tagaagtaggtatg        tagaagtaggtatg
||||| || |||||  -->   |||||    |||||     +    ||
atctt at catac        atctt    catac          at

Examples

>>> from pydna.dseq import Dseq
>>> ds = Dseq("tagaaqtaqgtatg")
>>> ds
Dseq(-14)
tagaa ta gtatg
atcttcatccatac
>>> new, strands  = ds.melt_ss_dna(2)
>>> new
Dseq(-14)
tagaa    gtatg
atcttcatccatac
>>> strands[0]
Dseq(-2)
ta
||
>>> ds = Dseq("tagaaptapgtatg")
>>> ds
Dseq(-14)
tagaagtaggtatg
atctt at catac
>>> new, strands = ds.melt_ss_dna(2)
>>> new
Dseq(-14)
tagaagtaggtatg
atctt    catac
>>> strands[0]
Dseq(-2)
||
at
shed_ss_dna(watson_cutpairs: list[tuple[int, int]] = None, crick_cutpairs: list[tuple[int, int]] = None)[source]#

Separate parts of one of the DNA strands

Examples

>>> from pydna.dseq import Dseq
>>> ds = Dseq("tagaagtaggtatg")
>>> ds
Dseq(-14)
tagaagtaggtatg
atcttcatccatac
>>> new, strands = ds.shed_ss_dna([(6, 8)],[])
>>> new
Dseq(-14)
tagaag  ggtatg
atcttcatccatac
>>> strands[0]
Dseq(-2)
ta
||
>>> new, strands = ds.shed_ss_dna([],[(6, 8)])
>>> new
Dseq(-14)
tagaagtaggtatg
atcttc  ccatac
>>> strands[0]
Dseq(-2)
||
at
>>> ds = Dseq("tagaagtaggtatg")
>>> new, (strand1, strand2) = ds.shed_ss_dna([(6, 8), (9, 11)],[])
>>> new
Dseq(-14)
tagaag  g  atg
atcttcatccatac
>>> strand1
Dseq(-2)
ta
||
>>> strand2
Dseq(-2)
gt
||
apply_cut(left_cut: Tuple[Tuple[int, int], AbstractCut | None | _cas], right_cut: Tuple[Tuple[int, int], AbstractCut | None | _cas]) Dseq[source]#

Extracts a subfragment of the sequence between two cuts.

For more detail see the documentation of get_cutsite_pairs.

Parameters:
Return type:

Dseq

Examples

>>> from Bio.Restriction import EcoRI
>>> from pydna.dseq import Dseq
>>> dseq = Dseq('aaGAATTCaaGAATTCaa')
>>> cutsites = dseq.get_cutsites([EcoRI])
>>> cutsites
[((3, -4), EcoRI), ((11, -4), EcoRI)]
>>> p1, p2, p3 = dseq.get_cutsite_pairs(cutsites)
>>> p1
(None, ((3, -4), EcoRI))
>>> dseq.apply_cut(*p1)
Dseq(-7)
aaG
ttCTTAA
>>> p2
(((3, -4), EcoRI), ((11, -4), EcoRI))
>>> dseq.apply_cut(*p2)
Dseq(-12)
AATTCaaG
    GttCTTAA
>>> p3
(((11, -4), EcoRI), None)
>>> dseq.apply_cut(*p3)
Dseq(-7)
AATTCaa
    Gtt
>>> dseq = Dseq('TTCaaGAA', circular=True)
>>> cutsites = dseq.get_cutsites([EcoRI])
>>> cutsites
[((6, -4), EcoRI)]
>>> pair = dseq.get_cutsite_pairs(cutsites)[0]
>>> pair
(((6, -4), EcoRI), ((6, -4), EcoRI))
>>> dseq.apply_cut(*pair)
Dseq(-12)
AATTCaaG
    GttCTTAA
get_cutsite_pairs(cutsites: List[Tuple[Tuple[int, int], AbstractCut | None | _cas]]) List[Tuple[None | Tuple[Tuple[int, int], AbstractCut | None | _cas], None | Tuple[Tuple[int, int], AbstractCut | None | _cas]]][source]#

Returns pairs of cutsites that render the edges of the resulting fragments.

A fragment produced by restriction is represented by a tuple of length 2 that may contain cutsites or None:

  • Two cutsites: represents the extraction of a fragment between those two cutsites, in that orientation. To represent the opening of a circular molecule with a single cutsite, we put the same cutsite twice.

  • None, cutsite: represents the extraction of a fragment between the left edge of linear sequence and the cutsite.

  • cutsite, None: represents the extraction of a fragment between the cutsite and the right edge of a linear sequence.

Parameters:

cutsites (list[tuple[tuple[int,int], _AbstractCut]])

Return type:

list[tuple[tuple[tuple[int,int], _AbstractCut]|None],tuple[tuple[int,int], _AbstractCut]|None]

Examples

>>> from Bio.Restriction import EcoRI
>>> from pydna.dseq import Dseq
>>> dseq = Dseq('aaGAATTCaaGAATTCaa')
>>> cutsites = dseq.get_cutsites([EcoRI])
>>> cutsites
[((3, -4), EcoRI), ((11, -4), EcoRI)]
>>> dseq.get_cutsite_pairs(cutsites)
[(None, ((3, -4), EcoRI)), (((3, -4), EcoRI), ((11, -4), EcoRI)), (((11, -4), EcoRI), None)]
>>> dseq = Dseq('TTCaaGAA', circular=True)
>>> cutsites = dseq.get_cutsites([EcoRI])
>>> cutsites
[((6, -4), EcoRI)]
>>> dseq.get_cutsite_pairs(cutsites)
[(((6, -4), EcoRI), ((6, -4), EcoRI))]
get_parts()[source]#

Returns a DseqParts instance containing the parts (strings) of a dsDNA sequence. DseqParts instance field names:

 "sticky_left5"
 |
 |      "sticky_right5"
 |      |
---    ---
GGGATCC
   TAGGTCA
   ----
     |
     "middle"

 "sticky_left3"
 |
 |      "sticky_right3"
 |      |
---    ---
   ATCCAGT
CCCTAGG
   ----
     |
     "middle"

   "single_watson" (only an upper strand)
   |
-------
ATCCAGT
|||||||

   "single_crick" (only a lower strand)
   |
-------

|||||||
CCCTAGG

Up to seven groups (0..6) are captured, but some are mutually exclusive which means that one of them is an empty string:

0 or 1, not both, a DNA fragment has either 5’ or 3’ sticky end.

2 or 5 or 6, a DNA molecule has a ds region or is single stranded.

3 or 4, not both, either 5’ or 3’ sticky end.

Note that internal single stranded regions are not identified and will be contained in the middle part if they are present.

Examples

>>> from pydna.dseq import Dseq
>>> ds = Dseq("PPPATCFQZ")
>>> ds
Dseq(-9)
GGGATC
   TAGTCA
>>> parts = ds.get_parts()
>>> parts
DseqParts(sticky_left5='PPP', sticky_left3='', middle='ATC', sticky_right3='', sticky_right5='FQZ', single_watson='', single_crick='')
>>> Dseq(parts.sticky_left5)
Dseq(-3)
GGG
|||
>>> Dseq(parts.middle)
Dseq(-3)
ATC
TAG
>>> Dseq(parts.sticky_right5)
Dseq(-3)
|||
TCA
Parameters:

datastring (str) – A string with dscode.

Returns:

Seven string fields describing the DNA molecule. fragment(sticky_left5=’’, sticky_left3=’’,

middle=’’, sticky_right3=’’, sticky_right5=’’, single_watson=’’, single_crick=’’)

Return type:

namedtuple

pydna.dseqrecord module#

This module provides the Dseqrecord class, for handling double stranded DNA sequences. The Dseqrecord holds sequence information in the form of a pydna.dseq.Dseq object. The Dseq and Dseqrecord classes are subclasses of Biopythons Seq and SeqRecord classes, respectively.

The Dseq and Dseqrecord classes support the notion of circular and linear DNA topology.

class pydna.dseqrecord.Dseqrecord(record, *args, circular=None, n=5e-14, source=None, **kwargs)[source]#

Bases: SeqRecord

Dseqrecord is a double stranded version of the Biopython SeqRecord [17] class. The Dseqrecord object holds a Dseq object describing the sequence. Additionally, Dseqrecord hold meta information about the sequence in the from of a list of SeqFeatures, in the same way as the SeqRecord does.

The Dseqrecord can be initialized with a string, Seq, Dseq, SeqRecord or another Dseqrecord. The sequence information will be stored in a Dseq object in all cases.

Dseqrecord objects can be read or parsed from sequences in FASTA, EMBL or Genbank formats. See the pydna.readers and pydna.parsers modules for further information.

There is a short representation associated with the Dseqrecord. Dseqrecord(-3) represents a linear sequence of length 2 while Dseqrecord(o7) represents a circular sequence of length 7.

Dseqrecord and Dseq share the same concept of length. This length can be larger than each strand alone if they are staggered as in the example below.

<-- length -->
GATCCTTT
     AAAGCCTAG
Parameters:
  • record (string, Seq, SeqRecord, Dseq or other Dseqrecord object) – This data will be used to form the seq property

  • circular (bool, optional) – True or False reflecting the shape of the DNA molecule

  • linear (bool, optional) – True or False reflecting the shape of the DNA molecule

Examples

>>> from pydna.dseqrecord import Dseqrecord
>>> a=Dseqrecord("aaa")
>>> a
Dseqrecord(-3)
>>> a.seq
Dseq(-3)
aaa
ttt
>>> from pydna.seq import Seq
>>> b=Dseqrecord(Seq("aaa"))
>>> b
Dseqrecord(-3)
>>> b.seq
Dseq(-3)
aaa
ttt
>>> from Bio.SeqRecord import SeqRecord
>>> c=Dseqrecord(SeqRecord(Seq("aaa")))
>>> c
Dseqrecord(-3)
>>> c.seq
Dseq(-3)
aaa
ttt

References

source: Source | None = None#
classmethod from_string(record: str = '', *args, circular=False, n=5e-14, **kwargs)[source]#

docstring.

classmethod from_SeqRecord(record: SeqRecord, *args, circular=None, n=5e-14, **kwargs)[source]#
property circular#

The circular property can not be set directly. Use looped()

m()[source]#

This method returns the mass of the DNA molecule in grams. This is calculated as the product between the molecular weight of the Dseq object and the

extract_feature(n)[source]#

Extracts a feature and creates a new Dseqrecord object.

Parameters:

n (int) – Indicates the feature to extract

Examples

>>> from pydna.dseqrecord import Dseqrecord
>>> a=Dseqrecord("atgtaa")
>>> a.add_feature(2,4)
>>> b=a.extract_feature(0)
>>> b
Dseqrecord(-2)
>>> b.seq
Dseq(-2)
gt
ca
add_feature(x=None, y=None, seq=None, type_='misc', strand=1, *args, **kwargs)[source]#

Add a feature of type misc to the feature list of the sequence.

Parameters:
  • x (int) – Indicates start of the feature

  • y (int) – Indicates end of the feature

Examples

>>> from pydna.seqrecord import SeqRecord
>>> a=SeqRecord("atgtaa")
>>> a.features
[]
>>> a.add_feature(2,4)
>>> a.features
[SeqFeature(SimpleLocation(ExactPosition(2), ExactPosition(4), strand=1), type='misc', qualifiers=...)]
seguid()[source]#

Url safe SEGUID for the sequence.

This checksum is the same as seguid but with base64.urlsafe encoding instead of the normal base64. This means that the characters + and / are replaced with - and _ so that the checksum can be part of a URL.

Examples

>>> from pydna.dseqrecord import Dseqrecord
>>> a = Dseqrecord("aa")
>>> a.seguid()
'ldseguid=TEwydy0ugvGXh3VJnVwgtxoyDQA'
looped()[source]#

Circular version of the Dseqrecord object.

The underlying linear Dseq object has to have compatible ends.

Examples

>>> from pydna.dseqrecord import Dseqrecord
>>> a=Dseqrecord("aaa")
>>> a
Dseqrecord(-3)
>>> b=a.looped()
>>> b
Dseqrecord(o3)
>>>
tolinear()[source]#

Returns a linear, blunt copy of a circular Dseqrecord object. The underlying Dseq object has to be circular.

This method is deprecated, use slicing instead. See example below.

Examples

>>> from pydna.dseqrecord import Dseqrecord
>>> a=Dseqrecord("aaa", circular = True)
>>> a
Dseqrecord(o3)
>>> b=a[:]
>>> b
Dseqrecord(-3)
>>>
terminal_transferase(nucleotides='a')[source]#

docstring.

format(format: str = 'gb')[source]#

Returns the sequence as a string using a format supported by Biopython SeqIO [18]. Default is “gb” which is short for Genbank. Allowed Formats are for example:

  • “fasta”: The standard FASTA format.

  • “fasta-2line”: No line wrapping and exactly two lines per record.

  • “genbank” (or “gb”): The GenBank flat file format.

  • “embl”: The EMBL flat file format.

  • “imgt”: The IMGT variant of the EMBL format.

The format string can be modified with the keyword “dscode” if the underlying dscode string is desired in the output. for example:

Dseqrecord("PEXIGATCQFZJ").format("fasta-2line dscode")

Examples

>>> from pydna.dseqrecord import Dseqrecord
>>> x=Dseqrecord("aaa")
>>> x.annotations['date'] = '02-FEB-2013'
>>> x
Dseqrecord(-3)
>>> print(x.format("gb"))
LOCUS       name                       3 bp    DNA     linear   UNK 02-FEB-2013
DEFINITION  description.
ACCESSION   id
VERSION     id
KEYWORDS    .
SOURCE      .
  ORGANISM  .
            .
FEATURES             Location/Qualifiers
ORIGIN
        1 aaa
//
>>> print(Dseqrecord("PEXIGATCQFZJ").format("fasta-2line"))
>id description
GATCGATCGATC
>>> print(Dseqrecord("PEXIGATCQFZJ").format("fasta-2line dscode"))
>id description
PEXIGATCQFZJ

References

write(filename=None, f='gb')[source]#

Writes the Dseqrecord to a file using the format f, which must be a format supported by Biopython SeqIO for writing [19]. Default is “gb” which is short for Genbank. Note that Biopython SeqIO reads more formats than it writes.

Filename is the path to the file where the sequece is to be written. The filename is optional, if it is not given, the description property (string) is used together with the format.

If obj is the Dseqrecord object, the default file name will be:

<obj.locus>.<f>

Where <f> is “gb” by default. If the filename already exists and AND the sequence it contains is different, a new file name will be used so that the old file is not lost:

<obj.locus>_NEW.<f>

References

find(other)[source]#
find_aminoacids(other)[source]#
>>> from pydna.dseqrecord import Dseqrecord
>>> s=Dseqrecord("atgtacgatcgtatgctggttatattttag")
>>> s.seq.translate()
ProteinSeq('MYDRMLVIF*')
>>> "RML" in s
True
>>> "MMM" in s
False
>>> s.seq.rc().translate()
ProteinSeq('LKYNQHTIVH')
>>> "QHT" in s.rc()
True
>>> "QHT" in s
False
>>> slc = s.find_aa("RML")
>>> slc
slice(9, 18, None)
>>> s[slc]
Dseqrecord(-9)
>>> code = s[slc].seq
>>> code
Dseq(-9)
cgtatgctg
gcatacgac
>>> code.translate()
ProteinSeq('RML')
find_aa(other)#
>>> from pydna.dseqrecord import Dseqrecord
>>> s=Dseqrecord("atgtacgatcgtatgctggttatattttag")
>>> s.seq.translate()
ProteinSeq('MYDRMLVIF*')
>>> "RML" in s
True
>>> "MMM" in s
False
>>> s.seq.rc().translate()
ProteinSeq('LKYNQHTIVH')
>>> "QHT" in s.rc()
True
>>> "QHT" in s
False
>>> slc = s.find_aa("RML")
>>> slc
slice(9, 18, None)
>>> s[slc]
Dseqrecord(-9)
>>> code = s[slc].seq
>>> code
Dseq(-9)
cgtatgctg
gcatacgac
>>> code.translate()
ProteinSeq('RML')
map_trace_files(pth, limit=25)[source]#
linearize(*enzymes)[source]#

Similar to :func:cut.

Throws an exception if there is not excactly one cut i.e. none or more than one digestion products.

no_cutters(batch: RestrictionBatch = None)[source]#

docstring.

unique_cutters(batch: RestrictionBatch = None)[source]#

docstring.

once_cutters(batch: RestrictionBatch = None)[source]#

docstring.

twice_cutters(batch: RestrictionBatch = None)[source]#

docstring.

n_cutters(n=3, batch: RestrictionBatch = None)[source]#

docstring.

cutters(batch: RestrictionBatch = None)[source]#

docstring.

number_of_cuts(*enzymes)[source]#

The number of cuts by digestion with the Restriction enzymes contained in the iterable.

reverse_complement()[source]#

Reverse complement.

Examples

>>> from pydna.dseqrecord import Dseqrecord
>>> a=Dseqrecord("ggaatt")
>>> a
Dseqrecord(-6)
>>> a.seq
Dseq(-6)
ggaatt
ccttaa
>>> a.reverse_complement().seq
Dseq(-6)
aattcc
ttaagg
>>>
rc()#

Reverse complement.

Examples

>>> from pydna.dseqrecord import Dseqrecord
>>> a=Dseqrecord("ggaatt")
>>> a
Dseqrecord(-6)
>>> a.seq
Dseq(-6)
ggaatt
ccttaa
>>> a.reverse_complement().seq
Dseq(-6)
aattcc
ttaagg
>>>
synced(ref, limit=25)[source]#

This method returns a new circular sequence (Dseqrecord object), which has been rotated in such a way that there is maximum overlap between the sequence and ref, which may be a string, Biopython Seq, SeqRecord object or another Dseqrecord object.

The reason for using this could be to rotate a new recombinant plasmid so that it starts at the same position after cloning. See the example below:

Examples

>>> from pydna.dseqrecord import Dseqrecord
>>> a=Dseqrecord("gaat", circular=True)
>>> a.seq
Dseq(o4)
gaat
ctta
>>> d = a[2:] + a[:2]
>>> d.seq
Dseq(-4)
atga
tact
>>> insert=Dseqrecord("CCC")
>>> recombinant = (d+insert).looped()
>>> recombinant.seq
Dseq(o7)
atgaCCC
tactGGG
>>> recombinant.synced(a).seq
Dseq(o7)
gaCCCat
ctGGGta
upper()[source]#

Returns an uppercase copy. >>> from pydna.dseqrecord import Dseqrecord >>> my_seq = Dseqrecord(“aAa”) >>> my_seq.seq Dseq(-3) aAa tTt >>> upper = my_seq.upper() >>> upper.seq Dseq(-3) AAA TTT >>>

Returns:

Dseqrecord object in uppercase

Return type:

Dseqrecord

lower()[source]#
>>> from pydna.dseqrecord import Dseqrecord
>>> my_seq = Dseqrecord("aAa")
>>> my_seq.seq
Dseq(-3)
aAa
tTt
>>> upper = my_seq.upper()
>>> upper.seq
Dseq(-3)
AAA
TTT
>>> lower = my_seq.lower()
>>> lower
Dseqrecord(-3)
>>>
Returns:

Dseqrecord object in lowercase

Return type:

Dseqrecord

orfs(minsize=300)[source]#

docstring.

orfs_to_features(minsize=300)[source]#

docstring.

copy_gb_to_clipboard()[source]#

docstring.

copy_fasta_to_clipboard()[source]#

docstring.

figure(feature=0, highlight='\x1b[48;5;11m', plain='\x1b[0m')[source]#

docstring.

shifted(shift)[source]#

Circular Dseqrecord with a new origin <shift>.

This only works on circular Dseqrecords. If we consider the following circular sequence:

GAAAT   <-- watson strand
CTTTA   <-- crick strand

The T and the G on the watson strand are linked together as well as the A and the C of the of the crick strand.

if shift is 1, this indicates a new origin at position 1:

new origin at the | symbol:

G|AAAT
C|TTTA

new sequence:

AAATG
TTTAC

Examples

>>> from pydna.dseqrecord import Dseqrecord
>>> a=Dseqrecord("aaat",circular=True)
>>> a
Dseqrecord(o4)
>>> a.seq
Dseq(o4)
aaat
ttta
>>> b=a.shifted(1)
>>> b
Dseqrecord(o4)
>>> b.seq
Dseq(o4)
aata
ttat
cut(*enzymes)[source]#

Digest a Dseqrecord object with one or more restriction enzymes.

returns a list of linear Dseqrecords. If there are no cuts, an empty list is returned.

See also Dseq.cut() :param enzymes: A Bio.Restriction.XXX restriction object or iterable of such. :type enzymes: enzyme object or iterable of such objects

Returns:

Dseqrecord_frags – list of Dseqrecord objects formed by the digestion

Return type:

list

Examples

>>> from pydna.dseqrecord import Dseqrecord
>>> a=Dseqrecord("ggatcc")
>>> from Bio.Restriction import BamHI
>>> a.cut(BamHI)
(Dseqrecord(-5), Dseqrecord(-5))
>>> frag1, frag2 = a.cut(BamHI)
>>> frag1.seq
Dseq(-5)
g
cctag
>>> frag2.seq
Dseq(-5)
gatcc
    g
apply_cut(left_cut, right_cut)[source]#
annotations: _AnnotationsDict#
dbxrefs: list[str]#
history()[source]#

Returns a string representation of the cloning history of the sequence. Returns an empty string if the sequence has no source.

Check the documentation notebooks for extensive examples.

Returns:

str

Return type:

A string representation of the cloning history of the sequence.

Examples

>>> from pydna.dseqrecord import Dseqrecord
>>> from pydna.assembly2 import gibson_assembly
>>> fragments = [
...    Dseqrecord("TTTTacgatAAtgctccCCCC", circular=False, name="fragment1"),
...    Dseqrecord("CCCCtcatGGGG", circular=False, name="fragment2"),
...    Dseqrecord("GGGGatataTTTT", circular=False, name="fragment3"),
... ]
>>> product, *_ = gibson_assembly(fragments, limit=4)
>>> product.name = "product_name"
>>> print(product.history())
╙── product_name (Dseqrecord(o34))
    └─╼ GibsonAssemblySource
        ├─╼ fragment1 (Dseqrecord(-21))
        ├─╼ fragment2 (Dseqrecord(-12))
        └─╼ fragment3 (Dseqrecord(-13))
join(fragments)[source]#

Join an iterable of Dseqrecords with this instance as the separator.

Example:

>>> sep = Dseqrecord("a")
>>> joined = sep.join([Dseqrecord("A"), Dseqrecord("B"), Dseqrecord("C")])
>>> joined
Dseqrecord(-5)
>>> joined.seq
Dseq(-5)
AaBaC
TtVtG

pydna.fakeseq module#

docstring.

class pydna.fakeseq.FakeSeq(length: int, n: float = 50e-15, rf: float = 0.0)[source]#

Bases: object

docstring.

m() float[source]#

Mass of the DNA molecule in grams.

M() float[source]#

M grams/mol.

pydna.fusionpcr module#

pydna.fusionpcr.fuse_by_pcr(fragments, limit=15)[source]#

docstring.

pydna.fusionpcr.list_parts(fusion_pcr_fragment)[source]#

pydna.gateway module#

pydna.gateway.gateway_overlap(seqx: Dseqrecord, seqy: Dseqrecord, reaction: str, greedy: bool) list[tuple[int, int, int]][source]#

Find gateway overlaps. If greedy is True, it uses a more greedy consensus site to find attP sites, which might give false positives

pydna.gateway.find_gateway_sites(seq: Dseqrecord, greedy: bool) dict[str, list[SimpleLocation]][source]#

Find all gateway sites in a sequence and return a dictionary with the name and positions of the sites.

pydna.gateway.annotate_gateway_sites(seq: Dseqrecord, greedy: bool) Dseqrecord[source]#

pydna.gel module#

docstring.

pydna.gel.interpolator(mwstd)[source]#

docstring.

pydna.gel.gel(samples=None, gel_length=600, margin=50, interpolator=interpolator(mwstd=_mwstd))[source]#

pydna.genbank module#

This module provides a class for downloading sequences from genbank called Genbank and an function that does the same thing called genbank.

The function can be used if the environmental variable pydna_email has been set to a valid email address. The easiest way to do this permanantly is to edit the pydna.ini file. See the documentation of pydna.open_config_folder()

class pydna.genbank.Genbank(users_email: str, *, tool: str = 'pydna')[source]#

Bases: object

Class to facilitate download from genbank. It is easier and quicker to use the pydna.genbank.genbank() function directly.

Parameters:

users_email (string) – Has to be a valid email address. You should always tell Genbanks who you are, so that they can contact you.

Examples

>>> from pydna.genbank import Genbank
>>> gb=Genbank("bjornjobb@gmail.com")
>>> rec = gb.nucleotide("LP002422.1")   # <- entry from genbank
>>> print(len(rec))
1
nucleotide(item: str, seq_start: int | None = None, seq_stop: int | None = None, strand: Literal[1, 2] = 1) Dseqrecord[source]#

This method downloads a genbank nuclotide record from genbank. This method is cached by default. This can be controlled by editing the pydna_cached_funcs environment variable. The best way to do this permanently is to edit the edit the pydna.ini file. See the documentation of pydna.open_config_folder()

Item is a string containing one genbank accession number for a nucleotide file. Genbank nucleotide accession numbers have this format:

A12345 = 1 letter + 5 numerals
AB123456 = 2 letters + 6 numerals

The accession number is sometimes followed by a point and version number

BK006936.2

Item can also contain optional interval information in the following formats:

BK006936.2 REGION: complement(613900..615202)
NM_005546 REGION: 1..100
NM_005546 REGION: complement(1..100)
21614549:1-100
21614549:c100-1
21614549 1-100
21614549 c100-1

It is useful to set an interval for large genbank records to limit the download time. The items above containing interval information and can be obtained directly by looking up an entry in Genbank and setting the Change region shown on the upper right side of the page. The ACCESSION line of the displayed Genbank file will have the formatting shown.

Alternatively, seq_start and seq_stop can be set explicitly to the sequence intervals to be downloaded.

If strand is 2. “c”, “C”, “crick”, “Crick”, “antisense”,”Antisense”, “2”, 2, “-” or “-1”, the antisense (Crick) strand is returned, otherwise the sense (Watson) strand is returned.

Result is returned as a Dseqrecord object.

References

pydna.genbank.genbank(accession: str = 'CS570233.1', *args, email=None, **kwargs) Dseqrecord[source]#

Download a genbank nuclotide record.

This function takes the same paramenters as the :func:pydna.genbank.Genbank.nucleotide method. The email address stored in the pydna_email environment variable is used. The easiest way set this permanantly is to edit the pydna.ini file. See the documentation of pydna.open_config_folder()

if no accession is given, a very short Genbank entry is used as an example (see below). This can be useful for testing the connection to Genbank.

Please note that this result is also cached by default by settings in the pydna.ini file. See the documentation of pydna.open_config_folder()

LOCUS       CS570233                  14 bp    DNA     linear   PAT 18-MAY-2007
DEFINITION  Sequence 6 from Patent WO2007025016.
ACCESSION   CS570233
VERSION     CS570233.1
KEYWORDS    .
SOURCE      synthetic construct
  ORGANISM  synthetic construct
            other sequences; artificial sequences.
REFERENCE   1
  AUTHORS   Shaw,R.W. and Cottenoir,M.
  TITLE     Inhibition of metallo-beta-lactamase by double-stranded dna
  JOURNAL   Patent: WO 2007025016-A1 6 01-MAR-2007;
            Texas Tech University System (US)
FEATURES             Location/Qualifiers
     source          1..14
                     /organism="synthetic construct"
                     /mol_type="unassigned DNA"
                     /db_xref="taxon:32630"
                     /note="This is a 14bp aptamer inhibitor."
ORIGIN
        1 atgttcctac atga
//

pydna.genbankfixer module#

This module provides the gbtext_clean() function which can clean up broken Genbank files enough to pass the BioPython Genbank parser

Almost all of this code was lifted from BioJSON (levskaya/BioJSON) by Anselm Levskaya. The original code was not accompanied by any software licence. This parser is based on pyparsing.

There are some modifications to deal with fringe cases.

The parser first produces JSON as an intermediate format which is then formatted back into a string in Genbank format.

The parser is not complete, so some fields do not survive the roundtrip (see below). This should not be a difficult fix. The returned result has two properties, .jseq which is the intermediate JSON produced by the parser and .gbtext which is the formatted genbank string.

pydna.genbankfixer.parseGBLoc(s, l_, t)[source]#

retwingles parsed genbank location strings, assumes no joins of RC and FWD sequences

pydna.genbankfixer.strip_multiline(s, l_, t)[source]#
pydna.genbankfixer.toInt(s, l_, t)[source]#
pydna.genbankfixer.strip_indent(str)[source]#
pydna.genbankfixer.concat_dict(dlist)[source]#

more or less dict(list of string pairs) but merges vals with the same keys so no duplicates occur

pydna.genbankfixer.toJSON(gbkstring)[source]#
pydna.genbankfixer.wrapstring(str_, rowstart, rowend, padfirst=True)[source]#

wraps the provided string in lines of length rowend-rowstart and padded on the left by rowstart. -> if padfirst is false the first line is not padded

pydna.genbankfixer.locstr(locs, strand)[source]#

genbank formatted location string, assumes no join’d combo of rev and fwd seqs

pydna.genbankfixer.originstr(sequence)[source]#

formats dna sequence as broken, numbered lines ala genbank

pydna.genbankfixer.toGB(jseq)[source]#

parses json jseq data and prints out ApE compatible genbank

pydna.genbankfixer.gbtext_clean(gbtext)[source]#

This function takes a string containing one genbank sequence in Genbank format and returns a named tuple containing two fields, the gbtext containing a string with the corrected genbank sequence and jseq which contains the JSON intermediate.

Examples

>>> s = '''LOCUS       New_DNA      3 bp    DNA   CIRCULAR SYN        19-JUN-2013
... DEFINITION  .
... ACCESSION
... VERSION
... SOURCE      .
...   ORGANISM  .
... COMMENT
... COMMENT     ApEinfo:methylated:1
... ORIGIN
...         1 aaa
... //'''
>>> from pydna.readers import read
>>> read(s)  
... /site-packages/Bio/GenBank/Scanner.py:1388: BiopythonParserWarning: Malformed LOCUS line found - is this correct?
:'LOCUS       New_DNA      3 bp    DNA   CIRCULAR SYN        19-JUN-2013\n'
  "correct?\n:%r" % line, BiopythonParserWarning)
Traceback (most recent call last):
  File "... /pydna/readers.py", line 48, in read
    results = results.pop()
IndexError: pop from empty list

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "... /pydna/readers.py", line 50, in read
    raise ValueError("No sequences found in data:\n({})".format(data[:79]))
ValueError: No sequences found in data:
(LOCUS       New_DNA      3 bp    DNA   CIRCULAR SYN        19-JUN-2013
DEFINITI)
>>> from pydna.genbankfixer import gbtext_clean
>>> s2, j2 = gbtext_clean(s)
>>> print(s2)
LOCUS       New_DNA                    3 bp ds-DNA     circular SYN 19-JUN-2013
DEFINITION  .
ACCESSION
VERSION
SOURCE      .
ORGANISM  .
COMMENT
COMMENT     ApEinfo:methylated:1
FEATURES             Location/Qualifiers
ORIGIN
        1 aaa
//
>>> s3 = read(s2)
>>> s3
Dseqrecord(o3)
>>> print(s3.format())
LOCUS       New_DNA                    3 bp    DNA     circular SYN 19-JUN-2013
DEFINITION  .
ACCESSION   New_DNA
VERSION     New_DNA
KEYWORDS    .
SOURCE
  ORGANISM  .
            .
COMMENT
            ApEinfo:methylated:1
FEATURES             Location/Qualifiers
ORIGIN
        1 aaa
//

pydna.ladders module#

Agarose gel DNA ladders.

A DNA ladder is a list of FakeSeq objects that has to be initiated with Size (bp), amount of substance (mol) and Relative mobility (Rf).

Rf is a float value between 0.000 and 1.000. These are used together with the cubic spline interpolator in the gel module to calculate migartion distance from fragment length. The Rf values are calculated manually from a gel image. Exampel can be found in scripts/molecular_weight_standards.ods.

pydna.oligonucleotide_hybridization module#

This module contains the functions for oligonucleotide hybridization.

pydna.oligonucleotide_hybridization.oligonucleotide_hybridization_overhangs(fwd_oligo_seq: str, rvs_oligo_seq: str, minimal_annealing: int) list[int][source]#

Returns possible overhangs between two oligos given a minimal annealing length, and returns an error if mismatches are found.

see manulera/OpenCloning_backend#302 for notation

>>> from pydna.oligonucleotide_hybridization import oligonucleotide_hybridization_overhangs
>>> oligonucleotide_hybridization_overhangs("ATGGC", "GCCAT", 3)
[0]
>>> oligonucleotide_hybridization_overhangs("aATGGC", "GCCAT", 5)
[-1]
>>> oligonucleotide_hybridization_overhangs("ATGGC", "GCCATa", 5)
[1]
>>> oligonucleotide_hybridization_overhangs("ATGGC", "GCCATaaGCCAT", 5)
[0, 7]

If the minimal annealing length is longer than the length of the shortest oligo, it returns an empty list.

>>> oligonucleotide_hybridization_overhangs("ATGGC", "GCCATaaGCCAT", 100)
[]

If it’s possible to anneal for minimal_annealing length, but with mismatches, it raises an error.

>>> oligonucleotide_hybridization_overhangs("cATGGC", "GCCATa", 5)
Traceback (most recent call last):
    ...
ValueError: The oligonucleotides can anneal with mismatches
pydna.oligonucleotide_hybridization.oligonucleotide_hybridization(fwd_primer: Primer, rvs_primer: Primer, minimal_annealing: int) list[Dseqrecord][source]#

Returns a list of Dseqrecord objects representing the hybridization of two primers.

>>> from pydna.primer import Primer
>>> from pydna.oligonucleotide_hybridization import oligonucleotide_hybridization
>>> fwd_primer = Primer("ATGGC")
>>> rvs_primer = Primer("GCCA")
>>> oligonucleotide_hybridization(fwd_primer, rvs_primer, 3)[0].seq
Dseq(-5)
ATGGC
 ACCG

Multiple values can be returned:

>>> rvs_primer2 = Primer("GCCATaaGCCAT")
>>> oligonucleotide_hybridization(fwd_primer, rvs_primer2, 3)[0].seq
Dseq(-12)
ATGGC
TACCGaaTACCG
>>> oligonucleotide_hybridization(fwd_primer, rvs_primer2, 3)[1].seq
Dseq(-12)
       ATGGC
TACCGaaTACCG

If no possible overhangs are found, it returns an empty list.

>>> oligonucleotide_hybridization(fwd_primer, rvs_primer, 100)
[]

If there are mismatches given the minimal annealing length, it raises an error.

>>> fwd_primer3 = Primer("cATGGC")
>>> rvs_primer3 = Primer("GCCATa")
>>> oligonucleotide_hybridization(fwd_primer3, rvs_primer3, 5)
Traceback (most recent call last):
    ...
ValueError: The oligonucleotides can anneal with mismatches

pydna.opencloning_models module#

This module provides classes that roughly map to the OpenCloning data model, which is defined using LinkML <https://linkml.io>, and available as a python package opencloning-linkml. These classes are documented there, and the ones in this module essentially replace the fields pointing to sequences and primers (which use ids in the data model) to Dseqrecord and Primer objects, respectively. Similarly, it uses Location from Biopython instead of a string, which is what the data model uses.

When using pydna to plan cloning, it stores the provenance of Dseqrecord objects in their source attribute. Not all methods generate sources so far, so refer to the documentation notebooks for examples on how to use this feature. The history method of Dseqrecord objects can be used to get a string representation of the provenance of the sequence. You can also use the CloningStrategy class to create a JSON representation of the cloning strategy. That CloningStrategy can be loaded in the OpenCloning web interface to see a representation of the cloning strategy.

Contributing#

Not all fields can be readily serialized to be converted to regular types in pydantic. For instance, the coordinates field of the GenomeCoordinatesSource class is a SimpleLocation object, or the input field of Source is a list of SourceInput objects, which can be Dseqrecord or Primer objects, or AssemblyFragment objects. For these type of fields, you have to define a field_serializer method to serialize them to the correct type.

pydna.opencloning_models.id_mode(use_python_internal_id: bool = True)[source]#

Context manager that is used to determine how ids are assigned to objects when mapping them to the OpenCloning data model. If use_python_internal_id is True, the built-in python id() function is used to assign ids to objects. That function produces a unique integer for each object in python, so it’s guaranteed to be unique. If use_python_internal_id is False, the object’s .id attribute (must be a string integer) is used to assign ids to objects. This is useful when the objects already have meaningful ids, and you want to keep references to them in SourceInput objects (which sequences and primers are used in a particular source).

Parameters:

use_python_internal_id (bool) – If True, use Python’s built-in id() function. If False, use the object’s .id attribute (must be a string integer).

Examples

>>> from pydna.dseqrecord import Dseqrecord
>>> from pydna.opencloning_models import get_id, id_mode
>>> dseqr = Dseqrecord("ATGC")
>>> dseqr.name = "my_sequence"
>>> dseqr.id = "123"
>>> get_id(dseqr) == id(dseqr)
True
>>> with id_mode(use_python_internal_id=False):
...     get_id(dseqr)
123
pydna.opencloning_models.get_id(obj: Primer' | 'Dseqrecord) int[source]#

Get ID using the current strategy from thread-local storage (see id_mode) :param obj: The object to get the id of :type obj: Primer | Dseqrecord

Returns:

int

Return type:

The id of the object

class pydna.opencloning_models.SequenceLocationStr[source]#

Bases: str

A string representation of a sequence location, genbank-like.

classmethod from_biopython_location(location: Location)[source]#
to_biopython_location() Location[source]#
classmethod field_validator(v)[source]#
classmethod from_start_and_end(start: int, end: int, seq_len: int | None = None, strand: int | None = 1)[source]#
get_ncbi_format_coordinates() str[source]#

Return start, end, strand in the same format as the NCBI eutils API (1-based, inclusive)

class pydna.opencloning_models.ConfiguredBaseModel[source]#

Bases: BaseModel

model_config: ClassVar[ConfigDict] = {'arbitrary_types_allowed': True, 'extra': 'forbid', 'strict': False, 'use_enum_values': True, 'validate_assignment': True, 'validate_default': True}#

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

class pydna.opencloning_models.TextFileSequence(*, id: int, type: Literal['TextFileSequence'] = 'TextFileSequence', sequence_file_format: SequenceFileFormat, overhang_crick_3prime: int | None = 0, overhang_watson_3prime: int | None = 0, file_content: str | None = None)[source]#

Bases: TextFileSequence

classmethod from_dseqrecord(dseqr: Dseqrecord)[source]#
model_config: ClassVar[ConfigDict] = {'arbitrary_types_allowed': True, 'extra': 'forbid', 'strict': False, 'use_enum_values': True, 'validate_assignment': True, 'validate_default': True}#

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

class pydna.opencloning_models.PrimerModel(*, id: int, type: Literal['Primer'] = 'Primer', name: str | None = None, database_id: int | None = None, sequence: str | None = None)[source]#

Bases: Primer

classmethod from_primer(primer: Primer)[source]#
model_config: ClassVar[ConfigDict] = {'arbitrary_types_allowed': True, 'extra': 'forbid', 'strict': False, 'use_enum_values': True, 'validate_assignment': True, 'validate_default': True}#

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

class pydna.opencloning_models.SourceInput(*, sequence: object)[source]#

Bases: ConfiguredBaseModel

sequence: object#
to_pydantic_model() SourceInput[source]#
model_config: ClassVar[ConfigDict] = {'arbitrary_types_allowed': True, 'extra': 'forbid', 'strict': False, 'use_enum_values': True, 'validate_assignment': True, 'validate_default': True}#

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

class pydna.opencloning_models.AssemblyFragment(*, sequence: object, left_location: Location | None = None, right_location: Location | None = None, reverse_complemented: bool)[source]#

Bases: SourceInput

left_location: Location | None#
right_location: Location | None#
reverse_complemented: bool#
static from_biopython_location(location: Location | None)[source]#
to_pydantic_model() AssemblyFragment[source]#
model_config: ClassVar[ConfigDict] = {'arbitrary_types_allowed': True, 'extra': 'forbid', 'strict': False, 'use_enum_values': True, 'validate_assignment': True, 'validate_default': True}#

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

class pydna.opencloning_models.Source(*, input: list[~pydna.opencloning_models.SourceInput | ~pydna.opencloning_models.AssemblyFragment] = <factory>)[source]#

Bases: ConfiguredBaseModel

input: list[SourceInput | AssemblyFragment]#
TARGET_MODEL#

alias of Source

serialize_input(input: list[SourceInput | AssemblyFragment]) list[SourceInput | AssemblyFragment][source]#
to_pydantic_model(seq_id: int)[source]#
to_unserialized_dict()[source]#

Converts into a dictionary without serializing the fields. This is used to be able to recast.

add_to_history_graph(history_graph: nx.DiGraph, seq: Dseqrecord)[source]#

Add the source to the history graph.

It does not use the get_id function, because it just uses it to have unique identifiers for graph nodes, not to store them anywhere.

history_string(seq: Dseqrecord)[source]#

Returns a string representation of the cloning history of the sequence. See dseqrecord.history() for examples.

model_config: ClassVar[ConfigDict] = {'arbitrary_types_allowed': True, 'extra': 'forbid', 'strict': False, 'use_enum_values': True, 'validate_assignment': True, 'validate_default': True}#

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

class pydna.opencloning_models.AssemblySource(*, input: list[~pydna.opencloning_models.SourceInput | ~pydna.opencloning_models.AssemblyFragment] = <factory>, circular: bool)[source]#

Bases: Source

circular: bool#
TARGET_MODEL#

alias of AssemblySource

classmethod from_subfragment_representation(assembly: SubFragmentRepresentationAssembly, fragments: list['Dseqrecord'], is_circular: bool)[source]#
model_config: ClassVar[ConfigDict] = {'arbitrary_types_allowed': True, 'extra': 'forbid', 'strict': False, 'use_enum_values': True, 'validate_assignment': True, 'validate_default': True}#

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

class pydna.opencloning_models.DatabaseSource(*, input: list[~pydna.opencloning_models.SourceInput | ~pydna.opencloning_models.AssemblyFragment] = <factory>, database_id: int)[source]#

Bases: Source

TARGET_MODEL#

alias of DatabaseSource

database_id: int#
model_config: ClassVar[ConfigDict] = {'arbitrary_types_allowed': True, 'extra': 'forbid', 'strict': False, 'use_enum_values': True, 'validate_assignment': True, 'validate_default': True}#

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

class pydna.opencloning_models.UploadedFileSource(*, input: list[~pydna.opencloning_models.SourceInput | ~pydna.opencloning_models.AssemblyFragment] = <factory>, file_name: str, index_in_file: int, sequence_file_format: str)[source]#

Bases: Source

TARGET_MODEL#

alias of UploadedFileSource

file_name: str#
index_in_file: int#
sequence_file_format: str#
model_config: ClassVar[ConfigDict] = {'arbitrary_types_allowed': True, 'extra': 'forbid', 'strict': False, 'use_enum_values': True, 'validate_assignment': True, 'validate_default': True}#

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

class pydna.opencloning_models.RepositoryIdSource(*, input: list[~pydna.opencloning_models.SourceInput | ~pydna.opencloning_models.AssemblyFragment] = <factory>, repository_id: str)[source]#

Bases: Source

TARGET_MODEL#

alias of RepositoryIdSource

repository_id: str#
model_config: ClassVar[ConfigDict] = {'arbitrary_types_allowed': True, 'extra': 'forbid', 'strict': False, 'use_enum_values': True, 'validate_assignment': True, 'validate_default': True}#

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

class pydna.opencloning_models.RepositoryIdSourceWithSequenceFileUrl(*, input: list[~pydna.opencloning_models.SourceInput | ~pydna.opencloning_models.AssemblyFragment] = <factory>, repository_id: str, sequence_file_url: str | None = None)[source]#

Bases: RepositoryIdSource

Auxiliary class to avoid code duplication in the sources that have a sequence file url.

sequence_file_url: str | None#
model_config: ClassVar[ConfigDict] = {'arbitrary_types_allowed': True, 'extra': 'forbid', 'strict': False, 'use_enum_values': True, 'validate_assignment': True, 'validate_default': True}#

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

class pydna.opencloning_models.AddgeneIdSource(*, input: list[~pydna.opencloning_models.SourceInput | ~pydna.opencloning_models.AssemblyFragment] = <factory>, repository_id: str, sequence_file_url: str | None = None, addgene_sequence_type: ~opencloning_linkml.datamodel._models.AddgeneSequenceType | None = None)[source]#

Bases: RepositoryIdSourceWithSequenceFileUrl

TARGET_MODEL#

alias of AddgeneIdSource

addgene_sequence_type: AddgeneSequenceType | None#
model_config: ClassVar[ConfigDict] = {'arbitrary_types_allowed': True, 'extra': 'forbid', 'strict': False, 'use_enum_values': True, 'validate_assignment': True, 'validate_default': True}#

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

class pydna.opencloning_models.BenchlingUrlSource(*, input: list[~pydna.opencloning_models.SourceInput | ~pydna.opencloning_models.AssemblyFragment] = <factory>, repository_id: str)[source]#

Bases: RepositoryIdSource

TARGET_MODEL#

alias of BenchlingUrlSource

model_config: ClassVar[ConfigDict] = {'arbitrary_types_allowed': True, 'extra': 'forbid', 'strict': False, 'use_enum_values': True, 'validate_assignment': True, 'validate_default': True}#

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

class pydna.opencloning_models.SnapGenePlasmidSource(*, input: list[~pydna.opencloning_models.SourceInput | ~pydna.opencloning_models.AssemblyFragment] = <factory>, repository_id: str)[source]#

Bases: RepositoryIdSource

TARGET_MODEL#

alias of SnapGenePlasmidSource

model_config: ClassVar[ConfigDict] = {'arbitrary_types_allowed': True, 'extra': 'forbid', 'strict': False, 'use_enum_values': True, 'validate_assignment': True, 'validate_default': True}#

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

class pydna.opencloning_models.EuroscarfSource(*, input: list[~pydna.opencloning_models.SourceInput | ~pydna.opencloning_models.AssemblyFragment] = <factory>, repository_id: str)[source]#

Bases: RepositoryIdSource

TARGET_MODEL#

alias of EuroscarfSource

model_config: ClassVar[ConfigDict] = {'arbitrary_types_allowed': True, 'extra': 'forbid', 'strict': False, 'use_enum_values': True, 'validate_assignment': True, 'validate_default': True}#

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

class pydna.opencloning_models.WekWikGeneIdSource(*, input: list[~pydna.opencloning_models.SourceInput | ~pydna.opencloning_models.AssemblyFragment] = <factory>, repository_id: str, sequence_file_url: str | None = None)[source]#

Bases: RepositoryIdSourceWithSequenceFileUrl

TARGET_MODEL#

alias of WekWikGeneIdSource

model_config: ClassVar[ConfigDict] = {'arbitrary_types_allowed': True, 'extra': 'forbid', 'strict': False, 'use_enum_values': True, 'validate_assignment': True, 'validate_default': True}#

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

class pydna.opencloning_models.SEVASource(*, input: list[~pydna.opencloning_models.SourceInput | ~pydna.opencloning_models.AssemblyFragment] = <factory>, repository_id: str, sequence_file_url: str | None = None)[source]#

Bases: RepositoryIdSourceWithSequenceFileUrl

TARGET_MODEL#

alias of SEVASource

model_config: ClassVar[ConfigDict] = {'arbitrary_types_allowed': True, 'extra': 'forbid', 'strict': False, 'use_enum_values': True, 'validate_assignment': True, 'validate_default': True}#

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

class pydna.opencloning_models.IGEMSource(*, input: list[~pydna.opencloning_models.SourceInput | ~pydna.opencloning_models.AssemblyFragment] = <factory>, repository_id: str, sequence_file_url: str | None = None)[source]#

Bases: RepositoryIdSourceWithSequenceFileUrl

TARGET_MODEL#

alias of IGEMSource

model_config: ClassVar[ConfigDict] = {'arbitrary_types_allowed': True, 'extra': 'forbid', 'strict': False, 'use_enum_values': True, 'validate_assignment': True, 'validate_default': True}#

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

class pydna.opencloning_models.OpenDNACollectionsSource(*, input: list[~pydna.opencloning_models.SourceInput | ~pydna.opencloning_models.AssemblyFragment] = <factory>, repository_id: str, sequence_file_url: str | None = None)[source]#

Bases: RepositoryIdSourceWithSequenceFileUrl

TARGET_MODEL#

alias of OpenDNACollectionsSource

model_config: ClassVar[ConfigDict] = {'arbitrary_types_allowed': True, 'extra': 'forbid', 'strict': False, 'use_enum_values': True, 'validate_assignment': True, 'validate_default': True}#

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

class pydna.opencloning_models.NCBISequenceSource(*, input: list[~pydna.opencloning_models.SourceInput | ~pydna.opencloning_models.AssemblyFragment] = <factory>, repository_id: str, coordinates: ~Bio.SeqFeature.SimpleLocation | None = None)[source]#

Bases: RepositoryIdSource

TARGET_MODEL#

alias of NCBISequenceSource

coordinates: SimpleLocation | None#
model_config: ClassVar[ConfigDict] = {'arbitrary_types_allowed': True, 'extra': 'forbid', 'strict': False, 'use_enum_values': True, 'validate_assignment': True, 'validate_default': True}#

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

class pydna.opencloning_models.GenomeCoordinatesSource(*, input: list[~pydna.opencloning_models.SourceInput | ~pydna.opencloning_models.AssemblyFragment] = <factory>, repository_id: str, coordinates: ~Bio.SeqFeature.SimpleLocation, assembly_accession: str | None = None, locus_tag: str | None = None, gene_id: int | None = None)[source]#

Bases: NCBISequenceSource

TARGET_MODEL#

alias of GenomeCoordinatesSource

assembly_accession: str | None#
locus_tag: str | None#
gene_id: int | None#
coordinates: SimpleLocation#
serialize_coordinates(coordinates: SimpleLocation) str[source]#
model_config: ClassVar[ConfigDict] = {'arbitrary_types_allowed': True, 'extra': 'forbid', 'strict': False, 'use_enum_values': True, 'validate_assignment': True, 'validate_default': True}#

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

class pydna.opencloning_models.RestrictionAndLigationSource(*, input: list[~pydna.opencloning_models.SourceInput | ~pydna.opencloning_models.AssemblyFragment] = <factory>, circular: bool, restriction_enzymes: list[~Bio.Restriction.Restriction.AbstractCut])[source]#

Bases: AssemblySource

restriction_enzymes: list[AbstractCut]#
TARGET_MODEL#

alias of RestrictionAndLigationSource

serialize_restriction_enzymes(restriction_enzymes: list[AbstractCut]) list[str][source]#
model_config: ClassVar[ConfigDict] = {'arbitrary_types_allowed': True, 'extra': 'forbid', 'strict': False, 'use_enum_values': True, 'validate_assignment': True, 'validate_default': True}#

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

class pydna.opencloning_models.GibsonAssemblySource(*, input: list[~pydna.opencloning_models.SourceInput | ~pydna.opencloning_models.AssemblyFragment] = <factory>, circular: bool)[source]#

Bases: AssemblySource

TARGET_MODEL#

alias of GibsonAssemblySource

model_config: ClassVar[ConfigDict] = {'arbitrary_types_allowed': True, 'extra': 'forbid', 'strict': False, 'use_enum_values': True, 'validate_assignment': True, 'validate_default': True}#

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

class pydna.opencloning_models.InFusionSource(*, input: list[~pydna.opencloning_models.SourceInput | ~pydna.opencloning_models.AssemblyFragment] = <factory>, circular: bool)[source]#

Bases: AssemblySource

TARGET_MODEL#

alias of InFusionSource

model_config: ClassVar[ConfigDict] = {'arbitrary_types_allowed': True, 'extra': 'forbid', 'strict': False, 'use_enum_values': True, 'validate_assignment': True, 'validate_default': True}#

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

class pydna.opencloning_models.OverlapExtensionPCRLigationSource(*, input: list[~pydna.opencloning_models.SourceInput | ~pydna.opencloning_models.AssemblyFragment] = <factory>, circular: bool)[source]#

Bases: AssemblySource

TARGET_MODEL#

alias of OverlapExtensionPCRLigationSource

model_config: ClassVar[ConfigDict] = {'arbitrary_types_allowed': True, 'extra': 'forbid', 'strict': False, 'use_enum_values': True, 'validate_assignment': True, 'validate_default': True}#

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

class pydna.opencloning_models.InVivoAssemblySource(*, input: list[~pydna.opencloning_models.SourceInput | ~pydna.opencloning_models.AssemblyFragment] = <factory>, circular: bool)[source]#

Bases: AssemblySource

TARGET_MODEL#

alias of InVivoAssemblySource

model_config: ClassVar[ConfigDict] = {'arbitrary_types_allowed': True, 'extra': 'forbid', 'strict': False, 'use_enum_values': True, 'validate_assignment': True, 'validate_default': True}#

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

class pydna.opencloning_models.LigationSource(*, input: list[~pydna.opencloning_models.SourceInput | ~pydna.opencloning_models.AssemblyFragment] = <factory>, circular: bool)[source]#

Bases: AssemblySource

TARGET_MODEL#

alias of LigationSource

model_config: ClassVar[ConfigDict] = {'arbitrary_types_allowed': True, 'extra': 'forbid', 'strict': False, 'use_enum_values': True, 'validate_assignment': True, 'validate_default': True}#

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

class pydna.opencloning_models.GatewaySource(*, input: list[~pydna.opencloning_models.SourceInput | ~pydna.opencloning_models.AssemblyFragment] = <factory>, circular: bool, reaction_type: ~opencloning_linkml.datamodel._models.GatewayReactionType, greedy: bool = False)[source]#

Bases: AssemblySource

TARGET_MODEL#

alias of GatewaySource

reaction_type: GatewayReactionType#
greedy: bool#
model_config: ClassVar[ConfigDict] = {'arbitrary_types_allowed': True, 'extra': 'forbid', 'strict': False, 'use_enum_values': True, 'validate_assignment': True, 'validate_default': True}#

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

class pydna.opencloning_models.HomologousRecombinationSource(*, input: list[~pydna.opencloning_models.SourceInput | ~pydna.opencloning_models.AssemblyFragment] = <factory>, circular: bool)[source]#

Bases: AssemblySource

TARGET_MODEL#

alias of HomologousRecombinationSource

model_config: ClassVar[ConfigDict] = {'arbitrary_types_allowed': True, 'extra': 'forbid', 'strict': False, 'use_enum_values': True, 'validate_assignment': True, 'validate_default': True}#

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

class pydna.opencloning_models.CRISPRSource(*, input: list[~pydna.opencloning_models.SourceInput | ~pydna.opencloning_models.AssemblyFragment] = <factory>, circular: bool)[source]#

Bases: HomologousRecombinationSource

TARGET_MODEL#

alias of CRISPRSource

model_config: ClassVar[ConfigDict] = {'arbitrary_types_allowed': True, 'extra': 'forbid', 'strict': False, 'use_enum_values': True, 'validate_assignment': True, 'validate_default': True}#

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

class pydna.opencloning_models.CreLoxRecombinationSource(*, input: list[~pydna.opencloning_models.SourceInput | ~pydna.opencloning_models.AssemblyFragment] = <factory>, circular: bool)[source]#

Bases: AssemblySource

TARGET_MODEL#

alias of CreLoxRecombinationSource

model_config: ClassVar[ConfigDict] = {'arbitrary_types_allowed': True, 'extra': 'forbid', 'strict': False, 'use_enum_values': True, 'validate_assignment': True, 'validate_default': True}#

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

class pydna.opencloning_models.PCRSource(*, input: list[~pydna.opencloning_models.SourceInput | ~pydna.opencloning_models.AssemblyFragment] = <factory>, circular: bool, add_primer_features: bool = False)[source]#

Bases: AssemblySource

TARGET_MODEL#

alias of PCRSource

add_primer_features: bool#
model_config: ClassVar[ConfigDict] = {'arbitrary_types_allowed': True, 'extra': 'forbid', 'strict': False, 'use_enum_values': True, 'validate_assignment': True, 'validate_default': True}#

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

class pydna.opencloning_models.SequenceCutSource(*, input: list[~pydna.opencloning_models.SourceInput | ~pydna.opencloning_models.AssemblyFragment] = <factory>, left_edge: ~typing.Tuple[~typing.Tuple[int, int], ~Bio.Restriction.Restriction.AbstractCut | None | ~pydna.crispr._cas] | None, right_edge: ~typing.Tuple[~typing.Tuple[int, int], ~Bio.Restriction.Restriction.AbstractCut | None | ~pydna.crispr._cas] | None)[source]#

Bases: Source

left_edge: CutSiteType | None#
right_edge: CutSiteType | None#
property TARGET_MODEL#

Represents the source of a sequence

serialize_cut_site(cut_site: Tuple[Tuple[int, int], AbstractCut | None | _cas] | None) RestrictionSequenceCut | SequenceCut | None[source]#
classmethod from_parent(parent: Dseqrecord, left_edge: CutSiteType, right_edge: CutSiteType)[source]#
model_config: ClassVar[ConfigDict] = {'arbitrary_types_allowed': True, 'extra': 'forbid', 'strict': False, 'use_enum_values': True, 'validate_assignment': True, 'validate_default': True}#

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

class pydna.opencloning_models.OligoHybridizationSource(*, input: list[~pydna.opencloning_models.SourceInput | ~pydna.opencloning_models.AssemblyFragment] = <factory>, overhang_crick_3prime: int | None = None)[source]#

Bases: Source

TARGET_MODEL#

alias of OligoHybridizationSource

overhang_crick_3prime: int | None#
model_config: ClassVar[ConfigDict] = {'arbitrary_types_allowed': True, 'extra': 'forbid', 'strict': False, 'use_enum_values': True, 'validate_assignment': True, 'validate_default': True}#

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

class pydna.opencloning_models.PolymeraseExtensionSource(*, input: list[~pydna.opencloning_models.SourceInput | ~pydna.opencloning_models.AssemblyFragment] = <factory>)[source]#

Bases: Source

TARGET_MODEL#

alias of PolymeraseExtensionSource

model_config: ClassVar[ConfigDict] = {'arbitrary_types_allowed': True, 'extra': 'forbid', 'strict': False, 'use_enum_values': True, 'validate_assignment': True, 'validate_default': True}#

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

class pydna.opencloning_models.AnnotationSource(*, input: list[~pydna.opencloning_models.SourceInput | ~pydna.opencloning_models.AssemblyFragment] = <factory>, annotation_tool: ~opencloning_linkml.datamodel._models.AnnotationTool, annotation_tool_version: str | None = None, annotation_report: list[~opencloning_linkml.datamodel._models.AnnotationReport | ~opencloning_linkml.datamodel._models.PlannotateAnnotationReport] | None = None)[source]#

Bases: Source

TARGET_MODEL#

alias of AnnotationSource

annotation_tool: AnnotationTool#
annotation_tool_version: str | None#
annotation_report: list[_AnnotationReport | _PlannotateAnnotationReport] | None#
model_config: ClassVar[ConfigDict] = {'arbitrary_types_allowed': True, 'extra': 'forbid', 'strict': False, 'use_enum_values': True, 'validate_assignment': True, 'validate_default': True}#

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

class pydna.opencloning_models.ReverseComplementSource(*, input: list[~pydna.opencloning_models.SourceInput | ~pydna.opencloning_models.AssemblyFragment] = <factory>)[source]#

Bases: Source

TARGET_MODEL#

alias of ReverseComplementSource

model_config: ClassVar[ConfigDict] = {'arbitrary_types_allowed': True, 'extra': 'forbid', 'strict': False, 'use_enum_values': True, 'validate_assignment': True, 'validate_default': True}#

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

class pydna.opencloning_models.CloningStrategy(*, sequences: list[~opencloning_linkml.datamodel._models.Sequence | ~opencloning_linkml.datamodel._models.TemplateSequence | ~opencloning_linkml.datamodel._models.TextFileSequence | ~opencloning_linkml.datamodel._models.ManuallyTypedSequence | ~opencloning_linkml.datamodel._models.Primer], sources: list[~opencloning_linkml.datamodel._models.Source | ~opencloning_linkml.datamodel._models.DatabaseSource | ~opencloning_linkml.datamodel._models.CollectionSource | ~opencloning_linkml.datamodel._models.ManuallyTypedSource | ~opencloning_linkml.datamodel._models.UploadedFileSource | ~opencloning_linkml.datamodel._models.RepositoryIdSource | ~opencloning_linkml.datamodel._models.SequenceCutSource | ~opencloning_linkml.datamodel._models.AssemblySource | ~opencloning_linkml.datamodel._models.OligoHybridizationSource | ~opencloning_linkml.datamodel._models.PolymeraseExtensionSource | ~opencloning_linkml.datamodel._models.AnnotationSource | ~opencloning_linkml.datamodel._models.ReverseComplementSource | ~opencloning_linkml.datamodel._models.PCRSource | ~opencloning_linkml.datamodel._models.LigationSource | ~opencloning_linkml.datamodel._models.HomologousRecombinationSource | ~opencloning_linkml.datamodel._models.GibsonAssemblySource | ~opencloning_linkml.datamodel._models.InFusionSource | ~opencloning_linkml.datamodel._models.OverlapExtensionPCRLigationSource | ~opencloning_linkml.datamodel._models.InVivoAssemblySource | ~opencloning_linkml.datamodel._models.RestrictionAndLigationSource | ~opencloning_linkml.datamodel._models.GatewaySource | ~opencloning_linkml.datamodel._models.CreLoxRecombinationSource | ~opencloning_linkml.datamodel._models.CRISPRSource | ~opencloning_linkml.datamodel._models.RestrictionEnzymeDigestionSource | ~opencloning_linkml.datamodel._models.AddgeneIdSource | ~opencloning_linkml.datamodel._models.WekWikGeneIdSource | ~opencloning_linkml.datamodel._models.SEVASource | ~opencloning_linkml.datamodel._models.BenchlingUrlSource | ~opencloning_linkml.datamodel._models.SnapGenePlasmidSource | ~opencloning_linkml.datamodel._models.EuroscarfSource | ~opencloning_linkml.datamodel._models.IGEMSource | ~opencloning_linkml.datamodel._models.OpenDNACollectionsSource | ~opencloning_linkml.datamodel._models.NCBISequenceSource | ~opencloning_linkml.datamodel._models.GenomeCoordinatesSource], primers: ~typing.List[~pydna.opencloning_models.PrimerModel] | None = <factory>, description: str | None = None, files: list[~opencloning_linkml.datamodel._models.AssociatedFile | ~opencloning_linkml.datamodel._models.SequencingFile] | None = None, schema_version: str | None = '0.4.9', backend_version: str | None = None, frontend_version: str | None = None)[source]#

Bases: CloningStrategy

primers: List[PrimerModel] | None#
add_primer(primer: Primer)[source]#
add_dseqrecord(dseqr: Dseqrecord)[source]#
reassign_ids()[source]#
classmethod from_dseqrecords(dseqrs: list['Dseqrecord'], description: str = '')[source]#
model_dump_json(*args, **kwargs)[source]#
!!! abstract “Usage Documentation”

[model_dump_json](../concepts/serialization.md#json-mode)

Generates a JSON representation of the model using Pydantic’s to_json method.

Parameters:
  • indent – Indentation to use in the JSON output. If None is passed, the output will be compact.

  • ensure_ascii – If True, the output is guaranteed to have all incoming non-ASCII characters escaped. If False (the default), these characters will be output as-is.

  • include – Field(s) to include in the JSON output.

  • exclude – Field(s) to exclude from the JSON output.

  • context – Additional context to pass to the serializer.

  • by_alias – Whether to serialize using field aliases.

  • exclude_unset – Whether to exclude fields that have not been explicitly set.

  • exclude_defaults – Whether to exclude fields that are set to their default value.

  • exclude_none – Whether to exclude fields that have a value of None.

  • exclude_computed_fields – Whether to exclude computed fields. While this can be useful for round-tripping, it is usually recommended to use the dedicated round_trip parameter instead.

  • round_trip – If True, dumped values should be valid as input for non-idempotent types such as Json[T].

  • warnings – How to handle serialization errors. False/”none” ignores them, True/”warn” logs errors, “error” raises a [PydanticSerializationError][pydantic_core.PydanticSerializationError].

  • fallback – A function to call when an unknown value is encountered. If not provided, a [PydanticSerializationError][pydantic_core.PydanticSerializationError] error is raised.

  • serialize_as_any – Whether to serialize fields with duck-typing serialization behavior.

Returns:

A JSON string representation of the model.

model_dump(*args, **kwargs)[source]#
!!! abstract “Usage Documentation”

[model_dump](../concepts/serialization.md#python-mode)

Generate a dictionary representation of the model, optionally specifying which fields to include or exclude.

Parameters:
  • mode – The mode in which to_python should run. If mode is ‘json’, the output will only contain JSON serializable types. If mode is ‘python’, the output may contain non-JSON-serializable Python objects.

  • include – A set of fields to include in the output.

  • exclude – A set of fields to exclude from the output.

  • context – Additional context to pass to the serializer.

  • by_alias – Whether to use the field’s alias in the dictionary key if defined.

  • exclude_unset – Whether to exclude fields that have not been explicitly set.

  • exclude_defaults – Whether to exclude fields that are set to their default value.

  • exclude_none – Whether to exclude fields that have a value of None.

  • exclude_computed_fields – Whether to exclude computed fields. While this can be useful for round-tripping, it is usually recommended to use the dedicated round_trip parameter instead.

  • round_trip – If True, dumped values should be valid as input for non-idempotent types such as Json[T].

  • warnings – How to handle serialization errors. False/”none” ignores them, True/”warn” logs errors, “error” raises a [PydanticSerializationError][pydantic_core.PydanticSerializationError].

  • fallback – A function to call when an unknown value is encountered. If not provided, a [PydanticSerializationError][pydantic_core.PydanticSerializationError] error is raised.

  • serialize_as_any – Whether to serialize fields with duck-typing serialization behavior.

Returns:

A dictionary representation of the model.

model_config: ClassVar[ConfigDict] = {'arbitrary_types_allowed': True, 'extra': 'forbid', 'strict': False, 'use_enum_values': True, 'validate_assignment': True, 'validate_default': True}#

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

pydna.parsers module#

Provides two functions, parse and parse_primers

pydna.parsers.extract_from_text(text)[source]#

docstring.

pydna.parsers.embl_gb_fasta(text)[source]#

Parse embl, genbank or fasta format from text.

Returns list of Bio.SeqRecord.SeqRecord

annotations[“molecule_type”] annotations[“topology”]

pydna.parsers.parse(data, ds=True) list[Dseqrecord | SeqRecord][source]#

Return all DNA sequences found in data.

If no sequences are found, an empty list is returned. This is a greedy function, use carefully.

Parameters:
  • data (string or iterable) –

    The data parameter is a string containing:

    1. an absolute path to a local file. The file will be read in text mode and parsed for EMBL, FASTA and Genbank sequences. Can be a string or a Path object.

    2. a string containing one or more sequences in EMBL, GENBANK, or FASTA format. Mixed formats are allowed.

    3. data can be a list or other iterable where the elements are 1 or 2

  • ds (bool) – If True double stranded Dseqrecord objects are returned. If False single stranded Bio.SeqRecord [20] objects are returned.

Returns:

contains Dseqrecord or SeqRecord objects

Return type:

list

References

See also

read

pydna.parsers.parse_primers(data)[source]#

docstring.

pydna.parsers.parse_snapgene(file_path: str) list[Dseqrecord][source]#

Parse a SnapGene file and return a Dseqrecord object.

Parameters:

file_path (str) – The path to the SnapGene file to parse.

Returns:

The parsed SnapGene file as a Dseqrecord object.

Return type:

Dseqrecord

pydna.primer module#

This module provide the Primer class that is a subclass of the biopython SeqRecord.

class pydna.primer.Primer(record, *args, amplicon=None, position=None, footprint=0, **kwargs)[source]#

Bases: SeqRecord

Primer and its position on a template, footprint and tail.

property footprint#
property tail#
reverse_complement(*args, **kwargs)[source]#

Return the reverse complement of the sequence.

pydna.primer_screen module#

Fast primer screening#

This module provides fast primer screening using the Aho-Corasick string-search algorithm. It is useful for PCR diagnostic purposes when given a list of primers and a single sequence or list of sequences to analyze.

The primer list can consist of Primer objects returned by pydna.parsers.parse_primers() or any objects with a seq attribute, such as pydna.seqrecord.SeqRecord or Bio.SeqRecord.SeqRecord.

The Aho-Corasick algorithm efficiently finds all occurrences of a set of sequences within a larger text. If the same primer list is used repeatedly, creating an automaton greatly speeds up repeated searches. See make_automaton() for information on creating, saving, and loading such automata.

Functions#

References

Aho-Corasick algorithm:

https://en.wikipedia.org/wiki/Aho%E2%80%93Corasick_algorithm

This module uses pyahocorasick:

Documentation: https://pyahocorasick.readthedocs.io/en/latest GitHub: WojciechMula/pyahocorasick PyPI: https://pypi.python.org/pypi/pyahocorasick

class pydna.primer_screen.amplicon_tuple(fp, rp, fposition, rposition, size)#

Bases: tuple

fp#

Alias for field number 0

fposition#

Alias for field number 2

rp#

Alias for field number 1

rposition#

Alias for field number 3

size#

Alias for field number 4

class pydna.primer_screen.primer_tuple(seq, fp, rp, size)#

Bases: tuple

fp#

Alias for field number 1

rp#

Alias for field number 2

seq#

Alias for field number 0

size#

Alias for field number 3

pydna.primer_screen.closest_diff(nums: list[int]) int[source]#

Smallest difference between two consecutive integers in a sorted list.

Given a list of integers eg. 1, 5, 7, 11, 19, return the smallest absolute difference, in this case 7-5 = 2.

>>> closest_diff([1, 5, 7, 11, 19])
2
Parameters:

nums (list[int]) – List of integers.

Raises:

ValueError – At least two numbers are required.

Returns:

Diff, always >= 0.

Return type:

int

pydna.primer_screen.expand_iupac_to_dna(seq: str) list[str][source]#

Expand an extended IUPAC DNA string to unambiguous IUPAC nucleotide alphabet.

Expands a string containing extended IUPAC code (ACGTURYSWKMBDHVN) including U for uracil into all possible DNA strings using only AGCT.

Returns a list of strings.

Example:

>>> expand_iupac_to_dna("ATNG")
['ATGG', 'ATAG', 'ATTG', 'ATCG']
>>> x = expand_iupac_to_dna("ACGTURYSWKMBDHVN")
>>> len(x)
20736
Parameters:

seq (str) – String containing extended IUPAC DNA.

Returns:

List of strings in unambiguous IUPAC nucleotide alphabet.

Return type:

list[str]

pydna.primer_screen.make_automaton(primer_list: Sequence[Primer | None], limit: str = 16) Automaton[source]#

Aho-Corasick automaton for a list of primers.

An automaton here can be made prior to primer screening for a list of Primer objects for faster primer search.

This automaton can be reused as an optional argument across calls to forward_primers(), reverse_primers(), primer_pairs(), flanking_primer_pairs(), diff_primer_pairs(), and diff_primer_triplets().

The primer list can contain None, this can be used to remove primers from the primer_list for the automaton, while keeping the original index for each primer.

The limit is the part of the primer used to find annealing positions. The automaton processes the uppercase 3’ part of each primer up to limit. It has to be rebuilt if a different limit is needed.

The primers can contain ambiguous bases from the extended IUPAC DNA alphabet.

The automaton can be saved and loaded like this (from the pyahocorasick docs):

import pickle
from pydna import primer_screen

# build automaton
atm = make_automaton(pl, limit = 16)

# save automaton
atm.save("atm.automaton", pickle.dumps)

# load automaton
import ahocorasick
atm = ahocorasick.load(path, pickle.loads)

# use automaton
fps = forward_primers(template, primer_list, automaton=atm)
Parameters:
  • primer_list (list[Primer] | tuple[Primer]) – This is a list of pydna.primer.Primer objects or any object with a seq property such as Bio.SeqRecord.SeqRecord.

  • limit (str, optional) – This is the primer part in the 3’-end that has to anneal. The default is 16.

Returns:

pyahocorasick automaton made for the list of Primer objects.

Return type:

ahocorasick.Automaton

pydna.primer_screen.callback(a: int, b: int) bool[source]#

PCR product sizes quality control.

This function accepts two integers representing PCR product sizes and returns True or False indicating the ease with which the size differences can be distinguished on a typical agarose gel.

Parameters:
  • a (int) – One size.

  • b (int) – Another size.

Returns:

True if successful, False otherwise.

Return type:

bool

pydna.primer_screen.forward_primers(seq: Dseqrecord, primer_list: Sequence[Primer | None], limit: int = 16, automaton: Automaton = None) dict[int, list[int]][source]#

Forward primers from primer_list annealing to seq with at least limit base pairs.

The optional automaton can speed up the primer search if the same primer list is often used, see make_automaton() for more information.

The resulting dict has the form:

{ primer_A_index : [location1, location2, ...]
  primer_B_index : [location1, location2, ...] }

Where a key such as primer_A_index (integer) is the index for a primer in primer_list and the value is a list of locations (integers) where the primer binds.

The concept of location is the same as used in pydna.primer. The forward primer in the figure below anneals at position 14 on the template.

5-gtcatgatctagtcgatgtta-3
 |||||||||||||||||||||

        5'-tagtcg-3' = forward primer, location = 14
           ||||||
 |||||||||||||||||||||
3-cagtactagatcagctacaat-5
                |
  012345678911111111112 position
            01234567890
Parameters:
  • seq (Dseqrecord) – Target sequence to find primer annealing positions.

  • primer_list (list[Primer] | tuple[Primer]) – This is a list of pydna.primer.Primer objects or any object with a seq property such as Bio.SeqRecord.SeqRecord.

  • limit (str, optional) – This is the part at the 3’-end of each primer that has to anneal. The default is 16.

  • automaton (ahocorasick.Automaton, optional) – Automaton made with the make_automaton(). The default is None.

Returns:

Dict of lists where keys are primer indices in primer_list and values are lists with primer locations.

Return type:

dict[int, list[int]]

pydna.primer_screen.reverse_primers(seq: Dseqrecord, primer_list: list[Primer] | tuple[Primer], limit: int = 16, automaton: Automaton = None) dict[int, list[int]][source]#

Primers from primer_list annealing in reverse to seq with at least limit base pairs.

The optional automaton can speed up the primer search if the same primer list is often used, see make_automaton() for more information.

The resulting dict has the form:

{ primer_A_index : [location1, location2, ...]
  primer_B_index : [location1, location2, ...] }

Where a key such as primer_A_index (integer) is the index for a primer in primer_list and the value is a list of locations (integers) where the primer binds.

The concept of location is the same as used in pydna.primer. The reverse primer below anneals at position 9.

5-gtcatgatctagtcgatgtta-3
  |||||||||||||||||||||
           ||||||
         3-atcagc-5 = reverse primer, location = 9

  |||||||||||||||||||||
3-cagtactagatcagctacaat-5
           |
  012345678911111111112 position
            01234567890
Parameters:
  • seq (Dseqrecord) – Target sequence to find primer annealing positions.

  • primer_list (list[Primer] | tuple[Primer]) – This is a list of pydna.primer.Primer objects or any object with a seq property such as Bio.SeqRecord.SeqRecord.

  • limit (str, optional) – This is the part in the 3’-end of each primer that has to anneal. The default is 16.

  • automaton (ahocorasick.Automaton, optional) – Automaton made with the make_automaton(). The default is None.

Returns:

Dict of lists where keys are primer indices in primer_list and values are lists with primer locations.

Return type:

dict[int, list[int]]

pydna.primer_screen.primer_pairs(seq: Dseqrecord, primer_list: list[Primer] | tuple[Primer], short: int = 500, long: int = 2000, limit: int = 16, automaton: Automaton = None) list[amplicon_tuple[int, int, int, int, int]][source]#

Primer pairs that form PCR products larger than short and smaller than long.

The PCR product size includes the PCR primers. Only unique primer pairs are returned. This means that the forward and reverse primers can only bind in one position on the template each.

If you suspect that primers bind on multiple locations, use the forward_primers() and reverse_primers() functions.

The function returns a list of flat 5-namedtuples of integers and integers with this form:

[
 ((index_fp1, index_rp1, position_fp1, position_rp1, size1),
 ((index_fp2, index_rp2, position_fp2, position_rp2, size2),
  ]

The indices are the primer_list indices and positions are the positions of the primers as described in forward_primers() and reverse_primers() functions. The size includes the length of each primer, so it is the true total length of the PCR product.

Parameters:
  • seq (Dseqrecord) – Target sequence to find primer annealing positions.

  • primer_list (list[Primer] | tuple[Primer]) – This is a list of pydna.primer.Primer objects or any object with a seq property such as Bio.SeqRecord.SeqRecord.

  • limit (str, optional) – This is the part in the 3’-end of each primer that has to anneal. The default is 16.

  • short (int, optional) – Lower limit for the size of the PCR products. The default is 500.

  • long (int, optional) – Upper limit for the size of the PCR products. The default is 1500.

  • automaton (ahocorasick.Automaton, optional) – Automaton made with the make_automaton(). The default is None.

Returns:

List of tuples (index_fp, position_fp, index_rp, position_rp, size)

Return type:

list[tuple(int, int, int, int, int)]

pydna.primer_screen.flanking_primer_pairs(seq: Dseqrecord, primer_list: list[Primer] | tuple[Primer], target: tuple[int, int], limit: int = 16, automaton: Automaton = None) list[amplicon_tuple[int, int, int, int, int]][source]#

Primer pairs that flank a target position (begin..end). This means that forward primers have to bind before or at the begin position and reverse primers have to bind at or after the end position.

The function returns a list of the same flat 5-namedtuples of integers returned from the primer_pairs() function.

[
 (index_fp1, position_fp1, index_rp1, position_rp1, size1),
 (index_fp2, position_fp2, index_rp2, position_rp2, size2),
 ]
Parameters:
  • seq (Dseqrecord) – Target sequence to find primer annealing positions.

  • primer_list (list[Primer] | tuple[Primer]) – This is a list of pydna.primer.Primer objects or any object with a seq property such as Bio.SeqRecord.SeqRecord.

  • target (tuple[int, int]) – Start and stop position for target sequence.

  • limit (str, optional) – This is the part in the 3’-end of each primer that has to anneal. The default is 16.

  • automaton (ahocorasick.Automaton, optional) – Automaton made with the make_automaton(). The default is None.

Returns:

List of tuples (index_fp, position_fp, index_rp, position_rp, size).

Return type:

list[tuple[int, int, int, int, int]]

pydna.primer_screen.diff_primer_pairs(sequences: list[Dseqrecord] | tuple[Dseqrecord], primer_list: list[Primer] | tuple[Primer], short: int = 500, long: int = 1500, limit: int = 16, automaton: Automaton = None, callback: Callable[[list], bool] = callback) tuple[tuple[Dseqrecord, int, int, int]][source]#

Primer pairs for diagnostic PCR.

Given an iterable of sequences and a primer list, primers are selected that result in unique product sizes from each of the input sequences.

Primers 1 and 2 both form PCR products from sequenceA and B below, but of different sizes. Primers 1 and 2 could be used to verify genetic modifications such as cloning an insert into a plasmid vector.

 1>              <2
-------NNNNNNNNN----  sequenceA

 1>           <2
-------XXXXX--------  sequenceB

The callback function is used to return true or false for the PCR products. This score is meant to filter for PCR products that are likely to migrate to sufficiently distinct locations to be distinguishable on a typical agarose gel.

Only products larger than short and smaller than long are returned.

An example of the output for two sequences (Dseqrecord(-3308), Dseqrecord(-3613)). Primers 501 and 1806 would yield a 933 bp product with the 3308 bp sequence and the same primer pair would give 1212 bp with the 3613 bp sequence.

A list of named 4-tuples is returned (Sequence, forward_primer, reverse_primer, size_bp), where each tuple has one entry for each sequence in the input argument.

[
    ((Dseqrecord(-3308), 501, 1806, 933), (Dseqrecord(-3613), 501, 1806, 1212)),
]
Parameters:
  • sequences (list[Dseqrecord] | tuple[Dseqrecord]) – Target sequence to find primer annealing positions.

  • primer_list (list[Primer] | tuple[Primer]) – This is a list of pydna.primer.Primer objects or any object with a seq property such as Bio.SeqRecord.SeqRecord.

  • limit (str, optional) – This is the part in the 3’-end of each primer that has to anneal. The default is 16.

  • short (int, optional) – Lower limit for the size of the PCR products. The default is 500.

  • long (int, optional) – Upper limit for the size of the PCR products. The default is 1500.

  • automaton (ahocorasick.Automaton, optional) – Automaton made with the make_automaton(). The default is None.

  • callback (callable[[list], bool], optional) – A function accepting a list of integers and returning True or False. The default is callback.

Returns:

(Sequence, forward_primer, reverse_primer, size_bp)

Return type:

list[tuple[Dseqrecord, int, int, int]]

pydna.primer_screen.diff_primer_triplets(sequences: list[Dseqrecord] | tuple[Dseqrecord], primer_list: list[Primer] | tuple[Primer], limit: int = 16, short: int = 500, long: int = 1500, automaton: Automaton = None, callback: Callable[[list], bool] = callback) tuple[tuple[tuple[Dseqrecord, int, int, int]]][source]#

Primer triplets for diagnostic PCR.

Given a list of sequences and a primer list, primer triplets are selected that result in PCR products of different sizes from each of the input sequences.

Primers 1, 2 and 3 form PCR products from sequenceA and B below, but of different sizes. Primer 1 binds both sequences while primers 2 and 3 bind one sequence each. This primer triplet could be used to verify genetic modifications.

 1>        <2
-------NNNNNNNNN----  sequenceA

 1>     <3
-------XXXXX--------  sequenceB

The callback function is used to give a score for the PCR products. This score can be used to decide if a collection of PCR products are likely to migrate to distinct locations on a typical agarose gel.

Only products larger than short and smaller than long are returned.

An example of the output for two sequences = [Dseqrecord(-7664), Dseqrecord(-3613)]. Primer pair 701, 700 would produce a 724 bp product with the 7664 bp sequence while the primer pair 701, 1564 would give a 1450 bp product with the 3613 bp sequence.

[
    ((Dseqrecord(-7664), 701, 700, 724), (Dseqrecord(-3613), 701, 1564, 1450)),
 ]
Parameters:
  • sequences (list[Dseqrecord] | tuple[Dseqrecord]) – Target sequence to find primer annealing positions.

  • primer_list (list[Primer] | tuple[Primer]) – This is a list of pydna.primer.Primer objects or any object with a seq property such as Bio.SeqRecord.SeqRecord.

  • limit (str, optional) – This is the part in the 3’-end of each primer that has to anneal. The default is 16.

  • short (int, optional) – Lower limit for the size of the PCR products. The default is 500.

  • long (int, optional) – Upper limit for the size of the PCR products. The default is 2000.

  • automaton (ahocorasick.Automaton, optional) – Automaton made with the make_automaton(). The default is None.

  • callback (callable[[list], bool], optional) – A function accepting a list of integers and returning True or False. The default is callback.

Returns:

(Sequence, forward_primer, reverse_primer, size_bp)

Return type:

list[tuple[Dseqrecord, int, int, int]]

pydna.readers module#

Provides two functions, read and read_primer.

pydna.readers.read(data, ds=True)[source]#

This function is similar the parse() function but expects one and only one sequence or and exception is thrown.

Parameters:
  • data (string) – see below

  • ds (bool) – Double stranded or single stranded DNA, if True return Dseqrecord objects, else Bio.SeqRecord objects.

Returns:

contains the first Dseqrecord or SeqRecord object parsed.

Return type:

Dseqrecord

Notes

The data parameter is similar to the data parameter for parse().

See also

parse

pydna.readers.read_primer(data)[source]#

Use this function to read a primer sequence from a string or a local file. The usage is similar to the parse_primer() function.

pydna.seq module#

A subclass of Biopython Bio.Seq.Seq

Has a number of extra methods and uses the pydna._pretty_str.pretty_str class instread of str for a nicer output in the IPython shell.

class pydna.seq.Seq(data: str | bytes | bytearray | _SeqAbstractBaseClass | SequenceDataAbstractBaseClass | dict | None, length: int | None = None)[source]#

Bases: Seq

docstring.

translate(table: [<class 'str'>, <class 'int'>] = "Standard", stop_symbol: [<class 'str'>] = "*", to_stop: bool = False, cds: bool = False, gap: str = "-") Seq[source]#

Translate into protein.

The table argument is the name of a codon table (string). These names can be for example “Standard” or “Alternative Yeast Nuclear” for the yeast CUG clade where the CUG codon is translated as serine instead of the standard leucine.

Over forty translation tables are available from the BioPython Bio.Data.CodonTable module. Look at the keys of the dictionary ´CodonTable.ambiguous_generic_by_name´. These are based on tables in this file provided by NCBI:

https://ftp.ncbi.nlm.nih.gov/entrez/misc/data/gc.prt

Standard table

T | C | A | G |

–+———+———+———+———+– T | TTT F | TCT S | TAT Y | TGT C | T T | TTC F | TCC S | TAC Y | TGC C | C T | TTA L | TCA S | TAA Stop| TGA Stop| A T | TTG L(s)| TCG S | TAG Stop| TGG W | G –+———+———+———+———+– C | CTT L | CCT P | CAT H | CGT R | T C | CTC L | CCC P | CAC H | CGC R | C C | CTA L | CCA P | CAA Q | CGA R | A C | CTG L(s)| CCG P | CAG Q | CGG R | G –+———+———+———+———+– A | ATT I | ACT T | AAT N | AGT S | T A | ATC I | ACC T | AAC N | AGC S | C A | ATA I | ACA T | AAA K | AGA R | A A | ATG M(s)| ACG T | AAG K | AGG R | G –+———+———+———+———+– G | GTT V | GCT A | GAT D | GGT G | T G | GTC V | GCC A | GAC D | GGC G | C G | GTA V | GCA A | GAA E | GGA G | A G | GTG V | GCG A | GAG E | GGG G | G –+———+———+———+———+–

Parameters:
  • table ([str, int], optional) – The default is “Standard”. Can be a table id integer, see here for table numbering https://www.ncbi.nlm.nih.gov/Taxonomy/Utils/wprintgc.cgi

  • stop_symbol ([str], optional) – The default is “*”. Single character string to indicate translation stop.

  • to_stop (bool, optional) –

    The default is False. True means that translation terminates at the first

    in frame stop codon. False translates to the end.

  • cds (bool, optional) – The default is False. If True, checks that the sequence starts with a valid alternative start codon sequence length is a multiple of three, and that there is a single in frame stop codon at the end. If these tests fail, an exception is raised.

  • gap (str, optional) – The default is “-“.

Returns:

A Biopython Seq object with the translated amino acid code.

Return type:

Bio.Seq.Seq

transcribe() Seq[source]#

Transcribe a DNA sequence into RNA and return the RNA sequence as a new Seq object.

gc() float[source]#

Return GC content.

cai(organism: str = 'sce') float[source]#

docstring.

rarecodons(organism: str = 'sce') List[slice][source]#

docstring.

startcodon(organism: str = 'sce') float | None[source]#

docstring.

stopcodon(organism: str = 'sce') float | None[source]#

docstring.

express(organism: str = 'sce') PrettyTable[source]#

docstring.

orfs2(minsize: int = 30) List[str][source]#

docstring.

orfs(minsize: int = 100) List[Tuple[int, int]][source]#
seguid() str[source]#

Url safe SEGUID [21] for the sequence.

This checksum is the same as seguid but with base64.urlsafe encoding instead of the normal base64. This means that the characters + and / are replaced with - and _ so that the checksum can be part of a URL.

Examples

>>> from pydna.seq import Seq
>>> a = Seq("aa")
>>> a.seguid()
'lsseguid=gBw0Jp907Tg_yX3jNgS4qQWttjU'

References

reverse_complement()[source]#

Return the reverse complement as a DNA sequence.

>>> Seq("CGA").reverse_complement()
Seq('TCG')

Any U in the sequence is treated as a T:

>>> Seq("CGAUT").reverse_complement()
Seq('AATCG')

In contrast, reverse_complement_rna returns an RNA sequence:

>>> Seq("CGA").reverse_complement_rna()
Seq('UCG')

The sequence is modified in-place and returned if inplace is True:

>>> my_seq = MutableSeq("CGA")
>>> my_seq
MutableSeq('CGA')
>>> my_seq.reverse_complement()
MutableSeq('TCG')
>>> my_seq
MutableSeq('CGA')
>>> my_seq.reverse_complement(inplace=True)
MutableSeq('TCG')
>>> my_seq
MutableSeq('TCG')

As Seq objects are immutable, a TypeError is raised if reverse_complement is called on a Seq object with inplace=True.

rc()#

Return the reverse complement as a DNA sequence.

>>> Seq("CGA").reverse_complement()
Seq('TCG')

Any U in the sequence is treated as a T:

>>> Seq("CGAUT").reverse_complement()
Seq('AATCG')

In contrast, reverse_complement_rna returns an RNA sequence:

>>> Seq("CGA").reverse_complement_rna()
Seq('UCG')

The sequence is modified in-place and returned if inplace is True:

>>> my_seq = MutableSeq("CGA")
>>> my_seq
MutableSeq('CGA')
>>> my_seq.reverse_complement()
MutableSeq('TCG')
>>> my_seq
MutableSeq('CGA')
>>> my_seq.reverse_complement(inplace=True)
MutableSeq('TCG')
>>> my_seq
MutableSeq('TCG')

As Seq objects are immutable, a TypeError is raised if reverse_complement is called on a Seq object with inplace=True.

class pydna.seq.ProteinSeq(data: str | bytes | bytearray | _SeqAbstractBaseClass | SequenceDataAbstractBaseClass | dict | None, length: int | None = None)[source]#

Bases: Seq

docstring.

translate()[source]#

Turn a nucleotide sequence into a protein sequence by creating a new sequence object.

This method will translate DNA or RNA sequences. It should not be used on protein sequences as any result will be biologically meaningless.

Parameters:
  • name (- table - Which codon table to use? This can be either a) – (string), an NCBI identifier (integer), or a CodonTable object (useful for non-standard genetic codes). This defaults to the “Standard” table.

  • string (- stop_symbol - Single character) – terminators. This defaults to the asterisk, “*”.

  • for (what to use) – terminators. This defaults to the asterisk, “*”.

  • Boolean (- cds -) – translation continuing on past any stop codons (translated as the specified stop_symbol). If True, translation is terminated at the first in frame stop codon (and the stop_symbol is not appended to the returned protein sequence).

  • full (defaults to False meaning do a) – translation continuing on past any stop codons (translated as the specified stop_symbol). If True, translation is terminated at the first in frame stop codon (and the stop_symbol is not appended to the returned protein sequence).

  • Boolean – this checks the sequence starts with a valid alternative start codon (which will be translated as methionine, M), that the sequence length is a multiple of three, and that there is a single in frame stop codon at the end (this will be excluded from the protein sequence, regardless of the to_stop option). If these tests fail, an exception is raised.

  • True (indicates this is a complete CDS. If) – this checks the sequence starts with a valid alternative start codon (which will be translated as methionine, M), that the sequence length is a multiple of three, and that there is a single in frame stop codon at the end (this will be excluded from the protein sequence, regardless of the to_stop option). If these tests fail, an exception is raised.

:paramthis checks the sequence starts with a valid alternative start

codon (which will be translated as methionine, M), that the sequence length is a multiple of three, and that there is a single in frame stop codon at the end (this will be excluded from the protein sequence, regardless of the to_stop option). If these tests fail, an exception is raised.

Parameters:

gaps. (- gap - Single character string to denote symbol used for) – Defaults to the minus sign.

A Seq object is returned if translate is called on a Seq object; a MutableSeq object is returned if translate is called pn a MutableSeq object.

e.g. Using the standard table:

>>> coding_dna = Seq("GTGGCCATTGTAATGGGCCGCTGAAAGGGTGCCCGATAG")
>>> coding_dna.translate()
Seq('VAIVMGR*KGAR*')
>>> coding_dna.translate(stop_symbol="@")
Seq('VAIVMGR@KGAR@')
>>> coding_dna.translate(to_stop=True)
Seq('VAIVMGR')

Now using NCBI table 2, where TGA is not a stop codon:

>>> coding_dna.translate(table=2)
Seq('VAIVMGRWKGAR*')
>>> coding_dna.translate(table=2, to_stop=True)
Seq('VAIVMGRWKGAR')

In fact, GTG is an alternative start codon under NCBI table 2, meaning this sequence could be a complete CDS:

>>> coding_dna.translate(table=2, cds=True)
Seq('MAIVMGRWKGAR')

It isn’t a valid CDS under NCBI table 1, due to both the start codon and also the in frame stop codons:

>>> coding_dna.translate(table=1, cds=True)
Traceback (most recent call last):
    ...
Bio.Data.CodonTable.TranslationError: First codon 'GTG' is not a start codon

If the sequence has no in-frame stop codon, then the to_stop argument has no effect:

>>> coding_dna2 = Seq("TTGGCCATTGTAATGGGCCGC")
>>> coding_dna2.translate()
Seq('LAIVMGR')
>>> coding_dna2.translate(to_stop=True)
Seq('LAIVMGR')

NOTE - Ambiguous codons like “TAN” or “NNN” could be an amino acid or a stop codon. These are translated as “X”. Any invalid codon (e.g. “TA?” or “T-A”) will throw a TranslationError.

NOTE - This does NOT behave like the python string’s translate method. For that use str(my_seq).translate(…) instead

complement()[source]#

Return the complement as a DNA sequence.

>>> Seq("CGA").complement()
Seq('GCT')

Any U in the sequence is treated as a T:

>>> Seq("CGAUT").complement()
Seq('GCTAA')

In contrast, complement_rna returns an RNA sequence:

>>> Seq("CGAUT").complement_rna()
Seq('GCUAA')

The sequence is modified in-place and returned if inplace is True:

>>> my_seq = MutableSeq("CGA")
>>> my_seq
MutableSeq('CGA')
>>> my_seq.complement()
MutableSeq('GCT')
>>> my_seq
MutableSeq('CGA')
>>> my_seq.complement(inplace=True)
MutableSeq('GCT')
>>> my_seq
MutableSeq('GCT')

As Seq objects are immutable, a TypeError is raised if complement_rna is called on a Seq object with inplace=True.

complement_rna()[source]#

Return the complement as an RNA sequence.

>>> Seq("CGA").complement_rna()
Seq('GCU')

Any T in the sequence is treated as a U:

>>> Seq("CGAUT").complement_rna()
Seq('GCUAA')

In contrast, complement returns a DNA sequence by default:

>>> Seq("CGA").complement()
Seq('GCT')

The sequence is modified in-place and returned if inplace is True:

>>> my_seq = MutableSeq("CGA")
>>> my_seq
MutableSeq('CGA')
>>> my_seq.complement_rna()
MutableSeq('GCU')
>>> my_seq
MutableSeq('CGA')
>>> my_seq.complement_rna(inplace=True)
MutableSeq('GCU')
>>> my_seq
MutableSeq('GCU')

As Seq objects are immutable, a TypeError is raised if complement_rna is called on a Seq object with inplace=True.

reverse_complement()[source]#

Return the reverse complement as a DNA sequence.

>>> Seq("CGA").reverse_complement()
Seq('TCG')

Any U in the sequence is treated as a T:

>>> Seq("CGAUT").reverse_complement()
Seq('AATCG')

In contrast, reverse_complement_rna returns an RNA sequence:

>>> Seq("CGA").reverse_complement_rna()
Seq('UCG')

The sequence is modified in-place and returned if inplace is True:

>>> my_seq = MutableSeq("CGA")
>>> my_seq
MutableSeq('CGA')
>>> my_seq.reverse_complement()
MutableSeq('TCG')
>>> my_seq
MutableSeq('CGA')
>>> my_seq.reverse_complement(inplace=True)
MutableSeq('TCG')
>>> my_seq
MutableSeq('TCG')

As Seq objects are immutable, a TypeError is raised if reverse_complement is called on a Seq object with inplace=True.

rc()#

Return the reverse complement as a DNA sequence.

>>> Seq("CGA").reverse_complement()
Seq('TCG')

Any U in the sequence is treated as a T:

>>> Seq("CGAUT").reverse_complement()
Seq('AATCG')

In contrast, reverse_complement_rna returns an RNA sequence:

>>> Seq("CGA").reverse_complement_rna()
Seq('UCG')

The sequence is modified in-place and returned if inplace is True:

>>> my_seq = MutableSeq("CGA")
>>> my_seq
MutableSeq('CGA')
>>> my_seq.reverse_complement()
MutableSeq('TCG')
>>> my_seq
MutableSeq('CGA')
>>> my_seq.reverse_complement(inplace=True)
MutableSeq('TCG')
>>> my_seq
MutableSeq('TCG')

As Seq objects are immutable, a TypeError is raised if reverse_complement is called on a Seq object with inplace=True.

reverse_complement_rna()[source]#

Return the reverse complement as an RNA sequence.

>>> Seq("CGA").reverse_complement_rna()
Seq('UCG')

Any T in the sequence is treated as a U:

>>> Seq("CGAUT").reverse_complement_rna()
Seq('AAUCG')

In contrast, reverse_complement returns a DNA sequence:

>>> Seq("CGA").reverse_complement()
Seq('TCG')

The sequence is modified in-place and returned if inplace is True:

>>> my_seq = MutableSeq("CGA")
>>> my_seq
MutableSeq('CGA')
>>> my_seq.reverse_complement_rna()
MutableSeq('UCG')
>>> my_seq
MutableSeq('CGA')
>>> my_seq.reverse_complement_rna(inplace=True)
MutableSeq('UCG')
>>> my_seq
MutableSeq('UCG')

As Seq objects are immutable, a TypeError is raised if reverse_complement_rna is called on a Seq object with inplace=True.

transcribe()[source]#

Transcribe a DNA sequence into RNA and return the RNA sequence as a new Seq object.

Following the usual convention, the sequence is interpreted as the coding strand of the DNA double helix, not the template strand. This means we can get the RNA sequence just by switching T to U.

>>> from Bio.Seq import Seq
>>> coding_dna = Seq("ATGGCCATTGTAATGGGCCGCTGAAAGGGTGCCCGATAG")
>>> coding_dna
Seq('ATGGCCATTGTAATGGGCCGCTGAAAGGGTGCCCGATAG')
>>> coding_dna.transcribe()
Seq('AUGGCCAUUGUAAUGGGCCGCUGAAAGGGUGCCCGAUAG')

The sequence is modified in-place and returned if inplace is True:

>>> sequence = MutableSeq("ATGGCCATTGTAATGGGCCGCTGAAAGGGTGCCCGATAG")
>>> sequence
MutableSeq('ATGGCCATTGTAATGGGCCGCTGAAAGGGTGCCCGATAG')
>>> sequence.transcribe()
MutableSeq('AUGGCCAUUGUAAUGGGCCGCUGAAAGGGUGCCCGAUAG')
>>> sequence
MutableSeq('ATGGCCATTGTAATGGGCCGCTGAAAGGGTGCCCGATAG')
>>> sequence.transcribe(inplace=True)
MutableSeq('AUGGCCAUUGUAAUGGGCCGCUGAAAGGGUGCCCGAUAG')
>>> sequence
MutableSeq('AUGGCCAUUGUAAUGGGCCGCUGAAAGGGUGCCCGAUAG')

As Seq objects are immutable, a TypeError is raised if transcribe is called on a Seq object with inplace=True.

Trying to transcribe an RNA sequence has no effect. If you have a nucleotide sequence which might be DNA or RNA (or even a mixture), calling the transcribe method will ensure any T becomes U.

Trying to transcribe a protein sequence will replace any T for Threonine with U for Selenocysteine, which has no biologically plausible rational.

>>> from Bio.Seq import Seq
>>> my_protein = Seq("MAIVMGRT")
>>> my_protein.transcribe()
Seq('MAIVMGRU')
back_transcribe()[source]#

Return the DNA sequence from an RNA sequence by creating a new Seq object.

>>> from Bio.Seq import Seq
>>> messenger_rna = Seq("AUGGCCAUUGUAAUGGGCCGCUGAAAGGGUGCCCGAUAG")
>>> messenger_rna
Seq('AUGGCCAUUGUAAUGGGCCGCUGAAAGGGUGCCCGAUAG')
>>> messenger_rna.back_transcribe()
Seq('ATGGCCATTGTAATGGGCCGCTGAAAGGGTGCCCGATAG')

The sequence is modified in-place and returned if inplace is True:

>>> sequence = MutableSeq("AUGGCCAUUGUAAUGGGCCGCUGAAAGGGUGCCCGAUAG")
>>> sequence
MutableSeq('AUGGCCAUUGUAAUGGGCCGCUGAAAGGGUGCCCGAUAG')
>>> sequence.back_transcribe()
MutableSeq('ATGGCCATTGTAATGGGCCGCTGAAAGGGTGCCCGATAG')
>>> sequence
MutableSeq('AUGGCCAUUGUAAUGGGCCGCUGAAAGGGUGCCCGAUAG')
>>> sequence.back_transcribe(inplace=True)
MutableSeq('ATGGCCATTGTAATGGGCCGCTGAAAGGGTGCCCGATAG')
>>> sequence
MutableSeq('ATGGCCATTGTAATGGGCCGCTGAAAGGGTGCCCGATAG')

As Seq objects are immutable, a TypeError is raised if transcribe is called on a Seq object with inplace=True.

Trying to back-transcribe DNA has no effect, If you have a nucleotide sequence which might be DNA or RNA (or even a mixture), calling the back-transcribe method will ensure any U becomes T.

Trying to back-transcribe a protein sequence will replace any U for Selenocysteine with T for Threonine, which is biologically meaningless.

>>> from Bio.Seq import Seq
>>> my_protein = Seq("MAIVMGRU")
>>> my_protein.back_transcribe()
Seq('MAIVMGRT')
seguid() str[source]#

Url safe SEGUID [22] for the sequence.

This checksum is the same as seguid but with base64.urlsafe encoding instead of the normal base64. This means that the characters + and / are replaced with - and _ so that the checksum can be part of a URL.

Examples

>>> from pydna.seq import ProteinSeq
>>> a = ProteinSeq("aa")
>>> a.seguid()
'lsseguid=gBw0Jp907Tg_yX3jNgS4qQWttjU'

References

molecular_weight() float[source]#
pI() float[source]#
instability_index() float[source]#

Instability index according to Guruprasad et al.

Value above 40 means the protein is has a short half life.

Guruprasad K., Reddy B.V.B., Pandit M.W. Protein Engineering 4:155-161(1990).

pydna.seqrecord module#

A subclass of the Biopython SeqRecord class.

Has a number of extra methods and uses the pydna._pretty_str.pretty_str class instread of str for a nicer output in the IPython shell.

class pydna.seqrecord.SeqRecord(seq, id='id', name='name', description='description', dbxrefs=None, features=None, annotations=None, letter_annotations=None)[source]#

Bases: SeqRecord

A subclass of the Biopython SeqRecord class.

Has a number of extra methods and uses the pydna._pretty_str.pretty_str class instread of str for a nicer output in the IPython shell.

classmethod from_Bio_SeqRecord(sr: SeqRecord)[source]#

Creates a pydnaSeqRecord from a Biopython SeqRecord.

property locus#

Alias for name property.

property accession#

Alias for id property.

property definition#

Alias for description property.

reverse_complement(*args, **kwargs)[source]#

Return the reverse complement of the sequence.

rc(*args, **kwargs)#

Return the reverse complement of the sequence.

isorf(table=1)[source]#

Detect if sequence is an open reading frame (orf) in the 5’-3’.

direction.

Translation tables are numbers according to the NCBI numbering [23].

Parameters:

table (int) – Sets the translation table, default is 1 (standard code)

Returns:

True if sequence is an orf, False otherwise.

Return type:

bool

Examples

>>> from pydna.seqrecord import SeqRecord
>>> a=SeqRecord("atgtaa")
>>> a.isorf()
True
>>> b=SeqRecord("atgaaa")
>>> b.isorf()
False
>>> c=SeqRecord("atttaa")
>>> c.isorf()
False

References

translate()[source]#

docstring.

add_colors_to_features_for_ape()[source]#

Assign colors to features.

compatible with the ApE editor.

add_feature(x=None, y=None, seq=None, type_='misc', strand=1, *args, **kwargs)[source]#

Add a feature of type misc to the feature list of the sequence.

Parameters:
  • x (int) – Indicates start of the feature

  • y (int) – Indicates end of the feature

Examples

>>> from pydna.seqrecord import SeqRecord
>>> a=SeqRecord("atgtaa")
>>> a.features
[]
>>> a.add_feature(2,4)
>>> a.features
[SeqFeature(SimpleLocation(ExactPosition(2),
                           ExactPosition(4),
                           strand=1),
            type='misc',
            qualifiers=...)]
list_features()[source]#

Print ASCII table with all features.

Examples

>>> from pydna.seq import Seq
>>> from pydna.seqrecord import SeqRecord
>>> a=SeqRecord(Seq("atgtaa"))
>>> a.add_feature(2,4)
>>> print(a.list_features())
+-----+---------------+-----+-----+-----+-----+------+------+
| Ft# | Label or Note | Dir | Sta | End | Len | type | orf? |
+-----+---------------+-----+-----+-----+-----+------+------+
|   0 | L:ft2         | --> | 2   | 4   |   2 | misc |  no  |
+-----+---------------+-----+-----+-----+-----+------+------+
extract_feature(n)[source]#

Extract feature and return a new SeqRecord object.

Parameters:
  • n (int)

  • extract (Indicates the feature to)

Examples

>>> from pydna.seqrecord import SeqRecord
>>> a=SeqRecord("atgtaa")
>>> a.add_feature(2,4)
>>> b=a.extract_feature(0)
>>> b
SeqRecord(seq=Seq('gt'), id='ft2', name='part_name',
          description='description', dbxrefs=[])
sorted_features()[source]#

Return a list of the features sorted by start position.

Examples

>>> from pydna.seqrecord import SeqRecord
>>> a=SeqRecord("atgtaa")
>>> a.add_feature(3,4)
>>> a.add_feature(2,4)
>>> print(a.features)
[SeqFeature(SimpleLocation(ExactPosition(3), ExactPosition(4),
                           strand=1),
            type='misc', qualifiers=...),
 SeqFeature(SimpleLocation(ExactPosition(2), ExactPosition(4),
                           strand=1),
            type='misc', qualifiers=...)]
>>> print(a.sorted_features())
[SeqFeature(SimpleLocation(ExactPosition(2), ExactPosition(4),
                           strand=1),
            type='misc', qualifiers=...),
 SeqFeature(SimpleLocation(ExactPosition(3), ExactPosition(4),
                           strand=1),
            type='misc', qualifiers=...)]
seguid()[source]#

Return the url safe SEGUID [24] for the sequence.

This checksum is the same as seguid but with base64.urlsafe encoding instead of the normal base 64. This means that the characters + and / are replaced with - and _ so that the checksum can be a part of and URL or a filename.

Examples

>>> from pydna.seqrecord import SeqRecord
>>> a=SeqRecord("gattaca")
>>> a.seguid() # original seguid is +bKGnebMkia5kNg/gF7IORXMnIU
'lsseguid=tp2jzeCM2e3W4yxtrrx09CMKa_8'

References

comment(newcomment='')[source]#

docstring.

datefunction()[source]#

docstring.

stamp(now=datefunction, tool='pydna', separator=' ', comment='')[source]#

Add seguid checksum to COMMENTS sections

The checksum is stored in object.annotations[“comment”]. This shows in the COMMENTS section of a formatted genbank file.

For blunt linear sequences:

SEGUID <seguid>

For circular sequences:

cSEGUID <seguid>

Fore linear sequences which are not blunt:

lSEGUID <seguid>

Examples

>>> from pydna.seqrecord import SeqRecord
>>> a = SeqRecord("aa")
>>> a.stamp()
'lsseguid=gBw0Jp907Tg_yX3jNgS4qQWttjU'
>>> a.annotations["comment"][:41]
'pydna lsseguid=gBw0Jp907Tg_yX3jNgS4qQWttj'
lcs(other, *args, limit=25, **kwargs)[source]#

Return the longest common substring between the sequence.

and another sequence (other). The other sequence can be a string, Seq, SeqRecord, Dseq or DseqRecord. The method returns a SeqFeature with type “read” as this method is mostly used to map sequence reads to the sequence. This can be changed by passing a type as keyword with some other string value.

Examples

>>> from pydna.seqrecord import SeqRecord
>>> a = SeqRecord("GGATCC")
>>> a.lcs("GGATCC", limit=6)
SeqFeature(SimpleLocation(ExactPosition(0),
                          ExactPosition(6), strand=1),
                          type='read',
                          qualifiers=...)
>>> a.lcs("GATC", limit=4)
SeqFeature(SimpleLocation(ExactPosition(1),
                          ExactPosition(5), strand=1),
                          type='read',
                          qualifiers=...)
>>> a = SeqRecord("CCCCC")
>>> a.lcs("GGATCC", limit=6)
SeqFeature(None)
gc()[source]#

Return GC content.

cai(organism='sce')[source]#

docstring.

rarecodons(organism='sce')[source]#

docstring.

startcodon(organism='sce')[source]#

docstring.

stopcodon(organism='sce')[source]#

docstring.

express(organism='sce')[source]#

docstring.

copy()[source]#

docstring.

dump(filename, protocol=None)[source]#

docstring.

class pydna.seqrecord.ProteinSeqRecord(seq, id='id', name='name', description='description', dbxrefs=None, features=None, annotations=None, letter_annotations=None)[source]#

Bases: SeqRecord

reverse_complement(*args, **kwargs)[source]#

Return the reverse complement of the sequence.

rc(*args, **kwargs)#

Return the reverse complement of the sequence.

isorf(*args, **kwargs)[source]#

Detect if sequence is an open reading frame (orf) in the 5’-3’.

direction.

Translation tables are numbers according to the NCBI numbering [25].

Parameters:

table (int) – Sets the translation table, default is 1 (standard code)

Returns:

True if sequence is an orf, False otherwise.

Return type:

bool

Examples

>>> from pydna.seqrecord import SeqRecord
>>> a=SeqRecord("atgtaa")
>>> a.isorf()
True
>>> b=SeqRecord("atgaaa")
>>> b.isorf()
False
>>> c=SeqRecord("atttaa")
>>> c.isorf()
False

References

gc()[source]#

Return GC content.

cai(*args, **kwargs)[source]#

docstring.

rarecodons(*args, **kwargs)[source]#

docstring.

startcodon(*args, **kwargs)[source]#

docstring.

stopcodon(*args, **kwargs)[source]#

docstring.

express(*args, **kwargs)[source]#

docstring.

pydna.sequence_picker module#

pydna.sequence_picker.genbank_accession(s: str) Dseqrecord[source]#

docstring.

pydna.sequence_regex module#

pydna.sequence_regex.compute_regex_site(site: str) str[source]#

Creates a regex pattern from a string that may contain degenerate bases.

Parameters:

site – The string to convert to a regex pattern.

Returns:

The regex pattern.

pydna.sequence_regex.dseqrecord_finditer(pattern: str, seq: Dseqrecord) list[Match][source]#

Finds all matches of a regex pattern in a Dseqrecord.

Parameters:
  • pattern – The regex pattern to search for.

  • seq – The Dseqrecord to search in.

Returns:

A list of matches.

pydna.threading_timer_decorator_exit module#

MIT License

Copyright (c) 2015 Aaron Hall

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the “Software”), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED “AS IS”, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

pydna.threading_timer_decorator_exit.cdquit(fn_name)[source]#
pydna.threading_timer_decorator_exit.exit_after(s)[source]#

use as decorator to exit process if function takes longer than s seconds

pydna.threading_timer_decorator_exit.a(*args, **kwargs)[source]#
pydna.threading_timer_decorator_exit.b(*args, **kwargs)[source]#
pydna.threading_timer_decorator_exit.c(*args, **kwargs)[source]#
pydna.threading_timer_decorator_exit.d(*args, **kwargs)[source]#
pydna.threading_timer_decorator_exit.countdown(*args, **kwargs)[source]#
pydna.threading_timer_decorator_exit.main()[source]#

pydna.tm module#

This module provide functions for melting temperature calculations.

pydna.tm.tm_default(seq, check=True, strict=True, c_seq=None, shift=0, nn_table=mt.DNA_NN4, tmm_table=None, imm_table=None, de_table=None, dnac1=500 / 2, dnac2=500 / 2, selfcomp=False, Na=40, K=0, Tris=75.0, Mg=1.5, dNTPs=0.8, saltcorr=7, func=mt.Tm_NN)[source]#
pydna.tm.tm_dbd(seq, check=True, strict=True, c_seq=None, shift=0, nn_table=mt.DNA_NN3, tmm_table=None, imm_table=None, de_table=None, dnac1=250, dnac2=250, selfcomp=False, Na=50, K=0, Tris=0, Mg=1.5, dNTPs=0.8, saltcorr=1, func=mt.Tm_NN)[source]#
pydna.tm.tm_product(seq: str, K=0.050)[source]#

Tm calculation for the amplicon.

according to:

Rychlik, Spencer, and Rhoads, 1990, Optimization of the anneal ing temperature for DNA amplification in vitro http://www.ncbi.nlm.nih.gov/pubmed/2243783

pydna.tm.ta_default(fp: str, rp: str, seq: str, tm=tm_default, tm_product=tm_product)[source]#

Ta calculation.

according to:

Rychlik, Spencer, and Rhoads, 1990, Optimization of the anneal ing temperature for DNA amplification in vitro http://www.ncbi.nlm.nih.gov/pubmed/2243783

The formula described uses the length and GC content of the product and salt concentration (monovalent cations)

pydna.tm.ta_dbd(fp, rp, seq, tm=tm_dbd, tm_product=None)[source]#
pydna.tm.program(amplicon, tm=tm_default, ta=ta_default)[source]#

Returns a string containing a text representation of a suggested PCR program using Taq or similar polymerase.

|95°C|95°C               |    |tmf:59.5
|____|_____          72°C|72°C|tmr:59.7
|3min|30s  \ 59.1°C _____|____|60s/kb
|    |      \______/ 0:32|5min|GC 51%
|    |       30s         |    |1051bp
pydna.tm.taq_program(amplicon, tm=tm_default, ta=ta_default)#

Returns a string containing a text representation of a suggested PCR program using Taq or similar polymerase.

|95°C|95°C               |    |tmf:59.5
|____|_____          72°C|72°C|tmr:59.7
|3min|30s  \ 59.1°C _____|____|60s/kb
|    |      \______/ 0:32|5min|GC 51%
|    |       30s         |    |1051bp
pydna.tm.dbd_program(amplicon, tm=tm_dbd, ta=ta_dbd)[source]#

Text representation of a suggested PCR program.

Using a polymerase with a DNA binding domain such as Pfu-Sso7d.

|98°C|98°C               |    |tmf:53.8
|____|_____          72°C|72°C|tmr:54.8
|30s |10s  \ 57.0°C _____|____|15s/kb
|    |      \______/ 0:15|5min|GC 51%
|    |       10s         |    |1051bp

|98°C|98°C      |    |tmf:82.5
|____|____      |    |tmr:84.4
|30s |10s \ 72°C|72°C|15s/kb
|    |     \____|____|GC 52%
|    |      3:45|5min|15058bp
pydna.tm.pfu_sso7d_program(amplicon, tm=tm_dbd, ta=ta_dbd)#

Text representation of a suggested PCR program.

Using a polymerase with a DNA binding domain such as Pfu-Sso7d.

|98°C|98°C               |    |tmf:53.8
|____|_____          72°C|72°C|tmr:54.8
|30s |10s  \ 57.0°C _____|____|15s/kb
|    |      \______/ 0:15|5min|GC 51%
|    |       10s         |    |1051bp

|98°C|98°C      |    |tmf:82.5
|____|____      |    |tmr:84.4
|30s |10s \ 72°C|72°C|15s/kb
|    |     \____|____|GC 52%
|    |      3:45|5min|15058bp
pydna.tm.Q5(primer: str, *args, **kwargs)[source]#

For Q5 Ta they take the lower of the two Tms and add 1C (up to 72C). For Phusion they take the lower of the two and add 3C (up to 72C).

pydna.tm.tmbresluc(primer: str, *args, primerc=500.0, saltc=50, **kwargs)[source]#

Returns the tm for a primer using a formula adapted to polymerases with a DNA binding domain, such as the Phusion polymerase.

Parameters:
  • primer (string) – primer sequence 5’-3’

  • primerc (float) – primer concentration in nM), set to 500.0 nm by default.

  • saltc (float, optional) – Monovalent cation concentration in mM, set to 50.0 mM by default.

  • thermodynamics (bool, optional) – prints details of the thermodynamic data to stdout. For debugging only.

Returns:

tm – the tm of the primer

Return type:

float

pydna.tm.tm_neb(primer, conc=0.5, prodcode='q5-0')[source]#

pydna.types module#

Types used in the pydna package.

pydna.utils module#

Miscellaneous functions.

pydna.utils.three_frame_orfs(dna: str, limit: int = 100, startcodons: tuple = ('ATG',), stopcodons: tuple = ('TAG', 'TAA', 'TGA'))[source]#

Overlapping orfs in three frames.

pydna.utils.shift_location(original_location, shift, lim)[source]#

docstring.

pydna.utils.shift_feature(feature, shift, lim)[source]#

Return a new feature with shifted location.

pydna.utils.smallest_rotation(s)[source]#

Smallest rotation of a string.

Algorithm described in Pierre Duval, Jean. 1983. Factorizing Words over an Ordered Alphabet. Journal of Algorithms & Computational Technology 4 (4) (December 1): 363–381. and Algorithms on strings and sequences based on Lyndon words, David Eppstein 2011. https://gist.github.com/dvberkel/1950267

Examples

>>> from pydna.utils import smallest_rotation
>>> smallest_rotation("taaa")
'aaat'
pydna.utils.anneal_from_left(watson: str, crick: str) int[source]#

The length of the common prefix shared by two strings.

Parameters:
  • str1 (str) – The first string.

  • str2 (str) – The second string.

Returns:

The length of the common prefix.

Return type:

int

pydna.utils.cai(seq: str, organism: str = 'sce', weights_dict: dict = None)[source]#

docstring.

pydna.utils.rarecodons(seq: str, organism='sce')[source]#

docstring.

pydna.utils.express(seq: str, organism='sce')[source]#

docstring.

NOT IMPLEMENTED YET

pydna.utils.open_folder(pth)[source]#

docstring.

pydna.utils.rc(sequence: StrOrBytes) StrOrBytes[source]#

Reverse complement.

accepts mixed DNA/RNA

pydna.utils.complement(sequence: StrOrBytes) StrOrBytes[source]#

Complement.

accepts mixed DNA/RNA

pydna.utils.identifier_from_string(s: str) str[source]#

Return a valid python identifier.

based on the argument s or an empty string

pydna.utils.flatten(*args) List[source]#

Flattens an iterable of iterables.

Down to str, bytes, bytearray or any of the pydna or Biopython seq objects

pydna.utils.seq31(seq)[source]#

Turn a three letter code protein sequence into one with one letter code.

The single input argument ‘seq’ should be a protein sequence using single letter codes, as a python string.

This function returns the amino acid sequence as a string using the one letter amino acid codes. Output follows the IUPAC standard (including ambiguous characters B for “Asx”, J for “Xle” and X for “Xaa”, and also U for “Sel” and O for “Pyl”) plus “Ter” for a terminator given as an asterisk.

Any unknown character (including possible gap characters), is changed into ‘Xaa’.

Examples

>>> from Bio.SeqUtils import seq3
>>> seq3("MAIVMGRWKGAR*")
'MetAlaIleValMetGlyArgTrpLysGlyAlaArgTer'
>>> from pydna.utils import seq31
>>> seq31('MetAlaIleValMetGlyArgTrpLysGlyAlaArgTer')
'M  A  I  V  M  G  R  W  K  G  A  R  *'
pydna.utils.randomRNA(length, maxlength=None)[source]#

docstring.

pydna.utils.randomDNA(length, maxlength=None)[source]#

docstring.

pydna.utils.randomORF(length, maxlength=None)[source]#

docstring.

pydna.utils.randomprot(length, maxlength=None)[source]#

docstring.

pydna.utils.eq(*args, **kwargs)[source]#

Compare two or more DNA sequences for equality.

Compares two or more DNA sequences for equality i.e. if they represent the same double stranded DNA molecule.

Parameters:
  • args (iterable) – iterable containing sequences args can be strings, Biopython Seq or SeqRecord, Dseqrecord or dsDNA objects.

  • circular (bool, optional) – Consider all molecules circular or linear

  • linear (bool, optional) – Consider all molecules circular or linear

Returns:

eq – Returns True or False

Return type:

bool

Notes

Compares two or more DNA sequences for equality i.e. if they represent the same DNA molecule.

Two linear sequences are considiered equal if either:

  1. They have the same sequence (case insensitive)

  2. One sequence is the reverse complement of the other

Two circular sequences are considered equal if they are circular permutations meaning that they have the same length and:

  1. One sequence can be found in the concatenation of the other sequence with itself.

  2. The reverse complement of one sequence can be found in the concatenation of the other sequence with itself.

The topology for the comparison can be set using one of the keywords linear or circular to True or False.

If circular or linear is not set, it will be deduced from the topology of each sequence for sequences that have a linear or circular attribute (like Dseq and Dseqrecord).

Examples

>>> from pydna.dseqrecord import Dseqrecord
>>> from pydna.utils import eq
>>> eq("aaa","AAA")
True
>>> eq("aaa","AAA","TTT")
True
>>> eq("aaa","AAA","TTT","tTt")
True
>>> eq("aaa","AAA","TTT","tTt", linear=True)
True
>>> eq("Taaa","aTaa", linear = True)
False
>>> eq("Taaa","aTaa", circular = True)
True
>>> a=Dseqrecord("Taaa")
>>> b=Dseqrecord("aTaa")
>>> eq(a,b)
False
>>> eq(a,b,circular=True)
True
>>> a=a.looped()
>>> b=b.looped()
>>> eq(a,b)
True
>>> eq(a,b,circular=False)
False
>>> eq(a,b,linear=True)
False
>>> eq(a,b,linear=False)
True
>>> eq("ggatcc","GGATCC")
True
>>> eq("ggatcca","GGATCCa")
True
>>> eq("ggatcca","tGGATCC")
True
pydna.utils.cuts_overlap(left_cut, right_cut, seq_len)[source]#
pydna.utils.location_boundaries(loc: SimpleLocation | CompoundLocation)[source]#
pydna.utils.locations_overlap(loc1: SimpleLocation | CompoundLocation, loc2: SimpleLocation | CompoundLocation, seq_len)[source]#
pydna.utils.sum_is_sticky(three_prime_end: tuple[str, str], five_prime_end: tuple[str, str], partial: bool = False) int[source]#

Return the overlap length if the 3’ end of seq1 and 5’ end of seq2 ends are sticky and compatible for ligation. Return 0 if they are not compatible.

pydna.utils.limit_iterator(iterator, limit)[source]#

Call the function with an iterator to raise an error if the number of items is greater than the limit.

pydna.utils.create_location(start: int, end: int, lim: int, strand: int | None = None) Location[source]#

Create a location object from a start and end position. If the end position is less than the start position, the location is circular. It handles negative positions.

Parameters:
  • start (int) – The start position of the location.

  • end (int) – The end position of the location.

  • lim (int) – The length of the sequence.

  • strand (int, optional) – The strand of the location. None, 1 or -1.

Returns:

location – The location object. Can be a SimpleLocation or a CompoundLocation if the feature spans the origin of a circular sequence.

Return type:

Location

Examples

>>> from pydna.utils import create_location
>>> str(create_location(0, 5, 10,-1))
'[0:5](-)'
>>> str(create_location(0, 5, 10,+1))
'[0:5](+)'
>>> str(create_location(0, 5, 10))
'[0:5]'
>>> str(create_location(8, 2, 10))
'join{[8:10], [0:2]}'
>>> str(create_location(8, 2, 10,-1))
'join{[0:2](-), [8:10](-)}'
>>> str(create_location(-2, 2, 10))
'join{[8:10], [0:2]}'

Note this special case, 0 is the same as len(seq) >>> str(create_location(5, 0, 10)) ‘[5:10]’

Note the special case where if start and end are the same, the location spans the entire sequence (it’s not empty). >>> str(create_location(5, 5, 10)) ‘join{[5:10], [0:5]}’