Representing sequences in pydna#
Visit the full library documentation here
%%capture
# Install pydna (only when running on Colab)
import sys
if 'google.colab' in sys.modules:
%pip install pydna[clipboard,download,express,gel]
Pydna contains classes to represent double stranded DNA sequences that can:
Be linear
Be circular
Contain overhangs (sticky ends).
These sequences can be used to simulate molecular biology methods such as cloning and PCR. The main classes used to represent sequences are Dseq and Dseqrecord.
Dseqrepresents the sequence only. Think of it as a FASTA file.Dseqrecordcan contain sequence features and other info such as publication, authors, etc. Think of it as a Genbank file.
Dseq Class#
We can create a Dseq object in different ways.
For a linear sequence without overhangs, we create a Dseq object passing a string with the sequence. For example:
from pydna.dseq import Dseq
my_seq = Dseq("aatat")
my_seq
Dseq(-5)
aatat
ttata
In the console representation above, there are three lines:
Dseq(-5)indicates that the sequence is linear and has 5 basepairs.aatat, the top / sense / watson strand, referred from now on as watson strand..ttata, the bottom / anti-sense / crick strand, referred from now on as crick strand.
Now, let’s create a circular sequence:
my_seq = Dseq("aatat", circular=True)
my_seq
Dseq(o5)
aatat
ttata
Note how
o5indicates that the sequence is circular and has 5 basepairs.
One way to represent a linear sequence with overhangs is to instantiate Dseq with the following arguments:
The
watsonstrand as a string in the 5’-3’ direction.The
crickstrand as a string in the 5’-3’ direction.The 5’ overhang
ovhg(overhang), which can be positive or negative, and represents the number of basepairs that thewatsonstrand extends beyond thecrickstrand.
Dseq("actag", "ctag", -1)
Dseq(-5)
actag
gatc
Note how the bottom strand is passed in the 5’-3’ direction, but it is represented in the 3’-5’ direction in the console output.
If you omit the ovhg argument, pydna will try to find the value that makes the watson and crick strands complementary.
Dseq("actag", "ctag")
Dseq(-5)
actag
gatc
The best way to get a feeling for the meaning of ovhg is to visualise the possible scenarios as such:
dsDNA overhang
nnn... 2
nnnnn...
nnnn... 1
nnnnn...
nnnnn... 0
nnnnn...
nnnnn... -1
nnnn...
nnnnn... -2
nnn...
Of note, the DNA sequence can be passed in both lower case and upper case, and are not restricted to the conventional ATCG nucleotides (E.g ), The class supports the IUPAC ambiguous nucleotide code.
Dseq("Actag", "Ctag", -1)
Dseq(-5)
Actag
gatC
Another way to pass the overhangs is to use the from_full_sequence_and_overhangs classmethod, which only needs the watson/sense strand. This is useful you can only store the entire sequence (e.g. in a FASTA file), or if you want to specify overhangs on both sides of the double stranded DNA when you create the object.
Both the watson_ovhg and crick_ovhg can be passed following the same rules as above. Specifically, the crick_ovhg argument is identical to the conventional ovhg argument. The watson_ovhg argument is the ovhg argument applied to the reverse complementary sequence.
my_seq = Dseq.from_full_sequence_and_overhangs("aaattaaa", crick_ovhg=-3, watson_ovhg=-2)
my_seq
Dseq(-8)
aaatta
aattt
A list of possible scenarios, applying positive and negative crick_ovhg and watson_ovhg to a Dseq object are visualised in the output of the code below:
for crick_ovhg in [-2, 2]:
for watson_ovhg in [-3, 3]:
print("watson_ovhg is " + str(watson_ovhg) + ", crick_ovhg is " + str(crick_ovhg))
my_seq = Dseq.from_full_sequence_and_overhangs("aaattaaa", crick_ovhg, watson_ovhg)
print(my_seq.__repr__() + "\n")
watson_ovhg is -3, crick_ovhg is -2
Dseq(-8)
aaatt
taattt
watson_ovhg is 3, crick_ovhg is -2
Dseq(-8)
aaattaaa
taa
watson_ovhg is -3, crick_ovhg is 2
Dseq(-8)
att
tttaattt
watson_ovhg is 3, crick_ovhg is 2
Dseq(-8)
attaaa
tttaa
The drawing below can help visualize the meaning of the overhangs.
(-3)--(-2)--(-1)--(x)--(x)--(x)--(-1)--(-2)
5'( a)--( a)--( a)--(t)--(t)--(a)--( a)--( a)3'
3'( a)--( a)--( a)--(t)--(t)--(a)--( a)--( a)5'
5'( a)--( a)--( a)--(t)--(t)--(a)--( )--( )3'
3'( )--( )--( )--(t)--(t)--(a)--( a)--( a)5'
If you would like to check the overhangs for a Dseq object, it can be done by calling the methods five_prime_end and three_prime_end to show the 5’ and 3’ overhangs, respectively. An example of a Dseq object, and examples showing what the print-out of the methods looks like are demonstrated here:
my_seq = Dseq("aatat", "ttata", ovhg=-2)
print(my_seq.__repr__())
print(my_seq.five_prime_end())
print(my_seq.three_prime_end())
Dseq(-7)
aatat
atatt
("5'", 'aa')
("5'", 'tt')
If you now want to join your sequence’s sticky ends to make a circular sequence (i.e Plasmid), you can use the looped method. The sticky ends must be compatible to do so.
my_seq = Dseq("aatat", "ttata", ovhg=-2)
my_seq.looped()
Dseq(o5)
aatat
ttata
If you want to change the circular origin of the sequence/plasmid, this can be easily done using the shifted method. This can be done by providing the number of bases between the original origin with the new origin:
my_seq = Dseq("aatat", circular=True)
my_seq.shifted(2)
Dseq(o5)
tataa
atatt
getitem, repr, and str methods#
Slicing sequences (__getitem__)#
__getitem__ is the method that is called when you use the square brackets [] after a python object. Below is an example of the builtin python list:
my_list = [1, 2, 3]
print('using square brackets:', my_list[1:])
print('is the same as using __getitem__:', my_list.__getitem__(slice(1, None)))
using square brackets: [2, 3]
is the same as using __getitem__: [2, 3]
The __getitem__ method is modified in pydna to deal with Dseq objects and returns a slice of the Dseq object, defined by the a start value and a stop value, similarly to string indexing. In other words, __getitem__ indexes Dseq. Note that ‘getitem’ (and, consequently, []) uses zero-based indexing.
my_seq = Dseq("aatataa")
my_seq[2:5]
Dseq(-3)
tat
ata
__getitem__ respects overhangs.
my_seq = Dseq.from_full_sequence_and_overhangs("aatataa", crick_ovhg=0, watson_ovhg=-1)
my_seq[2:]
Dseq(-5)
tata
atatt
Note that index zero corresponds to the leftmost base of the sequence, which might not necessarily be on the watson strand. Let’s create a sequence that has an overhang on the left side.
sequence_with_overhangs = Dseq.from_full_sequence_and_overhangs("aatacgttcc", crick_ovhg=3, watson_ovhg=0)
sequence_with_overhangs
Dseq(-10)
acgttcc
ttatgcaagg
When we index starting from 2, we don’t start counting on the watson, but on the crick strand since it is the leftmost one.
sequence_with_overhangs[2:]
Dseq(-8)
acgttcc
atgcaagg
Slicing circular sequences#
When slicing circular Dseq objects we get linear sequences.
circular_seq = Dseq("aatctaa", circular=True)
circular_seq[1:5]
Dseq(-4)
atct
taga
We can slice circular sequences across the origin (where index is zero) if the first index is bigger than the second index. This is demonstrated in the example below:
circular_seq[5:2]
Dseq(-4)
aaaa
tttt
Printing sequences to the console: __repr__ and __str__#
__repr__ and __str__ are methods present in all python classes that return a string representation of an object. __str__ is called by the print function, and __repr__ is used by the console or notebook output when the object is not assigned to a variable. Below is an example with a date object:
import datetime
my_date = datetime.date(2023, 8, 15)
print('> print statement:', my_date)
print('> repr:', repr(my_date))
print('> repr from class method:', my_date.__repr__())
print()
print('> console output:')
my_date
> print statement: 2023-08-15
> repr: datetime.date(2023, 8, 15)
> repr from class method: datetime.date(2023, 8, 15)
> console output:
datetime.date(2023, 8, 15)
In a similar way, __repr__ and __str__ methods are used by pydna to represent sequences as strings for different purposes:
__repr__is used to make a figure-like representation that shows both strands and the overhangs.__str__is used to return the entire sequence as a string of characters (from the left-most to the right-most base of both strands), the way we would store it in a FASTA file.
my_seq = Dseq.from_full_sequence_and_overhangs("aaattaaa", crick_ovhg=-3, watson_ovhg=-2)
print('> figure-like representation:\n', my_seq.__repr__())
print()
print('> string representation:\n', my_seq)
> figure-like representation:
Dseq(-8)
aaatta
aattt
> string representation:
aaattaaa
Note that on the string representation, the bases correspond to the entire sequence provided, even when they are only present on either the watson or crick strand. In the example above, the last two aa bases are missing from the watson strand, and that only the crick strand has them.
Edge cases#
You can create arbitrary double-stranded sequences that are not complementary if you specify both strands and an overhang, but you won’t be able to use them for molecular biology simulations. For example:
Dseq("xxxx", "atat", ovhg=2)
Dseq(-6)
xxxx
tata