# Representing sequences in pydna > Visit the full library documentation [here](https://pydna-group.github.io/pydna/) Open In Colab ```python %%capture # Install pydna (only when running on Colab) import sys if 'google.colab' in sys.modules: %pip install pydna ``` Pydna contains classes to represent double stranded DNA sequences that can: * Be linear * Be circular * Contain overhangs (sticky ends). These sequences can be used to simulate molecular biology methods such as cloning and PCR. The main classes used to represent sequences are `Dseq` and `Dseqrecord`. * `Dseq` represents the sequence only. Think of it as a FASTA file. * `Dseqrecord` can contain sequence features and other info such as publication, authors, etc. Think of it as a Genbank file. > NOTE: The `Dseq` class is a subclass of biopython's `Seq`, whose documentation can be found [here](https://biopython.org/wiki/Seq). `Dseqrecord` is a subclass of biopython's `SeqRecord`, whose documentation can be found [here](https://biopython.org/wiki/SeqRecord). ## Dseq Class We can create a `Dseq` object in different ways. For a linear sequence without overhangs, we create a `Dseq` object passing a string with the sequence. For example: ```python from pydna.dseq import Dseq my_seq = Dseq("aatat") my_seq ``` Dseq(-5) aatat ttata In the console representation above, there are three lines: 1. `Dseq(-5)` indicates that the sequence is linear and has 5 basepairs. 2. `aatat`, the top / sense / watson strand, referred from now on as **watson** strand.. 3. `ttata`, the bottom / anti-sense / crick strand, referred from now on as **crick** strand. Now, let's create a circular sequence: ```python my_seq = Dseq("aatat", circular=True) my_seq ``` Dseq(o5) aatat ttata > Note how `o5` indicates that the sequence is circular and has 5 basepairs. One way to represent a linear sequence with overhangs is to instantiate `Dseq` with the following arguments: * The `watson` strand as a string in the 5'-3' direction. * The `crick` strand as a string in the 5'-3' direction. * The 5' overhang `ovhg` (overhang), which can be positive or negative, and represents the number of basepairs that the `watson` strand extends beyond the `crick` strand. ```python Dseq("actag", "ctag", -1) ``` Dseq(-5) actag gatc > Note how the bottom strand is passed in the 5'-3' direction, but it is represented in the 3'-5' direction in the console output. If you omit the `ovhg` argument, pydna will try to find the value that makes the `watson` and `crick` strands complementary. ```python Dseq("actag", "ctag") ``` Dseq(-5) actag gatc The best way to get a feeling for the meaning of `ovhg` is to visualise the possible scenarios as such: ``` dsDNA overhang nnn... 2 nnnnn... nnnn... 1 nnnnn... nnnnn... 0 nnnnn... nnnnn... -1 nnnn... nnnnn... -2 nnn... ``` Of note, the DNA sequence can be passed in both lower case and upper case, and are not restricted to the conventional ATCG nucleotides (E.g ), The class supports the IUPAC ambiguous nucleotide code. ```python Dseq("Actag", "Ctag", -1) ``` Dseq(-5) Actag gatC Another way to pass the overhangs is to use the `from_full_sequence_and_overhangs` classmethod, which only needs the `watson`/sense strand. This is useful you can only store the entire sequence (e.g. in a FASTA file), or if you want to specify overhangs on both sides of the double stranded DNA when you create the object. Both the `watson_ovhg` and `crick_ovhg` can be passed following the same rules as above. Specifically, the `crick_ovhg` argument is identical to the conventional `ovhg` argument. The `watson_ovhg` argument is the `ovhg` argument applied to the reverse complementary sequence. ```python my_seq = Dseq.from_full_sequence_and_overhangs("aaattaaa", crick_ovhg=-3, watson_ovhg=-2) my_seq ``` Dseq(-8) aaatta aattt A list of possible scenarios, applying positive and negative `crick_ovhg` and `watson_ovhg` to a `Dseq` object are visualised in the output of the code below: ```python for crick_ovhg in [-2, 2]: for watson_ovhg in [-3, 3]: print("watson_ovhg is " + str(watson_ovhg) + ", crick_ovhg is " + str(crick_ovhg)) my_seq = Dseq.from_full_sequence_and_overhangs("aaattaaa", crick_ovhg, watson_ovhg) print(my_seq.__repr__() + "\n") ``` watson_ovhg is -3, crick_ovhg is -2 Dseq(-8) aaatt taattt watson_ovhg is 3, crick_ovhg is -2 Dseq(-8) aaattaaa taa watson_ovhg is -3, crick_ovhg is 2 Dseq(-8) att tttaattt watson_ovhg is 3, crick_ovhg is 2 Dseq(-8) attaaa tttaa The drawing below can help visualize the meaning of the overhangs. ``` (-3)--(-2)--(-1)--(x)--(x)--(x)--(-1)--(-2) 5'( a)--( a)--( a)--(t)--(t)--(a)--( a)--( a)3' 3'( a)--( a)--( a)--(t)--(t)--(a)--( a)--( a)5' 5'( a)--( a)--( a)--(t)--(t)--(a)--( )--( )3' 3'( )--( )--( )--(t)--(t)--(a)--( a)--( a)5' ``` If you would like to check the overhangs for a `Dseq` object, it can be done by calling the methods `five_prime_end` and `three_prime_end` to show the 5' and 3' overhangs, respectively. An example of a `Dseq` object, and examples showing what the print-out of the methods looks like are demonstrated here: ```python my_seq = Dseq("aatat", "ttata", ovhg=-2) print(my_seq.__repr__()) print(my_seq.five_prime_end()) print(my_seq.three_prime_end()) ``` Dseq(-7) aatat atatt ("5'", 'aa') ("5'", 'tt') If you now want to join your sequence's sticky ends to make a circular sequence (i.e Plasmid), you can use the `looped` method. The sticky ends must be compatible to do so. ```python my_seq = Dseq("aatat", "ttata", ovhg=-2) my_seq.looped() ``` Dseq(o5) aatat ttata If you want to change the circular origin of the sequence/plasmid, this can be easily done using the `shifted` method. This can be done by providing the number of bases between the original origin with the new origin: ```python my_seq = Dseq("aatat", circular=True) my_seq.shifted(2) ``` Dseq(o5) tataa atatt ## __getitem__, __repr__, and __str__ methods ### Slicing sequences (`__getitem__`) `__getitem__` is the method that is called when you use the square brackets `[]` after a python object. Below is an example of the builtin python `list`: ```python my_list = [1, 2, 3] print('using square brackets:', my_list[1:]) print('is the same as using __getitem__:', my_list.__getitem__(slice(1, None))) ``` using square brackets: [2, 3] is the same as using __getitem__: [2, 3] The `__getitem__` method is modified in pydna to deal with `Dseq` objects and returns a slice of the `Dseq` object, defined by the a start value and a stop value, similarly to string indexing. In other words, `__getitem__` indexes `Dseq`. Note that '__getitem__' (and, consequently, `[]`) uses zero-based indexing. ```python my_seq = Dseq("aatataa") my_seq[2:5] ``` Dseq(-3) tat ata `__getitem__` respects overhangs. ```python my_seq = Dseq.from_full_sequence_and_overhangs("aatataa", crick_ovhg=0, watson_ovhg=-1) my_seq[2:] ``` Dseq(-5) tata atatt Note that index zero corresponds to the leftmost base of the sequence, which might not necessarily be on the `watson` strand. Let's create a sequence that has an overhang on the left side. ```python sequence_with_overhangs = Dseq.from_full_sequence_and_overhangs("aatacgttcc", crick_ovhg=3, watson_ovhg=0) sequence_with_overhangs ``` Dseq(-10) acgttcc ttatgcaagg When we index starting from `2`, we don't start counting on the watson, but on the crick strand since it is the leftmost one. ```python sequence_with_overhangs[2:] ``` Dseq(-8) acgttcc atgcaagg #### Slicing circular sequences When slicing circular `Dseq` objects we get linear sequences. ```python circular_seq = Dseq("aatctaa", circular=True) circular_seq[1:5] ``` Dseq(-4) atct taga We can slice circular sequences across the origin (where index is zero) if the first index is bigger than the second index. This is demonstrated in the example below: ```python circular_seq[5:2] ``` Dseq(-4) aaaa tttt ### Printing sequences to the console: `__repr__` and `__str__` `__repr__` and `__str__` are methods present in all python classes that return a string representation of an object. `__str__` is called by the `print` function, and `__repr__` is used by the console or notebook output when the object is not assigned to a variable. Below is an example with a `date` object: ```python import datetime my_date = datetime.date(2023, 8, 15) print('> print statement:', my_date) print('> repr:', repr(my_date)) print('> repr from class method:', my_date.__repr__()) print() print('> console output:') my_date ``` > print statement: 2023-08-15 > repr: datetime.date(2023, 8, 15) > repr from class method: datetime.date(2023, 8, 15) > console output: datetime.date(2023, 8, 15) In a similar way, `__repr__` and `__str__` methods are used by pydna to represent sequences as strings for different purposes: * `__repr__` is used to make a figure-like representation that shows both strands and the overhangs. * `__str__` is used to return the entire sequence as a string of characters (from the left-most to the right-most base of both strands), the way we would store it in a FASTA file. ```python my_seq = Dseq.from_full_sequence_and_overhangs("aaattaaa", crick_ovhg=-3, watson_ovhg=-2) print('> figure-like representation:\n', my_seq.__repr__()) print() print('> string representation:\n', my_seq) ``` > figure-like representation: Dseq(-8) aaatta aattt > string representation: aaattaaa Note that on the string representation, the bases correspond to the entire sequence provided, even when they are only present on either the `watson` or `crick` strand. In the example above, the last two `aa` bases are missing from the `watson` strand, and that only the `crick` strand has them. ## Edge cases You can create arbitrary double-stranded sequences that are not complementary if you specify both strands and an overhang, but you won't be able to use them for molecular biology simulations. For example: ```python Dseq("xxxx", "atat", ovhg=2) ``` Dseq(-6) xxxx tata