# Representing sequences in pydna
> Visit the full library documentation [here](https://pydna-group.github.io/pydna/)

<a target="_blank" href="https://colab.research.google.com/github/pydna-group/pydna/blob/master/docs/notebooks/Dseq.ipynb">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>


```python
%%capture
# Install pydna (only when running on Colab)
import sys
if 'google.colab' in sys.modules:
    %pip install pydna

```

Pydna contains classes to represent double stranded DNA sequences that can:

* Be linear
* Be circular
* Contain overhangs (sticky ends).

These sequences can be used to simulate molecular biology methods such as cloning and PCR. The main classes used to represent sequences are `Dseq` and `Dseqrecord`.
* `Dseq` represents the sequence only. Think of it as a FASTA file.
* `Dseqrecord` can contain sequence features and other info such as publication, authors, etc. Think of it as a Genbank file.

> NOTE: The `Dseq` class is a subclass of biopython's `Seq`, whose documentation can be found [here](https://biopython.org/wiki/Seq). `Dseqrecord` is a subclass of biopython's `SeqRecord`, whose documentation can be found [here](https://biopython.org/wiki/SeqRecord).


## Dseq Class

We can create a `Dseq` object in different ways.

For a linear sequence without overhangs, we create a `Dseq` object passing a string with the sequence. For example:


```python
from pydna.dseq import Dseq
my_seq = Dseq("aatat")
my_seq
```


    Dseq(-5)
    aatat
    ttata


In the console representation above, there are three lines:
1. `Dseq(-5)` indicates that the sequence is linear and has 5 basepairs.
2. `aatat`, the top / sense / watson strand, referred from now on as **watson** strand..
3. `ttata`, the bottom / anti-sense / crick strand, referred from now on as **crick** strand.

Now, let's create a circular sequence:


```python
my_seq = Dseq("aatat", circular=True)
my_seq
```


    Dseq(o5)
    aatat
    ttata


> Note how `o5` indicates that the sequence is circular and has 5 basepairs.

One way to represent a linear sequence with overhangs is to instantiate `Dseq` with the following arguments:
* The `watson` strand as a string in the 5'-3' direction.
* The `crick` strand as a string in the 5'-3' direction.
* The 5' overhang `ovhg` (overhang), which can be positive or negative, and represents the number of basepairs that the `watson` strand extends beyond the `crick` strand.


```python
Dseq("actag", "ctag", -1)
```


    Dseq(-5)
    actag
     gatc


> Note how the bottom strand is passed in the 5'-3' direction, but it is represented in the 3'-5' direction in the console output.

If you omit the `ovhg` argument, pydna will try to find the value that makes the `watson` and `crick` strands complementary.


```python
Dseq("actag", "ctag")
```


    Dseq(-5)
    actag
     gatc


The best way to get a feeling for the meaning of `ovhg` is to visualise the possible scenarios as such:

```
dsDNA       overhang

  nnn...    2
nnnnn...

  nnnn...   1
nnnnn...

nnnnn...    0
nnnnn...

nnnnn...   -1
  nnnn...

nnnnn...   -2
  nnn...
```

Of note, the DNA sequence can be passed in both lower case and upper case, and are not restricted to the conventional ATCG nucleotides (E.g ), The class supports the IUPAC ambiguous nucleotide code.


```python
Dseq("Actag", "Ctag", -1)
```


    Dseq(-5)
    Actag
     gatC


Another way to pass the overhangs is to use the `from_full_sequence_and_overhangs` classmethod, which only needs the `watson`/sense strand. This is useful you can only store the entire sequence (e.g. in a FASTA file), or if you want to specify overhangs on both sides of the double stranded DNA when you create the object.

Both the `watson_ovhg` and `crick_ovhg` can be passed following the same rules as above. Specifically, the `crick_ovhg` argument is identical to the conventional `ovhg` argument. The `watson_ovhg` argument is the `ovhg` argument applied to the reverse complementary sequence.


```python
my_seq = Dseq.from_full_sequence_and_overhangs("aaattaaa", crick_ovhg=-3, watson_ovhg=-2)
my_seq
```


    Dseq(-8)
    aaatta
       aattt


A list of possible scenarios, applying positive and negative `crick_ovhg` and `watson_ovhg` to a `Dseq` object are visualised in the output of the code below:


```python
for crick_ovhg in [-2, 2]:
    for watson_ovhg in [-3, 3]:
        print("watson_ovhg is " + str(watson_ovhg) + ", crick_ovhg is " + str(crick_ovhg))
        my_seq = Dseq.from_full_sequence_and_overhangs("aaattaaa", crick_ovhg, watson_ovhg)
        print(my_seq.__repr__() + "\n")
```

    watson_ovhg is -3, crick_ovhg is -2
    Dseq(-8)
    aaatt
      taattt
    
    watson_ovhg is 3, crick_ovhg is -2
    Dseq(-8)
    aaattaaa
      taa
    
    watson_ovhg is -3, crick_ovhg is 2
    Dseq(-8)
      att
    tttaattt
    
    watson_ovhg is 3, crick_ovhg is 2
    Dseq(-8)
      attaaa
    tttaa
    

The drawing below can help visualize the meaning of the overhangs.
```
  (-3)--(-2)--(-1)--(x)--(x)--(x)--(-1)--(-2)

5'( a)--( a)--( a)--(t)--(t)--(a)--( a)--( a)3'
3'( a)--( a)--( a)--(t)--(t)--(a)--( a)--( a)5'

5'( a)--( a)--( a)--(t)--(t)--(a)--(  )--(  )3'
3'(  )--(  )--(  )--(t)--(t)--(a)--( a)--( a)5'
```

If you would like to check the overhangs for a `Dseq` object, it can be done by calling the methods `five_prime_end` and `three_prime_end` to show the 5' and 3' overhangs, respectively. An example of a `Dseq` object, and examples showing what the print-out of the methods looks like are demonstrated here:


```python
my_seq = Dseq("aatat", "ttata", ovhg=-2)
print(my_seq.__repr__())
print(my_seq.five_prime_end())
print(my_seq.three_prime_end())
```

    Dseq(-7)
    aatat
      atatt
    ("5'", 'aa')
    ("5'", 'tt')


If you now want to join your sequence's sticky ends to make a circular sequence (i.e Plasmid), you can use the `looped` method. The sticky ends must be compatible to do so.


```python
my_seq = Dseq("aatat", "ttata", ovhg=-2)
my_seq.looped()
```


    Dseq(o5)
    aatat
    ttata


If you want to change the circular origin of the sequence/plasmid, this can be easily done using the `shifted` method. This can be done by providing the number of bases between the original origin with the new origin: 


```python
my_seq = Dseq("aatat", circular=True)
my_seq.shifted(2)
```


    Dseq(o5)
    tataa
    atatt


## __getitem__, __repr__, and  __str__ methods


### Slicing sequences (`__getitem__`)

`__getitem__` is the method that is called when you use the square brackets `[]` after a python object. Below is an example of the builtin python `list`:


```python
my_list = [1, 2, 3]

print('using square brackets:', my_list[1:])
print('is the same as using __getitem__:', my_list.__getitem__(slice(1, None)))
```

    using square brackets: [2, 3]
    is the same as using __getitem__: [2, 3]


The `__getitem__` method is modified in pydna to deal with `Dseq` objects and returns a slice of the `Dseq` object, defined by the a start value and a stop value, similarly to string indexing. In other words, `__getitem__` indexes `Dseq`. Note that '__getitem__' (and, consequently, `[]`) uses zero-based indexing.


```python
my_seq = Dseq("aatataa")
my_seq[2:5]

```


    Dseq(-3)
    tat
    ata


`__getitem__` respects overhangs.


```python
my_seq = Dseq.from_full_sequence_and_overhangs("aatataa", crick_ovhg=0, watson_ovhg=-1)
my_seq[2:]
```


    Dseq(-5)
    tata
    atatt


Note that index zero corresponds to the leftmost base of the sequence, which might not necessarily be on the `watson` strand. Let's create a sequence that has an overhang on the left side.


```python
sequence_with_overhangs = Dseq.from_full_sequence_and_overhangs("aatacgttcc", crick_ovhg=3, watson_ovhg=0)
sequence_with_overhangs
```


    Dseq(-10)
       acgttcc
    ttatgcaagg


When we index starting from `2`, we don't start counting on the watson, but on the crick strand since it is the leftmost one.


```python
sequence_with_overhangs[2:]
```


    Dseq(-8)
     acgttcc
    atgcaagg


#### Slicing circular sequences
When slicing circular `Dseq` objects we get linear sequences.


```python
circular_seq = Dseq("aatctaa", circular=True)
circular_seq[1:5]
```


    Dseq(-4)
    atct
    taga


We can slice circular sequences across the origin (where index is zero) if the first index is bigger than the second index. This is demonstrated in the example below:


```python
circular_seq[5:2]
```


    Dseq(-4)
    aaaa
    tttt


### Printing sequences to the console: `__repr__` and `__str__`

`__repr__` and `__str__` are methods present in all python classes that return a string representation of an object. `__str__` is called by the `print` function, and `__repr__` is used by the console or notebook output when the object is not assigned to a variable. Below is an example with a `date` object:


```python
import datetime

my_date = datetime.date(2023, 8, 15)

print('> print statement:', my_date)
print('> repr:', repr(my_date))
print('> repr from class method:', my_date.__repr__())

print()
print('> console output:')
my_date
```

    > print statement: 2023-08-15
    > repr: datetime.date(2023, 8, 15)
    > repr from class method: datetime.date(2023, 8, 15)
    
    > console output:


    datetime.date(2023, 8, 15)


In a similar way, `__repr__` and `__str__` methods are used by pydna to represent sequences as strings for different purposes:

* `__repr__` is used to make a figure-like representation that shows both strands and the overhangs.
* `__str__` is used to return the entire sequence as a string of characters (from the left-most to the right-most base of both strands), the way we would store it in a FASTA file.


```python
my_seq = Dseq.from_full_sequence_and_overhangs("aaattaaa", crick_ovhg=-3, watson_ovhg=-2)
print('> figure-like representation:\n', my_seq.__repr__())
print()
print('> string representation:\n', my_seq)

```

    > figure-like representation:
     Dseq(-8)
    aaatta
       aattt
    
    > string representation:
     aaattaaa


Note that on the string representation, the bases correspond to the entire sequence provided, even when they are only present on either the `watson` or `crick` strand. In the example above, the last two `aa` bases are missing from the `watson` strand,  and that only the `crick` strand has them.

## Edge cases

You can create arbitrary double-stranded sequences that are not complementary if you specify both strands and an overhang, but you won't be able to use them for molecular biology simulations. For example:


```python
Dseq("xxxx", "atat", ovhg=2)
```


    Dseq(-6)
      xxxx
    tata