Designing Primers for a Kozak sequence library
%%capture
# Install pydna (only when running on Colab)
import sys
if 'google.colab' in sys.modules:
%pip install pydna[clipboard,download,express,gel] teemi
⚠️ This notebook uses the extra dependency
teemi. Run it in google Colab, or in an environment where you installteemias well aspydna.
In this notebook we explore the combinatorial space of the most abundant kozak sequences and make repair-primers for the experiments. We will use pydna to simulate and the CRIPSR experiment and the homology directed repair with the oligoes we make.
About Kozak sequences
Kozak sequences are short 5′-end hexamer motifs (often GCC(A/G)CC) flanking the start codon in eukaryotic mRNAs that enhance translation initiation. They are interesting because they help determine how efficiently a gene is translated into protein—strong Kozak motifs boost translation, while weak ones can limit it. This makes them important for gene regulation in biotechnology. In this case imagine we are using K. phaffi or your favourite protein production host. Happy bioengineering
Combinatorial space
Now, we want to limit our search space therefore we restrict the compinatorial space to the following combinations which were the most abundant nucleotides in the PWM analysis
position1 = ["C", "T", "A"]
position2 = ["C", "A", "G"]
position3 = [ "A", "G"]
position4 = ["C", "T", "A"]
position5 = ["C", "G"]
position6 = ["C"]
nucleotide_list = [position1, position2, position3, position4, position5, position6]
from teemi.design.combinatorial_design import get_combinatorial_list
# make all combinations
kozak = get_combinatorial_list(nucleotide_list)
print(f'{len(kozak)} combinations generated')
kozak[:5]
108 combinations generated
[('C', 'C', 'A', 'C', 'C', 'C'),
('C', 'C', 'A', 'C', 'G', 'C'),
('C', 'C', 'A', 'T', 'C', 'C'),
('C', 'C', 'A', 'T', 'G', 'C'),
('C', 'C', 'A', 'A', 'C', 'C')]
# Make them into strings
def make_to_string(list_of_list):
all_combinations_as_str = []
nuc_seq = ''
for sp in list_of_list:
for seq in sp:
nuc_seq += seq
all_combinations_as_str.append(nuc_seq)
nuc_seq = ''
return all_combinations_as_str
all_combinations_as_str = make_to_string(kozak)
all_combinations_as_str[:5]
['CCACCC', 'CCACGC', 'CCATCC', 'CCATGC', 'CCAACC']
Making primers for homology directed repair with oligoes
This is a dummy example where we wanna test kozak sequences for GFP gene that have been integrated into K. phaffi
from pydna.dseqrecord import Dseqrecord
from pydna.crispr import cas9, protospacer
from pydna.genbank import Genbank
# initalize your favourite gene
gb = Genbank("myself@email.com") # Tell Genbank who you are!
gene = gb.nucleotide("LN515608.1") # Synthetic construct for Aequorea victoria partial gfp gene for GFP
target_dseq = Dseqrecord(gene)
print(target_dseq)
Dseqrecord
circular: False
size: 735
ID: LN515608.1
Name: LN515608
Description: Synthetic construct for Aequorea victoria partial gfp gene for GFP
Number of features: 4
/molecule_type=DNA
/topology=linear
/data_file_division=SYN
/date=03-MAR-2015
/accessions=['LN515608']
/sequence_version=1
/keywords=['']
/source=synthetic construct
/organism=synthetic construct
/taxonomy=['other sequences', 'artificial sequences']
/references=[Reference(title='XerC-mediated DNA inversion at the inverted repeats of the UU172-phase-variable element of Ureaplasma parvum serovar 3', ...), Reference(title='Direct Submission', ...)]
Dseq(-735)
AGTA..CTAG
TCAT..GATC
promoter_region = Dseqrecord('GACGCACCAATCTAGCACAGGCACAGTGTTAACTAGATCTCAACCCTTACCCAAGTCAGAGCCGCAGAGATTGGCAACAAACTCTAGAAACCCGGGGCACGAGGACAATATGAGCTGTGCAGGCTGGTCGAGACTCGTCTAGTTGGTATTACGGTACTAGACGTCGTTGTATCCTTAGGGGACTAGAGTCAGGTAGGTAATAGGGGGTTCCCCTATCTATTATATTTAACTAGTGATACCTTCTCGAACTGTGTGAGCTGCTGCCTCAGCGAATTTCGTTCTGGACCggTACGTGTGT')
full_seq = promoter_region + target_dseq
full_seq
Dseqrecord(-1033)
this sgRNA we know works super well from previous experiments
sgRNA = 'AGCGAATTTCGTTCTGGAC'
Let’s simulate how it cuts our construct.
# Choose guides
guide = ["AGCGAATTTCGTTCTGGAC"]
# Create an enzyme object with the protospacer
enzyme = cas9(guide[0])
# Simulate the cut with enzyme1
print('cutting with guide:', full_seq.cut(enzyme))
cutting with guide: (Dseqrecord(-284), Dseqrecord(-749))
Let’s desing the 5 prime and 3 primer end of the repair oligoes.
We want to keep them at around ~ 60 to make it more affordable to synthesize.
# This is 6 bases directly upstream of the gene
five_prime = promoter_region[-50:-20]
# This is directly downstream of the gene
three_prime = target_dseq[:30]
print(f'Five prime end of the repair oligos is {len(five_prime)} bases')
print(f'Three prime end of the repair oligos is {len(three_prime)} bases')
Five prime end of the repair oligos is 30 bases
Three prime end of the repair oligos is 30 bases
# Making these into a list
five_prime_list = [[str(five_prime.seq)]] * len(all_combinations_as_str)
three_prime_list = [str(three_prime.seq)] * len(all_combinations_as_str)
# to string
five_prime_list = make_to_string(five_prime_list)
three_prime_list = make_to_string(three_prime_list)
# making a dataframe
import pandas as pd
my_dict = {'five':five_prime_list, "kozak":all_combinations_as_str, "three":three_prime_list}
kozak_df = pd.DataFrame(my_dict)
kozak_df['primer'] = kozak_df['five'] + kozak_df['kozak'] + kozak_df['three']
kozak_df
| five | kozak | three | primer | |
|---|---|---|---|---|
| 0 | CTGTGTGAGCTGCTGCCTCAGCGAATTTCG | CCACCC | AGTAAAGGAGAAGAACTTTTCACTGGAGTT | CTGTGTGAGCTGCTGCCTCAGCGAATTTCGCCACCCAGTAAAGGAG... |
| 1 | CTGTGTGAGCTGCTGCCTCAGCGAATTTCG | CCACGC | AGTAAAGGAGAAGAACTTTTCACTGGAGTT | CTGTGTGAGCTGCTGCCTCAGCGAATTTCGCCACGCAGTAAAGGAG... |
| 2 | CTGTGTGAGCTGCTGCCTCAGCGAATTTCG | CCATCC | AGTAAAGGAGAAGAACTTTTCACTGGAGTT | CTGTGTGAGCTGCTGCCTCAGCGAATTTCGCCATCCAGTAAAGGAG... |
| 3 | CTGTGTGAGCTGCTGCCTCAGCGAATTTCG | CCATGC | AGTAAAGGAGAAGAACTTTTCACTGGAGTT | CTGTGTGAGCTGCTGCCTCAGCGAATTTCGCCATGCAGTAAAGGAG... |
| 4 | CTGTGTGAGCTGCTGCCTCAGCGAATTTCG | CCAACC | AGTAAAGGAGAAGAACTTTTCACTGGAGTT | CTGTGTGAGCTGCTGCCTCAGCGAATTTCGCCAACCAGTAAAGGAG... |
| ... | ... | ... | ... | ... |
| 103 | CTGTGTGAGCTGCTGCCTCAGCGAATTTCG | AGGCGC | AGTAAAGGAGAAGAACTTTTCACTGGAGTT | CTGTGTGAGCTGCTGCCTCAGCGAATTTCGAGGCGCAGTAAAGGAG... |
| 104 | CTGTGTGAGCTGCTGCCTCAGCGAATTTCG | AGGTCC | AGTAAAGGAGAAGAACTTTTCACTGGAGTT | CTGTGTGAGCTGCTGCCTCAGCGAATTTCGAGGTCCAGTAAAGGAG... |
| 105 | CTGTGTGAGCTGCTGCCTCAGCGAATTTCG | AGGTGC | AGTAAAGGAGAAGAACTTTTCACTGGAGTT | CTGTGTGAGCTGCTGCCTCAGCGAATTTCGAGGTGCAGTAAAGGAG... |
| 106 | CTGTGTGAGCTGCTGCCTCAGCGAATTTCG | AGGACC | AGTAAAGGAGAAGAACTTTTCACTGGAGTT | CTGTGTGAGCTGCTGCCTCAGCGAATTTCGAGGACCAGTAAAGGAG... |
| 107 | CTGTGTGAGCTGCTGCCTCAGCGAATTTCG | AGGAGC | AGTAAAGGAGAAGAACTTTTCACTGGAGTT | CTGTGTGAGCTGCTGCCTCAGCGAATTTCGAGGAGCAGTAAAGGAG... |
108 rows × 4 columns
from Bio.Seq import Seq
# making them into pydna Dseqrecord
oligos = []
for i, row in kozak_df.iterrows():
repair_oligo = Dseqrecord(
Seq(row['primer']),
id=f"kozak_repair_oligo_{i+1}",
name=f"Repair oligo {i+1} for kozak experiment",
description="Designed repair oligo from DataFrame"
)
oligos.append(repair_oligo)
print(f'{oligos[0].name} : {repair_oligo.seq} ')
print(f'{oligos[0].name} length : {len(repair_oligo.seq)} ')
Repair oligo 1 for kozak experiment : CTGTGTGAGCTGCTGCCTCAGCGAATTTCGAGGAGCAGTAAAGGAGAAGAACTTTTCACTGGAGTT
Repair oligo 1 for kozak experiment length : 66
from pydna.assembly2 import in_vivo_assembly
# Example: Loop through all repair oligos and assemble each with full_sequence
assembled_KOs = []
for i, repair_oligo in enumerate(oligos, 1):
products = in_vivo_assembly((full_seq.cut(enzyme)[0], repair_oligo, full_seq.cut(enzyme)[1]), limit=30)
assembled_KOs.append(products[0])
# Now assembled_KOs contains all your assemblies
for p in products:
print(p)
Dseqrecord
circular: False
size: 1019
ID: id
Name: name
Description: description
Number of features: 4
/molecule_type=DNA
Dseq(-1019)
GACG..CTAG
CTGC..GATC
They seem to repair as expected - so lets buy them from IDT
oligo_data = []
for oligo in oligos:
oligo_data.append({
"Oligo Name": oligo.id,
"Sequence": str(oligo.seq),
# Optional IDT columns (set your preferred defaults)
"Scale": "25nm",
"Purification": "STD"
})
idt_df = pd.DataFrame(oligo_data)
idt_df
| Oligo Name | Sequence | Scale | Purification | |
|---|---|---|---|---|
| 0 | kozak_repair_oligo_1 | CTGTGTGAGCTGCTGCCTCAGCGAATTTCGCCACCCAGTAAAGGAG... | 25nm | STD |
| 1 | kozak_repair_oligo_2 | CTGTGTGAGCTGCTGCCTCAGCGAATTTCGCCACGCAGTAAAGGAG... | 25nm | STD |
| 2 | kozak_repair_oligo_3 | CTGTGTGAGCTGCTGCCTCAGCGAATTTCGCCATCCAGTAAAGGAG... | 25nm | STD |
| 3 | kozak_repair_oligo_4 | CTGTGTGAGCTGCTGCCTCAGCGAATTTCGCCATGCAGTAAAGGAG... | 25nm | STD |
| 4 | kozak_repair_oligo_5 | CTGTGTGAGCTGCTGCCTCAGCGAATTTCGCCAACCAGTAAAGGAG... | 25nm | STD |
| ... | ... | ... | ... | ... |
| 103 | kozak_repair_oligo_104 | CTGTGTGAGCTGCTGCCTCAGCGAATTTCGAGGCGCAGTAAAGGAG... | 25nm | STD |
| 104 | kozak_repair_oligo_105 | CTGTGTGAGCTGCTGCCTCAGCGAATTTCGAGGTCCAGTAAAGGAG... | 25nm | STD |
| 105 | kozak_repair_oligo_106 | CTGTGTGAGCTGCTGCCTCAGCGAATTTCGAGGTGCAGTAAAGGAG... | 25nm | STD |
| 106 | kozak_repair_oligo_107 | CTGTGTGAGCTGCTGCCTCAGCGAATTTCGAGGACCAGTAAAGGAG... | 25nm | STD |
| 107 | kozak_repair_oligo_108 | CTGTGTGAGCTGCTGCCTCAGCGAATTTCGAGGAGCAGTAAAGGAG... | 25nm | STD |
108 rows × 4 columns