# Working with Features using the Dseqrecord class > Before working with features, check how to import sequences from files in the [Importing_Seqs notebook](./Importing_Seqs.ipynb). > > For full library documentation, visit [here](https://pydna-group.github.io/pydna/). Some sequence file formats (like Genbank) include features, describing key biological properties of sequence regions. In Genbank, features "include genes, gene products, as well as regions of biological significance reported in the sequence." (See [here](https://www.ncbi.nlm.nih.gov/genbank/samplerecord/) for a description of a Genbank file and associated terminologies/annotations) Examples include coding sequences (CDS), introns, promoters, etc. pydna offers many ways to easily view, add, extract, and write features into a Genbank file via the `Dseqrecord` class. After reading a file into a `Dseqrecord` object, we can check out the list of features in the record using the following code. This example uses the sample record [U49845](https://www.ncbi.nlm.nih.gov/genbank/samplerecord/). Open In Colab ```python %%capture # Install pydna (only when running on Colab) import sys if 'google.colab' in sys.modules: %pip install pydna ``` ```python from pydna.dseqrecord import Dseqrecord from pydna.parsers import parse #Import your file into python. file_path = "./U49845.gb" records = parse(file_path) sample_record = records[0] # List all features for feature in sample_record.features: print(feature) ``` type: source location: [0:5028](+) qualifiers: Key: chromosome, Value: ['IX'] Key: db_xref, Value: ['taxon:4932'] Key: mol_type, Value: ['genomic DNA'] Key: organism, Value: ['Saccharomyces cerevisiae'] type: mRNA location: [<0:>206](+) qualifiers: Key: product, Value: ['TCP1-beta'] type: CDS location: [<0:206](+) qualifiers: Key: codon_start, Value: ['3'] Key: product, Value: ['TCP1-beta'] Key: protein_id, Value: ['AAA98665.1'] Key: translation, Value: ['SSIYNGISTSGLDLNNGTIADMRQLGIVESYKLKRAVVSSASEAAEVLLRVDNIIRARPRTANRQHM'] type: gene location: [<686:>3158](+) qualifiers: Key: gene, Value: ['AXL2'] type: mRNA location: [<686:>3158](+) qualifiers: Key: gene, Value: ['AXL2'] Key: product, Value: ['Axl2p'] type: CDS location: [686:3158](+) qualifiers: Key: codon_start, Value: ['1'] Key: gene, Value: ['AXL2'] Key: note, Value: ['plasma membrane glycoprotein'] Key: product, Value: ['Axl2p'] Key: protein_id, Value: ['AAA98666.1'] Key: translation, Value: ['MTQLQISLLLTATISLLHLVVATPYEAYPIGKQYPPVARVNESFTFQISNDTYKSSVDKTAQITYNCFDLPSWLSFDSSSRTFSGEPSSDLLSDANTTLYFNVILEGTDSADSTSLNNTYQFVVTNRPSISLSSDFNLLALLKNYGYTNGKNALKLDPNEVFNVTFDRSMFTNEESIVSYYGRSQLYNAPLPNWLFFDSGELKFTGTAPVINSAIAPETSYSFVIIATDIEGFSAVEVEFELVIGAHQLTTSIQNSLIINVTDTGNVSYDLPLNYVYLDDDPISSDKLGSINLLDAPDWVALDNATISGSVPDELLGKNSNPANFSVSIYDTYGDVIYFNFEVVSTTDLFAISSLPNINATRGEWFSYYFLPSQFTDYVNTNVSLEFTNSSQDHDWVKFQSSNLTLAGEVPKNFDKLSLGLKANQGSQSQELYFNIIGMDSKITHSNHSANATSTRSSHHSTSTSSYTSSTYTAKISSTSAAATSSAPAALPAANKTSSHNKKAVAIACGVAIPLGVILVALICFLIFWRRRRENPDDENLPHAISGPDLNNPANKPNQENATPLNNPFDDDASSYDDTSIARRLAALNTLKLDNHSATESDISSVDEKRDSLSGMNTYNDQFQSQSKEELLAKPPVQPPESPFFDPQNRSSSVYMDSEPAVNKSWRYTGNLSPVSDIVRDSYGSQKTVDTEKLFDLEAPEKEKRTSRDVTMSSLDPWNSNISPSPVRKSVTPSPYNVTKHRNRHLQNIQDSQSGKNGITPTTMSTSSSDDFVPVKDGENFCWVHSMEPDRRPSKKRLVDFSNKSNVNVGQVKDIHGRIPEML'] type: gene location: [<3299:>4037](-) qualifiers: Key: gene, Value: ['REV7'] type: mRNA location: [<3299:>4037](-) qualifiers: Key: gene, Value: ['REV7'] Key: product, Value: ['Rev7p'] type: CDS location: [3299:4037](-) qualifiers: Key: codon_start, Value: ['1'] Key: gene, Value: ['REV7'] Key: product, Value: ['Rev7p'] Key: protein_id, Value: ['AAA98667.1'] Key: translation, Value: ['MNRWVEKWLRVYLKCYINLILFYRNVYPPQSFDYTTYQSFNLPQFVPINRHPALIDYIEELILDVLSKLTHVYRFSICIINKKNDLCIEKYVLDFSELQHVDKDDQIITETEVFDEFRSSLNSLIMHLEKLPKVNDDTITFEAVINAIELELGHKLDRNRRVDSLEEKAEIERDSNWVKCQEDENLPDNNGFQPPKIKLTSLVGSDVGPLIIHQFSEKLISGDDKILNGVYSQYEEGESIFGSLF'] Additional ways to view and search for particular features are shown at the bottom of the page under "Other Methods to Viewing Features" ## Adding Features and Qualifiers To add new feature to describe a region of interest to a record, for instance a region that you would like to perform a PCR, you need to create a `SeqFeature` (sequence feature). The minimal information required is: * A `FeatureLocation`: position of the feature in the sequence. * The `type` of feature you want to add. 🚨🚨 **VERY IMPORTANT** 🚨🚨. Note that `FeatureLocation`s are like python ranges (zero-based open intervals), whereas in GenBank files, locations are one-based closed intervals. For instance, the following code adds a new feature from the 2nd to the 5th nucleotide (`FeatureLocation(3, 15)`), of the `gene` type, but in the GenBank file will be represented as `4..15`. ```python from Bio.SeqFeature import FeatureLocation, SeqFeature # Create a dummy record dummy_record = Dseqrecord("aaaATGCGTACGTGAacgt") # Define the locations of a CDS location = FeatureLocation(3, 15) # Create a SeqFeature with the type mRNA my_feature = SeqFeature(location=location, type="gene") # Add my_feature to dummy_record with .append dummy_record.features.append(my_feature) # Confirm that my_feature has been added print(dummy_record.features[-1]) # Print the feature in GenBank format (see how the location is `4..15`) print(dummy_record.format("genbank")) ``` type: gene location: [3:15] qualifiers: LOCUS name 19 bp DNA linear UNK 01-JAN-1980 DEFINITION description. ACCESSION id VERSION id KEYWORDS . SOURCE . ORGANISM . . FEATURES Location/Qualifiers gene 4..15 ORIGIN 1 aaaatgcgta cgtgaacgt // To give further information about a feature, we can add a qualifier using the `qualifiers` property of `SeqFeature`, which contains a dictionary of qualifiers. For instance, if I would like to note a new feature of type 'domain', between 3-9 bases as my region of interest, I can instantiate the `SeqFeature` class object as such. > Note that a new feature is always added to the last position of the features list. ```python location = FeatureLocation(3, 9) # Create a SeqFeature with a qualifier my_feature2 = SeqFeature(location=location, type="domain", qualifiers={"Note": ["Region of interest"]}) # Add my_feature to my_record with .append dummy_record.features.append(my_feature2) # Confirm that my_feature has been added print('>> Feature was added:') print(dummy_record.features[-1]) print() # Print the feature in GenBank format print('>> GenBank format:') print(dummy_record.format("genbank")) ``` >> Feature was added: type: domain location: [3:9] qualifiers: Key: Note, Value: ['Region of interest'] >> GenBank format: LOCUS name 19 bp DNA linear UNK 01-JAN-1980 DEFINITION description. ACCESSION id VERSION id KEYWORDS . SOURCE . ORGANISM . . FEATURES Location/Qualifiers gene 4..15 domain 4..9 /Note="Region of interest" ORIGIN 1 aaaatgcgta cgtgaacgt // **🤔 Best practices for qualifiers:** The values in the `qualifiers` dictionary should be lists. The reason for this is that in a GenBank file, a single feature can have multiple values for a single qualifier. Below is a real world of the ase1 CDS example from the _S. pombe_ genome in EMBL format: ``` FT CDS join(1878362..1878785,1878833..1880604) FT /colour=2 FT /primary_name="ase1" FT /product="antiparallel microtubule cross-linking factor FT Ase1" FT /systematic_id="SPAPB1A10.09" FT /controlled_curation="term=species distribution, conserved FT in eukaryotes; date=20081110" FT /controlled_curation="term=species distribution, conserved FT in metazoa; date=20081110" FT /controlled_curation="term=species distribution, conserved FT in vertebrates; date=20081110" FT /controlled_curation="term=species distribution, FT predominantly single copy (one to one); date=20081110" FT /controlled_curation="term=species distribution, conserved FT in fungi; date=20081110" FT /controlled_curation="term=species distribution, conserved FT in eukaryotes only; date=20081110" ``` Note how there are several `controlled_curation` qualifiers, therefore it makes sense to store them as a list. By default, you can add any type of object in the qualifiers dictionary, and most things will work if you add a string. However, you risk overwriting the existing value for a qualifier, so best practice is: 1. Check if the qualifier already exists using `if "qualifier_name" in feature.qualifiers` 2. If it exists, append to the existing list of values using `feature.qualifiers["qualifier_name"].append("new_value")` 3. If it does not exist, add it to the qualifiers dictionary using `feature.qualifiers["qualifier_name"] = ["new_value"]` Note that `Bio.SeqFeatures` does not automatically assume a sequence strand for the feature. If you would like to refer to a feature on the positive or minus strand, you can add a parameter in `FeatureLocation` specifying `strand=+1` or `strand=-1`. ```python #Create a location specifying the minus strand location = FeatureLocation(15, 19, strand=-1) my_feature3 = SeqFeature(location=location, type="domain", qualifiers={"gene":["example_domain"]}) dummy_record.features.append(my_feature3) print(dummy_record.features[-1]) print(dummy_record.format("genbank")) ``` type: domain location: [15:19](-) qualifiers: Key: gene, Value: ['example_domain'] LOCUS name 19 bp DNA linear UNK 01-JAN-1980 DEFINITION description. ACCESSION id VERSION id KEYWORDS . SOURCE . ORGANISM . . FEATURES Location/Qualifiers gene 4..15 domain 4..9 /Note="Region of interest" domain complement(16..19) /gene="example_domain" ORIGIN 1 aaaatgcgta cgtgaacgt // ### Adding a Feature with Parts To add a feature with parts, like a CDS with introns, we need to use a `CompoundLocation` object when creating a `SeqFeature`. The example code below adds a CDS with two parts, between 3-9bp and 12-15bp, to my features list. In a real-world scenario this would represent a CDS with an intron that skips the `ACG` codon: ATGCGT~~ACG~~TGA ```python from Bio.SeqFeature import CompoundLocation # Define the locations of the CDS locations = [FeatureLocation(3, 9), FeatureLocation(12, 15)] # Create a compound location from these parts compound_location = CompoundLocation(locations) # Create a SeqFeature with this compound location, including type and qualifiers. cds_feature = SeqFeature(location=compound_location, type="CDS", qualifiers={"gene": ["example_gene"]}) # Add the feature to the Dseqrecord dummy_record.features.append(cds_feature) print(dummy_record.features[-1]) print(dummy_record.format("genbank")) ``` type: CDS location: join{[3:9], [12:15]} qualifiers: Key: gene, Value: ['example_gene'] LOCUS name 19 bp DNA linear UNK 01-JAN-1980 DEFINITION description. ACCESSION id VERSION id KEYWORDS . SOURCE . ORGANISM . . FEATURES Location/Qualifiers gene 4..15 domain 4..9 /Note="Region of interest" domain complement(16..19) /gene="example_domain" CDS join(4..9,13..15) /gene="example_gene" ORIGIN 1 aaaatgcgta cgtgaacgt // We can even extract a protein record as follows (see how the protein sequence is `MR`, skipping the intron): ```python sub_record = dummy_record.features[-1].extract(dummy_record) print(sub_record.translate()) ``` ID: id Name: name Description: description Number of features: 0 /molecule_type=DNA ProteinSeq('MR') ### Standard Feature Types and Qualifiers `pydna` and `Bio.SeqFeature` suppports all the conventional feature types through the `type` parameters. A non-exhaustive list include gene, CDS, promoter, exon, intron, 5' UTR, 3' UTR, terminator, enhancer, and RBS. You can also define custom features, which could be useful for synthetic biology applications. For instance, you might want to have Bio_brick or spacer features to describe a synthetic standardised plasmid construct. It is important to note that while `pydna` and `Bio.SeqFeature` does not restrict the feature types you can use, sticking to standard types helps maintain compatibility with other bioinformatics tools and databases. Please refer to the official [GenBank_Feature_Table](https://www.insdc.org/submitting-standards/feature-table/#2), that lists the standard feature types and their associated qualifiers. Further documentation for `SeqFeature`, `CompoundLocation`, and `FeatureLocation` can be found in the `SeqFeature` module [here](https://biopython.org/docs/1.75/api/Bio.SeqFeature.html). ### Handling Origin Spanning Features An origin spanning feature is a special type of feature that crosses over a circular sequence's origin. In pydna, such a feature is represented as a feature with parts, joining the part of the sequence immediately before the origin and immediately after the origin. They can be added using `CompoundLocation` as normal. An origin spanning feature, between base 19 to base 6, in a 25bp long circular sequence, is represented like so: ``` type: gene location: join{[19:25](+), [0:6](+)} qualifiers: gene, Value: example_gene ``` This feature will be displayed as a single feature in SnapGene viewer and Benchling, since they support this convention. ```python circular_record = Dseqrecord('ACGTGAaaaaaaaaaaaaaATGCGT', circular=True) location = [FeatureLocation(19,25), FeatureLocation(0, 6)] ori_feat_location = CompoundLocation(location) ori_feature = SeqFeature(location=ori_feat_location, type="misc", qualifiers={"gene": ["example origin spanning gene"]}) circular_record.features.append(ori_feature) print('>> Feature:') print(circular_record.features[-1]) # Note how the feature sequence is extracted properly across the origin. print('>> Feature sequence:') print(circular_record.features[-1].extract(circular_record).seq) print() print('>> GenBank format:') print(circular_record.format("genbank")) ``` >> Feature: type: misc location: join{[19:25], [0:6]} qualifiers: Key: gene, Value: ['example origin spanning gene'] >> Feature sequence: ATGCGTACGTGA >> GenBank format: LOCUS name 25 bp DNA circular UNK 01-JAN-1980 DEFINITION description. ACCESSION id VERSION id KEYWORDS . SOURCE . ORGANISM . . FEATURES Location/Qualifiers misc join(20..25,1..6) /gene="example origin spanning gene" ORIGIN 1 acgtgaaaaa aaaaaaaaaa tgcgt // ### Other Methods to Viewing Features pydna also provides the `list_features` method as a simple way to list all the features in a `Dseqrecord` object. ```python print(sample_record.list_features()) ``` +-----+------------------+-----+-------+-------+------+--------+------+ | Ft# | Label or Note | Dir | Sta | End | Len | type | orf? | +-----+------------------+-----+-------+-------+------+--------+------+ | 0 | nd | --> | 0 | 5028 | 5028 | source | no | | 1 | nd | --> | <0 | >206 | 206 | mRNA | no | | 2 | nd | --> | <0 | 206 | 206 | CDS | no | | 3 | nd | --> | <686 | >3158 | 2472 | gene | yes | | 4 | nd | --> | <686 | >3158 | 2472 | mRNA | yes | | 5 | N:plasma membran | --> | 686 | 3158 | 2472 | CDS | yes | | 6 | nd | <-- | <3299 | >4037 | 738 | gene | yes | | 7 | nd | <-- | <3299 | >4037 | 738 | mRNA | yes | | 8 | nd | <-- | 3299 | 4037 | 738 | CDS | yes | +-----+------------------+-----+-------+-------+------+--------+------+ This method is convenient for checking-out a brief overview of each feature, without reading through an entire sequence record. Alternatively, we can look for specific features using their qualifiers. For instance: ```python # Filter based on feature type print('Getting all CDS features:') cds_features = [f for f in sample_record.features if f.type == "CDS"] for feature in cds_features: print(feature) ``` Getting all CDS features: type: CDS location: [<0:206](+) qualifiers: Key: codon_start, Value: ['3'] Key: product, Value: ['TCP1-beta'] Key: protein_id, Value: ['AAA98665.1'] Key: translation, Value: ['SSIYNGISTSGLDLNNGTIADMRQLGIVESYKLKRAVVSSASEAAEVLLRVDNIIRARPRTANRQHM'] type: CDS location: [686:3158](+) qualifiers: Key: codon_start, Value: ['1'] Key: gene, Value: ['AXL2'] Key: note, Value: ['plasma membrane glycoprotein'] Key: product, Value: ['Axl2p'] Key: protein_id, Value: ['AAA98666.1'] Key: translation, Value: ['MTQLQISLLLTATISLLHLVVATPYEAYPIGKQYPPVARVNESFTFQISNDTYKSSVDKTAQITYNCFDLPSWLSFDSSSRTFSGEPSSDLLSDANTTLYFNVILEGTDSADSTSLNNTYQFVVTNRPSISLSSDFNLLALLKNYGYTNGKNALKLDPNEVFNVTFDRSMFTNEESIVSYYGRSQLYNAPLPNWLFFDSGELKFTGTAPVINSAIAPETSYSFVIIATDIEGFSAVEVEFELVIGAHQLTTSIQNSLIINVTDTGNVSYDLPLNYVYLDDDPISSDKLGSINLLDAPDWVALDNATISGSVPDELLGKNSNPANFSVSIYDTYGDVIYFNFEVVSTTDLFAISSLPNINATRGEWFSYYFLPSQFTDYVNTNVSLEFTNSSQDHDWVKFQSSNLTLAGEVPKNFDKLSLGLKANQGSQSQELYFNIIGMDSKITHSNHSANATSTRSSHHSTSTSSYTSSTYTAKISSTSAAATSSAPAALPAANKTSSHNKKAVAIACGVAIPLGVILVALICFLIFWRRRRENPDDENLPHAISGPDLNNPANKPNQENATPLNNPFDDDASSYDDTSIARRLAALNTLKLDNHSATESDISSVDEKRDSLSGMNTYNDQFQSQSKEELLAKPPVQPPESPFFDPQNRSSSVYMDSEPAVNKSWRYTGNLSPVSDIVRDSYGSQKTVDTEKLFDLEAPEKEKRTSRDVTMSSLDPWNSNISPSPVRKSVTPSPYNVTKHRNRHLQNIQDSQSGKNGITPTTMSTSSSDDFVPVKDGENFCWVHSMEPDRRPSKKRLVDFSNKSNVNVGQVKDIHGRIPEML'] type: CDS location: [3299:4037](-) qualifiers: Key: codon_start, Value: ['1'] Key: gene, Value: ['REV7'] Key: product, Value: ['Rev7p'] Key: protein_id, Value: ['AAA98667.1'] Key: translation, Value: ['MNRWVEKWLRVYLKCYINLILFYRNVYPPQSFDYTTYQSFNLPQFVPINRHPALIDYIEELILDVLSKLTHVYRFSICIINKKNDLCIEKYVLDFSELQHVDKDDQIITETEVFDEFRSSLNSLIMHLEKLPKVNDDTITFEAVINAIELELGHKLDRNRRVDSLEEKAEIERDSNWVKCQEDENLPDNNGFQPPKIKLTSLVGSDVGPLIIHQFSEKLISGDDKILNGVYSQYEEGESIFGSLF'] ```python # Find a particular feature by its qualifier (e.g. gene name) rev7_cds_feature = next(f for f in sample_record.features if f.type == "gene" and "gene" in f.qualifiers and "REV7" in f.qualifiers["gene"] ) print(rev7_cds_feature) ``` type: gene location: [<3299:>4037](-) qualifiers: Key: gene, Value: ['REV7'] If you would like to search for another type of features, simply replace the `"gene"` with your desired feature type in quotation marks. ### Removing Features In pydna, we can search for the feature that we would like to remove using the feature's types or qualififers. For instance, we can modify the features list to exclude all CDS: ```python #Remove all CDS type features from my feature list sample_record.features = [f for f in sample_record.features if not (f.type == "CDS")] for feature in sample_record.features: print(feature) ``` type: source location: [0:5028](+) qualifiers: Key: chromosome, Value: ['IX'] Key: db_xref, Value: ['taxon:4932'] Key: mol_type, Value: ['genomic DNA'] Key: organism, Value: ['Saccharomyces cerevisiae'] type: mRNA location: [<0:>206](+) qualifiers: Key: product, Value: ['TCP1-beta'] type: gene location: [<686:>3158](+) qualifiers: Key: gene, Value: ['AXL2'] type: mRNA location: [<686:>3158](+) qualifiers: Key: gene, Value: ['AXL2'] Key: product, Value: ['Axl2p'] type: gene location: [<3299:>4037](-) qualifiers: Key: gene, Value: ['REV7'] type: mRNA location: [<3299:>4037](-) qualifiers: Key: gene, Value: ['REV7'] Key: product, Value: ['Rev7p'] We can also modify the features list to exclude a specific gene: ```python #Exclude REV7 from my feature list sample_record.features = [f for f in sample_record.features if not ('gene' in f.qualifiers and 'REV7' in f.qualifiers['gene'])] for feature in sample_record.features: print(feature) ``` type: source location: [0:5028](+) qualifiers: Key: chromosome, Value: ['IX'] Key: db_xref, Value: ['taxon:4932'] Key: mol_type, Value: ['genomic DNA'] Key: organism, Value: ['Saccharomyces cerevisiae'] type: mRNA location: [<0:>206](+) qualifiers: Key: product, Value: ['TCP1-beta'] type: gene location: [<686:>3158](+) qualifiers: Key: gene, Value: ['AXL2'] type: mRNA location: [<686:>3158](+) qualifiers: Key: gene, Value: ['AXL2'] Key: product, Value: ['Axl2p']