SEQUENCE FILE NAMES

Naming Conventions | Species Abbreviations

Naming Conventions

At MGEL we utilize a sequence file naming convention similar to the one we use for labeling plates containing DNA clones.  Each sequence file name consists of nine identification fields.  Below is an example of a file name broken into its nine fields and a legend explaining the meaning and limits of each field. 

sequence names

Field 1 - First letter of the genus of the organism from which the sequence was derived.  In the example, the P stands for Pinus.

Field 2 - First letter of the species name of the organism from which the sequence was derived.  In the example, the T stands for taeda.

Field 3 - First letter of the subspecies from which the sequence was derived.  As shown, if a subspecies name is not known or not applicable, an underscore, _,  is used to show that the subspecies field is empty.

Field 4 - First letter or number of the cultivar/genotype from which the sequence was derived.  As with the Field 3, an underscore can be used to indicate the lack of a known cultivar or genotype.  In the example, the 7 stands for the genotype "7-56."

Field 5 - A one letter abbreviation for the method by which the sequence was obtained.  At MGEL, we currently have sequences with the following Field 5 designations:

In the example, the G indicates that this sequence is from genomic DNA.

Field 6 - A one letter abbreviation for the type of DNA analyzer used in sequencing.  The following options are currently available:
     4 = 454/Roche Applied Sciences Genome Sequencer
     C = capillary sequencer (e.g., ABI 3730)
     I = Illumina Genome Analyzer
   In the example, the 4 indicates that a 454 Genome Sequencer 20 was used in sequencing.

Field 7 - This is always an underscore, _, and is simply used to delimit the first six fields from the last two fields.

Field 8 - A number (from 1 to 10,000) indicating the nth file of this type in the sequence database.  For example, the 00004 in field 8 indicates that this is the fourth file of sequences to be placed on the MGEL server that is described (in Fields 1-7) by PT_7R4_.

Field 9 - A file extension that indicates the type of data contained therein.  The extension .fas indicates a FASTA file while the extension .fas.qual is used to identify a quality file.  Of note, each .fas file in the MGEL database should have a corresponding .fas.qual file that possesses an identical string of letters/numbers in Fields 1-8.  If sequences are generated on a capillary sequencer, they may have a corresponding trace file as well.  Trace files that we produce generally have either the extension .abi or .scf. 
   The example above is a quality file (.fas.qual).

Species Abbreviations

As discussed above, sequence files are named, in part, based upon the specific epithet of the organism from which the sequences were derived.  Here is a list of two letter acronyms (Fields 1 and 2 above) used for species with which we currently work.

GB = Ginkgo biloba
GH = Gossypium hirsutum
GR = Gossypium raimondii
PA = Picea abies
PT = Pinus taeda
SB = Sorghum bicolor
SP = Sorghum propinquum
TA = Triticum aestivum
TD = Taxodium distichum

If we were to add Gossypium barbadense sequences to our database, we could not use the abbreviation GB to represent this species as GB is already being used to represent Ginkgo biloba.  Consequently, an alternative would have to be picked (e.g., GP for Gossypium 'Pima.'  Pima cotton is a common name used for G. barbadense).