SEQUENCE FILE NAMES
At MGEL we utilize a sequence file naming convention similar to the one we use for labeling plates containing DNA clones. Each sequence file name consists of nine identification fields. Below is an example of a file name broken into its nine fields and a legend explaining the meaning and limits of each field.
Field 1 - First letter of the genus of the organism from which the sequence was derived. In the example, the P stands for Pinus.
Field 2 - First letter of the species name of the organism from which the sequence was derived. In the example, the T stands for taeda.
Field 3 - First letter of the subspecies from which the sequence was derived. As shown, if a subspecies name is not known or not applicable, an underscore, _, is used to show that the subspecies field is empty.
Field 4 - First letter or number of the cultivar/genotype from which the sequence was derived. As with the Field 3, an underscore can be used to indicate the lack of a known cultivar or genotype. In the example, the 7 stands for the genotype "7-56."
Field 5 - A one letter abbreviation for the method by which the sequence was obtained. At MGEL, we currently have sequences with the following Field 5 designations:
- S = single/low-copy, Cot-filtered. An 'S' sequence is from the slowest, mathematically-discernible kinetic component of a genome as defined via standard Cot analysis.
- M = moderately repetitive, Cot-filtered. An 'M' sequence is from the middle component of a three component genome as delineated by standard Cot analysis.
- H = highly repetitive, Cot-filtered; An 'H' sequence is from the fastest reassociating component (excluding foldback DNA) of a three component genome as defined by standard Cot analysis.
- R= repetitive, Cot-filtered; An 'R' sequence is prepared from the fastest reassociating component (excluding foldback DNA) of a two component genome as determined by standard Cot analysis OR a library prepared from the combined 'H' and 'M' components of a three component genome.
- U = ultra-rapidly reassociating DNA. A 'U' sequence is from the DNA that binds to a HAP column (in 0.12 M sodium phosphate buffer) at a Cot value of virtually zero. Ultra-rapidly reassociating DNA has long been thought to be the result of intramolecular base pairing (hence its other monikers fold-back DNA and snap-back DNA), although recent studies suggest that this explanation is too simplistic.
- T = theoretical single-copy, Cot-filtered. A 'T' sequence is isolated from any DNA that remains single-stranded at 0.1*theoretical Cot value for single-copy DNA as predicted from genome size (see Peterson 2005 for further explanation).
- G = genomic (random) DNA sequence.
In the example, the G indicates that this sequence is from genomic DNA.
Field 6 - A one letter abbreviation for the type of
DNA analyzer used in sequencing. The following options are currently available:
4 = 454/Roche Applied Sciences Genome Sequencer
C = capillary sequencer (e.g., ABI 3730)
I = Illumina Genome Analyzer
In the example, the 4 indicates that a 454 Genome Sequencer 20 was used in sequencing.
Field 7 - This is always an underscore, _, and is simply used to delimit the first six fields from the last two fields.
Field 8 - A number (from 1 to 10,000) indicating the nth file of this type in the sequence database. For example, the 00004 in field 8 indicates that this is the fourth file of sequences to be placed on the MGEL server that is described (in Fields 1-7) by PT_7R4_.
Field 9 - A file extension that indicates the type of
data contained therein. The extension .fas indicates a FASTA file while
the extension .fas.qual is used to identify a quality file. Of note, each
.fas file in the MGEL database should have a corresponding .fas.qual file that
possesses an identical string of letters/numbers in Fields 1-8. If sequences
are generated on a capillary sequencer, they may have a corresponding trace
file as well. Trace files that we produce generally have either the extension
.abi or .scf.
The example above is a quality file (.fas.qual).
As discussed above, sequence files are named, in part, based upon the specific epithet of the organism from which the sequences were derived. Here is a list of two letter acronyms (Fields 1 and 2 above) used for species with which we currently work.
GB = Ginkgo biloba
GH = Gossypium hirsutum
GR = Gossypium raimondii
PA = Picea abies
PT = Pinus taeda
SB = Sorghum bicolor
SP = Sorghum propinquum
TA = Triticum aestivum
TD = Taxodium distichum
If we were to add Gossypium barbadense sequences to our database, we could not use the abbreviation GB to represent this species as GB is already being used to represent Ginkgo biloba. Consequently, an alternative would have to be picked (e.g., GP for Gossypium 'Pima.' Pima cotton is a common name used for G. barbadense).