About CDSParser

GenBank

Download

Manual

Contact Us

Creating Tab-delimited Output Files

Record

Tab-delimited files organized by record are outputted in the order they were read in from the GBSeq XML file. Each record's information is outputted on one row (excluding the case where two or more genes fall within the same group column, in which case the accession number will be left blank in the following rows to indicate they belong to the same record). Each CDS (coding sequence) may be placed in a group column, the miscellaneous column, or no where (if its name is in the delete set). This file type is the most comprehensive that can be outputted by CDSParser because it includes much of the record information; it also allows the user to see what genes are present for each GenBank record.

Taxa

Tab-delimited files organized by taxa are outputted by taxonomic order (alphabetical). Records may be outputted in taxonomic order, or combined to represent a synthesis of the available sequences at a certain taxonomic level. The user may specify the number of consecutive names in the taxonomy that must be the same. This value allows CDSParser to combine sequences from various GenBank records to form the most complete set of available sequences for a certain taxonomic level (like order or genus). CDSParser selects sequences based on the presence of an amino acid sequence and the length of the nucleotide sequence. (Note: the miscellaneous column will not be outputted when GenBank records are combined because the outputted file would quickly excede the permissible number for columns for a spreadsheet program.)

Configuring with defaults.def

Altering the values in defaults.def alters the output of CDSParser. Changing this file will only take affect after CDSParser is restarted. Following each name, 0 stands for false and 1 for true. The first two lines in the file specify whether CDSParser reads in tRNA and rRNA sequences; there is no option for mRNA sequences. The next set of values, ending in "ColR", dictate which columns are outputted in the tab-delimited files organized by record. The third set of values, ending in "ColT", dictate which columns are outputted in the tab-delimited files organized by taxa. The last value, "addSpecies", specifies if the organism name should be added to the lineage or taxonomy. "mSequenceColR" refers to the sequence of the GenBank record, not an individual coding sequece. "nameColR" and "nameColT" refer to the name of the gene or coding sequence. "gNotesColR" and "gNotesColT" refer to notes about a gene or coding sequence. "nSequenceColR" and "nSequenceColT" refer to the nucleotide coding sequence. "pSequenceColR" and "pSequenceColT" refer to the protein coding sequence. "lengthColR" and "lengthColT" refer to the nucleotide coding sequence length. See defaults.def for an example of this file should look like.