Gp

From Christoph's Personal Wiki
Jump to: navigation, search

GP is a set of small programs for working with sequence data. It is part of genpak. The programs are written as a sort of 'biological' extension to the standard Unix tools (sed, awk, grep, and the whole myriad of little, useful tools). They accept standard input and can spawn the data to standard output, so you can place them in a pipe as any other Linux command or use them in a cgi script. All programs are written in ANSI C, and are supposed to (1) aid manipulate large data sets in intensive batch processing; and (2) facilitate production of cgi-based local web servers providing some basic functions.

List of GP programs / utilities

gp_qs 
find fast a sequence within a larger sequence, and print out the positions. Sometimes you just don't need blasta -- like, when you want only to know where exactly your primer binds in a given sequence. You can either type the sequence directly as a command line argument, like
gp_qs ACTGACTG [sequence filename] 
or give a filename in command line as an argument.
gp_getseq 
retrieves quickly a sequence fragment. Usage is simple: gp_getseq Position1 Position2 [sequence filename] Note that if Position2 > Position1, the retrieved sequence is complementary to the fragment Position1...Position2. Position1 is the number of the first base to be retrieved, and Position2 is the last base to be retrieved.
gp_gc 
prints out the GC content of a given sequence or sequences. Can also computate mean and SE for larger number of sequences.
gp_map 
generates automatically graphical gene maps. You provide a simple input -- a list of genes, their positions, maybe some parameters -- and the program outputs a PNG graphics showing the gene map. If the -H option is specified, additionaly an IMAP file is created: this allowes the creation of clickable, graphical maps created on the fly.
gp_tm 
prints out the Tm of a given sequence. Three algorithms can be used: the exact nearest neighbor algorithm, the approximate GC contents algorithm, and the evil and false 4*[GC] + 2*[AT] algorithm.
gp_matrix 
look for promoters in a set of sequence files, using the Staden matrix (see: Hertz, G. and Stormo, G.D. 1996. Escherichia coli promoter sequences: analysis and prediction. Meth. Enzym. 273). Basically, you have a matrix file containing scores and penalties for nucleotides at different positions in the supposed -35 and -10 boxes, as well in the +1 region of a given sequence (see the file "matryca" in the data/ directory, which is the same as the E. coli matrix published in Hertz et al.).
The program loads sequences from the sequence file, and then scans it using all possible combinations of gap lengths between the +1, -10 and -35 boxes and at all possible positions in the sequence so as to find this combination which gives the highest score for the sequence. It then prints a formatted output in the following form:
  1. score sequence...[-35 core]...[-10 core]...[start]...
The '|' characters denote the boundaries of matrix'ed fragments.
In the "data" directory you will find the original Staden E. coli matrix. The myco.mtx Mycoplasma pneumoniae matrix and the program have been described in Weiner, J. et al. 2000, "Transcription in Mycoplasma pneumoniae".
gp_mkmtx 
creates nucleotide frequency matrices, such as that which are used by the gp_matrix program.
gp_shift 
sometimes you have a list of genes:
100000 101000 gene1
200000 201000 gene2
400000 391000 gene3
...
...and would like to, for example, print out the promoter regions, that is, sequences from -100 to +10 relative to the 5'-end of the genes. gp_shift is useful for this.
gp_randseq 
unless the option -r is set, it prints out random fragments from a sequence file. Default fragment length is 100, and you can change it with the option -l length. If you set -r, however, completly random sequences are provided. You can determine their GC content with the option -g value. There is also an option -m, which stands for "Markov chains", but all it does is to assure that the probability of selecting a nucleotide depends on what is the previous nucleotide; this probabilities are also taken out from a sequence file.
gp_seq2prot 
converts a nucleotide sequence to protein sequence. Sequence is supposed to start with a start codon: this is mandatory. Lacking of the stop codon or premature end of input sequence (like, in the middle of a codon) results only in a warning message.
You can provide your own codon tables; for the format of the codon_file look at data/standard.cdn and data/myco.cdn. Basically, you need not to provide the whole table, it is enough to point out the differences. To see a codon file, type gp_seq2prot -p.
gp_findorf 
prints out all ORFs that are contained in a sequence. gp_findorf looks always for the longest ORF within the given limit. See also notes for gp_seq2prot.
gp_cusage 
prints out the codon usage of sequence(s). Same options as in the case of gp_seq2prot; actually -- this *is* nearly the same program. I just like them to have separately.
gp_slen 
sequence length. Sometimes useful. Can also computate mean and SE of a set of sequences.
gp_dimer 
record frequencies of nucleotide pairs: AA, AC, AG...TT. This is sometimes useful for characterizing a sequence. You can also record frequencies of nucleotide pairs separated by a given number of nucleotides, to check, for example, how often an 'A' comes five nucleotides downstream of an 'T'. Believe me or not, it is useful.
gp_trimer 
record frequencies of nucleotide trimers: AAA, AAC, AAG...TTT.
gp_pattern 
record frequencies of patterns of a given length. Note that the number of possible patterns increases exponentially with each basepair, that is, for a tetramere there are 4^4 = 256 possible patterns.
gp_primer 
calculate oligonucleotide stem/loop and dimere structural parameters. This is what most of the web pages and programs like "Oligo" do. The set of thermodynamic parameters used here comes from a paper by SantaLucia et al.
gp_acc 
this program can be used to convert a sequence into a set of so-called auto-cross-correlation coefficients which can be further analised by, for example, principle component analysis (PCA). If you want to learn more about it, read Jonsson et al., 1991, "A multivariate representation and analysis of DNA sequence data".
gp_scan 
used to further analyse the auto-cross-correlation terms to find out some more information about patterns or regularities using in sequence.

External links