CRITICA Documentation

Jonathan Badger

August 1, 2000

Introduction

CRITICA is a series of programs designed to find protein coding regions in intron-less genomic DNA. It doesn't require (or use) any sequence annotation for its predictions. However, it does require that at least some of the genes in the sequence have homologs in data-banks for CRITICA's comparative analysis and that that the overall dicodon bias in these genes is typical of the organism for CRITICA's non-comparative analysis. In practice, this means that one should either analyze complete genomes with CRITICA or a sizeable chunk (>=100,000 nucleotides). It is not required that the sequence be in one contig, however.

Installation

CRITICA is implemented as a series of Perl 5.0 scripts and ANSI C code. To make the C programs simply type make (if your compiler is not GCC, you'll have to edit the Makefile). Put the generated programs somewhere on your path. The scripts in the scripts directory also need to be on your path (Critica.pm should be placed somewhere on your PERLLIB path). The location of perl in the scripts may have to be edited if your perl is not in /usr/bin. Also, you may have to adjust several environmental variables for the script blast-contigs: CRITICA_BLASTN (default value ``blastn'') is the name of the BLASTN program used by CRITICA. CRITICA_BLASTPARM (default value ``-warnings -nogaps E=1e-4 E2=1e-4'' for use with WU-BLAST2; a good set of values for NCBI BLAST2 is ``-gF -e1e-4'') is a set of parameters to be sent to BLASTN (if you use a version of BLASTN that generates gapped alignments, you should turn off this feature by adding a flag such as -nogap to CRITICA_BLASTPARM). CRITICA_BLASTDB (default value ``gb'') is the name of the BLASTN database to be searched. WinCritica.mak is a makefile for Microsoft Visual C++ and Makefile is the UNIX makefile. The two files critica.tex and critica.html (with a stylesheet in critica.css) are simply the documentation that you are reading now in L^ATEX and html format. LICENSE is a copy of the terms of the GNU Public License (GPL).

Running CRITICA - a tutorial

Probably the best way to demonstrate how to use CRITICA is to walk though a typical analysis. To start with, you need the DNA to be analyzed to be in FASTA (also called Pearson) format. Here is an example of part of a FASTA file:

>ECOLI
AGCTTTTCATTCTGACTGCAACGGGCAATATGTCTCTGTGTGGATTAAAAAAAGAGTGTC
TGATAGCAGCTTCTGAACTGGTTACCTGCCGTGAGTAAATTAAAATTTTATTGACTTAGG
TCACTAAATACTTTAACCAATATAGGCATAGCGCACAGACAGATAAAAATTACAGAGTAC
ACAACATCCATGAAACGCATTAGCACCACCATTACCACCACCATCACCATTACCACAGGT
AACGGTGCGGGCTGACGCGTACAGGAAACACAGAAAAAAGCCCGCACCTGACAGTGCGGG

Let's say you were going to analyze E. coli with CRITICA and the genomic data was contained in a file named contigs. The first part of a CRITICA analysis requires running blastn on small subsequences of the genomic data: the script to do this is called blast-contigs. Below is an example:

        blast-contigs contigs > EC.blast

If you had additional data not in the main database. you could additionally run blast-contigs as follows:

        blast-contigs -db=mydata contigs >> EC.blast

BLASTing a genome against GenBank often takes overnight; be prepared.

Next the BLAST output has to be processed into a format readable by CRITICA.

	make-blastpairs EC.blast > EC.blast.pairs

Now the program scanblastpairs can be used to tabulate the triplets aligned to E. coli in the blast.pairs file

	scanblastpairs contigs EC.blast.pairs EC.triplets

Now we can predict coding regions. Because it is best to have CRITICA iterate several times, a script iterate-CRITICA has been supplied. It is run as follows:

	iterate-CRITICA ecorfs contigs EC.triplets

This will take twenty minutes (on a 300 Mhz Pentium II) to several hours (on slower machines) to run and will generate numerous files, among them ecorfs0.cds, ecorfs1.cds, ecorfs2.cds, and ecorfs3.cds. These files are the predictions made by CRITICA at each iteration. The last file (ecorfs3.cds) should be the most accurate. Here is part of the ecorfs3.cds file for interpretation:

ECOLI  35371  34781  2.86e-35   16   3399   2708  -93 GTG    48  12 AGGA    
ECOLI  36162  35377  8.28e-50   32   2171   6818   62 ATG    40   8 GGAG    
ECOLI  37824  36271  8.18e-45   64   1002   7385   63 ATG    84   3 GAGGTG  
ECOLI  39115  37898  7.12e-53  128      0  10183   63 ATG    76   7 AGGAG   
ECOLI  40386  39244  4.09e-50  128    324   9258   63 ATG    73   7 GAGGT  
ECOLI  41931  40417  1.29e-53   64    225  10286   63 ATG  -123   0 -  
  ^      ^      ^      ^        ^      ^    ^      ^^^^^^   ^^^^^^^^^^^^
contig start   end    P-value   M     comp  dicod  initiator    SD

The P-value is the amount of statistical support for the coding region. Like BLAST, low scores are better. The matrix (M) is the best comparative matrix - high matrices mean the homologs found were more distant. The start and end are the start and end of the coding region, including the terminator. Comp and dicod represent the scores the orf has from comparative support and dicodon support respectively, The initiator information consists of the score given to the initiator codon and the codon itself. The SD information consists of the score given to the best SD sequence for the chosen initiator, the number of nucleotides between the end of the SD sequence and the start of the coding region, and the SD sequence itself.

Summary of programs and scripts

addlongorfs

Usage: addlongorfs [options] cds-file seq-file 
Valid addlongorfs options:
        -orf-aa-length=length
        -genetic-code=number

This program lists orfs over a certain length that aren't included in the cds file given as input. The main use of this program is that it is called by iterate-CRITICA if the -add-longorfs= option is given.

annotation-compare

usage: annotation-compare annotation.cds prediction.cds

This program compares the results of a CRITICA run against a file of known coding regions of the format contig-name start end (...). It outputs the percentage of false positives and false negatives, as well as the percentages of too early and late start positions. It also generates files ending in .fp, .fn, .early and .late which have the respective CRITICA calls in them.

blast-contigs

Usage: blast-contigs [-db=dbname] fasta-file

This script blasts the DNA in a fasta file against a DNA database. Either the default database (usually GenBank or a non-redundant DNA database) is used, or if the -db= option is chosen, a database of the user's choice. The DNA to be blasted is split into small fragments before BLASTing.

compare-orf-lists

Usage: compare-orf-lists list1 list2

This script compares lists of coding regions. The lists can either be in the format of CRITICA .cds files or in a simple format:

  
contig start end

The resulting list is a list of coding regions in list1 not in list2 and the vice versa. The start position of the coding region doesn't matter in these comparisons.

critica

Usage: critica [options] seq triplet-scores
Valid critica options:
        -scoring-matrix=file
        -dicodon-scores=file
        -init-scores=file
        -sd-scores=file
        -prom-scores=file
        -no-sdscores
        -prom-find
        -genetic-code=number
        -threshold=value
        -alpha=value
        -strict-threshold
        -frameshift-threshold=value
        -quick-stats

This is the main CRITICA program. It takes as arguments the sequence to be analyzed and set of comparative triplets (created by scanblastpairs -- see above). It outputs a list of predicted coding regions with start positions within two orders of magnitude of the best start position for each coding region.

Options

-scoring-matrix=file -- this causes CRITICA to use an empirically derived scoring matrix instead of the theoretically determined scores as described in the manuscript. An advantage of the empirical scores is that conservative changes can be rewarded. However, in practice I've usually found that this does not significantly change the results, suggesting that identities are sufficient for finding coding regions.

-dicodon-scores -- this causes CRITICA to use a set of dicodon scores (generated by the dicodontable program -- see below) for noncomparative support.

-init-scores -- this causes CRITICA to use a set of initiator scores (generated by iterate-critica -- see below) for initiator choice.

-sd-scores -- this causes CRITICA to use a set of SD scores (generated by iterate-critica -- see below) to improve initiator choice.

-prom-scores -- this causes CRITICA to use a set of promoter scores (generated by iterate-critica -- see below) to improve initiator choice.

-no-sdscores -- this turns off the search for SD sequences. This is useful for analyzing organisms which you know don't use SD sequences.

-prom-find -- This turns on the currently experimental (and disabled by default) promoter finding routines of CRITICA.

-genetic code=number -- this causes CRITICA to use a different genetic code than the ``universal'' one. The numbers used are are the standard NCBI genetic code numbers.

-threshold=value -- this sets the threshold for the maximum p-value allowed for an region to be considered a coding region. The default is $1\times 10^{-4}$ , which seems to work the best.

-alpha=value -- this sets the parameter ``alpha'' which scales the dicodon scores down to compensate for their non-independent nature. The default value is 1.0.

-strict-threshold -- by default CRITICA accepts as coding regions regions that meet the demand of the threshold even if extending them to a terminator causes the region to drop below the amount of support required by the threshold. Using -strict-threshold eliminates them. In practice this tends not to be desirable, however.

-frameshift-threshold=value -- causes CRITICA to check for potential frameshift errors in the sequences it is analyzing. The value supplied is the number of nucleotides between two regions of coding evidence in difference frames below which the regions are assumed to be caused by a frameshift. A good value to use is 10. The potential frameshifts are outputed to stderr - in the case of iterate-critica (below) the messages are redirected to the .mesg files.

-quick-stats -- one of the more time consuming parts of a CRITICA run is calculating the parameters K and lambda. This option causes CRITICA to use conservative values (0.2 and 0.015 respectively) and thus speeds the run (at the expense of losing some marginal cases).

dicodontable

Usage: dicodontable [options] cds-file seq-file
Valid dicodontable options:
        -fraction-coding=fraction

This program generates a set of dicodon scores from a list of coding regions (generated by best-inits, above) and the FASTA file of the DNA to be analyzed. The option -fraction-coding allows adjustment for the fact that in the first (comparative evidence only iteration) less than a realistic percentage of the DNA is called coding. The default behavior of iterate CRITICA is to set this parameter to 0.8 in the first iteration, and not set it at all in following iterations. The table generated by dicodontable has the following format:

dicodon dicodon-score frequency-in-noncoding DNA

empirical-matrix

Usage: empirical-matrix coding-blast-pairs

This program creates a scoring matrix including conservative changes from a blast-pairs file created from a BLASTN run of known coding regions in the organism of interest. The scores are based on the log odds of two triplets being aligned in the coding frame (the 1 frame) vs. being aligned in other frames. The matrix is used via the -scoring-matrix option of CRITICA and iterate-CRITICA.

extractcoding

Usage: extractcoding [options] fasta-file cds-file
valid extractcoding options:
     -desc  (include gene descriptions in headers)
     -prot  (output translated sequences)
     -genetic-code=NCBI-num
     -min=num (minimum length (in amino acids) of extracted genes)
     -max=num (maximum length (in amino acids) of extracted genes)

This program generates FASTA files of the genes predicted by CRITICA (actually, it can read any file that starts contig-name start end). The default is to output DNA sequences; the program can also translate the genes if desired (through the -prot option. The -desc option is useful particularly for generating FASTA files from annotation files of the format contig-name start end description....

histo-orf

Usage: histo-orf bin-size cds-file

This script produces a histogram of coding-region lengths (in amino acids). The arguments are the size (in amino acids) of the histogram bin size and the list of coding regions. The list can either be in the format of CRITICA .cds files or any file that starts contig-name start end.

The program produces input showing the number of orfs that have a length less than the number on the left (and whose length is greater or equal to the number on the left on the previous line). The percent of all coding regions in the bin is listed, as well as the culminative percentage.

intergenic

Usage: intergenic [options] fasta-file cds-file
valid intergenic options:
     -desc  (include gene descriptions in headers)
     -min=num (minimum length (in amino acids) of extracted genes)
     -max=num (maximum length (in amino acids) of extracted genes)

This program generates FASTA files of the regions of the sequences provided that are not covered by the genes listed in cds-file.

iterate-critica

iterate-critica: [options] output-name seq-file triplets-scores
Valid iterate-critica options:
        -iterations=number (default: 3)
        -scoring-matrix=file
        -initial-dicod=file
        -initial-init=file
        -initial-sd=file
        -initial-prom=file
        -no-sdscores
        -prom-find
        -threshold=value
        -alpha=value
        -fraction-coding=value
        -add-longorfs=length
        -genetic-code=number
        -strict-threshold
        -frameshift-threshold=value
        -quick-stats

This script is the normal way of automating a CRITICA run. The options are mostly identical to those of the CRITICA program itself except for the ability to set the number of iterations.

Some options of note -- -fraction-coding also changing the default value is 0.8, (80%), which is typical of the fraction of DNA which is actually part of a gene in prokaryotes. If you know that your organism is quite different from this you can set your own value.

-add-longorfs=length adds to the first iteration all ORFS that would encode proteins over a certain length (given in amino acids). This can be helpful if there is very little comparative information available for the organism being analyzed.

lookat

Usage: lookat [options] fasta-file [contig-name] start end
valid lookat options:
     -num include num bases upstream of given region
     num include num bases downstream of given region

This program allows the user to examine any sub-sequence of a DNA or protein sequence. If the start is less than end, the reverse complement of the subsequence is given (for DNA of course!)

make-blast-pairs

Usage: make-blastpairs [-exclude=string] blast-file

This script converts BLAST output (generated by blast -contigs, above), into a FASTA file with alternating records representing query DNA and its corresponding BLAST match. The option -exclude= allows the exclusion of DNA from a particular genus or species, generally the same genus or species of the query organism. In general, this is unneeded because make-blastpairs automatically excludes matches with more than 97% identity.

map-critica-orfs

Usage: map-critica-orfs cds-list

This script lists the amount of overlap of each coding region predicted by CRITICA with the previous coding region. The output is the same as the input but with two additional columns - the first is the number of nucleotides of overlap with the previous coding region (negative numbers mean there is no overlap but a gap between the two genes). The second is the amount of overlap represented as the percent of the length of the coding region.

motiffind

Usage: motiffind [-options] pattern fasta-file
Valid motiffind options:
        -compstrand

This program finds instances of the pattern supplied in the FASTA file. In the FASTA file contains DNA sequences, ambiguous codes such as R and Y can be used (the program decides if a sequence is protein if its GC is less than 0.20). If the compstrand option is selected, the reverse complement is searched (for DNA sequences).

removeoverlaps

Usage: removeoverlaps seq-file cds-file percent-allowed

This program removes coding regions which overlap more than the percentage given. The coding region removed is the one with the least support.

reportscore

Usage: reportscore [options] seq triplet-scores contig start end matrix
Valid options:
        -dicodon-scores=file
        -init-scores=file
        -sd-scores=file
        -genetic-code=number

The program allows one to determine the score and p-value of a region of DNA. This is useful for understanding why CRITICA did not call an expected region a coding region.

scanblastpairs

Usage: scanblastpairs seq blast-pairs triplets

This program takes as inputs the FASTA file of the sequence to be analyzed, the blast-pairs generated by make-blastpairs (see above), and generates a table of triplets aligned to each triplet in the sequence file.

translate

Usage: translate [options] fasta-file [contig-name] start end
valid translate options:
     -num include num residues upstream of given region
     num include num residues downstream of given region
     -genetic-code=NCBI-num

This program allows the user to translate any sub-sequence of a DNA sequence. If the start is less than end, the reverse complement of the subsequence is used.

What's New

Version 1.05

Changes to Array.c and Array.h to make them more portable. Thanks to Peter Li and Xiaoying Lin.
BLAST2 puts carriage returns in sequence names to keep lines less than 80 characters in length. This caused make-blastpairs to get confused. It has been fixed.
Pseudocount method in dicodontable.c could cause noncoding count to go negative. Gary Olsen discovered this and suggested a fix.

Version 1.04

CRITICA now runs under Windows (when compiled with Visual C++). Thanks to Brian Schick.
blast-contigs now also works with NCBI BLAST2, which has a slightly different output format than traditional BLAST.
CRITICA now handles periods, spaces, tabs, and numbers in contig names more gracefully. There shouldn't be any need for renaming contigs any more.

License

CRITICA (and its source code) are copyrighted by Jonathan Badger but are licensed under the GNU Public License (GPL). What this means is that not only are you free to use CRITICA and have access to its source code, but are free to incorporate parts of CRITICA into your own programs. However, if such programs are used beyond your institution or company, you must release their source code under the GPL as well.

Frankly, I'm surprised I have to say this explicitly. The GPL simply embodies what I've always understood as the scientific spirit of cooperation which distinguishes modern scientists from secret-hoarding medieval alchemists. Just as bench biologists make the mutants and DNA sequences mentioned in their papers freely available to their colleagues so that additional research can be done, computational biologists should be expected to make their source code of the programs mentioned in their papers freely available ¹

Yet there is a disturbing trend in computational biology to only make programs available through web servers and make access to the source code impossible or only possible after signing legal documents which generally forbid use of the licensed source code in the licensee's projects. This behavior can only impede scientific progress. So don't do it.

Contact Information

If you wish to contact me about CRITICA, here is my address:

Jonathan Badger
Department of Computer Science
University of Waterloo
Waterloo, Ontario N2L 3G1
Canada
email: jhbadger@monod.uwaterloo.ca 
phone: (519) 888-4567 ext. 5321
fax: (519) 885-1208

Footnotes

... available ¹: I'm not against proprietary software (sometimes proprietary software has a nicer interface and fewer quirks than the equivalent research software). However, writing a proprietary sequence analysis package is no more science than the writing of a proprietary word processor and thus is not worthy of a scientific publication.

Jonathan Badger 2000-08-01