DNArates_1.1 Gary J. Olsen December 26, 1998 The DNArates program takes a set of sequences and a corresponding phylogenetic tree and produces and maximum likelihood estimate of the rate of nucleotides substitution at each sequence position. Summary of program input and output: ------------------------------------ Input is read from standard input. The format is very much like that of the fastDNAml program. The first line of the input file gives the number of sequences and the number of bases per sequence. Also on this line are the requested program option letters. Any auxiliary data required by the options follow on subsequent lines. Next, the program expects a data matrix. The first 10 characters of the first line of data for a given sequence in interpreted as the name (blanks are counted). Elsewhere in the data matrix, blanks and numbers are ignored. The default data matrix format is interleaved. If all the data for a sequence are on one input line, then interleaved and noninterleaved are equivalent. Following the data matrix there must be a line with the number of user-specified trees for which rates are to be estimated (as with the U option is fastDNAml). The rest of the input file is one or more user-specified trees with branch lengths (as with the U and L options in fastDNAml). The program writes to standard output. The output lists the estimated rate of change at every site in the sequence, or "Undefined" if there are not sufficient unambiguous data at the site. If the C option is specified, the program also categorizes the rates into the requested number of categories. The current categorization algorithm is rather crude, but is probably adequate if the number of categories is large enough. A weighting mask is also created in which sites with Undefined rates are assigned a weight of zero. If the Y option is specified, the program writes the weights and categories data to a file in a format appropriate for use by fastDNAml. Changes from version 1.0: ------------------------- Emperical base frequencies are now the default (matching current versions of fastDNAml). Fixed a bug that led to an incorrect rate at the last position of the alignment. Fixed a bug that caused a user-supplied minimum number of informative residues per column to be ingored. Summary of options for first line of input file (also see examples below and the documentation of options usage with the fastDNAml program): ------------------------------------------------------------------------ 1 - Print data. Toggles print data option (default = noprint). C - Categorize rates. Requires an auxiliary line with a C and the desired number of categories (values 2 - 35 are permitted) F - User base frequencies. The user needs to supply an auxiliary data line with the frequencies of A, C, G and T. I - Interleave. Turns off the data interleave format. L - Userlengths. Branch lengths are required on the user tree, so the option is ignored. M - Minimum informative sequences. Requires an auxiliary data line with an M and the minimum number of sequences in which a sequence position (alignment column) must have unambiguous information in order for the rate at the site to be defined (default = 4). T - Transitions/transversion ratio. Requires auxiliary line with a T and the ration of observed transitions to transversions (default = 2.0). U - User trees. A user tree is required, so the option is ignored. W - User weights. Requires weights auxiliary data. Y - Categories file. Writes the weights and categories to a file. The option scripts usertree, weights, n_categories and categories_file are useful for adding the appropriate options to the input data matrix. The option script weights_categories is useful for adding the resulting outfile to a fastDNAml input file. Input file requesting output in 9 categories and outputing the categorizations into a file: -------------------------------------------------------------- 5 114 C Y C 9 Sequence1 ACACGGTGTCGTATCATGCTGCAGGATGCTAGACTGCGTCANATGTTCGTACTAACTGTG Sequence2 ACGCGGTGTCGTGTCATGCTACATTATGCTAGACTGCGTCGGATGCTCGTATTGACTGCG Sequence3 ACGCGGTGCCGTGTNATGCTGCATTATGCTCGACTGCGRCGGATGCTAGTATTGACTGCG Sequence4 ACGCGCTGCCGTGTCATCCTACACGATGCYAGACAGCGTCAGCTGCTAGTACTGGCTGAG Sequence5 ACGCGCTGTCGTGTCATACTGCAGGATGCTAGACTGCGTCAGCTGCTAGTACTGGCTGAG AGCTCGATGATCGGTGACGTAGACTCAGGGGCCATGCCGCGAGTTTGCGATGCG AGCACGGTGATCAATGACGTAGNCTCAGGRTCCACGCCGTGACTTTGTGATNCG AGCACGATGACCGATGACGTAGACTGAGGGTCCGTGCCGCGACTTTGTGATGCG ACCTCGGTGATTGATGACGTAGACTGCGGGTCCATGCCGCGATTTTGCGRTGCG ACCTCGATGCTCGATGACGTAGACTGCGGGTCCATGCCGTGATTTTGCGATGCG 1 ( Sequence3: 0.061772, Sequence2: 0.053462, ( Sequence1: 0.082889, ( Sequence4: 0.067423, Sequence5: 0.018731 ): 0.087748 ): 0.069398 ):0.0; Corresponding output file: -------------------------- DNArates, version 1.1.0, December 23, 1998 Portions based on Joseph Felsenstein's Nucleic acid sequence Maximum Likelihood method, version 3.3 5 Species, 114 Sites There must be at least 4 informative residues per column Analyzing 43 distinct data patterns (columns) Empirical Base Frequencies: A 0.18574 C 0.24395 G 0.31333 T(U) 0.25698 Transition/transversion ratio = 2.000000 (Transition/transversion parameter = 1.568115) User-defined tree: 5 taxon user-supplied tree read Total length of tree branches = 0.441423 Site Rate ---- --------- 1 1.1327 2 1.1327 3 3.5630 4 1.1327 5 1.1327 6 4.1351 7 1.1327 8 1.1327 9 8.2367 10 1.1327 11 1.1327 12 1.1327 13 3.5630 14 1.1327 15 Undefined 16 1.1327 17 1.1327 18 10.4105 19 1.1327 20 1.1327 21 9.3672 22 1.1327 23 1.1327 24 819.6077 25 4.7154 26 1.1327 ... 112 1.1327 113 1.1327 114 1.1327 Weights 1111111111 1111011111 1111111111 1111111111 1111111111 1111111111 1111111111 1111100001 1111111111 1111111111 1111111111 1111 C 9 1.13270 2.28233 3.57910 5.61266 8.80165 13.80255 21.64484 33.94294 819.60770 Categories 1131131151 1131911511 5119411111 2111311131 3121121711 1313211151 1313114112 3232199991 1111182111 3112311115 1151111311 1111 Weights and categories also written to weight_rate.1627 ============================================================================== Input file requesting 35 rate categories and requiring 5 informative residues (out of 5 sequences) instead of the default 4: -------------------------------------------------------------------- 5 114 C M C 35 M 5 Sequence1 ACACGGTGTCGTATCATGCTGCAGGATGCTAGACTGCGTCANATGTTCGTACTAACTGTG Sequence2 ACGCGGTGTCGTGTCATGCTACATTATGCTAGACTGCGTCGGATGCTCGTATTGACTGCG Sequence3 ACGCGGTGCCG-GTNATGCTGCATTATGCTCGACTGCGRCGGATGCTAGTATTGACTGCG Sequence4 ACGCGCTGCC-TGTNATCCTACACGATGCYAGACAGCGTCAGCTGCTAGTACTGGCTGAG Sequence5 ACGCGCTGTCGTGTCATACTGCAGGATGCTAGACTGCGTCAGCTGCTAGTACTGGCTGAG AGCTCGATGATCGGTGAC-TAGACTCAGGGGCCATGCCGCGAGTTTGCGATGCG AGCACGGTGATCAATGA--TAGNCTCAGGRTCCACGCCGTGACTTTGTGATNCG AGCACGATGACCGATG---TAGACTGAGGGTCCGTGCCGCGACTTTGTGATGCG ACCTCGGTGATTGAT----TAGACTGCGGGTCCATGCCGCGATTTTGCGRTGCG ACCTCGATGCTCGA-----TAGACTGCGGGTCCATGCCGTGATTTTGCGATGCG 1 (Sequence3: 0.061772, Sequence2: 0.053462, (Sequence1: 0.082889, (Sequence4: 0.067423, Sequence5: 0.018731): 0.087748): 0.069398):0.0; Corresponding output file (note the increase in alignment columns with undefined rates): --------------------------------------------------------- DNArates, version 1.1.0, December 23, 1998 Portions based on Joseph Felsenstein's Nucleic acid sequence Maximum Likelihood method, version 3.3 5 Species, 114 Sites There must be at least 5 informative residues per column Analyzing 37 distinct data patterns (columns) Empirical Base Frequencies: A 0.18668 C 0.25534 G 0.30458 T(U) 0.25339 Transition/transversion ratio = 2.000000 (Transition/transversion parameter = 1.557217) User-defined tree: 5 taxon user-supplied tree read Total length of tree branches = 0.441423 Site Rate ---- --------- 1 1.1327 2 1.1327 3 3.4961 4 1.1327 5 1.1327 6 4.1017 7 1.1327 8 1.1327 9 8.2145 10 1.1327 11 Undefined 12 Undefined 13 3.4961 14 1.1327 15 Undefined 16 1.1327 17 1.1327 18 10.1187 19 1.1327 20 1.1327 21 9.2412 22 1.1327 23 1.1327 24 819.6077 25 4.5339 26 1.1327 ... 110 1.1327 111 1.1327 112 Undefined 113 1.1327 114 1.1327 Weights 1111111111 0011011111 1111111111 1111111111 1011111111 1111111111 1111111111 1111000001 1101111111 1111111111 1111111111 1011 C 9 1.13270 2.31344 3.60422 5.61520 8.74819 13.62924 21.23367 33.08099 819.60770 Categories 1131131151 9931911511 5119411111 2111311131 2921131711 1313211151 1313114112 3232999991 1191182111 3112311115 1151111311 1911 ============================================================================== Input file with a user-supplied weighting mask, user-supplied base frequencies, and non-interleaved data format; note that the blank lines and indentation of the sequence continuation lines are optional: ---------------------------------------------------------------------- 5 114 C W F I C 9 F 0.25 0.25 0.25 0.25 Weights 111111111111111111111111111111111111111111111111111111111111 111111111111000000000000111111111111110000111111111111 Sequence1 ACACGGTGTCGTATCATGCTGCAGGATGCTAGACTGCGTCANATGTTCGTACTAACTGTG AGCTCGATGATCGGTGAC-TAGACTCAGGGGCCATGCCGCGAGTTTGCGATGCG Sequence2 ACGCGGTGTCGTGTCATGCTACATTATGCTAGACTGCGTCGGATGCTCGTATTGACTGCG AGCACGGTGATCAATGA--TAGNCTCAGGRTCCACGCCGTGACTTTGTGATNCG Sequence3 ACGCGGTGCCG-GTNATGCTGCATTATGCTCGACTGCGRCGGATGCTAGTATTGACTGCG AGCACGATGACCGATG---TAGACTGAGGGTCCGTGCCGCGACTTTGTGATGCG Sequence4 ACGCGCTGCC-TGTNATCCTACACGATGCYAGACAGCGTCAGCTGCTAGTACTGGCTGAG ACCTCGGTGATTGAT----TAGACTGCGGGTCCATGCCGCGATTTTGCGRTGCG Sequence5 ACGCGCTGTCGTGTCATACTGCAGGATGCTAGACTGCGTCAGCTGCTAGTACTGGCTGAG ACCTCGATGCTCGA-----TAGACTGCGGGTCCATGCCGTGATTTTGCGATGCG 1 (Sequence3: 0.061772, Sequence2: 0.053462, (Sequence1: 0.082889, (Sequence4: 0.067423, Sequence5: 0.018731): 0.087748): 0.069398):0.0; ==============================================================================