TreeLign: Automatic Update of 16s rRNA Phylogenetic Tree and Multiple Alignment in Metagenomics

Download FastTreeLign Package Here, updated on March 22 2010

INSTALLATION

After downloading the file TreeLign.tar to your computer, untar it by the following command, and the TreeLign package will be created.

  • tar xvf FTreeLign.tar

  • The TreeLign software package is a free software developed in LINUX, compile it using g++ (4.4.1 or above) in LINUX operating system with the following command (Tests have been done in Ubuntu and Fedora 11). Perl (5.0 or above) is also required for executing recursive_treelign.pl .

  • make

  • DESCRIPTION:

    The executable file 'treelign' is designed for assigning a new sequence(species) to the phylogeny and multiple alignment of a set of known sequences(species); executable file 'edgecost' is able to compute the lengths of branches on the phylogenetic tree using Dorit S Algorithm; recursive_treelign.pl aims to improve the phylogeny and multiple alignment at the same time by iteratively leaving out and reassigning each leaf node (species) back using treelign.

    FUNCTIONS:

    Assume a binary phylogenetic tree and its corresponding multiple sequence alignment is well-established,
    [1] If you have a novel 16s rRNA sequence at hand and would like to know the following questions, please try 'treelign'
        [a] assign the novel sequence to the best place in the phylogenetic tree and construct a larger tree
        [b] align the novel sequence to the multiple sequence alignment and construct a larger alignment
    [2] If you would like to optimize both the phylogenetic tree and multiple alignment simultaneously, please try 'recursive_treelign.pl'
    [3] If you would like to assign an edge cost for each branch on the phylogenetic tree, you may use any packages that are available (i.e. TreePuzzle) or try 'edgecost' in our package. (The computational method implemented in edgecost is from Dorit et. al.)
    [4] NEW Here we propose an heuristic to improve the time perfomance of TreeLign at the expense of accuracy. We first obtain the best hit of searching the query sequence against the alignment using BLAST, then narrow down the search space through only searching branches that are within a certain distance to the best hit node. Be careful choosing radius for the search space.

    USAGE:

    Given a multiple sequence alignment of a dataset, its corresponding phylogenetic tree, and a novel sequence, 'treelign' aligns the novel sequence with the multiple sequence alignement, and assigns the novel sequence to the best position on the phylogenetic tree simultaneously. 'treelign' tries to assign the novel sequence to every possible branch in the tree and the best position is determined in the sense that the number of mutations introduced to the resulting phylogeny is minimal. The relative distances (path lengths) between each pair of nodes in the original tree remain intact in the resulting tree. While the length of the newly assigned branch incident to an exterior node that labels the novel sequence, is calculated in proportion to sequence dissimilarities. Note that although newly assigned branch lengths are able to indicate sequence dissimilarities and evolutionary distances, the computing method employed in our algorithm may not conform to that of the original tree. If you'd like compute conformed branch lengths for the resulting tree using the same method, please refer to FUNCTIONS [3].

    (1) TreeLign

    Usage: Based on a multiple sequence alignment of a set of speceis, and its corresponding phylogenetic tree, treelign aligns a incoming novel sequence with the multiple alignment and aligns the novel sequence into the phylogeny simultanously.
         
    treelign
    -i input_aln : an incoming multiple sequence alignment (MSA) file
    -t input_tree : an incoming (binary newick) phylogenetic tree, each exterior node of which represents a distinct sequence in the incoming MSA
    -a input_seq : an incoming fasta file which contains a query sequence
    -o output_aln : an outcoming MSA file, the alignment of the query sequence and the incoming MSA
    -n output_tree:an outcoming (binary newick) phylogenetic tree, constructed by assigning the novel sequence to the incoming tree
    -f alnformat : format of input MSA, 0-clustalw, 1-fasta
    -F alnformat : format of output MSA, 0-clustalw, 1-fasta

    -x besthit : optional, the best hit result for searching input_seq against input_aln using BLAST

    -r radius : optional, the radius of the search space

    example
    ./treelign -i example/T1.aln -t example/T1.tree -a example/T1.seq -o example/T1.newaln -n example/T1.newtree -f 0 -F 0
    ./treelign -i example/test.aln -t example/test.tree -a example/test.seq -o example/test.newaln -n example/test.newtree -f 0 -F 1
    ./treelign -i example/test.aln -t example/test.tree -a example/test.seq -o example/test.newaln -n example/test.newtree -f 0 -F 1
    ./treelign -i example/test.aln -t example/test.tree -a example/test.seq -o example/test.newaln -n example/test.newtree -f 0 -F 1 -x 2107 -r 0.3
    notice
    The lengths of names of species can not exceed 200 letters.
    The incoming phylgoenetic tree should be a binarized newick tree; otherwise, an error may occur.
    If you only want to obtain the resulting phylogenetic tree rather than the alignment, you may remove the poorly aligned columns in input_aln (e.g. which are composed of more than 80% gaps) before using TreeLign. Removing low quality columns makes TreeLign run faster and more accurately.
    If besthit and radius is not set, TreeLign still tries every possible position for phylogenetic assignment. However, if they are set, TreeLign will only search nodes that are within certain distance to the best hit node, therefore, the accuracy of besthit and the value of radius will greatly affect the accuracy of TreeLign. Usually, the larger the radius is, the more accurate the result is.

    (2) recursive_treelign.pl

    Usage: Given a binary phylogenetic tree and its corresponding sequence sequence alignments, recursive_treelign.pl recursively removes every leaf node off the tree and re-assign them back using treelign in order to improve the phylogeny and the alignments simultaneously.

    perl recursive_treelign.pl input_aln input_tree output_aln output_tree format_of_input_aln format_of_output_aln num_of_pass working_dir logfile

    program
    input_aln : an incoming multiple sequence alignments file
    input_tree : an incoming (binary newick) phylogenetic tree file, each exterior node of which represents a distinct sequence in the incoming MSA
    output_aln : an optimized multiple sequence alignments file
    output_tree : an optimized (binary newick) phylogenetic tree
    format_of_input_aln : 0 - clustalw, 1 - fasta
    format_of_output_aln: 0 - clustalw, 1 - fasta
    num_of_pass : optional, an positive interger; if num_of_pass is set, the script terminates when the total number of mutations converages or when every leaf node has been removed and re-inserted back for the given number of passes no matter the number of mutations converages or not
    working_dir : optional, a working directory for saving intermediate files; if not set, delete all temporary files when the work is done
    logfile : optional, a logging file that records the input/output Maximum Parsimony score (the number of mutations)
    example
    perl recursive_treelign.pl example/T1.aln example/T1.tree 0 example/T1_res.aln example/T1_res.tree 0 2
    perl recursive_treelign.pl example/T1.aln example/T1.tree example/T1_res.aln example/T1_res.tree 0 0 2 tmp_working_dir logfile

    (3) edgecost

    Usage: Given a binary phylogenetic tree and its corresponding multiple sequence alignemnt, edgecost computes the branch lengths of the phylogeny based on the multiple alignment according to Dorit S Algorithm.

    edgecost
    -t input_tree : an incoming (binary newick) phylogenetic tree, of newick format
    -i input_aln : an incoming multiple sequence alignment, must be clustalw format
    -s treeSizeCap: tree size capacity, any number that is larger than the number of nodes on the tree should be OK
    -n output_tree: the outcoming phylogenetic tree, with edge cost assigned
    example
    ./edgecost -t example/T1.newtree -i example/T1.newaln -s 300 -n example/T1.edgecost.newaln
    ./edgecost -t example/test.newtree -i example/test.newaln -s 500 -n example/test.edgecost.newaln
    notice
    Notice: If a segmentation fault comes up, it might be because that the treeSizeCap is not large enough, try setting a larger number for it.