bioMCMC's posts

Post has attachment

Public

**Abstract**

A variety of algorithms have been proposed for reconstructing trees that show the evolutionary relationships between species by comparing differences in genetic data across present-day taxa. If the leaf-to-leaf distances in a tree can be accurately estimated, then it is possible to reconstruct this tree from these estimated distances, using polynomial-time methods such as the popular `Neighbor-Joining' algorithm. There is a precise combinatorial condition under which distance-based methods are guaranteed to return a correct tree (in full or in part) based on the requirement that the input distances all lie within some `safety radius' of the true distances. Here, we explore a stochastic analogue of this condition, and mathematically establish upper and lower bounds on this `stochastic safety radius' for distance-based tree reconstruction methods. Using simulations, we show how this notion provides a new way to compare the performance of distance-based tree reconstruction methods. This may help explain why Neighbor-Joining performs so well, as its stochastic safety radius appears close to optimal (while its more classical safety radius is the same as many other less accurate methods).

Post has attachment

Public

**Abstract**

"We introduce the Phylogenetic Likelihood Library (PLL), a highly optimized application programming interface for developing likelihood-based phylogenetic inference and post-analysis software. The PLL implements appropriate data structures and functions that allow users to quickly implement common, error-prone, and labor-intensive tasks, such as likelihood calculations, model parameter as well as branch length optimization, and tree space exploration. The highly optimized and parallelized implementation of the phylogenetic likelihood function and a thorough documentation provide a framework for rapid development of scalable parallel phylogenetic software. By example of two likelihood-based phylogenetic codes we show that the PLL improves the sequential performance of current software by a factor of two to ten while requiring only one month of programming time for integration. We show that, when numerical scaling for preventing floating point underflow is enabled, the double precision likelihood calculations in the PLL are up to 1.9 times faster than those in BEAGLE. On an empirical DNA dataset with 2,000 taxa the AVX version of PLL is 4 times faster than BEAGLE (scaling enabled and required). The PLL is available at http://www.libpll.org under the GNU General Public License (GPL)."

Post has attachment

Public

**Motivation:**Abundance profiling (also called “phylogenetic profiling”) is a crucial step in understanding the diversity of a metagenomic sample, and one of the basic techniques used for this is taxonomic identification of the metagenomic reads.

**Results:**We present TIPP (taxon identification and phylogenetic profiling), a new marker-based taxon identification and abundance profiling method. TIPP combines SEPP, a phylogenetic placement method, with statistical techniques to control the classification precision and recall, and results in improved abundance profiles. TIPP is highly accurate even in the presence of high indel errors and novel genomes, and matches or improves on previous approaches, including NBC, mOTU, PhymmBL, MetaPhyler, and MetaPhlAn.

**Availability:**Software and supplementary materials are available at http://www.cs.utexas.edu/users/phylo/software/sepp/tipp-submission/.

Post has attachment

Public

**Abstract**

"Terraces are potentially large sets of trees with precisely the same likelihood or parsimony score, which can be induced by missing sequences in partitioned multi-locus phylogenetic data matrices. The set of trees on a terrace can be characterized by enumeration algorithms or consensus methods that exploit the pattern of partial taxon coverage in the data, independent of the sequence data themselves. Terraces add ambiguity and complexity to phylogenetic inference particularly in settings where inference is already challenging: data sets with many taxa and relatively few loci.

In this paper we present five new findings about terraces and their impacts on phylogenetic inference. First we clarify assumptions about model parameters that are necessary for the existence of terraces. Second, we explore the dependence of terrace size on partitioning scheme and indicate how to find the partitioning scheme associated with the largest terrace containing a given tree. Third, we highlight the impact of terraces on bootstrap estimates of confidence limits in clades, and characterize the surprising result that the bootstrap proportion for a clade can be entirely determined by the frequency of bipartitions on a terrace, with some bipartitions receiving high support even when incorrect. Fourth, we dissect some effects of prior distributions of edge lengths on the computed posterior probabilities of clades on terraces, to understand an example in which long edges "attract" each other in Bayesian inference. Fifth, we show that even if data are not partitioned, patterns of missing data studied in the terrace problem can lead to instances of apparent statistical inconsistency when even a small element of heterotachy is introduced to the model generating the sequence data. Finally, we discuss strategies for remediation of some of these problems."

Post has shared content

Public

Post has attachment

Public

Post has attachment

Public

**Abstract**

"This article reviews the various models that have been used to describe the relationships between gene trees and species trees. Molecular phylogeny has focused mainly on improving models for the reconstruction of gene trees based on sequence alignments. Yet, most phylogeneticists seek to reveal the history of species. Although the histories of genes and species are tightly linked, they are seldom identical, because genes duplicate, are lost or horizontally transferred, and because alleles can co-exist in populations for periods that may span several speciation events. Building models describing the relationship between gene and species trees can thus improve the reconstruction of gene trees when a species tree is known, and vice-versa. Several approaches have been proposed to solve the problem in one direction or the other, but in general neither gene trees nor species trees are known. Only a few studies have attempted to jointly infer gene trees and species trees. These models account for gene duplication and loss, transfer or incomplete lineage sorting. Some of them consider several types of events together, but none exists currently that considers the full repertoire of processes that generate gene trees along the species tree. Simulations as well as empirical studies on genomic data show that combining gene tree-species tree models with models of sequence evolution improves gene tree reconstruction. In turn, these better gene trees provide a more reliable basis for studying genome evolution or reconstructing ancestral chromosomes and ancestral gene sequences. We predict that gene tree-species tree methods that can deal with genomic data sets will be instrumental to advancing our understanding of genomic evolution."

Post has attachment

Public

**Abstract**

"Many important stochastic counting models can be written as general birth-death processes (BDPs). BDPs are continuous-time Markov chains on the non-negative integers and can be used to easily parameterize a rich variety of probability distributions. Although the theoretical properties of general BDPs are well understood, traditionally statistical work on BDPs has been limited to the simple linear (Kendall) process, which arises in ecology and evolutionary applications. Aside from a few simple cases, it remains impossible to find analytic expressions for the likelihood of a discretely-observed BDP, and computational difficulties have hindered development of tools for statistical inference. But the gap between BDP theory and practical methods for estimation has narrowed in recent years. There are now robust methods for evaluating likelihoods for realizations of BDPs: finite-time transition, first passage, equilibrium probabilities, and distributions of summary statistics that arise commonly in applications. Recent work has also exploited the connection between continuously- and discretely-observed BDPs to derive EM algorithms for maximum likelihood estimation. Likelihood-based inference for previously intractable BDPs is much easier than previously thought and regression approaches analogous to Poisson regression are straightforward to derive. In this review, we outline the basic mathematical theory for BDPs and demonstrate new tools for statistical inference using data from BDPs. We give six examples of BDPs and derive EM algorithms to fit their parameters by maximum likelihood. We show how to compute the distribution of integral summary statistics and give an example application to the total cost of an epidemic. Finally, we suggest future directions for innovation in this important class of stochastic processes."

Post has attachment

Public

**Abstract**

"Given the rapid increase of species with a sequenced genome, the need to identify orthologous genes between them has emerged as a central bioinformatics task. Many different methods exist for orthology detection, which makes it difficult to decide which one to choose for a particular application.

Here, we review the latest developments and issues in the orthology field, and summarise the most recent results reported at the third 'Quest for Orthologs' meeting. We focus on community efforts such as the adoption of reference proteomes, standard file formats, and benchmarking. Progress in these areas is good and they are already beneficial to both orthology consumers and providers. However, a major current issue is that the massive increase in complete proteomes poses computational challenges to many of the ortholog database providers, as most orthology inference algorithms scale at least quadratically with the number of proteomes.

The Quest for Orthologs consortium is an open community with a number of working groups that join efforts to enhance various aspects of orthology analysis such as defining standard formats and datasets, documenting community resources, and benchmarking. All such materials are available at http://questfororthologs.org."

Post has attachment

Public

Wait while more posts are being loaded