Profile cover photo
Profile photo
Leonardo de Oliveira Martins
1,434 followers -
Wandering around the boundaries of my own uncertainty
Wandering around the boundaries of my own uncertainty

1,434 followers
About
Leonardo's posts

Post has attachment
Comparative methods in the genomic era
Every modern biologist should know about phylogenetic comparative methods, even if they are not familiar with the term. The idea is that we cannot compare biological traits without taking into account the evolutionary relations between the species/populatio...

Post has attachment
Bar charts must start at zero (or something)
The other day I was mentioning to a colleague that a bar chart should
start at zero, and I may have given the impression that it was just my
personal taste. It is not. It is a universal standard in statistical
visualisation. However, since it is very eas...

Post has attachment
When our intuition fails us with collections of trees
Recently I realised that some ideas regarding the distribution of phylogenetic trees are not as straightforward as they seem. The first case is that the frequency of the most common tree does not give us information about the dispersion of the distribution....

Post has attachment
abstract
"Phylogenetic models are an important tool in molecular evolution allowing us to study the pattern and rate of sequence change. The recent influx of new sequence data in the biosciences means that in order to address evolutionary questions we need a means for rapid and easy model development and implementation. Here we present GeLL, a Java library that lets users use text to quickly and efficiently define novel forms of discrete data and create new substitution models that describe how those data change on a phylogeny. GeLL allows users to define general substitution models and data structures in a way that is not possible in other existing libraries, including mixture models and non-reversible models. Classes are provided for calculating likelihoods, optimizing model parameters and branch lengths, ancestral reconstruction and sequence simulation."

Post has attachment

Post has attachment
Abstract
"As researchers collect increasingly large molecular data sets to reconstruct the Tree of Life, the heterogeneity of signals in the genomes of diverse organisms poses challenges for traditional phylogenetic analysis. A class of phylogenetic methods known as "species tree methods" have been proposed to directly address one important source of gene tree heterogeneity, namely the incomplete lineage sorting or deep coalescence that occurs when evolving lineages radiate rapidly, resulting in a diversity of gene trees from a single underlying species tree. Although such methods are gaining in popularity, they are being adopted with caution in some quarters, in part because of an increasing number of examples of strong phylogenetic conflict between concatenation or supermatrix methods and species tree methods. Here we review theory and empirical examples that help clarify these conflicts. Thinking of concatenation as a special case of the more general model provided by the multispecies coalescent can help explain a number of differences in the behavior of the two methods on phylogenomic data sets. Recent work suggests that species tree methods are more robust than concatenation approaches to some of the classic challenges of phylogenetic analysis, including rapidly evolving sites in DNA sequences, base compositional heterogeneity and long branch attraction. We show that approaches such as binning, designed to augment the signal in species tree analyses, can distort the distribution of gene trees and are inconsistent. Computationally efficient species tree methods that incorporate biological realism are a key to phylogenetic analysis of whole genome data."

Post has attachment
Abstract
"The distribution of divergence times between member species of a community reflects the pattern of species composition. In this paper, we contrast the species composition of a community against the meta-community, which we define as the species composition of a set of target communities. We regard the collection of species that comprise a community as a sample from the set of member species of the meta-community, and interpret the pattern of the community species composition in terms of the type of species sampled from the meta-community. A newly defined effective species sampling proportion explains the amount of the difference between the divergence time distributions of the community and that of the meta-community, assuming random sampling. We propose a new index of phylogenetic skew (PS), as the ratio of the maximum likelihood estimate of the effective species sampling proportion to the observed sampling proportion. A PS value of 1 is interpreted as random sampling. If the value is greater than 1, the sampling is suspected to be phylogenetically skewed. If it is less than 1, systematic thinning of species is likely. Unlike other indices, the PS does not depend on species richness as long as the community has more than a few members of a species. Because it is possible to compare partially observed communities, the index may be effectively used in exploratory analysis to detect candidate communities with unique species compositions from a large number of communities."

Post has attachment
Abstract
"With the availability of genomic sequence data, there is increasing interest in using genes with a possible history of duplication and loss for species tree inference. Here we assess the performance of both non-probabilistic and probabilistic species tree inference approaches using gene duplication and loss and coalescence simulations. We evaluated the performance of gene tree parsimony (GTP) based on duplication (Only-dup), duplication and loss (Dup-loss), and deep coalescence (Deep-c) costs, the NJst distance method, the MulRF supertree method, and PHYLDOG, which jointly estimates gene trees and species tree using a hierarchical probabilistic model. We examined the effects of gene tree and species sampling, gene tree error, and duplication and loss rates on the accuracy of phylogenetic estimates.

In the 10-taxon duplication and loss simulation experiments, MulRF is more accurate than the other methods when the duplication and loss rates are low, and Dup-loss is generally the most accurate when the duplication and loss rates are high. PHYLDOG performs well in 10-taxon duplication and loss simulations, but its run time is prohibitively long on larger data sets. In the larger duplication and loss simulation experiments, MulRF outperforms all other methods in experiments with at most 100 taxa; however, in the larger simulation, Dup-loss generally performs best. In all duplication and loss simulation experiments with more than 10 taxa, all methods perform better with more gene trees and fewer missing sequences, and they are all affected by gene tree error.

Our results also highlight high levels of error in estimates of duplications and losses from GTP methods and demonstrate the usefulness of methods based on generic tree distances for large analyses."

Post has attachment
Abstract

With the availability of genomic sequence data, there is increasing interest in using genes with a possible history of duplication and loss for species tree inference. Here we assess the performance of both non-probabilistic and probabilistic species tree inference approaches using gene duplication and loss and coalescence simulations. We evaluated the performance of gene tree parsimony (GTP) based on duplication (Only-dup), duplication and loss (Dup-loss), and deep coalescence (Deep-c) costs, the NJst distance method, the MulRF supertree method, and PHYLDOG, which jointly estimates gene trees and species tree using a hierarchical probabilistic model. We examined the effects of gene tree and species sampling, gene tree error, and duplication and loss rates on the accuracy of phylogenetic estimates. In the 10-taxon duplication and loss simulation experiments, MulRF is more accurate than the other methods when the duplication and loss rates are low, and Dup-loss is generally the most accurate when the duplication and loss rates are high. PHYLDOG performs well in 10-taxon duplication and loss simulations, but its run time is prohibitively long on larger data sets. In the larger duplication and loss simulation experiments, MulRF outperforms all other methods in experiments with at most 100 taxa; however, in the larger simulation, Dup-loss generally performs best. In all duplication and loss simulation experiments with more than 10 taxa, all methods perform better with more gene trees and fewer missing sequences, and they are all affected by gene tree error. Our results also highlight high levels of error in estimates of duplications and losses from GTP methods and demonstrate the usefulness of methods based on generic tree distances for large analyses.

Post has attachment
Abstract

Most eukaryotic lineages are microbial, and many have only recently been sampled for phylogenetic studies or remain in the ‘dark area’ of the tree of life where there are no molecular data. To assess relationships among eukaryotic lineages, we perform a taxon-rich phylogenomic analysis including 232 eukaryotes selected to maximize taxonomic diversity and up to 1554 genes chosen as vertically inherited based on their broad distribution among eukaryotes. We also include sequences from 486 bacteria and 84 archaea to assess the impact of endosymbiotic gene transfer (EGT) from plastids and to detect contamination. Overall, our analyses are consistent with other less taxon-rich estimates of the eukaryotic tree of life and we recover strong support for five major clades: Amoebozoa, Excavata (without the genus Malawimonas), Opisthokonta, Archaeplastida and SAR (Stramenopila, Alveolata and Rhizaria). Our analyses also highlight the existence of ‘orphan’ lineages, lineages that lack robust placement in the eukaryotic tree of life and indicate the possibility of as yet undiscovered diversity. In analyses including bacteria and archaea, we find that ~10% of the 1554 genes, which we choose because they are found in four or five of the five major eukaryotic clades and hence may be more likely to be inherited vertically, appear to have been acquired from cyanobacteria through EGT in photosynthetic lineages. Removing these EGT genes places the green algae as sister to the glaucophytes instead of the red algae, suggesting that unknowingly including of genes of plastid origin, and combining them with genes of nuclear origin, may mislead phylogenetic estimates. Finally, the large size of our dataset allows comparative analyses of subsets of data; alignments built from randomly sampled sites provide greater support, particularly for deep relationships, than do equivalent sized datasets built from randomly sampled genes.
Wait while more posts are being loaded