Phylogenetic analysis of large genomic dataset has been key for elucidating the evolutionary and transmission dynamics of pathogens. The results of such phylodynamic analysis have featured prominently in the decision-making process during the pandemic, leading to the dynamic enactment of effective surveillance and intervention strategies in real time. Bayesian phylodynamic methods stand out thanks to their capability of employing complex models, expressing uncertainties, and incorporating various sources of information. However, the computational complexity of these methods often hinders their application on analyzing modern large-scale viral dataset. Here, we provide empirical evidence demonstrating that: 1) the computational difficulties largely stem from inefficient exploration of the phylogenetic tree space; 2) despite good performance in estimating the continuous parameters, convergence and mixing issues are widespread in tree inference; 3) these striking tree-inference issues are frequently caused by a small number of clades that are challenging to sample under the conventional scheme; 4) a limited number of sites in the genome alignment, which frequently exhibit conflicting phylogenetic signal, can significantly exacerbate the tree-inference issues, and fortunately; 5) the inferred molecular evolutionary and demographic processes are minimally affected by the poor exploration of tree space, whereas impacts on the estimated origin time and introduction history of particular clades appear to be more pronounced. We offer theoretical explanations underlying the observed difficulties in exploring tree space, identify specific biological properties of viral datasets that may impede the exploration, and propose new sampling mechanisms targeting these properties to improve the performance.
Quantifying and improving Bayesian phylogenetic inference of large viral dataset