Audio versions of bioRxiv paper abstracts
Link to bioRxiv paper: http://biorxiv.org/cgi/content/short/2020.11.21.392621v1?rss=1 Authors: Chibani, C. M., Mahnert, A., Borrel, G., Almeida, A., Werner, A., Brugere, J.-F., Gribaldo, S., Finn, R. D., Schmitz, R. A., Moissl-Eichinger, C. Abstract: The human gut microbiome plays an important role in health and disease, but the archaeal diversity therein remains largely unexplored. Here we report the pioneering analysis of 1,167 non-redundant archaeal genomes recovered from human gastrointestinal tract microbiomes across countries and populations. We identified three novel genera and 15 novel species including 52 previously unknown archaeal strains. Based on distinct genomic features, we warrant the split of the Methanobrevibacter smithii clade into two separate species, with one represented by the novel Candidatus M. intestini. Patterns derived from 1.8 million proteins and 28,851 protein clusters coded in these genomes showed a substantial correlation with socio-demographic characteristics such as age and lifestyle. We infer that archaea are actively replicating in the human gastrointestinal tract and are characterized by specific genomic and functional adaptations to the host. We further demonstrate that the human gut archaeome carries a complex virome, with some viral species showing unexpected host flexibility. Our work furthers our current understanding of the human archaeome, and provides a large genome catalogue for future analyses to decipher its role and impact on human physiology. Copy rights belong to original authors. Visit the link for more info
Link to bioRxiv paper: http://biorxiv.org/cgi/content/short/2020.11.21.392613v1?rss=1 Authors: Moussa, M. M. R., Mandoiu, I. I. Abstract: The variation in gene expression profiles of cells captured in different phases of the cell cycle can interfere with cell type identification and functional analysis of single cell RNA-Seq (scRNA-Seq) data. In this paper, we introduce SC1CC (SC1 - Cell Cycle analysis tool), a computational approach for clustering and ordering single cell transcriptional profiles according to their progression along cell cycle phases. We also introduce a new robust metric, GSS (Gene Smoothness Score) for assessing the cell cycle based order of the cells. SC1CC is available as part of the SC1 web-based scRNA-Seq analysis pipeline, publicly accessible at https://sc1.engr.uconn.edu/. Copy rights belong to original authors. Visit the link for more info
Link to bioRxiv paper: http://biorxiv.org/cgi/content/short/2020.11.21.392761v1?rss=1 Authors: Barthelson, K., Pederson, S. M., Newman, M., Jiang, H., Lardelli, M. Abstract: Background: Mutations in PRESENILIN 2 (PSEN2) cause early disease onset familial Alzheimer's disease (EOfAD) but their mode of action remains elusive. One consistent observation for all PRESENILIN gene mutations causing EOfAD is that a transcript is produced with a reading frame terminated by the normal stop codon : the 'reading frame preservation rule'. Mutations that do not obey this rule do not cause the disease. The reasons for this are debated. Methods: A frameshift mutation (psen2N140fs) and a reading frame-preserving mutation (psen2T141_L142delinsMISLISV) were previously isolated during genome editing directed at the N140 codon of zebrafish psen2 (equivalent to N141 of human PSEN2). We mated a pair of fish heterozygous for each mutation to generate a family of siblings including wild type and heterozygous mutant genotypes. Transcriptomes from young adult (6 months) brains of these genotypes were analysed. Bioinformatics techniques were used to predict cellular functions affected by heterozygosity for each mutation. Results: The reading frame preserving mutation uniquely caused subtle, but statistically significant, changes to expression of genes involved in oxidative phosphorylation, long term potentiation and the cell cycle. The frameshift mutation uniquely affected genes involved in Notch and MAPK signalling, extracellular matrix receptor interactions and focal adhesion. Both mutations affected ribosomal protein gene expression but in opposite directions. Conclusion: A frameshift and frame-preserving mutation at the same position in zebrafish psen2 cause discrete effects. Changes in oxidative phosphorylation, long Copy rights belong to original authors. Visit the link for more info
Link to bioRxiv paper: http://biorxiv.org/cgi/content/short/2020.11.21.392878v1?rss=1 Authors: Wojtowicz, D., Hoinka, J., Amgalan, B., Kim, Y.-A., Przytycka, T. M. Abstract: Many mutagenic processes leave characteristic imprints on cancer genomes known as mutational signatures. These signatures have been of recent interest regarding their applicability in studying processes shaping the mutational landscape of cancer. In particular, pinpointing the presence of altered DNA repair pathways can have important therapeutic implications. However, mutational signatures of DNA repair deficiencies are often hard to infer. This challenge emerges as a result of deficient DNA repair processes acting by modifying the outcome of other mutagens. Thus, they exhibit non-additive effects that are not depicted by the current paradigm for modeling mutational processes as independent signatures. To close this gap, we present RepairSig, a method that accounts for interactions between DNA damage and repair and is able to uncover unbiased signatures of deficient DNA repair processes. In particular, RepairSig was able to replace three MMR deficiency signatures previously proposed to be active in breast cancer, with just one signature strikingly similar to the experimentally derived signature. As the first method to model interactions between mutagenic processes, RepairSig is an important step towards biologically more realistic modeling of mutational processes in cancer. The source code for RepairSig is publicly available at https://github.com/ncbi/RepairSig. Copy rights belong to original authors. Visit the link for more info
Link to bioRxiv paper: http://biorxiv.org/cgi/content/short/2020.11.22.393108v1?rss=1 Authors: Loewenthal, G., Rapoport, D., Avram, O., Moshe, A., Itzkovitch, A., Israeli, O., Azouri, D., Cartwright, R. A., Mayrose, I., Pupko, T. Abstract: Insertions and deletions (indels) are common molecular evolutionary events. However, probabilistic models for indel evolution are under-developed due to their computational complexity. Here we introduce several improvements to indel modeling: (1) while previous models for indel evolution assumed that the rates and length distributions of insertions and deletions are equal, here, we propose a richer model that explicitly distinguishes between the two; (2) We introduce numerous summary statistics that allow Approximate Bayesian Computation (ABC) based parameter estimation; (3) We develop a neural-network model-selection scheme to test whether the richer model better fits biological data compared to the simpler model. Our analyses suggest that both our inference scheme and the model-selection procedure achieve high accuracy on simulated data. We further demonstrate that our proposed indel model better fits a large number of empirical datasets and that, for the majority of these datasets, the deletion rate is higher than the insertion rate. Finally, we demonstrate that indel rates are negatively correlated to the effective population size across various phylogenomic clades. Copy rights belong to original authors. Visit the link for more info
Link to bioRxiv paper: http://biorxiv.org/cgi/content/short/2020.11.22.392217v1?rss=1 Authors: Armingol, E., Joshi, C. J., Baghdassarian, H., Shamie, I., Ghaddar, A., Chan, J., Her, H.-L., O'Rourke, E. J., Lewis, N. E. Abstract: Cell-cell interactions are crucial for multicellular organisms as they shape cellular function and ultimately organismal phenotype. However, the spatial code embedded in the molecular interactions that drive and sustain spatial organization, and in the organization that in turns drives intercellular interactions across a living animal remains to be elucidated. Here we use the expression of ligand-receptor pairs obtained from a whole-body single-cell transcriptome of Caenorhabditis elegans larvae to compute the potential for intercellular interactions through a Bray-Curtis-like metric. Leveraging a 3D atlas of C. elegans' cells, we implement a genetic algorithm to select the ligand-receptor pairs most informative of the spatial organization of cells. Validating the strategy, the selected ligand-receptor pairs are involved in known cell-migration and morphogenesis processes and we confirm a negative correlation between cell-cell distances and interactions. Thus, our computational framework helps identify cell-cell interactions and their relationship with intercellular distances, and decipher molecular bases encoding spatial information in a whole animal. Furthermore, it can also be used to elucidate associations with any other intercellular phenotype and applied to other multicellular organisms. Copy rights belong to original authors. Visit the link for more info
Link to bioRxiv paper: http://biorxiv.org/cgi/content/short/2020.11.22.393074v1?rss=1 Authors: Roy, N., Kabir, A. H., Zahan, N., Mouna, S. T., Chakravarty, S., Rahman, A. H., Bayzid, M. S. Abstract: Rice genetic diversity is regulated by multiple genes and is largely dependent on various environmental factors. Uncovering the genetic variations associated with the diversity in rice populations is the key to breed stable and high yielding rice varieties. We performed Genome Wide Association Studies (GWAS) on 7 rice yielding traits (grain length, grain width, grain weight, panicle length, leaf length, leaf width, and leaf angle) based on 39,40,165 single nucleotide polymorphisms (SNPs) in a population of 183 rice landraces of Bangladesh. Our studies reveal various chromosomal regions that are significantly associated with different traits in Bangladeshi rice varieties. We also identified various candidate genes, which are associated with these traits. This study reveals multiple candidate genes within short intervals. We also identified SNP loci, which are significantly associated with multiple yield-related traits. The results of these association studies support previous findings as well as provide additional insights into the genetic diversity of rice. This is the first known GWAS study on various yield-related traits in the varieties of Oryza sativa available in Bangladesh, the fourth largest rice producing country. We believe this study will accelerate rice genetics research and breeding stable high-yielding rice in Bangladesh. Copy rights belong to original authors. Visit the link for more info
Link to bioRxiv paper: http://biorxiv.org/cgi/content/short/2020.11.22.393165v1?rss=1 Authors: Yan, H., Song, Q., Lee, J., Schiefelbein, J., Li, S. Abstract: An essential step of single-cell RNA sequencing analysis is to classify specific cell types with marker genes in order to dissect the biological functions of each individual cell. In this study, we integrated five published scRNA-seq datasets from the Arabidopsis root containing over 25,000 cells and 17 cell clusters. We have compared the performance of seven machine learning methods in classifying these cell types, and determined that the random forest and support vector machine methods performed best. Using feature selection with these two methods and a correlation method, we have identified 600 new marker genes for 10 root cell types, and more than 70% of these machine learning-derived marker genes were not identified before. We found that these new markers not only can assign cell types consistently as the previously known cell markers, but also performed better than existing markers in several evaluation metrics including accuracy and sensitivity. Markers derived by the random forest method, in particular, were expressed in 89-98% of cells in endodermis, trichoblast, and cortex clusters, which is a 29-67% improvement over known markers. Finally, we have found 111 new orthologous marker genes for the trichoblast in five plant species, which expands the number of marker genes by 58-170% in non-Arabidopsis plants. Our results represent a new approach to identify cell-type marker genes from scRNA-seq data and pave the way for cross-species mapping of scRNA-seq data in plants. Copy rights belong to original authors. Visit the link for more info
Link to bioRxiv paper: http://biorxiv.org/cgi/content/short/2020.11.19.390773v1?rss=1 Authors: Ge, X., Chen, Y. E., Song, D., McDermott, M., Woyshner, K., Manousopoulou, A., Wang, L. D., Li, W., Li, J. J. Abstract: High-throughput biological data analysis commonly involves the identification of "interesting" features (e.g., genes, genomic regions, and proteins), whose values differ between two conditions, from numerous features measured simultaneously. To ensure the reliability of such analysis, the most widely-used criterion is the false discovery rate (FDR), the expected proportion of uninteresting features among the identified ones. Existing bioinformatics tools primarily control the FDR based on p-values. However, obtaining valid p-values relies on either reasonable assumptions of data distribution or large numbers of replicates under both conditions, two requirements that are often unmet in biological studies. To address this issue, we propose Clipper, a general statistical framework for FDR control without relying on p-values or specific data distributions. Clipper is applicable to identifying both enriched and differential features from high-throughput biological data of diverse types. In comprehensive simulation and real-data benchmarking, Clipper outperforms existing generic FDR control methods and specific bioinformatics tools designed for various tasks, including peak calling from ChIP-seq data, differentially expressed gene identification from RNA-seq data, differentially interacting chromatin region identification from Hi-C data, and peptide identification from mass spectrometry data. Notably, our benchmarking results for peptide identification are based on the first mass spectrometry data standard that has a realistic dynamic range. Our results demonstrate Clipper's flexibility and reliability for FDR control, as well as its broad applications in high-throughput data analysis. Copy rights belong to original authors. Visit the link for more info
Link to bioRxiv paper: http://biorxiv.org/cgi/content/short/2020.11.20.391912v1?rss=1 Authors: Choudhary, K. S., Fahy, E., Coakley, K., Sud, M., Maurya, M. R., Subramaniam, S. Abstract: With the advent of high throughput mass spectrometric methods, metabolomics has emerged as an essential area of research in biomedicine with the potential to provide deep biological insights into normal and diseased functions in physiology. However, to achieve the potential offered by metabolomics measures, there is a need for biologist-friendly integrative analysis tools that can transform data into mechanisms that relate to phenotypes. Here, we describe MetENP, an R package, and a user-friendly web application deployed at the Metabolomics Workbench site extending the metabolomics enrichment analysis to include species-specific pathway analysis, pathway enrichment scores, gene-enzyme information, and enzymatic activities of the significantly altered metabolites. MetENP provides a highly customizable workflow through various user-specified options and includes support for all metabolite species with available KEGG pathways. MetENPweb is a web application for calculating metabolite and pathway enrichment analysis. Copy rights belong to original authors. Visit the link for more info
Link to bioRxiv paper: http://biorxiv.org/cgi/content/short/2020.11.19.390542v1?rss=1 Authors: Burstein, D., Fullard, J., Roussos, P. Abstract: Prior to identifying clusters in single cell gene expression experiments, selecting the top principal components is a critical step for filtering out noise in the data set. Identifying these top principal components typically focuses on the total variance explained, and principal components that explain small clusters from rare populations will not necessarily capture a large percentage of variance in the data. We present a computationally efficient alternative for identifying the optimal principal components based on the tails of the distribution of variance explained for each observation. We then evaluate the efficacy of our approach in three different single cell RNA-sequencing data sets and find that our method matches, or outperforms, other selection criteria that are typically employed in the literature. Availability and implementation: pcqc is written in Python and available at github.com/RoussosLab/pcqc Copy rights belong to original authors. Visit the link for more info
Link to bioRxiv paper: http://biorxiv.org/cgi/content/short/2020.11.17.367490v1?rss=1 Authors: Rizwan, S., Pike, D., Poudel, S., Nanda, V. Abstract: Cofactor binding sites in proteins often are composed of favorable interactions of specific cofactors with the sidechains and/or backbone protein fold motifs. In many cases these motifs contain left-handed conformations which enable tight turns of the backbone that present backbone amide protons in direct interactions with cofactors termed 'cationic nests'. Here, we defined alternating handedness of secondary structure as a search constraint within the PDB to systematically identify these cofactor binding nests. We identify unique alternating handedness structural motifs which are specific to the cofactors they bind. These motifs can guide the design of engineered folds that utilize specific cofactors and also enable us to gain a deeper insight into the evolution of the structure of cofactor binding sites. Copy rights belong to original authors. Visit the link for more info
Link to bioRxiv paper: http://biorxiv.org/cgi/content/short/2020.11.17.386813v1?rss=1 Authors: Dragan, I., Sparso, T., Kuznetsov, D., Slieker, R., Ibberson, M. Abstract: Summary: dsSwissKnife is an R package that enables several powerful analyses to be performed on federated datasets. The package works alongside DataSHIELD and extends its functionality. We have developed and implemented dsSwissKnife in a large IMI project on type 2 diabetes, RHAPSODY, where data from 10 observational cohorts have been harmonised and federated in CDISC SDTM format and made available for biomarker discovery. Availability and implementation: dsSwissKnife is freely available online at https://github.com/sib-swiss/dsSwissKnife. The package is distributed under the GNU General Public License version 3 and is accompanied by example files and data. Copy rights belong to original authors. Visit the link for more info
Link to bioRxiv paper: http://biorxiv.org/cgi/content/short/2020.11.20.391045v1?rss=1 Authors: Planell, N., Lagani, V., Sebastian-Leon, P., van der Kloet, F., Ewing, E., Karathanasis, N., Urdangarin, A., Arozarena, I., Jagodic, M., Tsamardinos, I., Tarazona, S., Conesa, A., tegner, j., Gomez-Cabrero, D. Abstract: Technologies for profiling samples using different omics platforms have been at the forefront since the human genome project. Large-scale multi-omics data hold the promise of deciphering different regulatory layers. Yet, while there is a myriad of bioinformatics tools, each multi-omics analysis appears to start from scratch with an arbitrary decision over which tools to use and how to combine them. It is therefore an unmet need to conceptualize how to integrate such data and to implement and validate pipelines in different cases. We have designed a conceptual framework (STATegra), aiming it to be as generic as possible for multi-omics analysis, combining machine learning component analysis, non-parametric data combination and a multi-omics exploratory analysis in a step-wise manner. While in several studies we have previously combined those integrative tools, here we provide a systematic description of the STATegra framework and its validation using two TCGA case studies. For both, the Glioblastoma and the Skin Cutaneous Melanoma cases, we demonstrate an enhanced capacity to identify features in comparison to single-omics analysis. Such an integrative multi-omics analysis framework for the identification of features and components facilitates the discovery of new biology. Finally, we provide several options for applying the STATegra framework when parametric assumptions are fulfilled, and for the case when not all the samples are profiled for all omics. The STATegra framework is built using several tools, which are being integrated step-by-step as OpenSource in the STATegRa Bioconductor package https://bioconductor.org/packages/release/bioc/html/STATegra.html. Copy rights belong to original authors. Visit the link for more info
Link to bioRxiv paper: http://biorxiv.org/cgi/content/short/2020.11.20.390930v1?rss=1 Authors: Das Roy, R., Hallikas, O., Christensen, M. M., Renvoise, E., Jernvall, J. Abstract: Exploration of genetically modified organisms, developmental processes, diseases or responses to various treatments require accurate measurement of changes in gene expression. This can be done for thousands of genes using high throughput technologies such as microarray and RNAseq. However, identification of differentially expressed (DE) genes poses technical challenges due to limited sample size, few replicates, or simply very small changes in expression levels. Consequently, several methods have been developed to determine DE genes, such as Limma, RankProd, SAM, and DeSeq2. These methods identify DE genes based on the expression levels alone. As genomic co-localization of genes is generally not linked to co-expression, we deduced that DE genes could be detected with the help of genes from chromosomal neighbourhood. Here, we present a new method, DELocal, which identifies DE genes by comparing their expression changes to changes in adjacent genes in their chromosomal regions. Our results show that DELocal provides distinct benefits in the identification of DE genes. Furthermore, our comparative analysis of the dispersal of genes with related functions suggests that DELocal is applicable to a wide range of developmental systems. With increasing availability of genomic data, gene neighbourhood can become a powerful tool to detect differential expression. Copy rights belong to original authors. Visit the link for more info
Link to bioRxiv paper: http://biorxiv.org/cgi/content/short/2020.11.20.391011v1?rss=1 Authors: Laws, R. L., Paul, P., Mosites, E., Scobie, H., Clarke, K. E. N., Slayton, R. B. Abstract: Background : Congregate settings are at risk for coronavirus disease 2019 (COVID-19) outbreaks. Diagnostic testing can be used as a tool in these settings to identify outbreaks and to control transmission. Methods : We used transmission modeling to estimate the minimum number of persons to test and the optimal frequency to detect small outbreaks of COVID-19 in a congregate facility. We also estimated the frequency of testing needed to interrupt transmission within a facility. Results : The number of people to test and frequency of testing needed depended on turnaround time, facility size, and test characteristics. Parameters are calculated for a variety of scenarios. In a facility of 100 people, 26 randomly selected individuals would need to be tested at least every 6 days to identify a true underlying prevalence of at least 5%, with test sensitivity of 85%, and greater than 95% outbreak detection sensitivity. Disease transmission could be interrupted with universal, facility-wide testing with rapid turnaround every three days. Conclusions : Testing a subset of individuals in congregate settings can improve early detection of small outbreaks of COVID-19. Frequent universal diagnostic testing can be used to interrupt transmission within a facility, but its efficacy is reliant on rapid turnaround of results for isolation of infected individuals. Copy rights belong to original authors. Visit the link for more info
Link to bioRxiv paper: http://biorxiv.org/cgi/content/short/2020.11.20.391029v1?rss=1 Authors: Covell, D. G. Abstract: A joint analysis of NCI60 small molecule screening data, their genetically defective genes and mechanisms of action (MOA) of FDA approved cancer drugs screened in the NCI60 is proposed for identifying links between chemosensitivity, genomic defects and MOA. Self-organizing-maps (SOMs) are used to organize the chemosensitivity data. Students t-tests are used to identify SOM clusters with chemosensitivity for tumor cells harboring genetically defective genes. Fishers exact tests are used to reveal instances where defective gene to chemosensitivity associations have enriched MOAs. The results of this analysis find a relatively small set of defective genes, inclusive of ABL1, AXL, BRAF, CDC25A, CDKN2A, IGF1R, KRAS, MECOM, MMP1, MYC, NOTCH1, NRAS, PIK3CG, PTK2, RPTOR, SPTBN1, STAT2, TNKS and ZHX2, as possible candidates for roles in chemosensitivity for compound MOAs that target primarily, but not exclusively, kinases, nucleic acid synthesis, protein synthesis, apoptosis and tubulin. This analysis may contribute towards the goals of cancer drug discovery, development decision making, and explanation of mechanisms. Copy rights belong to original authors. Visit the link for more info
Link to bioRxiv paper: http://biorxiv.org/cgi/content/short/2020.11.20.391300v1?rss=1 Authors: Neely, B. A., Stemmer, P., Searle, B. C., Herring, L. E., Martin, L., Midha, M. K., Phinney, B. S., Shan, B., Palmblad, M., Wang, Y., Jagtap, P. D., Kirkpatrick, J. M. Abstract: Despite the advantages of fewer missing values by collecting fragment ion data on all analytes in the sample, as well as the potential for deeper coverage, the adoption of data-independent acquisition (DIA) in core facility settings has been slow. The Association of Biomolecular Resource Facilities conducted a large interlaboratory study to evaluate DIA performance in laboratories with various instrumentation. Participants were supplied with generic methods and a uniform set of test samples. The resulting 49 DIA datasets act as benchmarks and have utility in education and tool development. The sample set consisted of a tryptic HeLa digest spiked with high or low levels of four exogenous proteins. Data are available in MassIVE MSV000086479. Additionally, we demonstrate how the data can be analysed by focusing on two datasets using different library approaches and show the utility of select summary statistics. These data can be used by DIA newcomers, software developers, or DIA experts evaluating performance with different platforms, acquisition settings and skill levels. Copy rights belong to original authors. Visit the link for more info
Link to bioRxiv paper: http://biorxiv.org/cgi/content/short/2020.11.19.389981v1?rss=1 Authors: Ness-Cohn, E., Braun, R. Abstract: The circadian rhythm drives the oscillatory expression of thousands of genes across all tissues. The recent revolution in high-throughput transcriptomics, coupled with the significant implications of the circadian clock for human health, has sparked an interest in circadian profiling studies to discover genes under circadian control. Here we present TimeCycle: a topology-based rhythm detection method designed to identify cycling transcripts. For a given time-series, the method reconstructs the state space using time-delay embedding, a data transformation technique from dynamical systems. In the embedded space, Takens' theorem proves that the dynamics of a rhythmic signal will exhibit circular patterns. The degree of circularity of the embedding is calculated as a persistence score using persistent homology, an algebraic method for discerning the topological features of data. By comparing the persistence scores to a bootstrapped null distribution, cycling genes are identified. Results in both synthetic and biological data highlight TimeCycle's ability to identify cycling genes across a range of sampling schemes, number of replicates, and missing data. Comparison to competing methods highlights their relative strengths, providing guidance as to the optimal choice of cycling detection method. Copy rights belong to original authors. Visit the link for more info
Link to bioRxiv paper: http://biorxiv.org/cgi/content/short/2020.11.20.391318v1?rss=1 Authors: Karatzas, E., Baltoumas, F. A., Panayiotou, N. A., Schneider, R., Pavlopoulos, G. A. Abstract: Efficient integration and visualization of heterogeneous biomedical information in a single view is a key challenge. In this study, we present Arena3Dweb, the first, fully interactive and dependency-free, web application which allows the visualization of multilayered graphs in 3D space. With Arena3Dweb, users can integrate multiple networks in a single view along with their intra- and inter-layer connections. For clearer and more informative views, users can choose between a plethora of layout algorithms and apply them on a set of selected layers either individually or in combination. Users can align networks and highlight node topological features, whereas each layer as well as the whole scene can be translated, rotated and scaled in 3D space. User-selected edge colors can be used to highlight important paths, while node positioning, coloring and resizing can be adjusted on-the-fly. In its current version, Arena3Dweb supports weighted and unweighted undirected graphs and is written in R, Shiny and JavaScript. We demonstrate the functionality of Arena3Dweb using two different use-case scenarios; one regarding drug repurposing for SARS-CoV-2 and one related to GPCR signaling pathways implicated in melanoma. Arena3Dweb is available at http://bib.fleming.gr:3838/Arena3D Copy rights belong to original authors. Visit the link for more info
Link to bioRxiv paper: http://biorxiv.org/cgi/content/short/2020.11.17.387795v1?rss=1 Authors: Sun, T., Song, D., Li, W. V., Li, J. J. Abstract: In the burgeoning field of single-cell transcriptomics, a pressing challenge is to benchmark various experimental protocols and numerous computational methods in an unbiased manner. Although dozens of simulators have been developed for single-cell RNA-seq (scRNA-seq) data, they lack the capacity to simultaneously achieve all the three goals: preserving genes, capturing gene correlations, and generating any number of cells with varying sequencing depths. To fill in this gap, here we propose scDesign2, an interpretable simulator that achieves all the three goals and generates high-fidelity synthetic data for multiple scRNA-seq protocols and other single-cell gene expression count-based technologies. Compared with existing simulators, scDesign2 is advantageous in its transparent use of probabilistic models and is unique in its ability to capture gene correlations via copula. We verify that scDesign2 generates more realistic synthetic data for four scRNA-seq protocols (10x Genomics, CEL-Seq2, Fluidigm C1, and Smart-Seq2) and two single-cell spatial transcriptomics protocols (MERFISH and pciSeq) than existing simulators do. Under two typical computational tasks, cell clustering and rare cell type detection, we demonstrate that scDesign2 provides informative guidance on deciding the optimal sequencing depth and cell number in single-cell RNA-seq experimental design, and that scDesign2 can effectively benchmark computational methods under varying sequencing depths and cell numbers. With these advantages, scDesign2 is a powerful tool for single-cell researchers to design experiments, develop computational methods, and choose appropriate methods for specific data analysis needs. Copy rights belong to original authors. Visit the link for more info
Link to bioRxiv paper: http://biorxiv.org/cgi/content/short/2020.11.20.387738v1?rss=1 Authors: Arslan, A. Abstract: Motivation: The transport of proteins plays a crucial role in the cellular phenotype. Changes in the protein targeting sequence can result in missing protein delivery to the right destination and at the right time, and can disrupt various cellular pathways. Given the importance of single residue change(s) in the protein targeting sequence we developed a missing computational method. Results: By taking into account various protein features like conservation, protein modifications, charge, isoelectric effect and biochemical properties of peptides, the method, TransSite, assess the impact of mutations on the protein transportation. We applied this method to human cancer proteins and discovered several cancer proteins harbour recurring mutations in their targeting sequences. Availability: https://github.com/AhmedArslan/TransSite Contact: aarslan@staford.edu Supplementary information: Supplementary data are available at # online. Copy rights belong to original authors. Visit the link for more info
Link to bioRxiv paper: http://biorxiv.org/cgi/content/short/2020.11.18.389213v1?rss=1 Authors: Srinivas, R., Verma, N., Larson, E. C., Kraka, E. Abstract: In their previous work , Srinivas et al have shown that implicit fingerprints capture ligands and proteins in a shared latent space, typically for the purposes of virtual screening with collaborative filtering models applied on known bioactivity data. In this work, we extend these implicit fingerprints/descriptors using deep learning techniques to translate latent descriptors into discrete representations of molecules (SMILES), without explicitly optimizing for chemical properties . This allows the design of new compounds based upon the latent representation of nearby proteins, thereby encoding drug-like properties including binding affinities to known proteins. The implicit descriptor method does not require any fingerprint similarity search, which makes the method free of any bias arising from the empirical nature of the fingerprint models cite{srinivas2018implicit}. We evaluate the properties of the novel drugs generated by our approach using physical properties of drug-like molecules and chemical complexity. Additionally, we analyze the reliability of the biological activity of the new compounds generated using this method by employing models of protein ligand interaction, which assists in assessing the potential binding affinity of the designed compounds. We find that the generated compounds exhibit properties of chemically feasible compounds and are likely to be excellent binders to known proteins. Furthermore, we also analyze the diversity of compounds created using the Tanimoto distance and conclude that there is a wide diversity in the generated compounds. Copy rights belong to original authors. Visit the link for more info
Link to bioRxiv paper: http://biorxiv.org/cgi/content/short/2020.11.19.390401v1?rss=1 Authors: Geread, R. S., Sivanandarajah, A., Brouwer, E., Wood, G., Androutsos, D., Faragalla, H., Khademi, A. Abstract: In this work, a novel proliferation index (PI) calculator for Ki67 images called piNET is proposed. It is successfully tested on four datasets, from three scanners comprised of patches, tissue microarrays (TMAs) and wholeslide images (WSI), representing a diverse multicentre dataset for evaluating Ki67 quantification. Compared to state of the art methods, piNET consistently performs the best over all datasets with an average PI difference of 5.603%, PI accuracy rate of 86% and correlation coefficient R = 0.927. The success of the system can be attributed to a number of innovations. Firstly, this tool is built based on deep learning, which can adapt to wide variability of medical images and it was posed as a detection problem to mimic pathologists workflow which improves accuracy and efficiency. Secondly, the system is trained purely on tumour cells, which reduces false positives from non-tumour cells without needing the usual pre-requisite tumour segmentation step for Ki67 quantification. Thirdly, the concept of learning background regions through weak supervision is introduced, by providing the system with ideal and non-ideal (artifact) patches that further reduces false positives. Lastly, a novel hotspot analysis is proposed to allow automated methods to score patches from WSI that contain significant activity. Copy rights belong to original authors. Visit the link for more info
Link to bioRxiv paper: http://biorxiv.org/cgi/content/short/2020.11.20.391573v1?rss=1 Authors: Mastrantonio, G., Bibbona, E., Furlan, M. Abstract: We propose a hierarchical Bayesian approach to infer the RNA synthesis, processing, and degradation rates from sequencing data. We parametrise kinetic rates with novel functional forms and estimate the parameters through a Dirichlet process defined at a low level of hierarchy. Despite the complexity of this approach, we manage to perform inference, clusterisation and model selection simultaneously. We apply our method to investigate transcriptional and post-transcriptional responses of murine fibroblasts to the activation of proto-oncogene MYC. We uncover a widespread choral regulation of the three rates, which was not previously observed in this biological system. Copy rights belong to original authors. Visit the link for more info
Link to bioRxiv paper: http://biorxiv.org/cgi/content/short/2020.11.18.389189v1?rss=1 Authors: Kang, J. B., Nathan, A., Millard, N., Rumker, L., Moody, D. B., Korsunsky, I., Raychaudhuri, S. Abstract: Recent advances in single-cell technologies and integration algorithms make it possible to construct large, comprehensive reference atlases from multiple datasets encompassing many donors, studies, disease states, and sequencing platforms. Much like mapping sequencing reads to a reference genome, it is essential to be able to map new query cells onto complex, multimillion-cell reference atlases to rapidly identify relevant cell states and phenotypes. We present Symphony, a novel algorithm for building compressed, integrated reference atlases of [≥]106 cells and enabling efficient query mapping within seconds. Based on a linear mixture model framework, Symphony precisely localizes query cells within a low-dimensional reference embedding without the need to reintegrate the reference cells, facilitating the downstream transfer of many types of reference-defined annotations to the query cells. We demonstrate the power of Symphony by (1) mapping a query containing multiple levels of experimental design to predict pancreatic cell types in human and mouse, (2) localizing query cells along a smooth developmental trajectory of human fetal liver hematopoiesis, and (3) harnessing a multimodal CITE-seq reference atlas to infer query surface protein expression in memory T cells. Symphony will enable the sharing of comprehensive integrated reference atlases in a convenient, portable format that powers fast, reproducible querying and downstream analyses. Copy rights belong to original authors. Visit the link for more info
Link to bioRxiv paper: http://biorxiv.org/cgi/content/short/2020.11.19.386565v1?rss=1 Authors: Haase, R., Jain, A., Rigaud, S., Vorkel, D., Rajasekhar, P., Suckert, T., Lambert, T. J., Nunez-Iglesias, J., Poole, D. P., Tomancak, P., Myers, E. W. Abstract: Modern life science relies heavily on fluorescent microscopy and subsequent quantitative bio-image analysis. The current rise of graphics processing units (GPUs) in the context of image processing enables batch processing large amounts of image data at unprecedented speed. In order to facilitate adoption of this technology in daily practice, we present an expert system based on the GPU-accelerated image processing library CLIJ: The CLIJ-assistant keeps track of which operations formed an image and suggests subsequent operations. It enables new ways of interaction with image data and image processing operations because its underlying GPU-accelerated image data flow graphs (IDFGs) allow changes to parameters of early processing steps and instantaneous visualization of their final results. Operations, their parameters and connections in the IDFG are stored at any point in time enabling the CLIJ-assistant to offer an undo-function for virtually unlimited rewinding parameter changes. Furthermore, to improve reproducibility of image data analysis workflows and interoperability with established image analysis platforms, the CLIJ-assistant can generate code from IDFGs in programming languages such as ImageJ Macro, Java, Jython, JavaScipt, Groovy, Python and C++ for later use in ImageJ, Fiji, Icy, Matlab, QuPath, Jupyter Notebooks and Napari. We demonstrate the CLIJ-assistant for processing image data in multiple scenarios to highlight its general applicability. The CLIJ-assistant is open source and available online: https://clij.github.io/assistant/ Copy rights belong to original authors. Visit the link for more info
Link to bioRxiv paper: http://biorxiv.org/cgi/content/short/2020.11.20.391904v1?rss=1 Authors: Mohanty, S. D., Lekan, D., McCoy, T. P., Jenkins, M., Manda, P. Abstract: Healthcare costs that can be attributed to unplanned readmissions are staggeringly high and negatively impact health and wellness of patients. In the United States, hospital systems and care providers have strong financial motivations to reduce readmissions in accordance with several government guidelines. One of the critical steps to reducing readmissions is to recognize the factors that lead to readmission and correspondingly identify at-risk patients based on these factors. The availability of large volumes of electronic health care records make it possible to develop and deploy automated machine learning models that can predict unplanned readmissions and pinpoint the most important factors of readmission risk. While hospital readmission is an undesirable outcome for any patient, it is more so for medically frail patients. Here, we develop and compare four machine learning models (Random Forest, XGBoost, CatBoost, and Logistic Regression) for predicting 30-day unplanned readmission for patients deemed frail (Age [≥] 50). Variables that indicate frailty, comorbidities, high risk medication use, demographic, hospital and insurance were incorporated in the models for prediction of unplanned 30-day readmission. Our findings indicate that CatBoost outperforms the other three models (AUC 0.80) and prior work in this area. We find that constructs of frailty, certain categories of high risk medications, and comorbidity are all strong predictors of readmission for elderly patients. Copy rights belong to original authors. Visit the link for more info
Link to bioRxiv paper: http://biorxiv.org/cgi/content/short/2020.11.19.390500v1?rss=1 Authors: Hou, Q., Stringer, B., Waury, K., Capel, H., Haydarlou, R., Abeln, S., Heringa, J., Feenstra, K. A. Abstract: Motivation: Antibodies play an important role in clinical research and biotechnology, with their specificity determined by the interaction with the antigen's epitope region, as a special type of protein-protein interaction (PPI) interface. The ubiquitous availability of sequence data, allows us to predicting epitopes from sequence in order to focus time-consuming wet-lab experiments onto the most promising epitope regions. Here, we extend our previously developed sequence-based predictors for homodimer and heterodimer PPI interfaces to predict epitope residues that have the potential to bind an antibody. Results: We collected and curated a high quality epitope dataset from the SAbDaB database. Our generic PPI heterodimer predictor obtained an AUC-ROC of 0.666 when evaluated on the epitope test set. We then trained a random forest model specifically on the epitope dataset, reaching AUC 0.694. Further training on the combined heterodimer and epitope datasets, improves our final predictor to AUC 0.703 on the epitope test set. This is better than the best state-of-the-art sequence-based epitope predictor BepiPred-2.0. On one solved antibody-antigen structure of the COVID19 virus spike RNA binding domain, our predictor reaches AUC 0.778. We added the SeRenDIP-CE Conformational Epitope predictors to our webserver, which is simple to use and only requires a single antigen sequence as input, which will help make the method immediately applicable in a wide range of biomedical and biomolecular research. Availability: Webserver, source code and datasets are available at www.ibi.vu.nl/programs/serendipwww/ Contact: k.a.feenstra@vu.nl Copy rights belong to original authors. Visit the link for more info
Link to bioRxiv paper: http://biorxiv.org/cgi/content/short/2020.11.18.388579v1?rss=1 Authors: Valcarcel, L. V., San Jose-Eneriz, E., Cendoya, X., Rubio, A., Agirre, X., Prosper, F., Planes, F. J. Abstract: Motivation: With the frenetic growth of high-dimensional datasets in different biomedical domains, there is an urgent need to develop predictive methods able to deal with this complexity. Feature selection is a relevant strategy in machine learning to address this challenge. Results: We introduce a novel feature selection algorithm for linear regression called BOSO (Bilevel Optimization Selector Operator). We conducted a benchmark of BOSO with key algorithms in the literature, finding a superior performance in high-dimensional datasets. Proof-of-concept of BOSO for predicting drug sensitivity in cancer is presented. A detailed analysis is carried out for methotrexate, a well-studied drug targeting cancer metabolism. Copy rights belong to original authors. Visit the link for more info
Link to bioRxiv paper: http://biorxiv.org/cgi/content/short/2020.11.19.390617v1?rss=1 Authors: Cao, C., Kwok, D., Li, Q., He, J., Guo, X., Zhang, Q., Long, Q. Abstract: The success of transcriptome-wide association studies (TWAS) has led to substantial research towards improving its core component of genetically regulated expression (GReX). GReX links expression information with phenotype by serving as both the outcome of genotype-based expression models and the predictor for downstream association testing. In this work, we demonstrate that current linear models of GReX inadvertently combine two separable steps of machine learning - feature selection and aggregation - which can be independently replaced to improve overall power. We show that the monolithic approach of GReX limits the adaptability of TWAS methodology and practice, especially given low expression heritability. Copy rights belong to original authors. Visit the link for more info
Link to bioRxiv paper: http://biorxiv.org/cgi/content/short/2020.11.17.385757v1?rss=1 Authors: Jiang, Y., Rensi, S., wang, s., Altman, R. B. Abstract: Massively accumulated pharmacogenomics, chemogenomics, and side effect datasets offer an unprecedented opportunity for drug response prediction, drug target identification and drug side effect prediction. Existing computational approaches limit their scope to only one of these three tasks, inevitably overlooking the rich connection among them. Here, we propose DrugOrchestra, a deep multi-task learning framework that jointly predicts drug response, targets and side effects. DrugOrchestra leverages pre-trained molecular structure-based drug representation to bridge these three tasks. Instead of directly fine-tuning on an individual task, DrugOrchestra uses deep multi-task learning to obtain a phenotype-based drug representation by simultaneously fine-tuning on drug response, target and side effect prediction. By coupling these three tasks together, DrugOrchestra is able to make predictions for unseen drugs by only knowing their molecular structures. We constructed a heterogeneous drug discovery dataset of over 21k drugs by integrating 8 datasets across three tasks. Our method obtained significant improvements in comparison to methods that were trained on a single task or a single dataset. We further revealed the transferability across 8 datasets and 3 tasks, providing novel insights for understanding drug mechanisms. Copy rights belong to original authors. Visit the link for more info
Link to bioRxiv paper: http://biorxiv.org/cgi/content/short/2020.11.17.387811v1?rss=1 Authors: Kang, K., Chong, H., Ning, K. Abstract: Motivation: Microbial community samples and sequencing data have been accumulated at a speed faster than ever, with tens of thousands of samples been sequenced each year. Mining such a huge amount of multi-source heterogeneous data is becoming more and more difficult. Among several sample mining bottlenecks, efficient and accurate search of samples is one of the most prominent: Faced with millions of samples in the data repository, traditional sample comparison and search approaches fall short in speed and accuracy. Results: Here we proposed Meta-Prism 2.0, a microbial community sample search method based on smart pair-wise sample comparison, which pushed the time and memory efficiency to a new limit, without the compromise of accuracy. Based on memory-saving data structure, time-saving instruction pipeline, and boost scheme optimization, Meta-Prism 2.0 has enabled ultra-fast, accurate and memory-efficient search among millions of samples. Meta-Prism 2.0 has been put to test on several datasets, with largest containing one million samples. Results have shown that firstly, as a distance-based method, Meta-Prism 2.0 is not only faster than other distance-based methods, but also faster than unsupervised methods. Its 0.00001s per sample pair search speed, as well as 8GB memory needs for searching against one million samples, have enabled it to be the most efficient method for sample comparison. Additionally, Meta-Prism 2.0 could achieve the comparison accuracy and search precision that are comparable or better than other contemporary methods. Thirdly, Meta-Prism 2.0 can precisely identify the original biome for samples, thus enabling sample source tracking. Conclusion: In summary, Meta-Prism 2.0 can perform accurate searches among millions of samples with very low memory cost and fast speed, enabling knowledge discovery from samples at a massive scale. It has changed the traditional resource-intensive sample comparison and search scheme to a cheap and effective procedure, which could be conducted by researchers everyday even on a laptop, for insightful sample search and knowledge discovery. Meta-Prism 2.0 could be accessed at: https://github.com/HUST-NingKang-Lab/Meta-Prism-2.0. Copy rights belong to original authors. Visit the link for more info
Link to bioRxiv paper: http://biorxiv.org/cgi/content/short/2020.11.18.386102v1?rss=1 Authors: Zhang, R., Luo, Y., Ma, J., Zhang, M., Wang, S. Abstract: Rapidly generated scRNA-seq datasets enable us to understand cellular differences and the function of each individual cell at single-cell resolution. Cell type classification, which aims at characterizing and labeling groups of cells according to their gene expression, is one of the most important steps for single-cell analysis. To facilitate the manual curation process, supervised learning methods have been used to automatically classify cells. Most of the existing supervised learning approaches only utilize annotated cells in the training step while ignoring the more abundant unannotated cells. In this paper, we proposed scPretrain, a multi-task self-supervised learning approach that jointly considers annotated and unannotated cells for cell type classification. scPretrain consists of a pre-training step and a fine-tuning step. In the pre-training step, scPretrain uses a multi-task learning framework to train a feature extraction encoder based on each dataset's pseudo-labels, where only unannotated cells are used. In the fine-tuning step, scPretrain fine-tunes this feature extraction encoder using the limited annotated cells in a new dataset. We evaluated scPretrain on 60 diverse datasets from different technologies, species and organs, and obtained a significant improvement on both cell type classification and cell clustering. Moreover, the representations obtained by scPretrain in the pre-training step also enhanced the performance of conventional classifiers such as random forest, logistic regression and support vector machines. scPretrain is able to effectively utilize the massive amount of unlabelled data and be applied to annotating increasingly generated scRNA-seq datasets. Copy rights belong to original authors. Visit the link for more info
Link to bioRxiv paper: http://biorxiv.org/cgi/content/short/2020.11.17.387860v1?rss=1 Authors: Gao, B., Luo, Y., Ma, J., Wang, S. Abstract: Tumor stratification, which aims at clustering tumors into biologically meaningful subtypes, is the key step towards personalized treatment. Large-scale profiled cancer genomics data enables us to develop computational methods for tumor stratification. However, most of the existing approaches only considered tumors from an individual cancer type during clustering, leading to the overlook of common patterns across cancer types and the vulnerability to the noise within that cancer type. To address these challenges, we proposed cancerAlign to map tumors of the target cancer type into latent spaces of other source cancer types. These tumors were then clustered in each latent space rather than the original space in order to exploit shared patterns across cancer types. Due to the lack of aligned tumor samples across cancer types, cancerAlign used adversarial learning to learn the mapping at the population level. It then used consensus clustering to integrate cluster labels from different source cancer types. We evaluated cancerAlign on 7,134 tumors spanning 24 cancer types from TCGA and observed substantial improvement on tumor stratification and cancer gene prioritization. We further revealed the transferability across cancer types, which reflected the similarity among them based on the somatic mutation profile. cancerAlign is an unsupervised approach that provides deeper insights into the heterogeneous and rapidly accumulating somatic mutation profile and can be also applied to other genome-scale molecular information. Copy rights belong to original authors. Visit the link for more info
Link to bioRxiv paper: http://biorxiv.org/cgi/content/short/2020.11.18.042887v1?rss=1 Authors: Vishwakarma, P., Meena, N. K., Prasad, R., Lynn, A. M., Banerjee, A. Abstract: In view of the multiple clinical and physiological implications of ABC transporter proteins, there is a considerable interest among researchers to characterize them functionally. However, such characterizations are based on the premise that ABC proteins are accurately identified in the genome, and their topology is correctly predicted. With this objective, we have developed ABC-finder, i.e., a Docker-based package for the identification of ABC proteins in all organisms, and visualization of the topology of ABC proteins using an interactive web browser. ABC-finder is built and deployed in a Linux container, making it scalable for many concurrent users on our servers and enabling users to download and run it locally. Overall, ABC-finder is a convenient, portable, and platform-independent tool for the identification and topology prediction of ABC proteins. ABC-finder is accessible at http://abc-finder.osdd.jnu.ac.in Copy rights belong to original authors. Visit the link for more info
Link to bioRxiv paper: http://biorxiv.org/cgi/content/short/2020.11.17.387548v1?rss=1 Authors: Maden, S. K., Thompson, R. F., Hansen, K. D., Nellore, A. Abstract: While DNA methylation (DNAm) is the most-studied epigenetic mark, few recent studies probe the breadth of publicly available DNAm array samples. We collectively analyzed 35,360 Illumina Infinium HumanMethylation450K DNAm array samples published on the Gene Expression Omnibus (GEO). We learned a controlled vocabulary of sample labels by applying regular expressions to metadata and used existing models to predict various sample properties including epigenetic age. We found approximately two-thirds of samples were from blood, one-quarter were from brain, and one-third were from cancer patients. 19% of samples failed at least one of Illumina's 17 prescribed quality assessments; signal distributions across samples suggest modifying manufacturer-recommended thresholds for failure would make these assessments more informative. We further analyzed DNAm variances in seven tissues (adipose, nasal, blood, brain, buccal, sperm, and liver) and characterized specific probes distinguishing them. Finally, we compiled DNAm array data and metadata, including our learned and predicted sample labels, into database files accessible via the recountmethylation R/Bioconductor companion package. Its vignettes walk the user through some analyses contained in this paper. Copy rights belong to original authors. Visit the link for more info
Link to bioRxiv paper: http://biorxiv.org/cgi/content/short/2020.11.17.387134v1?rss=1 Authors: Ernst, J., Vu, H. T. Abstract: Genome-wide maps of chromatin marks such as histone modifications and open chromatin sites provide valuable information for annotating the non-coding genome, including identifying regulatory elements. Computational approaches such as ChromHMM have been applied to discover and annotate chromatin states defined by combinatorial and spatial patterns of chromatin marks within the same cell type. An alternative stacked modeling approach was previously suggested, where chromatin states are defined jointly from datasets of multiple cell types to produce a single universal genome annotation based on all datasets. Despite its potential benefits for applications that are not specific to one cell type, such an approach was previously applied only for small-scale specialized purposes. Large-scale applications of stacked modeling have previously posed scalability challenges. In this paper, using a version of ChromHMM enhanced for large-scale applications, we applied the stacked modeling approach to produce a universal chromatin state annotation of the human genome using over 1000 datasets from more than 100 cell types, denoted the full-stack model. The full-stack model states show distinct enrichments for external genomic annotations, which we used in characterizing each state. Compared to cell-type-specific annotations, the full-stack annotation directly differentiates constitutive from cell-type-specific activity and is more predictive of locations of external genomic annotations. Overall, the full-stack ChromHMM model provides a universal chromatin state annotation of the genome and a unified global view of over 1000 datasets. We expect this to be a useful resource that complements existing cell-type-specific annotations for studying the non-coding human genome. Copy rights belong to original authors. Visit the link for more info
Link to bioRxiv paper: http://biorxiv.org/cgi/content/short/2020.11.17.386649v1?rss=1 Authors: Danciu, D., Karasikov, M., Mustafa, H., Kahles, A., Ratsch, G. Abstract: Since the amount of published biological sequencing data is growing exponentially, efficient methods for storing and indexing this data are more needed than ever to truly benefit from this invaluable resource for biomedical research. Labeled de Bruijn graphs are a frequently-used approach for representing large sets of sequencing data. While significant progress has been made to succinctly represent the graph itself, efficient methods for storing labels on such graphs are still rapidly evolving. In this paper, we present RowDiff, a new technique for compacting graph labels by leveraging expected similarities in annotations of nodes adjacent in the graph. RowDiff can be constructed in linear time relative to the number of nodes and labels in the graph, and the construction can be efficiently parallelized and distributed, significantly reducing construction time. RowDiff can be viewed as an intermediary sparsification step of the initial annotation matrix and can thus naturally be combined with existing generic schemes for compressed binary matrix representation. Our experiments on the Fungi subset of the RefSeq collection show that applying RowDiff sparsification reduces the size of individual annotation columns stored as compressed bit vectors by an average factor of 42. When combining RowDiff with a Multi-BRWT representation, the resulting annotation is 26 times smaller than Mantis-MST, the previously known smallest annotation representation. In addition, experiments on 10,000 RNA-seq datasets show that RowDiff combined with Multi-BRWT results in a 30% reduction in annotation footprint over Mantis-MST. Copy rights belong to original authors. Visit the link for more info
Link to bioRxiv paper: http://biorxiv.org/cgi/content/short/2020.11.17.385864v1?rss=1 Authors: Chevalier, A., Yang, S., Yajima, M., Campbell, J. D. Abstract: Mutational signatures are patterns of somatic alterations in the genome that cause carcinogenic exposures or aberrant cellular processes. We created the musicatk package to provide a comprehensive workflow for preprocessing, analysis, and visualization of mutational signatures. The musicatk package enables users to select different schemas for counting mutation types and easily combine count tables from different schemas. Several methods can be used to discover new signatures or infer the exposures for given a pre-existing set of signatures. Several visualizations are provided to facilitate exploratory analysis of signatures and exposures. These include comparison of discovered signatures to those in the COSMIC database, using UMAP to embed tumors in two dimensions, and plotting of exposure distributions across user-defined annotations. Overall, musicatk can be used to gain novel insights into the patterns of mutational signature observed in cancer. Copy rights belong to original authors. Visit the link for more info
Link to bioRxiv paper: http://biorxiv.org/cgi/content/short/2020.11.18.388504v1?rss=1 Authors: Koch, H., Keller, C. A., Giardine, B., Xiang, G., Zhang, F., Wang, Y., Hardison, R. C., Li, Q. Abstract: Joint analyses of genomic datasets obtained in multiple different conditions are essential for understanding the biological mechanism that drives tissue-specificity and cell differentiation, but they still remain computationally challenging. To address this we introduce CLIMB (Composite LIkelihood eMpirical Bayes), a statistical methodology that learns patterns of condition-specificity present in genomic data. CLIMB provides a generic framework facilitating a host of analyses, such as clustering genomic features sharing similar condition-specific patterns and identifying which of these features are involved in cell fate commitment. Our approach improves upon existing methods by boosting statistical power to identify meaningful signals while retaining interpretability and computational tractability. We illustrate CLIMB's value on two sets of hematopoietic data: one studying CTCF ChIP-seq measured in 17 different cell populations, and another examining RNA-seq measured across constituent cell populations in three committed lineages. These analyses demonstrate that CLIMB captures biologically relevant clusters in the data and improves upon commonly-used pairwise comparisons and unsupervised clusterings typical of genomic analyses. Copy rights belong to original authors. Visit the link for more info
Link to bioRxiv paper: http://biorxiv.org/cgi/content/short/2020.11.16.385633v1?rss=1 Authors: Chen, L., Jiang, Y., Yao, B., Huang, K., Liu, Y., Wang, Y., Qin, X., Saykin, A. J., Wang, Y. Abstract: Understanding the functional consequence of noncoding variants is of great interest. Though genome-wide association studies (GWAS) or quantitative trait locus (QTL) analyses have identified variants associated with traits or molecular phenotypes, most of them are located in the noncoding regions, making the identification of causal variants a particular challenge. Existing computational approaches developed for for prioritizing non- coding variants produce inconsistent and even conflicting results. To address these challenges, we propose a novel statistical learning framework, which directly integrates the precomputed functional scores from representative scoring methods. It will maximize the usage of integrated methods by automatically learning the relative contribution of each method and produce an ensemble score as the final prediction. The framework consists of two modes. The first "context-free" mode is trained using curated causal regulatory variants from a wide range of context and is applicable to predict noncoding variants of unknown and diverse context. The second "context-dependent" mode further improves the prediction when the training and testing variants are from the same context. By evaluating the framework via both simulation and empirical studies, we demonstrate that it outperforms integrated scoring methods and the ensemble score successfully prioritizes experimentally validated regulatory variants in multiple risk loci. Copy rights belong to original authors. Visit the link for more info
Link to bioRxiv paper: http://biorxiv.org/cgi/content/short/2020.11.17.387779v1?rss=1 Authors: Song, D., Li, J. J. Abstract: In the investigation of molecular mechanisms underlying cell state changes, a crucial analysis is to identify differentially expressed (DE) genes along a continuous cell trajectory, which can be estimated by pseudotime inference from single-cell RNA-sequencing (scRNA-seq) data. However, existing methods that identify DE genes based on inferred pseudotime do not account for the uncertainty in pseudotime inference. Also, they either have ill-posed p-values that hinder the control of false discovery rate (FDR) or have restrictive models that reduce the power of DE gene identification. To overcome these drawbacks, we propose PseudotimeDE, a robust method that accounts for the uncertainty in pseudotime inference and thus identifies DE genes along cell pseudotime with well-calibrated p-values. PseudotimeDE is flexible in allowing users to specify the pseudotime inference method and to choose the appropriate model for scRNA-seq data. Comprehensive simulations and real-data applications verify that PseudotimeDE provides well-calibrated p-values essential for controlling FDR and downstream analysis and that PseudotimeDE is more powerful than existing methods to identify DE genes. Copy rights belong to original authors. Visit the link for more info
Link to bioRxiv paper: http://biorxiv.org/cgi/content/short/2020.11.16.384834v1?rss=1 Authors: Verschaffelt, P., Van Den Bossche, T., Gabriel, W., Burdukiewicz, M., Soggiu, A., Martens, L., Renard, B. Y., Schiebenhoefer, H., Mesuere, B. Abstract: The study of microbiomes has gained in importance over the past few years, and has led to the fields of metagenomics, metatranscriptomics and metaproteomics. While initially focused on the study of biodiversity within these communities the emphasis has increasingly shifted to the study of (changes in) the complete set of functions available in these communities. A key tool to study this functional complement of a microbiome is Gene Ontology (GO) term analysis. However, comparing large sets of GO terms is not an easy task due to the deeply branched nature of GO, which limits the utility of exact term matching. To solve this problem, we here present MegaGO, a user-friendly tool that relies on semantic similarity between GO terms to compute functional similarity between two data sets. MegaGO is highly performant: each set can contain thousands of GO terms, and results are calculated in a matter of seconds. MegaGO is available as a web application at https://megago.ugent.be and installable via pip as a standalone command line tool and reusable software library. All code is open source under the MIT license, and is available at https://github.com/MEGA-GO/. Copy rights belong to original authors. Visit the link for more info
Link to bioRxiv paper: http://biorxiv.org/cgi/content/short/2020.11.16.384479v1?rss=1 Authors: Vuong, H., Truong, T., Phan, T., Pham, S. Abstract: Most widely used tools for finding marker genes in single-cell data (SeuratT/NegBinom/Poisson, CellRanger, EdgeR, limmatrend) use a conventional definition of differentially expressed genes: genes with different mean expression values. However, in single-cell data, a cell population can be a mixture of many cell types/cell states, hence the mean expression of genes cannot represent the whole population. In addition, these tools assume that the gene expression of a population belongs to a specific family of distribution. This assumption is often violated in single-cell data. In this work, we define marker genes of a cell population as genes that can be used to distinguish cells in the population from cells in other populations. Besides log-fold change, we devise a new metric to classify genes into up-regulated, down-regulated, and transitional states. In a benchmark for finding up-regulated and down-regulated genes, our tool outperforms all compared methods, including Seurat, ROTS, scDD, edgeR, MAST, limma, normal t test, Wilcoxon and KolmogorovSmirnov test. Our method is much faster than all compared methods, therefore, enables interactive analysis for large single cell data sets in BioTuring Browser. Venice algorithm is available within Signac package: https://github.com/bioturing/signac Copy rights belong to original authors. Visit the link for more info
Link to bioRxiv paper: http://biorxiv.org/cgi/content/short/2020.11.15.383448v1?rss=1 Authors: Lin, F. P. Abstract: BACKGROUND: The advances in genome sequencing technologies have provided new opportunities for delivering targeted therapy to patients with advanced cancer. However, these high-throughput assays have also created a multitude of challenges for oncologists in treatment selection, demanding a new approach to support decision-making in clinics. METHODS: To address this unmet need, this paper describes the design of a symbolic reasoning framework using the method of hierarchical task analysis. Based on this framework, an evidence-based treatment recommendation system was implemented for supporting decision-making based on a patient's clinicopathologic and biomarker profiles. RESULTS: This intelligent framework captures a six-step sequential decision process: (1) concept expansion by ontology matching, (2) evidence matching, (3) evidence grading and value-based prioritisation, (4) clinical hypothesis generation, (5) recommendation ranking, and (6) recommendation filtering. The importance of balancing evidence-based and hypothesis-driven treatment recommendations is also highlighted. Of note, tracking history of inference has emerged to be a critical step to allow rational prioritisation of recommendations. The concept of inference tracking also enables the derivation of a novel measure -- level of matching -- that helps to convey whether a treatment recommendation is drawn from incomplete knowledge during the reasoning process. CONCLUSIONS: This framework systematically encapsulates oncologist's treatment decision-making process. Further evaluations in prospective clinical studies are warranted to demonstrate how this computational pipeline can be integrated into oncology practice to improve outcomes. Copy rights belong to original authors. Visit the link for more info
Link to bioRxiv paper: http://biorxiv.org/cgi/content/short/2020.11.15.363259v1?rss=1 Authors: Leonard, R. R., Leleu, M., Van Vlierberghe, M., Kerff, F., BAURAIN, D. Abstract: TQMD is a tool which downloads, stores and produces lists of dereplicated prokaryotic genomes. It has been developed to counter the ever-growing number of prokaryotic genomes and their uneven taxonomic distribution. It is based on word-based alignment-free methods (k-mers), an iterative single-linkage approach and a divide-and-conquer strategy to remain both efficient and scalable. We studied the performance of TQMD by verifying the influence of its parameters and heuristics on the clustering outcome. We further compared TQMD to two other dereplication tools (dRep and Assembly-Dereplicator). Our results showed that TQMD is optimized to dereplicate at high taxonomic levels (phylum/class), whereas the other dereplication tools are optimized for lower taxonomic levels (species/strain), making TQMD complementary to the existing dereplicating tools. TQMD is available at [https://bitbucket.org/phylogeno/tqmd]. Copy rights belong to original authors. Visit the link for more info
Link to bioRxiv paper: http://biorxiv.org/cgi/content/short/2020.11.15.383273v1?rss=1 Authors: Ren, J., Chaisson, M. Abstract: It is computationally challenging to detect variation by aligning long reads from single-molecule sequencing (SMS) instruments, or megabase-scale contigs from SMS assemblies. One approach to efficiently align long sequences is sparse dynamic programming (SDP), where exact matches are found between the sequence and the genome, and optimal chains of matches are found representing a rough alignment. Sequence variation is more accurately modeled when alignments are scored with a gap penalty that is a convex function of the gap length. Because previous implementations of SDP used a linear-cost gap function that does not accurately model variation, and implementations of alignment that have a convex gap penalty are either inefficient or use heuristics, we developed a method, lra, that uses SDP with a convex-cost gap penalty. We use lra to align long-read sequences from PacBio and Oxford Nanopore (ONT) instruments as well as de novo assembly contigs. Across all data types, the runtime of lra is between 52-168% of the state of the art aligner minimap2 when generating SAM alignment, and 9-15% of an alternative method, ngmlr. This alignment approach may be used to provide additional evidence of SV calls in PacBio datasets, and an increase in sensitivity and specificity on ONT data with current SV detection algorithms. The number of calls discovered using pbsv with lra alignments are within 98.3-98.6% of calls made from minimap2 alignments on the same data, and give a nominal 0.2-0.4% increase in F1 score by Truvari analysis. On ONT data with SV called using Sniffles, the number of calls made from lra alignments is 3% greater than minimap2-based calls, and 30% greater than ngmlr based calls, with a 4.6-5.5% increase in Truvari F1 score. When applied to calling variation from de novo assembly contigs, there is a 5.8% increase in SV calls compared to minimap2+paftools, with a 4.3% increase in Truvari F1 score. Copy rights belong to original authors. Visit the link for more info
Link to bioRxiv paper: http://biorxiv.org/cgi/content/short/2020.11.15.383661v1?rss=1 Authors: Kuchroo, M., Huang, J., Wong, P., Grenier, J.-C., Shung, D., Tong, A., Lucas, C., Klein, J., Burkhardt, D., Gigante, S., Godavarthi, A., Israelow, B., Mao, T., Oh, J. E., Silva, J., Takahashi, T., Odio, C. D., Casanovas-Massana, A., Fournier, J., IMPACT Team, Y., Farhadian, S., Dela Cruz, C. S., Ko, A. I., Wilson, F. P., Hussin, J., Wolf, G., Iwasaki, A., Krishnaswamy, S. Abstract: The biomedical community is producing increasingly high dimensional datasets, integrated from hundreds of patient samples, which current computational techniques struggle to explore. To uncover biological meaning from these complex datasets, we present an approach called Multiscale PHATE, which learns abstracted biological features from data that can be directly predictive of disease. Built on a continuous coarse graining process called diffusion condensation, Multiscale PHATE creates a tree of data granularities that can be cut at coarse levels for high level summarizations of data, as well as at fine levels for detailed representations on subsets. We apply Multiscale PHATE to study the immune response to COVID-19 in 54 million cells from 168 hospitalized patients. Through our analysis of patient samples, we identify CD16hiCD66lo neutrophil and IFN{gamma}+GranzymeB+ Th17 cell responses enriched in patients who die. Further, we show that population groupings Multiscale PHATE discovers can be directly fed into a classifier to predict disease outcome. We also use Multiscale PHATE-derived features to construct two different manifolds of patients, one from abstracted flow cytometry features and another directly on patient clinical features, both associating immune subsets and clinical markers with outcome. Copy rights belong to original authors. Visit the link for more info
Link to bioRxiv paper: http://biorxiv.org/cgi/content/short/2020.11.17.387068v1?rss=1 Authors: Malhotra, S., Joseph, A. P., Thiyagalingam, J., Topf, M. Abstract: Structures of macromolecular assemblies derived from cryo-EM maps often contain errors that become more abundant with decreasing resolution. Despite efforts in the cryo-EM community to develop metrics for the map and atomistic model validation, thus far, no specific scoring metrics have been applied systematically to assess the interface between the assembly subunits. Here, we have assessed protein-protein interfaces in macromolecular assemblies derived by cryo-EM. To this end, we developed PI-score, a density-independent machine learning-based metric, trained using protein-protein interfaces features in high-resolution crystal structures. Using PI-score, we were able to identify errors at interfaces in the PDB-deposited cryo-EM structures (including SARS-CoV-2 complexes) and in the models submitted for cryo-EM targets in CASP13 and the EM model challenge. Some of the identified errors, especially at medium-to-low resolution structures, were not captured by density-based assessment scores. Our method can therefore provide a powerful complementary assessment tool for the increasing number of complexes solved by cryo-EM. Copy rights belong to original authors. Visit the link for more info
Link to bioRxiv paper: http://biorxiv.org/cgi/content/short/2020.11.16.385328v1?rss=1 Authors: Hong, R., Koga, Y., Bandyadka, S., Leshchyk, A., Wang, Z., Alabdullatif, S., Wang, Y., Akavoor, V., Cao, X., Sarfraz, I., Jansen, F., Johnson, W. E., Yajima, M., Campbell, J. D. Abstract: Performing comprehensive quality control is necessary to remove technical or biological artifacts in single-cell RNA sequencing (scRNA-seq) data. Artifacts in the scRNA-seq data, such as doublets or ambient RNA, can also hinder downstream clustering and marker selection and need to be assessed. While several algorithms have been developed to perform various quality control tasks, they are only available in different packages across various programming environments. No standardized workflow has been developed to streamline the generation and reporting of all quality control metrics from these tools. We have built an easy-to-use pipeline, named SCTK-QC, in the singleCellTK package that generates a comprehensive set of quality control metrics from a plethora of packages for quality control. We are able to import data from several preprocessing tools including CellRanger, STARSolo, BUSTools, dropEST, Optimus, and SEQC. Standard quality control metrics for each cell are calculated including the total number of UMIs, total number of genes detected, and the percentage of counts mapping to predefined gene sets such as mitochondrial genes. Doublet detection algorithms employed include scrublet, scds, doubletCells, and doubletFinder. DecontX is used to identify contamination in each individual cell. To make the data accessible in downstream analysis workflows, the results can be exported to common data structures in R and Python or to text files for use in any generic workflow. Overall, this pipeline will streamline and standardize quality control analyses for single cell RNA-seq data across different platforms. Copy rights belong to original authors. Visit the link for more info