Back
Explore every episode of the podcast PaperPlayer biorxiv bioinformatics
Dive into the complete episode list for PaperPlayer biorxiv bioinformatics. Each episode is cataloged with detailed descriptions, making it easy to find and explore specific topics. Keep track of all episodes from your favorite podcast and never miss a moment of insightful content.
| Title | Pub. Date | Duration | |
|---|---|---|---|
| Performance Evaluation Of Prediction On Molecular Graphs With Graph Neural Networks | 21 Oct 2022 | ||
Link to bioRxiv paper:
http://biorxiv.org/cgi/content/short/2022.10.21.513175v1?rss=1
Authors: Li, H.
Abstract:
Machine learning and deep learning are novel and trending approaches to solving real-world scientific problems. Graph machine learning is dedicated to performing learning methods, such as graph neural networks, on non-Euclidean data such as graphs. Molecules, with their natural graph structures, could be analyzed by such method. In this work, we carry out the performance evaluation regarding to learning results as well as time consumed, speedup, and efficiency using different types of neural network structures and distributed training pipeline implementations. Besides, the reasons lead to an unideal performance enhancement is investigated. Code availability at https://github.com/ htlee6/perf-analysis-dist-training-gnn.
Copy rights belong to original authors. Visit the link for more info
Podcast created by Paper Player, LLC | |||
| Graph Regularized Probabilistic MatrixFactorization for Drug-Drug Interactions Prediction | 21 Oct 2022 | ||
Link to bioRxiv paper:
http://biorxiv.org/cgi/content/short/2022.10.18.512676v1?rss=1
Authors: Jain, S., Chouzenoux, E., Kumar, K., Majumdar, A.
Abstract:
Co-administration of two or more drugs simultaneously can result in adverse drug reactions. Identifying drug-drug interactions (DDIs) is necessary, especially for drug development and for repurposing old drugs. DDI prediction can be viewed as a matrix completion task, for which matrix factorization (MF) appears as a suitable solution. This paper presents a novel Graph Regularized Probabilistic Matrix Factorization (GRPMF) method, which incorporates expert knowledge through a novel graph-based regularization strategy within an MF framework. An efficient and sounded optimization algorithm is proposed to solve the resulting non-convex problem in an alternating fashion. The performance of the proposed method is evaluated through the DrugBank dataset, and comparisons are provided against state-of-the-art techniques. The results demonstrate the superior performance of GRPMF when compared to its counterparts.
Copy rights belong to original authors. Visit the link for more info
Podcast created by Paper Player, LLC | |||
| Multi-histone ChIP-Seq Analysis with DecoDen | 21 Oct 2022 | ||
Link to bioRxiv paper:
http://biorxiv.org/cgi/content/short/2022.10.18.512665v1?rss=1
Authors: Narendra, T., Visona, G., Cardona, C. d. J., Schweikert, G.
Abstract:
Epigenetic mechanisms coordinate packaging, accessibility and read-out of the DNA sequence within the chromatin context. They significantly contribute to the regulation of gene expression. Thus, they play fundamental roles during differentiation on the one hand and maintenance and propagation of cell identity on the other. Epigenetic malfunctioning is associated with a large range of diseases, from neurodevelopmental disorders to cancer progression. In humans, hundreds of known epigenetic factors and complexes are involved in establishing covalent modifications on the DNA sequence itself and on associated histone proteins. Within the cellular context, the resulting combinatorial epigenomic patterns are neither established nor interpreted independently of each other and therefore exhibit high correlations in a region-specific manner. Post-translational modifications of histone proteins can be analysed using Chromatin Immunoprecipitation followed by sequencing (ChIP-Seq). Often, several assays for a number of different histone modifications are performed as part of the same experimental design. These measurements are, however, confounded by shared biases including chromatin accessibility, PCR amplification and mappability. Existing computational methods analyse each histone modification separately, while often also merging biological or technical replicates. We introduce DecoDen, a new approach that leverages replicates and multi-histone ChIP-Seq experiments for a fixed cell type to learn and remove shared biases. DecoDen (Deconvolve and Denoise) consists of two major steps: We use non-negative matrix factorisation (NMF) to learn a joint cell-type specific signal. Half-sibling regression (HSR) is then used to correct for the cell-type specific biases in the histone modification signals. We demonstrate that DecoDen is a robust and interpretable method that enables the unbiased discovery of subtle peaks, which are particularly important in an individual-specific context.
Copy rights belong to original authors. Visit the link for more info
Podcast created by Paper Player, LLC | |||
| The BRAIN Initiative Cell Census Data Ecosystem: A User's Guide | 30 Oct 2022 | ||
Link to bioRxiv paper:
http://biorxiv.org/cgi/content/short/2022.10.26.513573v1?rss=1
Authors: BICCN Data Ecosytem Collaboration,, Hawrylycz, M. J., Martone, M. E., Hof, P. R., Lein, E. S., Regev, A., Ascoli, G. A. A., Bjaalie, J. G., Dong, H.-W., Ghosh, S. S., Gillis, J., Hertzano, R., Haynor, D. R., Kim, Y., Liu, Y., Miller, J. A., Mitra, P. P., Mukamel, E., Osumi-Sutherland, D., Peng, H., Ray, P. L., Sanchez, R., Ropelewski, A., Scheuermann, R. H., Tan, S. Z. K., Tickle, T., Tilgner, H., Varghese, M., Wester, B., White, O., Aevermann, B., Allemang, D., Ament, S., Athey, T. L., Baker, P. M., Baker, C., Baker, K. S., Bandrowski, A., Bishwakarma, P., Carr, A., Chen, M., Choudhury, R.,
Abstract:
Characterizing cellular diversity at different levels of biological organization across data modalities is a prerequisite to understanding the function of cell types in the brain. Classification of neurons is also required to manipulate cell types in controlled ways, and to understand their variation and vulnerability in brain disorders. The BRAIN Initiative Cell Census Network (BICCN) is an integrated network of data generating centers, data archives and data standards developers, with the goal of systematic multimodal brain cell type profiling and characterization. Emphasis of the BICCN is on the whole mouse brain and demonstration of prototypes for human and non-human primate (NHP) brains. Here, we provide a guide to the cellular and spatial approaches employed, and to accessing and using the BICCN data and its extensive resources, including the BRAIN Cell Data Center (BCDC) which serves to manage and integrate data across the ecosystem. We illustrate the power of the BICCN data ecosystem through vignettes highlighting several BICCN analysis and visualization tools. Finally, we present emerging standards that have been developed or adopted by the BICCN toward FAIR (Wilkinson et al. 2016a) neuroscience. The combined BICCN ecosystem provides a comprehensive resource for the exploration and analysis of cell types in the brain.
Copy rights belong to original authors. Visit the link for more info
Podcast created by Paper Player, LLC | |||
| Phased nanopore assembly with Shasta and modular graph phasing with GFAse | 22 Feb 2023 | ||
Link to bioRxiv paper:
http://biorxiv.org/cgi/content/short/2023.02.21.529152v1?rss=1
Authors: Lorig-Roach, R., Meredith, M., Monlong, J., Jain, M., Olsen, H., McNulty, B., Porubsky, D., Montague, T. G., Lucas, J., Condon, C., Eizenga, J., Juul, S., McKenzie, S., Simmonds, S., Park, J., Asri, M., Koren, S., Eichler, E., Axel, R., Martin, B., Carnevali, P., Miga, K., Paten, B.
Abstract:
As a step towards simplifying and reducing the cost of haplotype resolved de novo assembly, we describe new methods for accurately phasing nanopore data with the Shasta genome assembler and a modular tool for extending phasing to the chromosome scale called GFAse. We test using new variants of Oxford Nanopore Technologies' (ONT) PromethION sequencing, including those using proximity ligation and show that newer, higher accuracy ONT reads substantially improve assembly quality.
Copy rights belong to original authors. Visit the link for more info
Podcast created by Paper Player, LLC | |||
| Identification of novel prognostic targets in coronary artery disease and related complications using bioinformatics and next generation sequencing data analysis | 22 Feb 2023 | ||
Link to bioRxiv paper:
http://biorxiv.org/cgi/content/short/2023.02.22.529500v1?rss=1
Authors: Vastrad, B. M., Vastrad, C. M.
Abstract:
Coronary artery disease (CAD) is the most common cardiovascular disorder and the leading cause of heart related deaths in world. Increasing molecular targets have been discovered for CAD and CAD - related complications prognosis and therapy. However, there is still an urgent need to identify novel biomarkers. Therefore, we evaluated biomarkers that might help the diagnosis and treatment of CAD and CAD related complications. We searched next generation sequencing (NGS) dataset (GSE202625) and identified differentially expressed genes (DEGs) by comparing CAD and normal control samples using DESeq2. Gene ontology (GO) and pathway enrichment analyses of the DEGs were performed using the g:Profiler online database. The protein protein interaction (PPI) network was plotted with IMEx interactome and visualized using Cytoscape. Module analysis of the PPI network was done using PEWCC1. MiRNA hub gene regulatory network and TF hub gene regulatory network analysis was performed to identify the hub genes, miRNAs and TFs. Receiver operating characteristic (ROC) curve analysis was used to predict the diagnostic effectiveness of the hub genes. A total of 118 DEGs (479 up regulated genes and 479 down regulated genes) were detected. The GO enrichment analysis indicated that the DEGs most significantly enriched in cellular response to stimulus and biosynthetic process. The REACTOME pathway enrichment analysis revealed that the DEGs were most significantly enriched in immune system and eukaryotic translation elongation. PPI network, modules, miRNA hub gene regulatory network and TF hub gene regulatory network analysis demonstrated that EGR1, SIRT1, STAT1, LRRK2, HIF1A, CSNK2B, RPS3, RPS2, RPS4X and HDAC11 were the hub genes. On the whole, the findings of this study enhance our understanding of the potential molecular mechanisms of CAD and CAD-related complications, and provide potential targets for further research.
Copy rights belong to original authors. Visit the link for more info
Podcast created by Paper Player, LLC | |||
| Reconstruction of TrkB complex assemblies and localizing an-tidepressant targets using Artificial Intelligence | 22 Feb 2023 | ||
Link to bioRxiv paper:
http://biorxiv.org/cgi/content/short/2023.02.21.529454v1?rss=1
Authors: Qian, C., Xiang, X., Yao, H., Li, P., Cheng, B., Wei, D., An, W., Lu, Y., Chu, M., Wei, L., Asakawa, T., Xu, J., Xia, F., Liu, X., Liu, B.-F.
Abstract:
Since Major Depressive Disorder (MDD) represents a neurological pathology caused by inter-synaptic messaging errors, membrane receptors, the source of signal cascades, constitute ap-pealing drugs targets. G protein-coupled receptors (GPCRs) and ion channel receptors chelated antidepressants (ADs) high-resolution architectures were reported to realize receptors physical mechanism and design prototype compounds with minimal side effects. Tyrosine kinase recep-tor 2 (TrkB), a receptor that directly modulates synaptic plasticity, has a finite three-dimensional chart due to its high molecular mass and intrinsically disordered regions (IDRs). Leveraging breakthroughs in deep learning, the meticulous architecture of TrkB was projected employing Alphfold 2 (AF2). Furthermore, the Alphafold Multimer algorithm (AF-M) models the coupling of intra- and extra-membrane topologies to chaperones: mBDNF, SHP2, Etc. Conjugating firmly dimeric transmembrane helix with novel compounds like 2R,6R-hydroxynorketamine (2R,6R-HNK) expands scopes of drug screening to encompass all coding sequences throughout ge-nomes. The operational implementation of TrkB kinase-SHP2, PLC{gamma}1, and SHC1 ensembles has paved the path for machine learning in which it can forecast structural transitions in the self-assembly and self-dissociation of molecules during trillions of cellular mechanisms. In silicon, the cornerstone of the alteration will be artificial intelligence (AI), empowering signal networks to operate at the atomic level and picosecond timescales.
Copy rights belong to original authors. Visit the link for more info
Podcast created by Paper Player, LLC | |||
| DR-BERT: A Protein Language Model to Annotate Disordered Regions | 22 Feb 2023 | ||
Link to bioRxiv paper:
http://biorxiv.org/cgi/content/short/2023.02.22.529574v1?rss=1
Authors: Nambiar, A., Forsyth, J. M., Liu, S., Maslov, S.
Abstract:
Despite their lack of a rigid structure, intrinsically disordered regions in proteins play important roles in cellular functions, including mediating protein-protein interactions. Therefore, it is important to computationally annotate disordered regions of proteins with high accuracy. Most popular tools use evolutionary or biophysical features to make predictions of disordered regions. In this study, we present DR-BERT, a compact protein language model that is first pretrained on a large number of unannotated proteins before being trained to predict disordered regions. Although it does not use any evolutionary or biophysical information, DR-BERT shows a statistically significant improvement when compared to several existing methods on a gold standard dataset. We show that this performance is due to the information learned during pre-training and DR-BERT's ability to use contextual information. A web application for using DR-BERT is available at https://huggingface.co/spaces/nambiar4/DR-BERT and the code to run the model can be found at https://github.com/maslov-group/DR-BERT.
Copy rights belong to original authors. Visit the link for more info
Podcast created by Paper Player, LLC | |||
| Utilizing Pre-trained Network Medicine Models for Generating Biomarkers, Targets, Re-purposing Drugs, and Personalized Therapeutic Regimes: COVID-19 Applications | 22 Feb 2023 | ||
Link to bioRxiv paper:
http://biorxiv.org/cgi/content/short/2023.02.21.527754v1?rss=1
Authors: Xiong, J.
Abstract:
In this paper, we present a novel pre-trained network medicine model called Selective Remodeling of Protein Networks by Chemicals (SEMO). We divide the global human protein-protein interaction (PPI) network into smaller sub-networks, and quantify the potential effects of chemicals by statistically comparing their target and non-target gene sets. By combining 9607 PPI gene sets with 2658 chemicals, we created a pre-trained pool of SEMOs, which we then used to identify SEMOs related to Covid-19 severity using DNA methylation profiling data from two clinical cohorts. The nutraceutical-derived SEMO features provided an effective model for predicting Covid-19 severity, with an AUC score of 81% in the training data and 80% in the independent validation data. Our findings suggest that Vitamin D3, Lipoic Acid, Citrulline, and Niacin, along with their associated protein networks,particularly STAT1, MMP2, CD8A, and CXCL8 as hub nodes,could be used to effectively predict Covid-19 severity. Furthermore, the severity-associated SEMOs were found to be significantly correlated with CD4+ and monocyte cell proportions. These insights can be used to generate personalized nutraceutical regimes by ranking the relative severity risk associated with each SEMO. Thus, our pre-trained SEMO model can serve as a fundamental knowledge map when coupled with DNA methylation measurements, allowing us to simultaneously generate biomarkers, targets, re-purposing drugs, and nutraceutical interventions.
Copy rights belong to original authors. Visit the link for more info
Podcast created by Paper Player, LLC | |||
| Fast Identification of Optimal Monotonic Classifiers | 22 Feb 2023 | ||
Link to bioRxiv paper:
http://biorxiv.org/cgi/content/short/2023.02.22.529510v1?rss=1
Authors: Fourquet, O., Krejca, M. S., Doerr, C., Schwikowski, B.
Abstract:
Motivation Monotonic bivariate classifiers can describe simple patterns in high-dimensional data that may not be discernible using only elementary linear decision boundaries. Such classifiers are relatively simple, easy to interpret, and do not require large amounts of data to be effective. A challenge is that finding optimal pairs of features from a vast number of possible pairs tends to be computationally intensive, limiting the applicability of these classifiers. Results We prove a simple mathematical inequality and show how it can be exploited for the faster identification of optimal feature combinations. Our empirical results suggest speedups of 10x--20x, relative to the previous, naive, approach in applications. This result thus greatly extends the range of possible applications for bivariate monotonic classifiers. In addition, we provide the first open-source code to identify optimal monotonic bivariate classifiers. Availability: https://gitlab.pasteur.fr/ofourque/mem_python
Copy rights belong to original authors. Visit the link for more info
Podcast created by Paper Player, LLC | |||
| choros: correction of sequence-based biases for accurate quantification of ribosome profiling data | 22 Feb 2023 | ||
Link to bioRxiv paper:
http://biorxiv.org/cgi/content/short/2023.02.21.529452v1?rss=1
Authors: Mok, A., Tunney, R., Benegas, G., Wallace, E. W. J., Lareau, L. F.
Abstract:
Ribosome profiling quantifies translation genome-wide by sequencing ribosome-protected fragments, or footprints. Its single-codon resolution allows identification of translation regulation, such as ribosome stalls or pauses, on individual genes. However, enzyme preferences during library preparation lead to pervasive sequence artifacts that obscure translation dynamics. Widespread over- and under-representation of ribosome footprints can dominate local footprint densities and skew estimates of elongation rates by up to five fold. To address these biases and uncover true patterns of translation, we present choros, a computational method that models ribosome footprint distributions to provide bias-corrected footprint counts. choros uses negative binomial regression to accurately estimate two sets of parameters: (i) biological contributions from codon-specific translation elongation rates; and (ii) technical contributions from nuclease digestion and ligation efficiencies. We use these parameter estimates to generate bias correction factors that eliminate sequence artifacts. Applying choros to multiple ribosome profiling datasets, we are able to accurately quantify and attenuate ligation biases to provide more faithful measurements of ribosome distribution. We show that a pattern interpreted as pervasive ribosome pausing near the beginning of coding regions is likely to arise from technical biases. Incorporating choros into standard analysis pipelines will improve biological discovery from measurements of translation.
Copy rights belong to original authors. Visit the link for more info
Podcast created by Paper Player, LLC | |||
| Dimensionality reduction methods for extracting functional networks from large-scale CRISPR screens | 22 Feb 2023 | ||
Link to bioRxiv paper:
http://biorxiv.org/cgi/content/short/2023.02.22.529573v1?rss=1
Authors: Hassan, A. Z., Ward, H. N., Rahman, M., Billmann, M., Lee, Y., Myers, C. L.
Abstract:
CRISPR-Cas9 screens facilitate the discovery of gene functional relationships and phenotype-specific dependencies. The Cancer Dependency Map (DepMap) is the largest compendium of whole-genome CRISPR screens aimed at identifying cancer-specific genetic dependencies across human cell lines. A mitochondria-associated bias has been previously reported to mask signals for genes involved in other functions, and thus, methods for normalizing this dominant signal to improve co-essentiality networks are of interest. In this study, we explore three unsupervised dimensionality reduction methods - autoencoders, robust, and classical principal component analyses (PCA) - for normalizing the DepMap to improve functional networks extracted from these data. We propose a novel onion normalization technique to combine several normalized data layers into a single network. Benchmarking analyses reveal that robust PCA combined with onion normalization outperforms existing methods for normalizing the DepMap. Our work demonstrates the value of removing low-dimensional signals from the DepMap before constructing functional gene networks and provides generalizable dimensionality reduction-based normalization tools.
Copy rights belong to original authors. Visit the link for more info
Podcast created by Paper Player, LLC | |||
| Flame (v2.0): advanced integration and interpretation of functional enrichment results from multiple sources | 22 Feb 2023 | ||
Link to bioRxiv paper:
http://biorxiv.org/cgi/content/short/2023.02.21.529389v1?rss=1
Authors: Karatzas, E., Baltoumas, F., Aplakidou, E., Kontou, P., Stathopoulos, P., Stefanis, L., Bagos, P., Pavlopoulos, G. A.
Abstract:
Functional enrichment is the process of identifying implicated functional terms from a given input list of genes or proteins. In this article, we present Flame (v2.0), a web tool which offers a combinatorial approach through merging and visualizing results from widely-used functional enrichment applications while also allowing various flexible input options. In this version, Flame utilizes the aGOtool, g:Profiler, WebGestalt and Enrichr pipelines and presents their outputs separately or in combination following a visual analytics approach. For intuitive representations and easier interpretation, it uses interactive plots such as parameterizable networks, heatmaps, barcharts and scatter plots. Users can also: (i) handle multiple protein/gene lists and analyze union and intersection sets simultaneously through interactive UpSet plots, (ii) automatically extract genes and proteins from free text through text-mining and Named Entity Recognition (NER) techniques, (iii) upload single nucleotide polymorphisms (SNPs) and extract their relative genes or (iv) analyze multiple lists of differentially-expressed proteins/genes after selecting them interactively from a parameterizable volcano plot. Compared to the previous version of 197 supported organisms, Flame (v2.0) currently allows enrichment for 14,436 organisms.
Copy rights belong to original authors. Visit the link for more info
Podcast created by Paper Player, LLC | |||
| Investigating racial disparities in carcinomas through TCGA transcriptomic and proteomic database | 22 Feb 2023 | ||
Link to bioRxiv paper:
http://biorxiv.org/cgi/content/short/2023.02.21.529336v1?rss=1
Authors: Lei, B., Jiang, X., Saxena, A.
Abstract:
Epidemiological studies highlight a disparity in cancer incidence and outcome rates between racial groups in the United States. In our study, we investigated molecular differences among racial groups in 10 carcinoma types. We used publicly available data from The Cancer Genome Atlas to identify patterns of differential gene expression in tumors obtained from 4,112 White, Black/African American, and Asian patients. We identified race-dependent expression of numerous genes whose mRNA transcript levels were significantly correlated with patient survival. A small subset of these genes was differentially expressed in multiple carcinomas, including genes involved in cell cycle progression such as CCNB1, CCNE1, CCNE2, and FOXM1. In contrast, genes such as transcriptional factor ETS1 and apoptotic gene BAK1 were differentially expressed and clinically significant only in specific cancer types. Our analyses also revealed race-dependent regulation of relevant pathways. Importantly, homology directed repair and ERBB4-mediated nuclear signaling were both upregulated in Black patients compared to Whites in four carcinoma types. This large-scale pan-cancer study refines our understanding of the cancer health disparity and can help inform the use of novel biomarkers in clinical settings as well as the future development of precision therapies.
Copy rights belong to original authors. Visit the link for more info
Podcast created by Paper Player, LLC | |||
| Mining impactful discoveries from the biomedical literature | 30 Oct 2022 | ||
Link to bioRxiv paper:
http://biorxiv.org/cgi/content/short/2022.10.28.514184v1?rss=1
Authors: Moreau, E., Hardiman, O., Heverin, M., O'Sullivan, D.
Abstract:
Motivation: Literature-Based Discovery (LBD) aims to help researchers to identify relations between concepts which are worthy of further investigation by text-mining the biomedical literature. While the LBD literature is rich and the field is considered mature, standard practice in the evaluation of LBD methods is methodologically poor and has not progressed on par with the domain. The lack of properly designed and decent-sized benchmark dataset hinders the progress of the field and its development into applications usable by biomedical experts. Results: This work presents a method for mining past discoveries from the biomedical literature. It leverages the impact made by a discovery, using descriptive statistics to detect surges in the prevalence of a relation across time. This method allows the collection of a large amount of time-stamped discoveries which can be used for LBD evaluation or other applications. The validity of the method is tested against a baseline representing the state of the art "time sliced" method. Availability: The source data used in this article are publicly available. The implementation and the resulting data are published under open-source license (code: https://github.com/erwanm/medline-discoveries; datasets: https://zenodo.org/record/5888572). An online exploration tool is also provided at https://brainmend.adaptcentre.ie/.
Copy rights belong to original authors. Visit the link for more info
Podcast created by Paper Player, LLC | |||
| SPA-STOCSY: An Automated Tool for Identification of Annotated and Non-Annotated Metabolites in High-Throughput NMR Spectra | 22 Feb 2023 | ||
Link to bioRxiv paper:
http://biorxiv.org/cgi/content/short/2023.02.22.529564v1?rss=1
Authors: Han, X., Wang, W., Ma, L., Ramahi, I. A., Botas, J., MacKenzie, K., Allen, G. I., Young, D. W., Liu, Z., Maletic-Savatic, M.
Abstract:
Nuclear Magnetic Resonance (NMR) spectroscopy is widely used to analyze metabolites in biological samples, but the analysis can be cumbersome and inaccurate. Here, we present a powerful automated tool, SPA-STOCSY (Spatial Clustering Algorithm - Statistical Total Correlation Spectroscopy), which overcomes the challenges by identifying metabolites in each sample with high accuracy. As a data-driven method, SPA-STOCSY estimates all parameters from the input dataset, first investigating the covariance pattern and then calculating the optimal threshold with which to cluster data points belonging to the same structural unit, i.e. metabolite. The generated clusters are then automatically linked to a compound library to identify candidates. To assess SPA-STOCSY efficiency and accuracy, we applied it to synthesized and real NMR data obtained from Drosophila melanogaster brains and human embryonic stem cells. In the synthesized spectra, SPA outperforms Statistical Recoupling of Variables, an existing method for clustering spectral peaks, by capturing a higher percentage of the signal regions and the close-to-zero noise regions. In the real spectra, SPA-STOCSY performs comparably to operator-based Chenomx analysis but avoids operator bias and performs the analyses in less than seven minutes of total computation time. Overall, SPA-STOCSY is a fast, accurate, and unbiased tool for untargeted analysis of metabolites in the NMR spectra. As such, it might accelerate the utilization of NMR for scientific discoveries, medical diagnostics, and patient-specific decision-making.
Copy rights belong to original authors. Visit the link for more info
Podcast created by Paper Player, LLC | |||
| CryoPPP: A Large Expert-Labelled Cryo-EM Image Dataset for Machine Learning Protein Particle Picking | 22 Feb 2023 | ||
Link to bioRxiv paper:
http://biorxiv.org/cgi/content/short/2023.02.21.529443v1?rss=1
Authors: Dhakal, A., Gyawali, R., Wang, L., Cheng, J.
Abstract:
Cryo-electron microscopy (cryo-EM) is currently the most powerful technique for determining the structures of large protein complexes and assemblies. Picking single-protein particles from cryo-EM micrographs (images) is a key step in reconstructing protein structures. However, the widely used template-based particle picking process is labor-intensive and time-consuming. Though the emerging machine learning-based particle picking can potentially automate the process, its development is severely hindered by lack of large, high-quality, manually labelled training data. Here, we present CryoPPP, a large, diverse, expert-curated cryo-EM image dataset for single protein particle picking and analysis to address this bottleneck. It consists of manually labelled cryo-EM micrographs of 32 non-redundant, representative protein datasets selected from the Electron Microscopy Public Image Archive (EMPIAR). It includes 9,089 diverse, high-resolution micrographs (~300 cryo-EM images per EMPIAR dataset) in which the coordinates of protein particles were labelled by human experts. The protein particle labelling process was rigorously validated by both 2D particle class validation and 3D density map validation with the gold standard. The dataset is expected to greatly facilitate the development of machine learning and artificial intelligence methods for automated cryo-EM protein particle picking. The dataset and data processing scripts are available at https://github.com/BioinfoMachineLearning/cryoppp
Copy rights belong to original authors. Visit the link for more info
Podcast created by Paper Player, LLC | |||
| Pacybara: Accurate long-read sequencing for barcoded mutagenized allelic libraries | 23 Feb 2023 | ||
Link to bioRxiv paper:
http://biorxiv.org/cgi/content/short/2023.02.22.529427v1?rss=1
Authors: Weile, J., Cote, A. G., Kishore, N., Tabet, D., van Loggerenberg, W., Rayhan, A., Roth, F. P.
Abstract:
Long read sequencing technologies, an attractive solution for many applications, usually suffer from higher error rates. Alignment of multiple reads can improve base-calling accuracy, but some applications, e.g. the sequencing of mutagenized libraries where multiple distinct clones differ by one or few variants, require the use of barcodes or unique molecular identifiers. Unfortunately, not only can sequencing errors interfere with correct barcode identification, but a given barcode sequence may be linked to multiple independent clones within a given library. Here we focus on the target application of sequencing mutagenized libraries in the context of multiplexed assays of variant effects (MAVEs). MAVEs are increasingly used to create comprehensive genotype-phenotype maps that can aid clinical variant interpretation. Many MAVE methods use barcoded mutant libraries and thus require the accurate association of barcode with genotype, e.g. using long-read sequencing. Existing pipelines do not account for inaccurate sequencing or non-unique barcodes. Here, we describe Pacybara, which handles these issues by clustering long reads based on the similarities of (error-prone) barcodes while detecting the association of a single barcode with multiple genotypes. Pacybara also detects recombinant (chimeric) clones and reduces false positive indel calls. In an example application, we show that Pacybara increases the sensitivity of a MAVE-derived missense variant effect map.
Copy rights belong to original authors. Visit the link for more info
Podcast created by Paper Player, LLC | |||
| Genetic dependencies associated with transcription factor activities in human cancer cell lines | 23 Feb 2023 | ||
Link to bioRxiv paper:
http://biorxiv.org/cgi/content/short/2023.02.23.529701v1?rss=1
Authors: Thatikonda, V., Supper, V., C. Ravichandran, M., J. Lipp, J., S. Boghossian, A., G. Rees, M., M. Ronan, M., A. Roth, J., Grosche, S., A. Neumüller, R., Mair, B., Mauri, F., Popa, A.
Abstract:
Transcription factors (TFs) are key components of the aberrant transcriptional programs in cancer cells. In this study, we used TF activity (TFa), inferred from the downstream regulons as a potential biomarker to identify associated genetic vulnerabilities in cancer cells. Our linear model framework, integrating TFa and genome-wide CRISPR knockout datasets identified 1,770 candidate TFa target pairs across different cancer types and assessed their survival impact in patient data. As a proof of concept, through inhibitor screens and genetic depletion assays in cell lines, we validated the dependency of cell lines on predicted targets linked to TEAD1, the most prominent TF from our analysis. Overall, these candidate pairs represent an attractive resource for early-stage targets and drug discovery programs in oncology.
Copy rights belong to original authors. Visit the link for more info
Podcast created by Paper Player, LLC | |||
| Retrieved Sequence Augmentation for Protein Representation Learning | 23 Feb 2023 | ||
Link to bioRxiv paper:
http://biorxiv.org/cgi/content/short/2023.02.22.529597v1?rss=1
Authors: Ma, C., Zhao, H., Zheng, L., Xin, J., Li, Q., Wu, L., Deng, Z., Lu, Y., Liu, Q., Kong, L.
Abstract:
Protein language models have excelled in a variety of tasks, ranging from structure prediction to protein engineering. However, proteins are highly diverse in functions and structures, and current state-of-the-art models including the latest version of AlphaFold rely on Multiple Sequence Alignments (MSA) to feed in the evolutionary knowledge. Despite their success, heavy computational overheads, as well as the de novo and orphan proteins remain great challenges in protein representation learning. In this work, we show that MSAaugmented models inherently belong to retrievalaugmented methods. Motivated by this finding, we introduce Retrieved Sequence Augmentation(RSA) for protein representation learning without additional alignment or pre-processing. RSA links query protein sequences to a set of sequences with similar structures or properties in the database and combines these sequences for downstream prediction. We show that protein language models benefit from the retrieval enhancement on both structure prediction and property prediction tasks, with a 5% improvement on MSA Transformer on average while being 373 times faster. In addition, we show that our model can transfer to new protein domains better and outperforms MSA Transformer on de novo protein prediction. Our study fills a much-encountered gap in protein prediction and brings us a step closer to demystifying the domain knowledge needed to understand protein sequences. Code is available on https://github.com/HKUNLP/RSA.
Copy rights belong to original authors. Visit the link for more info
Podcast created by Paper Player, LLC | |||
| A mechanistic simulation of molecular cell states over time | 23 Feb 2023 | ||
Link to bioRxiv paper:
http://biorxiv.org/cgi/content/short/2023.02.23.529720v1?rss=1
Authors: Erbe, R., Stein-O'Brien, G., Fertig, E. J.
Abstract:
Computer simulations of cell behaviors and dynamics allow for investigation of aspects of cellular biology with a ground truth that is currently difficult or impossible to generate from experimentally generated profiling data. Here, we present a mechanistic simulation of cell states that models the stochastic interactions of molecules revealing the DNA accessibility, RNA expression, and protein expression state of a simulated cell and how these states evolve over time. By designing each component to correspond to a specific biological molecule or parameter, the simulation becomes highly interpretable. From the simulated cells generated, we explore the importance of parameters such as splicing and degradation rates of genes on RNA and protein expression, demonstrating that perturbing these parameters leads to changes in long term gene and protein expression levels. We observe that the expression levels of corresponding RNA and proteins are not necessarily well correlated and identify mechanistic explanations that may help explain the similar phenomenon that has been observed in real cells. We evaluate whether the RNA data output from the simulation provides sufficient information to reconstruct the underlying regulatory relationships between genes. While predictive relationships can be inferred, direct causal regulatory relationships between genes cannot be reliably distinguished from other predictive relationships between genes arising independently from a direct regulatory mechanism. We observe the same inability to robustly distinguish causal gene regulatory relationships using simulated data from the simpler BoolODE model, suggesting this may be a limitation to the identifiability of network inference.
Copy rights belong to original authors. Visit the link for more info
Podcast created by Paper Player, LLC | |||
| Dual-modality imaging of immunofluorescence and imaging mass cytometry for whole slide imaging with accurate single-cell segmentation | 23 Feb 2023 | ||
Link to bioRxiv paper:
http://biorxiv.org/cgi/content/short/2023.02.23.529718v1?rss=1
Authors: Kim, E. N., Chen, P. Z., Bressan, D., Tripathi, M., Miremadi, A., di Pietro, M., Coussens, L. M., Hannon, G. J., Fitzgerald, R. C., Zhuang, L., Chang, Y. H.
Abstract:
Imaging mass cytometry (IMC) is a powerful multiplexed tissue imaging technology that allows simultaneous detection of more than 30 makers on a single slide. It has been increasingly used for single-cell-based spatial phenotyping in a wide range of samples. However, it only acquires a small, rectangle field of view (FOV) with a low image resolution that hinders downstream analysis. Here, we reported a highly practical dual-modality imaging method that combines high-resolution immunofluorescence (IF) and high-dimensional IMC on the same tissue slide. Our computational pipeline uses the whole slide image (WSI) of IF as a spatial reference and integrates small FOVs IMC into a WSI of IMC. The high-resolution IF images enable accurate single-cell segmentation to extract robust high-dimensional IMC features for downstream analysis. We applied this method in esophageal adenocarcinoma of different stages, identified the single-cell pathology landscape via reconstruction of WSI IMC images, and demonstrated the advantage of the dual-modality imaging strategy.
Copy rights belong to original authors. Visit the link for more info
Podcast created by Paper Player, LLC | |||
| Ensemble deep learning of embeddings for clustering multimodal single-cell omics data | 23 Feb 2023 | ||
Link to bioRxiv paper:
http://biorxiv.org/cgi/content/short/2023.02.22.529627v1?rss=1
Authors: Yu, L., Liu, C., Yang, J. Y. H., Yang, P.
Abstract:
Motivation: Recent advances in multimodal single-cell omics technologies enable multiple modalities of molecular attributes, such as gene expression, chromatin accessibility, and protein abundance, to be profiled simultaneously at a global level in individual cells. While the increasing availability of multiple data modalities is expected to provide a more accurate clustering and characterisation of cells, the development of computational methods that are capable of extracting information embedded across data modalities is still in its infancy. Results: We propose SnapCCESS for clustering cells by integrating data modalities in multimodal single-cell omics data using an unsupervised ensemble deep learning framework. By creating snapshots of embeddings of multimodality using variational autoencoders, SnapCCESS can be coupled with various clustering algorithms for generating consensus clustering of cells. We applied SnapCCESS with several clustering algorithms to various datasets generated from popular multimodal single-cell omics technologies. Our results demonstrate that SnapCCESS is effective and more efficient than conventional ensemble deep learning-based clustering methods and outperforms other state-of-the-art multimodal embedding generation methods in integrating data modalities for clustering cells. The improved clustering of cells from SnapCCESS will pave the way for more accurate characterisation of cell identity and types, an essential step for various downstream analyses of multimodal single-cell omics data. Availability and implementation: SnapCCESS is implemented as a Python package and is freely available from https://github.com/yulijia/SnapCCESS.
Copy rights belong to original authors. Visit the link for more info
Podcast created by Paper Player, LLC | |||
| Elucidation of Genome-wide Understudied Proteins targeted by PROTAC-induced degradation using Interpretable Machine Learning | 24 Feb 2023 | ||
Link to bioRxiv paper:
http://biorxiv.org/cgi/content/short/2023.02.23.529828v1?rss=1
Authors: Xie, L., Xie, L.
Abstract:
Proteolysis-targeting chimeras (PROTACs) are hetero-bifunctional molecules. They induce the degradation of a target protein by recruiting an E3 ligase to the target. The PROTAC can inactivate disease-related genes that are considered as understudied, thus has a great potential to be a new type of therapy for the treatment of incurable diseases. However, only hundreds of proteins have been experimentally tested if they are amenable to the PROTACs. It remains elusive what other proteins can be targeted by the PROTAC in the entire human genome. For the first time, we have developed an interpretable machine learning model PrePROTAC, which is based on a transformer-based protein sequence descriptor and random forest classification to predict genome-wide PROTAC-induced targets degradable by CRBN, one of the E3 ligases. In the benchmark studies, PrePROTAC achieved ROC-AUC of 0.81, PR-AUC of 0.84, and over 40% sensitivity at a false positive rate of 0.05, respectively. Furthermore, we developed an embedding SHapley Additive exPlanations (eSHAP) method to identify positions in the protein structure, which play key roles in the PROTAC activity. The key residues identified were consistent with our existing knowledge. We applied PrePROTAC to identify more than 600 novel understudied proteins that are potentially degradable by CRBN, and proposed PROTAC compounds for three novel drug targets associated with Alzheimer's disease.
Copy rights belong to original authors. Visit the link for more info
Podcast created by Paper Player, LLC | |||
| Predicting S. aureus antimicrobial resistance with interpretable genomic space maps | 24 Feb 2023 | ||
Link to bioRxiv paper:
http://biorxiv.org/cgi/content/short/2023.02.24.529878v1?rss=1
Authors: Pikalyova, K., Orlov, A., Horvath, D., Marcou, G., Varnek, A.
Abstract:
Increasing antimicrobial resistance (AMR) represents a global healthcare threat. Methods for rapid selection of optimal antibiotic treatment are urgently needed to decrease the spread of AMR and associated mortality. The use of machine learning (ML) techniques based on genomic data to predict resistance phenotypes serves as a solution for the acceleration of the clinical response prior to phenotypic testing. Nonetheless, many existing ML methods lack interpretability and do not implicitly incorporate visualization of the sequence space that can be useful for extracting insightful patterns from genomic data. Herein, we present a methodology for AMR prediction and visualization of sequence space based on the non-linear dimensionality reduction method - generative topographic mapping (GTM). This approach applied to data on AMR of greater than 5000 S. aureus isolates retrieved from the PATRIC database yielded GTM models with reasonable accuracy for all drugs (balanced accuracy values greater than or equal to 0.75). The GTMs represent data in the form of illustrative 2D maps of the genomic space and allow for antibiotic-wise comparison of resistance phenotypes. In addition to that, the maps were found to be useful for the analysis of genetic determinants responsible for drug resistance based on the data from the PATRIC database. Overall, the GTM-based methodology is a useful tool for the illustrative exploration of the genomic sequence space and modelling AMR and can be used as a tool complementary to the existing ML methods for AMR prediction.
Copy rights belong to original authors. Visit the link for more info
Podcast created by Paper Player, LLC | |||
| Deep Learning vs Gradient Boosting in age prediction on immunology profile | 31 Oct 2022 | ||
Link to bioRxiv paper:
http://biorxiv.org/cgi/content/short/2022.10.28.514283v1?rss=1
Authors: Kalyakulina, A., Yusipov, I., Kondakova, E., Bacalini, M. G., Franceschi, C., Vedunova, M., Ivanchenko, M.
Abstract:
Background. The aging process affects all systems of the human body, and the observed increase in inflammatory components affecting the immune system in old age can lead to the development of age-associated diseases and systemic inflammation. Results. We propose a small clock model SImAge using a limited number of immunological biomarkers. To solve the problem of regressing chronological age from cytokine data, we first use a baseline Elastic Net model, gradient-boosted decision trees models, and several deep neural network architectures. On the full dataset for 46 immunological parameters, LightGBM, DANet, and TabNet models showed the best results. Dimensionality reduction of these 3 models with SHAP values revealed the 10 most age-associated immunological parameters, which formed the basis of the SImAge small immunological clock. The best result of the SImAge model has mean absolute error of 6.28 years, it was shown by the DANet deep neural network model. Explicable artificial intelligence methods were used to explain the model solution for each individual participant. Conclusions. We proposed an approach to construct a small model of immunological age, SImAge, using the DANet deep neural network model, which showed the smallest error on the 10 immunological parameters. The resulting model shows the highest result among all published studies on immunological profiles. Since gradient-boosted decision trees and neural networks show similar results in this case, we can consider parity between these types of models for immunological profiles.
Copy rights belong to original authors. Visit the link for more info
Podcast created by Paper Player, LLC | |||
| PhageDPO: Phage Depolymerase Finder | 24 Feb 2023 | ||
Link to bioRxiv paper:
http://biorxiv.org/cgi/content/short/2023.02.24.529883v1?rss=1
Authors: Vieira, M. F., Duarte, J., Domingues, R., Oliveira, H., Dias, O.
Abstract:
Bacteriophages are the most predominant and genetically diverse biological entities on Earth. They are bacterial viruses which encode numerous proteins with potential antibacterial activity. However, most bacteriophage-encoded proteins have no assigned function, hindering the discovery of novel antibacterial agents. In particular, there has been a growing interest in exploring recombinant bacteriophage depolymerases from the fundamental standpoint, but mostly for biotechnological applications to control bacterial pathogens. Due to the lack of efficient identification tools, we developed PhageDPO, the first developed tool that predicts depolymerases in bacteriophage genomes using machine learning methods.
Copy rights belong to original authors. Visit the link for more info
Podcast created by Paper Player, LLC | |||
| SEG: Segmentation Evaluation in absence of Ground truth labels | 24 Feb 2023 | ||
Link to bioRxiv paper:
http://biorxiv.org/cgi/content/short/2023.02.23.529809v1?rss=1
Authors: Sims, Z., Strgar, L., Thirumalaisamy, D., Heussner, R., Thibault, G., Chang, Y. H.
Abstract:
Identifying individual cells or nuclei is often the first step in the analysis of multiplex tissue imaging (MTI) data. Recent efforts to produce plug-and-play, end-to-end MTI analysis tools such as MCMICRO [1] - though groundbreaking in their usability and extensibility - are often unable to provide users guidance regarding the most appropriate models for their segmentation task among an endless proliferation of novel segmentation methods. Unfortunately, evaluating segmentation results on a user's dataset without ground truth labels is either purely subjective or eventually amounts to the task of performing the original, time-intensive annotation. As a consequence, researchers rely on models pre-trained on other large datasets for their unique tasks. Here, we propose a methodological approach for evaluating MTI nuclei segmentation methods in absence of ground truth labels by scoring relatively to a larger ensemble of segmentations. To avoid potential sensitivity to collective bias from the ensemble approach, we refine the ensemble via weighted average across segmentation methods, which we derive from a systematic model ablation study. First, we demonstrate a proof-of-concept and the feasibility of the proposed approach to evaluate segmentation performance in a small dataset with ground truth annotation. To validate the ensemble and demonstrate the importance of our method-specific weighting, we compare the ensemble's detection and pixel-level predictions - derived without supervision - with the data's ground truth labels. Second, we apply the methodology to an unlabeled larger tissue microarray (TMA) dataset, which includes a diverse set of breast cancer phenotypes, and provides decision guidelines for the general user to more easily choose the most suitable segmentation methods for their own dataset by systematically evaluating the performance of individual segmentation approaches in the entire dataset.
Copy rights belong to original authors. Visit the link for more info
Podcast created by Paper Player, LLC | |||
| Themisto: a scalable colored k-mer index for sensitive pseudoalignment against hundreds of thousands of bacterial genomes | 24 Feb 2023 | ||
Link to bioRxiv paper:
http://biorxiv.org/cgi/content/short/2023.02.24.529942v1?rss=1
Authors: Alanko, J. N., Vuohtoniemi, J., Maklin, T., Puglisi, S. J.
Abstract:
Motivation: Huge data sets containing whole-genome sequences of bacterial strains are now commonplace and represent a rich and important resource for modern genomic epidemiology and metagenomics. In order to efficiently make use of these data sets, efficient indexing data structures - that are both scalable and provide rapid query throughput - are paramount. Results: Here, we present Themisto, a scalable colored k-mer index designed for large collections of microbial reference genomes, that works for both short and long read data. Themisto indexes 179 thousand Salmonella enterica genomes in 9 hours. The resulting index takes 142 gigabytes, and Themisto pseudoaligns reads from a Salmonella enterica isolate sample against the index at a rate of 2 million base pairs per second on 48 threads. In comparison, the best competing tools Metagraph and Bifrost were only able to index 11 thousand genomes in the same time. In pseudoalignment, these other tools were either an order of magnitude slower than Themisto, or used an order of magnitude more memory. Themisto also offers superior pseudoalignment quality, achieving a higher recall than previous methods on Nanopore read sets. Availability and implementation: Themisto is available and documented as a C++ package at https://github.com/algbio/themisto available under the GPLv2 license.
Copy rights belong to original authors. Visit the link for more info
Podcast created by Paper Player, LLC | |||
| GENEvaRX: A Novel AI-Driven Method and Web Tool Can Identify Critical Genes and Effective Drugs for Lichen Planus | 24 Feb 2023 | ||
Link to bioRxiv paper:
http://biorxiv.org/cgi/content/short/2023.02.23.529678v1?rss=1
Authors: Turki, T., Taguchi, Y.-h.
Abstract:
Lichen planus (LP) is an autoimmune disorder diagnosed based on physical symptoms and lab tests. Examples of symptoms include flat bumps, and itchy and purplish skin, while lab tests include a shave biopsy of the lesion. When the pathology report shows consistency with LP and is negative for potential triggers for an allergy test and hepatitis C, a dermatologist typically prescribes corticosteroid in the form of pills or injection into the lesion to treat the symptoms. To understand the molecular mechanism of the disease and thereby overcome issues associated with disease treatment, there is a need to identify potential effective drugs, drug targets, and therapeutic targets associated the LP. Hence, we propose a novel computational framework based on new constrained optimization to support vector machines coupled with enrichment analysis. First, we downloaded three gene expression datasets (GSE63741, GSE193351, GSE52130) pertaining to healthy and LP patients from the gene expression omnibus (GEO) database. We then processed each dataset and entered it into our computational framework to select important genes. Finally, we performed enrichment analysis of selected genes, reporting the following results. Our methods outperformed baseline methods in terms of identifying disease and skin tissue. Moreover, we report 5 drugs (including, dexamethasone, retinoic acid, and quercetin), 45 unique genes (including PSMB8, KRT31, KRT16, KRT19, KRT17, COL3A1, LCE2D, LCE2A), and 23 unique TFs (including NFKB1, STAT1, STAT3) reportedly related to LP pathogenesis, treatments, and therapeutic targets. Our methods are publicly available in the GENEvaRX web server at https://aibio.shinyapps.io/GENEvaRX/.
Copy rights belong to original authors. Visit the link for more info
Podcast created by Paper Player, LLC | |||
| Single cell and spatial alternative splicing analysis with long read sequencing | 24 Feb 2023 | ||
Link to bioRxiv paper:
http://biorxiv.org/cgi/content/short/2023.02.23.529769v1?rss=1
Authors: Fu, Y., Kim, H., Adams, J. I., Grimes, S. M., Huang, S., Lau, B., Sathe, A., Hess, P., Ji, H., Zhang, N.
Abstract:
Long-read sequencing has become a powerful tool for alternative splicing analysis. However, technical and computational challenges have limited our ability to explore alternative splicing at single cell and spatial resolution. The higher sequencing error of long reads, especially high indel rates, have limited the accuracy of cell barcode and unique molecular identifier (UMI) recovery. Read truncation and mapping errors, the latter exacerbated by the higher sequencing error rates, can cause the false detection of spurious new isoforms. Downstream, there is yet no rigorous statistical framework to quantify splicing variation within and between cells/spots. In light of these challenges, we developed Longcell, a statistical framework and computational pipeline for accurate isoform quantification for single cell and spatial spot barcoded long read sequencing data. Longcell performs computationally efficient cell/spot barcode extraction, UMI recovery, and UMI-based truncation- and mapping-error correction. Through a statistical model that accounts for varying read coverage across cells/spots, Longcell rigorously quantifies the level of inter-cell/spot versus intra-cell/ spot diversity in exon-usage and detects changes in splicing distributions between cell populations. Applying Longcell to single cell long-read data from multiple contexts, we found that intra-cell splicing heterogeneity, where multiple isoforms co-exist within the same cell, is ubiquitous for highly expressed genes. On matched single cell and Visium long read sequencing for a tissue of colorectal cancer metastasis to the liver, Longcell found concordant signals between the two data modalities. Finally, on a perturbation experiment for 9 splicing factors, Longcell identified regulatory targets that are validated by targeted sequencing.
Copy rights belong to original authors. Visit the link for more info
Podcast created by Paper Player, LLC | |||
| A generalizable Cas9/sgRNA prediction model using machine transfer learning with small high-quality datasets | 26 Feb 2023 | ||
Link to bioRxiv paper:
http://biorxiv.org/cgi/content/short/2023.02.26.530100v1?rss=1
Authors: Ham, D. T., Browne, T. S., Bangalorewala, P. N., Wilson, T., Michael, R., Gloor, G. B., Edgell, D. R.
Abstract:
The CRISPR/Cas9 nuclease from Streptococcus pyogenes (SpCas9) can be used with single guide RNAs (sgRNAs) as a sequence-specific antimicrobial agent and as a genome-engineering tool. However, current bacterial sgRNA activity models poorly predict SpCas9/sgRNA activity and are not generalizable, possibly because the underlying datasets used to train the models do not accurately measure SpCas9/sgRNA cleavage activity and cannot distinguish cleavage activity from toxicity. We solved this problem by using a two-plasmid positive selection system to generate high-quality biologically-relevant data that more accurately reports on SpCas9/sgRNA cleavage activity and that separates activity from toxicity. We developed a new machine transfer learning architecture (crisprHAL) that can be trained on existing datasets and that shows marked improvements in sgRNA activity prediction accuracy when transfer learning is used with small amounts of high-quality data. The crisprHAL model recapitulates known SpCas9/sgRNA-target DNA interactions and provides a pathway to a generalizable sgRNA bacterial activity prediction tool.
Copy rights belong to original authors. Visit the link for more info
Podcast created by Paper Player, LLC | |||
| Bactabolize: A tool for high-throughput generation of bacterial strain-specific metabolic models | 27 Feb 2023 | ||
Link to bioRxiv paper:
http://biorxiv.org/cgi/content/short/2023.02.26.530115v1?rss=1
Authors: Vezina, B., Watts, S. C., Hawkey, J., Cooper, H. B., Judd, L. M., Jenney, A., Monk, J. M., Holt, K. E., Wyres, K. L.
Abstract:
Metabolic capacity can vary substantially within a bacterial species, leading to ecological niche separation, as well as differences in virulence and antimicrobial susceptibility. Genome-scale metabolic models are useful tools for studying the metabolic potential of individuals, and with the rapid expansion of genomic sequencing there is a wealth of data that can be leveraged for comparative analysis. However, there exist few tools to construct strain-specific metabolic models at scale. Here we describe Bactabolize (github.com/kelwyres/Bactabolize), a reference-based tool which rapidly produces strain-specific metabolic models and growth phenotype predictions. We describe a pan reference model for the priority antimicrobial-resistant pathogen, Klebsiella pneumoniae (github.com/kelwyres/KpSC-pan-metabolic-model), and a quality control framework for using draft genome assemblies as input for Bactabolize. The Bactabolize-derived model for K. pneumoniae reference strain KPPR1 outperformed the CarveMe-derived model across greater than or equal to 201 substrate and greater than or equal to 1220 knockout mutant growth predictions. Novel draft genomes passing our systematically-defined quality control criteria resulted in models with a high degree of completeness ( greater than or equal to 99% genes and reactions captured) and high accuracy (mean 0.97, n=10). We anticipate the tools and framework described herein will facilitate large-scale metabolic modelling analyses that broaden our understanding of diversity within bacterial species and inform novel control strategies for priority pathogens.
Copy rights belong to original authors. Visit the link for more info
Podcast created by Paper Player, LLC | |||
| Protein embeddings improve phage-host interaction prediction | 27 Feb 2023 | ||
Link to bioRxiv paper:
http://biorxiv.org/cgi/content/short/2023.02.26.530154v1?rss=1
Authors: Gonzales, M. E. M., Ureta, J. C., Shrestha, A. M. S.
Abstract:
With the growing interest in using phages to combat antimicrobial resistance, computational methods for predicting phage-host interactions have been explored to help shortlist candidate phages. Most existing models consider entire proteomes and rely on manual feature engineering, which poses difficulty in selecting the most informative sequence properties to serve as input to the model. In this paper, we framed phage-host interaction prediction as a multiclass classification problem, which takes as input the embeddings of a phage's receptor-binding proteins, which are known to be the key machinery for host recognition, and predicts the host genus. We explored different protein language models to automatically encode these protein sequences into dense embeddings without the need for additional alignment or structural information. We show that the use of embeddings of receptor-binding proteins presents improvements over handcrafted genomic and protein sequence features. The highest performance was obtained using the transformer-based protein language model ProtT5, resulting in a 3% to 4% increase of weighted F1 scores across different prediction confidence threshold,compared to using selected handcrafted sequence features.
Copy rights belong to original authors. Visit the link for more info
Podcast created by Paper Player, LLC | |||
| Influence of Demographic, Socio-economic, and Brain Structural Factors on Adolescent Neurocognition: A Correlation Analysis in the ABCD Initiative | 27 Feb 2023 | ||
Link to bioRxiv paper:
http://biorxiv.org/cgi/content/short/2023.02.24.529930v1?rss=1
Authors: Hussain, M. A., Li, G., Grant, E., Ou, Y.
Abstract:
The Adolescent Brain Cognitive Development (ABCD) initiative is a longitudinal study aimed at characterizing brain development from childhood through adolescence and identifying key biological and environmental factors that influence this development. The study measures neurocognitive abilities across a multidimensional array of functions, with a focus on the critical period of adolescence during which physical and socio-emotional changes occur and the structure of the cortical and white matter changes. In this study, we perform a correlation analysis to examine the linear relation of adolescent neurocognition functions with the demographic, socio-economic, and magnetic resonance imaging-based brain structural factors. The overall goal is to obtain a comprehensive understanding of how natural and nurtural factors influence adolescent neurocognition. Our results on greater than 10,000 adolescents show many positive and negative statistical significance interrelations of different neurocognitive functions with the demographic, socioeconomic, and brain structural factors, and also open up questions inviting further future studies.
Copy rights belong to original authors. Visit the link for more info
Podcast created by Paper Player, LLC | |||
| MDTOMO: Continuous conformational variability analysis in cryo electron subtomogram data using flexible fitting based on Molecular Dynamics simulations | 27 Feb 2023 | ||
Link to bioRxiv paper:
http://biorxiv.org/cgi/content/short/2023.02.25.529934v1?rss=1
Authors: Vuillemot, R., Rouiller, I., Jonic, S.
Abstract:
Cryo electron tomography (cryo-ET) allows observing macromolecular complexes in their native environment. The common routine of subtomogram averaging (STA) allows obtaining the three-dimensional (3D) structure of abundant macromolecular complexes, and can be coupled with discrete classification to reveal conformational heterogeneity of the sample. However, the number of complexes extracted from cryo-ET data is usually small, which restricts the discrete-classification results to a small number of enough populated states and, thus, results in a largely incomplete conformational landscape. Alternative approaches are currently being investigated to explore the continuity of the conformational landscapes that in situ cryo-ET studies could provide. In this article, we present MDTOMO, a method for analyzing continuous conformational variability in cryo-ET subtomograms based on Molecular Dynamics (MD) simulations. MDTOMO allows obtaining an atomic-scale model of conformational variability and the corresponding free-energy landscape, from a given set of cryo-ET subtomograms. The article presents the performance of MDTOMO on a synthetic ABC exporter dataset and an in situ SARS-CoV-2 spike dataset. MDTOMO allows analyzing dynamic properties of molecular complexes to understand their biological functions, which could also be useful for structure-based drug discovery.
Copy rights belong to original authors. Visit the link for more info
Podcast created by Paper Player, LLC | |||
| Expanding the stdpopsim species catalog, and lessons learned forrealistic genome simulations | 31 Oct 2022 | ||
Link to bioRxiv paper:
http://biorxiv.org/cgi/content/short/2022.10.29.514266v1?rss=1
Authors: Lauterbur, M. E., Cavassim, M. I. A., Gladstein, A. L., Gower, G., Pope, N. S., Tsambos, G., Adrion, J., Belsare, S., Biddanda, A., Caudill, V., Cury, J., Echevarria, I., Haller, B. C., Hasan, A. R., Huang, X., Iasi, L. N. M., Noskova, E., Obsteter, J., Pavinato, V. A. C., Pearson, A., Peede, D., Perez, M. F., Rodrigues, M. F., Smith, C. C. R., Spence, J. P., Teterina, A., Tittes, S., Unneberg, P., Vazquez, J. M., Waples, R. K., Wohns, A. W., Wong, Y., Baumdicker, F., Cartwright, R. A., Gorjanc, G., Gutenkunst, R. N., Kelleher, J., Kern, A. D., Ragsdale, A. P., Ralph, P. L., Schrider, D. R., G
Abstract:
Simulation is a key tool in population genetics for both methods development and empirical research, but producing simulations that recapitulate the main features of genomic data sets remains a major obstacle. Today, more realistic simulations are possible thanks to large increases in the quantity and quality of available genetic data, and to the sophistication of inference and simulation software. However, implementing these simulations still requires substantial time and specialized knowledge. These challenges are especially pronounced for simulating genomes for species that are not well-studied, since it is not always clear what information is required to produce simulations with a level of realism sufficient to confidently answer a given question. The community-developed framework stdpopsim seeks to lower this barrier by facilitating the simulation of complex population genetic models using up-to-date information. The initial version of stdpopsim focused on establishing this framework using six well-characterized model species (Adrion et al., 2020). Here, we report on major improvements made in the new release of stdpopsim (version 0.2), which includes a significant expansion of the species catalog and substantial additions to simulation capabilities. Features added to improve the realism of the simulated genomes include non-crossover recombination and provision of species-specific genomic annotations. Through community-driven efforts, we expanded the number of species in the catalog more than three-fold and broadened coverage across the tree of life. During the process of expanding the catalog, we have identified common sticking points and developed best practices for setting up genome-scale simulations. We describe the input data required for generating a realistic simulation, suggest good practices for obtaining the relevant information from the literature, and discuss common pitfalls and major considerations. These improvements to stdpopsim aim to further promote the use of realistic whole-genome population genetic simulations, especially in non-model organisms, making them available, transparent, and accessible to everyone.
Copy rights belong to original authors. Visit the link for more info
Podcast created by Paper Player, LLC | |||
| Microsnoop: a generalist tool for the unbiased representation of heterogeneous microscopy images | 27 Feb 2023 | ||
Link to bioRxiv paper:
http://biorxiv.org/cgi/content/short/2023.02.25.530004v1?rss=1
Authors: Xun, D., Wang, R., Wang, Y.
Abstract:
Accurate and automated representation of microscopy images from small-scale to high-throughput is becoming an essential procedure in basic and applied biological research. Here, we present Microsnoop, a novel deep learning-based representation tool trained on large-scale microscopy images using masked self-supervised learning, which eliminates the need for manual annotation. Microsnoop is able to unbiasedly profile a wide range of complex and heterogeneous images, including single-cell, fully-imaged and batch-experiment data. We evaluated the performance of Microsnoop using seven high-quality datasets, containing over 358,000 images and 1,270,000 single cells with varying resolutions and channels from cellular organelles to tissues. Our results demonstrate Microsnoop's robustness and state-of-the-art performance in all biological applications, outperforming previous generalist and even custom algorithms. Furthermore, we presented its potential contribution for multi-modal studies. Microsnoop is highly inclusive of GPU and CPU capabilities, and can be freely and easily deployed on local or cloud computing platforms.
Copy rights belong to original authors. Visit the link for more info
Podcast created by Paper Player, LLC | |||
| A robust model for cell type-specific interindividual variation in single-cell RNA sequencing data | 27 Feb 2023 | ||
Link to bioRxiv paper:
http://biorxiv.org/cgi/content/short/2023.02.24.529987v1?rss=1
Authors: Chen, M., Dahl, A.
Abstract:
The development of single-cell RNA sequencing (scRNA-seq) offers opportunities to characterize cellular heterogeneity at unprecedented resolution. Although scRNA-seq has been widely used to identify and characterize gene expression variation across cell types and cell states based on their average gene expression profiles, most studies ignore variation across individual donors. Modelling this inter-individual variation could improve statistical power to detect cell type-specific biology and inform the genes and cell types that underlying complex traits. We therefore develop a new model to detect and quantify cell type-specific variation across individuals called CTMM (Cell Type-specific linear Mixed Model). CTMM operates on cell type-specific pseudobulk expression and is fit with efficient methods that scale to hundreds of samples. We use extensive simulations to show that CTMM is powerful and unbiased in realistic settings. We also derive calibrated tests for cell type-specific interindividual variation, which is challenging given the modest sample sizes in scRNA-seq data. We apply CTMM to scRNA-seq data from human induced pluripotent stem cells to characterize the transcriptomic variation across donors as cells differentiate into endoderm. We find that almost 100% of transcriptome-wide variability between donors is differentiation stage-specific. CTMM also identifies individual genes with statistically significant stage-specific variability across samples, including 61 genes that do not have significant stage-specific mean expression. Finally, we extend CTMM to partition interindividual covariance between stages, which recapitulates the overall differentiation trajectory. Overall, CTMM is a powerful tool to characterize a novel dimension of cell type-specific biology in scRNA-seq.
Copy rights belong to original authors. Visit the link for more info
Podcast created by Paper Player, LLC | |||
| Cellograph: A Semi-supervised Approach to Analyzing Multi-condition Single-cell RNA-sequencing Data Using Graph Neural Networks | 27 Feb 2023 | ||
Link to bioRxiv paper:
http://biorxiv.org/cgi/content/short/2023.02.24.528672v1?rss=1
Authors: Shahir, J. A., Stanley, N., Purvis, J. E.
Abstract:
With the growing number of single-cell datasets collected under more complex experimental conditions, there is an opportunity to leverage single-cell variability to reveal deeper insights into how cells respond to perturbations. Many existing approaches rely on discretizing the data into clusters for differential gene expression (DGE), effectively ironing out any information unveiled by the single-cell variability across cell-types. In addition, DGE often assumes a statistical distribution that, if erroneous, can lead to false positive differentially expressed genes. Here, we present Cellograph: a semi-supervised framework that uses graph neural networks to quantify the effects of perturbations at single-cell granularity. Cellograph not only measures how prototypical cells are of each condition but also learns a latent space that is amenable to interpretable data visualization and clustering. The learned gene weight matrix from training reveals pertinent genes driving the differences between conditions. We demonstrate the utility of our approach on publicly-available datasets including cancer drug therapy, stem cell reprogramming, and organoid differentiation. Cellograph outperforms existing methods for quantifying the effects of experimental perturbations and offers a novel framework to analyze single-cell data using deep learning.
Copy rights belong to original authors. Visit the link for more info
Podcast created by Paper Player, LLC | |||
| PIFiA: Self-supervised Approach for Protein Functional Annotation from Single-Cell Imaging Data | 27 Feb 2023 | ||
Link to bioRxiv paper:
http://biorxiv.org/cgi/content/short/2023.02.24.529975v1?rss=1
Authors: Razdaibiedina, A., Brechalov, A. V., Friesen, H., Mattiazzi Usaj, M., Masinas, M. P. D., Garadi Suresh, H., Wang, K., Boone, C., Ba, J., Andrews, B. J.
Abstract:
Fluorescence microscopy data describe protein localization patterns at single-cell resolution and have the potential to reveal whole-proteome functional information with remarkable precision. Yet, extracting biologically meaningful representations from cell micrographs remains a major challenge. Existing approaches often fail to learn robust and noise-invariant features or rely on supervised labels for accurate annotations. We developed PIFiA, (Protein Image-based Functional Annotation), a self-supervised approach for protein functional annotation from single-cell imaging data. We imaged the global yeast ORF-GFP collection and applied PIFiA to generate protein feature profiles from single-cell images of fluorescently tagged proteins. We show that PIFiA outperforms existing approaches for molecular representation learning and describe a range of downstream analysis tasks to explore the information content of the feature profiles. Specifically, we cluster extracted features into a hierarchy of functional organization, study cell population heterogeneity, and develop techniques to distinguish multi-localizing proteins and identify functional modules. Finally, we confirm new PIFiA predictions using a colocalization assay, suggesting previously unappreciated biological roles for several proteins. Paired with a fully interactive website (https://thecellvision.org/pifia/), PIFiA is a resource for the quantitative analysis of protein organization within the cell.
Copy rights belong to original authors. Visit the link for more info
Podcast created by Paper Player, LLC | |||
| InjectionDesign: Plate Design with Optimized Stratified Block Randomization for Modern LC/GC-MS-Based Sample Preparation | 27 Feb 2023 | ||
Link to bioRxiv paper:
http://biorxiv.org/cgi/content/short/2023.02.26.530140v1?rss=1
Authors: Lu, M., Jiang, H., Wang, R., An, S., Wang, J., Yu, C.
Abstract:
Plate Design is a necessary and time-consuming operation for GC/LC-MS based sample preparation. The implementation of the inter-batch balancing algorithm and the intra-batch randomization algorithm can have a significant impact on the final analysis results. For researchers without programming skills, a stable and efficient online service for plate design is necessary. However, most exist products do not currently have online services, and are not optimized for GC/LC-MS instruments with custom injection capabilities. Here we describe InjectionDesign, a free online plate design service focus on GC/LC-MS-based multi-omics experiment design. It offers the ability to separate the position design from the sequence design, making the output more compatible with the requirements of a modern mass spectrometer-based laboratory. In addition, it has implemented an optimized block randomization algorithm, which can be better applied to sample stratification with block randomization for unbalanced distribution. It is easy to use, with built-in support for common instrument models and quick export to worksheet. InjectionDesign is an open-source project based on Java. Researchers can get the source code of the project from Github: https://github.com/CSi-Studio/InjectionDesign. A free web service is also provided: http://www.injection.design. | Demo Project: http://www.injection.design/#/project/detail?projectId=Test Keywords: InjectionDesign; Plate Design; Mass Spectrometry; Block Randomization; Stratified Balancing; Metabolomics; Proteomics; Web Service;
Copy rights belong to original authors. Visit the link for more info
Podcast created by Paper Player, LLC | |||
| Benchmarking differential abundance methods for finding condition-specific prototypical cells in multi-sample single-cell datasets | 27 Feb 2023 | ||
Link to bioRxiv paper:
http://biorxiv.org/cgi/content/short/2023.02.24.529894v1?rss=1
Authors: Yi, H., Plotkin, A., Stanley, N.
Abstract:
Modern single-cell data analysis relies on statistical testing (e.g. differential expression testing) to identify genes or proteins that are up-or down-regulated in relation to cell-types or clinical outcomes. However, existing algorithms for such statistical testing are often limited by technical noise and cellular heterogeneity, which lead to false-positive results. To constrain the analysis to a compact and phenotype-related cell population, differential abundance (DA) testing methods were employed to identify subgroups of cells whose abundance changed significantly in response to disease progression, or experimental perturbation. Despite the effectiveness of DA testing algorithms of identifying critical cell-states, there are no systematic benchmarking or comparative studies to compare their usages in practice. Herein, we performed the first comprehensive benchmarking study to objectively evaluate and compare the benefits and potential downsides of current state-of-the-art DA testing methods. We benchmarked six DA testing methods on several practical tasks, using both synthetic and real single-cell datasets. The task evaluated include, recognizing true DA subpopulations, appropriate handing of batch effects, runtime efficiency, and hyperparameter usability and robustness. Based on various evaluation results, this paper gives dataset-specific suggestions for the usage of DA testing methods.
Copy rights belong to original authors. Visit the link for more info
Podcast created by Paper Player, LLC | |||
| Automated quantification of lipophagy in Saccharomyces cerevisiae from fluorescence and cryo-soft X-ray microscopy data using deep learning | 27 Feb 2023 | ||
Link to bioRxiv paper:
http://biorxiv.org/cgi/content/short/2023.02.27.530171v1?rss=1
Authors: Egebjerg, J., Szomek, M., Thaysen, K., Dupont Juhl, A., Pratsch, C., Werner, S., Schneider, G., Rottger, R., Wustner, D.
Abstract:
Lipophagy is a form of autophagy by which lipid droplets (LDs) become digested to provide nutrients as a cellular response to starvation. Lipophagy is often studied in yeast, Saccharomyces cerevisiae, in which LDs become internalized into the vacuole. There is a lack of tools to quantitatively assess lipophagy in intact cells with high resolution and throughput. Here, we combine soft X-ray tomography (SXT) with fluorescence microscopy and use a deep learning computational approach to visualize and quantify lipophagy in yeast. We focus on yeast homologs of mammalian Niemann Pick type C proteins, whose dysfunction leads to Niemann Pick type C disease in humans, i.e., NPC1 (named NCR1 in yeast) and NPC2. We developed a convolutional neural network (CNN) model which classifies ring-shaped versus lipid-filled or fragmented vacuoles containing ingested LDs in fluorescence images from wild-type yeast and from cells lacking NCR1 (delta ncr1 cells) or NPC2 ({Delta}npc2 cells). Using a second CNN model, which performs automated segmentation of LDs and vacuoles from high-resolution reconstructions of X-ray tomograms, we can obtain 3D renderings of LDs inside and outside of the vacuole in a fully automated manner and additionally measure droplet volume, number, and distribution. We find that cells lacking functional NPC proteins can ingest LDs into vacuoles normally but show compromised degradation of LDs and accumulation of lipid vesicles inside vacuoles. This phenotype is most severe in delta npc2 cells. Our new method is versatile and allows for automated high-throughput 3D visualization and quantification of lipophagy in intact cells.
Copy rights belong to original authors. Visit the link for more info
Podcast created by Paper Player, LLC | |||
| MsPBRsP: Multi-scale Protein Binding Residues Prediction Using Language Model | 27 Feb 2023 | ||
Link to bioRxiv paper:
http://biorxiv.org/cgi/content/short/2023.02.26.528265v1?rss=1
Authors: Li, Y., Lu, S., Nan, X., Zhang, S., Zhou, Q.
Abstract:
Accurate prediction of protein binding residues (PBRs) from sequence is important for the understanding of cellular activity and helpful for the design of novel drug. However, experimental methods are time-consuming and expensive. In recent years, a lot of computational predictors based on machine learning and deep learning models are proposed to reduce such consumption. But those methods often use MSA tools such as PSI-BLAST or NetSurfP to generate some statistical features and enter them into predictive models as necessary supplementary input. The input generation process normally takes long time, and there is no standard to specify which and how many statistic results should be provided to a prediction model. In addition, prediction of PBRs relies on residue local context, but the most appropriate scale is undetermined. Most works pre-selected certain residue features as input and a scale size based on expertise for certain type of PBRs. In this study, we propose a general tool-free end-to-end framework that can be applied to all types of PBRs, Multiscale Protein Binding Residues Prediction using language model (MsPBRsP). We adopt a pre-trained language model ProtTrans to save the large consumption caused by MSA tools, and use protein sequence alone as input to our model. To ease scale size uncertainty, we construct multi-size windows in attention layer and multi-size kernels in convolutional layer. We test our framework on various benchmark datasets including PBRs from protein-protein, protein-nucleotide, protein-small ligand, heterodimer, homodimer and antibody-antigen interactions. Compared with existing state-of-the-art methods, MsPBRsP achieves superior performance with less running time and higher prediction rates on every PBRs prediction task. Specifically, we boost F1 score by 27.1% and AUPRC score by 7.6% on NSP448 dataset and decrease running time from over 10 minutes to under 0.1s on average. The source code and datasets are available at https://github.com/biolushuai/MsPBRsP-for-multiple-PBRsprediction.
Copy rights belong to original authors. Visit the link for more info
Podcast created by Paper Player, LLC | |||
| Pan-cancer genetic analysis of disulfidptosis-related gene set | 27 Feb 2023 | ||
Link to bioRxiv paper:
http://biorxiv.org/cgi/content/short/2023.02.25.529997v1?rss=1
Authors: Liu, H., Tang, T.
Abstract:
Background: A recent study has identified a novel programmed cell death pathway, termed disulfidoptosis, which is based on disulfide proteins. This discovery provides new insight into the mechanisms of cell death and may have implications for therapeutic strategies targeting cell death pathways. This study aimed to evaluate the pan-cancer genomics and clinical association of disulfidptosis and disulfidptosis-related cell death genes, including SLC7A11, NADPH, INF2, CD2AP, PDLIM1, ACTN4, MYH9, MYH10, IQGAP1, FLNA, FLNB, TLN1, MYL6, ACTB, DSTN, and CAPZB. Methods: Using multi-omics profiling data, this study provides a comprehensive and systematic characterization of disulfidptosis genes across more than 9000 samples of over 30 types of cancer. Results: FLNA and FLNB were the two most frequently mutated disulfidptosis cell death genes in cancer. UCEC and SKCM were the two cancer types that have the highest mutation rates while the mutation of ACTN4 was associated with worse survival of CESC and ESCA. Breast cancer was potentially affected by disulfidptosis because its subtypes are different in disulfidptosis gene expression. Similarly, KIRC might also be associated with disulfidptosis.). Additionally, the association of disulfidptosis-related cell death genes with survival was analyzed, with MESO and LGG as the top cancer types with survival associated with disulfidptosis cell death genes. The correlation between CNV and survival across multiple cancer types found that UCEC, KIRP, LGG, and KIRC were the top cancer types where the CNV level was associated with survival. There was a negative correlation between expression and methylation for most of the genes and there was only a slight correlation between methylation levels and survival of cancer in LGG. About half of the disulfidptosis-related cell death proteins were associated with the activation of EMT. Disulfidptosis genes were correlated to immune cell infiltration levels in cancers. Multiple compounds were identified as potential drugs that might be affected by disulfidptosis-related cell death for future study. Conclusion: Disulfidptosis cell death genes are potentially involved in many cancer types and can be developed as candidates for cancer diagnosis, prognosis, and therapeutic biomarkers.
Copy rights belong to original authors. Visit the link for more info
Podcast created by Paper Player, LLC | |||
| Adversarial and variational autoencoders improve metagenomic binning | 27 Feb 2023 | ||
Link to bioRxiv paper:
http://biorxiv.org/cgi/content/short/2023.02.27.527078v1?rss=1
Authors: Piera Lindez, P., Johansen, J., Sigurdsson, A. I., Nissen, J. N., Rasmussen, S.
Abstract:
Assembly of reads from metagenomic samples is a hard problem, often resulting in highly fragmented genome assemblies. Metagenomic binning allows us to reconstruct genomes by re-grouping the sequences by their organism of origin, thus representing a crucial processing step when exploring the biological diversity of metagenomic samples. Here we present Adversarial Autoencoders for Metagenomics Binning (AAMB), an ensemble deep learning approach that integrates sequence co-abundances and tetranucleotide frequencies into a common denoised space that enables precise clustering of sequences into microbial genomes. When benchmarked, AAMB presented similar or better results compared with the state-of-the-art reference-free binner VAMB, reconstructing ~7% more near-complete (NC) genomes across simulated and real data. In addition, genomes reconstructed using AAMB had higher completeness and greater taxonomic diversity compared with VAMB. Finally, we implemented a pipeline integrating VAMB and AAMB that enabled improved binning, recovering 20% and 29% more simulated and real NC genomes, respectively, compared to VAMB with moderate additional runtime. AAMB is freely available at https://github.com/RasmussenLab/VAMB.
Copy rights belong to original authors. Visit the link for more info
Podcast created by Paper Player, LLC | |||
| One-stop analysis of DIA proteomics data using MSFragger-DIA and FragPipe computational platform | 31 Oct 2022 | ||
Link to bioRxiv paper:
http://biorxiv.org/cgi/content/short/2022.10.28.514272v1?rss=1
Authors: Yu, F., Teo, G. C., Kong, A. T., Li, G. X., Demichev, V., Nesvizhskii, A. I.
Abstract:
Liquid chromatography (LC) coupled with data-independent acquisition (DIA) mass spectrometry (MS) has been increasingly used in quantitative proteomics studies. Here, we present a fast and sensitive approach for direct peptide identification from DIA data, MSFragger-DIA, which leverages the unmatched speed of the fragment ion indexing-based search engine MSFragger. MSFragger-DIA conducts a database search of the DIA tandem mass (MS/MS) spectra prior to spectral feature detection and peak tracing across the LC dimension. We have integrated MSFragger-DIA into the FragPipe computational platform for seamless support of peptide identification and spectral library building from DIA, data dependent acquisition (DDA), or both data types combined. We compared MSFragger-DIA with other DIA tools, such as DIA-Umpire based workflow in FragPipe, Spectronaut, and in silico library-based DIA-NN and MaxDIA. We demonstrated the fast and sensitive performance of MSFragger-DIA across a variety of sample types and data acquisition schemes, including single-cell proteomics, phosphoproteomics, and large-scale tumor proteome profiling studies.
Copy rights belong to original authors. Visit the link for more info
Podcast created by Paper Player, LLC | |||
| An associative transcriptomics study on rice bean (Vigna umbellata) provides new insights into genetic basis and candidate genes governing flowering, maturity and seed weight | 27 Feb 2023 | ||
Link to bioRxiv paper:
http://biorxiv.org/cgi/content/short/2023.02.25.530014v1?rss=1
Authors: Sahu, T. K., Verma, S. K., Gayacharan,, Singh, N. P., Joshi, D. C., Wankhede, D. P., Singh, M., Bhardwaj, R., Parida, S. K., Chattopadhyay, D., Singh, G. P., Singh, A. K.
Abstract:
Rice bean is an underrated legume with significant potential to support food and nutritional security worldwide, being a rich source of proteins, minerals, and essential fatty acids. Therefore, we considered three pivotal production traits of rice bean; flowering, maturity and seed weight, to identify associated candidate genes. One-hundred diverse genotypes out of 1800 evaluated rice bean accessions from the Indian National Genebank were considered for phenotypic data collection and genotyping by transcriptome sequencing approach. Association analysis involving various GWAS models was conducted to identify significant marker-trait associations. The results revealed association of 82 markers on 48 transcripts for flowering, 26 markers on 22 transcripts for maturity and 22 markers on 21 transcripts for seed weight. The annotation of associated transcripts unraveled the functional genes related to the considered traits. Among the significant candidate genes identified, HSC80, P-II PsbX, phospholipid-transporting-ATPase-9, pectin-acetylesterase-8 and E3-ubiquitin-protein-ligase-RHG1A were found associated with flowering. Further, associations of WRKY1 and DEAD-box-RH27 with seed weight, PIF3 and pentatricopeptide-repeat-containing-gene with maturity & seed weight and aldo-keto-reductase with flowering & maturity have been revealed. The present investigation provides insights into the genetic mechanisms governing economically-essential traits like flowering, maturity and seed weight that can be potentially utilized for rice bean improvement.
Copy rights belong to original authors. Visit the link for more info
Podcast created by Paper Player, LLC | |||
| EnCPdock: a web-interface for direct conjoint comparative analyses of complementarity and binding energetics in inter-protein associations | 27 Feb 2023 | ||
Link to bioRxiv paper:
http://biorxiv.org/cgi/content/short/2023.02.26.530084v1?rss=1
Authors: Biswas, G., Mukherjee, D., Dutta, N., Ghosh, P., Basu, S.
Abstract:
Protein-protein interaction (PPI) is a key component linked to virtually all cellular processes. Be it an enzyme catalysis (classic type functions of proteins) or a signal transduction (non-classic), proteins generally function involving stable or quasi-stable multi-protein associations. The physical basis for such associations is inherent in the combined effect of shape and electrostatic complementarities (Sc, EC) of the interacting protein partners at their interface. While Sc is a necessary criterion for inter-protein associations, EC can be favorable as well as disfavored (e.g., in transient interactions). Estimating equilibrium thermodynamic parameters (delta Gbinding, Kd) by experimental means is costly and time consuming, thereby opening windows for computational structural interventions. Attempts to empirically probe delta_G_binding from coarse-grain structural descriptors (primarily, surface area based terms) have lately been overtaken by physics-based, knowledge-based and their hybrid approaches (MM/PBSA, FoldX etc.) that directly compute delta_G_binding without involving intermediate structural descriptors. Here we present EnCPdock (www.scinetmol.in/EnCPdock/), a user-friendly web-interface for the direct conjoint comparative analyses of complementarity and binding energetics in proteins. EnCPdock returns an AI-predicted delta_G_binding computed by combining complementarity (Sc, EC) and other high-level structural terms, and, renders a prediction accuracy comparable to the state-of-the-art. EnCPdock further locates a PPI complex in terms of its {Sc, EC} values (taken as an ordered pair) in the two-dimensional Complementarity Plot (CP). In addition, it also generates mobile molecular graphics of the interfacial atomic contact network. Combining all its features, EnCPdock presents a unique online tool that should be beneficial to structural biologists and researchers across related fraternities.
Copy rights belong to original authors. Visit the link for more info
Podcast created by Paper Player, LLC | |||
© My Podcast Data