Software project for the analysis of genomic data
POPULARITY
Daniel Sabanés Bové is a senior principal data scientist at Roche. In our conversation, we discuss the need for better software in biotech and his career in data science. Daniel studied statistics at Ludwig Maximilian University of Munich in Germany, earning his PhD in 2013 from the University of Zurich in Switzerland. His doctoral research focused on Bayesian model selection. After completing his PhD, Daniel began his at Roche as a biostatistician. There, he applied statistical principles to clinical trials and research in areas like oncology, immunology, and neuroscience. In 2018, Daniel joined Google as a data scientist. While there, he worked on ranking systems, developing models to optimize search results. Then in 2020, Daniel returned back to Roche to lead a specialized team focused on statistical engineering. Throughout his career, Daniel has co-authored multiple R packages published on CRAN and Bioconductor. He also co-wrote the book 'Likelihood and Bayesian Inference: With Applications in Biology and Medicine'. Currently, he serves as co-chair of an ASA working group called openstatsware that promotes software engineering in biostatistics. According to Daniel, software engineering principles are often neglected in biostatistics. Most biostatisticians know a programming language like R, but lack formal training in writing reusable, reliable code. Daniel argues this is problematic for several reasons. First, without code reviews, we risk making erroneous analytical decisions based on buggy statistical software. Code passed from statistician to statistician without documentation makes reproducibility impossible. In regulated fields like pharmaceuticals, validation protocols are needed to verify analyses, but require engineered code. Even modifying poorly written software can introduce unexpected behaviors without sufficient testing. To address these problems, Daniel calls on the biostatistics community to prioritize software engineering skills. Change starts with awareness - we must recognize the value of good engineering. Next, software engineering concepts need integration across statistics curriculums - in both academia and industry. Dedicated software engineering teams play a key role. They can catalyze adoption of engineering best practices within research teams and provide training. Providing attractive career growth for software-oriented roles aids retention of technical talent. Cross-organizational collaboration also helps. By sharing insights and contributing to open source tools, we make better use of resources. Following modern engineering practices facilitates building reusable components. Daniel points to projects like Mediana (for clinical trial simulations) as examples of successful collaborative open source biostatistics software. What could improved software engineering mean for biostatistical analyses? Daniel foresees greater efficiency and integrity. With robust code review protocols, analyses have higher accuracy. Well-documented software enhances reproducibility. A strong testing culture provides safety nets against inadvertent bugs. Modular, reusable code makes implementing new analyses faster. Validation frameworks give regulators necessary confidence in results. Daniel also notes how high-quality software enables faster innovation. By encapsulating complex methods in packages, researchers can build on previous work rather than recoding from scratch. Reliable software tools empower statisticians to operate at higher levels of abstraction. Ultimately, Daniel argues that pursing excellence in software engineering serves both ethical and practical ends. Ethically, biostatisticians have an obligation to provide sound statistical guidance. Pursuing engineering excellence helps fulfill this duty. Practically, improved software engineering makes biostatisticians more effective in their work - accelerating discoveries and powering data-driven decisions.
Consistency, Impact, and Versatility.In this episode of The Outspoken Podcast, host Shana Cosgrove talks to Benjamin Harvey, Founder & CEO of AI Squared, Research Professor, and Data Science Leader. He takes us on his multifaceted journey, spanning from Harvard to The NSA to Johns Hopkins University to his most recent endeavor, AI Squared. Benjamin gets technical, discussing everything from Hadoop to Apache Spark to ACID transactions. We hear how Benjamin's new journey as founder of AI Squared is rooted in his discipline and the simplicity of consistent hard work over time. Finally, Benjamin's NCAA basketball dominance is revealed, as he impresses us with his performance against Kevin Love and Russell Westbrook. QUOTES “Having that mindset of being focused on helping others, and not having to go back to that environment is really what kept me focused and gave me that grit, those habits to be able to be successful” - Benjamin Harvey [20:12] “General Nakasone is really focused on figuring out how to even enable folks that are in mission spaces to be able to do tours in industry and then come back. Because the point of the matter is that a guy like myself - I was interested in doing a tour in industry, but I didn't come back because programs like that don't exist.” - Benjamin Harvey [30:21] “How can I leave my mark for the next generation of Ben Harveys that come from similar backgrounds, face similar challenges - how can I take the experiences and the knowledge of navigating through some of these challenges and make it easier for that next generation to be successful?” - Benjamin Harvey [47:40] TIMESTAMPS [00:04] Intro [01:39] How Benjamin Knows Stephanie Beben [03:38] Benjamin's Day Job [05:33] Discussing JHU COVID-19 Risk Tools [09:19] Benjamin's Family [11:17] Was Benjamin This Driven as a Kid? [14:01] High School Experience [16:35] Benjamin's Scholarship [19:05] Benjamin's Discipline [21:24] From Harvard to Bowie [25:07] Highlights of Benjamin's Time at NSA [28:24] Why Benjamin Left the NSA [32:56] Advantages of Spark Over Hadoop [35:55] ACID Transactions [38:56] Moving on From Databricks [41:25] Creating AI Squared [44:47] Benjamin's Experience as Founder [46:16] Advice Benjamin Would Give His Younger Self [47:32] What Does Success Look Like? [48:05] Benjamin's Favorite Books [49:13] A Surprising Fact About Benjamin [50:05] Outro RESOURCES https://www.linkedin.com/in/stephanie-beben/ (Stephanie Beben) https://www.nsa.gov/ (National Security Agency (NSA)) https://www.boozallen.com/ (Booz Allen Hamilton) https://www.nsa.gov/Signals-Intelligence/Overview/#:~:text=SIGINT%20is%20intelligence%20derived%20from,capabilities%2C%20actions%2C%20and%20intentions. (SIGINT) https://www.linkedin.com/in/jaysha-camacho/ (Jaysha Camacho Irizarry) https://www.nea.com/ (New Enterprise Associates) https://www.jhu.edu/ (Johns Hopkins University (JHU)) https://www.cdc.gov/ (Centers for Disease Control and Prevention (CDC)) https://covid19risktools.com:8443/ (COVID-19 Risk Tools) https://www.army.mil/ (United States Army) https://www.american.edu/ (American University) https://www.att.com/ (AT&T) https://www.mvsu.edu/ (Mississippi Valley State University) https://www.fldoe.org/accountability/assessments/k-12-student-assessment/archive/fcat/ (FCAT) https://www.gwu.edu/ (George Washington University (GWU)) https://www.harvard.edu/ (Harvard University) https://connects.catalyst.harvard.edu/Profiles/display/Person/42421 (Vincent James Carey, Ph.D.) https://www.bioconductor.org/ (Bioconductor) https://www.bowiestate.edu/ (Bowie State University) https://www.imdb.com/title/tt0119217/ (Good Will Hunting) https://www.imdb.com/title/tt0120660/ (Enemy of the State) https://www.theguardian.com/world/2013/jun/09/edward-snowden-nsa-whistleblower-surveillance (Edward Snowden) https://www.dni.gov/index.php?option=com_content&view=article&id=572&Itemid=991...
過去2年分のお便りを紹介しました。Show notes Researchat.fm お便りフォーム … こちらからお便り受け付けております。 原腸陥入はお済みですか? ルイス・ウォルパート「人生において最も重要なのは、誕生でも結婚でも死でもなく、原腸陥入の瞬間だ。」 V(D)J組換えはお済みですか? アレグラ飲んで実験してるよね? THIS WEEK IN VIROLOGY … 石原さんご紹介ありがとうございます。 TWシリーズ NEB podcast eLife podcast The Lonely Pipette : helping scientists do better science … SHIOTAさんが教えてくれた海外のresearchat.fmっぽいポッドキャスト 葬送のフリーレン 湯神くんには友達がいない … Ayaneさん、ご紹介ありがとうございました。 Artiste 防衛漫玉日記 神戸在住 76. The Chimeric RNA (Researchat.fm) … dessanがCRISPRについて説明してくれた回 79. Connecting Dots (Researchat.fm) … pomeさんがバイオスタートアップについて説明してくれた回 Zhu et al., Science Advances (2019) … “A prokaryotic-eukaryotic hybrid viral vector for delivery of large cargos of genes and proteins into human cells” たうちさん紹介ありがとうございます。おもしろい論文です。 いいねの数だけ論文を今年紹介する 2020年の論文紹介実績 2019年の論文紹介実績 登さんのブログ 登さんの情熱大陸 論理的思考の放棄 相分離 (Phase separation) LLPSに関する言及 … LLPS研究の歴史から問題点まで chromatin/DNA loop extrusion … 興味がある方はググってください。 Cytoscape Bioconductor Fiji/ImageJ Chan Zuckerberg Initiative BioPerl Biopython BioRuby 74. Imaging-by-Sequencing (Researchat.fm) … DNA origamiについての軽い言及があります。 37. Biological Enigma (Researchat.fm) … dessanによる分子生物学入門 Editorial notes こう見えて懸命に答えをだそうとはしています (soh) 遅くなりましたが、お便りありがとうございます。(tadasu) ポッドキャストでの回答のペースが遅くて申し訳ないです。お便りは全て目を通させて頂います!とても励みになっております。(coela)
Guest Rodrigo Mendoza Panelists Richard Littauer | Justin Dorfman | Ben Nichols | Eric Berry Show Notes Hello and welcome to Sustain! The podcast where we talk about sustaining open source for the long haul. Today, we have a really awesome guest to talk about some really cool stuff. Rodrigo Mendoza is the Founder and CEO of Quine, a data-driven professional network for software creators, as well as GitNFT, an NFT minting platform for GitHub commits. Rodrigo dives deep in Quine and tells us why he's focusing on open source and software developers, and what is so different about his platform. We also learn more about GitNFT, which is a part of Quine but a different product, and he talks about some of the issues he's had with GitNFT and why some people get so riled up towards NFTs and Web3. Go ahead and download this episode now to find out much more! [00:02:09] We start off with Rodrigo telling us what Quine is. [00:04:12] Richard wonders why Rodrigo is focusing on open source and software developers in particular. [00:05:26] Richard asks Rodrigo how Quine is not a subset of LinkedIn, and he tells us what's different about his platform. [00:09:17] Ben wonders if Rodrigo has any pathways he could create to bring more people into open source to distribute more opportunities to people. [00:12:33] Another thing Rodrigo works on is GitNFT, so we find out more about that and how it works. [00:16:22] Justin asks Rodrigo his thoughts on why some people in this industry or our community are so hostile towards NFTs and Web3 as a whole. [00:21:28] Richard wonders how Rodrigo deals with the internal conflict. [00:23:36] Ben shares his thoughts on NFTs, and Rodrigo talks about some of the issues he's had with GitNFT. [00:29:17] Eric shares some closing thoughts on GitNFT, NFTs overall, and what he loves about this project. [00:32:35] Find out where you can follow Rodrigo and learn more about Quine. Quotes [00:02:28] “We think that coding is a super-power and if you can code, then you should be able to monetize your skill in a way that is very easy and very fluid.” [00:06:48] “We think of open source as the professional network of the future.” [00:07:21] “We think that open source contributions are going to be micro-certificates of skill.” [00:08:35] “Open source contributions are proof-of-work for skills.” [00:10:07] “I like to think that open source is still in its early stages.” [00:11:09] “Open Source has the same problems that a creator economy has: It has issues all around attention, monetization, content creation, content consumption, etc.” [00:15:40] “We think that having an NFT of a commit can have value based on historical significance.” [00:19:44] “We want to flip the script on how we monetize open source contributions.” Spotlight [00:33:33] Ben's spotlight is Exercism.org. [00:34:07] Eric's spotlight is the Firefox Browser Developer Edition. [00:34:27] Justin's spotlight is a GitNFT discussion on the Sustain discourse. [00:34:45] Richard's spotlight is ADHD medication. [00:35:16] Rodrigo's spotlight is Bioconductor. Links SustainOSS (https://sustainoss.org/) SustainOSS Twitter (https://twitter.com/SustainOSS?ref_src=twsrc%5Egoogle%7Ctwcamp%5Eserp%7Ctwgr%5Eauthor) SustainOSS Discourse (https://discourse.sustainoss.org/) Rodrigo Mendoza Twitter (https://twitter.com/r0dms) Rodrigo Mendoza LinkedIn (https://www.linkedin.com/in/r0dms/) Quine (https://quine.sh/) Quine Twitter (https://twitter.com/quine_sh) GitNFT (https://gitnft.quine.sh/) GitNFT Twitter (https://mobile.twitter.com/gitnft) “Devs have eaten the world,” by Rodrigo Mendoza (https://medium.com/quine/devs-have-eaten-the-world-523c8a1d1da2) Exercism (https://exercism.org/) Firefox Browser Developer Edition (https://www.mozilla.org/en-US/firefox/developer/) Sustain Discourse-GitNFT discussion (https://discourse.sustainoss.org/t/gitnft-autograph-and-sell-your-github-commits/885) Bioconductor (https://www.bioconductor.org/) Credits Produced by Richard Littauer (https://www.burntfen.com/) Edited by Paul M. Bahr at Peachtree Sound (https://www.peachtreesound.com/) Show notes by DeAnn Bahr Peachtree Sound (https://www.peachtreesound.com/) Special Guest: Rodrigo Mendoza-Smith.
Installing packages, {distill} for personal websites, and Shiny app stories Episode Links This week's curator: Jonathan Carroll (@carroll_jono (https://twitter.com/carroll_jono)) The Comprehensive Guide to Installing R Packages from CRAN, Bioconductor, GitHub and Co. (https://thomasadventure.blog/posts/install-r-packages/) Distill it down (https://education.rstudio.com/blog/2021/02/distill-it-down/) Introducing Shiny App Stories (https://blog.rstudio.com/2021/02/12/shiny-app-stories/) Supplemental Resources 20 Years of R (https://github.com/revodavid/20-years-of-R) Package management basics (Mozilla Developer Network Web Docs) (https://developer.mozilla.org/en-US/docs/Learn/Tools_and_testing/Understanding_client-side_tools/Package_management) Sharing on Short Notice: How to Get Your Materials Online with R Markdown (https://rstudio.com/resources/webinars/sharing-on-short-notice-how-to-get-your-materials-online-with-r-markdown/) Building a {distill} website} (https://lisalendway.netlify.app/posts/2020-12-09-buildingdistill) Shiny Developer Series Episode 1: Shiny Development Past and Future (https://shinydevseries.com/post/episode-1-shiny-development-past-and-future/) Shiny Developer Series Episode 5: shinysense and custom javascript visualizations (https://shinydevseries.com/post/episode-5-shinysense/)
Link to bioRxiv paper: http://biorxiv.org/cgi/content/short/2020.11.16.385211v1?rss=1 Authors: Jalili, V., Clements, D., Gruning, B., Blankenberg, D., Goecks, J. Abstract: A growing number of biomedical methods and protocols are being disseminated as open-source software packages. When put in concert with other packages, they can execute in-depth and comprehensive computational pipelines. Therefore, their integration with other software packages plays a prominent role in their adoption in addition to their availability. Accordingly, package management systems are developed to standardize the discovery and integration of software packages. Here we study the impact of package management systems on software dissemination and their scholarly recognition. We study the citation pattern of more than 18,000 scholarly papers referenced by more than 23,000 software packages hosted by Bioconda, Bioconductor, BioTools, and ToolShed-the package management systems primarily used by the Bioinformatics community. Our results suggest that there is significant evidence that the scholarly papers' citation count increases after their respective software was published to package management systems. Additionally, our results show that the impact of different package management systems on the scholarly papers' recognition is of the same magnitude. These results may motivate scientists to distribute their software via package management systems, facilitating the composition of computational pipelines and helping reduce redundancy in package development. Copy rights belong to original authors. Visit the link for more info
Link to bioRxiv paper: http://biorxiv.org/cgi/content/short/2020.09.08.287516v1?rss=1 Authors: Eling, N., Damond, N., Hoch, T., Bodenmiller, B. Abstract: Highly multiplexed imaging technologies enable spatial profiling of dozens of biomarkers in situ. Standard data processing pipelines quantify cell-specific features and generate object segmentation masks as well as multi-channel images. Therefore, multiplexed imaging data can be visualised across two layers of information: pixel intensities represent the spatial expression of biomarkers across an image while segmented objects visualise cellular morphology, interactions and cell phenotypes in their microenvironment. Here we describe cytomapper, a computational tool that enables visualisation of pixel- and cell-level information obtained by multiplexed imaging. The package is written in the statistical programming language R, integrates with the image and single-cell analysis infrastructure of the Bioconductor project, and allows visualisation of single to hundreds of images in parallel. Using cytomapper, expression of multiple markers is displayed as composite images, segmentation masks are coloured based on cellular features, and selected cells can be outlined in images based on their cell type, among other functions. We illustrate the utility of cytomapper by analysing 100 images obtained by imaging mass cytometry from a cohort of type 1 diabetes patients and healthy individuals. In addition, cytomapper includes a Shiny application that allows hierarchical gating of cells based on marker expression and visualisation of selected cells in corresponding images. Together, cytomapper offers tools for diverse image and single-cell visualisation approaches and supports robust cell phenotyping via gating. Copy rights belong to original authors. Visit the link for more info
Link to bioRxiv paper: http://biorxiv.org/cgi/content/short/2020.09.04.282731v1?rss=1 Authors: Munz, M., Khodaygani, M., Aherrahrou, Z., Busch, H., Wohlers, I. Abstract: Mice are the most widely used animal model to study genotype to phenotype relationships. Inbred mice are genetically identical, which eliminates genetic heterogeneity and makes them particularly useful for genetic studies. Many different strains have been bred over decades and a vast amount of phenotypic data has been generated. In addition, lately, also whole genome sequencing-based genome-wide genotype data for many widely used inbred strains has been released. Here, we present an approach for in silico fine mapping that uses genotypic data of 37 inbred mouse strains together with phenotypic data provided by the user to propose candidate variants and genes for the phenotype under study. Public genome-wide genotype data covering more than 74 million variant sites is queried efficiently in real-time to provide those variants that are compatible with the observed phenotype differences between strains. Variants can be filtered by molecular consequences and by corresponding molecular impact. Candidate gene lists can be generated from variant lists on the fly. Fine mapping together with annotation or filtering of results is provided in a Bioconductor package called MouseFM. For albinism, MouseFM reports only one variant allele of moderate or high molecular impact that only albino mice share: a missense variant in the Tyr gene, reported previously to be causal for this phenotype. Performing in silico fine mapping for interfrontal bone formation in mice using four strains with and five strains without interfrontal bone results in 12 genes. Of these, three are related to skull shaping abnormality. Finally performing fine mapping for dystrophic cardiac calcification by comparing 8 strains showing the phenotype with 8 strains lacking it, we identify only one moderate impact variant in the known causal gene Abcc6. In summary, this illustrates the benefit of using MouseFM for candidate variant and gene identification. Copy rights belong to original authors. Visit the link for more info
Link to bioRxiv paper: http://biorxiv.org/cgi/content/short/2020.09.03.282186v1?rss=1 Authors: Stilianoudakis, S., Dozmorov, M. G. Abstract: High-throughput chromosome conformation capture technology (Hi-C) revealed extensive DNA looping and folding into discrete 3D domains. These include Topologically Associating Domains (TADs) and chromatin loops, the 3D domains critical for cellular processes like gene regulation and cell differentiation. The relatively low resolution of Hi-C data (regions of several kilobases in size) prevents precise mapping of domain boundaries. However, the high resolution of genomic annotations associated with boundaries, such as CTCF and members of cohesin complex, suggests they can inform the precise location of domain boundaries. Several methods attempted to leverage genome annotation data for predicting domain boundaries; however, they overlooked key characteristics of the data, such as spatial associations between an annotation and a boundary, and a much smaller number of boundaries than the rest of the genome (class imbalance). We developed preciseTAD, an optimized random forest model to improve the location of domain boundaries. Trained on high-resolution genome annotation data and boundaries from low-resolution Hi-C data, the model predicts the location of boundaries at base-level resolution. We investigated several feature engineering and resampling techniques (random over- and undersampling, Synthetic Minority Over-sampling TEchnique (SMOTE)) to select the most optimal data characteristics and address class imbalance. Density-based clustering and scalable partitioning techniques were used to identify the precise location of boundary regions and summit points. We benchmarked our method against the Arrowhead domain caller and a novel chromatin loop prediction algorithm, Peakachu, on the two most annotated cell lines. We found that spatial relationship (distance in the linear genome) between boundaries and annotations has the most predictive power. Transcription factor binding sites outperformed other genome annotation types. Random under-sampling significantly improved model performance. Boundaries predicted by preciseTAD were more enriched for CTCF, RAD21, SMC3, and ZNF143 signal and more conserved across cell lines, highlighting their higher biological significance. The model pre-trained in one cell line performs well in predicting boundaries in another cell line using only genomic annotations, enabling the detection of domain boundaries in cells without Hi-C data. Our study implements the method and the pre-trained models for precise domain boundary prediction using genome annotation data. The precise identification of domain boundaries will improve our understanding of how genomic regulators are shaping the 3D structure of the genome. preciseTAD R package is available on https://dozmorovlab.github.io/preciseTAD/ and Bioconductor (submitted). Copy rights belong to original authors. Visit the link for more info
Link to bioRxiv paper: http://biorxiv.org/cgi/content/short/2020.08.18.254680v1?rss=1 Authors: Bhattacharya, S., Barseghyan, H., Délot, E. C., Vilain, E. Abstract: Whole genome sequencing is effective at identification of small variants but, because it is based on short reads, assessment of structural variants (SVs) is limited. The advent of Optical Genome Mapping (OGM), which utilizes long fluorescently labeled DNA molecules for de novo genome assembly and SV calling, has allowed for increased sensitivity and specificity in SV detection. However, compared to small variant annotation tools, OGM-based SV annotation software has seen little development, and currently available SV annotation tools do not provide sufficient information for determination of variant pathogenicity. We developed an R-based package, nanotatoR, which provides comprehensive annotation as a tool for SV classification. nanotatoR uses both external (DGV; DECIPHER; Bionano Genomics BNDB) and internal (user-defined) databases to estimate SV frequency. Human genome reference GRCh37/38-based BED files are used to annotate SVs with overlapping, upstream, and downstream genes. Overlap percentages and distances for nearest genes are calculated and can be used for filtration. A primary gene list is extracted from public databases based on the patient's phenotype and used to filter genes overlapping SVs, providing the analyst with an easy way to prioritize variants. If available, expression of overlapping or nearby genes of interest is extracted (e.g. from an RNA-Seq dataset, allowing the user to assess the effects of SVs on the transcriptome). Most quality-control filtration parameters are customizable by the user. The output is given in an Excel file format, subdivided into multiple sheets based on SV type and inheritance pattern (INDELs, inversions, translocations, de novo, etc.). nanotatoR passed all quality and run time criteria of Bioconductor, where it was accepted in the April 2019 release. We evaluated nanotatoR's annotation capabilities using publicly available reference datasets: the singleton sample NA12878, mapped with two types of enzyme labeling, and the NA24143 trio. nanotatoR was also able to accurately filter the known pathogenic variants in a cohort of patients with Duchenne Muscular Dystrophy for which we had previously demonstrated the diagnostic ability of OGM. The extensive annotation enables users to rapidly identify potential pathogenic SVs, a critical step toward use of OGM in the clinical setting. Copy rights belong to original authors. Visit the link for more info
Interview with Stuart Lee, a PhD candidate from Monash University. We discuss software development, data visualization, and Bioconductor. - Stephanie's Google slides: https://docs.google.com/presentation/d/1AlwLGTlc3ZFxY8PLpCZ5cwD3ktQyQqb1V6qub2I8360/edit?usp=sharing - Stuart's blog post on making rookie mistakes and how to fix them when making plots: https://www.stuartlee.org/post/content/post/2018-04-14-rookie-mistakes/ - Bioconductor conference: http://bioc2019.bioconductor.org
Seth is the VP of Engineering at Chef. He is a product-focused engineering leader who believes the essential elements for a high-performing team are shared purpose, clear communication, commitment to learning, and mechanisms for measuring outcomes. Seth began his career melding data analysis, software engineering, and open source project management at the Fred Hutchinson Cancer Research Center where he worked on the Bioconductor project and contributed to R. In 2010, he joined Chef Software, an IT company that helps enterprises increase their velocity through automation of infrastructure, compliance, and applications. At Chef, his roles have spanned software development engineer to VP of Engineering. LinkedIn: https://www.linkedin.com/in/sethfalcon/ Twitter: https://twitter.com/sfalcon
Fakultät für Mathematik, Informatik und Statistik - Digitale Hochschulschriften der LMU - Teil 02/02
The methods of molecular biology for the quantitative measurement of gene expression have undergone a rapid development in the past two decades. High-throughput assays with the microarray and RNA-seq technology now enable whole-genome studies in which several thousands of genes can be measured at a time. However, this has also imposed serious challenges on data storage and analysis, which are subject of the young, but rapidly developing field of computational biology. To explain observations made on such a large scale requires suitable and accordingly scaled models of gene regulation. Detailed models, as available for single genes, need to be extended and assembled in larger networks of regulatory interactions between genes and gene products. Incorporation of such networks into methods for data analysis is crucial to identify molecular mechanisms that are drivers of the observed expression. As methods for this purpose emerge in parallel to each other and without knowing the standard of truth, results need to be critically checked in a competitive setup and in the context of the available rich literature corpus. This work is centered on and contributes to the following subjects, each of which represents important and distinct research topics in the field of computational biology: (i) construction of realistic gene regulatory network models; (ii) detection of subnetworks that are significantly altered in the data under investigation; and (iii) systematic biological interpretation of detected subnetworks. For the construction of regulatory networks, I review existing methods with a focus on curation and inference approaches. I first describe how literature curation can be used to construct a regulatory network for a specific process, using the well-studied diauxic shift in yeast as an example. In particular, I address the question how a detailed understanding, as available for the regulation of single genes, can be scaled-up to the level of larger systems. I subsequently inspect methods for large-scale network inference showing that they are significantly skewed towards master regulators. A recalibration strategy is introduced and applied, yielding an improved genome-wide regulatory network for yeast. To detect significantly altered subnetworks, I introduce GGEA as a method for network-based enrichment analysis. The key idea is to score regulatory interactions within functional gene sets for consistency with the observed expression. Compared to other recently published methods, GGEA yields results that consistently and coherently align expression changes with known regulation types and that are thus easier to explain. I also suggest and discuss several significant enhancements to the original method that are improving its applicability, outcome and runtime. For the systematic detection and interpretation of subnetworks, I have developed the EnrichmentBrowser software package. It implements several state-of-the-art methods besides GGEA, and allows to combine and explore results across methods. As part of the Bioconductor repository, the package provides a unified access to the different methods and, thus, greatly simplifies the usage for biologists. Extensions to this framework, that support automating of biological interpretation routines, are also presented. In conclusion, this work contributes substantially to the research field of network-based analysis of gene expression data with respect to regulatory network construction, subnetwork detection, and their biological interpretation. This also includes recent developments as well as areas of ongoing research, which are discussed in the context of current and future questions arising from the new generation of genomic data.
We are in the midst of a renaissance in the biological sciences, which is spurring the growth of brand new fields like functional and comparative genomics. These new fields are revealing novel insights into evolutionary biology, medicine, developmental biology and many other areas, transforming the way scientists look at life. Join the California Academy of Sciences to learn about genomics, hear about compelling current research, and explore the future of this rapidly advancing field. Katherine Pollard received her Ph.D. and M.A. from UC Berkeley Division of Biostatistics under the supervision of Mark van der Laan. Her research at Berkeley included developing computationally intensive statistical methods for analysis of microarray data with applications in cancer biology. After graduating, she did a postdoc at UC Berkeley with Sandrine Dudoit. She developed Bioconductor open source software packages for clustering and multiple hypothesis testing. In 2003, she began a comparative genomics NIH Postdoctoral Fellowship in the labs of David Haussler and Todd Lowe in the Center for Biomolecular Science & Engineering at UC Santa Cruz. She was part of the Chimpanzee Sequencing and Analysis Consortium that published the sequence of the Chimp Genome, and she used this sequence to identify the fastest evolving regions in the human genome. In 2005, she joined the faculty at the UC Davis Genome Center and Department of Statistics. She moved to UCSF in Fall 2008.
Background: Chromatin immunoprecipitation combined with DNA microarrays (ChIP-chip) is an assay used for investigating DNA-protein-binding or post-translational chromatin/histone modifications. As with all high-throughput technologies, it requires thorough bioinformatic processing of the data for which there is no standard yet. The primary goal is to reliably identify and localize genomic regions that bind a specific protein. Further investigation compares binding profiles of functionally related proteins, or binding profiles of the same proteins in different genetic backgrounds or experimental conditions. Ultimately, the goal is to gain a mechanistic understanding of the effects of DNA binding events on gene expression. Results: We present a free, open-source R/Bioconductor package Starr that facilitates comparative analysis of ChIP-chip data across experiments and across different microarray platforms. The package provides functions for data import, quality assessment, data visualization and exploration. Starr includes high-level analysis tools such as the alignment of ChIP signals along annotated features, correlation analysis of ChIP signals with complementary genomic data, peak-finding and comparative display of multiple clusters of binding profiles. It uses standard Bioconductor classes for maximum compatibility with other software. Moreover, Starr automatically updates microarray probe annotation files by a highly efficient remapping of microarray probe sequences to an arbitrary genome. Conclusion: Starr is an R package that covers the complete ChIP-chip workflow from data processing to binding pattern detection. It focuses on the high-level data analysis, e. g., it provides methods for the integration and combined statistical analysis of binding profiles and complementary functional genomics data. Starr enables systematic assessment of binding behaviour for groups of genes that are alingned along arbitrary genomic features.
Fakultät für Mathematik, Informatik und Statistik - Digitale Hochschulschriften der LMU - Teil 01/02
In the 1990s a number of technological innovations appeared that revolutionized biology, and 'Bioinformatics' became a new scientific discipline. Microarrays can measure the abundance of tens of thousands of mRNA species, data on the complete genomic sequences of many different organisms are available, and other technologies make it possible to study various processes at the molecular level. In Bioinformatics and Biostatistics, current research and computations are limited by the available computer hardware. However, this problem can be solved using high-performance computing resources. There are several reasons for the increased focus on high-performance computing: larger data sets, increased computational requirements stemming from more sophisticated methodologies, and latest developments in computer chip production. The open-source programming language 'R' was developed to provide a powerful and extensible environment for statistical and graphical techniques. There are many good reasons for preferring R to other software or programming languages for scientific computations (in statistics and biology). However, the development of the R language was not aimed at providing a software for parallel or high-performance computing. Nonetheless, during the last decade, a great deal of research has been conducted on using parallel computing techniques with R. This PhD thesis demonstrates the usefulness of the R language and parallel computing for biological research. It introduces parallel computing with R, and reviews and evaluates existing techniques and R packages for parallel computing on Computer Clusters, on Multi-Core Systems, and in Grid Computing. From a computer-scientific point of view the packages were examined as to their reusability in biological applications, and some upgrades were proposed. Furthermore, parallel applications for next-generation sequence data and preprocessing of microarray data were developed. Microarray data are characterized by high levels of noise and bias. As these perturbations have to be removed, preprocessing of raw data has been a research topic of high priority over the past few years. A new Bioconductor package called affyPara for parallelized preprocessing of high-density oligonucleotide microarray data was developed and published. The partition of data can be performed on arrays using a block cyclic partition, and, as a result, parallelization of algorithms becomes directly possible. Existing statistical algorithms and data structures had to be adjusted and reformulated for the use in parallel computing. Using the new parallel infrastructure, normalization methods can be enhanced and new methods became available. The partition of data and distribution to several nodes or processors solves the main memory problem and accelerates the methods by up to the factor fifteen for 300 arrays or more. The final part of the thesis contains a huge cancer study analysing more than 7000 microarrays from a publicly available database, and estimating gene interaction networks. For this purpose, a new R package for microarray data management was developed, and various challenges regarding the analysis of this amount of data are discussed. The comparison of gene networks for different pathways and different cancer entities in the new amount of data partly confirms already established forms of gene interaction.
Background: For the last eight years, microarray-based classification has been a major topic in statistics, bioinformatics and biomedicine research. Traditional methods often yield unsatisfactory results or may even be inapplicable in the so-called "p >> n" setting where the number of predictors p by far exceeds the number of observations n, hence the term "ill-posed-problem". Careful model selection and evaluation satisfying accepted good-practice standards is a very complex task for statisticians without experience in this area or for scientists with limited statistical background. The multiplicity of available methods for class prediction based on high-dimensional data is an additional practical challenge for inexperienced researchers. Results: In this article, we introduce a new Bioconductor package called CMA (standing for "Classification for MicroArrays") for automatically performing variable selection, parameter tuning, classifier construction, and unbiased evaluation of the constructed classifiers using a large number of usual methods. Without much time and effort, users are provided with an overview of the unbiased accuracy of most top-performing classifiers. Furthermore, the standardized evaluation framework underlying CMA can also be beneficial in statistical research for comparison purposes, for instance if a new classifier has to be compared to existing approaches. Conclusion: CMA is a user-friendly comprehensive package for classifier construction and evaluation implementing most usual approaches. It is freely available from the Bioconductor website at http://bioconductor.org/packages/2.3/bioc/html/CMA.html.