Podcasts about Random forest

An ensemble machine learning method

  • 50PODCASTS
  • 91EPISODES
  • 30mAVG DURATION
  • ?INFREQUENT EPISODES
  • Apr 17, 2025LATEST
Random forest

POPULARITY

20172018201920202021202220232024


Best podcasts about Random forest

Latest podcast episodes about Random forest

IMMOblick
Evidenz statt Schätzung: Die Zukunft der Immobilienbewertung mit Machine Learning

IMMOblick

Play Episode Listen Later Apr 17, 2025 36:01


Wie genau ist ein Wert eigentlich? Wenn es um Immobilienbewertungen geht, zählt nicht nur der ermittelte Wert selbst – sondern auch, wie verlässlich er ist. In dieser Folge von IMMOblick sprechen Peter Ache und Robert Krägenbring darüber, wie moderne Statistik und Machine Learning die Wertermittlung nicht nur präziser, sondern auch transparenter machen können. Du erfährst, dass moderne Methode wie Bootstrapping oder Random Forest dabei helfen, die Aussagekraft von Bewertungsmodellen messbar zu machen – und warum das Thema Repräsentativität mehr ist als ein Bauchgefühl. Auch die Frage, wie sich neue Verfahren in die bestehenden rechtlichen Rahmenbedingungen einfügen, kommt nicht zu kurz. Wenn Dich interessiert, wie viel Technik heute schon in der Wertermittlung steckt – und was noch alles möglich ist – dann ist diese Folge genau das Richtige. Fachlich tiefgehend, verständlich erklärt und mit einem klaren Blick auf die Zukunft der Bewertung. Weitere Informationen findest du hier: Webseite: https://dvw.de/publikationen/immoblick Social Media: LinkedIn | Instagram | Facebook

Dev Sem Fronteiras
Engenheiro de Software na Meta no Vale do Silício, Estados Unidos - Dev Sem Fronteiras #171

Dev Sem Fronteiras

Play Episode Listen Later Dec 19, 2024 37:41


O campineiro Leonardo teve seu primeiro contato com tecnologia por engano. Quando se inscreveu em um curso de informática no colégio técnico, ele esperava que iria aprender a mexer no Photoshop, quando na verdade aprenderia a programar. Isso lhe colocou na rota para acabar cursando Ciência da Computação na Unicamp e, após rápidas passagens por algumas empresas, uma carreira de 15 anos na CI&T. Lá pelas tantas, um projeto da CI&T lhe rendeu a oportunidade de se mudar para o Texas, para onde ele foi e passou outros 5 anos antes de resolver matar uma curiosidade que vinha lhe cutucando há uns anos: como seria trabalhar em uma Big Tech? Depois de reativar o contato de uma recrutador, Leonardo passou por um dos processos seletivos mais cobiçados do mundo e hoje trabalha em uma das equipes de desenvolvimento do WhatsApp. Neste episódio, o Leonardo conta como foi seu processo de entrevista e de adaptação, além das semelhanças e diferenças de se morar na terra onde não se disca 911, e na terra onde silício vale ouro. Fabrício Carraro, o seu viajante poliglota Leonardo Jesus, Engenheiro de Software na Meta no Vale do Silício, Estados Unidos Links: Leetcode Campanha do WhatsApp com elenco de Modern Family Levels.FYI Conheça o curso Spark: trabalhando com regressão da Alura, aprenda a usar algoritmos como Árvore de Decisão e Random Forest, e resolva problemas de regressão utilizando as ferramentas de Machine Learning do Spark. TechGuide.sh, um mapeamento das principais tecnologias demandadas pelo mercado para diferentes carreiras, com nossas sugestões e opiniões. #7DaysOfCode: Coloque em prática os seus conhecimentos de programação em desafios diários e gratuitos. Acesse https://7daysofcode.io/ Ouvintes do podcast Dev Sem Fronteiras têm 10% de desconto em todos os planos da Alura Língua. Basta ir a https://www.aluralingua.com.br/promocao/devsemfronteiras/e começar a aprender inglês e espanhol hoje mesmo!  Produção e conteúdo: Alura Língua Cursos online de Idiomas – https://www.aluralingua.com.br/ Alura Cursos online de Tecnologia – https://www.alura.com.br/ Edição e sonorização: Rede Gigahertz de Podcasts

IMMOblick
Revolution in der Immobilienbewertung: Wie KI und AVMs das Tagesgeschäft transformieren

IMMOblick

Play Episode Listen Later May 23, 2024 32:50


In dieser Folge unterhalten sich Peter Ache und Robert Krägenbring über den Einsatz von Künstlicher Intelligenz (KI) und Automated Valuation Models (AVMs) in der Immobilienbewertung. Seit ihrer letzten Diskussion zu diesem Thema hat sich einiges getan. Peter betont: »KI ist gekommen, um zu bleiben.« Die beiden Hosts diskutieren, wie neue Technologien oft auf Skepsis stoßen, bevor sie breite Akzeptanz finden – ein Phänomen, das bei der Einführung vieler anderer bahnbrechender Technologien beobachtet wurde. Auf der Ebene der International Federation of Surveyors (FIG) wird global darüber nachgedacht, wie KI sinnvoll in der Immobilienbewertung eingesetzt werden kann. Ein bevorstehender internationaler Workshop zum Thema AVM bietet Anlass für eine vertiefte Diskussion im aktuellen Podcast. AVMs sind computergestützte Systeme, die mathematische Modelle und Algorithmen verwenden, um den Wert von Immobilien zu schätzen. Sie nutzen umfangreiche Datenbanken mit Informationen über Immobilienpreise, Standortdaten, Immobilienmerkmale und Markttrends aus Quellen wie öffentlichen Aufzeichnungen und Immobilienportalen. Moderne AVMs setzen fortschrittliche Algorithmen und KI-Techniken wie maschinelles Lernen ein, um präzisere und zuverlässigere Bewertungen zu liefern. Diese Algorithmen analysieren historische Daten und identifizieren Muster, die bei der Bewertung von Immobilien helfen. Die Vorteile von AVMs liegen auf der Hand. Sie liefern schnelle Bewertungen und beschleunigen den Bewertungsprozess erheblich. Zudem bieten automatisierte Modelle objektive Bewertungen, da sie nicht von subjektiven Meinungen beeinflusst werden. Dabei geht es auch um maschinelle Lernverfahren wie Random Forest, das sowohl für Klassifikations- als auch für Regressionsaufgaben verwendet werden kann. Peter und Robert erörtern die Chancen, die KI und AVMs bieten, sowie die Risiken und Grenzen dieser Technologien. Sie geben den Zuhörerinnen und Zuhörern eine Vorstellung von den Funktionsweisen und Anwendungen von AVMs. Erfahren Sie mehr über die aktuellen Entwicklungen und wie diese Technologien die Immobilienbewertung revolutionieren könnten. Abschließend appellieren die Gastgeber erneut, dass Deutschland bei der Digitalisierung aufholen muss, um international wettbewerbsfähig zu bleiben. Moderation: Peter Ache, Leiter des Arbeitskreises Immobilienwertermittlung des DVW e.V. und Robert Krägenbring, Stellvertretender Leiter des Arbeitskreises Immobilienwertermittlung des DVW e.V. Weitere Informationen findest du hier: Webseite: https://dvw.de Social Media: LinkedIn | Instagram | Facebook

Society of Actuaries Podcasts Feed
Emerging Topics Community: Return to Trees, Part 4: Gradient Boosting Machines

Society of Actuaries Podcasts Feed

Play Episode Listen Later May 1, 2024 26:08


In the final episode of this mini-series, Shea and Anders cover the other common tree-based ensemble model, the Gradient Boosting Machine. Like Random Forests, GBMs make use of a large number of decision trees, but they use a “boosting” approach that cleverly makes use of “weak learners” to incrementally extract information from the data. After an explanation of how GBMs work, we compare them to Random Forests and go over a few examples where they have used GBMs in their own work.

Papers Read on AI
From Words to Numbers: Your Large Language Model Is Secretly A Capable Regressor When Given In-Context Examples

Papers Read on AI

Play Episode Listen Later Apr 16, 2024 36:41


We analyze how well pre-trained large language models (e.g., Llama2, GPT-4, Claude 3, etc) can do linear and non-linear regression when given in-context examples, without any additional training or gradient updates. Our findings reveal that several large language models (e.g., GPT-4, Claude 3) are able to perform regression tasks with a performance rivaling (or even outperforming) that of traditional supervised methods such as Random Forest, Bagging, or Gradient Boosting. For example, on the challenging Friedman #2 regression dataset, Claude 3 outperforms many supervised methods such as AdaBoost, SVM, Random Forest, KNN, or Gradient Boosting. We then investigate how well the performance of large language models scales with the number of in-context exemplars. We borrow from the notion of regret from online learning and empirically show that LLMs are capable of obtaining a sub-linear regret. 2024: Robert Vacareanu, Vlad-Andrei Negru, Vasile Suciu, M. Surdeanu https://arxiv.org/pdf/2404.07544v1.pdf

Society of Actuaries Podcasts Feed
Emerging Topics Community: Return to Trees, Part 3: Random Forest

Society of Actuaries Podcasts Feed

Play Episode Listen Later Apr 10, 2024 31:09


Building on the discussion of individual decision trees in the prior episode, Shea and Anders shift to one of today's most popular ensemble models, the Random Forest. At first glance, the algorithm may seem like a brute force approach of simply running hundreds or thousands of decision trees, but it leverages the concept of “bagging” to avoid overfitting and attempt to learn as much as possible from the entire data sets, not just a few key features. We close by covering strengths and weaknesses of this model and providing some real-life examples.

Society of Actuaries Podcasts Feed
Emerging Topics Community: Return to Trees, Part 2: Single Decision Trees

Society of Actuaries Podcasts Feed

Play Episode Listen Later Mar 20, 2024 42:01


Shea and Anders dive into tree-based algorithms, starting with the most fundamental variety, the single decision tree. We cover the mechanics of a decision tree and provide a comparison to linear models. A solid understanding of how a decision tree works is critical to fully grasp the nuances of the more powerful ensemble models, the Random Forest and Gradient Boosting Machine. In addition, single decision trees can still be useful either as a starting point for building more complex models or for situations where interpretability is paramount.

ExplAInable
עושים כבוד לעצים

ExplAInable

Play Episode Listen Later Mar 18, 2024 12:17


רשתות נוירונים על שלל סוגיהן זוכות להרבה אטנשן - אבל בפועל, הרבה פרויקטים לא זקוקים לרשתות נוירונים. מודליים עציים הם בדרך כלל הפתרון הפשוט והיעיל לדאטא טבלאי. בפרק קצרצר זה, נסקור את עצי החלטה, תהליך אימונם ובעיית הOverfit. נדבר על שתי ההרחבות הנפוצות: Random Forest & Gradient Boosted Trees והיתרונות שיש בשימוש במודלים ותיקים בסביבת פרודקשן

GeocHemiSTea
Head in the Clouds, Feet in the Data with Britt Bluemel

GeocHemiSTea

Play Episode Listen Later Mar 13, 2024 46:16


For this episode we read: Using machine learning to estimate a key missing geochemical variable in mining exploration: application of the Random Forest algorithm to multi-sensor core logging data (Schnitzler et al., 2019)  A big difference between applied geochemistry and machine is the terminology, but once you start to chip away at this, like Britt, you will realize that the two disciplines are not so different. Join us as we talk about dimensionality reductions, transformations, and workflows pre- and post- her introduction to the realm of data science. And talk about a really neat paper that used random forest to predict sodium for an alteration study. --- Support this podcast: https://podcasters.spotify.com/pod/show/geochemistea/support

Society of Actuaries Podcasts Feed
Emerging Topics Community: Return to Trees, Part 1: Overview of Tree-Based Algorithms

Society of Actuaries Podcasts Feed

Play Episode Listen Later Feb 28, 2024 10:11


This is the first episode in a new mini-series, “Return to Trees”. In this series, Shea and Anders will be covering tree-based algorithms, including single decision trees, Random Forests, and Gradient Boosting Machines. In this episode, we discuss the rationale for re-visiting these models, which were covered in older podcast episodes in the mid-2010s, as well as an overview of what's to come in the remaining episodes of the mini-series.  

The Machine Learning Podcast
Improve The Success Rate Of Your Machine Learning Projects With bizML

The Machine Learning Podcast

Play Episode Listen Later Feb 18, 2024 50:22


Summary Machine learning is a powerful set of technologies, holding the potential to dramatically transform businesses across industries. Unfortunately, the implementation of ML projects often fail to achieve their intended goals. This failure is due to a lack of collaboration and investment across technological and organizational boundaries. To help improve the success rate of machine learning projects Eric Siegel developed the six step bizML framework, outlining the process to ensure that everyone understands the whole process of ML deployment. In this episode he shares the principles and promise of that framework and his motivation for encapsulating it in his book "The AI Playbook". Announcements Hello and welcome to the Machine Learning Podcast, the podcast about machine learning and how to bring it from idea to delivery. Your host is Tobias Macey and today I'm interviewing Eric Siegel about how the bizML approach can help improve the success rate of your ML projects Interview Introduction How did you get involved in machine learning? Can you describe what bizML is and the story behind it? What are the key aspects of this approach that are different from the "industry standard" lifecycle of an ML project? What are the elements of your personal experience as an ML consultant that helped you develop the tenets of bizML? Who are the personas that need to be involved in an ML project to increase the likelihood of success? Who do you find to be best suited to "own" or "lead" the process? What are the organizational patterns that might hinder the work of delivering on the goals of an ML initiative? What are some of the misconceptions about the work involved in/capabilities of an ML model that you commonly encounter? What is your main goal in writing your book "The AI Playbook"? What are the most interesting, innovative, or unexpected ways that you have seen the bizML process in action? What are the most interesting, unexpected, or challenging lessons that you have learned while working on ML projects and developing the bizML framework? When is bizML the wrong choice? What are the future developments in organizational and technical approaches to ML that will improve the success rate of AI projects? Contact Info LinkedIn (https://www.linkedin.com/in/predictiveanalytics/) Parting Question From your perspective, what is the biggest barrier to adoption of machine learning today? Closing Announcements Thank you for listening! Don't forget to check out our other shows. The Data Engineering Podcast (https://www.dataengineeringpodcast.com) covers the latest on modern data management. Podcast.__init__ () covers the Python language, its community, and the innovative ways it is being used. Visit the site (https://www.themachinelearningpodcast.com) to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show then tell us about it! Email hosts@themachinelearningpodcast.com (mailto:hosts@themachinelearningpodcast.com)) with your story. To help other people find the show please leave a review on iTunes (https://podcasts.apple.com/us/podcast/the-machine-learning-podcast/id1626358243) and tell your friends and co-workers. Links The AI Playbook (https://www.machinelearningkeynote.com/the-ai-playbook): Mastering the Rare Art of Machine Learning Deployment by Eric Siegel Predictive Analytics (https://www.machinelearningkeynote.com/predictive-analytics): The Power to Predict Who Will Click, Buy, Lie, or Die by Eric Siegel Columbia University (https://www.columbia.edu/) Machine Learning Week Conference (https://machinelearningweek.com/) Generative AI World (https://generativeaiworld.events/) Machine Learning Leadership and Practice Course (https://www.predictiveanalyticsworld.com/machinelearningweek/workshops/machine-learning-course/) Rexer Analytics (https://www.rexeranalytics.com/) KD Nuggets (https://www.kdnuggets.com/) CRISP-DM (https://en.wikipedia.org/wiki/Cross-industry_standard_process_for_data_mining) Random Forest (https://en.wikipedia.org/wiki/Random_forest) Gradient Descent (https://en.wikipedia.org/wiki/Gradient_descent) The intro and outro music is from Hitman's Lovesong feat. Paola Graziano (https://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/Tales_Of_A_Dead_Fish/Hitmans_Lovesong/) by The Freak Fandango Orchestra (http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/)/CC BY-SA 3.0 (https://creativecommons.org/licenses/by-sa/3.0/)

Software Engineering Radio - The Podcast for Professional Software Developers
SE Radio 594: Sean Moriarity on Deep Learning with Elixir and Axon

Software Engineering Radio - The Podcast for Professional Software Developers

Play Episode Listen Later Dec 14, 2023 57:43


Sean Moriarity, creator of the Axon deep learning framework, co-creator of the Nx library, and author of Machine Learning in Elixir and Genetic Algorithms in Elixir, published by the Pragmatic Bookshelf, speaks with SE Radio host Gavin Henry about what deep learning (neural networks) means today. Using a practical example with deep learning for fraud detection, they explore what Axon is and why it was created. Moriarity describes why the Beam is ideal for machine learning, and why he dislikes the term “neural network.” They discuss the need for deep learning, its history, how it offers a good fit for many of today's complex problems, where it shines and when not to use it. Moriarity goes into depth on a range of topics, including how to get datasets in shape, supervised and unsupervised learning, feed-forward neural networks, Nx.serving, decision trees, gradient descent, linear regression, logistic regression, support vector machines, and random forests. The episode considers what a model looks like, what training is, labeling, classification, regression tasks, hardware resources needed, EXGBoost, Jax, PyIgnite, and Explorer. Finally, they look at what's involved in the ongoing lifecycle or operational side of Axon once a workflow is put into production, so you can safely back it all up and feed in new data. Brought to you by IEEE Computer Society and IEEE Software magazine. This episode sponsored by Miro.

PaperPlayer biorxiv neuroscience
Source-Free Random Forest Model Calibration for Myoelectric Control

PaperPlayer biorxiv neuroscience

Play Episode Listen Later Jul 25, 2023


Link to bioRxiv paper: http://biorxiv.org/cgi/content/short/2023.07.21.550033v1?rss=1 Authors: Jiang, X., Ma, C., Nazarpour, K. Abstract: Objective: Most existing machine learning models for myoelectric control require a large amount of data to learn user-specific characteristics of the electromyographic (EMG) signals, which is burdensome. Our objective is to develop an approach to enable the calibration of a pre-trained model with minimal data from a new myoelectric user. Approach: We trained a random forest model with EMG data from 20 people collected during the performance of multiple hand grips. To adapt the decision rules for a new user, first, the branches of the pre-trained decision trees were pruned using the validation data from the new user. Then new decision trees trained merely with data from the new user were appended to the pruned pre-trained model. Results: Real-time myoelectric experiments with 18 participants over two days demonstrated the improved accuracy of the proposed approach when compared to benchmark user-specific random forest and the linear discriminant analysis models. Furthermore, the random forest model that was calibrated on day one for a new participant yielded significantly higher accuracy on day two, when compared to the benchmark approaches, which reflects the robustness of the proposed approach. Significance: The proposed model calibration procedure is completely source-free, that is, once the base model is pre-trained, no access to the source data from the original 20 people is required. Our work promotes the use of efficient, explainable, and simple models for myoelectric control. Copy rights belong to original authors. Visit the link for more info Podcast created by Paper Player, LLC

The AI Frontier Podcast
#21 - Ensemble Learning: Boosting, Bagging, and Random Forests in Machine Learning

The AI Frontier Podcast

Play Episode Listen Later Jun 11, 2023 11:05


Dive into this episode of The AI Frontier podcast, where we explore Ensemble Learning techniques like Boosting, Bagging, and Random Forests in Machine Learning. Learn about their applications, advantages, and limitations, and discover real-world success stories. Enhance your understanding of these powerful methods and stay ahead in the world of data science.Support the Show.Keep AI insights flowing – become a supporter of the show!Click the link for details

InfosecTrain
What is Classification Algorithms? | Decision Tree and Random Forests | Model Evaluation Metrics

InfosecTrain

Play Episode Listen Later Apr 12, 2023 118:06


InfosecTrain hosts a live event entitled ‘Data Science Fast Track Course' with certified expert ‘NAWAJ'. Data Science is not the future anymore, it is rather the present. This masterclass would be extremely beneficial to anyone interested in pursuing a career in Data Science. It will be delivered by a domain expert with extensive industry experience. With our instructors who are specialists in their disciplines, we hold a global reputation. Attending this webinar will benefit you in a variety of ways. Thank you for watching this video, For more details or free demo with our expert write into us at sales@infosectrain.com ➡️ Agenda

Probable Causation
Episode 91: Allison Harris on registering returning citizens to vote

Probable Causation

Play Episode Listen Later Apr 11, 2023 55:20


Allison Harris talks about increasing the civic engagement of people with felony convictions. "Registering Returning Citizens to Vote” by Jennifer Doleac, Laurel Eckhouse, Eric Foster-Moore, Allison Harris, Hannah Walker, and Ariel White. *** Probable Causation is part of Doleac Initiatives, a 501(c)(3) nonprofit. If you enjoy the show, please consider making a tax-deductible contribution. Thank you for supporting our work! *** OTHER RESEARCH WE DISCUSS IN THIS EPISODE: “Can Incarcerated Felons be (Re)integrated into the Political System? Results from a Field Experiment” by Alan S. Gerber, Gregory A. Huber, Marc Meredith, Daniel R. Bigger, and David J. Hendry. “The Politics of the Restoration of Ex-felon Voting Rights: The Case of Iowa” by Marc Meredith and Michael Morse. “Using Causal Forests to Predict Treatment Heterogeneity: An Application to Summer Jobs” by Jonathan David and Sara B. Heller. "Estimation and Inference of Heterogeneous Treatment Effects using Random Forests" by Stefan Wager and Susan Athey. “Civic Responses to Police Violence” by Desmond And and John Tebes. [Working Paper]. “Mobilized by Injustice: Criminal Justice Contact, Political Participation, and Race” by Hannah L. Walker. Bonus Episode 10 of Probable Causation: Hannah Walker.

AI Today Podcast: Artificial Intelligence Insights, Experts, and Opinion
AI Today Podcast: AI Glossary Series – Random Forest and Boosted Trees

AI Today Podcast: Artificial Intelligence Insights, Experts, and Opinion

Play Episode Listen Later Mar 15, 2023 9:01


Sometimes for reasons such as improving performance or robustness it makes sense to create multiple decision trees and average the results to solve problems related to overfitting. Or, it makes sense to boost certain decision trees. In this episode of the AI Today podcast hosts Kathleen Walch and Ron Schmelzer define the terms Random Forest and Boosted Trees, and explain how they relate to AI and why it's important to know about them. Continue reading AI Today Podcast: AI Glossary Series – Random Forest and Boosted Trees at AI & Data Today.

The Machine Learning Podcast
Real-Time Machine Learning Has Entered The Realm Of The Possible

The Machine Learning Podcast

Play Episode Listen Later Mar 9, 2023 34:29


Summary Machine learning models have predominantly been built and updated in a batch modality. While this is operationally simpler, it doesn't always provide the best experience or capabilities for end users of the model. Tecton has been investing in the infrastructure and workflows that enable building and updating ML models with real-time data to allow you to react to real-world events as they happen. In this episode CTO Kevin Stumpf explores they benefits of real-time machine learning and the systems that are necessary to support the development and maintenance of those models. Announcements Hello and welcome to the Machine Learning Podcast, the podcast about machine learning and how to bring it from idea to delivery. Your host is Tobias Macey and today I'm interviewing Kevin Stumpf about the challenges and promise of real-time ML applications Interview Introduction How did you get involved in machine learning? Can you describe what real-time ML is and some examples of where it might be applied? What are the operational and organizational requirements for being able to adopt real-time approaches for ML projects? What are some of the ways that real-time requirements influence the scale/scope/architecture of an ML model? What are some of the failure modes for real-time vs analytical or operational ML? Given the low latency between source/input data being generated or received and a prediction being generated, how does that influence susceptibility to e.g. data drift? Data quality and accuracy also become more critical. What are some of the validation strategies that teams need to consider as they move to real-time? What are the most interesting, innovative, or unexpected ways that you have seen real-time ML applied? What are the most interesting, unexpected, or challenging lessons that you have learned while working on real-time ML systems? When is real-time the wrong choice for ML? What do you have planned for the future of real-time support for ML in Tecton? Contact Info LinkedIn (https://www.linkedin.com/in/kevinstumpf/) @kevinmstumpf (https://twitter.com/kevinmstumpf?lang=en) on Twitter Parting Question From your perspective, what is the biggest barrier to adoption of machine learning today? Closing Announcements Thank you for listening! Don't forget to check out our other shows. The Data Engineering Podcast (https://www.dataengineeringpodcast.com) covers the latest on modern data management. Podcast.__init__ () covers the Python language, its community, and the innovative ways it is being used. Visit the site (https://www.themachinelearningpodcast.com) to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show then tell us about it! Email hosts@themachinelearningpodcast.com (mailto:hosts@themachinelearningpodcast.com)) with your story. To help other people find the show please leave a review on iTunes (https://podcasts.apple.com/us/podcast/the-machine-learning-podcast/id1626358243) and tell your friends and co-workers Links Tecton (https://www.tecton.ai/) Podcast Episode (https://www.themachinelearningpodcast.com/tecton-machine-learning-feature-platform-episode-6/) Data Engineering Podcast Episode (https://www.dataengineeringpodcast.com/tecton-mlops-feature-store-episode-166/) Uber Michelangelo (https://www.uber.com/blog/michelangelo-machine-learning-platform/) Reinforcement Learning (https://en.wikipedia.org/wiki/Reinforcement_learning) Online Learning (https://en.wikipedia.org/wiki/Online_machine_learning) Random Forest (https://en.wikipedia.org/wiki/Random_forest) ChatGPT (https://openai.com/blog/chatgpt) XGBoost (https://xgboost.ai/) Linear Regression (https://en.wikipedia.org/wiki/Linear_regression) Train-Serve Skew (https://ploomber.io/blog/train-serve-skew/) Flink (https://flink.apache.org/) Data Engineering Podcast Episode (https://www.dataengineeringpodcast.com/apache-flink-with-fabian-hueske-episode-57/) The intro and outro music is from Hitman's Lovesong feat. Paola Graziano (https://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/Tales_Of_A_Dead_Fish/Hitmans_Lovesong/) by The Freak Fandango Orchestra (http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/)/CC BY-SA 3.0 (https://creativecommons.org/licenses/by-sa/3.0/)

TechReview - The Podcast
52: Nothing. Forever. AI-Powered

TechReview - The Podcast

Play Episode Listen Later Mar 1, 2023 34:19


TikTok is creating the Creativity Program, a revamped program which is aimed to reward talented content creators with financial compensation and opportunities to grow their presence on the platform. Are we on the brink of a major astronomical discovery? An AI system has detected mysterious radio signals of unknown origin that could hold the key to finding extraterrestrial life. Nothing, Forever is back on Twitch and better than ever! With new guardrails in place, this popular channel is committed to creating a safe and inclusive space for all viewers and creators. And an author shares their experience of being gaslit and lied to by Bing's ChatGPT, exposing the importance of trust and transparency in the tech industry.00:00 - Intro02:06 - TikTok launches a revamped creator fund called the 'Creativity Program' in beta10:36 - AI System Detects Strange Signals of Unknown Origin in Radio Data18:27 - Nothing, Forever is set to return to Twitch with new guardrails in place28:00 - My Week of Being Gaslit and Lied to by the New BingSummary: TikTok has launched a beta version of its revamped creator fund, the Creativity Program, to provide more earning opportunities and revenue for select creators. The program is designed to address criticisms about the low payouts under the existing Creator Fund, but specifics on revenue allocation and eligibility requirements for the program remain undisclosed. Creators need to produce high-quality, original videos that are over one minute long, while access to the Creativity Program dashboard gives creators greater insights into video performance metrics and estimated revenue. The program is being rolled out on an invite-only basis initially, with wider availability expected soon.A team of radio astronomers has built an artificial intelligence (AI) system that beats classical algorithms in signal detection tasks in the search for extraterrestrial life. The AI algorithm sifts out "false positives" from radio interference, delivering results better than expected. The algorithm was trained to classify signals as either radio interference or a genuine technosignature candidate using an autoencoder and random forest classifier. The team fed the algorithm over 150 terabytes of data from the Green Bank Telescope in West Virginia and identified eight signals of interest that couldn't be attributed to radio interference, although they were not re-detected in follow-up observations. The researchers say their findings highlight the continued role AI techniques will play in the search for extraterrestrial intelligence.Nothing, Forever, an AI-powered Seinfeld spoof show on Twitch, was suspended for two weeks after the Jerry Seinfeld-like character made transphobic remarks. The creators, Mismatch Media, changed the AI models underpinning the stream, which resulted in inappropriate text being generated. Mismatch has been working to implement OpenAI's content moderation API and making sure its guardrails work. Mismatch also wants to introduce an audience interaction system that it had previously built but decided not to launch with Nothing, Forever. Beyond Nothing, Forever, Mismatch Media wants to build a platform for creators to make shows of their own. The goal is to get this platform up and running within the next six to 12 months.The author used to dismiss Bing as an inferior search engine compared to Google, but now Bing has gained attention for its integration of an AI-powered chatbot, ChatGPT. Since its rollout, daily visits to Bing.com have increased by 15% and searches for "Bing AI" have risen 700%. Google has responded by unveiling its own AI-powered search engine, Bard. The author spent a week using Bing's new AI-powered answer engine, Sydney, in place of Google search to see if Bing can truly compete with Google.Our panel today>> Tarek >> Chris >> Henrike >> Vincent Every week our panel of technology enthusiasts meets to discuss the most important news from the fields of technology, innovation, and science. And you can join us live!https://techreview.axelspringer.com/https://www.ideas-engineering.io/https://www.freetech.academy/https://www.upday.com/

PaperPlayer biorxiv neuroscience
Predicting alcohol-related memory problems in older adults: A machine learning study with multi-domain features

PaperPlayer biorxiv neuroscience

Play Episode Listen Later Jan 2, 2023


Link to bioRxiv paper: http://biorxiv.org/cgi/content/short/2022.12.30.522330v1?rss=1 Authors: Kamarajan, C., Pandey, A. K., Chorlian, D. B., Meyers, J. L., Kinreich, S., Pandey, G., Subbie Saenz de Viteri, S., Zhang, J., Kuang, W., Barr, P. B., Aliev, F., Anokhin, A. P., Plawecki, M. H., Kuperman, S., Almasy, L., Merikangas, A., Brislin, S. J., Bauer, L., Hesselbrock, V., Chan, G., Kramer, J., Lai, D., Hartz, S., Bierut, L. J., McCutcheon, V. V., Bucholz, K. K., Dick, D. M., Schuckit, M. A., Edenberg, H. J., Porjesz, B. Abstract: Memory problems are common among older adults with a history of alcohol use disorder (AUD). Employing a machine learning framework, the current study investigates the use of multi-domain features to classify individuals with and without alcohol-induced memory problems. A group of 94 individuals (ages 50-81 years) with alcohol-induced memory problems (Memory group) were compared with a matched Control group who did not have memory problems. The Random Forests model identified specific features from each domain that contributed to the classification of Memory vs. Control group (AUC=88.29%). Specifically, individuals from the Memory group manifested a predominant pattern of hyperconnectivity across the default mode network regions except some connections involving anterior cingulate cortex which were predominantly hypoconnected. Other significant contributing features were (i) polygenic risk scores for AUD, (ii) alcohol consumption and related health consequences during the past 5 years, such as health problems, past negative experiences, withdrawal symptoms, and the largest number of drinks in a day during the past 12 months, and (iii) elevated neuroticism and increased harm avoidance, and fewer positive "uplift" life events. At the neural systems level, hyperconnectivity across the default mode network regions, including the connections across the hippocampal hub regions, in individuals with memory problems may indicate dysregulation in neural information processing. Overall, the study outlines the importance of utilizing multidomain features, consisting of resting-state brain connectivity collected ~18 years ago, together with personality, life experiences, polygenic risk, and alcohol consumption and related consequences, to predict alcohol-related memory problems that arise in later life. Copy rights belong to original authors. Visit the link for more info Podcast created by Paper Player, LLC

PaperPlayer biorxiv neuroscience
Intermediate Gray Matter Interneurons in the Lumbar Spinal Cord Play a Critical and Necessary Role in Coordinated Locomotion

PaperPlayer biorxiv neuroscience

Play Episode Listen Later Nov 1, 2022


Link to bioRxiv paper: http://biorxiv.org/cgi/content/short/2022.10.31.514612v1?rss=1 Authors: Kuehn, N., Schwarz, A., Beretta, C. A., Schwarte, Y., Schmitt, F., Motsch, M., Weidner, N., Puttagunta, R. Abstract: Locomotion is a complex task involving excitatory and inhibitory circuitry in spinal gray matter. While genetic knockouts examine the function of unique spinal interneuron (SpIN) subtypes, the phenotype of combined premotor interneuron loss remains to be explored. We modified a kainic acid lesion to damage intermediate gray matter (laminae V-VII) in the lumbar spinal enlargement (spinal L2-L4) in female rats. A thorough, tailored behavioral evaluation revealed deficits in gross hindlimb function, skilled walking, coordination, balance and gait two-weeks post-injury. Using a Random Forest algorithm, we combined these behavioral assessments into a highly predictive binary classification system which strongly correlated with structural deficits in the rostro-caudal axis. Machine-learning quantification confirmed interneuronal damage to laminae V-VII in spinal L2-L4 correlates with hindlimb dysfunction. White matter damage and lower motoneuron loss did not correlate with behavioral deficits. Animals do not regain lost sensorimotor function three months after injury, indicating that natural recovery of the spinal cord cannot compensate for loss of laminae V-VII neurons. As spinal cord injuries are often located at spinal enlargements, this research lays the groundwork for new neuroregenerative therapies to replace these lost neuronal pools vital to sensorimotor function. Copy rights belong to original authors. Visit the link for more info Podcast created by Paper Player, LLC

At Any Rate
Systematic late-cycle hedging with FX options: A machine learning based strategy

At Any Rate

Play Episode Listen Later Oct 20, 2022 15:31


We introduce a systematic strategy for identifying signals and selecting late-cycle / recession FX options hedges. 1) we first run Random Forest algorithm to zero in on the most relevant signals for hedging risk-off episodes; then 2) we build a 4-factor model that selectively buys defensive FXO based on model predicted P&L. The highest beta to risk-off events is achieved with four signals: m/m change in 1-y z-score of ATM vol, m/m change in 1-y z-score of realized vol, 1-y z-score of 6-month % change in spot, and m/m change in 1-y z-score of fwd pts/vol (i.e. carry/vol). We find short-term tenor to be most responsive to adverse episodes, 30%TV digitals to be showing the least decay, and the longer expiry structures tend to hold better during the low vol times. Overall 3M - 6M expires offer a good compromise. At current market, our 4- factor model favors high beta G10 digi structures with AUD structures dominating within the top 5: 3M to 6M 30%TV at-expiry digitals in AUD/SGD put, USD/CAD call, EUR/AUD call, GBP/USD put and/or AUD/CHF put. This podcast was recorded on 20 October 2022. This communication is provided for information purposes only.   Please visit www.jpmm.com/research/disclosures for important disclosures. © 2022 JPMorgan Chase & Co. All rights reserved.

Astro arXiv | all categories
Stellar population of the Rosette Nebula and NGC 2244: application of the probabilistic random forest

Astro arXiv | all categories

Play Episode Listen Later Oct 12, 2022 1:24


Stellar population of the Rosette Nebula and NGC 2244: application of the probabilistic random forest by Koraljka Muzic et al. on Wednesday 12 October (Abridged) In this work, we study the 2.8x2.6 deg2 region in the emblematic Rosette Nebula, centred at the young cluster NGC 2244, with the aim of constructing the most reliable candidate member list to date, determining various structural and kinematic parameters, and learning about the past and the future of the region. Starting from a catalogue containing optical to mid-infrared photometry, as well as positions and proper motions from Gaia EDR3, we apply the Probabilistic Random Forest algorithm and derive membership probability for each source. Based on the list of almost 3000 probable members, of which about a third are concentrated within the radius of 20' from the centre of NGC 2244, we identify various clustered sources and stellar concentrations, and estimate the average distance of 1489+-37 pc (entire region), 1440+-32 pc (NGC 2244) and 1525+-36 pc (NGC 2237). The masses, extinction, and ages are derived by SED fitting, and the internal dynamic is assessed via proper motions relative to the mean proper motion of NGC 2244. NGC 2244 is showing a clear expansion pattern, with an expansion velocity that increases with radius. Its IMF is well represented by two power laws (dN/dMpropto M^{-alpha}), with slopes alpha = 1.05+-0.02 for the mass range 0.2 - 1.5 MSun, and alpha = 2.3+-0.3 for the mass range 1.5 - 20 MSun, in agreement with other star forming regions. The mean age of the region is ~2 Myr. We find evidence for the difference in ages between NGC 2244 and the region associated with the molecular cloud, which appears slightly younger. The velocity dispersion of NGC 2244 is well above the virial velocity dispersion derived from the total mass (1000+-70 MSun) and half-mass radius (3.4+-0.2 pc). From the comparison to other clusters and to numerical simulations, we conclude that NGC 2244 may be unbound, and possibly even formed in a super-virial state. arXiv: http://arxiv.org/abs/http://arxiv.org/abs/2209.13302v2

Astro arXiv | all categories
Stellar population of the Rosette Nebula and NGC 2244: application of the probabilistic random forest

Astro arXiv | all categories

Play Episode Listen Later Oct 12, 2022 1:21


Stellar population of the Rosette Nebula and NGC 2244: application of the probabilistic random forest by Koraljka Muzic et al. on Wednesday 12 October (Abridged) In this work, we study the 2.8x2.6 deg2 region in the emblematic Rosette Nebula, centred at the young cluster NGC 2244, with the aim of constructing the most reliable candidate member list to date, determining various structural and kinematic parameters, and learning about the past and the future of the region. Starting from a catalogue containing optical to mid-infrared photometry, as well as positions and proper motions from Gaia EDR3, we apply the Probabilistic Random Forest algorithm and derive membership probability for each source. Based on the list of almost 3000 probable members, of which about a third are concentrated within the radius of 20' from the centre of NGC 2244, we identify various clustered sources and stellar concentrations, and estimate the average distance of 1489+-37 pc (entire region), 1440+-32 pc (NGC 2244) and 1525+-36 pc (NGC 2237). The masses, extinction, and ages are derived by SED fitting, and the internal dynamic is assessed via proper motions relative to the mean proper motion of NGC 2244. NGC 2244 is showing a clear expansion pattern, with an expansion velocity that increases with radius. Its IMF is well represented by two power laws (dN/dMpropto M^{-alpha}), with slopes alpha = 1.05+-0.02 for the mass range 0.2 - 1.5 MSun, and alpha = 2.3+-0.3 for the mass range 1.5 - 20 MSun, in agreement with other star forming regions. The mean age of the region is ~2 Myr. We find evidence for the difference in ages between NGC 2244 and the region associated with the molecular cloud, which appears slightly younger. The velocity dispersion of NGC 2244 is well above the virial velocity dispersion derived from the total mass (1000+-70 MSun) and half-mass radius (3.4+-0.2 pc). From the comparison to other clusters and to numerical simulations, we conclude that NGC 2244 may be unbound, and possibly even formed in a super-virial state. arXiv: http://arxiv.org/abs/http://arxiv.org/abs/2209.13302v2

Astro arXiv | all categories
The metallicity's fundamental dependence on both local and global galactic quantities

Astro arXiv | all categories

Play Episode Listen Later Oct 10, 2022 1:04


The metallicity's fundamental dependence on both local and global galactic quantities by William M. Baker et al. on Monday 10 October We study the scaling relations between gas-phase metallicity, stellar mass surface density ($Sigma _*$), star formation rate surface density ($Sigma _{SFR}$), and molecular gas surface density ($Sigma_{H_2}$) in local star-forming galaxies on scales of a kpc. We employ optical integral field spectroscopy from the MaNGA survey, and ALMA data for a subset of MaNGA galaxies. We use Partial Correlation Coefficients and Random Forest regression to determine the relative importance of local and global galactic properties in setting the gas-phase metallicity. We find that the local metallicity depends primarily on $Sigma _*$ (the resolved mass-metallicity relation, rMZR), and has a secondary anti-correlation with $Sigma _{SFR}$ (i.e. a spatially-resolved version of the `Fundamental Metallicity Relation', rFMR). We find that $Sigma_{H_2}$ has little effect in determining the local metallicity. This result indicates that gas accretion, resulting in local metallicity dilution and local boosting of star formation, cannot be the primary origin of the rFMR. Star-formation driven, metal-loaded winds may contribute to the anti-correlation between metallicity and SFR. The local metallicity depends also on the global properties of galaxies. We find a strong dependence on the total stellar mass ($M_*$) and a weaker (inverse) dependence on the total SFR. The global metallicity scaling relations, therefore, do not simply stem out of their resolved counterparts; global properties and processes, such as the global gravitational potential well, galaxy-scale winds and global redistribution/mixing of metals, likely contribute to the local metallicity, in addition to local production and retention. arXiv: http://arxiv.org/abs/http://arxiv.org/abs/2210.03755v1

Astro arXiv | all categories
The probabilistic random forest applied to the QUBRICS survey: improving the selection of high-redshift quasars with synthetic data

Astro arXiv | all categories

Play Episode Listen Later Sep 15, 2022 0:59


The probabilistic random forest applied to the QUBRICS survey: improving the selection of high-redshift quasars with synthetic data by Francesco Guarneri et al. on Thursday 15 September Several recent works have focused on the search for bright, high-z quasars (QSOs) in the South. Among them, the QUasars as BRIght beacons for Cosmology in the Southern hemisphere (QUBRICS) survey has now delivered hundreds of new spectroscopically confirmed QSOs selected by means of machine learning algorithms. Building upon the results obtained by introducing the probabilistic random forest (PRF) for the QUBRICS selection, we explore in this work the feasibility of training the algorithm on synthetic data to improve the completeness in the higher redshift bins. We also compare the performances of the algorithm if colours are used as primary features instead of magnitudes. We generate synthetic data based on a composite QSO spectral energy distribution. We first train the PRF to identify QSOs among stars and galaxies, then separate high-z quasar from low-z contaminants. We apply the algorithm on an updated dataset, based on SkyMapper DR3, combined with Gaia eDR3, 2MASS and WISE magnitudes. We find that employing colours as features slightly improves the results with respect to the algorithm trained on magnitude data. Adding synthetic data to the training set provides significantly better results with respect to the PRF trained only on spectroscopically confirmed QSOs. We estimate, on a testing dataset, a completeness of ~86% and a contamination of ~36%. Finally, 207 PRF-selected candidates were observed: 149/207 turned out to be genuine QSOs with z > 2.5, 41 with z < 2.5, 3 galaxies and 14 stars. The result confirms the ability of the PRF to select high-z quasars in large datasets. arXiv: http://arxiv.org/abs/http://arxiv.org/abs/2209.07257v1

Astro arXiv | all categories
A machine-learning photometric classifier for massive stars in nearby galaxies I The method

Astro arXiv | all categories

Play Episode Listen Later Sep 14, 2022 0:45


A machine-learning photometric classifier for massive stars in nearby galaxies I The method by Grigoris Maravelias et al. on Wednesday 14 September (abridged) Mass loss is a key parameter in the evolution of massive stars, with discrepancies between theory and observations and with unknown importance of the episodic mass loss. To address this we need increased numbers of classified sources stars spanning a range of metallicity environments. We aim to remedy the situation by applying machine learning techniques to recently available extensive photometric catalogs. We used IR/Spitzer and optical/Pan-STARRS, with Gaia astrometric information, to compile a large catalog of known massive stars in M31 and M33, which were grouped in Blue, Red, Yellow, B[e] supergiants, Luminous Blue Variables, Wolf-Rayet, and background galaxies. Due to the high imbalance, we implemented synthetic data generation to populate the underrepresented classes and improve separation by undersampling the majority class. We built an ensemble classifier using color indices. The probabilities from Support Vector Classification, Random Forests, and Multi-layer Perceptron were combined for the final classification. The overall weighted balanced accuracy is ~83%, recovering Red supergiants at ~94%, Blue/Yellow/B[e] supergiants and background galaxies at ~50-80%, Wolf-Rayets at ~45%, and Luminous Blue Variables at ~30%, mainly due to their small sample sizes. The mixing of spectral types (no strict boundaries in their color indices) complicates the classification. Independent application to IC 1613, WLM, and Sextans A galaxies resulted in an overall lower accuracy of ~70%, attributed to metallicity and extinction effects. The missing data imputation was explored using simple replacement with mean values and an iterative imputor, which proved more capable. We also found that r-i and y-[3.6] were the most important features. Our method, although limited by the sampling of the feature space, is efficient in classifying sources with missing data and at lower metallicitites. arXiv: http://arxiv.org/abs/http://arxiv.org/abs/2203.08125v2

The Machine Learning Podcast
Build Better Machine Learning Models With Confidence By Adding Validation With Deepchecks

The Machine Learning Podcast

Play Episode Listen Later Jul 6, 2022 48:40


Summary Machine learning has the potential to transform industries and revolutionize business capabilities, but only if the models are reliable and robust. Because of the fundamental probabilistic nature of machine learning techniques it can be challenging to test and validate the generated models. The team at Deepchecks understands the widespread need to easily and repeatably check and verify the outputs of machine learning models and the complexity involved in making it a reality. In this episode Shir Chorev and Philip Tannor explain how they are addressing the problem with their open source deepchecks library and how you can start using it today to build trust in your machine learning applications. Announcements Hello and welcome to the Machine Learning Podcast, the podcast about machine learning and how to bring it from idea to delivery. Do you wish you could use artificial intelligence to drive your business the way Big Tech does, but don’t have a money printer? Graft is a cloud-native platform that aims to make the AI of the 1% accessible to the 99%. Wield the most advanced techniques for unlocking the value of data, including text, images, video, audio, and graphs. No machine learning skills required, no team to hire, and no infrastructure to build or maintain. For more information on Graft or to schedule a demo, visit themachinelearningpodcast.com/graft today and tell them Tobias sent you. Predibase is a low-code ML platform without low-code limits. Built on top of our open source foundations of Ludwig and Horovod, our platform allows you to train state-of-the-art ML and deep learning models on your datasets at scale. Our platform works on text, images, tabular, audio and multi-modal data using our novel compositional model architecture. We allow users to operationalize models on top of the modern data stack, through REST and PQL – an extension of SQL that puts predictive power in the hands of data practitioners. Go to themachinelearningpodcast.com/predibase today to learn more and try it out! Data powers machine learning, but poor data quality is the largest impediment to effective ML today. Galileo is a collaborative data bench for data scientists building Natural Language Processing (NLP) models to programmatically inspect, fix and track their data across the ML workflow (pre-training, post-training and post-production) – no more excel sheets or ad-hoc python scripts. Get meaningful gains in your model performance fast, dramatically reduce data labeling and procurement costs, while seeing 10x faster ML iterations. Galileo is offering listeners a free 30 day trial and a 30% discount on the product there after. This offer is available until Aug 31, so go to themachinelearningpodcast.com/galileo and request a demo today! Your host is Tobias Macey and today I’m interviewing Shir Chorev and Philip Tannor about Deepchecks, a Python package for comprehensively validating your machine learning models and data with minimal effort. Interview Introduction How did you get involved in machine learning? Can you describe what Deepchecks is and the story behind it? Who is the target audience for the project? What are the biggest challenges that these users face in bringing ML models from concept to production and how does DeepChecks address those problems? In the absence of DeepChecks how are practitioners solving the problems of model validation and comparison across iteratiosn? What are some of the other tools in this ecosystem and what are the differentiating features of DeepChecks? What are some examples of the kinds of tests that are useful for understanding the "correctness" of models? What are the methods by which ML engineers/data scientists/domain experts can define what "correctness" means in a given model or subject area? In software engineering the categories of tests are tiered as unit -> integration -> end-to-end. What are the relevant categories of tests that need to be built for validating the behavior of machine learning models? How do model monitoring utilities overlap with the kinds of tests that you are building with deepchecks? Can you describe how the DeepChecks package is implemented? How have the design and goals of the project changed or evolved from when you started working on it? What are the assumptions that you have built up from your own experiences that have been challenged by your early users and design partners? Can you describe the workflow for an individual or team using DeepChecks as part of their model training and deployment lifecycle? Test engineering is a deep discipline in its own right. How have you approached the user experience and API design to reduce the overhead for ML practitioners to adopt good practices? What are the interfaces available for creating reusable tests and composing test suites together? What are the additional services/capabilities that you are providing in your commercial offering? How are you managing the governance and sustainability of the OSS project and balancing that against the needs/priorities of the business? What are the most interesting, innovative, or unexpected ways that you have seen DeepChecks used? What are the most interesting, unexpected, or challenging lessons that you have learned while working on DeepChecks? When is DeepChecks the wrong choice? What do you have planned for the future of DeepChecks? Contact Info Shir LinkedIn shir22 on GitHub Philip LinkedIn @philiptannor on Twitter Parting Question From your perspective, what is the biggest barrier to adoption of machine learning today? Closing Announcements Thank you for listening! Don’t forget to check out our other shows. The Data Engineering Podcast covers the latest on modern data management. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@themachinelearningpodcast.com) with your story. To help other people find the show please leave a review on iTunes and tell your friends and co-workers Links DeepChecks Random Forest Talpiot Program SHAP Podcast.__init__ Episode Airflow Great Expectations Data Engineering Podcast Episode The intro and outro music is from Hitman’s Lovesong feat. Paola Graziano by The Freak Fandango Orchestra/CC BY-SA 3.0

Bigdata Hebdo
Episode 140 : Feature importance de la mafia dans la data

Bigdata Hebdo

Play Episode Listen Later May 13, 2022 86:53


### Apero* Atlassian a effacé les environnements cloud de 400 clients par erreur -> https://www.usine-digitale.fr/article/atlassian-a-efface-les-environnements-cloud-de-400-clients-par-erreur.N1993357* Hacked News Channel and Deepfake of Zelenskyy Surrendering Is Causing Chaos Online -> https://www-vice-com.cdn.ampproject.org/c/s/www.vice.com/amp/en/article/93bmda/hacked-news-channel-and-deepfake-of-zelenskyy-surrendering-is-causing-chaos-online### Database* PostgreSQL interface -> https://cloud.google.com/spanner/docs/postgresql-interface* Fin de Big Data Cluster en 2025 -> https://docs.microsoft.com/fr-fr/sql/big-data-cluster/release-notes-big-data-cluster?view=sql-server-ver15### ML/AI + Data-Science* Feature importance dans les Random Forests -> https://medium.com/@ali.soleymani.co/stop-using-random-forest-feature-importances-take-this-intuitive-approach-instead-4335205b933f* Neo4J AuraDS GA on GCP -> https://neo4j.com/blog/introducing-graph-data-science-2-0-aurads/### Architecture* Data Mesh From an Engineering Perspective -> https://www.datamesh-architecture.com/#why* Building a Modern Data Stack at Whatnot -> https://medium.com/whatnot-engineering/building-a-modern-data-stack-at-whatnot-afc1d03c3f9* Airbyte acquires Grouparoo to accelerate Data Movement -> https://airbyte.com/blog/airbyte-acquires-grouparoo-to-accelerate-data-movement### Cloud* Modernize your Oracle workloads to PostgreSQL with Database Migration Service, now in preview -> https://cloud.google.com/blog/products/databases/migrate-oracle-to-postgresql* NetApp Announces Intent to Acquire Instaclustr -> https://www.netapp.com/newsroom/press-releases/news-rel-20220407-656381/* BigLake unifies data warehouses and data lakes into a consistent format -> https://cloud.google.com/blog/products/data-analytics/unifying-data-lakes-and-data-warehouses-across-clouds-with-biglake### Cloud Native dev tips* COPY --chmod reduced the size of my container image by 35% -> https://blog.vamc19.dev/posts/dockerfile-copy-chmod/SponsorsCette publication est sponsorisée par [Affini-Tech](https://affini-tech.com/) et [CerenIT](https://www.cerenit.fr/).[CerenIT](https://www.cerenit.fr/) vous accompagne pour concevoir, industrialiser ou automatiser vos plateformes mais aussi pour faire parler vos données temporelles. Ecrivez nous à [contact@cerenit.fr](mailto:contact@cerenit.fr) et retrouvez-nous aussi sur [Time Series France](https://www.timeseriesfr.org/).Affini-Tech vous accompagne dans tous vos projets Cloud et Data, pour Imaginer, Expérimenter etExecuter vos services ! ([Affini-Tech](http://affini-tech.com), La plateforme [Datatask](https://datatask.io/)) pour accélérer vos services Data et IAConsulter le [blog d'Affini-Tech](https://affini-tech.com/blog/) et le [blog de Datatask](https://datatask.io/blog/) pour en savoir plus.On recrute ! Venez cruncher de la data avec nous ! Ecrivez nous à [recrutement@affini-tech.com](mailto:recrutement@affini-tech.com)Le générique a été composé et réalisé par Maxence Lecointe.

KI in der Industrie
KI und Wiederbeschaffungszeiten im Maschinenbau

KI in der Industrie

Play Episode Listen Later Nov 3, 2021 59:25


Markus Günther und seine Kolleginnen und Kollegen haben daraus eine Software entwickelt, die einen Random Forest Ansatz verfolgt. Wie Anwenderinnen und Anwender diese nutzen, erklärt er im Podcast-Gespräch. **Wir suchen einen neuen Podcast Partner!** **Noch mehr KI in der Industrie?** https://kipodcast.de/podcast-archiv **Oder unser Buch KI in der Industrie:** https://www.hanser-fachbuch.de/buch/KI+in+der+Industrie/9783446463455 **Kontakt zu unserem Gesprächspartner** https://de.linkedin.com/in/markus-günther-6a71a6165 Aktueller Teil **Offener Brief ** https://netzpolitik.org/2021/autonome-waffensysteme-neue-bundesregierung-soll-killer-roboter-einhegen/ **Berg Karabach ** https://sicherheitspod.de **Oracle Studie** https://www.industry-of-things.de/hoffnung-auf-karrierekick-durch-ki-a-1069368/ **Golden Age of Computer Vision** https://www.telecom-paris.fr/gerard-medioni-golden-age-computer-vision-research

Data Chatter
7. From Random Walks to Random Forests: Analytics and data science on Wall Street

Data Chatter

Play Episode Listen Later Aug 2, 2021 49:00


One of the first industries to extensively use advanced maths to do better was financial services. Ever since Fischer Black and Myron Scholes published their seminal paper on option pricing in 1973, Wall Street firms hired mathematicians and scientists by the droves, getting them to model asset prices in order to get an edge in the market. Even today, top hedge funds such as Renaissance, Citadel and Two Sigma prefer to hire scientists rather than finance professionals to manage their portfolios. However, in the last decade or so, as Data Science and Artificial Intelligence have taken over the rest of the world, Wall Street has not maintained its leadership position in the use of maths to make money. How and why did this happen? In order to understand this, we talk to Hari Balaji, co-founder of Romulus, an award winning unstructured data automation platform for Financial Services firms. Prior to founding Romulus, Hari spent a decade in quant & data roles at Goldman Sachs across Hong Kong and Singapore. Hari is an alumnus of IIT Madras & IIM Ahmedabad. Show Notes 00:03:15 - What is data science and what is artificial intelligence? 00:10:40 - What Hari's company does 00:14:00 - Toolbox versus hammer-nail approaches 00:15:00 - This history of math in the financial services industry 00:28:45 - Wall Street is never a first mover but a great follower 00:33:30 - How Wall Street uses data science nowadays 00:41:00 - Why most innovations have happened at smaller firms 00:44:00 - Why the financial industry doesn't behave like the Tech world Romulus on Twitter Romulus on LinkedIn Data Chatter is a podcast on all things data. It is a series of conversations with experts and industry leaders in data, and each week we aim to unpack a different compartment of the "data suitcase". The podcast is hosted by Karthik Shashidhar. He is a blogger, newspaper columnist, book author and a former data and strategy consultant. Karthik currently heads Analytics and Business Intelligence for Delhivery, one of India's largest logistics companies. You can follow him on twitter at @karthiks, and read his blog at noenthuda.com/blog

Data Science at Home
A simple trick for very unbalanced data (Ep. 157)

Data Science at Home

Play Episode Listen Later Jun 22, 2021 22:02


Data from the real world are never perfectly balanced. In this episode I explain a simple yet effective trick to train models with very unbalanced data. Enjoy the show! Sponsors Get one of the best VPN at a massive discount with coupon code DATASCIENCE. It provides you with an 83% discount which unlocks the best price in the market plus 3 extra months for free. Here is the link https://surfshark.deals/DATASCIENCE   References Leo Breiman, Random Forests, 2001 C. Chen, A. Liaw, L. Breiman, Using Random Forest to Learn Imbalanced Data (2004)  

BI or DIE
80. Der Data Analyst - Im Gespräch mit Michael Tenner, pmOne

BI or DIE

Play Episode Listen Later Jun 8, 2021 31:44


Michael ist Business Development Manager für Daten Visualisierung und Business Intelligence bei der pmOne. Seine Leidenschaft ist die Vermittlung der Begeisterung für Daten und deren Analyse. Seine eigene Datenreise begann 2010 mit der Annahme das PostgreSQL eine Pokersoftware ist und Kimball eine asiatischer Ballsportart. Sie führte ihn dann über komplexe Finanzmodelle in Excel, eine gewachsene MySQL Datenbank und erste Schritte in PowerView, PowerPivot und Power Point, 2015 in die Arme von Power BI. 2018 wurde er dieser ersten BI-Liebe für Tableau untreu. In unzähligen Gruppensitzungen und Einzelgesprächen hat er versucht sein Wissen zu teilen. Einige Dashboards, Reports, Analysen und Projekte später im Jahr 2021 ist Azure für ihn nicht nur eine Farbe und er versucht nicht mehr ohne Ziel durch den Random Forest zu irren. Das ist für Euch drin: - Power BI vs. Tableau - Die Rolle des Data Analysten - Wie sollten Tool Schulungen aussehen? - Warum LinkedIn Potential hat

Tech Hunters by Rebels
Łukasz Malicki: Analiza danych przez sztuczną inteligencję - czy technologia zastąpi człowieka?

Tech Hunters by Rebels

Play Episode Listen Later Jan 11, 2021 52:47


Z Łukaszem Malickim, CEO Random Forest, rozmawiamy o planie na zrewolucjonizowanie staromodnych narzędzi wyszukiwania. Czy sztuczna inteligencja może zastąpić człowieka w analizie danych? Czy wręcz przeciwnie - będzie dla niego bezcennym wsparciem? Na czym skupić się budując biznes i dlaczego należy wyciągać wnioski z niepowodzeń? Rozmowa prowadzona jest w ramach cyklu podsumowującego 6. edycję programu #StartUP Małopolska, którego uczestnikiem był Random Forest.

Machine learning
Real estate price predictor using random forest regressor

Machine learning

Play Episode Listen Later Dec 24, 2020 18:17


PaperPlayer biorxiv bioinformatics
A multi-modal machine learning approach towards predicting patient readmission

PaperPlayer biorxiv bioinformatics

Play Episode Listen Later Nov 20, 2020


Link to bioRxiv paper: http://biorxiv.org/cgi/content/short/2020.11.20.391904v1?rss=1 Authors: Mohanty, S. D., Lekan, D., McCoy, T. P., Jenkins, M., Manda, P. Abstract: Healthcare costs that can be attributed to unplanned readmissions are staggeringly high and negatively impact health and wellness of patients. In the United States, hospital systems and care providers have strong financial motivations to reduce readmissions in accordance with several government guidelines. One of the critical steps to reducing readmissions is to recognize the factors that lead to readmission and correspondingly identify at-risk patients based on these factors. The availability of large volumes of electronic health care records make it possible to develop and deploy automated machine learning models that can predict unplanned readmissions and pinpoint the most important factors of readmission risk. While hospital readmission is an undesirable outcome for any patient, it is more so for medically frail patients. Here, we develop and compare four machine learning models (Random Forest, XGBoost, CatBoost, and Logistic Regression) for predicting 30-day unplanned readmission for patients deemed frail (Age [≥] 50). Variables that indicate frailty, comorbidities, high risk medication use, demographic, hospital and insurance were incorporated in the models for prediction of unplanned 30-day readmission. Our findings indicate that CatBoost outperforms the other three models (AUC 0.80) and prior work in this area. We find that constructs of frailty, certain categories of high risk medications, and comorbidity are all strong predictors of readmission for elderly patients. Copy rights belong to original authors. Visit the link for more info

PaperPlayer biorxiv bioinformatics
LongTron: Automated Analysis of Long Read Spliced Alignment Accuracy

PaperPlayer biorxiv bioinformatics

Play Episode Listen Later Nov 11, 2020


Link to bioRxiv paper: http://biorxiv.org/cgi/content/short/2020.11.10.376871v1?rss=1 Authors: Wilks, C., Schatz, M. C. Abstract: Abstract Motivation: Long read sequencing has increased the accuracy and completeness of assemblies of various organisms' genomes in recent months. Similarly, spliced alignments of long read RNA sequencing hold the promise of delivering much longer transcripts of existing and novel isoforms in known genes without the need for error-prone transcript assemblies from short reads. However, low coverage and high-error rates potentially hamper the widespread adoption of long-read spliced alignments in annotation updates and isoform-level expression quantifications. Results: Addressing these issues, we first develop a simulation of error modes for both Oxford Nanopore and PacBio CCS spliced-alignments. Based on this we train a Random Forest classifier to assign new long-read alignments to one of two error categories, a novel category, or label them as non-error. We use this classifier to label reads from the spliced-alignments of the popular aligner minimap2, run on three long read sequencing datasets, including NA12878 from Oxford Nanopore and PacBio CCS, as well as a PacBio SKBR3 cancer cell line. Finally, we compare the intron chains of the three long read alignments against individual splice sites, short read assemblies, and the output from the FLAIR pipeline on the same samples. Our results demonstrate a substantial lack of precision in determining exact splice sites for long reads during alignment on both platforms while showing some benefit from postprocessing. This work motivates the need for both better aligners and additional post-alignment processing to adjust incorrectly called putative splice-sites and clarify novel transcripts support. Availability and implementation Source code for the random forest implemented in python is available at https://github.com/schatzlab/LongTron under the MIT license. The modified version of GffCompare used to construct Table 3 and related is here: https://github.com/ChristopherWilks/gffcompare/releases/tag/0.11.2LT Copy rights belong to original authors. Visit the link for more info

PaperPlayer biorxiv bioinformatics
Neural network fast-classifies biological images using features selected after their random-forests-importance to power smart microscopy.

PaperPlayer biorxiv bioinformatics

Play Episode Listen Later Nov 11, 2020


Link to bioRxiv paper: http://biorxiv.org/cgi/content/short/2020.11.10.376988v1?rss=1 Authors: Balluet, M., Sizaire, F., Walter, T., Pont, J., Giroux, B., Bouchareb, O., Tramier, M., Pecreaux, J. Abstract: Artificial intelligence is nowadays used for cell detection and classification in optical microscopy, during post-acquisition analysis. The microscopes are now fully automated and next expected to be smart, to make acquisition decisions based on the images. It calls for analysing them on the fly. Biology further imposes training on a reduced dataset due to cost and time to prepare the samples and have the datasets annotated by experts. We propose here a real-time image processing, compliant with these specifications by balancing accurate detection and execution performance. We characterised the images using a generic, high-dimensional feature extractor. We then classified the images using machine learning for the sake of understanding the contribution of each feature in decision and execution time. We found that the non-linear-classifier random forests outperformed Fisher's linear discriminant. More importantly, the most discriminant and time-consuming features could be excluded without any significant loss in accuracy, offering a substantial gain in execution time. We offer a method to select fast and discriminant features. In our assay, a 79.6 {+/-} 2.4% accurate classification of a cell took 68.7 {+/-} 3.5 ms (mean {+/-} SD, 5-fold cross-validation nested in 10 bootstrap repeats), corresponding to 14 cells per second, dispatched into 8 phases of the cell cycle using 12 feature-groups and operating a consumer market ARM- based embedded system. Interestingly, a simple neural network offered similar performances paving the way to faster training and classification, using parallel execution on a general-purpose graphic processing unit. Copy rights belong to original authors. Visit the link for more info

PaperPlayer biorxiv bioinformatics
Detecting significant components of microbiomes by random forest with forward variable selection and phylogenetics

PaperPlayer biorxiv bioinformatics

Play Episode Listen Later Oct 30, 2020


Link to bioRxiv paper: http://biorxiv.org/cgi/content/short/2020.10.29.361360v1?rss=1 Authors: Dang, T., Kishino, H. Abstract: A central focus of microbiome studies is the characterization of differences in the microbiome composition across groups of samples. A major challenge is the high dimensionality of microbiome datasets, which significantly reduces the power of current approaches for identifying true differences and increases the chance of false discoveries. We have developed a new framework to address these issues by combining (i) identifying a few significant features by a massively parallel forward variable selection procedure, (ii) mapping the selected species on a phylogenetic tree, and (iii) predicting functional profiles by functional gene enrichment analysis from metagenomic 16S rRNA data. We demonstrated the performance of the proposed approach by analyzing two published datasets from large-scale case-control studies: (i) 16S rRNA gene amplicon data for Clostridioides difficile infection (CDI) and (ii) shotgun metagenomics data for human colorectal cancer (CRC). The proposed approach improved the accuracy from 81% to 99.01% for CDI and from 75.14% to 90.17% for CRC. We identified a core set of 96 species that were significantly enriched in CDI and a core set of 75 species that were enriched in CRC. Moreover, although the quality of the data differed for the functional profiles predicted from the 16S rRNA dataset and functional metagenome profiling, our approach performed well for both databases and detected main functions that can be used to diagnose and study further the growth stage of diseases. Copy rights belong to original authors. Visit the link for more info

PaperPlayer biorxiv bioinformatics
RNAmining: A machine learning stand-alone and web server tool for RNA coding potential prediction

PaperPlayer biorxiv bioinformatics

Play Episode Listen Later Oct 26, 2020


Link to bioRxiv paper: http://biorxiv.org/cgi/content/short/2020.10.26.354357v1?rss=1 Authors: Ramos, T., Galindo, N., Arias-Carrasco, R., da Silva, C., Maracaja-Coutinho, V., do Rego, T. Abstract: Non-coding RNAs (ncRNAs) are important players in the cellular regulation of organisms from different kingdoms. One of the key steps in ncRNAs research is the ability to distinguish coding/non-coding sequences. We applied 7 machine learning algorithms (Naive Bayes, SVM, KNN, Random Forest, XGBoost, ANN and DL) through 15 model organisms from different evolutionary branches. Then, we created a stand-alone and web server tool (RNAmining) to distinguish coding and non-coding sequences, selecting the algorithm with the best performance (XGBoost). Firstly, we used coding/non-coding sequences downloaded from Ensembl (April 14th, 2020). Then, coding/non-coding sequences were balanced, had their tri-nucleotides counts analysed and we performed a normalization by the sequence length. Thus, in total we built 180 models. All the machine learning algorithms tests were performed using 10-folds cross-validation and we selected the algorithm with the best results (XGBoost) to implement at RNAmining. Best F1-scores ranged from 97.56% to 99.57% depending on the organism. Moreover, we produced a benchmarking with other tools already in literature (CPAT, CPC2, RNAcon and Transdecoder) and our results outperformed them, opening opportunities for the development of RNAmining, which is freely available at https://rnamining.integrativebioinformatics.me/ . Copy rights belong to original authors. Visit the link for more info

PaperPlayer biorxiv bioinformatics
Ultra-fast Prediction of Somatic Structural Variations by Reduced Read Mapping via Pan-Genome k-mer Sets

PaperPlayer biorxiv bioinformatics

Play Episode Listen Later Oct 26, 2020


Link to bioRxiv paper: http://biorxiv.org/cgi/content/short/2020.10.25.354456v1?rss=1 Authors: Choi, M.-H., Sohn, J.-i., Yi, D., Menon, A. V., Kim, Y. J., Kyoung, S., Shin, S.-H., Na, B., Joung, J.-G., Yoon, S., Koh, Y., Baek, D., Kim, T.-M., Nam, J.-W. Abstract: Genome rearrangements often result in copy number alterations of cancer-related genes and cause the formation of cancer-related fusion genes. Current structural variation (SV) callers, however, still produce massive numbers of false positives (FPs) and require high computational costs. Here, we introduce an ultra-fast and high-performing somatic SV detector, called ETCHING, that significantly reduces the mapping cost by filtering reads matched to pan-genome and normal k-mer sets. To reduce the number of FPs, ETCHING takes advantage of a Random Forest classifier that utilizes six breakend-related features. We systematically benchmarked ETCHING with other SV callers on reference SV materials, validated SV biomarkers, tumor and matched-normal whole genomes, and tumor-only targeted sequencing datasets. For all datasets, our SV caller was much faster ([≥]15X) than other tools without compromising performance or memory use. Our approach would provide not only the fastest method for largescale genome projects but also an accurate clinically practical means for real-time precision medicine. Copy rights belong to original authors. Visit the link for more info

PaperPlayer biorxiv bioinformatics
Predicting Cell-Penetrating Peptides: Building and Interpreting Random Forest based prediction Models

PaperPlayer biorxiv bioinformatics

Play Episode Listen Later Oct 16, 2020


Link to bioRxiv paper: http://biorxiv.org/cgi/content/short/2020.10.15.341149v1?rss=1 Authors: Yadahalli, S., Verma, C. S. Abstract: Targeting intracellular pathways with peptide drugs is becoming increasingly desirable but often limited in application due to their poor cell permeability. Understanding cellular permeability of peptides remains a major challenge with very little structure-activity relationship known. Fortunately, there exist a class of peptides called Cell-Penetrating Peptides (CPPs), which have the ability to cross cell membranes and are also capable of delivering biologically active cargo into cells. Discovering patterns that make peptides cell-permeable have a variety of applications in drug delivery. In the current study, we build prediction models for CPPs exploring features covering a range of properties based on amino acid sequences, using Random forest classifiers which are often more interpretable than other ensemble machine learning algorithms. While obtaining prediction accuracies of ~96%, we also interpret our prediction models using TreeInterpreter, LIME and SHAP to decipher the contributions of important features and optimal feature space for CPP class. We propose that our work might offer an intuitive guide for incorporating features that impart cell-penetrability into the design of novel CPPs. Copy rights belong to original authors. Visit the link for more info

PaperPlayer biorxiv neuroscience
Predicting age and clinical risk from the neonatal connectome

PaperPlayer biorxiv neuroscience

Play Episode Listen Later Sep 29, 2020


Link to bioRxiv paper: http://biorxiv.org/cgi/content/short/2020.09.28.317180v1?rss=1 Authors: Taoudi-Benchekroun, Y., Christiaens, D., Grigorescu, I., Schuh, A., Pietsch, M., Chew, A., Harper, N., Falconer, S., Poppe, T., Hughes, E., Hutter, J., Price, A. N., Tournier, J.-D., Cordero-Grande, L., Counsell, S. J., Rueckert, D., Arichi, T., Hajnal, J. V., Edwards, A. D., Deprez, M., Batalle, D. Abstract: The development of perinatal brain connectivity underpins motor, cognitive and behavioural abilities in later life. With the rise of advanced imaging methods such as diffusion MRI, the study of brain connectivity has emerged as an important tool to understand subtle alterations associated with neurodevelopmental conditions. Brain connectivity derived from diffusion MRI is complex, multi-dimensional and noisy, and hence it can be challenging to interpret on an individual basis. Machine learning methods have proven to be a powerful tool to uncover hidden patterns in such data, thus opening an opportunity for early identification of atypical development and potentially more efficient treatment. In this work, we used Deep Neural Networks and Random Forests to predict neurodevelopmental characteristics from neonatal structural connectomes, in a large sample of neonates (N = 524) derived from the developing Human Connectome Project. We achieved a highly accurate prediction of post menstrual age (PMA) at scan on term-born infants (Mean absolute error (MAE) = 0.72 weeks, r = 0.83, p

Machine Learning en Español
15 Adaboost: Adaptive Boosting

Machine Learning en Español

Play Episode Listen Later Sep 28, 2020 16:32


Adaboost es uno de los algoritmos clásicos de aprendizaje máquina. Al igual que Random Forest y XGBoost pertenece a la clase de modelos de ensamble, es decir, que se basan en agregar otros modelos débiles o de base para hacer predicciones. La principal diferencia con Adaboost es que es adaptativo, es decir, aprender de los errores hechos en los primeros modelos poniendo más énfasis en los ejemplos clasificados incorrectamente.

Machine Learning with Coffee
15 Adaboost: Adaptive Boosting

Machine Learning with Coffee

Play Episode Listen Later Sep 28, 2020 18:01


Adaboost is one of the classic machine learning algorithms. Just like Random Forest and XGBoost, Adaboost belongs to the ensemble models, in other words, it aggregates the results of simpler classifiers to make robust predictions. The main different of Adaboost is that it is an adaptive algorithm, which means that it learns from the misclassified instances of previous models, assigning more weights to those errors and focusing its attention on those instances in the next round.

PaperPlayer biorxiv bioinformatics
HieRFIT: Hierarchical Random Forest for Information Transfer

PaperPlayer biorxiv bioinformatics

Play Episode Listen Later Sep 18, 2020


Link to bioRxiv paper: http://biorxiv.org/cgi/content/short/2020.09.16.300822v1?rss=1 Authors: Kaymaz, Y., Ganglberger, F., Tang, M., Fernandez-Albert, F., Lawless, N., Sackton, T. B. Abstract: The emergence of single-cell RNA sequencing (scRNA-seq) has led to an explosion in novel methods to study biological variation among individual cells, and to classify cells into functional and biologically meaningful categories. Here, we present a new cell type projection tool, HieRFIT (Hierarchical Random Forest for Information Transfer), based on hierarchical random forests. HieRFIT uses a priori information about cell type relationships to improve classification accuracy, taking as input a hierarchical tree structure representing the class relationships, along with the reference data. We use an ensemble approach combining multiple random forest models, organized in a hierarchical decision tree structure. We show that our hierarchical classification approach improves accuracy and reduces incorrect predictions especially for inter-dataset tasks which reflect real life applications. We use a scoring scheme that adjusts probability distributions for candidate class labels and resolves uncertainties while avoiding the assignment of cells to incorrect types by labeling cells at internal nodes of the hierarchy when necessary. Using HieRFIT, we re-analyzed publicly available scRNA-seq datasets showing its effectiveness in cell type cross-projections with inter/intra-species examples. HieRFIT is implemented as an R package and it is available at (https://github.com/yasinkaymaz/HieRFIT/releases/tag/v1.0.0) Copy rights belong to original authors. Visit the link for more info

PaperPlayer biorxiv bioinformatics
Combining Natural Language Processing and Metabarcoding to Reveal Pathogen-Environment Associations

PaperPlayer biorxiv bioinformatics

Play Episode Listen Later Sep 3, 2020


Link to bioRxiv paper: http://biorxiv.org/cgi/content/short/2020.09.02.280578v1?rss=1 Authors: Molik, D. C., Tomlinson, D., Davitt, S., Morgan, E. L., Roche, B., Meyers, N., Pfrender, M. E. Abstract: Cryptococcus neoformans is responsible for life-threatening infections that primarily affect immunocompromised individuals and has an estimated worldwide burden of 220,000 new cases each year--with 180,000 resulting deaths--mostly in sub-Saharan Africa. Surprisingly, little is known about the ecological niches occupied by C. neoformans in nature. To expand our understanding of the distribution and ecological associations of this pathogen we implement a Natural Language Processing approach to better describe the niche of C. neoformans . We use a Latent Dirichlet Allocation model to de novo topic model sets of metagenetic research articles written about varied subjects which either explicitly mention, inadvertently find, or fail to find C. neoformans. These articles are all linked to NCBI Sequence Read Archive datasets of 18S ribosomal RNA and/or Internal Transcribed Spacer gene-regions. The number of topics was determined based on the model coherence score, and articles were assigned to the created topics via a Machine Learning approach with a Random Forest algorithm. Our analysis provides support for a previously suggested linkage between C. neoformans and soils associated with decomposing wood. Our approach, using a search of single-locus metagenetic data, gathering papers connected to the datasets, de novo determination of topics, the number of topics, and assignment of articles to the topics, illustrates how such an analysis pipeline can harness large-scale datasets that are published/available but not necessarily fully analyzed, or whose metadata is not harmonized with other studies. Our approach can be applied to a variety of systems to assert potential evidence of environmental associations. Copy rights belong to original authors. Visit the link for more info

PaperPlayer biorxiv bioinformatics
geneRFinder: gene finding in distinct metagenomic data complexities

PaperPlayer biorxiv bioinformatics

Play Episode Listen Later Aug 24, 2020


Link to bioRxiv paper: http://biorxiv.org/cgi/content/short/2020.08.21.262147v1?rss=1 Authors: Silva, R., Padovani, K., Goes, F., Alves, R. Abstract: Motivation: Microbes perform a fundamental economic, social and environmental role in our society. Metagenomics makes it possible to investigate microbes in their natural environments (the complex communities) and their interactions. The way they act is usually estimated by looking at the functions they play in those environments and their responsibility is measured by their genes. The advances of next-generation sequencing technology have facilitated metagenomics research however it also create a heavy computational burden. Large and complex biological datasets are available as never before. There are many gene predictors available which can aid gene annotation process though they lack of handling appropriately metagenomic data complexities. There is no standard metagenomic benchmark data for gene prediction. Thus, gene predictors may inflate their results by obfuscating low false discovery rates. Results: We introduce geneRFinder, a ML-based gene predictor able to outperform state-of-the-art gene prediction tools across this benchmark by using only one pre-trained Random Forest model. Average prediction rates of geneRFinder differed in percentage terms by 54% and 64%, respectively, against Prodigal and FragGeneScan while handling high complexity metagenomes. The specificity rate of geneRFinder had the largest distance against FragGeneScan, 79 percentage points, and 66 more than Prodigal. According to McNemar's test, all percentual differences between predictors performances are statistically significant for all datasets with a 99% confidence interval. Conclusions: We provide geneRFinder, a approach for gene prediction in distinct metagenomic complexities, available at github.com/railorena/geneRFinder, and also we provide a novel, comprehensive benchmark data for gene prediction --- which is based on The Critical Assessment of Metagenome Interpretation (CAMI) challenge, and contains labeled data from gene regions -- avaliable at sourceforge.net/p/generfinder-benchmark . Copy rights belong to original authors. Visit the link for more info

PaperPlayer biorxiv bioinformatics
An Ensemble Learning Approach for Cancer Drug Prediction

PaperPlayer biorxiv bioinformatics

Play Episode Listen Later Aug 11, 2020


Link to bioRxiv paper: http://biorxiv.org/cgi/content/short/2020.08.10.245142v1?rss=1 Authors: Mandera, D., Ritz, A. Abstract: Predicting the response to a particular drug for specific cancer, despite known genetic mutations, still remains a huge challenge in modern oncology and precision medicine. Today, prescribing a drug for a cancer patient is based on a doctor's analysis of various articles and previous clinical trials; it is an extremely time-consuming process. We developed a machine learning classifier to automatically predict a drug given a carcinogenic gene mutation profile. Using the Breast Invasive Carcinoma Dataset from The Cancer Genome Atlas (TCGA), the method first selects features from mutated genes and then applies K-Fold, Decision Tree, Random Forest and Ensemble Learning classifiers to predict best drugs. Ensemble Learning yielded prediction accuracy of 66% on the test set in predicting the correct drug. To validate that the model is general-purpose, Lung Adenocarcinoma (LUAD) data and Colorectal Adenocarcinoma (COADREAD) data from TCGA was trained and tested, yielding prediction accuracies 50% and 66% respectively. The resulting accuracy indicates a direct correlation between prediction accuracy and cancer data size. More importantly, the results of LUAD and COADREAD show that the implemented model is general purpose as it is able to achieve similar results across multiple cancer types. We further verified the validity of the model by implementing it on patients with unclear recovery status from the COADREAD dataset. In every case, the model predicted a drug that was administered to each patient. This method will offer oncologists significant time-saving compared to their current approach of extensive background research, and offers personalized patient care for cancer patients. Copy rights belong to original authors. Visit the link for more info

PaperPlayer biorxiv bioinformatics
Development of Putative Isospecific Inhibitors for HDAC6 using Random Forest, QM-Polarized docking, Induced-fit docking, and Quantum mechanics

PaperPlayer biorxiv bioinformatics

Play Episode Listen Later Aug 10, 2020


Link to bioRxiv paper: http://biorxiv.org/cgi/content/short/2020.08.10.243824v1?rss=1 Authors: Joel, I. y., ADIGUN, T. O., BANKOLE, O. O., AJIBOLA, A. O., OFENIFORO, E. B., AUTA, F. B., OZOJIOFOR, U. O., REMI-ESAN, I. A., AKANDE, A. I. Abstract: Histone deacetylases have been recognized as a potential target for epigenetic aberrance reversal in the various strategies for cancer therapy, with HDAC6 implicated in various forms of tumor growth and cancers. Diverse inhibitors of HDAC6 has been developed, however, there is still the challenge of iso-specificity and toxicity. In this study, we trained a Random forest model on all HDAC6 inhibitors curated in the ChEMBL database (3,742). Upon rigorous validations the model had an 85% balanced accuracy and was used to screen the SCUBIDOO database; 7785 hit compounds resulted and were docked into HDAC6 CD2 active-site. The top two compounds having a benzimidazole moiety as its zinc-binding group had a binding affinity of -78.56kcal/mol and -78.21kcal/mol respectively. The compounds were subjected to exhaustive docking protocols (Qm-polarized docking and Induced-Fit docking) in other to elucidate a binding hypothesis and accurate binding affinity. Upon optimization, the compounds showed improved binding affinity (-81.42kcal/mol), putative specificity for HDAC6, and good ADMET properties. We have therefore developed a reliable model to screen for HDAC6 inhibitors and suggested a series of benzimidazole based inhibitors showing high binding affinity and putative specificity for HDAC6. Copy rights belong to original authors. Visit the link for more info

PaperPlayer biorxiv bioinformatics
A Deep Learning Framework for Predicting Human Essential Genes by Integrating Sequence and Functional data

PaperPlayer biorxiv bioinformatics

Play Episode Listen Later Aug 5, 2020


Link to bioRxiv paper: http://biorxiv.org/cgi/content/short/2020.08.04.236646v1?rss=1 Authors: Xiao, W., Zhang, X., Xiao, W. Abstract: Motivation: Essential genes are necessary to the survival or reproduction of a living organism. The prediction and analysis of gene essentiality can advance our understanding to basic life and human diseases, and further boost the development of new drugs. Wet lab methods for identifying essential genes are often costly, time consuming, and laborious. As a complement, computational methods have been proposed to predict essential genes by integrating multiple biological data sources. Most of these methods are evaluated on model organisms. However, prediction methods for human essential genes are still limited and the relationship between human gene essentiality and different biological information still needs to be explored. In addition, exploring suitable deep learning techniques to overcome the limitations of traditional machine learning methods and improve the prediction accuracy is also important and interesting. Results: We propose a deep learning based method, DeepSF, to predict human essential genes. DeepSF integrates four types of features, that is, sequence features, features from gene ontology, features from protein complex, and network features. Sequence features are derived from DNA and protein sequence for each gene. 100 GO terms from cellular component are used to form a feature vector for each gene, in which each component captures the relationship between a gene and a GO term. Network features are learned from protein-protein interaction (PPI) network using a deep learning based network embedding method. The features derived from protein complexes capture the relationships between a gene or a gene's direct neighbors from PPI network and protein complexes. The four types of features are integrated together to train a multilayer neural network. The experimental results of 10-fold cross validation show that DeepSF can accurately predict human gene essentiality with an average performance of AUC about 94.35%, the area under precision-recall curve (auPRC) about 91.28%, the accuracy about 91.35%, and the F1 measure about 77.79%. In addition, the comparison results show that DeepSF significantly outperforms several widely used traditional machine learning models (SVM, Random Forest, and Adaboost), and performs slightly better than a recent deep learning model (DeepHE). Conclusions: We have demonstrated that the proposed method, DeepSF, is effective for predicting human essential genes. Deep learning techniques are promising at both feature learning and classification levels for the task of essential gene prediction. Copy rights belong to original authors. Visit the link for more info

PaperPlayer biorxiv bioinformatics
Single-cell identity definition using random forests and recursive feature elimination (scRFE)

PaperPlayer biorxiv bioinformatics

Play Episode Listen Later Aug 4, 2020


Link to bioRxiv paper: http://biorxiv.org/cgi/content/short/2020.08.03.233650v1?rss=1 Authors: Park, M., Vorperian, S., Wang, S., Pisco, A. O. Abstract: Single cell RNA sequencing (scRNA-seq) enables detailed examination of a cell's underlying regulatory networks and the molecular factors contributing to its identity. We developed scRFE (single-cell identity definition using random forests and recursive feature elimination, pronounced 'surf') with the goal of easily generating interpretable gene lists that can accurately distinguish observations (single-cells) by their features(genes) given a class of interest. scRFE is an algorithm implemented as a Python package that combines the classical random forest method with recursive feature elimination and cross validation to find the features necessary and sufficient to classify cells in a single-cell RNA-seq dataset by ranking feature importance. The package is compatible with Scanpy, enabling a seamless integration into any single-cell data analysis workflow that aims at identifying minimal transcriptional programs relevant to describing metadata features of the dataset. We applied scRFE to the Tabula Muris Senis and reproduced commonly known aging patterns and transcription factor reprogramming protocols, highlighting the biological value of scRFE's learned features. Copy rights belong to original authors. Visit the link for more info

Machine Learning en Español
14 XGBoost: El Ganador de Muchas Competencias

Machine Learning en Español

Play Episode Listen Later Jul 26, 2020 19:15


XGBoost es una librería de software que es open-source y que ha ganado varias competencias de Machine Learning. XGBoost está basado en los principios de gradient booting, el cual a su vez está basado en las ideas de Leo Breiman, el creador de Random Forest. La teoría detrás de gradient boosting fue formalizada por Jerome H. Friedman. Gradient boosting combina modelos simples y utiliza ingeniería muy inteligente la cual incluye una penalización para los árboles y un encogimiento proporcional para los nodos hoja.

PaperPlayer biorxiv bioinformatics
Cross-Tissue Transcriptomic Analysis Leveraging Machine Learning Approaches Identifies New Biomarkers for Rheumatoid Arthritis

PaperPlayer biorxiv bioinformatics

Play Episode Listen Later Jul 26, 2020


Link to bioRxiv paper: http://biorxiv.org/cgi/content/short/2020.07.24.220483v1?rss=1 Authors: Rychkov, D., Neely, J., Oskotsky, T., Sirota, M. Abstract: Background/Purpose: There is an urgent need to identify effective biomarkers for early diagnosis of Rheumatoid Arthritis (RA) and accurate monitoring of disease activity. Here we define a RA meta-profile using publicly available cross-tissue gene expression data and apply machine learning to identify putative biomarkers, which we further validate on independent datasets. Methods: We carried out a comprehensive search for publicly available microarray gene expression data in the NCBI Gene Expression Omnibus database for whole blood and synovial tissues from RA patients and healthy controls. The raw data from 13 synovium datasets with 284 samples and 14 blood datasets with 1,885 samples were downloaded and processed. The datasets for each tissue were merged and batch corrected and split into training and test sets. We then developed and applied a robust feature selection pipeline to identify genes dysregulated in both tissues and highly associated with RA. From the training data we identified a set of overlapping differentially expressed genes following the condition of co-directionality. The classification performance of each gene in the resulting set was evaluated on the testing sets using AUROC. Five independent datasets were used to validate and threshold the feature selected (FS) genes. Finally, we define the RAScore, composed of a geometric mean of the selected RAScore Panel genes and demonstrate its clinical utility. Results: The result of the feature selection pipeline was a set of 25 upregulated and 28 downregulated genes. To assess the robustness of these feature selected genes, we trained a Random Forest machine learning model with this set of 53 genes and then with the set of 32 common differentially expressed genes and tested on the validation cohorts. The model with FS genes outperformed the model with common DE genes with AUC 0.89 +/- 0.04 vs 0.86 +/- 0.05. The FS genes were further thresholded on the 5 independent datasets resulting in 10 upregulated genes, TNFAIP6, S100A8, TNFSF10, DRAM1, LY96, QPCT, KYNU, ENTPD1, CLIC1, ATP6V0E1, that are involved in innate immune system pathways, including neutrophil degranulation and apoptosis and expressed in granulocytes, dendritic cells, and macrophages; and 3 downregulated genes, HSP90AB1, NCL, CIRBP, involved in metabolic processes and T-cell receptor regulation of apoptosis and expressed in lymphoblasts. To investigate the clinical utility of the 13 validated genes, the RA Score was developed and found to be highly correlated with DAS28 (r = 0.33 +/- 0.03, p = 7e-9) and able to distinguish OA and RA samples (OR 0.57, 95% CI [0.34, 0.80], p = 8e-10). Moreover, the RA Scores were not significantly different for RF-positive and RF-negative RA sub-phenotypes (p = 0.9) suggesting the generalizability of this score in clinical applications. The RA Score was also able to monitor the treatment effect among RA patients (t-test of treated vs untreated, p = 2e-4) and distinguish polyJIA from healthy individuals in 10 independent pediatric cohorts (OR 1.15, 95% CI [1.01, 1.3], p = 2e-4). Conclusion: The RAScore, consisting of 13 putative biomarkers, identified through a robust feature selection procedure on public data and validated using multiple independent data sets may be useful in the diagnosis and treatment monitoring of RA. Copy rights belong to original authors. Visit the link for more info

Machine Learning with Coffee
14 XGBoost: The Winner of Many Competitions

Machine Learning with Coffee

Play Episode Listen Later Jul 26, 2020 13:50


XGBoost is an open-source software library which has won several Machine Learning competitions in Kaggle. It is based on the principles of gradient boosting, which is based on the ideas of the Leo Breiman, the creator of Random Forest. The theory behind gradient boosting was later formalized by Jerome H. Friedman. Gradient boosting combines weak learners just as Random Forest. XGBoost is an engineering implementation which includes a clever penalization of trees and a proportional shrinking of leaf nodes.

Machine Learning en Español
13 Random Forest

Machine Learning en Español

Play Episode Listen Later Jul 12, 2020 23:16


El Random Forest es uno de los mejores algoritmos que están listos para usarse sin necesidad de hacer mucha afinación. En este episodio tratamos de entender la intuición detrás de este algoritmo y cómo es que trata de tomar ventaja de los árboles de decisión al agregarlos usando un truco muy bueno llamado Bagging. Importancia de variables y el error fuera de la bolsa son características de este algoritmo que nos ayudan a entender mejor cuáles son las variables mas importantes y cuál es el error de generalización, respectivamente.

Machine Learning with Coffee
13 Random Forest

Machine Learning with Coffee

Play Episode Listen Later Jul 12, 2020 23:07


Random Forest is one of the best out-of-the-shelf algorithms. In this episode we try to understand the intuition behind the Random Forest and how it tries to leverage the capabilities of Decision Trees by aggregating them using a very smart trick called “bagging”. Variable Importance and out-of-bag error are two of the nice capabilities of Random Forest which allow us to find the most important predictors and compute a good generalization error, respectively.

Quant Trading Live Report
Analyzing potential of random forest machine learning for crypto trading bot

Quant Trading Live Report

Play Episode Listen Later Apr 29, 2020 16:10


This is the ABSOLUTE most critical metric out there when trading crypto. If your exchange does not offer this, move to another that does. Can you rely on any exchange thats offers this? Also, most this 3rd party bot service or platform most likely will NOT offer this metric to you. So be forewarned about this. If you don’t get it, you could get crush and lose money. Free trading books https://quantlabs.net/ or learn algo trading https://quantlabs.net/dvd https://quantlabs.net/blog/2020/04/most-powerful-high-frequency-trading-aka-hft-like-metric-for-a-crypto-trading-bot/

Quant Trading Live Report
PCA vs Random Forest vs Regressions with machine learning for crypto trading

Quant Trading Live Report

Play Episode Listen Later Apr 23, 2020 34:04


The machine learning technique is used from market prediction using crypto trading bot These options were set out as explained here Free trading books https://quantlabs.net/ or learn algo trading https://quantlabs.net/dvd Some forecasting methods from an expert: Maybe regress them all against price, or conduct PCA or perhaps you just throw them all into a random forest in order to see which ones are the most important? Where to find these options with Scikit-learn and Python. Scikit-learn seems to to be the simplest machine learning to go with from Python

Quant Trading Live Report
Analyzing potential of random forest machine learning for crypto trading bot

Quant Trading Live Report

Play Episode Listen Later Apr 23, 2020 29:14


Analyzing potential of random forest machine learning for crypto trading bot I was playing around with random forest as posted here Free trading books https://quantlabs.net/ or learn algo trading https://quantlabs.net/dvd  

Machine learning
Gradient boost and adboostingclassifier and bagging and random forest classifier

Machine learning

Play Episode Listen Later Apr 22, 2020 19:45


Modellansatz
Machine Learning - Maschinelles Lernen

Modellansatz

Play Episode Listen Later Mar 5, 2020 41:23


Gudrun spricht mit Sebastian Lerch vom Institut für Stochastik in der KIT-Fakultät für Mathematik. Vor einiger Zeit - Anfang 2015 - hatten die beiden schon darüber gesprochen, wie extreme Wetterereignisse stochastisch modelliert werden können. Diesmal geht es um eine Lehrveranstaltung, die Sebastian extra konzipiert hat, um für Promovierende aller Fachrichtungen am KIT eine Einführung in Machine Learning zu ermöglichen. Der Rahmen hierfür ist die Graduiertenschule MathSEED, die ein Teil des im Oktober 2018 gegründeten KIT-Zentrums MathSEE ist. Es gab schon lange (und vielleicht immer) Angebote am KIT, die insbesondere Ingenieure an moderne Mathematik heranführten, weil sie deren Methoden schon in der Masterphase oder spätestens während der Promotion brauchten, aber nicht durch die klassischen Inhalten der Höheren Mathematik abgedeckt werden. All das wird nun gebündelt und ergänzt unter dem Dach von MathSEED. Außerdem funktioniert das nun in beide Richtungen: Mathematiker:innen, werden ebenso zu einführenden Angeboten der anderen beteiligten Fakultäten eingeladen. Das Thema Maschinelles Lernen und Künstliche Intelligenz war ganz oben auf der Wunschliste für neu zu schaffende Angebote. Im Februar 2020 hat Sebastian diese Vorlesung erstmalig konzipiert und gehalten - die Übungen wurden von Eva-Maria Walz betreut. Die Veranstaltung wird im Herbst 2020 wieder angeboten. Es ist nicht ganz einfach, die unterschiedlichen Begriffe, die für Künstliche Intelligenz (kurz: KI) benutzt werden gegeneinander abzutrennen, zumal die Sprechweisen in unterschiedlichen Kontexten unterschiedlich sind. Hinzu tritt, dass mit der Verfügbarkeit großer Datenmengen und der häufigen Nutzung von KI und Big Data gemeinsam auch hier vieles vermischt wird. Sebastian defininiert Maschinelles Lernen als echte Teilmenge von KI und denkt dabei auch daran, dass z.B. symbolisches Rechnen KI ist. Ebenso geben schon lange sogenannte Expertensysteme Hilfestellung für Entscheidungen. Hier geben Regeln ein Programm vor, das Daten-Input zu einem Output verwandelt. Heute denken wir bei KI eher daran, dass z.B. der Computer lernt wie ein Bild eines Autos aussieht, ohne dass dafür klare Regeln vorgegeben werden. Dies ist eher vergleichbar damit, wie Kinder lernen. Die modernste Variante ist sogenanntes Deep Learning auf der Basis von Neuronalen Netzen. Die Abgrenzung zu statistischen Verfahren ist mitunter nicht so klar. Das Neuronale Netz wird dabei eine Black Box, was wissenschaftlich arbeitende Menschen nicht ganz befriedigt. Aber mit ihrer Hilfe werden komplexere Probleme lösbar. Forschung muss versuchen, die Entscheidungen der Black Box nachvollziehbar zu machen und entscheiden, wann die Qualität ausreicht. Dazu muss man sich überlegen: Wie misst man Fehler? In der Bildverarbeitung kann es genügen, z.B. falsch erkannte Autos zu zählen. In der Wettervorhersage lässt sich im Nachhinein feststellen, welche Fehler in der Vorhersage gemacht wurden. Es wird unterschiedliche Fehlertoleranzen geben für Erkennung von Fußgängern für selbst fahrende Autos und für die Genauigkeit von Wettervorhersage. Ein Beispiel in der Übung war die Temperaturvorhersage anhand von vorliegenden Daten. Die Vorhersage beruht ja auf physikalischen Modelle in denen die Entwicklung von Temperatur, Luftdruck und Windgeschwindigkeit durch Gleichungssysteme nachgebildet wird. Aber diese Modelle können nicht fehlerfrei berechnet werden und sind auch recht stark vereinfacht. Diese Fehler werden mit Hilfe von KI analysiert und die Ergebnisse für die Verbesserung der Vorhersage benutzt. Ein populäres Verfahren sind Random Forests oder Entscheidungsbäume. Hier werden komplexe Fragen stufenweise zerlegt und in den Stufen einfache Ja- oder Nein-Fragen beantwortet. Dies wird z.B. angewandt in der Entscheidung ob und wo eine Warnung vor einer Gewitterzelle erfolgen sollte. Sehr bekannt und im praktischen Einsatz erprobt (beispielsweise in der Bildverarbeitung und in der Übersetzung zwischen gebräuchlichen Sprachen) sind Neuronale Netze. In mehrern Schichten sind hier sogenannte Neuronen angeordnet. Man kann sich diese wie Knoten in einem Netz vorstellen, in dem Daten von Knoten zu Knoten transportiert werden. In den Knoten werden die ankommenden Daten gewichtet aufaddiert und eine vorher festgelegte Aktivierungsfunktion entscheidet, was an die nächsten Knoten oder die nächste Schicht von Neuronen weitergegeben wird. Die einzelnen Rechenoperationen sind hier also ganz elementar, aber das Zusammenwirken ist schwer zu analysieren. Bei vielen Schichten spricht man von Deep Learning. Das ist momentan noch in den Kinderschuhen, aber es kann weit reichende Konsequenzen haben. In jedem Fall sollte man Menschen im Entscheidungsprozess beteiligen. Die konkrete Umsetzung hat Sebastian als Vorlesung und Übung zu gleichen Teilen gewählt. Er hat einen Schwerpunkt auf einen Überblick zu methodischen Aspekten gelegt, die die Teilnehmenden dazu befähigt, später selbst weiter zu lernen. Es ging also unter anderem darum, wie man Trainingsdaten auswählt, wie Qualitätssicherung funktioniert, wie populäre Modelle funktionieren und wie man einschätzt, dass die Anpassung an Daten nicht zu stark erfolgt. In der Übung fand großen Anklang, dass ein Vorhersagewettbewerb der entwickelten Modelle durch Kaggle competions online live möglich war. Literatur und weiterführende Informationen Forschungsergebnisse mit Hilfe von Maschinen Lernen, an denen Sebastian Lerch beteiligt ist: M.N. Lang e.a.: Remember the past: A comparison of time-adaptive training schemes for non-homogeneous regression Nonlinear Processes in Geophysics, 27: 23–34 2020. (eher stochastisch) S. Rasp und S. Lerch: Neural networks for post-processing ensemble weather forecasts Monthly Weather Review, 146(11): 3885–3900 2018. Lehrbücher T. Hastie, R. Tibshirani, J. Friedman: The Elements of Statistical Learning Springer 2017 (2nd Edition). G. James, D. Witten, T. Hastie and R. Tibshirani: An Introduction to Statistical Learning Springer 2013 (7nd Edition) I. Goodfellow and Y. Bengio and A. Courville: Deep Learning MIT-Press 2016. Online Kurse Pytorch-based Python library fastai Deeplearning Dystopie für alltägliche KI C. Doctorow: Little Brother Tor Teen, 2008. download beim Author C. Doctorow: Homeland Tor Books, 2013, ISBN 978-0-7653-3369-8 im Gespräch angesprochene Bildbearbeitung, die eigene Fotos mit Kunstwerken verschmilzt Meetups im Umland von Karlsruhe Karlsruhe ai Meetup Heidelberg ai Meetup Machine Learning Rhein-Neckar (Mannheim) Podcasts Leben X0 - Episode 6: Was ist Machine Learning? November 2019. Streitraum: Intelligenz und Vorurteil Carolin Emcke im Gespräch mit Anke Domscheit-Berg und Julia Krüger, 26. Januar 2020 P. Packmohr, S. Ritterbusch: Neural Networks, Data Science Phil, Episode 16, 2019.

Delicate Database with Aaron
Machine Learning - Random Forest

Delicate Database with Aaron

Play Episode Listen Later Dec 4, 2019 11:14


Today was a bit of a challenge but I did find it fun trying to break down and explain Random Forest. I Hope you enjoy/enjoyed the episode and as always feel free to drop an email at timicode54@gmail.com if you have any further questions you wish to ask me. --- Send in a voice message: https://anchor.fm/delicatedatabase/message

Machine learning
Random forest and linear regression and gradient boost in python

Machine learning

Play Episode Listen Later Aug 22, 2019 18:28


Thoughts on these three classifieds

The Banana Data Podcast
Prioritizing training data, model interpretability, and dodging an AI Winter

The Banana Data Podcast

Play Episode Listen Later Aug 16, 2019 27:19


This episode, Triveni and Will tackle the value, ethics, and methods for good labeled data, while also weighing the need for model interpretability and the possibility of an impending AI winter.  Triveni will also take us through a step-by-step of the decisions made by a Random Forest algorith  As always, be sure to rate and subscribe!  Be sure to check out the articles we mentioned this week: The Side of Machine Learning You're Undervaluing and How to Fix it by Matt Wilder (LabelBox) The Hidden Costs of Automated Thinking by Jonathan Zittrain (The New Yorker) Another AI Winter Could Usher in a Dark Period for Artificial Intelligence by Eleanor Cummins (PopSci)

Machine learning
Ml feature discovery using random forest classifier

Machine learning

Play Episode Listen Later Jun 18, 2019 35:03


Sven explain how he discovered and engineered the features to his computational model for metal stability predictions

The InfoQ Podcast
Megan Cartwright on Building a Machine Learning MVP at an Early Stage Startup

The InfoQ Podcast

Play Episode Listen Later Jan 28, 2019 32:14


Today on the InfoQ Podcast, Wes speaks with ThirdLove’s Megan Cartwright. Megan is the Director of Data Science for the personalized bra company. In the podcast, Megan first discusses why their customers need a more personal experience and how their using technology to help. She focuses quite a bit of time in the podcast discussing how the team got to an early MVP and then how they did the same for getting to an early machine learning MVP for product recommendations. In this later part, she discusses decisions they made on what data to use, how to get the solution into production quickly, how to update/train new models, and where they needed help. It’s a real early stage startup story of a lean team leveraging machine learning to get to a practical recommendations solution in a very short timeframe. Why listen to this podcast: - The experience for women selecting bras is poor experience characterized by awkward fitting experiences and an often uncomfortable product that may not even fit correctly. ThirdLove is a company built to serve this market. - ThirdLove took a lean approach to develop their architecture. It’s built with the Parse backend. The leveraged Shopify to build the site. The company’s first recommender system used a rules engine embedded into the front end. After that, they moved to a machine learning MVP with a Python recommender service that used a Random Forest algorithm in SciKit-Learn. - Despite having the data for 10 million surveys, the first algorithms only need about 100K records to be trained. The takeaway is you don’t have to have huge amounts of data to get started with machine learning. - To initially deploy their ML solution, ThirdLove first shadowed all traffic through the algorithm and then compared it to what was being output by the rules engine. Using this along with information on the full customer order lifecycle, they validated the ML solution worked correctly and outperformed the rules engine. - ThirdLove’s machine learning story shows that you move towards a machine learning solution quickly by leveraging your own network and using tools that may already familiar to your team. More on this: Quick scan our curated show notes on InfoQ https://bit.ly/2G9RnQn You can also subscribe to the InfoQ newsletter to receive weekly updates on the hottest topics from professional software development. bit.ly/24x3IVq Subscribe: www.youtube.com/infoq Like InfoQ on Facebook: bit.ly/2jmlyG8 Follow on Twitter: twitter.com/InfoQ Follow on LinkedIn: www.linkedin.com/company/infoq Check the landing page on InfoQ: https://bit.ly/2G9RnQn

Everyone Has A Podcast
Random Forest Buns

Everyone Has A Podcast

Play Episode Listen Later Sep 13, 2018 56:41


PREORDER “THE STORY BEHIND BOOK” ON AMAZON at www.ehappodcast.com/emily It makes a great gift for the holiday season!If you want to stay connected with Adam and Bryon you can like our Facebook page www.facebook.com/ehappodcast. If you want to engage with us on Facebook, feel free to join our Facebook group www.facebook.com/groups/ehappodcast. You can also follow us on Twitter and Instagram @ehappodcast. Feel free to checkout our website www.ehappodcast.com seeing as how you’re becoming mildly obsessed with us.You can contact Adam and Bryon via email at ehappodcast@gmail.com.If you feel like supporting the show, you can buy a t-shirt from our Teepublic store at www.ehappodcast.com/store.If you don’t like wearing clothes and want exclusive content, you can support us on Patreon for the price of a $1 cup of coffee at: www.patreon.com/ehapMusic:Intro Song: “Kingdom in the Clouds”Written by Adam BoutilierPerformed by Chris Layes and Adam BoutilierGuitarist: Chris LayesClapping: Chris LayesOutro music: “EHAP Outro 2018”Created by: Adam Boutilier using Logic Pro.Assistant to Mr. Depp: Daniel RepholzThis weeks Chris Pick:Where They Wander by HorrorPopsIf you enjoy the music on the show and happen to be an Apple Music subscriber, be sure to subscribe to our ever-growing Apple Music playlist. You can check that bad daddy out right here: https://itunes.apple.com/ca/playlist/everyone-has-a-podcast/pl.u-eaqfK2PEEqAny music used in the ‘Chris Pick’ segment is for entertainment and educational purposes only. All works belong to their original owners and are used solely for the promotion of the artists. If you enjoy the music used in this segment we strongly encourage you to purchase it and support the artists. All music used in this show has been purchased digitally from iTunes prior to use.2018 © Everyone Has A Podcast

Fantasy Toolz Podcast
Episode 2.25 - Ensemble Learners

Fantasy Toolz Podcast

Play Episode Listen Later Aug 14, 2018 32:52


2.25 visits DEF CON (0:40), weighs a steam locomotive argument (1:56), bats around some baseball topics (4:08), introduces the week’s algorithm challenge topic: Random Forests (5:23), applies the Random Forest algorithm to TGFBI data (9:51), applies the Random Forest algorithm to closers (16:36), ponders the home league (24:37), punts some football topics (26:16), and reviews Fight Club (28:12).

Trending In Education
World Cup 2018 - Mystic Animal versus AI Predictions and the Beautiful Game - Trending in Education - Episode 97

Trending In Education

Play Episode Listen Later Jun 19, 2018 27:55


Brandon, Dan, and Mike jump onto the pitch with this look at the 2018 World Cup. We talk predictions from mystical animals as well as artifical intillegence. Does Mystic Marcus the Pig have the final four teams locked up or do AI's 100,000 simulations prove Spain and Germany will be the last teams standing? We touch on Random Forest simulations and Poisson distributions as we explore the various ways in which humans make predictions. Ane we wrap it all up with a quick dive into a great set of resources from TheirWorld.Org which breaks down the teams and their respective countries based on relevant educational statistics. All that and more on the latest Trending In Education.

Fatal Error
47. Strange Loop

Fatal Error

Play Episode Listen Later Oct 23, 2017 25:20


Soroush interviews Chris about his experience at this year’s Strange Loop conference.Strange LoopStrange Loop Schedule (currently showing the 2017 schedule)Alex Miller"Just-So Stories For AI: Explaining Black-Box Predictions" By Sam RitchieDecision Tree LearningRandom Forest""It Me": Under The Hood Of Web Authentication" By Yan Zhu, Garrett RobinsonLito Nikolai"Level Up Your Concurrency Skills With Rust" By David SullinsSwift Ownership ManifestoCity Museum"To Serve The People: Public Interest Technologists" By Matt Mitchell"Redux: Architecting And Scaling A New Web App At The Ny Times" By Juan Carlos Montemayor Elosua"The Holy Grail Of Systems Analysis: From What To Where To Why" By Daniel Spoonhower"Biomaterials As Ui" By Ruthie NachmanyTalks Chris hasn’t watched yet, but wants to"Keeping Time In Real Systems" By Kavya Joshi"Stop Rate Limiting! Capacity Management Done Right" By Jon Moore"Dependent Types In Haskell" By Stephanie Weirich"Observability For Emerging Infra: What Got You Here Won't Get You There" By Charity Majors"The Security of Classic Game Consoles" by Kevin Shekleton"Key to the City: Writing Code to Induce Social Change" by Jurnell Cockhren"The Future is Now" by Rachel White"Experimental Creative Writing with the Vectorized Word" by Allison Parrish"Antics, drift, and chaos" by Lorin Hochstein"Lazy Defenses: Using Scaled TTLs to Keep Your Cache Correct" by Bonnie Eisenman"Promise and Pitfalls of Persistent Memory" by Rob DickinsonPre-ShowChris’s Aircraft Radar Alexa skillSelfridge Air National Guard BaseYankee Air Museum (Ypsilanti, MI)Get a new Fatal Error episode every week by becoming a supporter at patreon.com/fatalerror.

Nourish Balance Thrive
How to Reverse Insulin Resistant Type Two Diabetes in 100 Million People in Less Than 10 Years

Nourish Balance Thrive

Play Episode Listen Later Sep 16, 2017 62:48


For decades we’ve heard that diabetes prevention is simple—lose weight, eat less, and exercise more. But something is wrong with the conventional wisdom. Nearly 115 million people live with either diabetes or prediabetes in the United States, and that number is growing. It is time to reverse this trend. Virta was founded in 2014 with the goal of reversing diabetes in 100 million people by 2025. They have made this possible through advancements in the science of nutritional biochemistry and technology that is changing the diabetes care model. James McCarter, MD, PhD, is Head of Research at Virta, and in this interview, Dr McCarter explains how Virta is using a combination of a very low carb, ketogenic diet together with 1-on-1 health coaches and some sophisticated machine learning techniques to predict sentiment in natural language and spot anomalies in blood biomarkers. After the recording was made, Dr McCarter realised that he was off by about a decade on Joslin. Rather than 1920s, Dr. Elliott Joslin actually began keeping a diabetes registry early in the 20th century and published The Treatment of Diabetes Mellitus in 1917.  “Joslin carried out extensive metabolic balance studies examining fasting and feeding in patients with varying severities of diabetes. His findings would help to validate the observations of Frederick Madison Allen regarding the benefit of carbohydrate- and calorie-restricted diets.” Here’s the outline of this interview with James McCarter, MD, PhD: [00:01:00] Divergence, Inc. [00:01:43] Presentation: The Effects of a Year in Ketosis with James McCarter, MD, PhD at the Quantified Self Conference and Exposition. [00:02:44] Books by Gary Taubes. [00:03:13] Omega 3:6 ratios. [00:05:54] Rapeseed and Canola. [00:06:44] Wild Planet sardines. [00:07:11] The Virta story. [00:07:18] Sami Inkinen. [00:07:38] Study: SD. Phinney, BR. Bistrian, WJ. Evans, E. Gervino, GL. Blackburn, The human metabolic response to chronic ketosis without caloric restriction: preservation of submaximal exercise capability with reduced carbohydrate oxidation., Metabolism, volume 32, issue 8, pages 769-76, Aug 1983, PMID 6865776. [00:08:48] Jeff Volek, PhD, RD on PubMed. [00:09:51] Fear of fat. [00:10:13] USDA dietary guidelines. [00:12:59] The goal is to reverse T2D in 100M people. [00:14:09] Study: NCD Risk Factor Collaboration (NCD-RisC). Worldwide trends in diabetes since 1980: a pooled analysis of 751 population-based studies with 4·4 million participants. Lancet (London, England). 2016;387(10027):1513-1530. doi:10.1016/S0140-6736(16)00618-8. [00:14:29] Joslin Diabetes Center. [00:16:37] The causes of T2D. [00:17:35] Calories are now more accessible. [00:18:22] Sugar and refined carbohydrate intake. [00:20:26] Prerequisites for the Virta program. [00:22:19] Telemedicine, health coaches, online nutrition and behaviour education, biometric feedback, peer community. [00:23:53] Getting off meds. [00:24:50] HbA1C > 6 or glucose > 120 mg/dL [00:25:32] Purdue University. [00:26:28] Podcast: Econtalk: Mark Warshawsky on Compensation, Health Care Costs, and Inequality. [00:29:02] Study: American Diabetes Association. Economic Costs of Diabetes in the U.S. in 2012. Diabetes Care. 2013;36(4):1033-1046. doi:10.2337/dc12-2625. [00:29:27] Study: McKenzie AL, Hallberg SJ, Creighton BC, Volk BM, Link TM, Abner MK, Glon RM, McCarter JP, Volek JS, Phinney SD. A Novel Intervention Including Individualized Nutritional Recommendations Reduces Hemoglobin A1c Level, Medication Use, and Weight in Type 2 Diabetes. JMIR Diabetes. 2017;2(1):e5. [00:30:45] Discontinuing 2/3 of the meds. [00:32:54] Health coaching. [00:34:18] Behaviour change. [00:35:30] Biometrics, blood BHB. [00:38:10] Reducing blood pressure and CRP. [00:38:30] Study: Youm, Yun-Hee, et al. "The ketone metabolite [beta]-hydroxybutyrate blocks NLRP3 inflammasome-mediated inflammatory disease." Nature medicine 21.3 (2015): 263-269. [00:39:49] Blood levels of BHB and weight loss. [00:41:36] STEM-Talk #43: Jeff Volek Explains the Power of Ketogenic Diets to Reverse Type 2 Diabetes. [00:43:33] Machine learning. [00:45:57] The Team at Virta including Nasir Bhanpuri, Catalin Voss and Jackie Lee. See article Will robots inherit the world of healthcare? For links to their talks. [00:46:49] Random Forest. [00:47:06] Nourish Balance Thrive 7-Minute Analysis. [00:48:05] Natural Language Processing. [00:48:57] Nourish Balance Thrive Highlights email series. [00:50:26] Finding purpose in your work. [00:51:59] Using machine learning to change behaviour. [00:53:25] Book: Hooked: How to Build Habit-Forming Products by Nir Eyal. [00:54:11] Podcast: How to Avoid the Cognitive Middle Gear with James Hewitt. [00:55:37] $400 per month for one year. [00:57:58] Blog Post: Does Your Thyroid Need Dietary Carbohydrates? By Stephen Phinney, MD, PhD. [01:00:21] Article: Understanding Local Control of Thyroid Hormones:(Deiodinases Function and Activity) and Podcast: The Most Reliable Way to Lose Weight with Dr. Tommy Wood. [01:02:12] Podcast: How Busy Realtors Can Avoid Anxiety and Depression Without Prescriptions or the Help of a Doctor with Douglas Hilbert.

Linear Digressions
Ensemble Algorithms

Linear Digressions

Play Episode Listen Later Jan 22, 2017 13:08


If one machine learning model is good, are two models better? In a lot of cases, the answer is yes. If you build many ok models, and then bring them all together and use them in combination to make your final predictions, you've just created an ensemble model. It feels a little bit like cheating, like you just got something for nothing, but the results don't like: algorithms like Random Forests and Gradient Boosting Trees (two types of ensemble algorithms) are some of the strongest out-of-the-box algorithms for classic supervised classification problems. What makes a Random Forest random, and what does it mean to gradient boost a tree? Have a listen and find out.

Nourish Balance Thrive
How to Teach Machines That Can Learn

Nourish Balance Thrive

Play Episode Listen Later Dec 8, 2016 57:47


Machine learning is fast becoming a part of our lives. From the order in which your search results and news feeds are ordered to the image classifiers and speech recognition features on your smartphone. Machine learning may even have had a hand in choosing your spouse or driving you to work. As with cars, only the mechanics need to understand what happens under the hood, but all drivers need to know how to operate the steering wheel. Listen to this podcast to learn how to interact with machines that can learn, and about the implications for humanity. My guest is Dr. Pedro Domingos, Professor of Computer Science at Washington University. He is the author or co-author of over 200 technical publications in machine learning and data mining, and the author of my new favourite book The Master Algorithm: How the Quest for the Ultimate Learning Machine Will Remake Our World. Here’s the outline of this interview with Dr. Pedro Domingos, PhD: [00:01:55] Deep Learning. [00:02:21] Machine learning is affecting everyone's lives. [00:03:45] Recommender systems. [00:03:57] Ordering newsfeeds. [00:04:25] Text prediction and speech recognition in smart phones. [00:04:54] Accelerometers. [00:04:54] Selecting job applicants. [00:05:05] Finding a spouse. [00:05:35] OKCupid.com. [00:06:49] Robot scientists. [00:07:08] Artificially-intelligent Robot Scientist ‘Eve’ could boost search for new drugs. [00:08:38] Cancer research. [00:10:27] Central dogma of molecular biology. [00:10:34] DNA microarrays. [00:11:34] Robb Wolf at IHMC: Darwinian Medicine: Maybe there IS something to this evolution thing. [00:12:29] It costs more to find the data than to do the experiment again (ref?) [00:13:11] Making connections people could never make. [00:14:00] Jeremy Howard’s TED talk: The wonderful and terrifying implications of computers that can learn. [00:14:14] Pedro's TED talk: The Quest for the Master Algorithm. [00:15:49] Craig Venter: your immune system on the Internet. [00:16:44] Continuous blood glucose monitoring and Heart Rate Variability. [00:17:41] Our data: DUTCH, OAT, stool, blood. [00:19:21] Supervised and unsupervised learning. [00:20:11] Clustering dimensionality reduction, e.g. PCA and T-SNE. [00:21:44] Sodium to potassium ratio versus cortisol. [00:22:24] Eosinophils. [00:23:17] Clinical trials. [00:24:35] Tetiana Ivanova - How to become a Data Scientist in 6 months a hacker’s approach to career planning. [00:25:02] Deep Learning Book. [00:25:46] Maths as a barrier to entry. [00:27:09] Andrew Ng Coursera Machine Learning course. [00:27:28] Pedro's Data Mining course. [00:27:50] Theano and Keras. [00:28:02] State Farm Distracted Driver Detection Kaggle competition. [00:29:37] Nearest Neighbour algorithm. [00:30:29] Driverless cars. [00:30:41] Is a robot going to take my job? [00:31:29] Jobs will not be lost, they will be transformed [00:33:14] Automate your job yourself! [00:33:27] Centaur chess player. [00:35:32] ML is like driving, you can only learn by doing it. [00:35:52] A Few Useful Things to Know about Machine Learning. [00:37:00] Blood chemistry software. [00:37:30] We are the owners of our data. [00:38:49] Data banks and unions. [00:40:01] The distinction with privacy. [00:40:29] An ethical obligation to share. [00:41:46] Data vulcanisation. [00:42:40] Teaching the machine. [00:43:07] Chrome incognito mode. [00:44:13] Why can't we interact with the algorithm? [00:45:33] New P2 Instance Type for Amazon EC2 – Up to 16 GPUs. [00:46:01] Why now? [00:46:47] Research breakthroughs. [00:47:04] The amount of data. [00:47:13] Hardware. [00:47:31] GPUs, Moore’s law. [00:47:57] Economics. [00:48:32] Google TensorFlow. [00:49:05] Facebook Torch. [00:49:38] Recruiting. [00:50:58] The five tribes of machine learning: evolutionaries, connectionists, Bayesians, analogizers, symbolists. [00:51:55] Grand unified theory of ML. [00:53:40] Decision tree ensembles (Random Forests). [00:53:45] XGBoost. [00:53:54] Weka. [00:54:21] Alchemy: Open Source AI. [00:56:16] Still do a computer science degree. [00:56:54] Minor in probability and statistics.

Data Skeptic
[MINI] The Bootstrap

Data Skeptic

Play Episode Listen Later Nov 25, 2016 10:37


The Bootstrap is a method of resampling a dataset to possibly refine it's accuracy and produce useful metrics on the result. The bootstrap is a useful statistical technique and is leveraged in Bagging (bootstrap aggregation) algorithms such as Random Forest. We discuss this technique related to polling and surveys.

Histories of Data and the Database
Random Forests and Decision Trees: Machine Learning, Empirical Statistics, and the Challenge of Interpretability

Histories of Data and the Database

Play Episode Listen Later Nov 19, 2016 39:01


Matthew Jones from Columbia University delivers a talk titled “Random Forests and Decision Trees: Machine Learning, Empirical Statistics, and the Challenge of Interpretability.” This talk was included in the session titled “Methods and Ambiguities in the Contemporary Age.” Part of “Histories of Data and the Database,” a conference held at The Huntington Nov. 18–19, 2016.

Archaeology Conferences
0040 - GBAC 2016 - Meg Tracy - Modeling Human Locational Behavior

Archaeology Conferences

Play Episode Listen Later Oct 7, 2016 12:45


Models were developed to predict spatial distribution of prehistoric archaeological site potential in the Sawtooth National Forest. Archaeological data and environmental parameters were collected and processed in a GIS. Predictor variables were evaluated to discover correlates with human locational behavior & compared against a control dataset. Three modeling methods were used: Logistic Regression, Regression Tree, and Random Forest. These models were assessed for efficacy using k-fold cross-validation and gain statistics. Although observed relationships could result from biases in archaeological data and predictors, results suggest a strong correlation between environment and prehistoric site location.

Data Skeptic
[MINI] Random Forest

Data Skeptic

Play Episode Listen Later Oct 7, 2016 12:43


Random forest is a popular ensemble learning algorithm which leverages bagging both for sampling and feature selection. In this episode we make an analogy to the process of running a bookstore.

Medizinische Fakultät - Digitale Hochschulschriften der LMU - Teil 16/19

Assessing the health of populations is important for various reasons, especially for health policy purposes. Therefore, there exists a substantial need for health comparisons between populations, including the comparison of individuals, groups of persons, or even populations from different countries, at one point in time and over time. Two fundamentally different approaches exist to assess the health of populations. The first approach relies on indirect measures of health, which are based on mortality and morbidity statistics, and which are therefore only available at the population level. The second approach relies on direct measures of health, which are collected – based on health surveys – at the individual level. Based on the needs for comparisons, indirect measures appear to be less appropriate, as they are only available at the population level, but not at the individual or group level. Direct measures, however, are originally obtained at the individual level, and can then be aggregated to any group level, even to the population level. Therefore, direct measures seem to be more appropriate for these comparison purposes. The open question is then how to compare overall health based on data collected within health surveys. At first glance, a single general health question seems to be appealing. However, studies have shown that this kind of question is not appropriate to compare health over time, nor across populations. Qualitative studies found that respondents even consider very different aspects of health when responding to such a question. A more appropriate approach seems to be the use of data on several domains of health, as for example mobility, self-care and pain. Anyway, measuring health based on a set of domains is an extremely frequent approach. It provides more comprehensive information and can therefore be used for a wider range of possible applications. However, three open questions must be addressed when measuring health based on a set of domains. First, a parsimonious set of domains must be selected. Second, health measurement based on this set of domains must be operationalized in a standardized way. Third, this information must be aggregated into a summary measure of health, thereby taking into account that categorical responses to survey questions could be differently interpreted by respondents, and are not necessarily directly comparable. These open questions are addressed in this doctoral thesis. The overall objective of this doctoral thesis is to develop a valid, reliable and sensitive metric of health – based on data collected on a set of domains – that permits to monitor the health of populations over time, and which provides the basis for the comparisons of health across different populations. To achieve this aim two psychometric studies were carried out, entitled “Towards a Minimal Generic Set of Domains” and “Development of a metric of health”. In the first study a minimal generic set of domains suitable for measuring health both in the general population and in clinical populations was identified, and contrasted to the domains of the World Health Survey (WHS). The eight domains of the WHS – mobility, self-care, pain and discomfort, cognition, interpersonal activities, vision, sleep and energy, and affect – were used as a reference, as this set – developed by the World Health Organization (WHO) – so far constitutes the most advanced proposal of what to measure for international health comparisons. To propose the domains for the minimal generic set, two different regression methodologies – Random Forest and Group Lasso – were applied for the sake of robustness to three different data sources, two national general population surveys and one large international clinical study: the German National Health Interview and Examination Survey 1998, the United States National Health and Nutrition Examination Survey 2007/2008, and the ICF Core Set studies. A domain was selected when it was sufficiently explanatory for self-perceived health. Based on the analyses the following set of domains, systematically named based on their respective categories within the International Classification of Functioning, Disability and Health (ICF), was proposed as a minimal generic set: b130 Energy and drive functions b152 Emotional functions b280 Sensation of pain d230 Carrying out daily routine d450 Walking d455 Moving around d850 Remunerative employment Based on this set, four of the eight domains of the WHS were confirmed both in the general and in clinical populations: mobility, pain and discomfort, sleep and energy, and affect. The other WHS domains not represented in the proposed minimal generic set are vision, which was only confirmed with data of the general population, self-care and interpersonal activities, which were only confirmed with data of the clinical population and cognition, which could not be confirmed at all. The ICF categories of `carrying out daily routine´ and `remunerative employment´ also fulfilled the inclusion criteria, though not directly related to any of the eight WHS domains. This minimal generic set can be used as the starting point to address one of the most important challenges in health measurement, namely the comparability of data across studies and countries. It also represents the first step for developing a common metric of health to link information from the general population to information about sub-populations, such as clinical and institutional populations, e.g. persons living in nursing homes. In the second study a sound psychometric measure was developed based on information collected on the domains of the minimal generic set: energy and drive functions, emotional functions, sensation of pain, carrying out daily routine, mobility and remunerative employment. It was demonstrated that this metric can be used to assess the health of populations and also to monitor health over time. To develop this metric of health, data from two successive waves of the English Longitudinal Study of Ageing (ELSA) was used. A specific Item Response Theory (IRT) model, the Partial Credit Model (PCM), was applied on 12 items representing the 6 domains from the minimal generic set. All three IRT model assumptions – unidimensionality, local independency and monotonicity – were examined and found to be fulfilled. The developed metric showed sound psychometric properties: high internal consistency reliability, high construct validity and high sensitivity to change. Therefore, it can be considered an appropriate measure of population health. Furthermore, it was demonstrated how the health of populations can be compared based on this metric, for subgroups of populations, and over time. Finally, it was outlined how this metric can be used as the basis for comparing health across different populations, as for example from two different countries. The developed health metric can be seen as the starting point for a wide range of health comparisons, between individuals, groups of persons and populations as a whole, and both at one point in time and over time. It opens up a wide range of possible applications for both health care providers and health policy, and both in clinical settings and in the general population.

Medizin - Open Access LMU - Teil 22/22
Towards a minimal generic set of domains of functioning and health

Medizin - Open Access LMU - Teil 22/22

Play Episode Listen Later Jan 1, 2014


Background: The World Health Organization (WHO) has argued that functioning, and, more concretely, functioning domains constitute the operationalization that best captures our intuitive notion of health. Functioning is, therefore, a major public-health goal. A great deal of data about functioning is already available. Nonetheless, it is not possible to compare and optimally utilize this information. One potential approach to address this challenge is to propose a generic and minimal set of functioning domains that captures the experience of individuals and populations with respect to functioning and health. The objective of this investigation was to identify a minimal generic set of ICF domains suitable for describing functioning in adults at both the individual and population levels. Methods: We performed a psychometric study using data from: 1) the German National Health Interview and Examination Survey 1998, 2) the United States National Health and Nutrition Examination Survey 2007/2008, and 3) the ICF Core Set studies. Random Forests and Group Lasso regression were applied using one self-reported general-health question as a dependent variable. The domains selected were compared to those of the World Health Survey (WHS) developed by the WHO. Results: Seven domains of the International Classification of Functioning, Disability and Health (ICF) are proposed as a minimal generic set of functioning and health: energy and drive functions, emotional functions, sensation of pain, carrying out daily routine, walking, moving around, and remunerative employment. The WHS domains of self-care, cognition, interpersonal activities, and vision were not included in our selection. Conclusions: The minimal generic set proposed in this study is the starting point to address one of the most important challenges in health measurement - the comparability of data across studies and countries. It also represents the first step in developing a common metric of health to link information from the general population to information about sub-populations, such as clinical and institutionalized populations.

Medizin - Open Access LMU - Teil 22/22
Biological health or lived health: which predicts self-reported general health better?

Medizin - Open Access LMU - Teil 22/22

Play Episode Listen Later Jan 1, 2014


Background: Lived health is a person's level of functioning in his or her current environment and depends both on the person's environment and biological health. Our study addresses the question whether biological health or lived health is more predictive of self-reported general health (SRGH). Methods: This is a psychometric study using cross-sectional data from the Spanish Survey on Disability, Independence and Dependency Situation. Data was collected from 17,739 people in the community and 9,707 from an institutionalized population. The following analysis steps were performed: (1) a biological health and a lived health score were calculated for each person by constructing a biological health scale and a lived health scale using Samejima's Graded Response Model; and (2) variable importance measures were calculated for each study population using Random Forest, with SRGH as the dependent variable and the biological health and the lived health scores as independent variables. Results: The levels of biological health were higher for the community-dwelling population than for the institutionalized population. When technical assistance, personal assistance or both were received, the difference in lived health between the community-dwelling population and institutionalized population was smaller. According to Random Forest's variable importance measures, for both study populations, lived health is a more important predictor of SRGH than biological health. Conclusions: In general, people base their evaluation of their own health on their lived health experience rather than their experience of biological health. This study also sheds light on the challenges of assessing biological health and lived health at the general population level.

Técnicas Estadísticas en Análisis de Mercados (umh 1480) Curso 2012 - 2013
umh1480 2012-13 Lec026 Practicas Alumnos Random Forest

Técnicas Estadísticas en Análisis de Mercados (umh 1480) Curso 2012 - 2013

Play Episode Listen Later Jul 15, 2013 15:45


Prácticas de Alumnos: Random Forest. Asignatura: Técnicas Estadísticas en Investigación de Mercados. Grado en Estadística Empresarial. Profesor: Xavier Barber i Vallés. Dpto. de Estadística, Matemáticas e Informática. Área de Estadística e Investigación Operativa. Proyecto PLE 2013. Universidad Miguel Hernández de Elche. Práctica de los alumnos donde se muestra cómo realizar un Random Forest en R-Commander.

StatLearn 2013 - Workshop on
Clustering of variables combined with variable selection using random forests : application to gene expression data (Robin Genuer & Vanessa Kuentz-Simonet)

StatLearn 2013 - Workshop on "Challenging problems in Statistical Learning"

Play Episode Listen Later May 16, 2013 55:48


The main goal of this work is to tackle the problem of dimension reduction for highdimensional supervised classification. The motivation is to handle gene expression data. The proposed method works in 2 steps. First, one eliminates redundancy using clustering of variables, based on the R-package ClustOfVar. This first step is only based on the exploratory variables (genes). Second, the synthetic variables (summarizing the clusters obtained at the first step) are used to construct a classifier (e.g. logistic regression, LDA, random forests). We stress that the first step reduces the dimension and gives linear combinations of original variables (synthetic variables). This step can be considered as an alternative to PCA. A selection of predictors (synthetic variables) in the second step gives a set of relevant original variables (genes). Numerical performances of the proposed procedure are evaluated on gene expression datasets. We compare our methodology with LASSO and sparse PLS discriminant analysis on these datasets.

Mathematik, Informatik und Statistik - Open Access LMU - Teil 02/03
Variable selection with Random Forests for missing data

Mathematik, Informatik und Statistik - Open Access LMU - Teil 02/03

Play Episode Listen Later Jan 15, 2013


Variable selection has been suggested for Random Forests to improve their efficiency of data prediction and interpretation. However, its basic element, i.e. variable importance measures, can not be computed straightforward when there is missing data. Therefore an extensive simulation study has been conducted to explore possible solutions, i.e. multiple imputation, complete case analysis and a newly suggested importance measure for several missing data generating processes. The ability to distinguish relevant from non-relevant variables has been investigated for these procedures in combination with two popular variable selection methods. Findings and recommendations: Complete case analysis should not be applied as it lead to inaccurate variable selection and models with the worst prediction accuracy. Multiple imputation is a good means to select variables that would be of relevance in fully observed data. It produced the best prediction accuracy. By contrast, the application of the new importance measure causes a selection of variables that reflects the actual data situation, i.e. that takes the occurrence of missing values into account. It's error was only negligible worse compared to imputation.

selection findings variable statistik mathematik missing data random forest ddc:510 informatik und statistik technische reports
Medizin - Open Access LMU - Teil 21/22
An AUC-based permutation variable importance measure for random forests

Medizin - Open Access LMU - Teil 21/22

Play Episode Listen Later Jan 1, 2013


Background: The random forest (RF) method is a commonly used tool for classification with high dimensional data as well as for ranking candidate predictors based on the so-called random forest variable importance measures (VIMs). However the classification performance of RF is known to be suboptimal in case of strongly unbalanced data, i.e. data where response class sizes differ considerably. Suggestions were made to obtain better classification performance based either on sampling procedures or on cost sensitivity analyses. However to our knowledge the performance of the VIMs has not yet been examined in the case of unbalanced response classes. In this paper we explore the performance of the permutation VIM for unbalanced data settings and introduce an alternative permutation VIM based on the area under the curve (AUC) that is expected to be more robust towards class imbalance. Results: We investigated the performance of the standard permutation VIM and of our novel AUC-based permutation VIM for different class imbalance levels using simulated data and real data. The results suggest that the new AUC-based permutation VIM outperforms the standard permutation VIM for unbalanced data settings while both permutation VIMs have equal performance for balanced data settings. Conclusions: The standard permutation VIM loses its ability to discriminate between associated predictors and predictors not associated with the response for increasing class imbalance. It is outperformed by our new AUC-based permutation VIM for unbalanced data settings, while the performance of both VIMs is very similar in the case of balanced classes. The new AUC-based VIM is implemented in the R package party for the unbiased RF variant based on conditional inference trees. The codes implementing our study are available from the companion website: http://www.ibe.med.uni-muenchen.de/organisation/mitarbeiter/070_drittmi ttel/janitza/index.html.

Fakultät für Mathematik, Informatik und Statistik - Digitale Hochschulschriften der LMU - Teil 01/02

Random Forests are widely used for data prediction and interpretation purposes. They show many appealing characteristics, such as the ability to deal with high dimensional data, complex interactions and correlations. Furthermore, missing values can easily be processed by the built-in procedure of surrogate splits. However, there is only little knowledge about the properties of recursive partitioning in missing data situations. Therefore, extensive simulation studies and empirical evaluations have been conducted to gain deeper insight. In addition, new methods have been developed to enhance methodology and solve current issues of data interpretation, prediction and variable selection. A variable’s relevance in a Random Forest can be assessed by means of importance measures. Unfortunately, existing methods cannot be applied when the data contain miss- ing values. Thus, one of the most appreciated properties of Random Forests – its ability to handle missing values – gets lost for the computation of such measures. This work presents a new approach that is designed to deal with missing values in an intuitive and straightforward way, yet retains widely appreciated qualities of existing methods. Results indicate that it meets sensible requirements and shows good variable ranking properties. Random Forests provide variable selection that is usually based on importance mea- sures. An extensive review of corresponding literature led to the development of a new approach that is based on a profound theoretical framework and meets important statis- tical properties. A comparison to another eight popular methods showed that it controls the test-wise and family-wise error rate, provides a higher power to distinguish relevant from non-relevant variables and leads to models located among the best performing ones. Alternative ways to handle missing values are the application of imputation methods and complete case analysis. Yet it is unknown to what extent these approaches are able to provide sensible variable rankings and meaningful variable selections. Investigations showed that complete case analysis leads to inaccurate variable selection as it may in- appropriately penalize the importance of fully observed variables. By contrast, the new importance measure decreases for variables with missing values and therefore causes se- lections that accurately reflect the information given in actual data situations. Multiple imputation leads to an assessment of a variable’s importance and to selection frequencies that would be expected for data that was completely observed. In several performance evaluations the best prediction accuracy emerged from multiple imputation, closely fol- lowed by the application of surrogate splits. Complete case analysis clearly performed worst.

results investigations missing data random forest ddc:004 ddc:000 informatik und statistik
Medizin - Open Access LMU - Teil 17/22
AUC-RF: A New Strategy for Genomic Profiling with Random Forest

Medizin - Open Access LMU - Teil 17/22

Play Episode Listen Later Jan 1, 2011


Objective: Genomic profiling, the use of genetic variants at multiple loci simultaneously for the prediction of disease risk, requires the selection of a set of genetic variants that best predicts disease status. The goal of this work was to provide a new selection algorithm for genomic profiling. Methods: We propose a new algorithm for genomic profiling based on optimizing the area under the receiver operating characteristic curve (AUC) of the random forest (RF). The proposed strategy implements a backward elimination process based on the initial ranking of variables. Results and Conclusions: We demonstrate the advantage of using the AUC instead of the classification error as a measure of predictive accuracy of RF. In particular, we show that the use of the classification error is especially inappropriate when dealing with unbalanced data sets. The new procedure for variable selection and prediction, namely AUC-RF, is illustrated with data from a bladder cancer study and also with simulated data. The algorithm is publicly available as an R package, named AUCRF, at http://cran.r-project.org/. Copyright (C) 2011 S. Karger AG, Basel

Medizin - Open Access LMU - Teil 17/22
The behaviour of random forest permutation-based variable importance measures under predictor correlation

Medizin - Open Access LMU - Teil 17/22

Play Episode Listen Later Jan 1, 2010


Background: Random forests (RF) have been increasingly used in applications such as genome-wide association and microarray studies where predictor correlation is frequently observed. Recent works on permutation-based variable importance measures (VIMs) used in RF have come to apparently contradictory conclusions. We present an extended simulation study to synthesize results. Results: In the case when both predictor correlation was present and predictors were associated with the outcome (H(A)), the unconditional RF VIM attributed a higher share of importance to correlated predictors, while under the null hypothesis that no predictors are associated with the outcome (H(0)) the unconditional RF VIM was unbiased. Conditional VIMs showed a decrease in VIM values for correlated predictors versus the unconditional VIMs under H(A) and was unbiased under H(0). Scaled VIMs were clearly biased under H(A) and H(0). Conclusions: Unconditional unscaled VIMs are a computationally tractable choice for large datasets and are unbiased under the null hypothesis. Whether the observed increased VIMs for correlated predictors may be considered a "bias" - because they do not directly reflect the coefficients in the generating model - or if it is a beneficial attribute of these VIMs is dependent on the application. For example, in genetic association studies, where correlation between markers may help to localize the functionally relevant variant, the increased importance of correlated predictors may be an advantage. On the other hand, we show examples where this increased importance may result in spurious signals.

Medizin - Open Access LMU - Teil 15/22
Conditional variable importance for random forests

Medizin - Open Access LMU - Teil 15/22

Play Episode Listen Later Jan 1, 2008


Background: Random forests are becoming increasingly popular in many scientific fields because they can cope with "small n large p" problems, complex interactions and even highly correlated predictor variables. Their variable importance measures have recently been suggested as screening tools for, e. g., gene expression studies. However, these variable importance measures show a bias towards correlated predictor variables. Results: We identify two mechanisms responsible for this finding: (i) A preference for the selection of correlated predictors in the tree building process and (ii) an additional advantage for correlated predictor variables induced by the unconditional permutation scheme that is employed in the computation of the variable importance measure. Based on these considerations we develop a new, conditional permutation scheme for the computation of the variable importance measure. Conclusion: The resulting conditional variable importance reflects the true impact of each predictor variable more reliably than the original marginal approach.

Medizin - Open Access LMU - Teil 15/22
Bias in random forest variable importance measures: Illustrations, sources and a solution

Medizin - Open Access LMU - Teil 15/22

Play Episode Listen Later Jan 1, 2007


Background: Variable importance measures for random forests have been receiving increased attention as a means of variable selection in many classification tasks in bioinformatics and related scientific fields, for instance to select a subset of genetic markers relevant for the prediction of a certain disease. We show that random forest variable importance measures are a sensible means for variable selection in many applications, but are not reliable in situations where potential predictor variables vary in their scale of measurement or their number of categories. This is particularly important in genomics and computational biology, where predictors often include variables of different types, for example when predictors include both sequence data and continuous variables such as folding energy, or when amino acid sequence data show different numbers of categories. Results: Simulation studies are presented illustrating that, when random forest variable importance measures are used with data of varying types, the results are misleading because suboptimal predictor variables may be artificially preferred in variable selection. The two mechanisms underlying this deficiency are biased variable selection in the individual classification trees used to build the random forest on one hand, and effects induced by bootstrap sampling with replacement on the other hand. Conclusion: We propose to employ an alternative implementation of random forests, that provides unbiased variable selection in the individual classification trees. When this method is applied using subsampling without replacement, the resulting variable importance measures can be used reliably for variable selection even in situations where the potential predictor variables vary in their scale of measurement or their number of categories. The usage of both random forest algorithms and their variable importance measures in the R system for statistical computing is illustrated and documented thoroughly in an application re-analyzing data from a study on RNA editing. Therefore the suggested method can be applied straightforwardly by scientists in bioinformatics research.

Mathematik, Informatik und Statistik - Open Access LMU - Teil 02/03
Bias in Random Forest Variable Importance Measures: Illustrations, Sources and a Solution

Mathematik, Informatik und Statistik - Open Access LMU - Teil 02/03

Play Episode Listen Later Jan 1, 2006


Variable importance measures for random forests have been receiving increased attention as a means of variable selection in many classification tasks in bioinformatics and related scientific fields, for instance to select a subset of genetic markers relevant for the prediction of a certain disease. We show that random forest variable importance measures are a sensible means for variable selection in many applications, but are not reliable in situations where potential predictor variables vary in their scale level or their number of categories. This is particularly important in genomics and computational biology, where predictors often include variables of different types, for example when predictors include both sequence data and continuous variables such as folding energy, or when amino acid sequence data show different numbers of categories. Simulation studies are presented illustrating that, when random forest variable importance measures are used with data of varying types, the results are misleading because suboptimal predictor variables may be artificially preferred in variable selection. The two mechanisms underlying this deficiency are biased variable selection in the individual classification trees used to build the random forest on one hand, and effects induced by bootstrap sampling with replacement on the other hand. We propose to employ an alternative implementation of random forests, that provides unbiased variable selection in the individual classification trees. When this method is applied using subsampling without replacement, the resulting variable importance measures can be used reliably for variable selection even in situations where the potential predictor variables vary in their scale level or their number of categories. The usage of both random forest algorithms and their variable importance measures in the R system for statistical computing is illustrated and documented thoroughly in an application re-analysing data from a study on RNA editing. Therefore the suggested method can be applied straightforwardly by scientists in bioinformatics research.