Podcasts about Random forest

numbers context gpt friedman capable secretly large language models bagging svm random forest llama2

Play Episode Listen Later May 1, 2024 26:08

In the final episode of this mini-series, Shea and Anders cover the other common tree-based ensemble model, the Gradient Boosting Machine. Like Random Forests, GBMs make use of a large number of decision trees, but they use a “boosting” approach that cleverly makes use of “weak learners” to incrementally extract information from the data. After an explanation of how GBMs work, we compare them to Random Forests and go over a few examples where they have used GBMs in their own work.

community trees emerging machines boosting gradient random forest gbms

From Words to Numbers: Your Large Language Model Is Secretly A Capable Regressor When Given In-Context Examples

Papers Read on AI

Play Episode Listen Later Apr 16, 2024 36:41

We analyze how well pre-trained large language models (e.g., Llama2, GPT-4, Claude 3, etc) can do linear and non-linear regression when given in-context examples, without any additional training or gradient updates. Our findings reveal that several large language models (e.g., GPT-4, Claude 3) are able to perform regression tasks with a performance rivaling (or even outperforming) that of traditional supervised methods such as Random Forest, Bagging, or Gradient Boosting. For example, on the challenging Friedman #2 regression dataset, Claude 3 outperforms many supervised methods such as AdaBoost, SVM, Random Forest, KNN, or Gradient Boosting. We then investigate how well the performance of large language models scales with the number of in-context exemplars. We borrow from the notion of regret from online learning and empirically show that LLMs are capable of obtaining a sub-linear regret. 2024: Robert Vacareanu, Vlad-Andrei Negru, Vasile Suciu, M. Surdeanu https://arxiv.org/pdf/2404.07544v1.pdf

Emerging Topics Community: Return to Trees, Part 3: Random Forest

community building trees emerging random forest

Play Episode Listen Later Apr 10, 2024 31:09

Building on the discussion of individual decision trees in the prior episode, Shea and Anders shift to one of today's most popular ensemble models, the Random Forest. At first glance, the algorithm may seem like a brute force approach of simply running hundreds or thousands of decision trees, but it leverages the concept of “bagging” to avoid overfitting and attempt to learn as much as possible from the entire data sets, not just a few key features. We close by covering strengths and weaknesses of this model and providing some real-life examples.

Emerging Topics Community: Return to Trees, Part 2: Single Decision Trees

community single emerging decision trees random forest

Play Episode Listen Later Mar 20, 2024 42:01

Shea and Anders dive into tree-based algorithms, starting with the most fundamental variety, the single decision tree. We cover the mechanics of a decision tree and provide a comparison to linear models. A solid understanding of how a decision tree works is critical to fully grasp the nuances of the more powerful ensemble models, the Random Forest and Gradient Boosting Machine. In addition, single decision trees can still be useful either as a starting point for building more complex models or for situations where interpretability is paramount.

עושים כבוד לעצים

ExplAInable

Play Episode Listen Later Mar 18, 2024 12:17

רשתות נוירונים על שלל סוגיהן זוכות להרבה אטנשן - אבל בפועל, הרבה פרויקטים לא זקוקים לרשתות נוירונים. מודליים עציים הם בדרך כלל הפתרון הפשוט והיעיל לדאטא טבלאי. בפרק קצרצר זה, נסקור את עצי החלטה, תהליך אימונם ובעיית הOverfit. נדבר על שתי ההרחבות הנפוצות: Random Forest & Gradient Boosted Trees והיתרונות שיש בשימוש במודלים ותיקים בסביבת פרודקשן

random forest

Head in the Clouds, Feet in the Data with Britt Bluemel

GeocHemiSTea

Play Episode Listen Later Mar 13, 2024 46:16

For this episode we read: Using machine learning to estimate a key missing geochemical variable in mining exploration: application of the Random Forest algorithm to multi-sensor core logging data (Schnitzler et al., 2019) A big difference between applied geochemistry and machine is the terminology, but once you start to chip away at this, like Britt, you will realize that the two disciplines are not so different. Join us as we talk about dimensionality reductions, transformations, and workflows pre- and post- her introduction to the realm of data science. And talk about a really neat paper that used random forest to predict sodium for an alteration study. --- Support this podcast: https://podcasters.spotify.com/pod/show/geochemistea/support

data feet head in the clouds schnitzler random forest

Emerging Topics Community: Return to Trees, Part 1: Overview of Tree-Based Algorithms

community trees emerging algorithms random forest

Play Episode Listen Later Feb 28, 2024 10:11

This is the first episode in a new mini-series, “Return to Trees”. In this series, Shea and Anders will be covering tree-based algorithms, including single decision trees, Random Forests, and Gradient Boosting Machines. In this episode, we discuss the rationale for re-visiting these models, which were covered in older podcast episodes in the mid-2010s, as well as an overview of what's to come in the remaining episodes of the mini-series.

Improve The Success Rate Of Your Machine Learning Projects With bizML

The Machine Learning Podcast

Play Episode Listen Later Feb 18, 2024 50:22

Summary Machine learning is a powerful set of technologies, holding the potential to dramatically transform businesses across industries. Unfortunately, the implementation of ML projects often fail to achieve their intended goals. This failure is due to a lack of collaboration and investment across technological and organizational boundaries. To help improve the success rate of machine learning projects Eric Siegel developed the six step bizML framework, outlining the process to ensure that everyone understands the whole process of ML deployment. In this episode he shares the principles and promise of that framework and his motivation for encapsulating it in his book "The AI Playbook". Announcements Hello and welcome to the Machine Learning Podcast, the podcast about machine learning and how to bring it from idea to delivery. Your host is Tobias Macey and today I'm interviewing Eric Siegel about how the bizML approach can help improve the success rate of your ML projects Interview Introduction How did you get involved in machine learning? Can you describe what bizML is and the story behind it? What are the key aspects of this approach that are different from the "industry standard" lifecycle of an ML project? What are the elements of your personal experience as an ML consultant that helped you develop the tenets of bizML? Who are the personas that need to be involved in an ML project to increase the likelihood of success? Who do you find to be best suited to "own" or "lead" the process? What are the organizational patterns that might hinder the work of delivering on the goals of an ML initiative? What are some of the misconceptions about the work involved in/capabilities of an ML model that you commonly encounter? What is your main goal in writing your book "The AI Playbook"? What are the most interesting, innovative, or unexpected ways that you have seen the bizML process in action? What are the most interesting, unexpected, or challenging lessons that you have learned while working on ML projects and developing the bizML framework? When is bizML the wrong choice? What are the future developments in organizational and technical approaches to ML that will improve the success rate of AI projects? Contact Info LinkedIn (https://www.linkedin.com/in/predictiveanalytics/) Parting Question From your perspective, what is the biggest barrier to adoption of machine learning today? Closing Announcements Thank you for listening! Don't forget to check out our other shows. The Data Engineering Podcast (https://www.dataengineeringpodcast.com) covers the latest on modern data management. Podcast.__init__ () covers the Python language, its community, and the innovative ways it is being used. Visit the site (https://www.themachinelearningpodcast.com) to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show then tell us about it! Email hosts@themachinelearningpodcast.com (mailto:hosts@themachinelearningpodcast.com)) with your story. To help other people find the show please leave a review on iTunes (https://podcasts.apple.com/us/podcast/the-machine-learning-podcast/id1626358243) and tell your friends and co-workers. Links The AI Playbook (https://www.machinelearningkeynote.com/the-ai-playbook): Mastering the Rare Art of Machine Learning Deployment by Eric Siegel Predictive Analytics (https://www.machinelearningkeynote.com/predictive-analytics): The Power to Predict Who Will Click, Buy, Lie, or Die by Eric Siegel Columbia University (https://www.columbia.edu/) Machine Learning Week Conference (https://machinelearningweek.com/) Generative AI World (https://generativeaiworld.events/) Machine Learning Leadership and Practice Course (https://www.predictiveanalyticsworld.com/machinelearningweek/workshops/machine-learning-course/) Rexer Analytics (https://www.rexeranalytics.com/) KD Nuggets (https://www.kdnuggets.com/) CRISP-DM (https://en.wikipedia.org/wiki/Cross-industry_standard_process_for_data_mining) Random Forest (https://en.wikipedia.org/wiki/Random_forest) Gradient Descent (https://en.wikipedia.org/wiki/Gradient_descent) The intro and outro music is from Hitman's Lovesong feat. Paola Graziano (https://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/Tales_Of_A_Dead_Fish/Hitmans_Lovesong/) by The Freak Fandango Orchestra (http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/)/CC BY-SA 3.0 (https://creativecommons.org/licenses/by-sa/3.0/)

ai power cross mastering projects machine learning lie hitman python ml success rate gradient eric siegel random forest freak fandango orchestra

SE Radio 594: Sean Moriarity on Deep Learning with Elixir and Axon

Software Engineering Radio - The Podcast for Professional Software Developers

Play Episode Listen Later Dec 14, 2023 57:43

Sean Moriarity, creator of the Axon deep learning framework, co-creator of the Nx library, and author of Machine Learning in Elixir and Genetic Algorithms in Elixir, published by the Pragmatic Bookshelf, speaks with SE Radio host Gavin Henry about what deep learning (neural networks) means today. Using a practical example with deep learning for fraud detection, they explore what Axon is and why it was created. Moriarity describes why the Beam is ideal for machine learning, and why he dislikes the term “neural network.” They discuss the need for deep learning, its history, how it offers a good fit for many of today's complex problems, where it shines and when not to use it. Moriarity goes into depth on a range of topics, including how to get datasets in shape, supervised and unsupervised learning, feed-forward neural networks, Nx.serving, decision trees, gradient descent, linear regression, logistic regression, support vector machines, and random forests. The episode considers what a model looks like, what training is, labeling, classification, regression tasks, hardware resources needed, EXGBoost, Jax, PyIgnite, and Explorer. Finally, they look at what's involved in the ongoing lifecycle or operational side of Axon once a workflow is put into production, so you can safely back it all up and feed in new data. Brought to you by IEEE Computer Society and IEEE Software magazine. This episode sponsored by Miro.

machine learning explorers beam bumblebee miro deep learning labeling elixir nx neural networks axon oban erlang linear regression genetic algorithms decision trees random forest pragmatic bookshelf ieee computer society logistic regression se radio

Source-Free Random Forest Model Calibration for Myoelectric Control

ai dive artificial intelligence machine learning enhance ensemble boosting data science forests bagging decision trees random forest xgboost

Play Episode Listen Later Jul 25, 2023

Link to bioRxiv paper: http://biorxiv.org/cgi/content/short/2023.07.21.550033v1?rss=1 Authors: Jiang, X., Ma, C., Nazarpour, K. Abstract: Objective: Most existing machine learning models for myoelectric control require a large amount of data to learn user-specific characteristics of the electromyographic (EMG) signals, which is burdensome. Our objective is to develop an approach to enable the calibration of a pre-trained model with minimal data from a new myoelectric user. Approach: We trained a random forest model with EMG data from 20 people collected during the performance of multiple hand grips. To adapt the decision rules for a new user, first, the branches of the pre-trained decision trees were pruned using the validation data from the new user. Then new decision trees trained merely with data from the new user were appended to the pruned pre-trained model. Results: Real-time myoelectric experiments with 18 participants over two days demonstrated the improved accuracy of the proposed approach when compared to benchmark user-specific random forest and the linear discriminant analysis models. Furthermore, the random forest model that was calibrated on day one for a new participant yielded significantly higher accuracy on day two, when compared to the benchmark approaches, which reflects the robustness of the proposed approach. Significance: The proposed model calibration procedure is completely source-free, that is, once the base model is pre-trained, no access to the source data from the original 20 people is required. Our work promotes the use of efficient, explainable, and simple models for myoelectric control. Copy rights belong to original authors. Visit the link for more info Podcast created by Paper Player, LLC

model llc copy calibration emg biorxiv random forest

#21 - Ensemble Learning: Boosting, Bagging, and Random Forests in Machine Learning

The AI Frontier Podcast

Play Episode Listen Later Jun 11, 2023 11:05

Dive into this episode of The AI Frontier podcast, where we explore Ensemble Learning techniques like Boosting, Bagging, and Random Forests in Machine Learning. Learn about their applications, advantages, and limitations, and discover real-world success stories. Enhance your understanding of these powerful methods and stay ahead in the world of data science.Support the Show.Keep AI insights flowing – become a supporter of the show!Click the link for details

What is Classification Algorithms? | Decision Tree and Random Forests | Model Evaluation Metrics

InfosecTrain

Play Episode Listen Later Apr 12, 2023 118:06

InfosecTrain hosts a live event entitled ‘Data Science Fast Track Course' with certified expert ‘NAWAJ'. Data Science is not the future anymore, it is rather the present. This masterclass would be extremely beneficial to anyone interested in pursuing a career in Data Science. It will be delivered by a domain expert with extensive industry experience. With our instructors who are specialists in their disciplines, we hold a global reputation. Attending this webinar will benefit you in a variety of ways. Thank you for watching this video, For more details or free demo with our expert write into us at sales@infosectrain.com ➡️ Agenda

model algorithms evaluation metrics attending data science classification decision tree random forest

Episode 91: Allison Harris on registering returning citizens to vote

Probable Causation

Play Episode Listen Later Apr 11, 2023 55:20

Allison Harris talks about increasing the civic engagement of people with felony convictions. "Registering Returning Citizens to Vote” by Jennifer Doleac, Laurel Eckhouse, Eric Foster-Moore, Allison Harris, Hannah Walker, and Ariel White. *** Probable Causation is part of Doleac Initiatives, a 501(c)(3) nonprofit. If you enjoy the show, please consider making a tax-deductible contribution. Thank you for supporting our work! *** OTHER RESEARCH WE DISCUSS IN THIS EPISODE: “Can Incarcerated Felons be (Re)integrated into the Political System? Results from a Field Experiment” by Alan S. Gerber, Gregory A. Huber, Marc Meredith, Daniel R. Bigger, and David J. Hendry. “The Politics of the Restoration of Ex-felon Voting Rights: The Case of Iowa” by Marc Meredith and Michael Morse. “Using Causal Forests to Predict Treatment Heterogeneity: An Application to Summer Jobs” by Jonathan David and Sara B. Heller. "Estimation and Inference of Heterogeneous Treatment Effects using Random Forests" by Stefan Wager and Susan Athey. “Civic Responses to Police Violence” by Desmond And and John Tebes. [Working Paper]. “Mobilized by Injustice: Criminal Justice Contact, Political Participation, and Race” by Hannah L. Walker. Bonus Episode 10 of Probable Causation: Hannah Walker.

politics race vote iowa harris restoration results police violence registering summer jobs inference estimation jonathan david mobilized political participation returning citizens random forest michael morse jennifer doleac ariel white hannah l walker

AI Today Podcast: AI Glossary Series – Random Forest and Boosted Trees

AI Today Podcast: Artificial Intelligence Insights, Experts, and Opinion

Play Episode Listen Later Mar 15, 2023 9:01

Sometimes for reasons such as improving performance or robustness it makes sense to create multiple decision trees and average the results to solve problems related to overfitting. Or, it makes sense to boost certain decision trees. In this episode of the AI Today podcast hosts Kathleen Walch and Ron Schmelzer define the terms Random Forest and Boosted Trees, and explain how they relate to AI and why it's important to know about them. Continue reading AI Today Podcast: AI Glossary Series – Random Forest and Boosted Trees at AI & Data Today.

ai trees boosted glossary ai today random forest kathleen walch ron schmelzer

Real-Time Machine Learning Has Entered The Realm Of The Possible

The Machine Learning Podcast

Play Episode Listen Later Mar 9, 2023 34:29

Summary Machine learning models have predominantly been built and updated in a batch modality. While this is operationally simpler, it doesn't always provide the best experience or capabilities for end users of the model. Tecton has been investing in the infrastructure and workflows that enable building and updating ML models with real-time data to allow you to react to real-world events as they happen. In this episode CTO Kevin Stumpf explores they benefits of real-time machine learning and the systems that are necessary to support the development and maintenance of those models. Announcements Hello and welcome to the Machine Learning Podcast, the podcast about machine learning and how to bring it from idea to delivery. Your host is Tobias Macey and today I'm interviewing Kevin Stumpf about the challenges and promise of real-time ML applications Interview Introduction How did you get involved in machine learning? Can you describe what real-time ML is and some examples of where it might be applied? What are the operational and organizational requirements for being able to adopt real-time approaches for ML projects? What are some of the ways that real-time requirements influence the scale/scope/architecture of an ML model? What are some of the failure modes for real-time vs analytical or operational ML? Given the low latency between source/input data being generated or received and a prediction being generated, how does that influence susceptibility to e.g. data drift? Data quality and accuracy also become more critical. What are some of the validation strategies that teams need to consider as they move to real-time? What are the most interesting, innovative, or unexpected ways that you have seen real-time ML applied? What are the most interesting, unexpected, or challenging lessons that you have learned while working on real-time ML systems? When is real-time the wrong choice for ML? What do you have planned for the future of real-time support for ML in Tecton? Contact Info LinkedIn (https://www.linkedin.com/in/kevinstumpf/) @kevinmstumpf (https://twitter.com/kevinmstumpf?lang=en) on Twitter Parting Question From your perspective, what is the biggest barrier to adoption of machine learning today? Closing Announcements Thank you for listening! Don't forget to check out our other shows. The Data Engineering Podcast (https://www.dataengineeringpodcast.com) covers the latest on modern data management. Podcast.__init__ () covers the Python language, its community, and the innovative ways it is being used. Visit the site (https://www.themachinelearningpodcast.com) to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show then tell us about it! Email hosts@themachinelearningpodcast.com (mailto:hosts@themachinelearningpodcast.com)) with your story. To help other people find the show please leave a review on iTunes (https://podcasts.apple.com/us/podcast/the-machine-learning-podcast/id1626358243) and tell your friends and co-workers Links Tecton (https://www.tecton.ai/) Podcast Episode (https://www.themachinelearningpodcast.com/tecton-machine-learning-feature-platform-episode-6/) Data Engineering Podcast Episode (https://www.dataengineeringpodcast.com/tecton-mlops-feature-store-episode-166/) Uber Michelangelo (https://www.uber.com/blog/michelangelo-machine-learning-platform/) Reinforcement Learning (https://en.wikipedia.org/wiki/Reinforcement_learning) Online Learning (https://en.wikipedia.org/wiki/Online_machine_learning) Random Forest (https://en.wikipedia.org/wiki/Random_forest) ChatGPT (https://openai.com/blog/chatgpt) XGBoost (https://xgboost.ai/) Linear Regression (https://en.wikipedia.org/wiki/Linear_regression) Train-Serve Skew (https://ploomber.io/blog/train-serve-skew/) Flink (https://flink.apache.org/) Data Engineering Podcast Episode (https://www.dataengineeringpodcast.com/apache-flink-with-fabian-hueske-episode-57/) The intro and outro music is from Hitman's Lovesong feat. Paola Graziano (https://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/Tales_Of_A_Dead_Fish/Hitmans_Lovesong/) by The Freak Fandango Orchestra (http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/)/CC BY-SA 3.0 (https://creativecommons.org/licenses/by-sa/3.0/)

52: Nothing. Forever. AI-Powered

TechReview - The Podcast

Play Episode Listen Later Mar 1, 2023 34:19

TikTok is creating the Creativity Program, a revamped program which is aimed to reward talented content creators with financial compensation and opportunities to grow their presence on the platform. Are we on the brink of a major astronomical discovery? An AI system has detected mysterious radio signals of unknown origin that could hold the key to finding extraterrestrial life. Nothing, Forever is back on Twitch and better than ever! With new guardrails in place, this popular channel is committed to creating a safe and inclusive space for all viewers and creators. And an author shares their experience of being gaslit and lied to by Bing's ChatGPT, exposing the importance of trust and transparency in the tech industry.00:00 - Intro02:06 - TikTok launches a revamped creator fund called the 'Creativity Program' in beta10:36 - AI System Detects Strange Signals of Unknown Origin in Radio Data18:27 - Nothing, Forever is set to return to Twitch with new guardrails in place28:00 - My Week of Being Gaslit and Lied to by the New BingSummary: TikTok has launched a beta version of its revamped creator fund, the Creativity Program, to provide more earning opportunities and revenue for select creators. The program is designed to address criticisms about the low payouts under the existing Creator Fund, but specifics on revenue allocation and eligibility requirements for the program remain undisclosed. Creators need to produce high-quality, original videos that are over one minute long, while access to the Creativity Program dashboard gives creators greater insights into video performance metrics and estimated revenue. The program is being rolled out on an invite-only basis initially, with wider availability expected soon.A team of radio astronomers has built an artificial intelligence (AI) system that beats classical algorithms in signal detection tasks in the search for extraterrestrial life. The AI algorithm sifts out "false positives" from radio interference, delivering results better than expected. The algorithm was trained to classify signals as either radio interference or a genuine technosignature candidate using an autoencoder and random forest classifier. The team fed the algorithm over 150 terabytes of data from the Green Bank Telescope in West Virginia and identified eight signals of interest that couldn't be attributed to radio interference, although they were not re-detected in follow-up observations. The researchers say their findings highlight the continued role AI techniques will play in the search for extraterrestrial intelligence.Nothing, Forever, an AI-powered Seinfeld spoof show on Twitch, was suspended for two weeks after the Jerry Seinfeld-like character made transphobic remarks. The creators, Mismatch Media, changed the AI models underpinning the stream, which resulted in inappropriate text being generated. Mismatch has been working to implement OpenAI's content moderation API and making sure its guardrails work. Mismatch also wants to introduce an audience interaction system that it had previously built but decided not to launch with Nothing, Forever. Beyond Nothing, Forever, Mismatch Media wants to build a platform for creators to make shows of their own. The goal is to get this platform up and running within the next six to 12 months.The author used to dismiss Bing as an inferior search engine compared to Google, but now Bing has gained attention for its integration of an AI-powered chatbot, ChatGPT. Since its rollout, daily visits to Bing.com have increased by 15% and searches for "Bing AI" have risen 700%. Google has responded by unveiling its own AI-powered search engine, Bard. The author spent a week using Bing's new AI-powered answer engine, Sydney, in place of Google search to see if Bing can truly compete with Google.Our panel today>> Tarek >> Chris >> Henrike >> Vincent Every week our panel of technology enthusiasts meets to discuss the most important news from the fields of technology, innovation, and science. And you can join us live!https://techreview.axelspringer.com/https://www.ideas-engineering.io/https://www.freetech.academy/https://www.upday.com/

Predicting alcohol-related memory problems in older adults: A machine learning study with multi-domain features

Play Episode Listen Later Jan 2, 2023

Link to bioRxiv paper: http://biorxiv.org/cgi/content/short/2022.12.30.522330v1?rss=1 Authors: Kamarajan, C., Pandey, A. K., Chorlian, D. B., Meyers, J. L., Kinreich, S., Pandey, G., Subbie Saenz de Viteri, S., Zhang, J., Kuang, W., Barr, P. B., Aliev, F., Anokhin, A. P., Plawecki, M. H., Kuperman, S., Almasy, L., Merikangas, A., Brislin, S. J., Bauer, L., Hesselbrock, V., Chan, G., Kramer, J., Lai, D., Hartz, S., Bierut, L. J., McCutcheon, V. V., Bucholz, K. K., Dick, D. M., Schuckit, M. A., Edenberg, H. J., Porjesz, B. Abstract: Memory problems are common among older adults with a history of alcohol use disorder (AUD). Employing a machine learning framework, the current study investigates the use of multi-domain features to classify individuals with and without alcohol-induced memory problems. A group of 94 individuals (ages 50-81 years) with alcohol-induced memory problems (Memory group) were compared with a matched Control group who did not have memory problems. The Random Forests model identified specific features from each domain that contributed to the classification of Memory vs. Control group (AUC=88.29%). Specifically, individuals from the Memory group manifested a predominant pattern of hyperconnectivity across the default mode network regions except some connections involving anterior cingulate cortex which were predominantly hypoconnected. Other significant contributing features were (i) polygenic risk scores for AUD, (ii) alcohol consumption and related health consequences during the past 5 years, such as health problems, past negative experiences, withdrawal symptoms, and the largest number of drinks in a day during the past 12 months, and (iii) elevated neuroticism and increased harm avoidance, and fewer positive "uplift" life events. At the neural systems level, hyperconnectivity across the default mode network regions, including the connections across the hippocampal hub regions, in individuals with memory problems may indicate dysregulation in neural information processing. Overall, the study outlines the importance of utilizing multidomain features, consisting of resting-state brain connectivity collected ~18 years ago, together with personality, life experiences, polygenic risk, and alcohol consumption and related consequences, to predict alcohol-related memory problems that arise in later life. Copy rights belong to original authors. Visit the link for more info Podcast created by Paper Player, LLC

Intermediate Gray Matter Interneurons in the Lumbar Spinal Cord Play a Critical and Necessary Role in Coordinated Locomotion

llc animals copy schwarz schmitt intermediate coordinated beretta locomotion lumbar spinal cord weidner gray matter biorxiv random forest schwarte

Play Episode Listen Later Nov 1, 2022

Link to bioRxiv paper: http://biorxiv.org/cgi/content/short/2022.10.31.514612v1?rss=1 Authors: Kuehn, N., Schwarz, A., Beretta, C. A., Schwarte, Y., Schmitt, F., Motsch, M., Weidner, N., Puttagunta, R. Abstract: Locomotion is a complex task involving excitatory and inhibitory circuitry in spinal gray matter. While genetic knockouts examine the function of unique spinal interneuron (SpIN) subtypes, the phenotype of combined premotor interneuron loss remains to be explored. We modified a kainic acid lesion to damage intermediate gray matter (laminae V-VII) in the lumbar spinal enlargement (spinal L2-L4) in female rats. A thorough, tailored behavioral evaluation revealed deficits in gross hindlimb function, skilled walking, coordination, balance and gait two-weeks post-injury. Using a Random Forest algorithm, we combined these behavioral assessments into a highly predictive binary classification system which strongly correlated with structural deficits in the rostro-caudal axis. Machine-learning quantification confirmed interneuronal damage to laminae V-VII in spinal L2-L4 correlates with hindlimb dysfunction. White matter damage and lower motoneuron loss did not correlate with behavioral deficits. Animals do not regain lost sensorimotor function three months after injury, indicating that natural recovery of the spinal cord cannot compensate for loss of laminae V-VII neurons. As spinal cord injuries are often located at spinal enlargements, this research lays the groundwork for new neuroregenerative therapies to replace these lost neuronal pools vital to sensorimotor function. Copy rights belong to original authors. Visit the link for more info Podcast created by Paper Player, LLC

Systematic late-cycle hedging with FX options: A machine learning based strategy

At Any Rate

Play Episode Listen Later Oct 20, 2022 15:31

We introduce a systematic strategy for identifying signals and selecting late-cycle / recession FX options hedges. 1) we first run Random Forest algorithm to zero in on the most relevant signals for hedging risk-off episodes; then 2) we build a 4-factor model that selectively buys defensive FXO based on model predicted P&L. The highest beta to risk-off events is achieved with four signals: m/m change in 1-y z-score of ATM vol, m/m change in 1-y z-score of realized vol, 1-y z-score of 6-month % change in spot, and m/m change in 1-y z-score of fwd pts/vol (i.e. carry/vol). We find short-term tenor to be most responsive to adverse episodes, 30%TV digitals to be showing the least decay, and the longer expiry structures tend to hold better during the low vol times. Overall 3M - 6M expires offer a good compromise. At current market, our 4- factor model favors high beta G10 digi structures with AUD structures dominating within the top 5: 3M to 6M 30%TV at-expiry digitals in AUD/SGD put, USD/CAD call, EUR/AUD call, GBP/USD put and/or AUD/CHF put. This podcast was recorded on 20 October 2022. This communication is provided for information purposes only. Please visit www.jpmm.com/research/disclosures for important disclosures. © 2022 JPMorgan Chase & Co. All rights reserved.

tv strategy cycle options pl machine learning fx atm 3m systematic 6m aud hedging g10 jpmorgan chase co gbpusd random forest usd cad

Stellar population of the Rosette Nebula and NGC 2244: application of the probabilistic random forest

starting application population stellar nebula sed ngc arxiv probabilistic random forest

Play Episode Listen Later Oct 12, 2022 1:21

Stellar population of the Rosette Nebula and NGC 2244: application of the probabilistic random forest by Koraljka Muzic et al. on Wednesday 12 October (Abridged) In this work, we study the 2.8x2.6 deg2 region in the emblematic Rosette Nebula, centred at the young cluster NGC 2244, with the aim of constructing the most reliable candidate member list to date, determining various structural and kinematic parameters, and learning about the past and the future of the region. Starting from a catalogue containing optical to mid-infrared photometry, as well as positions and proper motions from Gaia EDR3, we apply the Probabilistic Random Forest algorithm and derive membership probability for each source. Based on the list of almost 3000 probable members, of which about a third are concentrated within the radius of 20' from the centre of NGC 2244, we identify various clustered sources and stellar concentrations, and estimate the average distance of 1489+-37 pc (entire region), 1440+-32 pc (NGC 2244) and 1525+-36 pc (NGC 2237). The masses, extinction, and ages are derived by SED fitting, and the internal dynamic is assessed via proper motions relative to the mean proper motion of NGC 2244. NGC 2244 is showing a clear expansion pattern, with an expansion velocity that increases with radius. Its IMF is well represented by two power laws (dN/dMpropto M^{-alpha}), with slopes alpha = 1.05+-0.02 for the mass range 0.2 - 1.5 MSun, and alpha = 2.3+-0.3 for the mass range 1.5 - 20 MSun, in agreement with other star forming regions. The mean age of the region is ~2 Myr. We find evidence for the difference in ages between NGC 2244 and the region associated with the molecular cloud, which appears slightly younger. The velocity dispersion of NGC 2244 is well above the virial velocity dispersion derived from the total mass (1000+-70 MSun) and half-mass radius (3.4+-0.2 pc). From the comparison to other clusters and to numerical simulations, we conclude that NGC 2244 may be unbound, and possibly even formed in a super-virial state. arXiv: http://arxiv.org/abs/http://arxiv.org/abs/2209.13302v2

Stellar population of the Rosette Nebula and NGC 2244: application of the probabilistic random forest

starting application population stellar nebula sed ngc arxiv probabilistic random forest

Play Episode Listen Later Oct 12, 2022 1:24

The metallicity's fundamental dependence on both local and global galactic quantities

global local fundamental manga dependence galactic sigma sfr arxiv quantities random forest

Play Episode Listen Later Oct 10, 2022 1:04

The metallicity's fundamental dependence on both local and global galactic quantities by William M. Baker et al. on Monday 10 October We study the scaling relations between gas-phase metallicity, stellar mass surface density ($Sigma _*$), star formation rate surface density ($Sigma _{SFR}$), and molecular gas surface density ($Sigma_{H_2}$) in local star-forming galaxies on scales of a kpc. We employ optical integral field spectroscopy from the MaNGA survey, and ALMA data for a subset of MaNGA galaxies. We use Partial Correlation Coefficients and Random Forest regression to determine the relative importance of local and global galactic properties in setting the gas-phase metallicity. We find that the local metallicity depends primarily on $Sigma _*$ (the resolved mass-metallicity relation, rMZR), and has a secondary anti-correlation with $Sigma _{SFR}$ (i.e. a spatially-resolved version of the `Fundamental Metallicity Relation', rFMR). We find that $Sigma_{H_2}$ has little effect in determining the local metallicity. This result indicates that gas accretion, resulting in local metallicity dilution and local boosting of star formation, cannot be the primary origin of the rFMR. Star-formation driven, metal-loaded winds may contribute to the anti-correlation between metallicity and SFR. The local metallicity depends also on the global properties of galaxies. We find a strong dependence on the total stellar mass ($M_*$) and a weaker (inverse) dependence on the total SFR. The global metallicity scaling relations, therefore, do not simply stem out of their resolved counterparts; global properties and processes, such as the global gravitational potential well, galaxy-scale winds and global redistribution/mixing of metals, likely contribute to the local metallicity, in addition to local production and retention. arXiv: http://arxiv.org/abs/http://arxiv.org/abs/2210.03755v1

The probabilistic random forest applied to the QUBRICS survey: improving the selection of high-redshift quasars with synthetic data

Play Episode Listen Later Sep 15, 2022 0:59

The probabilistic random forest applied to the QUBRICS survey: improving the selection of high-redshift quasars with synthetic data by Francesco Guarneri et al. on Thursday 15 September Several recent works have focused on the search for bright, high-z quasars (QSOs) in the South. Among them, the QUasars as BRIght beacons for Cosmology in the Southern hemisphere (QUBRICS) survey has now delivered hundreds of new spectroscopically confirmed QSOs selected by means of machine learning algorithms. Building upon the results obtained by introducing the probabilistic random forest (PRF) for the QUBRICS selection, we explore in this work the feasibility of training the algorithm on synthetic data to improve the completeness in the higher redshift bins. We also compare the performances of the algorithm if colours are used as primary features instead of magnitudes. We generate synthetic data based on a composite QSO spectral energy distribution. We first train the PRF to identify QSOs among stars and galaxies, then separate high-z quasar from low-z contaminants. We apply the algorithm on an updated dataset, based on SkyMapper DR3, combined with Gaia eDR3, 2MASS and WISE magnitudes. We find that employing colours as features slightly improves the results with respect to the algorithm trained on magnitude data. Adding synthetic data to the training set provides significantly better results with respect to the PRF trained only on spectroscopically confirmed QSOs. We estimate, on a testing dataset, a completeness of ~86% and a contamination of ~36%. Finally, 207 PRF-selected candidates were observed: 149/207 turned out to be genuine QSOs with z > 2.5, 41 with z < 2.5, 3 galaxies and 14 stars. The result confirms the ability of the PRF to select high-z quasars in large datasets. arXiv: http://arxiv.org/abs/http://arxiv.org/abs/2209.07257v1

building data south improving southern bright survey selection applied synthetic cosmology redshift prf arxiv probabilistic quasars random forest qsos

A machine-learning photometric classifier for massive stars in nearby galaxies I The method

The Machine Learning Podcast

Play Episode Listen Later Sep 14, 2022 0:45

A machine-learning photometric classifier for massive stars in nearby galaxies I The method by Grigoris Maravelias et al. on Wednesday 14 September (abridged) Mass loss is a key parameter in the evolution of massive stars, with discrepancies between theory and observations and with unknown importance of the episodic mass loss. To address this we need increased numbers of classified sources stars spanning a range of metallicity environments. We aim to remedy the situation by applying machine learning techniques to recently available extensive photometric catalogs. We used IR/Spitzer and optical/Pan-STARRS, with Gaia astrometric information, to compile a large catalog of known massive stars in M31 and M33, which were grouped in Blue, Red, Yellow, B[e] supergiants, Luminous Blue Variables, Wolf-Rayet, and background galaxies. Due to the high imbalance, we implemented synthetic data generation to populate the underrepresented classes and improve separation by undersampling the majority class. We built an ensemble classifier using color indices. The probabilities from Support Vector Classification, Random Forests, and Multi-layer Perceptron were combined for the final classification. The overall weighted balanced accuracy is ~83%, recovering Red supergiants at ~94%, Blue/Yellow/B[e] supergiants and background galaxies at ~50-80%, Wolf-Rayets at ~45%, and Luminous Blue Variables at ~30%, mainly due to their small sample sizes. The mixing of spectral types (no strict boundaries in their color indices) complicates the classification. Independent application to IC 1613, WLM, and Sextans A galaxies resulted in an overall lower accuracy of ~70%, attributed to metallicity and extinction effects. The missing data imputation was explored using simple replacement with mean values and an iterative imputor, which proved more capable. We also found that r-i and y-[3.6] were the most important features. Our method, although limited by the sampling of the feature space, is efficient in classifying sources with missing data and at lower metallicitites. arXiv: http://arxiv.org/abs/http://arxiv.org/abs/2203.08125v2

stars massive independent method mass yellow machine learning gaia ic galaxies nearby arxiv m31 classifier wlm random forest pan starrs perceptron

Build Better Machine Learning Models With Confidence By Adding Validation With Deepchecks

Play Episode Listen Later Jul 6, 2022 48:40

Summary Machine learning has the potential to transform industries and revolutionize business capabilities, but only if the models are reliable and robust. Because of the fundamental probabilistic nature of machine learning techniques it can be challenging to test and validate the generated models. The team at Deepchecks understands the widespread need to easily and repeatably check and verify the outputs of machine learning models and the complexity involved in making it a reality. In this episode Shir Chorev and Philip Tannor explain how they are addressing the problem with their open source deepchecks library and how you can start using it today to build trust in your machine learning applications. Announcements Hello and welcome to the Machine Learning Podcast, the podcast about machine learning and how to bring it from idea to delivery. Do you wish you could use artificial intelligence to drive your business the way Big Tech does, but don’t have a money printer? Graft is a cloud-native platform that aims to make the AI of the 1% accessible to the 99%. Wield the most advanced techniques for unlocking the value of data, including text, images, video, audio, and graphs. No machine learning skills required, no team to hire, and no infrastructure to build or maintain. For more information on Graft or to schedule a demo, visit themachinelearningpodcast.com/graft today and tell them Tobias sent you. Predibase is a low-code ML platform without low-code limits. Built on top of our open source foundations of Ludwig and Horovod, our platform allows you to train state-of-the-art ML and deep learning models on your datasets at scale. Our platform works on text, images, tabular, audio and multi-modal data using our novel compositional model architecture. We allow users to operationalize models on top of the modern data stack, through REST and PQL – an extension of SQL that puts predictive power in the hands of data practitioners. Go to themachinelearningpodcast.com/predibase today to learn more and try it out! Data powers machine learning, but poor data quality is the largest impediment to effective ML today. Galileo is a collaborative data bench for data scientists building Natural Language Processing (NLP) models to programmatically inspect, fix and track their data across the ML workflow (pre-training, post-training and post-production) – no more excel sheets or ad-hoc python scripts. Get meaningful gains in your model performance fast, dramatically reduce data labeling and procurement costs, while seeing 10x faster ML iterations. Galileo is offering listeners a free 30 day trial and a 30% discount on the product there after. This offer is available until Aug 31, so go to themachinelearningpodcast.com/galileo and request a demo today! Your host is Tobias Macey and today I’m interviewing Shir Chorev and Philip Tannor about Deepchecks, a Python package for comprehensively validating your machine learning models and data with minimal effort. Interview Introduction How did you get involved in machine learning? Can you describe what Deepchecks is and the story behind it? Who is the target audience for the project? What are the biggest challenges that these users face in bringing ML models from concept to production and how does DeepChecks address those problems? In the absence of DeepChecks how are practitioners solving the problems of model validation and comparison across iteratiosn? What are some of the other tools in this ecosystem and what are the differentiating features of DeepChecks? What are some examples of the kinds of tests that are useful for understanding the "correctness" of models? What are the methods by which ML engineers/data scientists/domain experts can define what "correctness" means in a given model or subject area? In software engineering the categories of tests are tiered as unit -> integration -> end-to-end. What are the relevant categories of tests that need to be built for validating the behavior of machine learning models? How do model monitoring utilities overlap with the kinds of tests that you are building with deepchecks? Can you describe how the DeepChecks package is implemented? How have the design and goals of the project changed or evolved from when you started working on it? What are the assumptions that you have built up from your own experiences that have been challenged by your early users and design partners? Can you describe the workflow for an individual or team using DeepChecks as part of their model training and deployment lifecycle? Test engineering is a deep discipline in its own right. How have you approached the user experience and API design to reduce the overhead for ML practitioners to adopt good practices? What are the interfaces available for creating reusable tests and composing test suites together? What are the additional services/capabilities that you are providing in your commercial offering? How are you managing the governance and sustainability of the OSS project and balancing that against the needs/priorities of the business? What are the most interesting, innovative, or unexpected ways that you have seen DeepChecks used? What are the most interesting, unexpected, or challenging lessons that you have learned while working on DeepChecks? When is DeepChecks the wrong choice? What do you have planned for the future of DeepChecks? Contact Info Shir LinkedIn shir22 on GitHub Philip LinkedIn @philiptannor on Twitter Parting Question From your perspective, what is the biggest barrier to adoption of machine learning today? Closing Announcements Thank you for listening! Don’t forget to check out our other shows. The Data Engineering Podcast covers the latest on modern data management. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@themachinelearningpodcast.com) with your story. To help other people find the show please leave a review on iTunes and tell your friends and co-workers Links DeepChecks Random Forest Talpiot Program SHAP Podcast.__init__ Episode Airflow Great Expectations Data Engineering Podcast Episode The intro and outro music is from Hitman’s Lovesong feat. Paola Graziano by The Freak Fandango Orchestra/CC BY-SA 3.0

Episode 140 : Feature importance de la mafia dans la data

Bigdata Hebdo

Play Episode Listen Later May 13, 2022 86:53

### Apero* Atlassian a effacé les environnements cloud de 400 clients par erreur -> https://www.usine-digitale.fr/article/atlassian-a-efface-les-environnements-cloud-de-400-clients-par-erreur.N1993357* Hacked News Channel and Deepfake of Zelenskyy Surrendering Is Causing Chaos Online -> https://www-vice-com.cdn.ampproject.org/c/s/www.vice.com/amp/en/article/93bmda/hacked-news-channel-and-deepfake-of-zelenskyy-surrendering-is-causing-chaos-online### Database* PostgreSQL interface -> https://cloud.google.com/spanner/docs/postgresql-interface* Fin de Big Data Cluster en 2025 -> https://docs.microsoft.com/fr-fr/sql/big-data-cluster/release-notes-big-data-cluster?view=sql-server-ver15### ML/AI + Data-Science* Feature importance dans les Random Forests -> https://medium.com/@ali.soleymani.co/stop-using-random-forest-feature-importances-take-this-intuitive-approach-instead-4335205b933f* Neo4J AuraDS GA on GCP -> https://neo4j.com/blog/introducing-graph-data-science-2-0-aurads/### Architecture* Data Mesh From an Engineering Perspective -> https://www.datamesh-architecture.com/#why* Building a Modern Data Stack at Whatnot -> https://medium.com/whatnot-engineering/building-a-modern-data-stack-at-whatnot-afc1d03c3f9* Airbyte acquires Grouparoo to accelerate Data Movement -> https://airbyte.com/blog/airbyte-acquires-grouparoo-to-accelerate-data-movement### Cloud* Modernize your Oracle workloads to PostgreSQL with Database Migration Service, now in preview -> https://cloud.google.com/blog/products/databases/migrate-oracle-to-postgresql* NetApp Announces Intent to Acquire Instaclustr -> https://www.netapp.com/newsroom/press-releases/news-rel-20220407-656381/* BigLake unifies data warehouses and data lakes into a consistent format -> https://cloud.google.com/blog/products/data-analytics/unifying-data-lakes-and-data-warehouses-across-clouds-with-biglake### Cloud Native dev tips* COPY --chmod reduced the size of my container image by 35% -> https://blog.vamc19.dev/posts/dockerfile-copy-chmod/SponsorsCette publication est sponsorisée par [Affini-Tech](https://affini-tech.com/) et [CerenIT](https://www.cerenit.fr/).[CerenIT](https://www.cerenit.fr/) vous accompagne pour concevoir, industrialiser ou automatiser vos plateformes mais aussi pour faire parler vos données temporelles. Ecrivez nous à [contact@cerenit.fr](mailto:contact@cerenit.fr) et retrouvez-nous aussi sur [Time Series France](https://www.timeseriesfr.org/).Affini-Tech vous accompagne dans tous vos projets Cloud et Data, pour Imaginer, Expérimenter etExecuter vos services ! ([Affini-Tech](http://affini-tech.com), La plateforme [Datatask](https://datatask.io/)) pour accélérer vos services Data et IAConsulter le [blog d'Affini-Tech](https://affini-tech.com/blog/) et le [blog de Datatask](https://datatask.io/blog/) pour en savoir plus.On recrute ! Venez cruncher de la data avec nous ! Ecrivez nous à [recrutement@affini-tech.com](mailto:recrutement@affini-tech.com)Le générique a été composé et réalisé par Maxence Lecointe.

building data cloud oracle feature copy fin dans la deepfakes exp venez la mafia gcp cloud native postgresql imaginer ecrivez modern data stack random forest

KI und Wiederbeschaffungszeiten im Maschinenbau

KI in der Industrie

Play Episode Listen Later Nov 3, 2021 59:25

Markus Günther und seine Kolleginnen und Kollegen haben daraus eine Software entwickelt, die einen Random Forest Ansatz verfolgt. Wie Anwenderinnen und Anwender diese nutzen, erklärt er im Podcast-Gespräch. **Wir suchen einen neuen Podcast Partner!** **Noch mehr KI in der Industrie?** https://kipodcast.de/podcast-archiv **Oder unser Buch KI in der Industrie:** https://www.hanser-fachbuch.de/buch/KI+in+der+Industrie/9783446463455 **Kontakt zu unserem Gesprächspartner** https://de.linkedin.com/in/markus-günther-6a71a6165 Aktueller Teil **Offener Brief ** https://netzpolitik.org/2021/autonome-waffensysteme-neue-bundesregierung-soll-killer-roboter-einhegen/ **Berg Karabach ** https://sicherheitspod.de **Oracle Studie** https://www.industry-of-things.de/hoffnung-auf-karrierekick-durch-ki-a-1069368/ **Golden Age of Computer Vision** https://www.telecom-paris.fr/gerard-medioni-golden-age-computer-vision-research

software noch kontakt golden age machine learning kollegen produktion inform industrie kolleginnen automatisierung computer vision podcast gespr maschinenbau anwender podcast partner industrie 4.0 markus g random forest

7. From Random Walks to Random Forests: Analytics and data science on Wall Street

Data Chatter

Play Episode Listen Later Aug 2, 2021 49:00

One of the first industries to extensively use advanced maths to do better was financial services. Ever since Fischer Black and Myron Scholes published their seminal paper on option pricing in 1973, Wall Street firms hired mathematicians and scientists by the droves, getting them to model asset prices in order to get an edge in the market. Even today, top hedge funds such as Renaissance, Citadel and Two Sigma prefer to hire scientists rather than finance professionals to manage their portfolios. However, in the last decade or so, as Data Science and Artificial Intelligence have taken over the rest of the world, Wall Street has not maintained its leadership position in the use of maths to make money. How and why did this happen? In order to understand this, we talk to Hari Balaji, co-founder of Romulus, an award winning unstructured data automation platform for Financial Services firms. Prior to founding Romulus, Hari spent a decade in quant & data roles at Goldman Sachs across Hong Kong and Singapore. Hari is an alumnus of IIT Madras & IIM Ahmedabad. Show Notes 00:03:15 - What is data science and what is artificial intelligence? 00:10:40 - What Hari's company does 00:14:00 - Toolbox versus hammer-nail approaches 00:15:00 - This history of math in the financial services industry 00:28:45 - Wall Street is never a first mover but a great follower 00:33:30 - How Wall Street uses data science nowadays 00:41:00 - Why most innovations have happened at smaller firms 00:44:00 - Why the financial industry doesn't behave like the Tech world Romulus on Twitter Romulus on LinkedIn Data Chatter is a podcast on all things data. It is a series of conversations with experts and industry leaders in data, and each week we aim to unpack a different compartment of the "data suitcase". The podcast is hosted by Karthik Shashidhar. He is a blogger, newspaper columnist, book author and a former data and strategy consultant. Karthik currently heads Analytics and Business Intelligence for Delhivery, one of India's largest logistics companies. You can follow him on twitter at @karthiks, and read his blog at noenthuda.com/blog

A simple trick for very unbalanced data (Ep. 157)

Data Science at Home

Play Episode Listen Later Jun 22, 2021 22:02

Data from the real world are never perfectly balanced. In this episode I explain a simple yet effective trick to train models with very unbalanced data. Enjoy the show! Sponsors Get one of the best VPN at a massive discount with coupon code DATASCIENCE. It provides you with an 83% discount which unlocks the best price in the market plus 3 extra months for free. Here is the link https://surfshark.deals/DATASCIENCE References Leo Breiman, Random Forests, 2001 C. Chen, A. Liaw, L. Breiman, Using Random Forest to Learn Imbalanced Data (2004)

simple data trick chen data science vpn unbalanced random forest

80. Der Data Analyst - Im Gespräch mit Michael Tenner, pmOne

BI or DIE

Play Episode Listen Later Jun 8, 2021 31:44

Michael ist Business Development Manager für Daten Visualisierung und Business Intelligence bei der pmOne. Seine Leidenschaft ist die Vermittlung der Begeisterung für Daten und deren Analyse. Seine eigene Datenreise begann 2010 mit der Annahme das PostgreSQL eine Pokersoftware ist und Kimball eine asiatischer Ballsportart. Sie führte ihn dann über komplexe Finanzmodelle in Excel, eine gewachsene MySQL Datenbank und erste Schritte in PowerView, PowerPivot und Power Point, 2015 in die Arme von Power BI. 2018 wurde er dieser ersten BI-Liebe für Tableau untreu. In unzähligen Gruppensitzungen und Einzelgesprächen hat er versucht sein Wissen zu teilen. Einige Dashboards, Reports, Analysen und Projekte später im Jahr 2021 ist Azure für ihn nicht nur eine Farbe und er versucht nicht mehr ohne Ziel durch den Random Forest zu irren. Das ist für Euch drin: - Power BI vs. Tableau - Die Rolle des Data Analysten - Wie sollten Tool Schulungen aussehen? - Warum LinkedIn Potential hat

Łukasz Malicki: Analiza danych przez sztuczną inteligencję - czy technologia zastąpi człowieka?

Tech Hunters by Rebels

Play Episode Listen Later Jan 11, 2021 52:47

Z Łukaszem Malickim, CEO Random Forest, rozmawiamy o planie na zrewolucjonizowanie staromodnych narzędzi wyszukiwania. Czy sztuczna inteligencja może zastąpić człowieka w analizie danych? Czy wręcz przeciwnie - będzie dla niego bezcennym wsparciem? Na czym skupić się budując biznes i dlaczego należy wyciągać wnioski z niepowodzeń? Rozmowa prowadzona jest w ramach cyklu podsumowującego 6. edycję programu #StartUP Małopolska, którego uczestnikiem był Random Forest.

czy analiza rozmowa przez zast technologia danych random forest

Real estate price predictor using random forest regressor

price real estate predicting predictors random forest

Play Episode Listen Later Dec 24, 2020 18:17

Predicting trend

A multi-modal machine learning approach towards predicting patient readmission

Play Episode Listen Later Nov 20, 2020

Link to bioRxiv paper: http://biorxiv.org/cgi/content/short/2020.11.20.391904v1?rss=1 Authors: Mohanty, S. D., Lekan, D., McCoy, T. P., Jenkins, M., Manda, P. Abstract: Healthcare costs that can be attributed to unplanned readmissions are staggeringly high and negatively impact health and wellness of patients. In the United States, hospital systems and care providers have strong financial motivations to reduce readmissions in accordance with several government guidelines. One of the critical steps to reducing readmissions is to recognize the factors that lead to readmission and correspondingly identify at-risk patients based on these factors. The availability of large volumes of electronic health care records make it possible to develop and deploy automated machine learning models that can predict unplanned readmissions and pinpoint the most important factors of readmission risk. While hospital readmission is an undesirable outcome for any patient, it is more so for medically frail patients. Here, we develop and compare four machine learning models (Random Forest, XGBoost, CatBoost, and Logistic Regression) for predicting 30-day unplanned readmission for patients deemed frail (Age [≥] 50). Variables that indicate frailty, comorbidities, high risk medication use, demographic, hospital and insurance were incorporated in the models for prediction of unplanned 30-day readmission. Our findings indicate that CatBoost outperforms the other three models (AUC 0.80) and prior work in this area. We find that constructs of frailty, certain categories of high risk medications, and comorbidity are all strong predictors of readmission for elderly patients. Copy rights belong to original authors. Visit the link for more info

united states patients copy machine learning predicting jenkins mccoy variables modal auc biorxiv readmission lekan random forest xgboost logistic regression

LongTron: Automated Analysis of Long Read Spliced Alignment Accuracy

Play Episode Listen Later Nov 11, 2020

Link to bioRxiv paper: http://biorxiv.org/cgi/content/short/2020.11.10.376871v1?rss=1 Authors: Wilks, C., Schatz, M. C. Abstract: Abstract Motivation: Long read sequencing has increased the accuracy and completeness of assemblies of various organisms' genomes in recent months. Similarly, spliced alignments of long read RNA sequencing hold the promise of delivering much longer transcripts of existing and novel isoforms in known genes without the need for error-prone transcript assemblies from short reads. However, low coverage and high-error rates potentially hamper the widespread adoption of long-read spliced alignments in annotation updates and isoform-level expression quantifications. Results: Addressing these issues, we first develop a simulation of error modes for both Oxford Nanopore and PacBio CCS spliced-alignments. Based on this we train a Random Forest classifier to assign new long-read alignments to one of two error categories, a novel category, or label them as non-error. We use this classifier to label reads from the spliced-alignments of the popular aligner minimap2, run on three long read sequencing datasets, including NA12878 from Oxford Nanopore and PacBio CCS, as well as a PacBio SKBR3 cancer cell line. Finally, we compare the intron chains of the three long read alignments against individual splice sites, short read assemblies, and the output from the FLAIR pipeline on the same samples. Our results demonstrate a substantial lack of precision in determining exact splice sites for long reads during alignment on both platforms while showing some benefit from postprocessing. This work motivates the need for both better aligners and additional post-alignment processing to adjust incorrectly called putative splice-sites and clarify novel transcripts support. Availability and implementation Source code for the random forest implemented in python is available at https://github.com/schatzlab/LongTron under the MIT license. The modified version of GffCompare used to construct Table 3 and related is here: https://github.com/ChristopherWilks/gffcompare/releases/tag/0.11.2LT Copy rights belong to original authors. Visit the link for more info

mit table alignment automated accuracy availability flair rna schatz biorxiv longread random forest spliced oxford nanopore

Neural network fast-classifies biological images using features selected after their random-forests-importance to power smart microscopy.

Play Episode Listen Later Nov 11, 2020

Link to bioRxiv paper: http://biorxiv.org/cgi/content/short/2020.11.10.376988v1?rss=1 Authors: Balluet, M., Sizaire, F., Walter, T., Pont, J., Giroux, B., Bouchareb, O., Tramier, M., Pecreaux, J. Abstract: Artificial intelligence is nowadays used for cell detection and classification in optical microscopy, during post-acquisition analysis. The microscopes are now fully automated and next expected to be smart, to make acquisition decisions based on the images. It calls for analysing them on the fly. Biology further imposes training on a reduced dataset due to cost and time to prepare the samples and have the datasets annotated by experts. We propose here a real-time image processing, compliant with these specifications by balancing accurate detection and execution performance. We characterised the images using a generic, high-dimensional feature extractor. We then classified the images using machine learning for the sake of understanding the contribution of each feature in decision and execution time. We found that the non-linear-classifier random forests outperformed Fisher's linear discriminant. More importantly, the most discriminant and time-consuming features could be excluded without any significant loss in accuracy, offering a substantial gain in execution time. We offer a method to select fast and discriminant features. In our assay, a 79.6 {+/-} 2.4% accurate classification of a cell took 68.7 {+/-} 3.5 ms (mean {+/-} SD, 5-fold cross-validation nested in 10 bootstrap repeats), corresponding to 14 cells per second, dispatched into 8 phases of the cell cycle using 12 feature-groups and operating a consumer market ARM- based embedded system. Interestingly, a simple neural network offered similar performances paving the way to faster training and classification, using parallel execution on a general-purpose graphic processing unit. Copy rights belong to original authors. Visit the link for more info

smart biology images copy arm sd biological pont giroux neural networks microscopy biorxiv random forest

Detecting significant components of microbiomes by random forest with forward variable selection and phylogenetics

Play Episode Listen Later Oct 30, 2020

Link to bioRxiv paper: http://biorxiv.org/cgi/content/short/2020.10.29.361360v1?rss=1 Authors: Dang, T., Kishino, H. Abstract: A central focus of microbiome studies is the characterization of differences in the microbiome composition across groups of samples. A major challenge is the high dimensionality of microbiome datasets, which significantly reduces the power of current approaches for identifying true differences and increases the chance of false discoveries. We have developed a new framework to address these issues by combining (i) identifying a few significant features by a massively parallel forward variable selection procedure, (ii) mapping the selected species on a phylogenetic tree, and (iii) predicting functional profiles by functional gene enrichment analysis from metagenomic 16S rRNA data. We demonstrated the performance of the proposed approach by analyzing two published datasets from large-scale case-control studies: (i) 16S rRNA gene amplicon data for Clostridioides difficile infection (CDI) and (ii) shotgun metagenomics data for human colorectal cancer (CRC). The proposed approach improved the accuracy from 81% to 99.01% for CDI and from 75.14% to 90.17% for CRC. We identified a core set of 96 species that were significantly enriched in CDI and a core set of 75 species that were enriched in CRC. Moreover, although the quality of the data differed for the functional profiles predicted from the 16S rRNA dataset and functional metagenome profiling, our approach performed well for both databases and detected main functions that can be used to diagnose and study further the growth stage of diseases. Copy rights belong to original authors. Visit the link for more info

forward significant copy selection components microbiome variable detecting cdi crc biorxiv phylogenetics random forest 16s rrna

RNAmining: A machine learning stand-alone and web server tool for RNA coding potential prediction

Play Episode Listen Later Oct 26, 2020

Link to bioRxiv paper: http://biorxiv.org/cgi/content/short/2020.10.26.354357v1?rss=1 Authors: Ramos, T., Galindo, N., Arias-Carrasco, R., da Silva, C., Maracaja-Coutinho, V., do Rego, T. Abstract: Non-coding RNAs (ncRNAs) are important players in the cellular regulation of organisms from different kingdoms. One of the key steps in ncRNAs research is the ability to distinguish coding/non-coding sequences. We applied 7 machine learning algorithms (Naive Bayes, SVM, KNN, Random Forest, XGBoost, ANN and DL) through 15 model organisms from different evolutionary branches. Then, we created a stand-alone and web server tool (RNAmining) to distinguish coding and non-coding sequences, selecting the algorithm with the best performance (XGBoost). Firstly, we used coding/non-coding sequences downloaded from Ensembl (April 14th, 2020). Then, coding/non-coding sequences were balanced, had their tri-nucleotides counts analysed and we performed a normalization by the sequence length. Thus, in total we built 180 models. All the machine learning algorithms tests were performed using 10-folds cross-validation and we selected the algorithm with the best results (XGBoost) to implement at RNAmining. Best F1-scores ranged from 97.56% to 99.57% depending on the organism. Moreover, we produced a benchmarking with other tools already in literature (CPAT, CPC2, RNAcon and Transdecoder) and our results outperformed them, opening opportunities for the development of RNAmining, which is freely available at https://rnamining.integrativebioinformatics.me/ . Copy rights belong to original authors. Visit the link for more info

predictions tool copy machine learning coding dl stand alone galindo rego svm webserver biorxiv random forest xgboost naive bayes

Ultra-fast Prediction of Somatic Structural Variations by Reduced Read Mapping via Pan-Genome k-mer Sets

Play Episode Listen Later Oct 26, 2020

Link to bioRxiv paper: http://biorxiv.org/cgi/content/short/2020.10.25.354456v1?rss=1 Authors: Choi, M.-H., Sohn, J.-i., Yi, D., Menon, A. V., Kim, Y. J., Kyoung, S., Shin, S.-H., Na, B., Joung, J.-G., Yoon, S., Koh, Y., Baek, D., Kim, T.-M., Nam, J.-W. Abstract: Genome rearrangements often result in copy number alterations of cancer-related genes and cause the formation of cancer-related fusion genes. Current structural variation (SV) callers, however, still produce massive numbers of false positives (FPs) and require high computational costs. Here, we introduce an ultra-fast and high-performing somatic SV detector, called ETCHING, that significantly reduces the mapping cost by filtering reads matched to pan-genome and normal k-mer sets. To reduce the number of FPs, ETCHING takes advantage of a Random Forest classifier that utilizes six breakend-related features. We systematically benchmarked ETCHING with other SV callers on reference SV materials, validated SV biomarkers, tumor and matched-normal whole genomes, and tumor-only targeted sequencing datasets. For all datasets, our SV caller was much faster ([≥]15X) than other tools without compromising performance or memory use. Our approach would provide not only the fastest method for largescale genome projects but also an accurate clinically practical means for real-time precision medicine. Copy rights belong to original authors. Visit the link for more info

predictions current copy mapping sohn sv somatic reduced shin fps structural variations nam genome yoon koh menon yi ultrafast baek etching biorxiv 15x random forest

Predicting Cell-Penetrating Peptides: Building and Interpreting Random Forest based prediction Models

predictions discovering models copy predicting lime interpreting peptides verma cpp penetrating shap biorxiv cpps random forest

Play Episode Listen Later Oct 16, 2020

Link to bioRxiv paper: http://biorxiv.org/cgi/content/short/2020.10.15.341149v1?rss=1 Authors: Yadahalli, S., Verma, C. S. Abstract: Targeting intracellular pathways with peptide drugs is becoming increasingly desirable but often limited in application due to their poor cell permeability. Understanding cellular permeability of peptides remains a major challenge with very little structure-activity relationship known. Fortunately, there exist a class of peptides called Cell-Penetrating Peptides (CPPs), which have the ability to cross cell membranes and are also capable of delivering biologically active cargo into cells. Discovering patterns that make peptides cell-permeable have a variety of applications in drug delivery. In the current study, we build prediction models for CPPs exploring features covering a range of properties based on amino acid sequences, using Random forest classifiers which are often more interpretable than other ensemble machine learning algorithms. While obtaining prediction accuracies of ~96%, we also interpret our prediction models using TreeInterpreter, LIME and SHAP to decipher the contributions of important features and optimal feature space for CPP class. We propose that our work might offer an intuitive guide for incorporating features that impart cell-penetrability into the design of novel CPPs. Copy rights belong to original authors. Visit the link for more info

Predicting age and clinical risk from the neonatal connectome

Machine Learning en Español

Play Episode Listen Later Sep 29, 2020

Link to bioRxiv paper: http://biorxiv.org/cgi/content/short/2020.09.28.317180v1?rss=1 Authors: Taoudi-Benchekroun, Y., Christiaens, D., Grigorescu, I., Schuh, A., Pietsch, M., Chew, A., Harper, N., Falconer, S., Poppe, T., Hughes, E., Hutter, J., Price, A. N., Tournier, J.-D., Cordero-Grande, L., Counsell, S. J., Rueckert, D., Arichi, T., Hajnal, J. V., Edwards, A. D., Deprez, M., Batalle, D. Abstract: The development of perinatal brain connectivity underpins motor, cognitive and behavioural abilities in later life. With the rise of advanced imaging methods such as diffusion MRI, the study of brain connectivity has emerged as an important tool to understand subtle alterations associated with neurodevelopmental conditions. Brain connectivity derived from diffusion MRI is complex, multi-dimensional and noisy, and hence it can be challenging to interpret on an individual basis. Machine learning methods have proven to be a powerful tool to uncover hidden patterns in such data, thus opening an opportunity for early identification of atypical development and potentially more efficient treatment. In this work, we used Deep Neural Networks and Random Forests to predict neurodevelopmental characteristics from neonatal structural connectomes, in a large sample of neonates (N = 524) derived from the developing Human Connectome Project. We achieved a highly accurate prediction of post menstrual age (PMA) at scan on term-born infants (Mean absolute error (MAE) = 0.72 weeks, r = 0.83, p

15 Adaboost: Adaptive Boosting

Play Episode Listen Later Sep 28, 2020 16:32

Adaboost es uno de los algoritmos clásicos de aprendizaje máquina. Al igual que Random Forest y XGBoost pertenece a la clase de modelos de ensamble, es decir, que se basan en agregar otros modelos débiles o de base para hacer predicciones. La principal diferencia con Adaboost es que es adaptativo, es decir, aprender de los errores hechos en los primeros modelos poniendo más énfasis en los ejemplos clasificados incorrectamente.

boosting adaptive random forest xgboost

15 Adaboost: Adaptive Boosting

Machine Learning with Coffee

Play Episode Listen Later Sep 28, 2020 18:01

Adaboost is one of the classic machine learning algorithms. Just like Random Forest and XGBoost, Adaboost belongs to the ensemble models, in other words, it aggregates the results of simpler classifiers to make robust predictions. The main different of Adaboost is that it is an adaptive algorithm, which means that it learns from the misclassified instances of previous models, assigning more weights to those errors and focusing its attention on those instances in the next round.

boosting adaptive random forest xgboost

HieRFIT: Hierarchical Random Forest for Information Transfer

copy tang rna lawless hierarchical biorxiv random forest

Play Episode Listen Later Sep 18, 2020

Link to bioRxiv paper: http://biorxiv.org/cgi/content/short/2020.09.16.300822v1?rss=1 Authors: Kaymaz, Y., Ganglberger, F., Tang, M., Fernandez-Albert, F., Lawless, N., Sackton, T. B. Abstract: The emergence of single-cell RNA sequencing (scRNA-seq) has led to an explosion in novel methods to study biological variation among individual cells, and to classify cells into functional and biologically meaningful categories. Here, we present a new cell type projection tool, HieRFIT (Hierarchical Random Forest for Information Transfer), based on hierarchical random forests. HieRFIT uses a priori information about cell type relationships to improve classification accuracy, taking as input a hierarchical tree structure representing the class relationships, along with the reference data. We use an ensemble approach combining multiple random forest models, organized in a hierarchical decision tree structure. We show that our hierarchical classification approach improves accuracy and reduces incorrect predictions especially for inter-dataset tasks which reflect real life applications. We use a scoring scheme that adjusts probability distributions for candidate class labels and resolves uncertainties while avoiding the assignment of cells to incorrect types by labeling cells at internal nodes of the hierarchy when necessary. Using HieRFIT, we re-analyzed publicly available scRNA-seq datasets showing its effectiveness in cell type cross-projections with inter/intra-species examples. HieRFIT is implemented as an R package and it is available at (https://github.com/yasinkaymaz/HieRFIT/releases/tag/v1.0.0) Copy rights belong to original authors. Visit the link for more info

Combining Natural Language Processing and Metabarcoding to Reveal Pathogen-Environment Associations

Play Episode Listen Later Sep 3, 2020

Link to bioRxiv paper: http://biorxiv.org/cgi/content/short/2020.09.02.280578v1?rss=1 Authors: Molik, D. C., Tomlinson, D., Davitt, S., Morgan, E. L., Roche, B., Meyers, N., Pfrender, M. E. Abstract: Cryptococcus neoformans is responsible for life-threatening infections that primarily affect immunocompromised individuals and has an estimated worldwide burden of 220,000 new cases each year--with 180,000 resulting deaths--mostly in sub-Saharan Africa. Surprisingly, little is known about the ecological niches occupied by C. neoformans in nature. To expand our understanding of the distribution and ecological associations of this pathogen we implement a Natural Language Processing approach to better describe the niche of C. neoformans . We use a Latent Dirichlet Allocation model to de novo topic model sets of metagenetic research articles written about varied subjects which either explicitly mention, inadvertently find, or fail to find C. neoformans. These articles are all linked to NCBI Sequence Read Archive datasets of 18S ribosomal RNA and/or Internal Transcribed Spacer gene-regions. The number of topics was determined based on the model coherence score, and articles were assigned to the created topics via a Machine Learning approach with a Random Forest algorithm. Our analysis provides support for a previously suggested linkage between C. neoformans and soils associated with decomposing wood. Our approach, using a search of single-locus metagenetic data, gathering papers connected to the datasets, de novo determination of topics, the number of topics, and assignment of articles to the topics, illustrates how such an analysis pipeline can harness large-scale datasets that are published/available but not necessarily fully analyzed, or whose metadata is not harmonized with other studies. Our approach can be applied to a variety of systems to assert potential evidence of environmental associations. Copy rights belong to original authors. Visit the link for more info

environment reveal copy machine learning roche associations meyers rna tomlinson saharan africa pathogens natural language processing biorxiv davitt random forest 18s

geneRFinder: gene finding in distinct metagenomic data complexities

Play Episode Listen Later Aug 24, 2020

Link to bioRxiv paper: http://biorxiv.org/cgi/content/short/2020.08.21.262147v1?rss=1 Authors: Silva, R., Padovani, K., Goes, F., Alves, R. Abstract: Motivation: Microbes perform a fundamental economic, social and environmental role in our society. Metagenomics makes it possible to investigate microbes in their natural environments (the complex communities) and their interactions. The way they act is usually estimated by looking at the functions they play in those environments and their responsibility is measured by their genes. The advances of next-generation sequencing technology have facilitated metagenomics research however it also create a heavy computational burden. Large and complex biological datasets are available as never before. There are many gene predictors available which can aid gene annotation process though they lack of handling appropriately metagenomic data complexities. There is no standard metagenomic benchmark data for gene prediction. Thus, gene predictors may inflate their results by obfuscating low false discovery rates. Results: We introduce geneRFinder, a ML-based gene predictor able to outperform state-of-the-art gene prediction tools across this benchmark by using only one pre-trained Random Forest model. Average prediction rates of geneRFinder differed in percentage terms by 54% and 64%, respectively, against Prodigal and FragGeneScan while handling high complexity metagenomes. The specificity rate of geneRFinder had the largest distance against FragGeneScan, 79 percentage points, and 66 more than Prodigal. According to McNemar's test, all percentual differences between predictors performances are statistically significant for all datasets with a 99% confidence interval. Conclusions: We provide geneRFinder, a approach for gene prediction in distinct metagenomic complexities, available at github.com/railorena/geneRFinder, and also we provide a novel, comprehensive benchmark data for gene prediction --- which is based on The Critical Assessment of Metagenome Interpretation (CAMI) challenge, and contains labeled data from gene regions -- avaliable at sourceforge.net/p/generfinder-benchmark . Copy rights belong to original authors. Visit the link for more info

data large average copy ml prodigal complexities alves distinct biorxiv metagenomics padovani random forest mcnemar

An Ensemble Learning Approach for Cancer Drug Prediction

Play Episode Listen Later Aug 11, 2020

Link to bioRxiv paper: http://biorxiv.org/cgi/content/short/2020.08.10.245142v1?rss=1 Authors: Mandera, D., Ritz, A. Abstract: Predicting the response to a particular drug for specific cancer, despite known genetic mutations, still remains a huge challenge in modern oncology and precision medicine. Today, prescribing a drug for a cancer patient is based on a doctor's analysis of various articles and previous clinical trials; it is an extremely time-consuming process. We developed a machine learning classifier to automatically predict a drug given a carcinogenic gene mutation profile. Using the Breast Invasive Carcinoma Dataset from The Cancer Genome Atlas (TCGA), the method first selects features from mutated genes and then applies K-Fold, Decision Tree, Random Forest and Ensemble Learning classifiers to predict best drugs. Ensemble Learning yielded prediction accuracy of 66% on the test set in predicting the correct drug. To validate that the model is general-purpose, Lung Adenocarcinoma (LUAD) data and Colorectal Adenocarcinoma (COADREAD) data from TCGA was trained and tested, yielding prediction accuracies 50% and 66% respectively. The resulting accuracy indicates a direct correlation between prediction accuracy and cancer data size. More importantly, the results of LUAD and COADREAD show that the implemented model is general purpose as it is able to achieve similar results across multiple cancer types. We further verified the validity of the model by implementing it on patients with unclear recovery status from the COADREAD dataset. In every case, the model predicted a drug that was administered to each patient. This method will offer oncologists significant time-saving compared to their current approach of extensive background research, and offers personalized patient care for cancer patients. Copy rights belong to original authors. Visit the link for more info

learning predictions cancer drug copy ensemble ritz biorxiv decision tree random forest tcga

Development of Putative Isospecific Inhibitors for HDAC6 using Random Forest, QM-Polarized docking, Induced-fit docking, and Quantum mechanics

development copy diverse induced polarized quantum mechanics inhibitors docking auta qm biorxiv random forest putative

Play Episode Listen Later Aug 10, 2020

Link to bioRxiv paper: http://biorxiv.org/cgi/content/short/2020.08.10.243824v1?rss=1 Authors: Joel, I. y., ADIGUN, T. O., BANKOLE, O. O., AJIBOLA, A. O., OFENIFORO, E. B., AUTA, F. B., OZOJIOFOR, U. O., REMI-ESAN, I. A., AKANDE, A. I. Abstract: Histone deacetylases have been recognized as a potential target for epigenetic aberrance reversal in the various strategies for cancer therapy, with HDAC6 implicated in various forms of tumor growth and cancers. Diverse inhibitors of HDAC6 has been developed, however, there is still the challenge of iso-specificity and toxicity. In this study, we trained a Random forest model on all HDAC6 inhibitors curated in the ChEMBL database (3,742). Upon rigorous validations the model had an 85% balanced accuracy and was used to screen the SCUBIDOO database; 7785 hit compounds resulted and were docked into HDAC6 CD2 active-site. The top two compounds having a benzimidazole moiety as its zinc-binding group had a binding affinity of -78.56kcal/mol and -78.21kcal/mol respectively. The compounds were subjected to exhaustive docking protocols (Qm-polarized docking and Induced-Fit docking) in other to elucidate a binding hypothesis and accurate binding affinity. Upon optimization, the compounds showed improved binding affinity (-81.42kcal/mol), putative specificity for HDAC6, and good ADMET properties. We have therefore developed a reliable model to screen for HDAC6 inhibitors and suggested a series of benzimidazole based inhibitors showing high binding affinity and putative specificity for HDAC6. Copy rights belong to original authors. Visit the link for more info

A Deep Learning Framework for Predicting Human Essential Genes by Integrating Sequence and Functional data

Play Episode Listen Later Aug 5, 2020

Link to bioRxiv paper: http://biorxiv.org/cgi/content/short/2020.08.04.236646v1?rss=1 Authors: Xiao, W., Zhang, X., Xiao, W. Abstract: Motivation: Essential genes are necessary to the survival or reproduction of a living organism. The prediction and analysis of gene essentiality can advance our understanding to basic life and human diseases, and further boost the development of new drugs. Wet lab methods for identifying essential genes are often costly, time consuming, and laborious. As a complement, computational methods have been proposed to predict essential genes by integrating multiple biological data sources. Most of these methods are evaluated on model organisms. However, prediction methods for human essential genes are still limited and the relationship between human gene essentiality and different biological information still needs to be explored. In addition, exploring suitable deep learning techniques to overcome the limitations of traditional machine learning methods and improve the prediction accuracy is also important and interesting. Results: We propose a deep learning based method, DeepSF, to predict human essential genes. DeepSF integrates four types of features, that is, sequence features, features from gene ontology, features from protein complex, and network features. Sequence features are derived from DNA and protein sequence for each gene. 100 GO terms from cellular component are used to form a feature vector for each gene, in which each component captures the relationship between a gene and a GO term. Network features are learned from protein-protein interaction (PPI) network using a deep learning based network embedding method. The features derived from protein complexes capture the relationships between a gene or a gene's direct neighbors from PPI network and protein complexes. The four types of features are integrated together to train a multilayer neural network. The experimental results of 10-fold cross validation show that DeepSF can accurately predict human gene essentiality with an average performance of AUC about 94.35%, the area under precision-recall curve (auPRC) about 91.28%, the accuracy about 91.35%, and the F1 measure about 77.79%. In addition, the comparison results show that DeepSF significantly outperforms several widely used traditional machine learning models (SVM, Random Forest, and Adaboost), and performs slightly better than a recent deep learning model (DeepHE). Conclusions: We have demonstrated that the proposed method, DeepSF, is effective for predicting human essential genes. Deep learning techniques are promising at both feature learning and classification levels for the task of essential gene prediction. Copy rights belong to original authors. Visit the link for more info

deep data dna network essential framework f1 copy integrating predicting functional genes zhang wet sequence deep learning ppi xiao auc svm biorxiv random forest

Single-cell identity definition using random forests and recursive feature elimination (scRFE)

identity definition feature copy wang python elimination rna pisco recursive single cell biorxiv random forest

Play Episode Listen Later Aug 4, 2020

Link to bioRxiv paper: http://biorxiv.org/cgi/content/short/2020.08.03.233650v1?rss=1 Authors: Park, M., Vorperian, S., Wang, S., Pisco, A. O. Abstract: Single cell RNA sequencing (scRNA-seq) enables detailed examination of a cell's underlying regulatory networks and the molecular factors contributing to its identity. We developed scRFE (single-cell identity definition using random forests and recursive feature elimination, pronounced 'surf') with the goal of easily generating interpretable gene lists that can accurately distinguish observations (single-cells) by their features(genes) given a class of interest. scRFE is an algorithm implemented as a Python package that combines the classical random forest method with recursive feature elimination and cross validation to find the features necessary and sufficient to classify cells in a single-cell RNA-seq dataset by ranking feature importance. The package is compatible with Scanpy, enabling a seamless integration into any single-cell data analysis workflow that aims at identifying minimal transcriptional programs relevant to describing metadata features of the dataset. We applied scRFE to the Tabula Muris Senis and reproduced commonly known aging patterns and transcription factor reprogramming protocols, highlighting the biological value of scRFE's learned features. Copy rights belong to original authors. Visit the link for more info

14 XGBoost: The Winner of Many Competitions

Machine Learning with Coffee

Play Episode Listen Later Jul 26, 2020 13:50

XGBoost is an open-source software library which has won several Machine Learning competitions in Kaggle. It is based on the principles of gradient boosting, which is based on the ideas of the Leo Breiman, the creator of Random Forest. The theory behind gradient boosting was later formalized by Jerome H. Friedman. Gradient boosting combines weak learners just as Random Forest. XGBoost is an engineering implementation which includes a clever penalization of trees and a proportional shrinking of leaf nodes.

winner machine learning friedman competitions gradient kaggle random forest xgboost

Cross-Tissue Transcriptomic Analysis Leveraging Machine Learning Approaches Identifies New Biomarkers for Rheumatoid Arthritis

Machine Learning en Español

Play Episode Listen Later Jul 26, 2020

Link to bioRxiv paper: http://biorxiv.org/cgi/content/short/2020.07.24.220483v1?rss=1 Authors: Rychkov, D., Neely, J., Oskotsky, T., Sirota, M. Abstract: Background/Purpose: There is an urgent need to identify effective biomarkers for early diagnosis of Rheumatoid Arthritis (RA) and accurate monitoring of disease activity. Here we define a RA meta-profile using publicly available cross-tissue gene expression data and apply machine learning to identify putative biomarkers, which we further validate on independent datasets. Methods: We carried out a comprehensive search for publicly available microarray gene expression data in the NCBI Gene Expression Omnibus database for whole blood and synovial tissues from RA patients and healthy controls. The raw data from 13 synovium datasets with 284 samples and 14 blood datasets with 1,885 samples were downloaded and processed. The datasets for each tissue were merged and batch corrected and split into training and test sets. We then developed and applied a robust feature selection pipeline to identify genes dysregulated in both tissues and highly associated with RA. From the training data we identified a set of overlapping differentially expressed genes following the condition of co-directionality. The classification performance of each gene in the resulting set was evaluated on the testing sets using AUROC. Five independent datasets were used to validate and threshold the feature selected (FS) genes. Finally, we define the RAScore, composed of a geometric mean of the selected RAScore Panel genes and demonstrate its clinical utility. Results: The result of the feature selection pipeline was a set of 25 upregulated and 28 downregulated genes. To assess the robustness of these feature selected genes, we trained a Random Forest machine learning model with this set of 53 genes and then with the set of 32 common differentially expressed genes and tested on the validation cohorts. The model with FS genes outperformed the model with common DE genes with AUC 0.89 +/- 0.04 vs 0.86 +/- 0.05. The FS genes were further thresholded on the 5 independent datasets resulting in 10 upregulated genes, TNFAIP6, S100A8, TNFSF10, DRAM1, LY96, QPCT, KYNU, ENTPD1, CLIC1, ATP6V0E1, that are involved in innate immune system pathways, including neutrophil degranulation and apoptosis and expressed in granulocytes, dendritic cells, and macrophages; and 3 downregulated genes, HSP90AB1, NCL, CIRBP, involved in metabolic processes and T-cell receptor regulation of apoptosis and expressed in lymphoblasts. To investigate the clinical utility of the 13 validated genes, the RA Score was developed and found to be highly correlated with DAS28 (r = 0.33 +/- 0.03, p = 7e-9) and able to distinguish OA and RA samples (OR 0.57, 95% CI [0.34, 0.80], p = 8e-10). Moreover, the RA Scores were not significantly different for RF-positive and RF-negative RA sub-phenotypes (p = 0.9) suggesting the generalizability of this score in clinical applications. The RA Score was also able to monitor the treatment effect among RA patients (t-test of treated vs untreated, p = 2e-4) and distinguish polyJIA from healthy individuals in 10 independent pediatric cohorts (OR 1.15, 95% CI [1.01, 1.3], p = 2e-4). Conclusion: The RAScore, consisting of 13 putative biomarkers, identified through a robust feature selection procedure on public data and validated using multiple independent data sets may be useful in the diagnosis and treatment monitoring of RA. Copy rights belong to original authors. Visit the link for more info

cross leveraging ra copy approaches machine learning tissue oa rf fs rheumatoid arthritis biomarkers identifies auc sirota ncl biorxiv random forest rheumatoid arthritis ra auroc

14 XGBoost: El Ganador de Muchas Competencias

Play Episode Listen Later Jul 26, 2020 19:15

XGBoost es una librería de software que es open-source y que ha ganado varias competencias de Machine Learning. XGBoost está basado en los principios de gradient booting, el cual a su vez está basado en las ideas de Leo Breiman, el creador de Random Forest. La teoría detrás de gradient boosting fue formalizada por Jerome H. Friedman. Gradient boosting combina modelos simples y utiliza ingeniería muy inteligente la cual incluye una penalización para los árboles y un encogimiento proporcional para los nodos hoja.

muchas machine learning friedman competencias gradient el ganador random forest xgboost

13 Random Forest

Machine Learning en Español

Play Episode Listen Later Jul 12, 2020 23:16

El Random Forest es uno de los mejores algoritmos que están listos para usarse sin necesidad de hacer mucha afinación. En este episodio tratamos de entender la intuición detrás de este algoritmo y cómo es que trata de tomar ventaja de los árboles de decisión al agregarlos usando un truco muy bueno llamado Bagging. Importancia de variables y el error fuera de la bolsa son características de este algoritmo que nos ayudan a entender mejor cuáles son las variables mas importantes y cuál es el error de generalización, respectivamente.

bagging random forest

13 Random Forest

Machine Learning with Coffee

Play Episode Listen Later Jul 12, 2020 23:07

Random Forest is one of the best out-of-the-shelf algorithms. In this episode we try to understand the intuition behind the Random Forest and how it tries to leverage the capabilities of Decision Trees by aggregating them using a very smart trick called “bagging”. Variable Importance and out-of-bag error are two of the nice capabilities of Random Forest which allow us to find the most important predictors and compute a good generalization error, respectively.

decision trees random forest

Analyzing potential of random forest machine learning for crypto trading bot

Quant Trading Live Report

Play Episode Listen Later Apr 29, 2020 16:10

This is the ABSOLUTE most critical metric out there when trading crypto. If your exchange does not offer this, move to another that does. Can you rely on any exchange thats offers this? Also, most this 3rd party bot service or platform most likely will NOT offer this metric to you. So be forewarned about this. If you don’t get it, you could get crush and lose money. Free trading books https://quantlabs.net/ or learn algo trading https://quantlabs.net/dvd https://quantlabs.net/blog/2020/04/most-powerful-high-frequency-trading-aka-hft-like-metric-for-a-crypto-trading-bot/

analyzing absolute machine learning crypto trading random forest

Analyzing potential of random forest machine learning for crypto trading bot

Quant Trading Live Report

Play Episode Listen Later Apr 23, 2020 29:14

Analyzing potential of random forest machine learning for crypto trading bot I was playing around with random forest as posted here Free trading books https://quantlabs.net/ or learn algo trading https://quantlabs.net/dvd

analyzing machine learning crypto trading random forest

PCA vs Random Forest vs Regressions with machine learning for crypto trading

Quant Trading Live Report

Play Episode Listen Later Apr 23, 2020 34:04

The machine learning technique is used from market prediction using crypto trading bot These options were set out as explained here Free trading books https://quantlabs.net/ or learn algo trading https://quantlabs.net/dvd Some forecasting methods from an expert: Maybe regress them all against price, or conduct PCA or perhaps you just throw them all into a random forest in order to see which ones are the most important? Where to find these options with Scikit-learn and Python. Scikit-learn seems to to be the simplest machine learning to go with from Python

machine learning python pca crypto trading regressions random forest scikit

Gradient boost and adboostingclassifier and bagging and random forest classifier

improving boost gradient bagging classifier random forest

Play Episode Listen Later Apr 22, 2020 19:45

Improving the decision tree

Machine Learning - Maschinelles Lernen

Modellansatz

Play Episode Listen Later Mar 5, 2020 41:23

Gudrun spricht mit Sebastian Lerch vom Institut für Stochastik in der KIT-Fakultät für Mathematik. Vor einiger Zeit - Anfang 2015 - hatten die beiden schon darüber gesprochen, wie extreme Wetterereignisse stochastisch modelliert werden können. Diesmal geht es um eine Lehrveranstaltung, die Sebastian extra konzipiert hat, um für Promovierende aller Fachrichtungen am KIT eine Einführung in Machine Learning zu ermöglichen. Der Rahmen hierfür ist die Graduiertenschule MathSEED, die ein Teil des im Oktober 2018 gegründeten KIT-Zentrums MathSEE ist. Es gab schon lange (und vielleicht immer) Angebote am KIT, die insbesondere Ingenieure an moderne Mathematik heranführten, weil sie deren Methoden schon in der Masterphase oder spätestens während der Promotion brauchten, aber nicht durch die klassischen Inhalten der Höheren Mathematik abgedeckt werden. All das wird nun gebündelt und ergänzt unter dem Dach von MathSEED. Außerdem funktioniert das nun in beide Richtungen: Mathematiker:innen, werden ebenso zu einführenden Angeboten der anderen beteiligten Fakultäten eingeladen. Das Thema Maschinelles Lernen und Künstliche Intelligenz war ganz oben auf der Wunschliste für neu zu schaffende Angebote. Im Februar 2020 hat Sebastian diese Vorlesung erstmalig konzipiert und gehalten - die Übungen wurden von Eva-Maria Walz betreut. Die Veranstaltung wird im Herbst 2020 wieder angeboten. Es ist nicht ganz einfach, die unterschiedlichen Begriffe, die für Künstliche Intelligenz (kurz: KI) benutzt werden gegeneinander abzutrennen, zumal die Sprechweisen in unterschiedlichen Kontexten unterschiedlich sind. Hinzu tritt, dass mit der Verfügbarkeit großer Datenmengen und der häufigen Nutzung von KI und Big Data gemeinsam auch hier vieles vermischt wird. Sebastian defininiert Maschinelles Lernen als echte Teilmenge von KI und denkt dabei auch daran, dass z.B. symbolisches Rechnen KI ist. Ebenso geben schon lange sogenannte Expertensysteme Hilfestellung für Entscheidungen. Hier geben Regeln ein Programm vor, das Daten-Input zu einem Output verwandelt. Heute denken wir bei KI eher daran, dass z.B. der Computer lernt wie ein Bild eines Autos aussieht, ohne dass dafür klare Regeln vorgegeben werden. Dies ist eher vergleichbar damit, wie Kinder lernen. Die modernste Variante ist sogenanntes Deep Learning auf der Basis von Neuronalen Netzen. Die Abgrenzung zu statistischen Verfahren ist mitunter nicht so klar. Das Neuronale Netz wird dabei eine Black Box, was wissenschaftlich arbeitende Menschen nicht ganz befriedigt. Aber mit ihrer Hilfe werden komplexere Probleme lösbar. Forschung muss versuchen, die Entscheidungen der Black Box nachvollziehbar zu machen und entscheiden, wann die Qualität ausreicht. Dazu muss man sich überlegen: Wie misst man Fehler? In der Bildverarbeitung kann es genügen, z.B. falsch erkannte Autos zu zählen. In der Wettervorhersage lässt sich im Nachhinein feststellen, welche Fehler in der Vorhersage gemacht wurden. Es wird unterschiedliche Fehlertoleranzen geben für Erkennung von Fußgängern für selbst fahrende Autos und für die Genauigkeit von Wettervorhersage. Ein Beispiel in der Übung war die Temperaturvorhersage anhand von vorliegenden Daten. Die Vorhersage beruht ja auf physikalischen Modelle in denen die Entwicklung von Temperatur, Luftdruck und Windgeschwindigkeit durch Gleichungssysteme nachgebildet wird. Aber diese Modelle können nicht fehlerfrei berechnet werden und sind auch recht stark vereinfacht. Diese Fehler werden mit Hilfe von KI analysiert und die Ergebnisse für die Verbesserung der Vorhersage benutzt. Ein populäres Verfahren sind Random Forests oder Entscheidungsbäume. Hier werden komplexe Fragen stufenweise zerlegt und in den Stufen einfache Ja- oder Nein-Fragen beantwortet. Dies wird z.B. angewandt in der Entscheidung ob und wo eine Warnung vor einer Gewitterzelle erfolgen sollte. Sehr bekannt und im praktischen Einsatz erprobt (beispielsweise in der Bildverarbeitung und in der Übersetzung zwischen gebräuchlichen Sprachen) sind Neuronale Netze. In mehrern Schichten sind hier sogenannte Neuronen angeordnet. Man kann sich diese wie Knoten in einem Netz vorstellen, in dem Daten von Knoten zu Knoten transportiert werden. In den Knoten werden die ankommenden Daten gewichtet aufaddiert und eine vorher festgelegte Aktivierungsfunktion entscheidet, was an die nächsten Knoten oder die nächste Schicht von Neuronen weitergegeben wird. Die einzelnen Rechenoperationen sind hier also ganz elementar, aber das Zusammenwirken ist schwer zu analysieren. Bei vielen Schichten spricht man von Deep Learning. Das ist momentan noch in den Kinderschuhen, aber es kann weit reichende Konsequenzen haben. In jedem Fall sollte man Menschen im Entscheidungsprozess beteiligen. Die konkrete Umsetzung hat Sebastian als Vorlesung und Übung zu gleichen Teilen gewählt. Er hat einen Schwerpunkt auf einen Überblick zu methodischen Aspekten gelegt, die die Teilnehmenden dazu befähigt, später selbst weiter zu lernen. Es ging also unter anderem darum, wie man Trainingsdaten auswählt, wie Qualitätssicherung funktioniert, wie populäre Modelle funktionieren und wie man einschätzt, dass die Anpassung an Daten nicht zu stark erfolgt. In der Übung fand großen Anklang, dass ein Vorhersagewettbewerb der entwickelten Modelle durch Kaggle competions online live möglich war. Literatur und weiterführende Informationen Forschungsergebnisse mit Hilfe von Maschinen Lernen, an denen Sebastian Lerch beteiligt ist: M.N. Lang e.a.: Remember the past: A comparison of time-adaptive training schemes for non-homogeneous regression Nonlinear Processes in Geophysics, 27: 23–34 2020. (eher stochastisch) S. Rasp und S. Lerch: Neural networks for post-processing ensemble weather forecasts Monthly Weather Review, 146(11): 3885–3900 2018. Lehrbücher T. Hastie, R. Tibshirani, J. Friedman: The Elements of Statistical Learning Springer 2017 (2nd Edition). G. James, D. Witten, T. Hastie and R. Tibshirani: An Introduction to Statistical Learning Springer 2013 (7nd Edition) I. Goodfellow and Y. Bengio and A. Courville: Deep Learning MIT-Press 2016. Online Kurse Pytorch-based Python library fastai Deeplearning Dystopie für alltägliche KI C. Doctorow: Little Brother Tor Teen, 2008. download beim Author C. Doctorow: Homeland Tor Books, 2013, ISBN 978-0-7653-3369-8 im Gespräch angesprochene Bildbearbeitung, die eigene Fotos mit Kunstwerken verschmilzt Meetups im Umland von Karlsruhe Karlsruhe ai Meetup Heidelberg ai Meetup Machine Learning Rhein-Neckar (Mannheim) Podcasts Leben X0 - Episode 6: Was ist Machine Learning? November 2019. Streitraum: Intelligenz und Vorurteil Carolin Emcke im Gespräch mit Anke Domscheit-Berg und Julia Krüger, 26. Januar 2020 P. Packmohr, S. Ritterbusch: Neural Networks, Data Science Phil, Episode 16, 2019.

man fall er computers kinder promotion bei probleme entwicklung dazu hilfe qualit basis bild fehler entscheidung entscheidungen lang machine learning big data einsatz diesmal regeln programm fotos intelligenz daten einf ergebnisse herbst umsetzung methoden autos python netz institut teilen forschung konsequenzen angebote literatur nutzung dach begriffe variante black box output sprachen schwerpunkt ebenso verfahren verbesserung inhalten modelle deep learning isbn angeboten anpassung aspekten warnung meetups nachhinein mathematik stufen hinzu temperatur schichten teilnehmenden ein beispiel schicht knoten fakult vorlesung witten vorhersage gudrun kontexten genauigkeit ingenieure wunschliste geophysics erkennung kinderschuhen anklang umland kunstwerken diese fehler die veranstaltung datenmengen kaggle goodfellow bildbearbeitung wettervorhersage hastie zusammenwirken rasp neuronen lehrb fachrichtungen maschinelles lernen luftdruck trainingsdaten wetterereignisse der rahmen neuronale netze bildverarbeitung random forest stochastik lehrveranstaltung anke domscheit berg teilmenge gleichungssysteme monthly weather review kit fakult

Machine Learning - Random Forest

Delicate Database with Aaron

Play Episode Listen Later Dec 4, 2019 11:14

Today was a bit of a challenge but I did find it fun trying to break down and explain Random Forest. I Hope you enjoy/enjoyed the episode and as always feel free to drop an email at timicode54@gmail.com if you have any further questions you wish to ask me. --- Send in a voice message: https://anchor.fm/delicatedatabase/message

machine learning i hope random forest

Random forest and linear regression and gradient boost in python

boost python gradient linear regression random forest

Play Episode Listen Later Aug 22, 2019 18:28

Thoughts on these three classifieds

Prioritizing training data, model interpretability, and dodging an AI Winter

The Banana Data Podcast

Play Episode Listen Later Aug 16, 2019 27:19

This episode, Triveni and Will tackle the value, ethics, and methods for good labeled data, while also weighing the need for model interpretability and the possibility of an impending AI winter. Triveni will also take us through a step-by-step of the decisions made by a Random Forest algorith As always, be sure to rate and subscribe! Be sure to check out the articles we mentioned this week: The Side of Machine Learning You're Undervaluing and How to Fix it by Matt Wilder (LabelBox) The Hidden Costs of Automated Thinking by Jonathan Zittrain (The New Yorker) Another AI Winter Could Usher in a Dark Period for Artificial Intelligence by Eleanor Cummins (PopSci)

ai training data model artificial intelligence prioritizing fix data science dodging ai ethics undervaluing interpretability unsupervised learning supervised learning ai winter random forest triveni

Ml feature discovery using random forest classifier