POPULARITY
Paco Nathan is the Managing Partner at Derwen, Inc., and author of Latent Space, along with other books, plus popular videos and tutorials about machine learning, natural language, graph technologies, and related topics. MLOps podcast #201 with Paco Nathan, Managing Partner at Derwen, Inc., Language, Graphs, and AI in Industry. // Abstract Let's talk about key findings from these conferences, specifically summarizing teams that have ROI on machine learning in production: what are the things in common they're doing, and what are the most important caveats they urge other teams to consider when getting started? Because these key takeaways aren't found in the current AI news cycle. // MLOps Jobs board https://mlops.pallet.xyz/jobs // MLOps Swag/Merch https://mlops-community.myshopify.com/ // Related Links AI Conference: https://aiconference.com/ K1st World: https://www.k1st.world/ Corunna Innovation Summit: https://corunna.dataspartan.com/ "Cloud Computing on Amazon AWS EC2" UC Berkeley EECS guest lecture (2009) https://vimeo.com/manage/videos/3616394 "Hardware - Software - Process: Data Science in a Post-Moore's Law World" https://www.nvidia.com/en-us/ai-data-science/resources/hardware-software-process-book/ “LLMs in Production: Learning from Experience” by Waleed Kadous @ Anyscale https://www.youtube.com/watch?v=xa7k9MUeIdk "Supercharging Industrial Operations with Problem-Solving GenAI & Domain Knowledge" by Christopher Nguyen @ Aitomatic https://www.k1st.world/2023-program/supercharging-industrial-operations-with-problem-solving-genai-domain-knowledge “The Next Million AI Systems” by Mark Huang @ Gradient: https://www.youtube.com/watch?v=lA0Npe4PqFw "AI in a Box" by Useful Sensors https://usefulsensors.com/#products "Opportunities in AI - 2023" by Andrew Ng https://www.youtube.com/watch?v=5p248yoa3oE "Advancing the Marine Industry Through the Harmony of Fishermen Knowledge and Al" by Akinori Kasai @ Furuno https://www.k1st.world/2023-program/advancing-the-marine-industry-through-the-harmony-of-fishermen-knowledge-and-al Macy conferences (1941-1960) https://en.wikipedia.org/wiki/Macy_conferences https://www.asc-cybernetics.org/foundations/history/MacySummary.htm https://press.uchicago.edu/ucp/books/book/distributed/C/bo23348570.html second-order cybernetics https://pangaro.com/designconversation/wp-content/uploads/dubberly-pangaro-chk-journal-2015.pdf https://en.wikipedia.org/wiki/Second-order_cybernetics Project Cybersyn https://jacobin.com/2015/04/allende-chile-beer-medina-cybersyn/ https://thereader.mitpress.mit.edu/project-cybersyn-chiles-radical-experiment-in-cybernetic-socialism/ https://99percentinvisible.org/episode/project-cybersyn/ https://medium.com/@rjog/project-cybersyn-an-early-attempt-at-iot-governance-and-how-we-can-apply-its-learnings-5164be850413 https://www.sustema.com/post/project-cybersyn-how-a-chilean-government-almost-controlled-the-economy-from-a-control-room https://transform-social.org/en/texts/cybersyn/ Humberto Maturana, Francisco Varela: Autopoeisis "De Maquinas y Seres Vivos" "Everything said is said by an observer" https://proyectos.yura.website/wp-content/uploads/2021/06/de_maquinas_y_seres_vivos_-_maturana.pdf https://en.wikipedia.org/wiki/Autopoiesis_and_Cognition:_The_Realization_of_the_Living Fernando Flores (led Project Cybersyn, imprisoned, later worked with Prof. Terry Winograd @ Stanford, the grad advisor for what became Google) https://lorenabarba.com/gallery/prof-barba-gave-keynote-at-pycon-2016/ https://conversationsforaction.com/fernando-flores "Navigating the Risk Landscape: A Deep Dive into Generative AI" by Ben Lorica and Andrew Burt https://thedataexchange.media/mitigating-generative-ai-risks/ "SpanMarker" by Tom Aarsen @ Hugging Face https://tomaarsen.github.io/SpanMarkerNER/ Examples of "the math catching up with the machine learning": Guy Van den Broeck @ UCLA https://web.cs.ucla.edu/~guyvdb/talks/ Charles Martin @ Calculations Consulting https://weightwatcher.ai/
This episode features an interview with Ben Lorica, Co-founder and Principal of Gradient Flow, a company that provides a wide range of content on data and technology. Ben is an industry expert on data, machine learning, and AI. He is a Technical Advisor for Databricks, a program chair for several data conferences, and he hosts The Data Exchange Podcast.In this episode, Sam and Ben discuss Big Data and the improvements and future opportunities of AI and machine learning.-------------------“The reason I use the word decentralize is because when you try to explain it to someone, let's say you want to train a different model for each user, or region, or sensor, or device. So you can't use necessarily just personalized because recommenders can be personalized, but they're still centralized models.” – Ben Lorica-------------------Episode Timestamps:(01:17): What open source data means to Ben(05:54): What intrigued Ben about Big Data(12:07): What brought Ben to working on Ray(16:15): Ben's opinion on how far AI and ML have come in the last 5 years(26:38): What Ben sees happening in this space in the next 5 years(39:06): What challenges Ben sees in the next 5 years (43:51): One question Ben's always wanted to be asked(44:55): Ben's advice for those starting their open source data adventure(46:34): Executive producer, Audra Montenegro's backstage takeaways-------------------Links:LinkedIn - Connect with BenGradient Flow's NewsletterGradient Flow's 2023 Trends ReportVisit Sky Labs
Orchestrate all the Things podcast: Connecting the Dots with George Anadiotis
As Ben Lorica will readily admit, at the risk of dating himself, he belongs to the first generation of data scientists. In addition to having served as Chief Data Scientist for the likes of Databricks and O'Reilly, Lorica advises and works with a number of venture capitals, startups and enterprises, conducts surveys, and chairs some of the top data and AI events in the world. That gives him a unique vantage point to identify developments in this space. Having worked in academia teaching applied mathematics and statistics for years, at some point Lorica realized that he wanted his work to have more practical implications. At that point the term "data science" was not yet coined, and Lorica's exit strategy was to become a quant. Fast forwarding to today, Lorica still has friends in the venture capital world. That includes Intel Capital's Assaf Araki, with whom Lorica co-authored two recent posts on data management and AI trends. We caught up with Lorica to discuss those, as well as new areas for growth and the trouble with unicorns and what to do about it.
Show Notes(01:41) Mars walked through his education studying Computer Systems Engineering at The University of Auckland in New Zealand.(03:16) Mars reflected on his overall Ph.D. experience in Computer Science at UCLA.(05:55) Mars discussed his early research paper on a robust and scalable lane departure warning system for smartphones.(07:13) Mars described his work on SmartFall, an automatic fall detection system to help prevent the elderly from falling.(08:34) Mars explained his project WANDA, an end-to-end remote health monitoring and analytics system designed for heart failure patients.(10:06) Mars recalled learnings from interning as a software engineer at Google during his Ph.D.(14:54) Mars discussed engineering challenges while working on PHP for Google App Engine and Gboard personalization during his subsequent four years at Google.(19:05) Mars rationalized his decision to join LinkedIn to lead an engineering team that builds the core metadata infrastructure for the entire organization.(21:15) Mars discussed the motivation behind the creation of LinkedIn's generalized metadata search and discovery tool, DataHub, later open-sourced in 2020.(25:21) Mars dissected the key architecture of DataHub, which is designed to address the key scalability challenges coming in four different forms: modeling, ingestion, serving, and indexing.(28:50) Mars expressed the challenges of finding DataHub's early adopters internally at LinkedIn and externally later on at other companies.(35:22) Mars shared the story behind the founding of Metaphor Data, which he co-founded with Pardhu Gunnam and Seyi Adebajo and currently serves as the CTO.(41:55) Mars unpacked how Metaphor's modern metadata platform serves as a system of record for any organization's data ecosystem.(48:07) Mars described new challenges with metadata management since the introduction of the modern data stack and key features of a great modern metadata platform (as brought up in his in-depth blog post with Ben Lorica).(53:55) Mars explained how a modern metadata platform fits within the broader data ecosystem.(58:30) Mars shared the hurdles to finding Metaphor Data's early design partners and lighthouse customers.(01:04:33) Mars shared valuable hiring lessons to attract the right people who are excited about Metaphor's mission.(01:07:28) Mars shared important culture-building lessons to build out a high-performing team at Metaphor.(01:10:45) Mars shared fundraising advice for founders currently seeking the right investors for their startups.(01:13:22) Closing segment.Mars' Contact InfoTwitterLinkedInGoogle ScholarGitHubMetaphor DataWebsite | Twitter | LinkedInCareers | About PageData Documentation | Data CollaborationMentioned ContentArticlesDataHub: A generalized metadata search and discovery tool (Aug 2019)Open-sourcing DataHub: LinkedIn's metadata search and discovery platform (Feb 2020)Founding Metaphor Data (Dec 2020)Metaphor and Soda partner to unify the modern data stack with trusted data (Dec 2021)Introducing Metaphor: The Modern Metadata Platform (Nov 2021)The Modern Metadata Platform: What, Why, and How? (Jan 2022)PapersSmartLDWS: A robust and scalable lane departure warning system for the smartphones (Oct 2009)SmartFall: An automatic fall detection system based on subsequence matching for the SmartCane (April 2009)WANDA: An end-to-end remote health monitoring and analytics system for heart failure patients (Oct 2012)PeopleBenn Stancil (Chief Analytics Officer at Mode Analytics, Well-Known Substack Writer)Tristan Handy (Co-Founder and CEO of dbt Labs, Writer of The Analytics Engineering Roundup)Andy Pavlo (Associate Professor of Database at Carnegie Mellon University)Books“Working In Public” (by Nadia Eghbal)“The Mom Test” (by Rob Fitzpatrick)“A Thousand Brains” (by Jeff Hawkins)“The Scout Mindset” (by Julia Galef)NotesMy conversation with Mars was recorded back in January 2022. Since then, many things have happened at Metaphor Data. I'd recommend:Visiting their brand new websiteReading the 3-part “Data Documentation” series on their blog (part 1, part 2, and part 3)Looking over the Trusted Data landing pageAbout the showDatacast features long-form, in-depth conversations with practitioners and researchers in the data community to walk through their professional journeys and unpack the lessons learned along the way. I invite guests coming from a wide range of career paths — from scientists and analysts to founders and investors — to analyze the case for using data in the real world and extract their mental models (“the WHY and the HOW”) behind their pursuits. Hopefully, these conversations can serve as valuable tools for early-stage data professionals as they navigate their own careers in the exciting data universe.Datacast is produced and edited by James Le. Get in touch with feedback or guest suggestions by emailing khanhle.1013@gmail.com.Subscribe by searching for Datacast wherever you get podcasts or click one of the links below:Listen on SpotifyListen on Apple PodcastsListen on Google PodcastsIf you're new, see the podcast homepage for the most recent episodes to listen to, or browse the full guest list.
On this episode, we discuss what it means for a technology to be transformational? Why Artificial Intelligence (AI) matters so much to us? What is AI? What are some good examples of AI? How AI works? Who are the key people & organizations to follow? The media narratives and how it distorts reality, our own personal experiences with AI, the limitations of AI and finally answer the question, "Will AI transform our lives?"Links to folks and organizations and other material mentioned in the episode People: Andrew Ng, Andrej Karpathy, Ben Lorica, Gary Marcus, Peter Diamandis, Sam Altman, Fei-Fei-Li, Pieter AbbeelCompanies/ organizations : Singularity University, Google, Covariant.aiTweets and articles: GPT-3Uber/Lyft getting out of autonomous drivingAlpha-fold breakthroughBoston Dynamics robotsBlenderbot 2.0Show info: Website, TwitterCo-hosts & Creators: Kapil, Ravi
Ben Lorica and Dhruba talk about top data trends that Data Officers, Scientists and CTOs are investigating and planning around, data OPS in ML and AI and Gradient Flow.
Orchestrate all the Things podcast: Connecting the Dots with George Anadiotis
What do you get when you juxtapose two of the hottest domains today - AI and healthcare? A peek into the future, potentially. In 2020, few things went well and saw growth. Artificial intelligence was one of them, and healthcare was another one. Artificial intelligence remained on a steady course of growth and further exploration -- perhaps because of the Covid-19 crisis. Healthcare was a big area for AI investment. Today, the results of a new survey focusing precisely on the adoption of AI in healthcare are being unveiled. We caught up with 2 of its architects: Gradient Flow Principal Ben Lorica, and John Snow Labs CTO David Talby, to discuss findings and the state of AI in healthcare. Article published on ZDNet.
Data infrastructure has been transformed over the last fifteen years. The open source Hadoop project led to the creation of multiple companies based around commercializing the MapReduce algorithm and Hadoop distributed file system. Cheap cloud storage popularized the usage of data lakes. Cheap cloud servers led to wide experimentation for data tools. Apache Spark emerged The post The Data Exchange with Ben Lorica appeared first on Software Engineering Daily.
Data infrastructure has been transformed over the last fifteen years. The open source Hadoop project led to the creation of multiple companies based around commercializing the MapReduce algorithm and Hadoop distributed file system. Cheap cloud storage popularized the usage of data lakes. Cheap cloud servers led to wide experimentation for data tools. Apache Spark emerged The post The Data Exchange with Ben Lorica appeared first on Software Engineering Daily.
Data infrastructure has been transformed over the last fifteen years. The open source Hadoop project led to the creation of multiple companies based around commercializing the MapReduce algorithm and Hadoop distributed file system. Cheap cloud storage popularized the usage of data lakes. Cheap cloud servers led to wide experimentation for data tools. Apache Spark emerged The post The Data Exchange with Ben Lorica appeared first on Software Engineering Daily.
Listen in as Kelli Lapointe is joined by Roger Magoulas, VP of Radar at O’Reilly to share stories, laughs and expectations of the AI conference. From technology adoption to the way the conference came to be, Roger addresses some big ideas and shares how a paper he wrote in tandem with Ben Lorica back in 2008 inspired the conference that it is today. Both Ben and Roger knew that something was going on in the realm of sophisticated analytics and took the chance at diving in deeper. After starting the conference and realizing machine learning was incredibly important they learned how to integrate this new-found methodology across applications and operations. Roger shares stories and experiences from O’Reilly’s AI conference and describes the learning taking place as deep and fundamental and explains how this social learning environment goes beyond the topics presented during the conference, but really allows attendees to absorb information from all around them. Learn more about this great social network and hear Roger’s insights, such as why this AI conference is much like meditation, today!
The trend towards model deployment, engineering and just generally building “stuff that works” is just the latest step in the evolution of the now-maturing world of data science. It’s almost guaranteed not to be the last one though, and staying ahead of the data science curve means keeping an eye on what trends might be just around the corner. That’s why we asked Ben Lorica, O’Reilly Media’s Chief Data Scientist, to join us on the podcast. Not only does Ben have a mile-high view of the data science world (he advises about a dozen startups and organizes multiple world-class conferences), but he also has a perspective that spans two decades of data science evolution.
Cory and Brett get to sit down with Mr. @BigData, Ben Lorica, Chief Data Scientist at O’Reilly Media to get a glimpse into the future of AI, ML and Data Science. Ben, who also happens to be the Program Chair for Strata Data, AI Conference and TensorFlowWorld talks about the common challenges he is seeing organizations have in starting and scaling machine learning projects and provides some best practices to take in order not to fall into the same traps as others. The team also gets to hear about some of the trends, tools and technologies on the bleeding edge that are driving huge advancements in machine and deep learning.
In this week’s episode of the Data Show, we’re featuring an interview Data Show host Ben Lorica participated in for the Software Engineering Daily Podcast, where he was interviewed by Jeff Meyerson. Their conversation mainly centered around data engineering, data architecture and infrastructure, and machine learning (ML). Here are a few highlights: Tools for productive […]
Upcoming events: A Conversation with Haseeb Qureshi at Cloudflare on April 3, 2019 FindCollabs Hackathon at App Academy on April 6, 2019 Ben Lorica is the chief data scientist at O’Reilly Media and the program director of the Strata Data Conference. In his work, Ben spends time with people across the software industry, giving him The post Data with Ben Lorica appeared first on Software Engineering Daily.
For the end-of-year holiday episode of the Data Show, I turned the tables on Data Show host Ben Lorica to talk about trends in big data, machine learning, and AI, and what to look for in 2019. Lorica also showcased some highlights from our upcoming Strata Data and Artificial Intelligence conferences. Here are some highlights […]
When contemplating a new venture into AI or machine learning, companies need to take on a number of important considerations that relate to talent, existing data and limitations. One way executives can judge how successful or appropriate and AI project would be for their company is to examine use cases of businesses that have previously done something similar. With AI and machine learning news increasing in tech media, a business leader may find it challenging to cut through the hype and identify valid, useful case studies. We talked to Ben Lorica, the Chief Data Scientist at O'Reilly Media, to get his insights on what key details executives should be looking for within a case study. To see the our interview article, visit https://www.techemergence.com/what-executives-should-be-asking-about-ai-use-cases-in-business
This Week in Machine Learning & Artificial Intelligence (AI) Podcast
This week I’ve invited my friend Ben Lorica onto the show. Ben is Chief Data Scientist for O’Reilly Media, and Program Director of Strata Data & the O'Reilly A.I. conference. Ben has worked on analytics and machine learning in the finance and retail industries, and serves as an advisor for nearly a dozen startups. In his role at O’Reilly he’s responsible for the content for 7 major conferences around the world each year. In the show we discuss all of that, touching on how publishers can take advantage of machine learning and data mining, how the role of “data scientist” is evolving and the emergence of the machine learning engineer, and a few of the hot technologies, trends and companies that he’s seeing arise around the world. The notes for this show can be found at twimlai.com/talk/26
The O’Reilly Security Podcast: Scaling machine learning for security, the evolving nature of security data, and how adversaries can use machine learning against us.In this special episode of the Security Podcast, O’Reilly’s Ben Lorica talks with Parvez Ahammad, who leads the data science and machine learning efforts at Instart Logic. He has applied machine learning in a variety of domains, most recently to computational neuroscience and security. Lorica and Ahammad discuss the challenges of using machine learning in information security.Here are some highlights: Scaling machine learning for security If you look at a day's worth of logs, even for a mid-size company, it's billions of rows of logs. The scale of the problem is actually incredibly large. Typically, people are working to somehow curate a small data set and convince themselves that using only a small subset of the data is reasonable, and then go to work on that small subset—mostly because they’re unsure how to build a scalable system. They’ve perhaps already signed up for doing a particular machine learning method without strategically thinking about what their situation really requires. Within my company, I have a colleague from a hardcore security background and I come from a more traditional machine learning background. We butt heads, and we essentially help each other learn about the other’s paradigm and how to think about it. The evolving nature of security data and the exploitation of machine learning by adversaries Many times, if you take a survey and see that most of the machine learning applications are supervised, what you're assuming is that you collected the data and you think the underlying distribution of your data collection is true. In statistics, this is called stationarity assumption. You assume that this batch is representative of what you're going to see later. You are going to split your data into two parts; you train on one part and you test on the other part. The issue is, especially in security, there is an adversary. Any time you settle down and build a classifier, there is somebody actively working to break it. There is no assumption of stationarity that is going to hold. Also, there are people and botnets that are actively trying to get around whatever model you constructed. There is an adversarial nature to the problem. These dual-sided problems are typically dealt in the game theoretic framework. Basically, you assume there's an adversary. We’ve recently seen research papers on this topic. One approach we’ve seen is that you can poison a machine learning classifier to act maliciously by messing with how the samples are being constructed or adjusting the distribution that the classifier is looking at. Alternatively, you can try to construct safe machine learning approaches that go in with the assumption that there is going to be an adversary, then reasoning through what you can do to thwart said adversary. Building interpretable and accessible machine learning I think companies like Google or Facebook probably have access to large-scale resources, where they can curate and generate really good quality ground truth. In such a scenario, it's probably wise to try deep learning. On a philosophical level, I also feel that deep learning is like proving there is a Nash equilibrium. You know that it can be done. How it’s exactly getting done is a separate problem. As a scientist, I am interested in understanding what, exactly, is making this work. For example, if you throw deep learning at this problem and the thing comes back, and the classification rates are very small, then we probably need to look at a different problem because you just threw the kitchen sink at it. However, if we found that it is doing a good job, then what we need to do is to start from there and figure out an explainable model that we can train. We are an enterprise, and in the enterprise industry, it's not sufficient to have an answer; we need to be able to explain why. For that, there are issues in simply applying deep learning as it is. What I'm really interested in these days is the idea of explainable machine learning. It’s not enough that we build machine learning systems that can do a certain classification or segmentation job very well. I'm starting to be really interested in the idea of how to build systems that are interpretable, that are explainable—where you can have faith in the outcome of the system by inspecting something about the system that allows you to say, ‘Hey, this was actually a trustworthy result.’ Related resources: Applying Machine Learning in Security: A recent survey paper co-written by Parvez Ahammad
The O’Reilly Security Podcast: Scaling machine learning for security, the evolving nature of security data, and how adversaries can use machine learning against us.In this special episode of the Security Podcast, O’Reilly’s Ben Lorica talks with Parvez Ahammad, who leads the data science and machine learning efforts at Instart Logic. He has applied machine learning in a variety of domains, most recently to computational neuroscience and security. Lorica and Ahammad discuss the challenges of using machine learning in information security.Here are some highlights: Scaling machine learning for security If you look at a day's worth of logs, even for a mid-size company, it's billions of rows of logs. The scale of the problem is actually incredibly large. Typically, people are working to somehow curate a small data set and convince themselves that using only a small subset of the data is reasonable, and then go to work on that small subset—mostly because they’re unsure how to build a scalable system. They’ve perhaps already signed up for doing a particular machine learning method without strategically thinking about what their situation really requires. Within my company, I have a colleague from a hardcore security background and I come from a more traditional machine learning background. We butt heads, and we essentially help each other learn about the other’s paradigm and how to think about it. The evolving nature of security data and the exploitation of machine learning by adversaries Many times, if you take a survey and see that most of the machine learning applications are supervised, what you're assuming is that you collected the data and you think the underlying distribution of your data collection is true. In statistics, this is called stationarity assumption. You assume that this batch is representative of what you're going to see later. You are going to split your data into two parts; you train on one part and you test on the other part. The issue is, especially in security, there is an adversary. Any time you settle down and build a classifier, there is somebody actively working to break it. There is no assumption of stationarity that is going to hold. Also, there are people and botnets that are actively trying to get around whatever model you constructed. There is an adversarial nature to the problem. These dual-sided problems are typically dealt in the game theoretic framework. Basically, you assume there's an adversary. We’ve recently seen research papers on this topic. One approach we’ve seen is that you can poison a machine learning classifier to act maliciously by messing with how the samples are being constructed or adjusting the distribution that the classifier is looking at. Alternatively, you can try to construct safe machine learning approaches that go in with the assumption that there is going to be an adversary, then reasoning through what you can do to thwart said adversary. Building interpretable and accessible machine learning I think companies like Google or Facebook probably have access to large-scale resources, where they can curate and generate really good quality ground truth. In such a scenario, it's probably wise to try deep learning. On a philosophical level, I also feel that deep learning is like proving there is a Nash equilibrium. You know that it can be done. How it’s exactly getting done is a separate problem. As a scientist, I am interested in understanding what, exactly, is making this work. For example, if you throw deep learning at this problem and the thing comes back, and the classification rates are very small, then we probably need to look at a different problem because you just threw the kitchen sink at it. However, if we found that it is doing a good job, then what we need to do is to start from there and figure out an explainable model that we can train. We are an enterprise, and in the enterprise industry, it's not sufficient to have an answer; we need to be able to explain why. For that, there are issues in simply applying deep learning as it is. What I'm really interested in these days is the idea of explainable machine learning. It’s not enough that we build machine learning systems that can do a certain classification or segmentation job very well. I'm starting to be really interested in the idea of how to build systems that are interpretable, that are explainable—where you can have faith in the outcome of the system by inspecting something about the system that allows you to say, ‘Hey, this was actually a trustworthy result.’ Related resources: Applying Machine Learning in Security: A recent survey paper co-written by Parvez Ahammad
The O'Reilly Radar Podcast: Emerging themes in the data space.This week, O'Reilly's Mac Slocum chats with Ben Lorica, O'Reilly's chief data scientist and host of the O'Reilly Data Show Podcast. Lorica talks about emerging themes in the data space, from machine learning to deep learning to artificial intelligence, and how those technologies relate to one another and how they're fueling real-time data applications. Lorica also talks about how the concept of a data center is evolving, the importance of open source big data components, and the rise in interest of big data ethics. Here are a few highlights: Human-in-the-loop recommendations Stitch Fix is a company that I like to talk about. They use machine learning recommendations. This is a company that basically recommends clothing and fashion apparel to women. They use machine learning to generate a series of recommendations but then, human fashion experts actually take those recommendations and filter them further. In many ways it's the true example of augmentation. That humans are always in the loop of the decision making process. Deep learning inspiration The developments in deep learning have inspired ideas from other parts of machine learning as well. ... People have realized it's really a sequence of steps in the pipeline, and in each step, you get better and better representation of your data culminating in some kind of predictive task. I think people have realized that if they can automate some of these machine learning pipelines in the way that deep learning does, then they can provide alternative approaches. In fact, one of things that has happened a lot recently is that people will use deep learning particularly for the feature engineering, the feature representation step in the machine learning task and then apply another algorithm in the end to do an actual prediction. There's a group out of UC Berkeley, AmpLab, that produced Apache Spark. They recently built a machine learning pipeline on top of Spark. Some of the examples that they ship with are pipelines that you normally associate with deep learning, such as images and speech index. They built a series of primitives that you can understand—you can understand how each of these primitive components works—and then you just piece them together in the pipeline. Then they optimize the pipeline for you, so in many ways, they mimic what a deep learning architecture does, but maybe they provide more transparency because you know exactly what's happening in each step of this pipeline. Learning to build a cake Deep learning people—and Yann LeCun in particular—have this distinction where, if you think of machine learning as playing a role in AI there are really three parts to it. Unsupervised learning, supervised learning and re-enforcement learning. He talks about unsupervised learning as being the cake, supervised learning being the icing on the cake, and reinforcement learning as being the cherry on the cake. He says the problem is that we don't know how to build the cake. There's a lot of unresolved problems in unsupervised learning. Structuring the unstructured At the end of the day, I think AI, and machine learning in general, will require the ability to basically do feature extraction intelligence. In technical terms, if you think of machine learning as discovering some kind of functional mapping one space to another—here's an image, map it into a category—then what you are really talking about is a function that requires variables, so these variables are features. One of the areas I'm excited about is people who are about to take unstructured information like text or images and turn it into structured information. Because once you go from unstructured information to structured information, the structured information can be used as features in machine learning algorithms. There's a company that came out of Stanford's Deep Dive project called Lattice.io. It's doing interesting things in this area, where they are taking text and imaging and extracting structured information from these unstructured data sources. Basically, human-level accuracy, but, obviously, since they are doing it using computers, they can scale as machines scale. I think this will unlock a lot of data sources that normally people would not use for predictive purposes. The rise of the mini data center The other interesting thing I've noticed over the past few months is the notion of a data center. What is a data center? Well, a data center is a huge warehouse near a hydroelectric plant, right? It's the usual notion. But as some of our environments generate more data—think of a self-driving car, or a smart building, or an airplane—once they are generating lots and lots of data, you could consider them mini data centers in many ways, right? Some of these platforms need to look ahead into the future, where they have to be simpler. Simple enough so you can stick a mini data center inside a car. Slim down your big data architecture enough so they can stick it somewhere so you don't have to rely too much on network communication to do all of your data crunching. I think that's an interesting concept; there are companies that are already deciding their architectures around this notion that there will be a proliferation of these data centers, so to speak.
The O'Reilly Radar Podcast: Emerging themes in the data space.This week, O'Reilly's Mac Slocum chats with Ben Lorica, O'Reilly's chief data scientist and host of the O'Reilly Data Show Podcast. Lorica talks about emerging themes in the data space, from machine learning to deep learning to artificial intelligence, and how those technologies relate to one another and how they're fueling real-time data applications. Lorica also talks about how the concept of a data center is evolving, the importance of open source big data components, and the rise in interest of big data ethics. Here are a few highlights: Human-in-the-loop recommendations Stitch Fix is a company that I like to talk about. They use machine learning recommendations. This is a company that basically recommends clothing and fashion apparel to women. They use machine learning to generate a series of recommendations but then, human fashion experts actually take those recommendations and filter them further. In many ways it's the true example of augmentation. That humans are always in the loop of the decision making process. Deep learning inspiration The developments in deep learning have inspired ideas from other parts of machine learning as well. ... People have realized it's really a sequence of steps in the pipeline, and in each step, you get better and better representation of your data culminating in some kind of predictive task. I think people have realized that if they can automate some of these machine learning pipelines in the way that deep learning does, then they can provide alternative approaches. In fact, one of things that has happened a lot recently is that people will use deep learning particularly for the feature engineering, the feature representation step in the machine learning task and then apply another algorithm in the end to do an actual prediction. There's a group out of UC Berkeley, AmpLab, that produced Apache Spark. They recently built a machine learning pipeline on top of Spark. Some of the examples that they ship with are pipelines that you normally associate with deep learning, such as images and speech index. They built a series of primitives that you can understand—you can understand how each of these primitive components works—and then you just piece them together in the pipeline. Then they optimize the pipeline for you, so in many ways, they mimic what a deep learning architecture does, but maybe they provide more transparency because you know exactly what's happening in each step of this pipeline. Learning to build a cake Deep learning people—and Yann LeCun in particular—have this distinction where, if you think of machine learning as playing a role in AI there are really three parts to it. Unsupervised learning, supervised learning and re-enforcement learning. He talks about unsupervised learning as being the cake, supervised learning being the icing on the cake, and reinforcement learning as being the cherry on the cake. He says the problem is that we don't know how to build the cake. There's a lot of unresolved problems in unsupervised learning. Structuring the unstructured At the end of the day, I think AI, and machine learning in general, will require the ability to basically do feature extraction intelligence. In technical terms, if you think of machine learning as discovering some kind of functional mapping one space to another—here's an image, map it into a category—then what you are really talking about is a function that requires variables, so these variables are features. One of the areas I'm excited about is people who are about to take unstructured information like text or images and turn it into structured information. Because once you go from unstructured information to structured information, the structured information can be used as features in machine learning algorithms. There's a company that came out of Stanford's Deep Dive project called Lattice.io. It's doing interesting things in this area, where they are taking text and imaging and extracting structured information from these unstructured data sources. Basically, human-level accuracy, but, obviously, since they are doing it using computers, they can scale as machines scale. I think this will unlock a lot of data sources that normally people would not use for predictive purposes. The rise of the mini data center The other interesting thing I've noticed over the past few months is the notion of a data center. What is a data center? Well, a data center is a huge warehouse near a hydroelectric plant, right? It's the usual notion. But as some of our environments generate more data—think of a self-driving car, or a smart building, or an airplane—once they are generating lots and lots of data, you could consider them mini data centers in many ways, right? Some of these platforms need to look ahead into the future, where they have to be simpler. Simple enough so you can stick a mini data center inside a car. Slim down your big data architecture enough so they can stick it somewhere so you don't have to rely too much on network communication to do all of your data crunching. I think that's an interesting concept; there are companies that are already deciding their architectures around this notion that there will be a proliferation of these data centers, so to speak.
The O'Reilly Radar Podcast: A special holiday cross-over of the O'Reilly Data Show Podcast.O'Reilly's Ben Lorica chats with Apache Spark release manager and Databricks co-founder Patrick Wendell about Spark's roadmap and interesting applications he's seeing in the growing Spark ecosystem.Here are some highlights from their chat: We were really trying to solve research problems, so we were trying to work with the early users of Spark, getting feedback on what issues it had and what types of problems they were trying to solve with Spark, and then use that to influence the roadmap. It was definitely a more informal process, but from the very beginning, we were expressly user driven in the way we thought about building Spark, which is quite different than a lot of other open source projects. … From the beginning, we were focused on empowering other people and building platforms for other developers. One of the early users was Conviva, a company that does analytics for real-time video distribution. They were a very early user of Spark, they continue to use it today, and a lot of their feedback was incorporated into our roadmap, especially around the types of APIs they wanted to have that would make data processing really simple for them, and of course, performance was a big issue for them very early on because in the business of optimizing real-time video streams, you want to be able to react really quickly when conditions change. ... Early on, things like latency and performance were pretty important. In general in Spark, we are trying to make every release of Spark accessible to more users, and that means getting people super easy-to-use APIs—APIs in familiar languages like Python and APIs that are codable without a lot of effort. I remember when we started Spark, we were super excited because you can write a k-means cluster in like 10 lines of code; to do the same thing in Hadoop, you have to write 300 lines. The next major API for Spark is this API called Spark R that was merged into master branch [early in 2015], and it's going to be present in the 1.4 release of Spark. This is what we saw as a very important part of embracing the data science community. R is already very popular and actually growing rather quickly in terms of popularity for statistical processing, and we wanted to give people a really nice first-class way of using R with Spark. We have an exploration into [deep learning] going on, actually, by Reza Zadeh, who's a Stanford professor who's been working on Spark for a long time and works at Databricks as well, as a consultant. He's starting to look into it; I think the initial deliverable was just some support for standard neural nets, but deep learning is definitely on the horizon. That may be more of the Spark 1.6 time frame, but we are definitely deciding which subsets of functionality we can support nicely inside of Spark, and we've heard a very clear user demand for that. Subscribe to the O'Reilly Radar Podcast: Stitcher, TuneIn, iTunes, SoundCloud, RSS
The O'Reilly Radar Podcast: A special holiday cross-over of the O'Reilly Data Show Podcast.O'Reilly's Ben Lorica chats with Apache Spark release manager and Databricks co-founder Patrick Wendell about Spark's roadmap and interesting applications he's seeing in the growing Spark ecosystem.Here are some highlights from their chat: We were really trying to solve research problems, so we were trying to work with the early users of Spark, getting feedback on what issues it had and what types of problems they were trying to solve with Spark, and then use that to influence the roadmap. It was definitely a more informal process, but from the very beginning, we were expressly user driven in the way we thought about building Spark, which is quite different than a lot of other open source projects. … From the beginning, we were focused on empowering other people and building platforms for other developers. One of the early users was Conviva, a company that does analytics for real-time video distribution. They were a very early user of Spark, they continue to use it today, and a lot of their feedback was incorporated into our roadmap, especially around the types of APIs they wanted to have that would make data processing really simple for them, and of course, performance was a big issue for them very early on because in the business of optimizing real-time video streams, you want to be able to react really quickly when conditions change. ... Early on, things like latency and performance were pretty important. In general in Spark, we are trying to make every release of Spark accessible to more users, and that means getting people super easy-to-use APIs—APIs in familiar languages like Python and APIs that are codable without a lot of effort. I remember when we started Spark, we were super excited because you can write a k-means cluster in like 10 lines of code; to do the same thing in Hadoop, you have to write 300 lines. The next major API for Spark is this API called Spark R that was merged into master branch [early in 2015], and it's going to be present in the 1.4 release of Spark. This is what we saw as a very important part of embracing the data science community. R is already very popular and actually growing rather quickly in terms of popularity for statistical processing, and we wanted to give people a really nice first-class way of using R with Spark. We have an exploration into [deep learning] going on, actually, by Reza Zadeh, who's a Stanford professor who's been working on Spark for a long time and works at Databricks as well, as a consultant. He's starting to look into it; I think the initial deliverable was just some support for standard neural nets, but deep learning is definitely on the horizon. That may be more of the Spark 1.6 time frame, but we are definitely deciding which subsets of functionality we can support nicely inside of Spark, and we've heard a very clear user demand for that. Subscribe to the O'Reilly Radar Podcast: Stitcher, TuneIn, iTunes, SoundCloud, RSS