POPULARITY
Tony Baer, Principal at dbInsight, joins Corey on Screaming in the Cloud to discuss his definition of what is and isn't a database, and the trends he's seeing in the industry. Tony explains why it's important to try and have an outsider's perspective when evaluating new ideas, and the growing awareness of the impact data has on our daily lives. Corey and Tony discuss the importance of working towards true operational simplicity in the cloud, and Tony also shares why explainability in generative AI is so crucial as the technology advances. About TonyTony Baer, the founder and CEO of dbInsight, is a recognized industry expert in extending data management practices, governance, and advanced analytics to address the desire of enterprises to generate meaningful value from data-driven transformation. His combined expertise in both legacy database technologies and emerging cloud and analytics technologies shapes how clients go to market in an industry undergoing significant transformation. During his 10 years as a principal analyst at Ovum, he established successful research practices in the firm's fastest growing categories, including big data, cloud data management, and product lifecycle management. He advised Ovum clients regarding product roadmap, positioning, and messaging and helped them understand how to evolve data management and analytic strategies as the cloud, big data, and AI moved the goal posts. Baer was one of Ovum's most heavily-billed analysts and provided strategic counsel to enterprises spanning the Fortune 100 to fast-growing privately held companies.With the cloud transforming the competitive landscape for database and analytics providers, Baer led deep dive research on the data platform portfolios of AWS, Microsoft Azure, and Google Cloud, and on how cloud transformation changed the roadmaps for incumbents such as Oracle, IBM, SAP, and Teradata. While at Ovum, he originated the term “Fast Data” which has since become synonymous with real-time streaming analytics.Baer's thought leadership and broad market influence in big data and analytics has been formally recognized on numerous occasions. Analytics Insight named him one of the 2019 Top 100 Artificial Intelligence and Big Data Influencers. Previous citations include Onalytica, which named Baer as one of the world's Top 20 thought leaders and influencers on Data Science; Analytics Week, which named him as one of 200 top thought leaders in Big Data and Analytics; and by KDnuggets, which listed Baer as one of the Top 12 top data analytics thought leaders on Twitter. While at Ovum, Baer was Ovum's IT's most visible and publicly quoted analyst, and was cited by Ovum's parent company Informa as Brand Ambassador in 2017. In raw numbers, Baer has 14,000 followers on Twitter, and his ZDnet “Big on Data” posts are read 20,000 – 30,000 times monthly. He is also a frequent speaker at industry conferences such as Strata Data and Spark Summit.Links Referenced:dbInsight: https://dbinsight.io/ TranscriptAnnouncer: Hello, and welcome to Screaming in the Cloud with your host, Chief Cloud Economist at The Duckbill Group, Corey Quinn. This weekly show features conversations with people doing interesting work in the world of cloud, thoughtful commentary on the state of the technical world, and ridiculous titles for which Corey refuses to apologize. This is Screaming in the Cloud.Corey: This episode is brought to us in part by our friends at RedHat.As your organization grows, so does the complexity of your IT resources. You need a flexible solution that lets you deploy, manage, and scale workloads throughout your entire ecosystem. The Red Hat Ansible Automation Platform simplifies the management of applications and services across your hybrid infrastructure with one platform. Look for it on the AWS Marketplace.Corey: Welcome to Screaming in the Cloud. I'm Corey Quinn. Back in my early formative years, I was an SRE sysadmin type, and one of the areas I always avoided was databases, or frankly, anything stateful because I am clumsy and unlucky and that's a bad combination to bring within spitting distance of anything that, you know, can't be spun back up intact, like databases. So, as a result, I tend not to spend a lot of time historically living in that world. It's time to expand horizons and think about this a little bit differently. My guest today is Tony Baer, principal at dbInsight. Tony, thank you for joining me.Tony: Oh, Corey, thanks for having me. And by the way, we'll try and basically knock down your primal fear of databases today. That's my mission.Corey: We're going to instill new fears in you. Because I was looking through a lot of your work over the years, and the criticism I have—and always the best place to deliver criticism is massively in public—is that you take a very conservative, stodgy approach to defining a database, whereas I'm on the opposite side of the world. I contain information. You can ask me about it, which we'll call querying. That's right. I'm a database.But I've never yet found myself listed in any of your analyses around various database options. So, what is your definition of databases these days? Where do they start and stop? Tony: Oh, gosh.Corey: Because anything can be a database if you hold it wrong.Tony: [laugh]. I think one of the last things I've ever been called as conservative and stodgy, so this is certainly a way to basically put the thumbtack on my share.Corey: Exactly. I'm trying to normalize my own brand of lunacy, so we'll see how it goes.Tony: Exactly because that's the role I normally play with my clients. So, now the shoe is on the other foot. What I view a database is, is basically a managed collection of data, and it's managed to the point where essentially, a database should be transactional—in other words, when I basically put some data in, I should have some positive information, I should hopefully, depending on the type of database, have some sort of guidelines or schema or model for how I structure the data. So, I mean, database, you know, even though you keep hearing about unstructured data, the fact is—Corey: Schemaless databases and data stores. Yeah, it was all the rage for a few years.Tony: Yeah, except that they all have schemas, just that those schemaless databases just have very variable schema. They're still schema.Corey: A question that I have is you obviously think deeply about these things, which should not come as a surprise to anyone. It's like, “Well, this is where I spend my entire career. Imagine that. I might think about the problem space a little bit.” But you have, to my understanding, never worked with databases in anger yourself. You don't have a history as a DBA or as an engineer—Tony: No.Corey: —but what I find very odd is that unlike a whole bunch of other analysts that I'm not going to name, but people know who I'm talking about regardless, you bring actual insights into this that I find useful and compelling, instead of reverting to the mean of well, I don't actually understand how any of these things work in reality, so I'm just going to believe whoever sounds the most confident when I ask a bunch of people about these things. Are you just asking the right people who also happen to sound confident? But how do you get away from that very common analyst trap?Tony: Well, a couple of things. One is I purposely play the role of outside observer. In other words, like, the idea is that if basically an idea is supposed to stand on its own legs, it has to make sense. If I've been working inside the industry, I might take too many things for granted. And a good example of this goes back, actually, to my early days—actually this goes back to my freshman year in college where I was taking an organic chem course for non-majors, and it was taught as a logic course not as a memorization course.And we were given the option at the end of the term to either, basically, take a final or do a paper. So, of course, me being a writer I thought, I can BS my way through this. But what I found—and this is what fascinated me—is that as long as certain technical terms were defined for me, I found a logic to the way things work. And so, that really informs how I approach databases, how I approach technology today is I look at the logic on how things work. That being said, in order for me to understand that, I need to know twice as much as the next guy in order to be able to speak that because I just don't do this in my sleep.Corey: That goes a big step toward, I guess, addressing a lot of these things, but it also feels like—and maybe this is just me paying closer attention—that the world of databases and data and analytics have really coalesced or emerged in a very different way over the past decade-ish. It used to be, at least from my perspective, that oh, that the actual, all the data we store, that's a storage admin problem. And that was about managing NetApps and SANs and the rest. And then you had the database side of it, which functionally from the storage side of the world was just a big file or series of files that are the backing store for the database. And okay, there's not a lot of cross-communication going on there.Then with the rise of object store, it started being a little bit different. And even the way that everyone is talking about getting meaning from data has really seem to be evolving at an incredibly intense clip lately. Is that an accurate perception, or have I just been asleep at the wheel for a while and finally woke up?Tony: No, I think you're onto something there. And the reason is that, one, data is touching us all around ourselves, and the fact is, I mean, I'm you can see it in the same way that all of a sudden that people know how to spell AI. They may not know what it means, but the thing is, there is an awareness the data that we work with, the data that is about us, it follows us, and with the cloud, this data has—well, I should say not just with the cloud but with smart mobile devices—we'll blame that—we are all each founts of data, and rich founts of data. And people in all walks of life, not just in the industry, are now becoming aware of it and there's a lot of concern about can we have any control, any ownership over the data that should be ours? So, I think that phenomenon has also happened in the enterprise, where essentially where we used to think that the data was the DBAs' issue, it's become the app developers' issue, it's become the business analysts' issue. Because the answers that we get, we're ultimately accountable for. It all comes from the data.Corey: It also feels like there's this idea of databases themselves becoming more contextually aware of the data contained within them. Originally, this used to be in the realm of, “Oh, we know what's been accessed recently and we can tier out where it lives for storage optimization purposes.” Okay, great, but what I'm seeing now almost seems to be a sense of, people like to talk about pouring ML into their database offerings. And I'm not able to tell whether that is something that adds actual value, or if it's marketing-ware.Tony: Okay. First off, let me kind of spill a couple of things. First of all, it's not a question of the database becoming aware. A database is not sentient.Corey: Niether are some engineers, but that's neither here nor there.Tony: That would be true, but then again, I don't want anyone with shotguns lining up at my door after this—Corey: [laugh].Tony: —after this interview is published. But [laugh] more of the point, though, is that I can see a couple roles for machine learning in databases. One is a database itself, the logs, are an incredible font of data, of operational data. And you can look at trends in terms of when this—when the pattern of these logs goes this way, that is likely to happen. So, the thing is that I could very easily say we're already seeing it: machine learning being used to help optimize the operation of databases, if you're Oracle, and say, “Hey, we can have a database that runs itself.”The other side of the coin is being able to run your own machine-learning models in database as opposed to having to go out into a separate cluster and move the data, and that's becoming more and more of a checkbox feature. However, that's going to be for essentially, probably, like, the low-hanging fruit, like the 80/20 rule. It'll be like the 20% of an ana—of relatively rudimentary, you know, let's say, predictive analyses that we can do inside the database. If you're going to be doing something more ambitious, such as a, you know, a large language model, you probably do not want to run that in database itself. So, there's a difference there.Corey: One would hope. I mean, one of the inappropriate uses of technology that I go for all the time is finding ways to—as directed or otherwise—in off-label uses find ways of tricking different services into running containers for me. It's kind of a problem; this is probably why everyone is very grateful I no longer write production code for anyone.But it does seem that there's been an awful lot of noise lately. I'm lazy. I take shortcuts very often, and one of those is that whenever AWS talks about something extensively through multiple marketing cycles, it becomes usually a pretty good indicator that they're on their back foot on that area. And for a long time, they were doing that about data and how it's very important to gather data, it unlocks the key to your business, but it always felt a little hollow-slash-hypocritical to me because you're going to some of the same events that I have that AWS throws on. You notice how you have to fill out the exact same form with a whole bunch of mandatory fields every single time, but there never seems to be anything that gets spat back out to you that demonstrates that any human or system has ever read—Tony: Right.Corey: Any of that? It's basically a, “Do what we say, not what we do,” style of story. And I always found that to be a little bit disingenuous.Tony: I don't want to just harp on AWS here. Of course, we can always talk about the two-pizza box rule and the fact that you have lots of small teams there, but I'd rather generalize this. And I think you really—what you're just describing is been my trip through the healthcare system. I had some sports-related injuries this summer, so I've been through a couple of surgeries to repair sports injuries. And it's amazing that every time you go to the doctor's office, you're filling the same HIPAA information over and over again, even with healthcare systems that use the same electronic health records software. So, it's more a function of that it's not just that the technologies are siloed, it's that the organizations are siloed. That's what you're saying.Corey: That is fair. And I think at some level—I don't know if this is a weird extension of Conway's Law or whatnot—but these things all have different backing stores as far as data goes. And there's a—the hard part, it seems, in a lot of companies once they hit a certain point of maturity is not just getting the data in—because they've already done that to some extent—but it's also then making it actionable and helping various data stores internal to the company reconcile with one another and start surfacing things that are useful. It increasingly feels like it's less of a technology problem and more of a people problem.Tony: It is. I mean, put it this way, I spent a lot of time last year, I burned a lot of brain cells working on data fabrics, which is an idea that's in the idea of the beholder. But the ideal of a data fabric is that it's not the tool that necessarily governs your data or secures your data or moves your data or transforms your data, but it's supposed to be the master orchestrator that brings all that stuff together. And maybe sometime 50 years in the future, we might see that.I think the problem here is both technical and organizational. [unintelligible 00:11:58] a promise, you have all these what we used call island silos. We still call them silos or islands of information. And actually, ironically, even though in the cloud we have technologies where we can integrate this, the cloud has actually exacerbated this issue because there's so many islands of information, you know, coming up, and there's so many different little parts of the organization that have their hands on that. That's also a large part of why there's such a big discussion about, for instance, data mesh last year: everybody is concerned about owning their own little piece of the pie, and there's a lot of question in terms of how do we get some consistency there? How do we all read from the same sheet of music? That's going to be an ongoing problem. You and I are going to get very old before that ever gets solved.Corey: Yeah, there are certain things that I am content to die knowing that they will not get solved. If they ever get solved, I will not live to see it, and there's a certain comfort in that, on some level.Tony: Yeah.Corey: But it feels like this stuff is also getting more and more complicated than it used to be, and terms aren't being used in quite the same way as they once were. Something that a number of companies have been saying for a while now has been that customers overwhelmingly are preferring open-source. Open source is important to them when it comes to their database selection. And I feel like that's a conflation of a couple of things. I've never yet found an ideological, purity-driven customer decision around that sort of thing.What they care about is, are there multiple vendors who can provide this thing so I'm not going to be using a commercially licensed database that can arbitrarily start playing games with seat licenses and wind up distorting my cost structure massively with very little notice. Does that align with your—Tony: Yeah.Corey: Understanding of what people are talking about when they say that, or am I missing something fundamental? Which is again, always possible?Tony: No, I think you're onto something there. Open-source is a whole other can of worms, and I've burned many, many brain cells over this one as well. And today, you're seeing a lot of pieces about the, you know, the—that are basically giving eulogies for open-source. It's—you know, like HashiCorp just finally changed its license and a bunch of others have in the database world. What open-source has meant is been—and I think for practitioners, for DBAs and developers—here's a platform that's been implemented by many different vendors, which means my skills are portable.And so, I think that's really been the key to why, for instance, like, you know, MySQL and especially PostgreSQL have really exploded, you know, in popularity. Especially Postgres, you know, of late. And it's like, you look at Postgres, it's a very unglamorous database. If you're talking about stodgy, it was born to be stodgy because they wanted to be an adult database from the start. They weren't the LAMP stack like MySQL.And the secret of success with Postgres was that it had a very permissive open-source license, which meant that as long as you don't hold University of California at Berkeley, liable, have at it, kids. And so, you see, like, a lot of different flavors of Postgres out there, which means that a lot of customers are attracted to that because if I get up to speed on this Postgres—on one Postgres database, my skills should be transferable, should be portable to another. So, I think that's a lot of what's happening there.Corey: Well, I do want to call that out in particular because when I was coming up in the naughts, the mid-2000s decade, the lingua franca on everything I used was MySQL, or as I insist on mispronouncing it, my-squeal. And lately, on same vein, Postgres-squeal seems to have taken over the entire universe, when it comes to the de facto database of choice. And I'm old and grumpy and learning new things as always challenging, so I don't understand a lot of the ways that thing gets managed from the context coming from where I did before, but what has driven the massive growth of mindshare among the Postgres-squeal set?Tony: Well, I think it's a matter of it's 30 years old and it's—number one, Postgres always positioned itself as an Oracle alternative. And the early years, you know, this is a new database, how are you going to be able to match, at that point, Oracle had about a 15-year headstart on it. And so, it was a gradual climb to respectability. And I have huge respect for Oracle, don't get me wrong on that, but you take a look at Postgres today and they have basically filled in a lot of the blanks.And so, it now is a very cre—in many cases, it's a credible alternative to Oracle. Can it do all the things Oracle can do? No. But for a lot of organizations, it's the 80/20 rule. And so, I think it's more just a matter of, like, Postgres coming of age. And the fact is, as a result of it coming of age, there's a huge marketplace out there and so much choice, and so much opportunity for skills portability. So, it's really one of those things where its time has come.Corey: I think that a lot of my own biases are simply a product of the era in which I learned how a lot of these things work on. I am terrible at Node, for example, but I would be hard-pressed not to suggest JavaScript as the default language that people should pick up if they're just entering tech today. It does front-end, it does back-end—Tony: Sure.Corey: —it even makes fries, apparently. There's a—that is the lingua franca of the modern internet in a bunch of different ways. That doesn't mean I'm any good at it, and it doesn't mean at this stage, I'm likely to improve massively at it, but it is the right move, even if it is inconvenient for me personally.Tony: Right. Right. Put it this way, we've seen—and as I said, I'm not an expert in programming languages, but we've seen a huge profusion of programming languages and frameworks. But the fact is that there's always been a draw towards critical mass. At the turn of the millennium, we thought is between Java and .NET. Little did we know that basically JavaScript—which at that point was just a web scripting language—[laugh] we didn't know that it could work on the server; we thought it was just a client. Who knew?Corey: That's like using something inappropriately as a database. I mean, good heavens.Tony: [laugh]. That would be true. I mean, when I could have, you know, easily just use a spreadsheet or something like that. But so, I mean, who knew? I mean, just like for instance, Java itself was originally conceived for a set-top box. You never know how this stuff is going to turn out. It's the same thing happen with Python. Python was also a web scripting language. Oh, by the way, it happens to be really powerful and flexible for data science. And whoa, you know, now Python is—in terms of data science languages—has become the new SaaS.Corey: It really took over in a bunch of different ways. Before that, Perl was great, and I go, “Why would I use—why write in Python when Perl is available?” It's like, “Okay, you know, how to write Perl, right?” “Yeah.” “Have you ever read anything a month later?” “Oh…” it's very much a write-only language. It is inscrutable after the fact. And Python at least makes that a lot more approachable, which is never a bad thing.Tony: Yeah.Corey: Speaking of what you touched on toward the beginning of this episode, the idea of databases not being sentient, which I equate to being self-aware, you just came out very recently with a report on generative AI and a trip that you wound up taking on this. Which I've read; I love it. In fact, we've both been independently using the phrase [unintelligible 00:19:09] to, “English is the new most common programming language once a lot of this stuff takes off.” But what have you seen? What have you witnessed as far as both the ground truth reality as well as the grandiose statements that companies are making as they trip over themselves trying to position as the forefront leader and all of this thing that didn't really exist five months ago?Tony: Well, what's funny is—and that's a perfect question because if on January 1st you asked “what's going to happen this year?” I don't think any of us would have thought about generative AI or large language models. And I will not identify the vendors, but I did some that had— was on some advanced briefing calls back around the January, February timeframe. They were talking about things like server lists, they were talking about in database machine learning and so on and so forth. They weren't saying anything about generative.And all of a sudden, April, it changed. And it's essentially just another case of the tail wagging the dog. Consumers were flocking to ChatGPT and enterprises had to take notice. And so, what I saw, in the spring was—and I was at a conference from SaaS, I'm [unintelligible 00:20:21] SAP, Oracle, IBM, Mongo, Snowflake, Databricks and others—that they all very quickly changed their tune to talk about generative AI. What we were seeing was for the most part, position statements, but we also saw, I think, the early emphasis was, as you say, it's basically English as the new default programming language or API, so basically, coding assistance, what I'll call conversational query.I don't want to call it natural language query because we had stuff like Tableau Ask Data, which was very robotic. So, we're seeing a lot of that. And we're also seeing a lot of attention towards foundation models because I mean, what organization is going to have the resources of a Google or an open AI to develop their own foundation model? Yes, some of the Wall Street houses might, but I think most of them are just going to say, “Look, let's just use this as a starting point.”I also saw a very big theme for your models with your data. And where I got a hint of that—it was a throwaway LinkedIn post. It was back in, I think like, February, Databricks had announced Dolly, which was kind of an experimental foundation model, just to use with your own data. And I just wrote three lines in a LinkedIn post, it was on Friday afternoon. By Monday, it had 65,000 hits.I've never seen anything—I mean, yes, I had a lot—I used to say ‘data mesh' last year, and it would—but didn't get anywhere near that. So, I mean, that really hit a nerve. And other things that I saw, was the, you know, the starting to look with vector storage and how that was going to be supported was it was going be a new type of database, and hey, let's have AWS come up with, like, an, you know, an [ADF 00:21:41] database here or is this going to be a feature? I think for the most part, it's going to be a feature. And of course, under all this, everybody's just falling in love, falling all over themselves to get in the good graces of Nvidia. In capsule, that's kind of like what I saw.Corey: That feels directionally accurate. And I think databases are a great area to point out one thing that's always been more a little disconcerting for me. The way that I've always viewed databases has been, unless I'm calling a RAND function or something like it and I don't change the underlying data structure, I should be able to run a query twice in a row and receive the same result deterministically both times.Tony: Mm-hm.Corey: Generative AI is effectively non-deterministic for all realistic measures of that term. Yes, I'm sure there's a deterministic reason things are under the hood. I am not smart enough or learned enough to get there. But it just feels like sometimes we're going to give you the answer you think you're going to get, sometimes we're going to give you a different answer. And sometimes, in generative AI space, we're going to be supremely confident and also completely wrong. That feels dangerous to me.Tony: [laugh]. Oh gosh, yes. I mean, I take a look at ChatGPT and to me, the responses are essentially, it's a high school senior coming out with an essay response without any footnotes. It's the exact opposite of an ACID database. The reason why we're very—in the database world, we're very strongly drawn towards ACID is because we want our data to be consistent and to get—if we ask the same query, we're going to get the same answer.And the problem is, is that with generative, you know, based on large language models, computers sounds sentient, but they're not. Large language models are basically just a series of probabilities, and so hopefully those probabilities will line up and you'll get something similar. That to me, kind of scares me quite a bit. And I think as we start to look at implementing this in an enterprise setting, we need to take a look at what kind of guardrails can we put on there. And the thing is, that what this led me to was that missing piece that I saw this spring with generative AI, at least in the data and analytics world, is nobody had a clue in terms of how to extend AI governance to this, how to make these models explainable. And I think that's still—that's a large problem. That's a huge nut that it's going to take the industry a while to crack.Corey: Yeah, but it's incredibly important that it does get cracked.Tony: Oh, gosh, yes.Corey: One last topic that I want to get into. I know you said you don't want to over-index on AWS, which, fair enough. It is where I spend the bulk of my professional time and energy—Tony: [laugh].Corey: Focusing on, but I think this one's fair because it is a microcosm of a broader industry question. And that is, I don't know what the DBA job of the future is going to look like, but increasingly, it feels like it's going to primarily be picking which purpose-built AWS database—or larger [story 00:24:56] purpose database is appropriate for a given workload. Even without my inappropriate misuse of things that are not databases as databases, they are legitimately 15 or 16 different AWS services that they position as database offerings. And it really feels like you're spiraling down a well of analysis paralysis, trying to pick between all these things. Do you think the future looks more like general-purpose databases, or very purpose-built and each one is this beautiful, bespoke unicorn?Tony: [laugh]. Well, this is basically a hit on a theme that I've been—you know, we've been all been thinking about for years. And the thing is, there are arguments to be made for multi-model databases, you know, versus a for-purpose database. That being said, okay, two things. One is that what I've been saying, in general, is that—and I wrote about this way, way back; I actually did a talk at the [unintelligible 00:25:50]; it was a throwaway talk, or [unintelligible 00:25:52] one of those conferences—I threw it together and it's basically looking at the emergence of all these specialized databases.But how I saw, also, there's going to be kind of an overlapping. Not that we're going to come back to Pangea per se, but that, for instance, like, a relational database will be able to support JSON. And Oracle, for instance, does has some fairly brilliant ideas up the sleeve, what they call a JSON duality, which sounds kind of scary, which basically says, “We can store data relationally, but superimpose GraphQL on top of all of this and this is going to look really JSON-y.” So, I think on one hand, you are going to be seeing databases that do overlap. Would I use Oracle for a MongoDB use case? No, but would I use Oracle for a case where I might have some document data? I could certainly see that.The other point, though, and this is really one I want to hammer on here—it's kind of a major concern I've had—is I think the cloud vendors, for all their talk that we give you operational simplicity and agility are making things very complex with its expanding cornucopia of services. And what they need to do—I'm not saying, you know, let's close down the patent office—what I think we do is we need to provide some guided experiences that says, “Tell us the use case. We will now blend these particular services together and this is the package that we would suggest.” I think cloud vendors really need to go back to the drawing board from that standpoint and look at, how do we bring this all together? How would he really simplify the life of the customer?Corey: That is, honestly, I think the biggest challenge that the cloud providers have across the board. There are hundreds of services available at this point from every hyperscaler out there. And some of them are brand new and effectively feel like they're there for three or four different customers and that's about it and others are universal services that most people are probably going to use. And most things fall in between those two extremes, but it becomes such an analysis paralysis moment of trying to figure out what do I do here? What is the golden path?And what that means is that when you start talking to other people and asking their opinion and getting their guidance on how to do something when you get stuck, it's, “Oh, you're using that service? Don't do it. Use this other thing instead.” And if you listen to that, you get midway through every problem for them to start over again because, “Oh, I'm going to pick a different selection of underlying components.” It becomes confusing and complicated, and I think it does customers largely a disservice. What I think we really need, on some level, is a simplified golden path with easy on-ramps and easy off-ramps where, in the absence of a compelling reason, this is what you should be using.Tony: Believe it or not, I think this would be a golden case for machine learning.Corey: [laugh].Tony: No, but submit to us the characteristics of your workload, and here's a recipe that we would propose. Obviously, we can't trust AI to make our decisions for us, but it can provide some guardrails.Corey: “Yeah. Use a graph database. Trust me, it'll be fine.” That's your general purpose—Tony: [laugh].Corey: —approach. Yeah, that'll end well.Tony: [laugh]. I would hope that the AI would basically be trained on a better set of training data to not come out with that conclusion.Corey: One could sure hope.Tony: Yeah, exactly.Corey: I really want to thank you for taking the time to catch up with me around what you're doing. If people want to learn more, where's the best place for them to find you?Tony: My website is dbinsight.io. And on my homepage, I list my latest research. So, you just have to go to the homepage where you can basically click on the links to the latest and greatest. And I will, as I said, after Labor Day, I'll be publishing my take on my generative AI journey from the spring.Corey: And we will, of course, put links to this in the [show notes 00:29:39]. Thank you so much for your time. I appreciate it.Tony: Hey, it's been a pleasure, Corey. Good seeing you again.Corey: Tony Baer, principal at dbInsight. I'm Cloud Economist Corey Quinn, and this is Screaming in the Cloud. If you've enjoyed this podcast, please leave a five-star review on your podcast platform of choice, whereas if you've hated this podcast, please leave a five-star review on your podcast platform of choice, along with an angry, insulting comment that we will eventually stitch together with all those different platforms to create—that's right—a large-scale distributed database.Corey: If your AWS bill keeps rising and your blood pressure is doing the same, then you need The Duckbill Group. We help companies fix their AWS bill by making it smaller and less horrifying. The Duckbill Group works for you, not AWS. We tailor recommendations to your business and we get to the point. Visit duckbillgroup.com to get started.
Roger and DJ share some of the history behind data science as we know it today, and reflect on their experiences working on California's COVID-19 response. --- Roger Magoulas is Senior Director of Data Strategy at Astronomer, where he works on data infrastructure, analytics, and community development. Previously, he was VP of Research at O'Reilly and co-chair of O'Reilly's Strata Data and AI Conference. DJ Patil is a board member and former CTO of Devoted Health, a healthcare company for seniors. He was also Chief Data Scientist under the Obama administration and the Head of Data Science at LinkedIn. Roger and DJ recently volunteered for the California COVID-19 response, and worked with data to understand case counts, bed capacities and the impact of intervention. Connect with Roger and DJ:
SPONSORED CONTENTThis episode is brought to you by Cloudera.For this sponsored Cloud Wars Live conversation, I spoke with Mick Hollison, CMO of Cloudera. Mick just came back from a Cloudera customer event in New York City called Strata Data, in which they unveiled the new Cloudera Data Platform to the world. He said customers wanted it to be open-source, open APIs, open compute, and open storage.Mick quotes Peter Levine of Andreesen Horowitz, who says the early phase is dictated by convincing developers and technologists to start programming; the next phase is to get users; and the third phase is how to extract meaningful analytics and insight from the data.He goes on to say that they have a customer, Komatsu, which makes massive mining equipment costing hundreds of millions of dollars. Cloudera put sensors on the devices to ingest the data, analyze it, and then predict what was going to happen to the machines. The machines, by the way, literally sink into the earth.Cloudera recently announced that the Cloudera Data Platform is available on AWS – and coming up shortly, on Microsoft Azure. And early next year, Google Cloud Platform. See acast.com/privacy for privacy and opt-out information.
Cory and Brett get to sit down with Mr. @BigData, Ben Lorica, Chief Data Scientist at O’Reilly Media to get a glimpse into the future of AI, ML and Data Science. Ben, who also happens to be the Program Chair for Strata Data, AI Conference and TensorFlowWorld talks about the common challenges he is seeing organizations have in starting and scaling machine learning projects and provides some best practices to take in order not to fall into the same traps as others. The team also gets to hear about some of the trends, tools and technologies on the bleeding edge that are driving huge advancements in machine and deep learning.
This Week in Machine Learning & Artificial Intelligence (AI) Podcast
Today we’re joined by Kelley Rivoire, engineering manager working on machine learning infrastructure at Stripe. Kelley and I caught up at a recent Strata Data conference to discuss: • Her talk "Scaling model training: From flexible training APIs to resource management with Kubernetes." • Stripe’s machine learning infrastructure journey, including their start from a production focus. • Internal tools used at Stripe, including Railyard, an API built to manage model training at scale & more! The complete show notes can be found at twimlai.com/talk/272. Visit twimlcon.com to learn more about the TWIMLcon: AI Platforms conference! The first 10 listeners who register get their ticket for 75% off using the discount code TWIMLFIRST! Follow along with the entire AI Platforms Vol 2 series at twimlai.com/aiplatforms2. Thanks to SigOpt for their continued support of the podcast, and their sponsorship of this episode! Check out their machine learning experimentation and optimization suite, and get a free trial at twimlai.com/sigopt.
At the recent O’Reilly AI Conference in New York City, Chris met up with O’Reilly Chief Data Scientist Ben Lorica, the Program Chair for Strata Data, the AI Conference, and TensorFlow World. O’Reilly’s ‘AI Adoption in the Enterprise’ report had just been released, so naturally Ben and Chris wanted to do a deep dive into enterprise AI adoption to discuss strategy, execution, and implications.
At the recent O’Reilly AI Conference in New York City, Chris met up with O’Reilly Chief Data Scientist Ben Lorica, the Program Chair for Strata Data, the AI Conference, and TensorFlow World. O’Reilly’s ‘AI Adoption in the Enterprise’ report had just been released, so naturally Ben and Chris wanted to do a deep dive into enterprise AI adoption to discuss strategy, execution, and implications.
This Week in Machine Learning & Artificial Intelligence (AI) Podcast
For the final episode of our Strata Data series, we’re joined by Eric Colson, Chief Algorithms Officer at Stitch Fix, whose presentation at the conference explored “How to make fewer bad decisions.” Our discussion focuses in on the three key organizational principles for data science teams that he’s developed at Stitch Fix. Along the way, we also talk through the various roles data science plays at the company, explore a few of the 800+ algorithms in use at the company spanning recommendations, inventory management, demand forecasting, and clothing design. We discuss the roles of Stitch Fix’splatforms team in supporting the data science organization, and his unique perspective on how to identify platform features. The complete show notes for this episode can be found at https://twimlai.com/talk/257. For more from the Strata Data conference series, visit twimlai.com/stratasf19. I want to send a quick thanks to our friends at Cloudera for their sponsorship of this series of podcasts from the Strata Data Conference, which they present along with O’Reilly Media. Cloudera’s long been a supporter of the podcast; in fact, they sponsored the very first episode of TWiML Talk, recorded back in 2016. Since that time Cloudera has continued to invest in and build out its platform, which already securely hosts huge volumes of enterprise data, to provide enterprise customers with a modern environment for machine learning and analytics that works both in the cloud as well as the data center. In addition, Cloudera Fast Forward Labs provides research and expert guidance that helps enterprises understand the realities of building with AI technologies without needing to hire an in-house research team. To learn more about what the company is up to and how they can help, visit Cloudera’s Machine Learning resource center at cloudera.com/ml.
This Week in Machine Learning & Artificial Intelligence (AI) Podcast
In this episode of our Strata Data conference series, we’re joined by Burcu Baran, Senior Data Scientist at LinkedIn. At Strata, Burcu, along with a few members of her team, delivered the presentation “Using the full spectrum of data science to drive business decisions,” which outlines how LinkedIn manages their entire machine learning production process. In our conversation, Burcu details each phase of the process, including problem formulation, monitoring features, A/B testing and more. We also discuss how her “horizontal” team works with other more “vertical” teams within LinkedIn, various challenges that arise when training and modeling such as data leakage and interpretability, best practices when trying to deal with data partitioning at scale, and of course, the need for a platform that reduces the manual pieces of this process, promoting efficiency. The complete show notes for this episode can be found at https://twimlai.com/talk/256. For more from the Strata Data conference series, visit twimlai.com/stratasf19. I want to send a quick thanks to our friends at Cloudera for their sponsorship of this series of podcasts from the Strata Data Conference, which they present along with O’Reilly Media. Cloudera’s long been a supporter of the podcast; in fact, they sponsored the very first episode of TWiML Talk, recorded back in 2016. Since that time Cloudera has continued to invest in and build out its platform, which already securely hosts huge volumes of enterprise data, to provide enterprise customers with a modern environment for machine learning and analytics that works both in the cloud as well as the data center. In addition, Cloudera Fast Forward Labs provides research and expert guidance that helps enterprises understand the realities of building with AI technologies without needing to hire an in-house research team. To learn more about what the company is up to and how they can help, visit Cloudera’s Machine Learning resource center at cloudera.com/ml. I’d also like to send a huge thanks to LinkedIn for their continued support and sponsorship of the show! Now that I’ve had a chance to interview several of the folks on LinkedIn’s Data Science and Engineering teams, it’s really put into context the complexity and scale of the problems that they get to work on in their efforts to create enhanced economic opportunities for every member of the global workforce. AI and ML are integral aspects of almost every product LinkedIn builds for its members and customers and their massive, highly structured dataset gives their data scientists and researchers the ability to conduct applied research to improve member experiences. To learn more about the work of LinkedIn Engineering, please visit engineering.linkedin.com/blog.
This Week in Machine Learning & Artificial Intelligence (AI) Podcast
Today, in the first episode of our Strata Data conference series, we’re joined by Shioulin Sam, Research Engineer with Cloudera Fast Forward Labs. Shioulin and I caught up to discuss the newest report to come out of CFFL, “Learning with Limited Label Data,” which explores active learning as a means to build applications requiring only a relatively small set of labeled data. We start our conversation with a review of active learning and some of the reasons why it’s recently become an interesting technology for folks building systems based on deep learning. We then discuss some of the differences between active learning approaches or implementations, and some of the common requirements of an active learning system. Finally, we touch on some packaged offerings in the marketplace that include active learning, including Amazon’s SageMaker Ground Truth, and review Shoulin’s tips for getting started with the technology. The complete show notes for this episode can be found at https://twimlai.com/talk/255. For more from the Strata Data conference series, visit twimlai.com/stratasf19. I want to send a quick thanks to our friends at Cloudera for their sponsorship of this series of podcasts from the Strata Data Conference, which they present along with O’Reilly Media. Cloudera’s long been a supporter of the podcast; in fact, they sponsored the very first episode of TWiML Talk, recorded back in 2016. Since that time Cloudera has continued to invest in and build out its platform, which already securely hosts huge volumes of enterprise data, to provide enterprise customers with a modern environment for machine learning and analytics that works both in the cloud as well as the data center. In addition, Cloudera Fast Forward Labs provides research and expert guidance that helps enterprises understand the realities of building with AI technologies without needing to hire an in-house research team. To learn more about what the company is up to and how they can help, visit Cloudera’s Machine Learning resource center at cloudera.com/ml.
Cory Minton and Brett Roberts sit down with their friend from across the pond Paul Brook to talk about the book he wrote titled “Life of AI: AI today, tomorrow, and in the future and what this may mean for human kind. Paul talks about what inspired him to write a book (spoiler it was at a bar) about Artificial Intelligence and what went into creating it. Paul and the team then explore different aspects of Artificial intelligence from bias, regulation, hardware and discuss how AI will evolve from man made AI to autonomous AI in the future. Tune in to get some brilliant insights into all things AI. O’Reilly Media’s Strata Data is coming to San Francisco on March 25 - 28, 2019 and YOU have a chance to win a free pass to this fantastic conference. All you have to do is subscribe to our BrightTalk channel between now and February 1st and you are entered to win. You can also use Promo Code PCBEARD for 20% off your conference pass. Music from this episode is by Andrew Belle. Please go check him out...you'll thank us!
For the end-of-year holiday episode of the Data Show, I turned the tables on Data Show host Ben Lorica to talk about trends in big data, machine learning, and AI, and what to look for in 2019. Lorica also showcased some highlights from our upcoming Strata Data and Artificial Intelligence conferences. Here are some highlights […]
This Week in Machine Learning & Artificial Intelligence (AI) Podcast
In this episode of our AI Platforms series, we’re joined by Atul Kale, Engineering Manager on the machine learning infrastructure team at Airbnb. Atul and I met at the Strata Data conference a while back to discuss Airbnb’s internal machine learning platform, Bighead. In our conversation, Atul outlines the ML lifecycle at Airbnb and how the various components of Bighead support it. We then dig into the major components of Bighead, which include Redspot, their supercharged Jupyter notebook service, Deep Thought, their real-time inference environment, Zipline, their data management platform, and quite a few others. We also take a look at some of Atul’s best practices for scaling machine learning, and discuss a special announcement that Atul and his team made at Strata. For the complete show notes, visit twimlai.com/talk/198. For more information on the AI Platforms series, visit twimlai.com/aiplatforms.
This Week in Machine Learning & Artificial Intelligence (AI) Podcast
In this episode, we’re joined by Garrett Hoffman, Director of Data Science at Stocktwits. Garrett and I caught up at last month’s Strata Data conference, where he presented a tutorial on “Deep Learning Methods for NLP with Emphasis on Financial Services.” Stocktwits is a social network for the investing community which has its roots in the use of the $cashtag on Twitter. In our conversation, we discuss applications such as Stocktwits’ own use of “social sentiment graphs” built on multilayer LSTM networks to gauge community sentiment about certain stocks in real time, as well as the more general use of natural language processing for generating trading ideas. I’d also like to send a huge thanks to our friends at IBM for their sponsorship of this episode. Are you interested in exploring code patterns leveraging multiple technologies, including ML and AI? Then check out IBM Developer. With more than 100 open source programs, a library of knowledge resources, developer advocates ready to help, and a global community of developers, what in the world will you create? Dive in at https://ibm.biz/mlaipodcast, and be sure to let them know that TWiML sent you! For the complete show notes for this episode, visit https://twimlai.com/talk/194.
This Week in Machine Learning & Artificial Intelligence (AI) Podcast
In this episode of our Strata Data conference series, we’re joined by Ahsan Ashraf, data scientist at Pinterest. In our conversation, Ahsan and I discuss his presentation from the conference, “Diversification in recommender systems: Using topical variety to increase user satisfaction.” We cover the experiments his team ran to explore the impact of diversification in user’s boards, the methodology his team used to incorporate variety into the Pinterest recommendation system, the metrics they monitored through the process, and how they performed sensitivity sanity testing. The show notes for this episode can be found at https://twimlai.com/talk/187.
This Week in Machine Learning & Artificial Intelligence (AI) Podcast
In today's episode we’ll be taking a break from our Strata Data conference series and presenting a special conversation with Jeremy Howard, founder and researcher at Fast.ai. Fast.ai is a company many of our listeners are quite familiar with due to their popular deep learning course. This episode is being released today in conjunction with the company’s announcement of version 1.0 of their fastai library at the inaugural Pytorch Devcon in San Francisco. Jeremy and I cover a ton of ground in this conversation. Of course, we dive into the new library and explore why it’s important and what’s changed. We also explore the unique way in which it was developed and what it means for the future of the fast.ai courses. Jeremy shares a ton of great insights and lessons learned in this conversation, not to mention mentions a bunch of really interesting-sounding papers. The complete show notes, and links to the fastai library can be found here.
This Week in Machine Learning & Artificial Intelligence (AI) Podcast
In this episode of our Strata Data conference series, we’re joined by Justin Norman, Director of Research and Data Science Services at Cloudera Fast Forward Labs. Fast Forward Labs was an Applied AI research firm and consultancy founded by Hilary Mason, who’s TWiML Talk episode remains an all-time fan favorite. My chat with Justin took place on the 1 year anniversary of Fast Forward Labs’ acquisition by Cloudera, so we start with an update on the company before diving into a look at some of recent and upcoming research projects. Specifically, we discuss their recent report on Multi-Task Learning and their upcoming research into Federated Machine Learning for AI at the edge. To learn more about Cloudera and CFFL, visit Cloudera's Machine Learning resource center at cloudera.com/ml. For the complete show notes, visit https://twimlai.com/talk/185.
This Week in Machine Learning & Artificial Intelligence (AI) Podcast
In today’s episode of our Strata Data series, we’re joined by Viviana Acquaviva, Associate Professor at City Tech, the New York City College of Technology. Viviana led a tutorial at the conference, titled “Learning Machine Learning using Astronomy data sets.” In our conversation, we begin by discussing an ongoing project she’s a part of called the “Hobby-Eberly Telescope Dark Energy eXperiment,” or HETDEX. In this project, Viviana tackles the challenge of understanding of how and why the expansion of the universe is accelerating, which is directly contrary to the principles of gravity. We discuss her motivation for undertaking this project, how she gets her data, the models she uses, and how she evaluates their performance. The complete show notes can be found at https://twimlai.com/talk/184.
This Week in Machine Learning & Artificial Intelligence (AI) Podcast
In this episode of our Strata Data series we’re joined by James Dreiss, Senior Data Scientist at international news syndicate Reuters. James and I sat down to discuss his talk from the conference “Document vectors in the wild, building a content recommendation system,” in which he details how Reuters implemented document vectors to recommend content to users of their new “infinite scroll” page layout. In our conversation we take a look at what document vectors are and how they’re created, how they tested the accuracy of their models, and the future of embeddings for natural language processing. The complete show notes for this episode can be found at https://twimlai.com/talk/183. For more info on the Strata Data Conference Series, visit https://twimlai.com/stratany2018.
“Ink or pixels, journalism doesn’t change.” The team sits down with Greg Doufas, CTO of The Globe and Mail, the national Canadian newspaper and a 170 year old media company, to explore how data analytics is being used to enhance customer experiences, drive engagement and fuel growth. Erin K. Banks & Cory Minton dive into the pages of this brilliant company’s history and understand how The Globe and Mail is using analytics to write their future by retaining and capturing new subscribers through their digital transformation. Strata Data London is right around the corner and as with all Strata Data and AI conferences, use promo code PCBEARD. to get 20% off conference passes! Full show notes can be found at http://bit.ly/BDB_EP27. Music from this episode is by Andrew Belle.
This week's episodes is for data scientists, sure, but also for data science managers and executives at companies with data science teams. These folks all think very differently about the same question: what should a data science team be working on? And how should that decision be made? That's the subject of a talk that I (Katie) gave at Strata Data in early March, about how my co-department head and I select projects for our team to work on. We have several goals in data science project selection at Civis Analytics (where I work), which can be summarized under "balance the best attributes of bottom-up and top-down decision-making." We achieve this balance, or at least get pretty close, using a process we've come to call the Idea Factory (after a great book about Bell Labs). This talk is about that process, how it works in the real world of a data science company and how we see it working in the data science programs of other companies. Relevant links: https://conferences.oreilly.com/strata/strata-ca/public/schedule/detail/63905
This Week in Machine Learning & Artificial Intelligence (AI) Podcast
This week I’ve invited my friend Ben Lorica onto the show. Ben is Chief Data Scientist for O’Reilly Media, and Program Director of Strata Data & the O'Reilly A.I. conference. Ben has worked on analytics and machine learning in the finance and retail industries, and serves as an advisor for nearly a dozen startups. In his role at O’Reilly he’s responsible for the content for 7 major conferences around the world each year. In the show we discuss all of that, touching on how publishers can take advantage of machine learning and data mining, how the role of “data scientist” is evolving and the emergence of the machine learning engineer, and a few of the hot technologies, trends and companies that he’s seeing arise around the world. The notes for this show can be found at twimlai.com/talk/26