Podcasts about octoml

  • 25PODCASTS
  • 32EPISODES
  • 41mAVG DURATION
  • ?INFREQUENT EPISODES
  • Oct 28, 2024LATEST

POPULARITY

20172018201920202021202220232024


Best podcasts about octoml

Latest podcast episodes about octoml

The Thesis Review
[48] Tianqi Chen - Scalable and Intelligent Learning Systems

The Thesis Review

Play Episode Listen Later Oct 28, 2024 46:29


Tianqi Chen is an Assistant Professor in the Machine Learning Department and Computer Science Department at Carnegie Mellon University and the Chief Technologist of OctoML. His research focuses on the intersection of machine learning and systems. Tianqi's PhD thesis is titled "Scalable and Intelligent Learning Systems," which he completed in 2019 at the University of Washington. We discuss his influential work on machine learning systems, starting with the development of XGBoost,an optimized distributed gradient boosting library that has had an enormous impact in the field. We also cover his contributions to deep learning frameworks like MXNet and machine learning compilation with TVM, and connect these to modern generative AI. - Episode notes: www.wellecks.com/thesisreview/episode48.html - Follow the Thesis Review (@thesisreview) and Sean Welleck (@wellecks) on Twitter - Follow Tianqi Chen on Twitter (@tqchenml) - Support The Thesis Review at www.patreon.com/thesisreview or www.buymeacoffee.com/thesisreview

Open Source Startup Podcast
E114: How OctoML Helps Developers Build with Llama 2 & Stable Diffusion

Open Source Startup Podcast

Play Episode Listen Later Nov 7, 2023 42:50


Tianqi Chen is Co-Founder and Chief Technologist of OctoML, the compute infrastructure platform for tuning and running generative models in the cloud. OctoML was founded by the creators of Apache TVM, the machine learning compiler framework for CPUs, GPUs, and accelerators OctoML has raised $132M from investors including Amplify, Addition, Madrona, and Tiger. In this episode, we discuss the importance of supporting multiple models, the advancements from LLaMA and Stable Diffusion this year, building the TVM and OctoML communities, predictions on GenAI in the enterprise (hybrid ML, for example), whether GenAI is over-invested in & more!

The Cloudcast
Economics & Optimization of AI/ML

The Cloudcast

Play Episode Listen Later Aug 30, 2023 35:55


Luis Ceze (@luisceze, Founder/CEO @OctoML) talks about barriers to entry for AI & ML, the economics of funding, training, fine tuning, inferencing and optimizations.SHOW: 749CLOUD NEWS OF THE WEEK - http://bit.ly/cloudcast-cnotwNEW TO CLOUD? CHECK OUT - "CLOUDCAST BASICS"SHOW SPONSORS:CloudZero – Cloud Cost Visibility and Savings​​CloudZero provides immediate and ongoing savings with 100% visibility into your total cloud spendReduce the complexities of protecting your workloads and applications in a multi-cloud environment. Panoptica provides comprehensive cloud workload protection integrated with API security to protect the entire application lifecycle.  Learn more about Panoptica at panoptica.appSHOW NOTES:OctoML (homepage)OctoML makes it easier to put AI/ML models into productionOctoML launches OctoAITopic 1 - Welcome to the show. You have an interesting background with roots in both VC markets and academia. Tell us a little bit about your background.Topic 2 - Generative AI is now all the rage. But as more people dig into AI/ML in general, they find out quickly there are a few barriers to entry. Let's address some of them as you have an extensive history here. The first barrier I believe most people hit is complexity. The tools to ingest data into models and deployment of models has improved but what about the challenges implementing that into production applications? How do folks overcome this first hurdle?Topic 3 - The next hurdle I think most organizations hit is where to place the models. Where to train them, where to fine tune them and where to run them could be the same or different places. Can you talk a bit about placement of models? Also, as a follow up, how does GPU shortages play into this and can models be fine tuned to work around this?Topic 4 - Do you see the AI/ML dependence on GPU's continuing into the future? Will there be an abstraction layer or another technology coming that will allow the industry to move away from GPU's from more mainstream applications?Topic 5 - The next barrier but very related to the previous one is cost. There are some very real world tradeoffs between cost and performance when it comes to AI/ML. What cost factors need to be considered besides hardware costs? Data ingestion and data gravity comes to mind as a hidden cost that can add up quickly if not properly considered. Another one is latency. Maybe you arrive at an answer but at a slower rate that is more economical. How do organizations optimize for cost?Topic 6 - Do most organizations tend to use an “off the shelf model” today? Maybe an open source model that they train with their private data? I would expect this to be the fastest way to production, why build your own model when the difference is in your data? How does data privacy factor into this scenario?FEEDBACK?Email: show at the cloudcast dot netTwitter: @thecloudcastnet

Latent Space: The AI Engineer Podcast — CodeGen, Agents, Computer Vision, Data Science, AI UX and all things Software 3.0
LLMs Everywhere: Running 70B models in browsers and iPhones using MLC — with Tianqi Chen of CMU / OctoML

Latent Space: The AI Engineer Podcast — CodeGen, Agents, Computer Vision, Data Science, AI UX and all things Software 3.0

Play Episode Listen Later Aug 10, 2023 52:10


We have just announced our first set of speakers at AI Engineer Summit! Sign up for the livestream or email sponsors@ai.engineer if you'd like to support.We are facing a massive GPU crunch. As both startups and VC's hoard Nvidia GPUs like countries count nuclear stockpiles, tweets about GPU shortages have become increasingly common. But what if we could run LLMs with AMD cards, or without a GPU at all? There's just one weird trick: compilation. And there's one person uniquely qualified to do it.We had the pleasure to sit down with Tianqi Chen, who's an Assistant Professor at CMU, where he both teaches the MLC course and runs the MLC group. You might also know him as the creator of XGBoost, Apache TVM, and MXNet, as well as the co-founder of OctoML. The MLC (short for Machine Learning Compilation) group has released a lot of interesting projects:* MLC Chat: an iPhone app that lets you run models like RedPajama-3B and Vicuna-7B on-device. It gets up to 30 tok/s!* Web LLM: Run models like LLaMA-70B in your browser (!!) to offer local inference in your product.* MLC LLM: a framework that allows any language models to be deployed natively on different hardware and software stacks.The MLC group has just announced new support for AMD cards; we previously talked about the shortcomings of ROCm, but using MLC you can get performance very close to the NVIDIA's counterparts. This is great news for founders and builders, as AMD cards are more readily available. Here are their latest results on AMD's 7900s vs some of top NVIDIA consumer cards.If you just can't get a GPU at all, MLC LLM also supports ARM and x86 CPU architectures as targets by leveraging LLVM. While speed performance isn't comparable, it allows for non-time-sensitive inference to be run on commodity hardware.We also enjoyed getting a peek into TQ's process, which involves a lot of sketching:With all the other work going on in this space with projects like ggml and Ollama, we're excited to see GPUs becoming less and less of an issue to get models in the hands of more people, and innovative software solutions to hardware problems!Show Notes* TQ's Projects:* XGBoost* Apache TVM* MXNet* MLC* OctoML* CMU Catalyst* ONNX* GGML* Mojo* WebLLM* RWKV* HiPPO* Tri Dao's Episode* George Hotz EpisodePeople:* Carlos Guestrin* Albert GuTimestamps* [00:00:00] Intros* [00:03:41] The creation of XGBoost and its surprising popularity* [00:06:01] Comparing tree-based models vs deep learning* [00:10:33] Overview of TVM and how it works with ONNX* [00:17:18] MLC deep dive* [00:28:10] Using int4 quantization for inference of language models* [00:30:32] Comparison of MLC to other model optimization projects* [00:35:02] Running large language models in the browser with WebLLM* [00:37:47] Integrating browser models into applications* [00:41:15] OctoAI and self-optimizing compute* [00:45:45] Lightning RoundTranscriptAlessio: Hey everyone, welcome to the Latent Space podcast. This is Alessio, Partner and CTO in Residence at Decibel Partners, and I'm joined by my co-host Swyx, writer and editor of Latent Space. [00:00:20]Swyx: Okay, and we are here with Tianqi Chen, or TQ as people call him, who is assistant professor in ML computer science at CMU, Carnegie Mellon University, also helping to run Catalyst Group, also chief technologist of OctoML. You wear many hats. Are those, you know, your primary identities these days? Of course, of course. [00:00:42]Tianqi: I'm also, you know, very enthusiastic open source. So I'm also a VP and PRC member of the Apache TVM project and so on. But yeah, these are the things I've been up to so far. [00:00:53]Swyx: Yeah. So you did Apache TVM, XGBoost, and MXNet, and we can cover any of those in any amount of detail. But maybe what's one thing about you that people might not learn from your official bio or LinkedIn, you know, on the personal side? [00:01:08]Tianqi: Let me say, yeah, so normally when I do, I really love coding, even though like I'm trying to run all those things. So one thing that I keep a habit on is I try to do sketchbooks. I have a book, like real sketchbooks to draw down the design diagrams and the sketchbooks I keep sketching over the years, and now I have like three or four of them. And it's kind of a usually a fun experience of thinking the design through and also seeing how open source project evolves and also looking back at the sketches that we had in the past to say, you know, all these ideas really turn into code nowadays. [00:01:43]Alessio: How many sketchbooks did you get through to build all this stuff? I mean, if one person alone built one of those projects, he'll be a very accomplished engineer. Like you built like three of these. What's that process like for you? Like it's the sketchbook, like the start, and then you think about the code or like. [00:01:59]Swyx: Yeah. [00:02:00]Tianqi: So, so usually I start sketching on high level architectures and also in a project that works for over years, we also start to think about, you know, new directions, like of course generative AI language model comes in, how it's going to evolve. So normally I would say it takes like one book a year, roughly at that rate. It's usually fun to, I find it's much easier to sketch things out and then gives a more like a high level architectural guide for some of the future items. Yeah. [00:02:28]Swyx: Have you ever published this sketchbooks? Cause I think people would be very interested on, at least on a historical basis. Like this is the time where XGBoost was born, you know? Yeah, not really. [00:02:37]Tianqi: I started sketching like after XGBoost. So that's a kind of missing piece, but a lot of design details in TVM are actually part of the books that I try to keep a record of. [00:02:48]Swyx: Yeah, we'll try to publish them and publish something in the journals. Maybe you can grab a little snapshot for visual aid. Sounds good. [00:02:57]Alessio: Yeah. And yeah, talking about XGBoost, so a lot of people in the audience might know it's a gradient boosting library, probably the most popular out there. And it became super popular because many people started using them in like a machine learning competitions. And I think there's like a whole Wikipedia page of like all state-of-the-art models. They use XGBoost and like, it's a really long list. When you were working on it, so we just had Tri Dao, who's the creator of FlashAttention on the podcast. And I asked him this question, it's like, when you were building FlashAttention, did you know that like almost any transform race model will use it? And so I asked the same question to you when you were coming up with XGBoost, like, could you predict it would be so popular or like, what was the creation process? And when you published it, what did you expect? We have no idea. [00:03:41]Tianqi: Like, actually, the original reason that we built that library is that at that time, deep learning just came out. Like that was the time where AlexNet just came out. And one of the ambitious mission that myself and my advisor, Carlos Guestrin, then is we want to think about, you know, try to test the hypothesis. Can we find alternatives to deep learning models? Because then, you know, there are other alternatives like, you know, support vector machines, linear models, and of course, tree-based models. And our question was, if you build those models and feed them with big enough data, because usually like one of the key characteristics of deep learning is that it's taking a lot [00:04:22]Swyx: of data, right? [00:04:23]Tianqi: So we will be able to get the same amount of performance. That's a hypothesis we're setting out to test. Of course, if you look at now, right, that's a wrong hypothesis, but as a byproduct, what we find out is that, you know, most of the gradient boosting library out there is not efficient enough for us to test that hypothesis. So I happen to have quite a bit of experience in the past of building gradient boosting trees and their variants. So Effective Action Boost was kind of like a byproduct of that hypothesis testing. At that time, I'm also competing a bit in data science challenges, like I worked on KDDCup and then Kaggle kind of become bigger, right? So I kind of think maybe it's becoming useful to others. One of my friends convinced me to try to do a Python binding of it. That tends to be like a very good decision, right, to be effective. Usually when I build it, we feel like maybe a command line interface is okay. And now we have a Python binding, we have R bindings. And then it realized, you know, it started getting interesting. People started contributing different perspectives, like visualization and so on. So we started to push a bit more on to building distributive support to make sure it works on any platform and so on. And even at that time point, when I talked to Carlos, my advisor, later, he said he never anticipated that we'll get to that level of success. And actually, why I pushed for gradient boosting trees, interestingly, at that time, he also disagreed. He thinks that maybe we should go for kernel machines then. And it turns out, you know, actually, we are both wrong in some sense, and Deep Neural Network was the king in the hill. But at least the gradient boosting direction got into something fruitful. [00:06:01]Swyx: Interesting. [00:06:02]Alessio: I'm always curious when it comes to these improvements, like, what's the design process in terms of like coming up with it? And how much of it is a collaborative with like other people that you're working with versus like trying to be, you know, obviously, in academia, it's like very paper-driven kind of research driven. [00:06:19]Tianqi: I would say the extra boost improvement at that time point was more on like, you know, I'm trying to figure out, right. But it's combining lessons. Before that, I did work on some of the other libraries on matrix factorization. That was like my first open source experience. Nobody knew about it, because you'll find, likely, if you go and try to search for the package SVD feature, you'll find some SVN repo somewhere. But it's actually being used for some of the recommender system packages. So I'm trying to apply some of the previous lessons there and trying to combine them. The later projects like MXNet and then TVM is much, much more collaborative in a sense that... But, of course, extra boost has become bigger, right? So when we started that project myself, and then we have, it's really amazing to see people come in. Michael, who was a lawyer, and now he works on the AI space as well, on contributing visualizations. Now we have people from our community contributing different things. So extra boost even today, right, it's a community of committers driving the project. So it's definitely something collaborative and moving forward on getting some of the things continuously improved for our community. [00:07:37]Alessio: Let's talk a bit about TVM too, because we got a lot of things to run through in this episode. [00:07:42]Swyx: I would say that at some point, I'd love to talk about this comparison between extra boost or tree-based type AI or machine learning compared to deep learning, because I think there is a lot of interest around, I guess, merging the two disciplines, right? And we can talk more about that. I don't know where to insert that, by the way, so we can come back to it later. Yeah. [00:08:04]Tianqi: Actually, what I said, when we test the hypothesis, the hypothesis is kind of, I would say it's partially wrong, because the hypothesis we want to test now is, can you run tree-based models on image classification tasks, where deep learning is certainly a no-brainer right [00:08:17]Swyx: now today, right? [00:08:18]Tianqi: But if you try to run it on tabular data, still, you'll find that most people opt for tree-based models. And there's a reason for that, in the sense that when you are looking at tree-based models, the decision boundaries are naturally rules that you're looking at, right? And they also have nice properties, like being able to be agnostic to scale of input and be able to automatically compose features together. And I know there are attempts on building neural network models that work for tabular data, and I also sometimes follow them. I do feel like it's good to have a bit of diversity in the modeling space. Actually, when we're building TVM, we build cost models for the programs, and actually we are using XGBoost for that as well. I still think tree-based models are going to be quite relevant, because first of all, it's really to get it to work out of the box. And also, you will be able to get a bit of interoperability and control monotonicity [00:09:18]Swyx: and so on. [00:09:19]Tianqi: So yes, it's still going to be relevant. I also sometimes keep coming back to think about, are there possible improvements that we can build on top of these models? And definitely, I feel like it's a space that can have some potential in the future. [00:09:34]Swyx: Are there any current projects that you would call out as promising in terms of merging the two directions? [00:09:41]Tianqi: I think there are projects that try to bring a transformer-type model for tabular data. I don't remember specifics of them, but I think even nowadays, if you look at what people are using, tree-based models are still one of their toolkits. So I think maybe eventually it's not even a replacement, it will be just an ensemble of models that you can call. Perfect. [00:10:07]Alessio: Next up, about three years after XGBoost, you built this thing called TVM, which is now a very popular compiler framework for models. Let's talk about, so this came out about at the same time as ONNX. So I think it would be great if you could maybe give a little bit of an overview of how the two things work together. Because it's kind of like the model, then goes to ONNX, then goes to the TVM. But I think a lot of people don't understand the nuances. I can get a bit of a backstory on that. [00:10:33]Tianqi: So actually, that's kind of an ancient history. Before XGBoost, I worked on deep learning for two years or three years. I got a master's before I started my PhD. And during my master's, my thesis focused on applying convolutional restricted Boltzmann machine for ImageNet classification. That is the thing I'm working on. And that was before AlexNet moment. So effectively, I had to handcraft NVIDIA CUDA kernels on, I think, a GTX 2070 card. I have a 22070 card. It took me about six months to get one model working. And eventually, that model is not so good, and we should have picked a better model. But that was like an ancient history that really got me into this deep learning field. And of course, eventually, we find it didn't work out. So in my master's, I ended up working on recommender system, which got me a paper, and I applied and got a PhD. But I always want to come back to work on the deep learning field. So after XGBoost, I think I started to work with some folks on this particular MXNet. At that time, it was like the frameworks of CAFE, Ciano, PyTorch haven't yet come out. And we're really working hard to optimize for performance on GPUs. At that time, I found it's really hard, even for NVIDIA GPU. It took me six months. And then it's amazing to see on different hardwares how hard it is to go and optimize code for the platforms that are interesting. So that gets me thinking, can we build something more generic and automatic? So that I don't need an entire team of so many people to go and build those frameworks. So that's the motivation of starting working on TVM. There is really too little about machine learning engineering needed to support deep learning models on the platforms that we're interested in. I think it started a bit earlier than ONNX, but once it got announced, I think it's in a similar time period at that time. So overall, how it works is that TVM, you will be able to take a subset of machine learning programs that are represented in what we call a computational graph. Nowadays, we can also represent a loop-level program ingest from your machine learning models. Usually, you have model formats ONNX, or in PyTorch, they have FX Tracer that allows you to trace the FX graph. And then it goes through TVM. We also realized that, well, yes, it needs to be more customizable, so it will be able to perform some of the compilation optimizations like fusion operator together, doing smart memory planning, and more importantly, generate low-level code. So that works for NVIDIA and also is portable to other GPU backends, even non-GPU backends [00:13:36]Swyx: out there. [00:13:37]Tianqi: So that's a project that actually has been my primary focus over the past few years. And it's great to see how it started from where I think we are the very early initiator of machine learning compilation. I remember there was a visit one day, one of the students asked me, are you still working on deep learning frameworks? I tell them that I'm working on ML compilation. And they said, okay, compilation, that sounds very ancient. It sounds like a very old field. And why are you working on this? And now it's starting to get more traction, like if you say Torch Compile and other things. I'm really glad to see this field starting to pick up. And also we have to continue innovating here. [00:14:17]Alessio: I think the other thing that I noticed is, it's kind of like a big jump in terms of area of focus to go from XGBoost to TVM, it's kind of like a different part of the stack. Why did you decide to do that? And I think the other thing about compiling to different GPUs and eventually CPUs too, did you already see some of the strain that models could have just being focused on one runtime, only being on CUDA and that, and how much of that went into it? [00:14:50]Tianqi: I think it's less about trying to get impact, more about wanting to have fun. I like to hack code, I had great fun hacking CUDA code. Of course, being able to generate CUDA code is cool, right? But now, after being able to generate CUDA code, okay, by the way, you can do it on other platforms, isn't that amazing? So it's more of that attitude to get me started on this. And also, I think when we look at different researchers, myself is more like a problem solver type. So I like to look at a problem and say, okay, what kind of tools we need to solve that problem? So regardless, it could be building better models. For example, while we build extra boots, we build certain regularizations into it so that it's more robust. It also means building system optimizations, writing low-level code, maybe trying to write assembly and build compilers and so on. So as long as they solve the problem, definitely go and try to do them together. And I also see it's a common trend right now. Like if you want to be able to solve machine learning problems, it's no longer at Aggressor layer, right? You kind of need to solve it from both Aggressor data and systems angle. And this entire field of machine learning system, I think it's kind of emerging. And there's now a conference around it. And it's really good to see a lot more people are starting to look into this. [00:16:10]Swyx: Yeah. Are you talking about ICML or something else? [00:16:13]Tianqi: So machine learning and systems, right? So not only machine learning, but machine learning and system. So there's a conference called MLsys. It's definitely a smaller community than ICML, but I think it's also an emerging and growing community where people are talking about what are the implications of building systems for machine learning, right? And how do you go and optimize things around that and co-design models and systems together? [00:16:37]Swyx: Yeah. And you were area chair for ICML and NeurIPS as well. So you've just had a lot of conference and community organization experience. Is that also an important part of your work? Well, it's kind of expected for academic. [00:16:48]Tianqi: If I hold an academic job, I need to do services for the community. Okay, great. [00:16:53]Swyx: Your most recent venture in MLsys is going to the phone with MLCLLM. You announced this in April. I have it on my phone. It's great. I'm running Lama 2, Vicuña. I don't know what other models that you offer. But maybe just kind of describe your journey into MLC. And I don't know how this coincides with your work at CMU. Is that some kind of outgrowth? [00:17:18]Tianqi: I think it's more like a focused effort that we want in the area of machine learning compilation. So it's kind of related to what we built in TVM. So when we built TVM was five years ago, right? And a lot of things happened. We built the end-to-end machine learning compiler that works, the first one that works. But then we captured a lot of lessons there. So then we are building a second iteration called TVM Unity. That allows us to be able to allow ML engineers to be able to quickly capture the new model and how we demand building optimizations for them. And MLCLLM is kind of like an MLC. It's more like a vertical driven organization that we go and build tutorials and go and build projects like LLM to solutions. So that to really show like, okay, you can take machine learning compilation technology and apply it and bring something fun forward. Yeah. So yes, it runs on phones, which is really cool. But the goal here is not only making it run on phones, right? The goal is making it deploy universally. So we do run on Apple M2 Macs, the 17 billion models. Actually, on a single batch inference, more recently on CUDA, we get, I think, the most best performance you can get out there already on the 4-bit inference. Actually, as I alluded earlier before the podcast, we just had a result on AMD. And on a single batch, actually, we can get the latest AMD GPU. This is a consumer card. It can get to about 80% of the 4019, so NVIDIA's best consumer card out there. So it's not yet on par, but thinking about how diversity and what you can enable and the previous things you can get on that card, it's really amazing that what you can do with this kind of technology. [00:19:10]Swyx: So one thing I'm a little bit confused by is that most of these models are in PyTorch, but you're running this inside a TVM. I don't know. Was there any fundamental change that you needed to do, or was this basically the fundamental design of TVM? [00:19:25]Tianqi: So the idea is that, of course, it comes back to program representation, right? So effectively, TVM has this program representation called TVM script that contains more like computational graph and operational representation. So yes, initially, we do need to take a bit of effort of bringing those models onto the program representation that TVM supports. Usually, there are a mix of ways, depending on the kind of model you're looking at. For example, for vision models and stable diffusion models, usually we can just do tracing that takes PyTorch model onto TVM. That part is still being robustified so that we can bring more models in. On language model tasks, actually what we do is we directly build some of the model constructors and try to directly map from Hugging Face models. The goal is if you have a Hugging Face configuration, we will be able to bring that in and apply optimization on them. So one fun thing about model compilation is that your optimization doesn't happen only as a soft language, right? For example, if you're writing PyTorch code, you just go and try to use a better fused operator at a source code level. Torch compile might help you do a bit of things in there. In most of the model compilations, it not only happens at the beginning stage, but we also apply generic transformations in between, also through a Python API. So you can tweak some of that. So that part of optimization helps a lot of uplifting in getting both performance and also portability on the environment. And another thing that we do have is what we call universal deployment. So if you get the ML program into this TVM script format, where there are functions that takes in tensor and output tensor, we will be able to have a way to compile it. So they will be able to load the function in any of the language runtime that TVM supports. So if you could load it in JavaScript, and that's a JavaScript function that you can take in tensors and output tensors. If you're loading Python, of course, and C++ and Java. So the goal there is really bring the ML model to the language that people care about and be able to run it on a platform they like. [00:21:37]Swyx: It strikes me that I've talked to a lot of compiler people, but you don't have a traditional compiler background. You're inventing your own discipline called machine learning compilation, or MLC. Do you think that this will be a bigger field going forward? [00:21:52]Tianqi: First of all, I do work with people working on compilation as well. So we're also taking inspirations from a lot of early innovations in the field. Like for example, TVM initially, we take a lot of inspirations from Halide, which is just an image processing compiler. And of course, since then, we have evolved quite a bit to focus on the machine learning related compilations. If you look at some of our conference publications, you'll find that machine learning compilation is already kind of a subfield. So if you look at papers in both machine learning venues, the MLC conferences, of course, and also system venues, every year there will be papers around machine learning compilation. And in the compiler conference called CGO, there's a C4ML workshop that also kind of trying to focus on this area. So definitely it's already starting to gain traction and becoming a field. I wouldn't claim that I invented this field, but definitely I helped to work with a lot of folks there. And I try to bring a perspective, of course, trying to learn a lot from the compiler optimizations as well as trying to bring in knowledges in machine learning and systems together. [00:23:07]Alessio: So we had George Hotz on the podcast a few episodes ago, and he had a lot to say about AMD and their software. So when you think about TVM, are you still restricted in a way by the performance of the underlying kernel, so to speak? So if your target is like a CUDA runtime, you still get better performance, no matter like TVM kind of helps you get there, but then that level you don't take care of, right? [00:23:34]Swyx: There are two parts in here, right? [00:23:35]Tianqi: So first of all, there is the lower level runtime, like CUDA runtime. And then actually for NVIDIA, a lot of the mood came from their libraries, like Cutlass, CUDN, right? Those library optimizations. And also for specialized workloads, actually you can specialize them. Because a lot of cases you'll find that if you go and do benchmarks, it's very interesting. Like two years ago, if you try to benchmark ResNet, for example, usually the NVIDIA library [00:24:04]Swyx: gives you the best performance. [00:24:06]Tianqi: It's really hard to beat them. But as soon as you start to change the model to something, maybe a bit of a variation of ResNet, not for the traditional ImageNet detections, but for latent detection and so on, there will be some room for optimization because people sometimes overfit to benchmarks. These are people who go and optimize things, right? So people overfit the benchmarks. So that's the largest barrier, like being able to get a low level kernel libraries, right? In that sense, the goal of TVM is actually we try to have a generic layer to both, of course, leverage libraries when available, but also be able to automatically generate [00:24:45]Swyx: libraries when possible. [00:24:46]Tianqi: So in that sense, we are not restricted by the libraries that they have to offer. That's why we will be able to run Apple M2 or WebGPU where there's no library available because we are kind of like automatically generating libraries. That makes it easier to support less well-supported hardware, right? For example, WebGPU is one example. From a runtime perspective, AMD, I think before their Vulkan driver was not very well supported. Recently, they are getting good. But even before that, we'll be able to support AMD through this GPU graphics backend called Vulkan, which is not as performant, but it gives you a decent portability across those [00:25:29]Swyx: hardware. [00:25:29]Alessio: And I know we got other MLC stuff to talk about, like WebLLM, but I want to wrap up on the optimization that you're doing. So there's kind of four core things, right? Kernel fusion, which we talked a bit about in the flash attention episode and the tiny grab one memory planning and loop optimization. I think those are like pretty, you know, self-explanatory. I think the one that people have the most questions, can you can you quickly explain [00:25:53]Swyx: those? [00:25:54]Tianqi: So there are kind of a different things, right? Kernel fusion means that, you know, if you have an operator like Convolutions or in the case of a transformer like MOP, you have other operators that follow that, right? You don't want to launch two GPU kernels. You want to be able to put them together in a smart way, right? And as a memory planning, it's more about, you know, hey, if you run like Python code, every time when you generate a new array, you are effectively allocating a new piece of memory, right? Of course, PyTorch and other frameworks try to optimize for you. So there is a smart memory allocator behind the scene. But actually, in a lot of cases, it's much better to statically allocate and plan everything ahead of time. And that's where like a compiler can come in. We need to, first of all, actually for language model, it's much harder because dynamic shape. So you need to be able to what we call symbolic shape tracing. So we have like a symbolic variable that tells you like the shape of the first tensor is n by 12. And the shape of the third tensor is also n by 12. Or maybe it's n times 2 by 12. Although you don't know what n is, right? But you will be able to know that relation and be able to use that to reason about like fusion and other decisions. So besides this, I think loop transformation is quite important. And it's actually non-traditional. Originally, if you simply write a code and you want to get a performance, it's very hard. For example, you know, if you write a matrix multiplier, the simplest thing you can do is you do for i, j, k, c, i, j, plus, equal, you know, a, i, k, times b, i, k. But that code is 100 times slower than the best available code that you can get. So we do a lot of transformation, like being able to take the original code, trying to put things into shared memory, and making use of tensor calls, making use of memory copies, and all this. Actually, all these things, we also realize that, you know, we cannot do all of them. So we also make the ML compilation framework as a Python package, so that people will be able to continuously improve that part of engineering in a more transparent way. So we find that's very useful, actually, for us to be able to get good performance very quickly on some of the new models. Like when Lamato came out, we'll be able to go and look at the whole, here's the bottleneck, and we can go and optimize those. [00:28:10]Alessio: And then the fourth one being weight quantization. So everybody wants to know about that. And just to give people an idea of the memory saving, if you're doing FB32, it's like four bytes per parameter. Int8 is like one byte per parameter. So you can really shrink down the memory footprint. What are some of the trade-offs there? How do you figure out what the right target is? And what are the precision trade-offs, too? [00:28:37]Tianqi: Right now, a lot of people also mostly use int4 now for language models. So that really shrinks things down a lot. And more recently, actually, we started to think that, at least in MOC, we don't want to have a strong opinion on what kind of quantization we want to bring, because there are so many researchers in the field. So what we can do is we can allow developers to customize the quantization they want, but we still bring the optimum code for them. So we are working on this item called bring your own quantization. In fact, hopefully MOC will be able to support more quantization formats. And definitely, I think there's an open field that's being explored. Can you bring more sparsities? Can you quantize activations as much as possible, and so on? And it's going to be something that's going to be relevant for quite a while. [00:29:27]Swyx: You mentioned something I wanted to double back on, which is most people use int4 for language models. This is actually not obvious to me. Are you talking about the GGML type people, or even the researchers who are training the models also using int4? [00:29:40]Tianqi: Sorry, so I'm mainly talking about inference, not training, right? So when you're doing training, of course, int4 is harder, right? Maybe you could do some form of mixed type precision for inference. I think int4 is kind of like, in a lot of cases, you will be able to get away with int4. And actually, that does bring a lot of savings in terms of the memory overhead, and so on. [00:30:09]Alessio: Yeah, that's great. Let's talk a bit about maybe the GGML, then there's Mojo. How should people think about MLC? How do all these things play together? I think GGML is focused on model level re-implementation and improvements. Mojo is a language, super sad. You're more at the compiler level. Do you all work together? Do people choose between them? [00:30:32]Tianqi: So I think in this case, I think it's great to say the ecosystem becomes so rich with so many different ways. So in our case, GGML is more like you're implementing something from scratch in C, right? So that gives you the ability to go and customize each of a particular hardware backend. But then you will need to write from CUDA kernels, and you write optimally from AMD, and so on. So the kind of engineering effort is a bit more broadened in that sense. Mojo, I have not looked at specific details yet. I think it's good to start to say, it's a language, right? I believe there will also be machine learning compilation technologies behind it. So it's good to say, interesting place in there. In the case of MLC, our case is that we do not want to have an opinion on how, where, which language people want to develop, deploy, and so on. And we also realize that actually there are two phases. We want to be able to develop and optimize your model. By optimization, I mean, really bring in the best CUDA kernels and do some of the machine learning engineering in there. And then there's a phase where you want to deploy it as a part of the app. So if you look at the space, you'll find that GGML is more like, I'm going to develop and optimize in the C language, right? And then most of the low-level languages they have. And Mojo is that you want to develop and optimize in Mojo, right? And you deploy in Mojo. In fact, that's the philosophy they want to push for. In the ML case, we find that actually if you want to develop models, the machine learning community likes Python. Python is a language that you should focus on. So in the case of MLC, we really want to be able to enable, not only be able to just define your model in Python, that's very common, right? But also do ML optimization, like engineering optimization, CUDA kernel optimization, memory planning, all those things in Python that makes you customizable and so on. But when you do deployment, we realize that people want a bit of a universal flavor. If you are a web developer, you want JavaScript, right? If you're maybe an embedded system person, maybe you would prefer C++ or C or Rust. And people sometimes do like Python in a lot of cases. So in the case of MLC, we really want to have this vision of, you optimize, build a generic optimization in Python, then you deploy that universally onto the environments that people like. [00:32:54]Swyx: That's a great perspective and comparison, I guess. One thing I wanted to make sure that we cover is that I think you are one of these emerging set of academics that also very much focus on your artifacts of delivery. Of course. Something we talked about for three years, that he was very focused on his GitHub. And obviously you treated XGBoost like a product, you know? And then now you're publishing an iPhone app. Okay. Yeah. Yeah. What is his thinking about academics getting involved in shipping products? [00:33:24]Tianqi: I think there are different ways of making impact, right? Definitely, you know, there are academics that are writing papers and building insights for people so that people can build product on top of them. In my case, I think the particular field I'm working on, machine learning systems, I feel like really we need to be able to get it to the hand of people so that really we see the problem, right? And we show that we can solve a problem. And it's a different way of making impact. And there are academics that are doing similar things. Like, you know, if you look at some of the people from Berkeley, right? A few years, they will come up with big open source projects. Certainly, I think it's just a healthy ecosystem to have different ways of making impacts. And I feel like really be able to do open source and work with open source community is really rewarding because we have a real problem to work on when we build our research. Actually, those research bring together and people will be able to make use of them. And we also start to see interesting research challenges that we wouldn't otherwise say, right, if you're just trying to do a prototype and so on. So I feel like it's something that is one interesting way of making impact, making contributions. [00:34:40]Swyx: Yeah, you definitely have a lot of impact there. And having experience publishing Mac stuff before, the Apple App Store is no joke. It is the hardest compilation, human compilation effort. So one thing that we definitely wanted to cover is running in the browser. You have a 70 billion parameter model running in the browser. That's right. Can you just talk about how? Yeah, of course. [00:35:02]Tianqi: So I think that there are a few elements that need to come in, right? First of all, you know, we do need a MacBook, the latest one, like M2 Max, because you need the memory to be big enough to cover that. So for a 70 million model, it takes you about, I think, 50 gigahertz of RAM. So the M2 Max, the upper version, will be able to run it, right? And it also leverages machine learning compilation. Again, what we are doing is the same, whether it's running on iPhone, on server cloud GPUs, on AMDs, or on MacBook, we all go through that same MOC pipeline. Of course, in certain cases, maybe we'll do a bit of customization iteration for either ones. And then it runs on the browser runtime, this package of WebLM. So that will effectively... So what we do is we will take that original model and compile to what we call WebGPU. And then the WebLM will be to pick it up. And the WebGPU is this latest GPU technology that major browsers are shipping right now. So you can get it in Chrome for them already. It allows you to be able to access your native GPUs from a browser. And then effectively, that language model is just invoking the WebGPU kernels through there. So actually, when the LATMAR2 came out, initially, we asked the question about, can you run 17 billion on a MacBook? That was the question we're asking. So first, we actually... Jin Lu, who is the engineer pushing this, he got 17 billion on a MacBook. We had a CLI version. So in MLC, you will be able to... That runs through a metal accelerator. So effectively, you use the metal programming language to get the GPU acceleration. So we find, okay, it works for the MacBook. Then we asked, we had a WebGPU backend. Why not try it there? So we just tried it out. And it's really amazing to see everything up and running. And actually, it runs smoothly in that case. So I do think there are some kind of interesting use cases already in this, because everybody has a browser. You don't need to install anything. I think it doesn't make sense yet to really run a 17 billion model on a browser, because you kind of need to be able to download the weight and so on. But I think we're getting there. Effectively, the most powerful models you will be able to run on a consumer device. It's kind of really amazing. And also, in a lot of cases, there might be use cases. For example, if I'm going to build a chatbot that I talk to it and answer questions, maybe some of the components, like the voice to text, could run on the client side. And so there are a lot of possibilities of being able to have something hybrid that contains the edge component or something that runs on a server. [00:37:47]Alessio: Do these browser models have a way for applications to hook into them? So if I'm using, say, you can use OpenAI or you can use the local model. Of course. [00:37:56]Tianqi: Right now, actually, we are building... So there's an NPM package called WebILM, right? So that you will be able to, if you want to embed it onto your web app, you will be able to directly depend on WebILM and you will be able to use it. We are also having a REST API that's OpenAI compatible. So that REST API, I think, right now, it's actually running on native backend. So that if a CUDA server is faster to run on native backend. But also we have a WebGPU version of it that you can go and run. So yeah, we do want to be able to have easier integrations with existing applications. And OpenAI API is certainly one way to do that. Yeah, this is great. [00:38:37]Swyx: I actually did not know there's an NPM package that makes it very, very easy to try out and use. I want to actually... One thing I'm unclear about is the chronology. Because as far as I know, Chrome shipped WebGPU the same time that you shipped WebILM. Okay, yeah. So did you have some kind of secret chat with Chrome? [00:38:57]Tianqi: The good news is that Chrome is doing a very good job of trying to have early release. So although the official shipment of the Chrome WebGPU is the same time as WebILM, actually, you will be able to try out WebGPU technology in Chrome. There is an unstable version called Canary. I think as early as two years ago, there was a WebGPU version. Of course, it's getting better. So we had a TVM-based WebGPU backhand two years ago. Of course, at that time, there were no language models. It was running on less interesting, well, still quite interesting models. And then this year, we really started to see it getting matured and performance keeping up. So we have a more serious push of bringing the language model compatible runtime onto the WebGPU. [00:39:45]Swyx: I think you agree that the hardest part is the model download. Has there been conversations about a one-time model download and sharing between all the apps that might use this API? That is a great point. [00:39:58]Tianqi: I think it's already supported in some sense. When we download the model, WebILM will cache it onto a special Chrome cache. So if a different web app uses the same WebILM JavaScript package, you don't need to redownload the model again. So there is already something there. But of course, you have to download the model once at least to be able to use it. [00:40:19]Swyx: Okay. One more thing just in general before we're about to zoom out to OctoAI. Just the last question is, you're not the only project working on, I guess, local models. That's right. Alternative models. There's gpt4all, there's olama that just recently came out, and there's a bunch of these. What would be your advice to them on what's a valuable problem to work on? And what is just thin wrappers around ggml? Like, what are the interesting problems in this space, basically? [00:40:45]Tianqi: I think making API better is certainly something useful, right? In general, one thing that we do try to push very hard on is this idea of easier universal deployment. So we are also looking forward to actually have more integration with MOC. That's why we're trying to build API like WebILM and other things. So we're also looking forward to collaborate with all those ecosystems and working support to bring in models more universally and be able to also keep up the best performance when possible in a more push-button way. [00:41:15]Alessio: So as we mentioned in the beginning, you're also the co-founder of Octomel. Recently, Octomel released OctoAI, which is a compute service, basically focuses on optimizing model runtimes and acceleration and compilation. What has been the evolution there? So Octo started as kind of like a traditional MLOps tool, where people were building their own models and you help them on that side. And then it seems like now most of the market is shifting to starting from pre-trained generative models. Yeah, what has been that experience for you and what you've seen the market evolve? And how did you decide to release OctoAI? [00:41:52]Tianqi: One thing that we found out is that on one hand, it's really easy to go and get something up and running, right? So if you start to consider there's so many possible availabilities and scalability issues and even integration issues since becoming kind of interesting and complicated. So we really want to make sure to help people to get that part easy, right? And now a lot of things, if we look at the customers we talk to and the market, certainly generative AI is something that is very interesting. So that is something that we really hope to help elevate. And also building on top of technology we build to enable things like portability across hardwares. And you will be able to not worry about the specific details, right? Just focus on getting the model out. We'll try to work on infrastructure and other things that helps on the other end. [00:42:45]Alessio: And when it comes to getting optimization on the runtime, I see when we run an early adopters community and most enterprises issue is how to actually run these models. Do you see that as one of the big bottlenecks now? I think a few years ago it was like, well, we don't have a lot of machine learning talent. We cannot develop our own models. Versus now it's like, there's these great models you can use, but I don't know how to run them efficiently. [00:43:12]Tianqi: That depends on how you define by running, right? On one hand, it's easy to download your MLC, like you download it, you run on a laptop, but then there's also different decisions, right? What if you are trying to serve a larger user request? What if that request changes? What if the availability of hardware changes? Right now it's really hard to get the latest hardware on media, unfortunately, because everybody's trying to work on the things using the hardware that's out there. So I think when the definition of run changes, there are a lot more questions around things. And also in a lot of cases, it's not only about running models, it's also about being able to solve problems around them. How do you manage your model locations and how do you make sure that you get your model close to your execution environment more efficiently? So definitely a lot of engineering challenges out there. That we hope to elevate, yeah. And also, if you think about our future, definitely I feel like right now the technology, given the technology and the kind of hardware availability we have today, we will need to make use of all the possible hardware available out there. That will include a mechanism for cutting down costs, bringing something to the edge and cloud in a more natural way. So I feel like still this is a very early stage of where we are, but it's already good to see a lot of interesting progress. [00:44:35]Alessio: Yeah, that's awesome. I would love, I don't know how much we're going to go in depth into it, but what does it take to actually abstract all of this from the end user? You know, like they don't need to know what GPUs you run, what cloud you're running them on. You take all of that away. What was that like as an engineering challenge? [00:44:51]Tianqi: So I think that there are engineering challenges on. In fact, first of all, you will need to be able to support all the kind of hardware backhand you have, right? On one hand, if you look at the media library, you'll find very surprisingly, not too surprisingly, most of the latest libraries works well on the latest GPU. But there are other GPUs out there in the cloud as well. So certainly being able to have know-hows and being able to do model optimization is one thing, right? Also infrastructures on being able to scale things up, locate models. And in a lot of cases, we do find that on typical models, it also requires kind of vertical iterations. So it's not about, you know, build a silver bullet and that silver bullet is going to solve all the problems. It's more about, you know, we're building a product, we'll work with the users and we find out there are interesting opportunities in a certain point. And when our engineer will go and solve that, and it will automatically reflect it in a service. [00:45:45]Swyx: Awesome. [00:45:46]Alessio: We can jump into the lightning round until, I don't know, Sean, if you have more questions or TQ, if you have more stuff you wanted to talk about that we didn't get a chance to [00:45:54]Swyx: touch on. [00:45:54]Alessio: Yeah, we have talked a lot. [00:45:55]Swyx: So, yeah. We always would like to ask, you know, do you have a commentary on other parts of AI and ML that is interesting to you? [00:46:03]Tianqi: So right now, I think one thing that we are really pushing hard for is this question about how far can we bring open source, right? I'm kind of like a hacker and I really like to put things together. So I think it's unclear in the future of what the future of AI looks like. On one hand, it could be possible that, you know, you just have a few big players, you just try to talk to those bigger language models and that can do everything, right? On the other hand, one of the things that Wailing Academic is really excited and pushing for, that's one reason why I'm pushing for MLC, is that can we build something where you have different models? You have personal models that know the best movie you like, but you also have bigger models that maybe know more, and you get those models to interact with each other, right? And be able to have a wide ecosystem of AI agents that helps each person while still being able to do things like personalization. Some of them can run locally, some of them, of course, running on a cloud, and how do they interact with each other? So I think that is a very exciting time where the future is yet undecided, but I feel like there is something we can do to shape that future as well. [00:47:18]Swyx: One more thing, which is something I'm also pursuing, which is, and this kind of goes back into predictions, but also back in your history, do you have any idea, or are you looking out for anything post-transformers as far as architecture is concerned? [00:47:32]Tianqi: I think, you know, in a lot of these cases, you can find there are already promising models for long contexts, right? There are space-based models, where like, you know, a lot of some of our colleagues from Albert, who he worked on this HIPPO models, right? And then there is an open source version called RWKV. It's like a recurrent models that allows you to summarize things. Actually, we are bringing RWKV to MOC as well, so maybe you will be able to see one of the models. [00:48:00]Swyx: We actually recorded an episode with one of the RWKV core members. It's unclear because there's no academic backing. It's just open source people. Oh, I see. So you like the merging of recurrent networks and transformers? [00:48:13]Tianqi: I do love to see this model space continue growing, right? And I feel like in a lot of cases, it's just that attention mechanism is getting changed in some sense. So I feel like definitely there are still a lot of things to be explored here. And that is also one reason why we want to keep pushing machine learning compilation, because one of the things we are trying to push in was productivity. So that for machine learning engineering, so that as soon as some of the models came out, we will be able to, you know, empower them onto those environments that's out there. [00:48:43]Swyx: Yeah, it's a really good mission. Okay. Very excited to see that RWKV and state space model stuff. I'm hearing increasing chatter about that stuff. Okay. Lightning round, as always fun. I'll take the first one. Acceleration. What has already happened in AI that you thought would take much longer? [00:48:59]Tianqi: Emergence of more like a conversation chatbot ability is something that kind of surprised me before it came out. This is like one piece that I feel originally I thought would take much longer, but yeah, [00:49:11]Swyx: it happens. And it's funny because like the original, like Eliza chatbot was something that goes all the way back in time. Right. And then we just suddenly came back again. Yeah. [00:49:21]Tianqi: It's always too interesting to think about, but with a kind of a different technology [00:49:25]Swyx: in some sense. [00:49:25]Alessio: What about the most interesting unsolved question in AI? [00:49:31]Swyx: That's a hard one, right? [00:49:32]Tianqi: So I can tell you like what kind of I'm excited about. So, so I think that I have always been excited about this idea of continuous learning and lifelong learning in some sense. So how AI continues to evolve with the knowledges that have been there. It seems that we're getting much closer with all those recent technologies. So being able to develop systems, support, and be able to think about how AI continues to evolve is something that I'm really excited about. [00:50:01]Swyx: So specifically, just to double click on this, are you talking about continuous training? That's like a training. [00:50:06]Tianqi: I feel like, you know, training adaptation and it's all similar things, right? You want to think about entire life cycle, right? The life cycle of collecting data, training, fine tuning, and maybe have your local context that getting continuously curated and feed onto models. So I think all these things are interesting and relevant in here. [00:50:29]Swyx: Yeah. I think this is something that people are really asking, you know, right now we have moved a lot into the sort of pre-training phase and off the shelf, you know, the model downloads and stuff like that, which seems very counterintuitive compared to the continuous training paradigm that people want. So I guess the last question would be for takeaways. What's basically one message that you want every listener, every person to remember today? [00:50:54]Tianqi: I think it's getting more obvious now, but I think one of the things that I always want to mention in my talks is that, you know, when you're thinking about AI applications, originally people think about algorithms a lot more, right? Our algorithm models, they are still very important. But usually when you build AI applications, it takes, you know, both algorithm side, the system optimizations, and the data curations, right? So it takes a connection of so many facades to be able to bring together an AI system and be able to look at it from that holistic perspective is really useful when we start to build modern applications. I think it's going to continue going to be more important in the future. [00:51:35]Swyx: Yeah. Thank you for showing the way on this. And honestly, just making things possible that I thought would take a lot longer. So thanks for everything you've done. [00:51:46]Tianqi: Thank you for having me. [00:51:47]Swyx: Yeah. [00:51:47]Alessio: Thanks for coming on TQ. [00:51:49]Swyx: Have a good one. [00:51:49] Get full access to Latent Space at www.latent.space/subscribe

Infinite Machine Learning
Speeding up Generative AI models | Luis Ceze, cofounder and CEO of OctoML

Infinite Machine Learning

Play Episode Listen Later Jul 31, 2023 39:56


Luis Ceze is the cofounder and CEO of OctoML, a platform that offers compute infrastructure to fine-tune, run, and scale your AI models. He's a professor at University of Washington and a venture partner at Madrona. He was previously the cofounder of Corensic. He has a PhD in Computer Science from University of Illinois Urbana-Champaign. In this episode, we cover a range of topics including: - OctoAI product announcement - How to make LLMs faster and cheaper - Training your own LLMs - The perceived shortage of AI compute - Enterprise spend on AI compute - Applications being built using OctoML - Domain specific models Luis's favorite books:- Thinking, Fast and Slow (Author: Daniel Kahneman)- Blindness (Author: Jose Saramago)--------Where to find Prateek Joshi: Newsletter: https://prateekjoshi.substack.com Website: https://prateekj.com LinkedIn: https://www.linkedin.com/in/prateek-joshi-91047b19 Twitter: https://twitter.com/prateekvjoshi 

Infinite Machine Learning
Generative AI stack, Android moment of AI, Data infrastructure | Jon Turow, Partner at Madrona

Infinite Machine Learning

Play Episode Listen Later Jul 10, 2023 41:11


Jon Turow is a partner at Madrona, a VC firm that has invested in amazing companies like OctoML, HighSpot, Fixie, Clari, Runway, UiPath, and many more. He holds 26 patents! Most recently, he led the product teams for AWS Computer Vision AI services, including Amazon Textract and Amazon Rekognition. He wrote the original product and business plans for AWS IoT and AWS Greengrass, which extend AWS services to run locally on edge devices. Prior to Amazon, he co-founded a cloud telephony startup. He holds a bachelor's from Wharton and an MBA from Kellogg. In this episode, we cover a range of topics including: - The Generative AI stack- Application frameworks for developers- Using a combination of multiple foundation models- Data tooling for AI applications- Making LLMs faster/better/cheaper- The Android moment of AI- Open source AI opportunities- AI copilots for software development- What use cases within AI infrastructure are exciting to youJon's favorite book: Night Flight (Author: Antoine de Saint-Exupéry)--------Where to find Prateek Joshi: Newsletter: https://prateekjoshi.substack.com Website: https://prateekj.com LinkedIn: https://www.linkedin.com/in/prateek-joshi-91047b19 Twitter: https://twitter.com/prateekvjoshi 

Science in Parallel
Season 3, Episode 5 -- Beyond Exascale: Exploring Emerging Hardware

Science in Parallel

Play Episode Listen Later Jun 21, 2023 41:16


The exascale era in computing has arrived, and that brings up the question of what's next. We'll discuss some emerging processor technologies-- molecular storage and computing, quantum computing and neuromorphic chips—with an expert from each of those fields. Learn more about these technologies' strengths and challenges and how they might be incorporated into tomorrow's systems.  You'll meet: Luis Ceze, professor of computer science at the University of Washington and CEO of the AI startup OctoML. Bert de Jong, senior scientist and department head for computational sciences at Lawrence Berkeley National Laboratory and deputy director of the Quantum Systems Accelerator.  Catherine (Katie) Schuman, is a neuromorphic computing researcher and an assistant professor of computer science at the University of Tennessee, Knoxville.

AI Unraveled: Latest AI News & Trends, Master GPT, Gemini, Generative AI, LLMs, Prompting, GPT Store
Latest AI trends: Top Python AI and Machine Learning Libraries; Meta develops method for teaching image models common sense; OctoAI; We are all AI's free data workers; AI resurrects The Beatles; First regulatory framework for AI;

AI Unraveled: Latest AI News & Trends, Master GPT, Gemini, Generative AI, LLMs, Prompting, GPT Store

Play Episode Listen Later Jun 15, 2023 12:11


Top Python AI and Machine Learning LibrariesMeta develops method for teaching image models common sense:OctoML launches OctoAI, a self-optimizing compute service for AIAI resurrects The Beatles: AI helps make 'final' Beatles songDaily AI Update News from Meta, Google, OpenAI, AMD, Adobe, Hugging Face, and AccentureWe are all AI's free data workers:The EU Parliament has adopted the world's first regulatory framework for AIDreamGPT turns a weakness of large language models into a strengthHow to Use The GPT-4 API With Function Calling | Your Own ChatGPT Plugins | TypeScriptThis podcast is generated using the Wondercraft AI platform, a tool that makes it super easy to start your own podcast, by enabling you to use hyper-realistic AI voices as your host. Like mine!Are you eager to expand your understanding of artificial intelligence? Look no further than the essential book "AI Unraveled: Demystifying Frequently Asked Questions on Artificial Intelligence (OpenAI, ChatGPT, Google Bard, Generative AI, LLM, Palm 2)," now available on Amazon, Google and Apple Book Stores.Get your copy at Google, Apple or Amazon today!AWS Certified Machine Learning Specialty (MLS-C01) Practice Exams: 3 Practice Exams, Data Engineering, Exploratory Data Analysis, Modeling, Machine Learning Implementation and Operations, NLP:

Infinite Machine Learning
LLM Agents, Few-Shot Learning | Matt Welsh, cofounder and CEO of Fixie

Infinite Machine Learning

Play Episode Listen Later May 30, 2023 37:38


Matt Welsh is the cofounder and CEO of Fixie, an automation platform for LLMs. It allows developers to build natural language agents that connect to your data, talk to APIs, and solve complex problems. They've raised $17M from investors such as Redpoint, Madrona, Zetta, SignalFire, Bloomberg Beta, and more. He has previously held roles at OctoML, Apple, Xnor, and Google. He was a Professor of Computer Science at Harvard and has a PhD in Computer Science from UC Berkeley. In this episode, we cover a range of topics including: - LLMs as the new computational engine - What can LLMs do well and where are the gaps - Fine-tuning vs In-context learning - Smart Agents - Few shot learning - Use cases of Fixie Matt's favorite book: The Amazing Adventures of Kavalier & Clay (Author: Michael Chabon)--------Where to find Prateek Joshi:  Newsletter: https://prateekjoshi.substack.com Website: https://prateekj.com LinkedIn: https://www.linkedin.com/in/prateek-joshi-91047b19 Twitter: https://twitter.com/prateekvjoshi 

The Jim Rutt Show
Currents 095: Matt Welsh on the End of Programming

The Jim Rutt Show

Play Episode Listen Later May 9, 2023 61:24


Jim talks with Matt Welsh about the ideas in his essay "The End of Programming," arguing that coding as we know it will soon be obsolete. They discuss ChatGPT's ability to perform logical reasoning, whether it thinks, its utility as a programming aid, skipping code entirely, using language models as computational engines, problem decomposition, streamlining the interface between models and databases, complex customer service, the accessibility of fine-tuning, Jim's LLM scriptwriting project, custom hardware for language models, learning to speak with aliens, democratizing computing abilities, moral conundrums & value-laden choices, training introspection, avoiding an erosion of trust, short-term opportunities for small dev teams, advice for recent college grads, and much more. Episode Transcript "The End of Programming," by Matt Welsh (Communications of the ACM) Fixie.ai Etched.ai GitHub Copilot Matt Welsh is CEO and Co-Founder at Fixie.ai, a startup building a new computing platform based on Large Language Models. Prior to Fixie, Matt was the SVP of Engineering at OctoML, and spent time as an engineering leader at Apple, Xnor.ai, and Google. He was previously a Professor of Computer Science at Harvard, and did his PhD in Computer Science at UC Berkeley.

MLOps.community
Cost/Performance Optimization with LLMs [Panel]

MLOps.community

Play Episode Listen Later May 6, 2023 35:57


Sign up for the next LLM in production conference here: https://go.mlops.community/LLMinprod Watch all the talks from the first conference: https://go.mlops.community/llmconfpart1 // Abstract In this panel discussion, the topic of the cost of running large language models (LLMs) is explored, along with potential solutions. The benefits of bringing LLMs in-house, such as latency optimization and greater control, are also discussed. The panelists explore methods such as structured pruning and knowledge distillation for optimizing LLMs. OctoML's platform is mentioned as a tool for the automatic deployment of custom models and for selecting the most appropriate hardware for them. Overall, the discussion provides insights into the challenges of managing LLMs and potential strategies for overcoming them. // Bio Lina Weichbrodt Lina is a pragmatic freelancer and machine learning consultant that likes to solve business problems end-to-end and make machine learning or a simple, fast heuristic work in the real world. In her spare time, Lina likes to exchange with other people on how they can implement best practices in machine learning, talk to her at the Machine Learning Ops Slack: shorturl.at/swxIN. Luis Ceze Luis Ceze is Co-Founder and CEO of OctoML, which enables businesses to seamlessly deploy ML models to production making the most out of the hardware. OctoML is backed by Tiger Global, Addition, Amplify Partners, and Madrona Venture Group. Ceze is the Lazowska Professor in the Paul G. Allen School of Computer Science and Engineering at the University of Washington, where he has taught for 15 years. Luis co-directs the Systems and Architectures for Machine Learning lab (sampl.ai), which co-authored Apache TVM, a leading open-source ML stack for performance and portability that is used in widely deployed AI applications. Luis is also co-director of the Molecular Information Systems Lab (misl.bio), which led pioneering research in the intersection of computing and biology for IT applications such as DNA data storage. His research has been featured prominently in the media including New York Times, Popular Science, MIT Technology Review, and the Wall Street Journal. Ceze is a Venture Partner at Madrona Venture Group and leads their technical advisory board. Jared Zoneraich Co-Founder of PromptLayer, enabling data-driven prompt engineering. Compulsive builder. Jersey native, with a brief stint in California (UC Berkeley '20) and now residing in NYC. Daniel Campos Hailing from Mexico Daniel started his NLP journey with his BS in CS from RPI. He then worked at Microsoft on Ranking at Bing with LLM(back when they had 2 commas) and helped build out popular datasets like MSMARCO and TREC Deep Learning. While at Microsoft he got his MS in Computational Linguistics from the University of Washington with a focus on Curriculum Learning for Language Models. Most recently, he has been pursuing his Ph.D. at the University of Illinois Urbana Champaign focusing on efficient inference for LLMs and robust dense retrieval. During his Ph.D., he worked for companies like Neural Magic, Walmart, Qualtrics, and Mendel.AI and now works on bringing LLMs to search at Neeva. Mario Kostelac Currently building AI-powered products in Intercom in a small, highly effective team. I roam between practical research and engineering but lean more towards engineering and challenges around running reliable, safe, and predictable ML systems. You can imagine how fun it is in LLM era :). Generally interested in the intersection of product and tech, and building a differentiation by solving hard challenges (technical or non-technical). Software engineer turned into Machine Learning engineer 5 years ago.

Gradient Dissent - A Machine Learning Podcast by W&B
Sarah Catanzaro — Remembering the Lessons of the Last AI Renaissance

Gradient Dissent - A Machine Learning Podcast by W&B

Play Episode Listen Later Feb 2, 2023 76:24


Sarah Catanzaro is a General Partner at Amplify Partners, and one of the leading investors in AI and ML. Her investments include RunwayML, OctoML, and Gantry.Sarah and Lukas discuss lessons learned from the "AI renaissance" of the mid 2010s and compare the general perception of ML back then to now. Sarah also provides insights from her perspective as an investor, from selling into tech-forward companies vs. traditional enterprises, to the current state of MLOps/developer tools, to large language models and hype bubbles.Show notes (transcript and links): http://wandb.me/gd-sarah-catanzaro---⏳ Timestamps: 0:00 Intro1:10 Lessons learned from previous AI hype cycles11:46 Maintaining technical knowledge as an investor19:05 Selling into tech-forward companies vs. traditional enterprises25:09 Building point solutions vs. end-to-end platforms36:27 LLMS, new tooling, and commoditization44:39 Failing fast and how startups can compete with large cloud vendors52:31 The gap between research and industry, and vice versa1:00:01 Advice for ML practitioners during hype bubbles1:03:17 Sarah's thoughts on Rust and bottlenecks in deployment1:11:23 The importance of aligning technology with people1:15:58 Outro---

MLOps.community
Bringing DevOps Agility to ML// Luis Ceze // Coffee Sessions #121

MLOps.community

Play Episode Listen Later Sep 6, 2022 64:35


MLOps Coffee Sessions #121 with Luis Ceze, CEO and Co-founder of OctoML, Bringing DevOps Agility to ML co-hosted by Mihail Eric. // Abstract There's something about this idea where people see a future where you don't need to think about infrastructure. You should just be able to do what you do and infrastructure happens. People understand that there is a lot of complexity underneath the hood and most data scientists or machine learning engineers start deploying things and shouldn't have to worry about the most efficient way of doing this. // Bio Luis Ceze is Co-Founder and CEO of OctoML, which enables businesses to seamlessly deploy ML models to production making the most out of the hardware. OctoML is backed by Tiger Global, Addition, Amplify Partners, and Madrona Venture Group. Ceze is the Lazowska Professor in the Paul G. Allen School of Computer Science and Engineering at the University of Washington, where he has taught for 15 years. Luis co-directs the Systems and Architectures for Machine Learning lab (sampl.ai), which co-authored Apache TVM, a leading open-source ML stack for performance and portability that is used in widely deployed AI applications. Luis is also co-director of the Molecular Information Systems Lab (misl.bio), which led pioneering research in the intersection of computing and biology for IT applications such as DNA data storage. His research has been featured prominently in the media including New York Times, Popular Science, MIT Technology Review, and the Wall Street Journal. Ceze is a Venture Partner at Madrona Venture Group and leads their technical advisory board. // MLOps Jobs board https://mlops.pallet.xyz/jobs MLOps Swag/Merch https://mlops-community.myshopify.com/ // Related Links Landing page: https://octoml.ai/ The Boys in the Boat: Nine Americans and Their Epic Quest for Gold at the 1936 Berlin Olympics by Daniel James Brown: https://www.amazon.com/Boys-Boat-Americans-Berlin-Olympics/dp/0143125478 --------------- ✌️Connect With Us ✌️ ------------- Join our slack community: https://go.mlops.community/slack Follow us on Twitter: @mlopscommunity Sign up for the next meetup: https://go.mlops.community/register Catch all episodes, blogs, newsletters, and more: https://mlops.community/ Connect with Demetrios on LinkedIn: https://www.linkedin.com/in/dpbrinkm/ Connect with Mihail on LinkedIn: https://www.linkedin.com/in/mihaileric/ Connect with Luis on LinkedIn: https://www.linkedin.com/in/luis-ceze-50b2314/ Timestamps: [00:00] Introduction to Luis Ceze [06:28] MLOps does not exist [10:41] Semantics argument [16:25] Parallel programming standpoint [18:09] TVM [22:51] Optimizations [24:18] TVM in the ecosystem [27:10] OctoML's further step [30:42] Value chain [33:58] Mature players [35:48] Talking to SRE's and Machine Learning Engineers [36:32] Building OctoML [40:20] My Octopus Teacher [42:15] Environmental effects of Sustainable Machine Learning [44:50] Bridging the gap from OctoML to biological mechanisms [50:02] Programmability [57:13] Academia making the impact [59:40] Rapid fire questions [1:03:39] Wrap up

Founded and Funded
Founded and Funded IA40 Winner Spotlight: Hugging Face CEO Clem Delangue and OctoML CEO Luis Ceze on foundation models, open source, and transparency

Founded and Funded

Play Episode Listen Later May 5, 2022 45:59


This week on Founded and Funded, we spotlight our next IA40 winners – Hugging Face and OctoML. Managing Director Matt McIlwain talks to Hugging Face Co-founder and CEO Clem Delangue and OctoML Co-founder and CEO Luis Ceze all about foundation models, diving deep into the importance of detecting biases in the data being used to train models as well the importance of transparency and the ability for researchers to share their models. They discuss open source, business models, the role of cloud providers and debate DevOps versus MLOps, something that Luis feels particularly passionate about. Clem even explains how large models are to machine learning like what Formula 1 is to the car industry.

Practical AI
MLOps is NOT Real

Practical AI

Play Episode Listen Later Apr 26, 2022 45:57 Transcription Available


We all hear a lot about MLOps these days, but where does MLOps end and DevOps begin? Our friend Luis from OctoML joins us in this episode to discuss treating AI/ML models as regular software components (once they are trained and ready for deployment). We get into topics including optimization on various kinds of hardware and deployment of models at the edge.

Changelog Master Feed
MLOps is NOT Real (Practical AI #176)

Changelog Master Feed

Play Episode Listen Later Apr 26, 2022 45:57 Transcription Available


We all hear a lot about MLOps these days, but where does MLOps end and DevOps begin? Our friend Luis from OctoML joins us in this episode to discuss treating AI/ML models as regular software components (once they are trained and ready for deployment). We get into topics including optimization on various kinds of hardware and deployment of models at the edge.

Secrets of the Middle Market with Tony Lystra
How UW research spawned a fast-growing Seattle startup

Secrets of the Middle Market with Tony Lystra

Play Episode Listen Later Jan 19, 2022 15:59


The New Stack Podcast
Deploying Scalable Machine Learning Models for Long-Term Sustainability

The New Stack Podcast

Play Episode Listen Later Jan 11, 2022 15:48


As machine learning models proliferate and become sophisticated, deploying them to the cloud becomes increasingly expensive. This challenge of optimizing the model also impacts the scale and requires the flexibility to move the models to different hardware like Graphic Processing Units (GPUs) or Central Processing Units (CPUs) to gain more advantage. The ability to accelerate the deployment of machine learning models to the cloud or edge at scale is shifting the way organizations build next-generation AI models and applications. And being able to optimize these models quickly to save costs and sustain them over time is moving to the forefront for many developers.In this episode of The New Stack Makers podcast recorded at AWS re:Invent, Luis Ceze, co-founder and CEO of OctoML talks about how to optimize and deploy machine learning models on any hardware, cloud or edge devices.Alex Williams, founder and publisher of The New Stack hosted this podcast.

Intel on AI
Computing with DNA – Intel on AI Season 3, Episode 6

Intel on AI

Play Episode Listen Later Dec 22, 2021 38:30


In this episode of Intel on AI host Amir Khosrowshahi and Luis Ceze talk about building better computer architectures, molecular biology, and synthetic DNA. Luis Ceze is the Lazowska Professor in the Paul G. Allen School of Computer Science and Engineering at the University of Washington, Co-founder and CEO at OctoML, and Venture Partner at Madrona Venture Group. His research focuses on the intersection between computer architecture, programming languages, machine learning and biology. His current research focus is on approximate computing for efficient machine learning and DNA-based data storage. He co-directs the Molecular Information Systems Lab (misl.bio) and the Systems and Architectures for Machine Learning lab (sampl.ai). He has co-authored over 100 papers in these areas, and had several papers selected as IEEE Micro Top Picks and CACM Research Highlights. His research has been featured prominently in the media including New York Times, Popular Science, MIT Technology Review, Wall Street Journal, among others. He is a recipient of an NSF CAREER Award, a Sloan Research Fellowship, a Microsoft Research Faculty Fellowship, the 2013 IEEE TCCA Young Computer Architect Award, the 2020 ACM SIGARCH Maurice Wilkes Award and UIUC Distinguished Alumni Award. In the episode, Amir and Luis talk about DNA storage, which has the potential to be a million times denser than solid state storage today. Luis goes into detail about the process he and fellow researchers at the University of Washington along with a team from Microsoft went through in order to store the high-definition music video “This Too Shall Pass” by the band OK Go onto DNA. Luis also discusses why enzymatic synthesis of DNA might potentially be environmentally sustainable, the advancements being made in similarity searches, and his role in creating the open source Apache TVM project that aims to use machine learning to find the most efficient hardware and software combination optimizations. Amir and Luis end the episode talking about why multi-technology systems with electronics, photonics, molecular systems, and even quantum components could be the future of compute. Academic research discussed in the podcast episode: The biologic synthesis of deoxyribonucleic acid Towards practical, high-capacity, low-maintenance information storage in synthesized DNA DNA Hybridization Catalysts and Catalyst Circuits A simple DNA gate motif for synthesizing large-scale circuits A DNA-Based Archival Storage System Random access in large-scale DNA data storage Landscape of Next-Generation Sequencing Technologies Clustering Billions of Reads for DNA Data Storage Demonstration of End-to-End Automation of DNA Data Storage High density DNA data storage library via dehydration with digital microfluidic retrieval Probing the physical limits of reliable DNA data retrieval Stabilizing synthetic DNA for long-term data storage with earth alkaline salts Molecular-level similarity search brings computing to DNA data storage DNA Data Storage and Near-Molecule Processing for the Yottabyte Era

Orchestrate all the Things podcast: Connecting the Dots with George Anadiotis
OctoML announces the latest release of its platform, exemplifies growth in MLOps. Featuring CEO & Co-founder Luis Ceze

Orchestrate all the Things podcast: Connecting the Dots with George Anadiotis

Play Episode Listen Later Dec 16, 2021 29:04


OctoML is announcing the latest release of its platform to automate deployment of production-ready models across the broadest array of clouds, hardware devices and machine learning acceleration engines. Article published on ZDNet

TechCrunch Startups – Spoken Edition
OctoML raises $85M for it for its machine learning acceleration platform

TechCrunch Startups – Spoken Edition

Play Episode Listen Later Nov 2, 2021 3:53


OctoML, a Seattle-based startup that helps enterprises optimize and deploy their machine learning models, today announced that it has raised a $85 million Series C round led by Tiger Global Management.

Software Engineering Radio - The Podcast for Professional Software Developers
Episode 479: Luis Ceze on the Apache TVM Machine Learning Compiler

Software Engineering Radio - The Podcast for Professional Software Developers

Play Episode Listen Later Sep 29, 2021 51:29


Luis Ceze of OctoML discusses Apache TVM, an open source machine learning model compiler for a variety of different hardware architectures with host Akshay Manchale. Luis talks about the challenges in deploying models on specialized hardware and how TVM.

Hanselminutes - Fresh Talk and Tech for Developers
Maximizing machine learning performance with OctoML and Luis Ceze

Hanselminutes - Fresh Talk and Tech for Developers

Play Episode Listen Later Aug 5, 2021 31:51


DataCast
Episode 68: Threat Intelligence, Venture Stamina, and Data Investing with Sarah Catanzaro

DataCast

Play Episode Listen Later Jul 14, 2021 76:06


Show Notes(01:48) Sarah talked about the formative experiences of her upbringing: growing up interested in the natural sciences and switching focus on terrorism analysis after experiencing the 9/11 tragedy with her own eyes.(04:07) Sarah discussed her experience studying International Security Studies at Stanford and working at the Center for International Security and Cooperation.(07:15) Sarah recalled her first job out of college as a Program Director at the Center for Advanced Defense Studies — collaborating with academic researchers to develop computational approaches that counter terrorism and piracy.(09:48) Sarah went over her time as a cyber-intelligence analyst at Cyveillance, which provided threat intelligence services to enterprises worldwide.(12:22) Sarah walked over her time at Palantir as an embedded analyst, where she observed the struggles that many agencies had with data integration and modeling challenges.(15:26) Sarah unpacked the challenges of building out the data team and applying the data work at Mattermark.(20:15) Sarah shared her opinion on the career trajectory for data analysts and data scientists, given her experience as a manager for these roles.(23:43) Sarah shared the power of having a peer group and building a team culture that she was proud of at Mattermark.(26:41) Sarah joined Canvas Ventures as a Data Partner in 2016 and shared her motivation for getting into venture capital.(29:47) Sarah revealed the secret sauce to succeed in venture — stamina.(32:00) Sarah has been an investor at Amplify Partners since 2017 and shared what attracted her about the firm's investment thesis and the team.(35:28) Sarah walked through the framework she used to prove her value upfront as the new investor at Amplify.(38:35) Sarah shared the details behind her investment on the Series A round for OctoML, a Seattle-based startup that leverages Apache TVM to enable their clients to simply, securely, and efficiently deploy any model on any hardware backend.(44:39) Sarah dissected her investment on the seed round for Einblick, a Boston-based startup that builds a visual computing platform for BI and analytics use cases.(48:45) Sarah mentioned the key factors inspiring her investment in the seed round for Metaphor Data, a meta-data platform that grew out of the DataHub open-source project developed at LinkedIn.(53:57) Sarah discussed what triggered her investment in the Series A round for Runway, a New York-based team building the next-generation creative toolkit powered by machine learning.(58:36) Sarah unpacked the advice she has been giving her portfolio companies in hiring decisions and expanding their founding team (and advice they should ignore).(01:01:29) Sarah went over the process of curating her weekly newsletter called Projects To Know (active since 2019).(01:05:00) Sarah predicted the 3 trends in the data ecosystem that will have a disproportionately huge impact in the future.(01:11:15) Closing segment.Sarah's Contact InfoAmplify PageTwitterLinkedInMediumAmplify Partners' ResourcesWebsiteTeamPortfolioBlogMentioned ContentBlog PostsOur Investment in OctoMLAnnouncing Our Investment in EinblickOur Investment in Metaphor DataOur Series A Investment in RunwayPeopleSunil Dhaliwal (General Partner at Amplify Partners)Mike Dauber (General Partner at Amplify Partners)Lenny Pruss (General Partner at Amplify Partners)Mike Volpi (Co-Founder and Partner at Index Ventures)Gary Little (Co-Founder and General Partner at Canvas Ventures)Book“Zen and the Art of Motorcycle Maintenance” (by Robert Pirsig)New UpdatesSince the podcast was recorded, Sarah has been keeping her stamina high!Her investments in Hex (data workspace for teams) and Meroxa (real-time data platform) have been made public.She has also spoken at various panels, including SIGMOD, REWORK, University of Chicago, and Utah Nerd Nights.Be sure to follow @sarahcat21 on Twitter to subscribe to her brain on the intersection of data, VC, and startups!

Datacast
Episode 68: Threat Intelligence, Venture Stamina, and Data Investing with Sarah Catanzaro

Datacast

Play Episode Listen Later Jul 14, 2021 76:06


Show Notes(01:48) Sarah talked about the formative experiences of her upbringing: growing up interested in the natural sciences and switching focus on terrorism analysis after experiencing the 9/11 tragedy with her own eyes.(04:07) Sarah discussed her experience studying International Security Studies at Stanford and working at the Center for International Security and Cooperation.(07:15) Sarah recalled her first job out of college as a Program Director at the Center for Advanced Defense Studies — collaborating with academic researchers to develop computational approaches that counter terrorism and piracy.(09:48) Sarah went over her time as a cyber-intelligence analyst at Cyveillance, which provided threat intelligence services to enterprises worldwide.(12:22) Sarah walked over her time at Palantir as an embedded analyst, where she observed the struggles that many agencies had with data integration and modeling challenges.(15:26) Sarah unpacked the challenges of building out the data team and applying the data work at Mattermark.(20:15) Sarah shared her opinion on the career trajectory for data analysts and data scientists, given her experience as a manager for these roles.(23:43) Sarah shared the power of having a peer group and building a team culture that she was proud of at Mattermark.(26:41) Sarah joined Canvas Ventures as a Data Partner in 2016 and shared her motivation for getting into venture capital.(29:47) Sarah revealed the secret sauce to succeed in venture — stamina.(32:00) Sarah has been an investor at Amplify Partners since 2017 and shared what attracted her about the firm's investment thesis and the team.(35:28) Sarah walked through the framework she used to prove her value upfront as the new investor at Amplify.(38:35) Sarah shared the details behind her investment on the Series A round for OctoML, a Seattle-based startup that leverages Apache TVM to enable their clients to simply, securely, and efficiently deploy any model on any hardware backend.(44:39) Sarah dissected her investment on the seed round for Einblick, a Boston-based startup that builds a visual computing platform for BI and analytics use cases.(48:45) Sarah mentioned the key factors inspiring her investment in the seed round for Metaphor Data, a meta-data platform that grew out of the DataHub open-source project developed at LinkedIn.(53:57) Sarah discussed what triggered her investment in the Series A round for Runway, a New York-based team building the next-generation creative toolkit powered by machine learning.(58:36) Sarah unpacked the advice she has been giving her portfolio companies in hiring decisions and expanding their founding team (and advice they should ignore).(01:01:29) Sarah went over the process of curating her weekly newsletter called Projects To Know (active since 2019).(01:05:00) Sarah predicted the 3 trends in the data ecosystem that will have a disproportionately huge impact in the future.(01:11:15) Closing segment.Sarah's Contact InfoAmplify PageTwitterLinkedInMediumAmplify Partners' ResourcesWebsiteTeamPortfolioBlogMentioned ContentBlog PostsOur Investment in OctoMLAnnouncing Our Investment in EinblickOur Investment in Metaphor DataOur Series A Investment in RunwayPeopleSunil Dhaliwal (General Partner at Amplify Partners)Mike Dauber (General Partner at Amplify Partners)Lenny Pruss (General Partner at Amplify Partners)Mike Volpi (Co-Founder and Partner at Index Ventures)Gary Little (Co-Founder and General Partner at Canvas Ventures)Book“Zen and the Art of Motorcycle Maintenance” (by Robert Pirsig)New UpdatesSince the podcast was recorded, Sarah has been keeping her stamina high!Her investments in Hex (data workspace for teams) and Meroxa (real-time data platform) have been made public.She has also spoken at various panels, including SIGMOD, REWORK, University of Chicago, and Utah Nerd Nights.Be sure to follow @sarahcat21 on Twitter to subscribe to her brain on the intersection of data, VC, and startups!

Gradient Dissent - A Machine Learning Podcast by W&B
OctoML CEO Luis Ceze on accelerating machine learning systems

Gradient Dissent - A Machine Learning Podcast by W&B

Play Episode Listen Later Jun 24, 2021 48:28


From Apache TVM to OctoML, Luis gives direct insight into the world of ML hardware optimization, and where systems optimization is heading. --- Luis Ceze is co-founder and CEO of OctoML, co-author of the Apache TVM Project, and Professor of Computer Science and Engineering at the University of Washington. His research focuses on the intersection of computer architecture, programming languages, machine learning, and molecular biology. Connect with Luis:

Changelog Master Feed
Apache TVM and OctoML (Practical AI #134)

Changelog Master Feed

Play Episode Listen Later May 18, 2021 49:06 Transcription Available


90% of AI / ML applications never make it to market, because fine tuning models for maximum performance across disparate ML software solutions and hardware backends requires a ton of manual labor and is cost-prohibitive. Luis Ceze and his team created Apache TVM at the University of Washington, then left founded OctoML to bring the project to market.

Practical AI
Apache TVM and OctoML

Practical AI

Play Episode Listen Later May 18, 2021 49:06 Transcription Available


90% of AI / ML applications never make it to market, because fine tuning models for maximum performance across disparate ML software solutions and hardware backends requires a ton of manual labor and is cost-prohibitive. Luis Ceze and his team created Apache TVM at the University of Washington, then left founded OctoML to bring the project to market.

What the Dev?
Making AI more accessible to developers with OctoML CEO Luis Ceze - Episode 102

What the Dev?

Play Episode Listen Later May 11, 2021 17:06


In this week's episode we spoke with Luis Ceze, professor at the University of Washington, co-creator of the Apache TVM project, and co-founder and CEO of OctoML. He spoke about how open source projects like Apache TVM make AI and machine learning more accessible to developers, as well as the foundations that need to be in place to increase AI adoption. 

Orchestrate all the Things podcast: Connecting the Dots with George Anadiotis
OctoML scores $28M to go to market with open source Apache TVM, a de facto standard for MLOps. Backstage chat with CEO Luis Ceze

Orchestrate all the Things podcast: Connecting the Dots with George Anadiotis

Play Episode Listen Later Mar 17, 2021 32:57


Machine learning operations, or MLOps, is the art and science of taking machine learning models from the data science lab to production. It's been a hot topic for the last couple of years, and for good reason. Going from innovation to scalability and repeatability are the hallmarks of generating business value, and MLOps represents precisely that for machine learning. Apache TVM has become a de facto standard in MLOps, and OctoML is the company gearing its commercialization and scale up.  As OctoML secured a $28 million Series B funding round, we caught up with its CEO and co-founder Luis Ceze to discuss TVM, OctoML, and MLOps. Article published on ZDNet

Software Daily
OctoML: Automated Deep Learning Engineering with Jason Knight and Luis Ceze

Software Daily

Play Episode Listen Later Feb 9, 2021


The incredible advances in machine learning research in recent years often take time to propagate out into usage in the field. One reason for this is that such “state-of-the-art” results for machine learning performance rely on the use of handwritten, idiosyncratic optimizations for specific hardware models or operating contexts. When developers are building ML-powered systems to deploy in the cloud and at the edge, their goals to ensure the model delivers the best possible functionality and end-user experience- and importantly, their hardware and software stack may require different optimizations to achieve that goal.OctoML provides a SaaS product called the Octomizer to help developers and AIOps teams deploy ML models most efficiently on any hardware, in any context. The Octomizer deploys its own ML models to analyze your model topology, and optimize, benchmark, and package the model for deployment. The Octomizer generates insights about model performance over different hardware stacks and helps you choose the deployment format that works best for your organization.Luis Ceze is the Co-Founder and CEO of OctoML. Luis is a founder of the ApacheTVM project, which is the basis for OctoML's technology. He is also a professor of Computer Science at the University of Washington. Jason Knight is co-founder and CPO at OctoML. Luis and Jason join the show today to talk about how OctoML is automating deep learning engineering, why it's so important to consider hardware when building deep learning systems, and how the field of deep learning is evolving.

AI Buzz
Fujitsu can spot confusion, OctoML raises funding for on-device ML, and Swarm AI

AI Buzz

Play Episode Listen Later Oct 25, 2019 18:20


In this episode, I will discuss impressive performance of a model from Fujitsu laboratories that can detect nervousness and confusion, how on-device machine learning startup OctoML is getting funding and has an all-star team to extend AI to edge devices, and how Unanimous AI is using their Swarm Platform to crowdsource features of their machine learning model.