Advanced interactive shell for Python
POPULARITY
Hey folks, this is Alex, coming to you LIVE from the AI Engineer Worlds Fair! What an incredible episode this week, we recorded live from floor 30th at the Marriott in SF, while Yam was doing live correspondence from the floor of the AI Engineer event, all while Swyx, the cohost of Latent Space podcast, and the creator of AI Engineer (both the conference and the concept itself) joined us for the whole stream - here's the edited version, please take a look. We've had around 6500 people tune in, and at some point we got 2 surprise guests, straight from the keynote stage, Logan Kilpatrick (PM for AI Studio and lead cheerleader for Gemini) and Jack Rae (principal scientist working on reasoning) joined us for a great chat about Gemini! Mind was absolutely blown! They have just launched the new Gemini 2.5 Pro and I though it would only be fitting to let their new model cover this podcast this week (so below is fully AI generated ... non slop I hope). The show notes and TL;DR is as always in the end. Okay, enough preamble… let's dive into the madness!
Catch us at Modular's ModCon next week with Chris Lattner, and join our community!Due to Bryan's very wide ranging experience in data science and AI across Blue Bottle (!), StitchFix, Weights & Biases, and now Hex Magic, this episode can be considered a two-parter.Notebooks = Chat++We've talked a lot about AI UX (in our meetups, writeups, and guest posts), and today we're excited to dive into a new old player in AI interfaces: notebooks! Depending on your background, you either Don't Like or you Like notebooks — they are the most popular example of Knuth's Literate Programming concept, basically a collection of cells; each cell can execute code, display it, and share its state with all the other cells in a notebook. They can also simply be Markdown cells to add commentary to the analysis. Notebooks have a long history but most recently became popular from iPython evolving into Project Jupyter, and a wave of notebook based startups from Observable to DeepNote and Databricks sprung up for the modern data stack.The first wave of AI applications has been very chat focused (ChatGPT, Character.ai, Perplexity, etc). Chat as a user interface has a few shortcomings, the major one being the inability to edit previous messages. We enjoyed Bryan's takes on why notebooks feel like “Chat++” and how they are building Hex Magic:* Atomic actions vs Stream of consciousness: in a chat interface, you make corrections by adding more messages to a conversation (i.e. “Can you try again by doing X instead?” or “I actually meant XYZ”). The context can easily get messy and confusing for models (and humans!) to follow. Notebooks' cell structure on the other hand allows users to go back to any previous cells and make edits without having to add new ones at the bottom. * “Airlocks” for repeatability: one of the ideas they came up with at Hex is “airlocks”, a collection of cells that depend on each other and keep each other in sync. If you have a task like “Create a summary of my customers' recent purchases”, there are many sub-tasks to be done (look up the data, sum the amounts, write the text, etc). Each sub-task will be in its own cell, and the airlock will keep them all in sync together.* Technical + Non-Technical users: previously you had to use Python / R / Julia to write notebooks code, but with models like GPT-4, natural language is usually enough. Hex is also working on lowering the barrier of entry for non-technical users into notebooks, similar to how Code Interpreter is doing the same in ChatGPT. Obviously notebooks aren't new for developers (OpenAI Cookbooks are a good example), but haven't had much adoption in less technical spheres. Some of the shortcomings of chat UIs + LLMs lowering the barrier of entry to creating code cells might make them a much more popular UX going forward.RAG = RecSys!We also talked about the LLMOps landscape and why it's an “iron mine” rather than a “gold rush”: I'll shamelessly steal [this] from a friend, Adam Azzam from Prefect. He says that [LLMOps] is more of like an iron mine than a gold mine in the sense of there is a lot of work to extract this precious, precious resource. Don't expect to just go down to the stream and do a little panning. There's a lot of work to be done. And frankly, the steps to go from this resource to something valuable is significant.Some of my favorite takeaways:* RAG as RecSys for LLMs: at its core, the goal of a RAG pipeline is finding the most relevant documents based on a task. This isn't very different from traditional recommendation system products that surface things for users. How can we apply old lessons to this new problem? Bryan cites fellow AIE Summit speaker and Latent Space Paper Club host Eugene Yan in decomposing the retrieval problem into retrieval, filtering, and scoring/ranking/ordering:As AI Engineers increasingly find that long context has tradeoffs, they will also have to relearn age old lessons that vector search is NOT all you need and a good systems not models approach is essential to scalable/debuggable RAG. Good thing Bryan has just written the first O'Reilly book about modern RecSys, eh?* Narrowing down evaluation: while “hallucination” is a easy term to throw around, the reality is more nuanced. A lot of times, model errors can be automatically fixed: is this JSON valid? If not, why? Is it just missing a closing brace? These smaller issues can be checked and fixed before returning the response to the user, which is easier than fixing the model.* Fine-tuning isn't all you need: when they first started building Magic, one of the discussions was around fine-tuning a model. In our episode with Jeremy Howard we talked about how fine-tuning leads to loss of capabilities as well. In notebooks, you are often dealing with domain-specific data (i.e. purchases, orders, wardrobe composition, household items, etc); the fact that the model understands that “items” are probably part of an “order” is really helpful. They have found that GPT-4 + 3.5-turbo were everything they needed to ship a great product rather than having to fine-tune on notebooks specifically.Definitely recommend listening to this one if you are interested in getting a better understanding of how to think about AI, data, and how we can use traditional machine learning lessons in large language models. The AI PivotFor more Bryan, don't miss his fireside chat at the AI Engineer Summit:Show Notes* Hex Magic* Bryan's new book: Building Recommendation Systems in Python and JAX* Bryan's whitepaper about MLOps* “Kitbashing in ML”, slides from his talk on building on top of foundation models* “Bayesian Statistics The Fun Way” by Will Kurt* Bryan's Twitter* “Berkeley man determined to walk every street in his city”* People:* Adam Azzam* Graham Neubig* Eugene Yan* Even OldridgeTimestamps* [00:00:00] Bryan's background* [00:02:34] Overview of Hex and the Magic product* [00:05:57] How Magic handles the complex notebook format to integrate cleanly with Hex* [00:08:37] Discussion of whether to build vs buy models - why Hex uses GPT-4 vs fine-tuning* [00:13:06] UX design for Magic with Hex's notebook format (aka “Chat++”)* [00:18:37] Expanding notebooks to less technical users* [00:23:46] The "Memex" as an exciting underexplored area - personal knowledge graph and memory augmentation* [00:27:02] What makes for good LLMops vs MLOps* [00:34:53] Building rigorous evaluators for Magic and best practices* [00:36:52] Different types of metrics for LLM evaluation beyond just end task accuracy* [00:39:19] Evaluation strategy when you don't own the core model that's being evaluated* [00:41:49] All the places you can make improvements outside of retraining the core LLM* [00:45:00] Lightning RoundTranscriptAlessio: Hey everyone, welcome to the Latent Space Podcast. This is Alessio, Partner and CTO-in-Residence of Decibel Partners, and today I'm joining by Bryan Bischof. [00:00:15]Bryan: Hey, nice to meet you. [00:00:17]Alessio: So Bryan has one of the most thorough and impressive backgrounds we had on the show so far. Lead software engineer at Blue Bottle Coffee, which if you live in San Francisco, you know a lot about. And maybe you'll tell us 30 seconds on what that actually means. You worked as a data scientist at Stitch Fix, which used to be one of the premier data science teams out there. [00:00:38]Bryan: It used to be. Ouch. [00:00:39]Alessio: Well, no, no. Well, you left, you know, so how good can it still be? Then head of data science at Weights and Biases. You're also a professor at Rutgers and you're just wrapping up a new O'Reilly book as well. So a lot, a lot going on. Yeah. [00:00:52]Bryan: And currently head of AI at Hex. [00:00:54]Alessio: Let's do the Blue Bottle thing because I definitely want to hear what's the, what's that like? [00:00:58]Bryan: So I was leading data at Blue Bottle. I was the first data hire. I came in to kind of get the data warehouse in order and then see what we could build on top of it. But ultimately I mostly focused on demand forecasting, a little bit of recsys, a little bit of sort of like website optimization and analytics. But ultimately anything that you could imagine sort of like a retail company needing to do with their data, we had to do. I sort of like led that team, hired a few people, expanded it out. One interesting thing was I was part of the Nestle acquisition. So there was a period of time where we were sort of preparing for that and didn't know, which was a really interesting dynamic. Being acquired is a very not necessarily fun experience for the data team. [00:01:37]Alessio: I build a lot of internal tools for sourcing at the firm and we have a small VCs and data community of like other people doing it. And I feel like if you had a data feed into like the Blue Bottle in South Park, the Blue Bottle at the Hanahaus in Palo Alto, you can get a lot of secondhand information on the state of VC funding. [00:01:54]Bryan: Oh yeah. I feel like the real source of alpha is just bugging a Blue Bottle. [00:01:58]Alessio: Exactly. And what's your latest book about? [00:02:02]Bryan: I just wrapped up a book with a coauthor Hector Yee called Building Production Recommendation Systems. I'll give you the rest of the title because it's fun. It's in Python and JAX. And so for those of you that are like eagerly awaiting the first O'Reilly book that focuses on JAX, here you go. [00:02:17]Alessio: Awesome. And we'll chat about that later on. But let's maybe talk about Hex and Magic before. I've known Hex for a while, I've used it as a notebook provider and you've been working on a lot of amazing AI enabled experiences. So maybe run us through that. [00:02:34]Bryan: So I too, before I sort of like joined Hex, saw it as this like really incredible notebook platform, sort of a great place to do data science workflows, quite complicated, quite ad hoc interactive ones. And before I joined, I thought it was the best place to do data science workflows. And so when I heard about the possibility of building AI tools on top of that platform, that seemed like a huge opportunity. In particular, I lead the product called Magic. Magic is really like a suite of sort of capabilities as opposed to its own independent product. What I mean by that is they are sort of AI enhancements to the existing product. And that's a really important difference from sort of building something totally new that just uses AI. It's really important to us to enhance the already incredible platform with AI capabilities. So these are things like the sort of obvious like co-pilot-esque vibes, but also more interesting and dynamic ways of integrating AI into the product. And ultimately the goal is just to make people even more effective with the platform. [00:03:38]Alessio: How do you think about the evolution of the product and the AI component? You know, even if you think about 10 months ago, some of these models were not really good on very math based tasks. Now they're getting a lot better. I'm guessing a lot of your workloads and use cases is data analysis and whatnot. [00:03:53]Bryan: When I joined, it was pre 4 and it was pre the sort of like new chat API and all that. But when I joined, it was already clear that GPT was pretty good at writing code. And so when I joined, they had already executed on the vision of what if we allowed the user to ask a natural language prompt to an AI and have the AI assist them with writing code. So what that looked like when I first joined was it had some capability of writing SQL and it had some capability of writing Python and it had the ability to explain and describe code that was already written. Those very, what feel like now primitive capabilities, believe it or not, were already quite cool. It's easy to look back and think, oh, it's like kind of like Stone Age in these timelines. But to be clear, when you're building on such an incredible platform, adding a little bit of these capabilities feels really effective. And so almost immediately I started noticing how it affected my own workflow because ultimately as sort of like an engineering lead and a lot of my responsibility is to be doing analytics to make data driven decisions about what products we build. And so I'm actually using Hex quite a bit in the process of like iterating on our product. When I'm using Hex to do that, I'm using Magic all the time. And even in those early days, the amount that it sped me up, that it enabled me to very quickly like execute was really impressive. And so even though the models weren't that good at certain things back then, that capability was not to be underestimated. But to your point, the models have evolved between 3.5 Turbo and 4. We've actually seen quite a big enhancement in the kinds of tasks that we can ask Magic and even more so with things like function calling and understanding a little bit more of the landscape of agent workflows, we've been able to really accelerate. [00:05:57]Alessio: You know, I tried using some of the early models in notebooks and it actually didn't like the IPyNB formatting, kind of like a JSON plus XML plus all these weird things. How have you kind of tackled that? Do you have some magic behind the scenes to make it easier for models? Like, are you still using completely off the shelf models? Do you have some proprietary ones? [00:06:19]Bryan: We are using at the moment in production 3.5 Turbo and GPT-4. I would say for a large number of our applications, GPT-4 is pretty much required. To your question about, does it understand the structure of the notebook? And does it understand all of this somewhat complicated wrappers around the content that you want to show? We do our very best to abstract that away from the model and make sure that the model doesn't have to think about what the cell wrapper code looks like. Or for our Magic charts, it doesn't have to speak the language of Vega. These are things that we put a lot of work in on the engineering side, to the AI engineer profile. This is the AI engineering work to get all of that out of the way so that the model can speak in the languages that it's best at. The model is quite good at SQL. So let's ensure that it's speaking the language of SQL and that we are doing the engineering work to get the output of that model, the generations, into our notebook format. So too for other cell types that we support, including charts, and just in general, understanding the flow of different cells, understanding what a notebook is, all of that is hard work that we've done to ensure that the model doesn't have to learn anything like that. I remember early on, people asked the question, are you going to fine tune a model to understand Hex cells? And almost immediately, my answer was no. No we're not. Using fine-tuned models in 2022, I was already aware that there are some limitations of that approach and frankly, even using GPT-3 and GPT-2 back in the day in Stitch Fix, I had already seen a lot of instances where putting more effort into pre- and post-processing can avoid some of these larger lifts. [00:08:14]Alessio: You mentioned Stitch Fix and GPT-2. How has the balance between build versus buy, so to speak, evolved? So GPT-2 was a model that was not super advanced, so for a lot of use cases it was worth building your own thing. Is with GPT-4 and the likes, is there a reason to still build your own models for a lot of this stuff? Or should most people be fine-tuning? How do you think about that? [00:08:37]Bryan: Sometimes people ask, why are you using GPT-4 and why aren't you going down the avenue of fine-tuning today? I can get into fine-tuning specifically, but I do want to talk a little bit about the good old days of GPT-2. Shout out to Reza. Reza introduced me to GPT-2. I still remember him explaining the difference between general transformers and GPT. I remember one of the tasks that we wanted to solve with transformer-based generative models at Stitch Fix were writing descriptions of clothing. You might think, ooh, that's a multi-modal problem. The answer is, not necessarily. We actually have a lot of features about the clothes that are almost already enough to generate some reasonable text. I remember at that time, that was one of the first applications that we had considered. There was a really great team of NLP scientists at Stitch Fix who worked on a lot of applications like this. I still remember being exposed to the GPT endpoint back in the days of 2. If I'm not mistaken, and feel free to fact check this, I'm pretty sure Stitch Fix was the first OpenAI customer, unlike their true enterprise application. Long story short, I ultimately think that depending on your task, using the most cutting-edge general model has some advantages. If those are advantages that you can reap, then go for it. So at Hex, why GPT-4? Why do we need such a general model for writing code, writing SQL, doing data analysis? Shouldn't a fine-tuned model just on Kaggle notebooks be good enough? I'd argue no. And ultimately, because we don't have one specific sphere of data that we need to write great data analysis workbooks for, we actually want to provide a platform for anyone to do data analysis about their business. To do that, you actually need to entertain an extremely general universe of concepts. So as an example, if you work at Hex and you want to do data analysis, our projects are called Hexes. That's relatively straightforward to teach it. There's a concept of a notebook. These are data science notebooks, and you want to ask analytics questions about notebooks. Maybe if you trained on notebooks, you could answer those questions, but let's come back to Blue Bottle. If I'm at Blue Bottle and I have data science work to do, I have to ask it questions about coffee. I have to ask it questions about pastries, doing demand forecasting. And so very quickly, you can see that just by serving just those two customers, a model purely fine-tuned on like Kaggle competitions may not actually fit the bill. And so the more and more that you want to build a platform that is sufficiently general for your customer base, the more I think that these large general models really pack a lot of additional opportunity in. [00:11:21]Alessio: With a lot of our companies, we talked about stuff that you used to have to extract features for, now you have out of the box. So say you're a travel company, you want to do a query, like show me all the hotels and places that are warm during spring break. It would be just literally like impossible to do before these models, you know? But now the model knows, okay, spring break is like usually these dates and like these locations are usually warm. So you get so much out of it for free. And in terms of Magic integrating into Hex, I think AI UX is one of our favorite topics and how do you actually make that seamless. In traditional code editors, the line of code is like kind of the atomic unit and HEX, you have the code, but then you have the cell also. [00:12:04]Bryan: I think the first time I saw Copilot and really like fell in love with Copilot, I thought finally, fancy auto-complete. And that felt so good. It felt so elegant. It felt so right sized for the task. But as a data scientist, a lot of the work that you do previous to the ML engineering part of the house, you're working in these cells and these cells are atomic. They're expressing one idea. And so ultimately, if you want to make the transition from something like this code, where you've got like a large amount of code and there's a large amount of files and they kind of need to have awareness of one another, and that's a long story and we can talk about that. But in this atomic, somewhat linear flow through the notebook, what you ultimately want to do is you want to reason with the agent at the level of these individual thoughts, these atomic ideas. Usually it's good practice in say Jupyter notebook to not let your cells get too big. If your cell doesn't fit on one page, that's like kind of a code smell, like why is it so damn big? What are you doing in this cell? That also lends some hints as to what the UI should feel like. I want to ask questions about this one atomic thing. So you ask the agent, take this data frame and strip out this prefix from all the strings in this column. That's an atomic task. It's probably about two lines of pandas. I can write it, but it's actually very natural to ask magic to do that for me. And what I promise you is that it is faster to ask magic to do that for me. At this point, that kind of code, I never write. And so then you ask the next question, which is what should the UI be to do chains, to do multiple cells that work together? Because ultimately a notebook is a chain of cells and actually it's a first class citizen for Hex. So we have a DAG and the DAG is the execution DAG for the individual cells. This is one of the reasons that Hex is reactive and kind of dynamic in that way. And so the very next question is, what is the sort of like AI UI for these collections of cells? And back in June and July, we thought really hard about what does it feel like to ask magic a question and get a short chain of cells back that execute on that task. And so we've thought a lot about sort of like how that breaks down into individual atomic units and how those are tied together. We introduced something which is kind of an internal name, but it's called the airlock. And the airlock is exactly a sequence of cells that refer to one another, understand one another, use things that are happening in other cells. And it gives you a chance to sort of preview what magic has generated for you. Then you can accept or reject as an entire group. And that's one of the reasons we call it an airlock, because at any time you can sort of eject the airlock and see it in the space. But to come back to your question about how the AI UX fits into this notebook, ultimately a notebook is very conversational in its structure. I've got a series of thoughts that I'm going to express as a series of cells. And sometimes if I'm a kind data scientist, I'll put some text in between them too, explaining what on earth I'm doing. And that feels, in my opinion, and I think this is quite shared amongst exons, that feels like a really nice refinement of the chat UI. I've been saying for several months now, like, please stop building chat UIs. There is some irony because I think what the notebook allows is like chat plus plus. [00:15:36]Alessio: Yeah, I think the first wave of everything was like chat with X. So it was like chat with your data, chat with your documents and all of this. But people want to code, you know, at the end of the day. And I think that goes into the end user. I think most people that use notebooks are software engineer, data scientists. I think the cool things about these models is like people that are not traditionally technical can do a lot of very advanced things. And that's why people like code interpreter and chat GBT. How do you think about the evolution of that persona? Do you see a lot of non-technical people also now coming to Hex to like collaborate with like their technical folks? [00:16:13]Bryan: Yeah, I would say there might even be more enthusiasm than we're prepared for. We're obviously like very excited to bring what we call the like low floor user into this world and give more people the opportunity to self-serve on their data. We wanted to start by focusing on users who are already familiar with Hex and really make magic fantastic for them. One of the sort of like internal, I would say almost North Stars is our team's charter is to make Hex feel more magical. That is true for all of our users, but that's easiest to do on users that are already able to use Hex in a great way. What we're hearing from some customers in particular is sort of like, I'm excited for some of my less technical stakeholders to get in there and start asking questions. And so that raises a lot of really deep questions. If you immediately enable self-service for data, which is almost like a joke over the last like maybe like eight years, if you immediately enabled self-service, what challenges does that bring with it? What risks does that bring with it? And so it has given us the opportunity to think about things like governance and to think about things like alignment with the data team and making sure that the data team has clear visibility into what the self-service looks like. Having been leading a data team, trying to provide answers for stakeholders and hearing that they really want to self-serve, a question that we often found ourselves asking is, what is the easiest way that we can keep them on the rails? What is the easiest way that we can set up the data warehouse and set up our tools such that they can ask and answer their own questions without coming away with like false answers? Because that is such a priority for data teams, it becomes an important focus of my team, which is, okay, magic may be an enabler. And if it is, what do we also have to respect? We recently introduced the data manager and the data manager is an auxiliary sort of like tool on the Hex platform to allow people to write more like relevant metadata about their data warehouse to make sure that magic has access to the best information. And there are some things coming to kind of even further that story around governance and understanding. [00:18:37]Alessio: You know, you mentioned self-serve data. And when I was like a joke, you know, the whole rush to the modern data stack was something to behold. Do you think AI is like in a similar space where it's like a bit of a gold rush? [00:18:51]Bryan: I have like sort of two comments here. One I'll shamelessly steal from a friend, Adam Azzam from Prefect. He says that this is more of like an iron mine than a gold mine in the sense of there is a lot of work to extract this precious, precious resource. And that's the first one is I think, don't expect to just go down to the stream and do a little panning. There's a lot of work to be done. And frankly, the steps to go from this like gold to, or this resource to something valuable is significant. I think people have gotten a little carried away with the old maxim of like, don't go pan for gold, sell pickaxes and shovels. It's a much stronger business model. At this point, I feel like I look around and I see more pickaxe salesmen and shovel salesmen than I do prospectors. And that scares me a little bit. Metagame where people are starting to think about how they can build tools for people building tools for AI. And that starts to give me a little bit of like pause in terms of like, how confident are we that we can even extract this resource into something valuable? I got a text message from a VC earlier today, and I won't name the VC or the fund, but the question was, what are some medium or large size companies that have integrated AI into their platform in a way that you're really impressed by? And I looked at the text message for a few minutes and I was finding myself thinking and thinking, and I responded, maybe only co-pilot. It's been a couple hours now, and I don't think I've thought of another one. And I think that's where I reflect again on this, like iron versus gold. If it was really gold, I feel like I'd be more blown away by other AI integrations. And I'm not yet. [00:20:40]Alessio: I feel like all the people finding gold are the ones building things that traditionally we didn't focus on. So like mid-journey. I've talked to a company yesterday, which I'm not going to name, but they do agents for some use case, let's call it. They are 11 months old. They're making like 8 million a month in revenue, but in a space that you wouldn't even think about selling to. If you were like a shovel builder, you wouldn't even go sell to those people. And Swix talks about this a bunch, about like actually trying to go application first for some things. Let's actually see what people want to use and what works. What do you think are the most maybe underexplored areas in AI? Is there anything that you wish people were actually trying to shovel? [00:21:23]Bryan: I've been saying for a couple of months now, if I had unlimited resources and I was just sort of like truly like, you know, on my own building whatever I wanted, I think the thing that I'd be most excited about is building sort of like the personal Memex. The Memex is something that I've wanted since I was a kid. And are you familiar with the Memex? It's the memory extender. And it's this idea that sort of like human memory is quite weak. And so if we can extend that, then that's a big opportunity. So I think one of the things that I've always found to be one of the limiting cases here is access. How do you access that data? Even if you did build that data like out, how would you quickly access it? And one of the things I think there's a constellation of technologies that have come together in the last couple of years that now make this quite feasible. Like information retrieval has really improved and we have a lot more simple systems for getting started with information retrieval to natural language is ultimately the interface that you'd really like these systems to work on, both in terms of sort of like structuring the data and preparing the data, but also on the retrieval side. So what keys off the query for retrieval, probably ultimately natural language. And third, if you really want to go into like the purely futuristic aspect of this, it is latent voice to text. And that is also something that has quite recently become possible. I did talk to a company recently called gather, which seems to have some cool ideas in this direction, but I haven't seen yet what I, what I really want, which is I want something that is sort of like every time I listen to a podcast or I watch a movie or I read a book, it sort of like has a great vector index built on top of all that information that's contained within. And then when I'm having my next conversation and I can't quite remember the name of this person who did this amazing thing, for example, if we're talking about the Memex, it'd be really nice to have Vannevar Bush like pop up on my, you know, on my Memex display, because I always forget Vannevar Bush's name. This is one time that I didn't, but I often do. This is something that I think is only recently enabled and maybe we're still five years out before it can be good, but I think it's one of the most exciting projects that has become possible in the last three years that I think generally wasn't possible before. [00:23:46]Alessio: Would you wear one of those AI pendants that record everything? [00:23:50]Bryan: I think I'm just going to do it because I just like support the idea. I'm also admittedly someone who, when Google Glass first came out, thought that seems awesome. I know that there's like a lot of like challenges about the privacy aspect of it, but it is something that I did feel was like a disappointment to lose some of that technology. Fun fact, one of the early Google Glass developers was this MIT computer scientist who basically built the first wearable computer while he was at MIT. And he like took notes about all of his conversations in real time on his wearable and then he would have real time access to them. Ended up being kind of a scandal because he wanted to use a computer during his defense and they like tried to prevent him from doing it. So pretty interesting story. [00:24:35]Alessio: I don't know but the future is going to be weird. I can tell you that much. Talking about pickaxes, what do you think about the pickaxes that people built before? Like all the whole MLOps space, which has its own like startup graveyard in there. How are those products evolving? You know, you were at Wits and Biases before, which is now doing a big AI push as well. [00:24:57]Bryan: If you really want to like sort of like rub my face in it, you can go look at my white paper on MLOps from 2022. It's interesting. I don't think there's many things in that that I would these days think are like wrong or even sort of like naive. But what I would say is there are both a lot of analogies between MLOps and LLMops, but there are also a lot of like key differences. So like leading an engineering team at the moment, I think a lot more about good engineering practices than I do about good ML practices. That being said, it's been very convenient to be able to see around corners in a few of the like ML places. One of the first things I did at Hex was work on evals. This was in February. I hadn't yet been overwhelmed by people talking about evals until about May. And the reason that I was able to be a couple of months early on that is because I've been building evals for ML systems for years. I don't know how else to build an ML system other than start with the evals. I teach my students at Rutgers like objective framing is one of the most important steps in starting a new data science project. If you can't clearly state what your objective function is and you can't clearly state how that relates to the problem framing, you've got no hope. And I think that is a very shared reality with LLM applications. Coming back to one thing you mentioned from earlier about sort of like the applications of these LLMs. To that end, I think what pickaxes I think are still very valuable is understanding systems that are inherently less predictable, that are inherently sort of experimental. On my engineering team, we have an experimentalist. So one of the AI engineers, his focus is experiments. That's something that you wouldn't normally expect to see on an engineering team. But it's important on an AI engineering team to have one person whose entire focus is just experimenting, trying, okay, this is a hypothesis that we have about how the model will behave. Or this is a hypothesis we have about how we can improve the model's performance on this. And then going in, running experiments, augmenting our evals to test it, et cetera. What I really respect are pickaxes that recognize the hybrid nature of the sort of engineering tasks. They are ultimately engineering tasks with a flavor of ML. And so when systems respect that, I tend to have a very high opinion. One thing that I was very, very aligned with Weights and Biases on is sort of composability. These systems like ML systems need to be extremely composable to make them much more iterative. If you don't build these systems in composable ways, then your integration hell is just magnified. When you're trying to iterate as fast as people need to be iterating these days, I think integration hell is a tax not worth paying. [00:27:51]Alessio: Let's talk about some of the LLM native pickaxes, so to speak. So RAG is one. One thing is doing RAG on text data. One thing is doing RAG on tabular data. We're releasing tomorrow our episode with Kube, the semantic layer company. Curious to hear your thoughts on it. How are you doing RAG, pros, cons? [00:28:11]Bryan: It became pretty obvious to me almost immediately that RAG was going to be important. Because ultimately, you never expect your model to have access to all of the things necessary to respond to a user's request. So as an example, Magic users would like to write SQL that's relevant to their business. And it's important then to have the right data objects that they need to query. We can't expect any LLM to understand our user's data warehouse topology. So what we can expect is that we can build a RAG system that is data warehouse aware, data topology aware, and use that to provide really great information to the model. If you ask the model, how are my customers trending over time? And you ask it to write SQL to do that. What is it going to do? Well, ultimately, it's going to hallucinate the structure of that data warehouse that it needs to write a general query. Most likely what it's going to do is it's going to look in its sort of memory of Stack Overflow responses to customer queries, and it's going to say, oh, it's probably a customer stable and we're in the age of DBT, so it might be even called, you know, dim customers or something like that. And what's interesting is, and I encourage you to try, chatGBT will do an okay job of like hallucinating up some tables. It might even hallucinate up some columns. But what it won't do is it won't understand the joins in that data warehouse that it needs, and it won't understand the data caveats or the sort of where clauses that need to be there. And so how do you get it to understand those things? Well, this is textbook RAG. This is the exact kind of thing that you expect RAG to be good at augmenting. But I think where people who have done a lot of thinking about RAG for the document case, they think of it as chunking and sort of like the MapReduce and the sort of like these approaches. But I think people haven't followed this train of thought quite far enough yet. Jerry Liu was on the show and he talked a little bit about thinking of this as like information retrieval. And I would push that even further. And I would say that ultimately RAG is just RecSys for LLM. As I kind of already mentioned, I'm a little bit recommendation systems heavy. And so from the beginning, RAG has always felt like RecSys to me. It has always felt like you're building a recommendation system. And what are you trying to recommend? The best possible resources for the LLM to execute on a task. And so most of my approach to RAG and the way that we've improved magic via retrieval is by building a recommendation system. [00:30:49]Alessio: It's funny, as you mentioned that you spent three years writing the book, the O'Reilly book. Things must have changed as you wrote the book. I don't want to bring out any nightmares from there, but what are the tips for people who want to stay on top of this stuff? Do you have any other favorite newsletters, like Twitter accounts that you follow, communities you spend time in? [00:31:10]Bryan: I am sort of an aggressive reader of technical books. I think I'm almost never disappointed by time that I've invested in reading technical manuscripts. I find that most people write O'Reilly or similar books because they've sort of got this itch that they need to scratch, which is that I have some ideas, I have some understanding that we're hard won, I need to tell other people. And there's something that, from my experience, correlates between that itch and sort of like useful information. As an example, one of the people on my team, his name is Will Kurt, he wrote a book sort of Bayesian statistics the fun way. I knew some Bayesian statistics, but I read his book anyway. And the reason was because I was like, if someone feels motivated to write a book called Bayesian statistics the fun way, they've got something to say about Bayesian statistics. I learned so much from that book. That book is like technically like targeted at someone with less knowledge and experience than me. And boy, did it humble me about my understanding of Bayesian statistics. And so I think this is a very boring answer, but ultimately like I read a lot of books and I think that they're a really valuable way to learn these things. I also regrettably still read a lot of Twitter. There is plenty of noise in that signal, but ultimately it is still usually like one of the first directions to get sort of an instinct for what's valuable. The other comment that I want to make is we are in this age of sort of like archive is becoming more of like an ad platform. I think that's a little challenging right now to kind of use it the way that I used to use it, which is for like higher signal. I've chatted a lot with a CMU professor, Graham Neubig, and he's been doing LLM evaluation and LLM enhancements for about five years and know that I didn't misspeak. And I think talking to him has provided me a lot of like directionality for more believable sources. Trying to cut through the hype. I know that there's a lot of other things that I could mention in terms of like just channels, but ultimately right now I think there's almost an abundance of channels and I'm a little bit more keen on high signal. [00:33:18]Alessio: The other side of it is like, I see so many people say, Oh, I just wrote a paper on X and it's like an article. And I'm like, an article is not a paper, but it's just funny how I know we were kind of chatting before about terms being reinvented and like people that are not from this space kind of getting into AI engineering now. [00:33:36]Bryan: I also don't want to be gatekeepy. Actually I used to say a lot to people, don't be shy about putting your ideas down on paper. I think it's okay to just like kind of go for it. And I, I myself have something on archive that is like comically naive. It's intentionally naive. Right now I'm less concerned by more naive approaches to things than I am by the purely like advertising approach to sort of writing these short notes and articles. I think blogging still has a good place. And I remember getting feedback during my PhD thesis that like my thesis sounded more like a long blog post. And I now feel like that curmudgeonly professor who's also like, yeah, maybe just keep this to the blogs. That's funny.Alessio: Uh, yeah, I think one of the things that Swyx said when he was opening the AI engineer summit a couple of weeks ago was like, look, most people here don't know much about the space because it's so new and like being open and welcoming. I think it's one of the goals. And that's why we try and keep every episode at a level that it's like, you know, the experts can understand and learn something, but also the novices can kind of like follow along. You mentioned evals before. I think that's one of the hottest topics obviously out there right now. What are evals? How do we know if they work? Yeah. What are some of the fun learnings from building them into X? [00:34:53]Bryan: I said something at the AI engineer summit that I think a few people have already called out, which is like, if you can't get your evals to be sort of like objective, then you're not trying hard enough. I stand by that statement. I'm not going to, I'm not going to walk it back. I know that that doesn't feel super good because people, people want to think that like their unique snowflake of a problem is too nuanced. But I think this is actually one area where, you know, in this dichotomy of like, who can do AI engineering? And the answer is kind of everybody. Software engineering can become AI engineering and ML engineering can become AI engineering. One thing that I think the more data science minded folk have an advantage here is we've gotten more practice in taking very vague notions and trying to put a like objective function around that. And so ultimately I would just encourage everybody who wants to build evals, just work incredibly hard on codifying what is good and bad in terms of these objective metrics. As far as like how you go about turning those into evals, I think it's kind of like sweat equity. Unfortunately, I told the CEO of gantry several months ago, I think it's been like six months now that I was sort of like looking at every single internal Hex request to magic by hand with my eyes and sort of like thinking, how can I turn this into an eval? Is there a way that I can take this real request during this dog foodie, not very developed stage? How can I make that into an evaluation? That was a lot of sweat equity that I put in a lot of like boring evenings, but I do think ultimately it gave me a lot of understanding for the way that the model was misbehaving. Another thing is how can you start to understand these misbehaviors as like auxiliary evaluation metrics? So there's not just one evaluation that you want to do for every request. It's easy to say like, did this work? Did this not work? Did the response satisfy the task? But there's a lot of other metrics that you can pull off these questions. And so like, let me give you an example. If it writes SQL that doesn't reference a table in the database that it's supposed to be querying against, we would think of that as a hallucination. You could separately consider, is it a hallucination as a valuable metric? You could separately consider, does it get the right answer? The right answer is this sort of like all in one shot, like evaluation that I think people jump to. But these intermediary steps are really important. I remember hearing that GitHub had thousands of lines of post-processing code around Copilot to make sure that their responses were sort of correct or in the right place. And that kind of sort of defensive programming against bad responses is the kind of thing that you can build by looking at many different types of evaluation metrics. Because you can say like, oh, you know, the Copilot completion here is mostly right, but it doesn't close the brace. Well, that's the thing you can check for. Or, oh, this completion is quite good, but it defines a variable that was like already defined in the file. Like that's going to have a problem. That's an evaluation that you could check separately. And so this is where I think it's easy to convince yourself that all that matters is does it get the right answer? But the more that you think about production use cases of these things, the more you find a lot of this kind of stuff. One simple example is like sometimes the model names the output of a cell, a variable that's already in scope. Okay. Like we can just detect that and like we can just fix that. And this is the kind of thing that like evaluations over time and as you build these evaluations over time, you really can expand the robustness in which you trust these models. And for a company like Hex, who we need to put this stuff in GA, we can't just sort of like get to demo stage or even like private beta stage. We really hunting GA on all of these capabilities. Did it get the right answer on some cases is not good enough. [00:38:57]Alessio: I think the follow up question to that is in your past roles, you own the model that you're evaluating against. Here you don't actually have control into how the model evolves. How do you think about the model will just need to improve or we'll use another model versus like we can build kind of like engineering post-processing on top of it. How do you make the choice? [00:39:19]Bryan: So I want to say two things here. One like Jerry Liu talked a little bit about in his episode, he talked a little bit about sort of like you don't always want to retrain the weights to serve certain use cases. Rag is another tool that you can use to kind of like soft tune. I think that's right. And I want to go back to my favorite analogy here, which is like recommendation systems. When you build a recommendation system, you build the objective function. You think about like what kind of recs you want to provide, what kind of features you're allowed to use, et cetera, et cetera. But there's always another step. There's this really wonderful collection of blog posts from Eugene Yon and then ultimately like even Oldridge kind of like iterated on that for the Merlin project where there's this multi-stage recommender. And the multi-stage recommender says the first step is to do great retrieval. Once you've done great retrieval, you then need to do great ranking. Once you've done great ranking, you need to then do a good job serving. And so what's the analogy here? Rag is retrieval. You can build different embedding models to encode different features in your latent space to ensure that your ranking model has the best opportunity. Now you might say, oh, well, my ranking model is something that I've got a lot of capability to adjust. I've got full access to my ranking model. I'm going to retrain it. And that's great. And you should. And over time you will. But there's one more step and that's downstream and that's the serving. Serving often sounds like I just show the s**t to the user, but ultimately serving is things like, did I provide diverse recommendations? Going back to Stitch Fix days, I can't just recommend them five shirts of the same silhouette and cut. I need to serve them a diversity of recommendations. Have I respected their requirements? They clicked on something that got them to this place. Is the recommendations relevant to that query? Are there any hard rules? Do we maybe not have this in stock? These are all things that you put downstream. And so much like the recommendations use case, there's a lot of knobs to pull outside of retraining the model. And even in recommendation systems, when do you retrain your model for ranking? Not nearly as much as you do other s**t. And even this like embedding model, you might fiddle with more often than the true ranking model. And so I think the only piece of the puzzle that you don't have access to in the LLM case is that sort of like middle step. That's okay. We've got plenty of other work to do. So right now I feel pretty enabled. [00:41:56]Alessio: That's great. You obviously wrote a book on RecSys. What are some of the key concepts that maybe people that don't have a data science background, ML background should keep in mind as they work in this area? [00:42:07]Bryan: It's easy to first think these models are stochastic. They're unpredictable. Oh, well, what are we going to do? I think of this almost like gaseous type question of like, if you've got this entropy, where can you put the entropy? Where can you let it be entropic and where can you constrain it? And so what I want to say here is think about the cases where you need it to be really tightly constrained. So why are people so excited about function calling? Because function calling feels like a way to constrict it. Where can you let it be more gaseous? Well, maybe in the way that it talks about what it wants to do. Maybe for planning, if you're building agents and you want to do sort of something chain of thoughty. Well, that's a place where the entropy can happily live. When you're building applications of these models, I think it's really important as part of the problem framing to be super clear upfront. These are the things that can be entropic. These are the things that cannot be. These are the things that need to be super rigid and really, really aligned to a particular schema. We've had a lot of success in making specific the parts that need to be precise and tightly schemified, and that has really paid dividends. And so other analogies from data science that I think are very valuable is there's the sort of like human in the loop analogy, which has been around for quite a while. And I have gone on record a couple of times saying that like, I don't really love human in the loop. One of the things that I think we can learn from human in the loop is that the user is the best judge of what is good. And the user is pretty motivated to sort of like interact and give you kind of like additional nudges in the direction that you want. I think what I'd like to flip though, is instead of human in the loop, I'd like it to be AI in the loop. I'd rather center the user. I'd rather keep the user as the like core item at the center of this universe. And the AI is a tool. By switching that analogy a little bit, what it allows you to do is think about where are the places in which the user can reach for this as a tool, execute some task with this tool, and then go back to doing their workflow. It still gets this back and forth between things that computers are good at and things that humans are good at, which has been valuable in the human loop paradigm. But it allows us to be a little bit more, I would say, like the designers talk about like user-centered. And I think that's really powerful for AI applications. And it's one of the things that I've been trying really hard with Magic to make that feel like the workflow as the AI is right there. It's right where you're doing your work. It's ready for you anytime you need it. But ultimately you're in charge at all times and your workflow is what we care the most about. [00:44:56]Alessio: Awesome. Let's jump into lightning round. What's something that is not on your LinkedIn that you're passionate about or, you know, what's something you would give a TED talk on that is not work related? [00:45:05]Bryan: So I walk a lot. [00:45:07]Bryan: I have walked every road in Berkeley. And I mean like every part of every road even, not just like the binary question of, have you been on this road? I have this little app that I use called Wanderer, which just lets me like kind of keep track of everywhere I've been. And so I'm like a little bit obsessed. My wife would say a lot a bit obsessed with like what I call new roads. I'm actually more motivated by trails even than roads, but like I'm a maximalist. So kind of like everything and anything. Yeah. Believe it or not, I was even like in the like local Berkeley paper just talking about walking every road. So yeah, that's something that I'm like surprisingly passionate about. [00:45:45]Alessio: Is there a most underrated road in Berkeley? [00:45:49]Bryan: What I would say is like underrated is Kensington. So Kensington is like a little town just a teeny bit north of Berkeley, but still in the Berkeley hills. And Kensington is so quirky and beautiful. And it's a really like, you know, don't sleep on Kensington. That being said, one of my original motivations for doing all this walking was people always tell me like, Berkeley's so quirky. And I was like, how quirky is Berkeley? Turn it out. It's quite, quite quirky. It's also hard to say quirky and Berkeley in the same sentence I've learned as of now. [00:46:20]Alessio: That's a, that's a good podcast warmup for our next guests. All right. The actual lightning ground. So we usually have three questions, acceleration, exploration, then a takeaway acceleration. What's, what's something that's already here today that you thought would take much longer to arrive in AI and machine learning? [00:46:39]Bryan: So I invited the CEO of Hugging Face to my seminar when I worked at Stitch Fix and his talk at the time, honestly, like really annoyed me. The talk was titled like something to the effect of like LLMs are going to be the like technology advancement of the next decade. It's on YouTube. You can find it. I don't remember exactly the title, but regardless, it was something like LLMs for the next decade. And I was like, okay, they're like one modality of model, like whatever. His talk was fine. Like, I don't think it was like particularly amazing or particularly poor, but what I will say is damn, he was right. Like I, I don't think I quite was on board during that talk where I was like, ah, maybe, you know, like there's a lot of other modalities that are like moving pretty quick. I thought things like RL were going to be the like real like breakout success. And there's a little pun with Atari and breakout there, but yeah, like I, man, I was sleeping on LLMs and I feel a little embarrassed. I, yeah. [00:47:44]Alessio: Yeah. No, I mean, that's a good point. It's like sometimes the, we just had Jeremy Howard on the podcast and he was saying when he was talking about fine tuning, everybody thought it was dumb, you know, and then later people realize, and there's something to be said about messaging, especially like in technical audiences where there's kind of like the metagame, you know, which is like, oh, these are like the cool ideas people are exploring. I don't know where I want to align myself yet, you know, or whatnot. So it's cool exploration. So it's kind of like the opposite of that. You mentioned RL, right? That's something that was kind of like up and up and up. And then now it's people are like, oh, I don't know. Are there any other areas if you weren't working on, on magic that you want to go work on? [00:48:25]Bryan: Well, I did mention that, like, I think this like Memex product is just like incredibly exciting to me. And I think it's really opportunistic. I think it's very, very feasible, but I would maybe even extend that a little bit, which is I don't see enough people getting really enthusiastic about hardware with advanced AI built in. You're hearing whispering of it here and there, put on the whisper, but like you're starting to see people putting whisper into pieces of hardware and making that really powerful. I joked with, I can't think of her name. Oh, Sasha, who I know is a friend of the pod. Like I joked with Sasha that I wanted to make the big mouth Billy Bass as a babble fish, because at this point it's pretty easy to connect that up to whisper and talk to it in one language and have it talk in the other language. And I was like, this is the kind of s**t I want people building is like silly integrations between hardware and these new capabilities. And as much as I'm starting to hear whisperings here and there, it's not enough. I think I want to see more people going down this track because I think ultimately like these things need to be in our like physical space. And even though the margins are good on software, I want to see more like integration into my daily life. Awesome. [00:49:47]Alessio: And then, yeah, a takeaway, what's one message idea you want everyone to remember and think about? [00:49:54]Bryan: Even though earlier I was talking about sort of like, maybe like not reinventing things and being respectful of the sort of like ML and data science, like ideas. I do want to say that I think everybody should be experimenting with these tools as much as they possibly can. I've heard a lot of professors, frankly, express concern about their students using GPT to do their homework. And I took a completely opposite approach, which is in the first 15 minutes of the first class of my semester this year, I brought up GPT on screen and we talked about what GPT was good at. And we talked about like how the students can sort of like use it. I showed them an example of it doing data analysis work quite well. And then I showed them an example of it doing quite poorly. I think however much you're integrating with these tools or interacting with these tools, and this audience is probably going to be pretty high on that distribution. I would really encourage you to sort of like push this into the other people in your life. My wife is very technical. She's a product manager and she's using chat GPT almost every day for communication or for understanding concepts that are like outside of her sphere of excellence. And recently my mom and my sister have been sort of like onboarded onto the chat GPT train. And so ultimately I just, I think that like it is our duty to help other people see like how much of a paradigm shift this is. We should really be preparing people for what life is going to be like when these are everywhere. [00:51:25]Alessio: Awesome. Thank you so much for coming on, Bryan. This was fun. [00:51:29]Bryan: Yeah. Thanks for having me. And use Hex magic. [00:51:31] Get full access to Latent Space at www.latent.space/subscribe
What are all the different versions of Python? You may have heard of Cython, Brython, PyPy, or others and wondered where they fit into the Python landscape. This week on the show, Christopher Trudeau is here, bringing another batch of PyCoder's Weekly articles and projects.
Watch on YouTube About the show Sponsored by us! Support our work through: Our courses at Talk Python Training Test & Code Podcast Patreon Supporters Connect with the hosts Michael: @mkennedy@fosstodon.org Brian: @brianokken@fosstodon.org Show: @pythonbytes@fosstodon.org Join us on YouTube at pythonbytes.fm/live to be part of the audience. Usually Tuesdays at 11am PT. Older video versions available there too. Brian #1: Plumbum: Shell Combinators and More Suggested by Henry Schreiner last week. (Also, thanks Michael for the awesome search tool on PythonBytes.fm that includes transcripts, so I can find stuff discussed and not just stuff listed in the show notes.) Plumbum is “ a small yet feature-rich library for shell script-like programs in Python. The motto of the library is “Never write shell scripts again”, and thus it attempts to mimic the shell syntax (shell combinators) where it makes sense, while keeping it all Pythonic and cross-platform.” Supports local commands piping redirection working directory changes in a with block. So cool. lots more fun features Michael #2: Our plan for Python 3.13 The big difference is that we have now finished the foundational work that we need: Low impact monitoring (PEP 669) is implemented. The bytecode compiler is a much better state. The interpreter generator is working. Experiments on the register machine are complete. We have a viable approach to create a low-overhead maintainable machine code generator, based on copy-and-patch. We plan three parallelizable pieces of work for 3.13: The tier 2 optimizer Enabling subinterpreters from Python code (PEP 554). Memory management Details on superblocks Brian #3: Some blogging myths Julia Evans myths (more info of each in the blog post): you need to be original you need to be an expert posts need to be 100% correct writing boring posts is bad you need to explain every concept page views matter more material is always better everyone should blog I'd add Write posts to help yourself remember something. Write posts to help future prospective employers know what topics you care about. You know when you find a post that is outdated and now wrong, and the code doesn't work, but the topic is interesting to you. Go ahead and try to write a better post with code that works. Michael #4: Jupyter AI A generative AI extension for JupyterLab An %%ai magic that turns the Jupyter notebook into a reproducible generative AI playground. This works anywhere the IPython kernel runs (JupyterLab, Jupyter Notebook, Google Colab, VSCode, etc.). A native chat UI in JupyterLab that enables you to work with generative AI as a conversational assistant. Support for a wide range of generative model providers and models (AI21, Anthropic, Cohere, Hugging Face, OpenAI, SageMaker, etc.). Official project from Jupyter Provides code insights Debug failing code Provides a general interface for interaction and experimentation with currently available LLMs Lets you collaborate with peers and an Al in JupyterLab Lets you ask questions about local files Video presentation: David Qiu - Jupyter AI — Bringing Generative AI to Jupyter | PyData Seattle 2023 Extras Brian: Textual has some fun releases recently Textualize youtube channel with 3 tutorials so far trogon to turn Click based command line apps into TUIs video example of it working with sqlite-utils. Python in VSCode June Release includes revamped test discovery and execution. You have to turn it on though, as the changes are experimental: "python.experiments.optInto": [ "pythonTestAdapter", ] I just turned it on, so I haven't formed an opinion yet. Michael: Michael's take on the MacBook Air 15” (black one) Joke: Phishing
We talked about: Christiaan's background Usual ways of collecting and curating data Getting the buy-in from experts and executives Starting an annotation booklet Pre-labeling Dataset collection Human level baseline and feedback Using the annotation booklet to boost annotation productivity Putting yourself in the shoes of annotators (and measuring performance) Active learning Distance supervision Weak labeling Dataset collection in career positioning and project portfolios IPython widgets GDPR compliance and non-English NLP Finding Christiaan online Links: My personal blog: https://useml.net/ Comtura, my company: https://comtura.ai/ LI: https://www.linkedin.com/in/christiaan-swart-51a68967/ Twitter: https://twitter.com/swartchris8/ ML Zoomcamp: https://github.com/alexeygrigorev/mlbookcamp-code/tree/master/course-zoomcamp Join DataTalks.Club: https://datatalks.club/slack.html Our events: https://datatalks.club/events.html
Watch the live stream: Watch on YouTube About the show Sponsored by Microsoft for Startups Founders Hub. Michael #1: Specialist: Python 3.11 perf highlighter via Alex Waygood Visualize CPython 3.11's specializing, adaptive interpreter.
We were fixing servers all night, but at least we have a great story. A special guest joins us to help make a big show announcement. Special Guest: Tim Canham.
We developed an open-source Python package, Gradio, which allows researchers to rapidly generate a visual interface for their ML models. Gradio makes accessing any ML model as easy as sharing a URL. Our development of Gradio is informed by interviews with a number of machine learning researchers who participate in interdisciplinary collaborations. Their feedback identified that Gradio should support a variety of interfaces and frameworks, allow for easy sharing of the interface, allow for input manipulation and interactive inference by the domain expert, as well as allow embedding the interface in iPython notebooks. 2019: Abubakar Abid, Ali Abdalla, Ali Abid, Dawood Khan, Abdulrahman Alfozan, James Y. Zou Machine learning, Accessibility, IPython, Subject-matter expert, Open-source software, Usability, As-Easy-As, Communication endpoint, Python https://arxiv.org/pdf/1906.02569v1.pdf
R is the 18th level of the Latin alphabet. It represents the rhotic consonant, or the r sound. It goes back to the Greek Rho, the Phoenician Resh before that and the Egyptian rêš, which is the same name the Egyptians had for head, before that. R appears in about 7 and a half percent of the words in the English dictionary. And R is probably the best language out there for programming around various statistical and machine learning tasks. We may use tools like Tensorflow imported to languages like python to prototype but R is incredibly performant for all the maths. And so it has become an essential piece of software for data scientists. The R programming language was created in 1993 by two statisticians Robert Gentleman, and Ross Ihaka at the University of Auckland, New Zealand. It has since been ported to practically every operating system and is available at r-project.org. Initially called "S," the name changed to "R" to avoid a trademark issue with a commercial software package that we'll discuss in a bit. R was primarily written in C but used Fortran and since even R itself. And there have been statistical packages since the very first computers were used for math. IBM in fact packaged up BMDP when they first started working on the idea at UCLA Health Computing Facility. That was 1957. Then came SPSS out of the University of Chicago in 1968. And the same year, John Sall and others gave us SAS, or Statistical Analysis System) out of North Carolina State University. And those evolved from those early days through into the 80s with the advent of object oriented everything and thus got not only windowing interfaces but also extensibility, code sharing, and as we moved into the 90s, acquisition's. BMDP was acquired by SPSS who was then acquired by IBM and the products were getting more expensive but not getting a ton of key updates for the same scientific and medical communities. And so we saw the upstarts in the 80s, Data Desk and JMP and others. Tools built for windowing operating systems and in object oriented languages. We got the ability to interactively manipulate data, zoom in and spin three dimensional representations of data, and all kinds of pretty aspects. But they were not a programmers tool. S was begun in the seventies at Bell Labs and was supposed to be a statistical MATLAB, a language specifically designed for number crunching. And the statistical techniques were far beyond where SPSS and SAS had stopped. And with the breakup of Ma Bell, parts of Bell became Lucent, which sold S to Insightful Corporation who released S-PLUS and would later get bought by TIBCO. Keep in mind, Bell was testing line quality and statistics and going back to World War II employed some of the top scientists in those fields, ones who would later create large chunks of the quality movement and implementations like Six Sigma. Once S went to a standalone software company basically, it became less about the statistics and more about porting to different computers to make more money. Private equity and portfolio conglomerates are, by nature, after improving the multiples on a line of business. But sometimes more statisticians in various feels might feel left behind. And this is where R comes into the picture. R gained popularity among statisticians because it made it easier to write complicated statistical algorithms without learning an entire programming language. Its popularity has grown significantly since then. R has been described as a cross between MATLAB and SPSS, but much faster. R was initially designed to be a language that could handle statistical analysis and other types of data mining, an offshoot of which we now call machine learning. R is also an open-source language and as with a number of other languages has plenty of packages available through a package repository - which they call CRAN (Comprehensive R Archive Network). This allows R to be used in fields outside of statistics and data science or to just get new methods to do math that doesn't belong in the main language. There are over 18,000 packages for R. One of the more popular is ggplot2, an open-source data visualization package. data.table is another that performs programmatic data manipulation operations. dplyr provides functions designed to enable data frame manipulation in an intuitive manner. tidyr helps create tidier data. Shiny generates interactive web apps. And there are plenty of packages to make R easier, faster, and more extensible. By 2015, more than 10 million people used R every month and it's now the 13th most popular language in use. And the needs have expanded. We can drop r scripts into other programs and tools for processing. And some of the workloads are huge. This led to the development of parallel computing, specifically using MPI (Message Passing Interface). R programming is one of the most popular languages used for statistical analysis, statistical graphics generation, and data science projects. There are other languages or tools for specific uses but it's even started being used in those. The latest version, R 4.1.2, was released on 21/11/01. R development, as with most thriving open source solutions, is guided by a group of core developers supported by contributions from the broader community. It became popular because it provides all essential features for data mining and graphics needed for academic research and industry applications and because of the pluggable and robust and versatile nature. And projects like tensorflow and numpy and sci-kit have evolved for other languages. And there are services from companies like Amazon that can host and process assets from both, both using unstructured databases like NoSQL or using Jupyter notebooks. A Jupyter Notebook is a JSON document, following a versioned schema that contains an ordered list of input/output cells which can contain code, text (using Markdown), formulas, algorithms, plots and even media like audio or video. Project Jupyter was a spin-off of iPython but the goal was to create a language-agnostic tool where we could execute aspects in Ruby or Haskel or Python or even R. This gives us so many ways to get our data into the notebook, in batches or deep learning environments or whatever pipeline needs to be built based on an organization's stack. Especially if the notebook has a frontend based on Amazon SageMaker Notebooks, Google's Colaboratory and Microsoft's Azure Notebook. Think about this. 25% of the languages lack a rhotic consonant. Sometimes it seems like we've got languages that do everything or that we've built products that do everything. But I bet no matter the industry or focus or sub-specialty, there's still 25% more automation or instigation into our own data to be done. Because there always will be.
Watch the live stream: Watch on YouTube About the show Sponsored by FusionAuth: pythonbytes.fm/fusionauth Special guest: Ian Hellen Brian #1: gensim.parsing.preprocessing Problem I'm working on Turn a blog title into a possible url example: “Twisted and Testing Event Driven / Asynchronous Applications - Glyph” would like, perhaps: “twisted-testing-event-driven-asynchrounous-applications” Sub-problem: remove stop words ← this is the hard part I started with an article called Removing Stop Words from Strings in Python It covered how to do this with NLTK, Gensim, and SpaCy I was most successful with remove_stopwords() from Gensim from gensim.parsing.preprocessing import remove_stopwords It's part of a gensim.parsing.preprocessing package I wonder what's all in there? a treasure trove gensim.parsing.preprocessing.preprocess_string is one this function applies filters to a string, with the defaults almost being just what I want: strip_tags() strip_punctuation() strip_multiple_whitespaces() strip_numeric() remove_stopwords() strip_short() stem_text() ← I think I want everything except this this one turns “Twisted” into “Twist”, not good. There's lots of other text processing goodies in there also. Oh, yeah, and Gensim is also cool. topic modeling for training semantic NLP models So, I think I found a really big hammer for my little problem. But I'm good with that Michael #2: DevDocs via Loic Thomson Gather and search a bunch of technology docs together at once For example: Python + Flask + JavaScript + Vue + CSS Has an offline mode for laptops / tablets Installs as a PWA (sadly not on Firefox) Ian #3: MSTICPy MSTICPy is toolset for CyberSecurity investigations and hunting in Jupyter notebooks. What is CyberSec hunting/investigating? - responding to security alerts and threat intelligence reports, trawling through security logs from cloud services and hosts to determine if it's a real threat or not. Why Jupyter notebooks? SOC (Security Ops Center) tools can be excellent but all have limitations You can get data from anywhere Use custom analysis and visualizations Control the workflow…. workflow is repeatable Open source pkg - created originally to support MS Sentinel Notebooks but now supports lots of providers. When I start this 3+ yrs ago I thought a lot this would be in PyPI - but no
So ein richtig eindeutiges Thema hatten wir diesmal nicht: Dominik und Jochen unterhalten sich über alles Mögliche :). Es ging zunächst ein bisschen um die neuen Exception Groups für Python 3.11, dann darüber, wie man Django-Projekte am besten initialisiert, dann um CSS, Softwarearchitektur und Microservices und dann noch ein bisschen über machine learning. Tja. Shownotes Unsere E-Mail für Fragen, Anregungen & Kommentare: hallo@python-podcast.de News aus der Szene Ultraschall 5 / Reaper / Auphonic PEP 654 -- Exception Groups and except / Twitter Thread / trio Notes on structured concurrency, or: Go statement considered harmful Closure (wikipedia) PEP 3134 -- Exception Chaining and Embedded Tracebacks asyncpg -- A fast PostgreSQL Database Client Library for Python/asyncio iPython 8 Release Werbung Exklusiv-Deal + ein Geschenk
Watch the live stream: Watch on YouTube About the show Sponsored by Datadog: pythonbytes.fm/datadog Special guest: Dean Langsam Brian #1: A Better Pygame Mainloop Glyph Doing some game programming is a great way to work on coding for early devs (and experienced devs). pygame is a popular package for writing games in Python But… the normal example of a main loop, which listens for events and dispatches actions based on events, has some problems: it's got a while 1: that wastes power, too much busy waiting looks bad, due to “screen tearing” which is writing to a screen while your in the middle of drawing it This post discusses the problems, and walks through to an async main loop that creates a better gaming experience. Michael #2: awesome sqlalchemy A few notable ones SQLAlchemy-Continuum: Versioning and auditing extension for SQLAlchemy. SQLAlchemy-Utc: SQLAlchemy type to store aware datetime.datetime values. SQLAlchemy-Utils: Various utility functions, new data types and helpers for SQLAlchemy filedepot: DEPOT is a framework for easily storing and serving files in web applications. SQLAlchemy-ImageAttach: SQLAlchemy-ImageAttach is a SQLAlchemy extension for attaching images to entity objects. SQLAlchemy-Searchable: Full-text searchable models for SQLAlchemy. sqlalchemy_schemadisplay: This module generates images from SQLAlchemy models. Can we also get a shoutout to SQLModel? Dean #3: ThreadPoolExecutor in Python: The Complete Guide Long, but worth it (80-120 minutes). Could be consumed in parts. It's mostly a collection of other blogposts on superfastpython Many examples LifeCycle Usage patterns Map and was as_completed vs sequentially callbacks IO-Bound vs CPU-bound Common Questions Comparison vs. ProcessPoolExecutor vs. threading.Thread vs. AsyncIO Brian #4: Chaining comparison operators Rodrigo Girão Serrão I use chained expressions all the time, mostly with ranges: min
An overview of the pytest flags that help with debugging. From Chapter 13, Debugging Test Failures, of Python Testing with pytest, 2nd edition (https://pythontest.com/pytest-book/). pytest includes quite a few command-line flags that are useful for debugging. We talk about thes flags in this episode. Flags for selecting which tests to run, in which order, and when to stop: * -lf / --last-failed: Runs just the tests that failed last. * -ff / --failed-failed: Runs all the tests, starting with the last failed. * -x / --exitfirst: Stops the tests session afterEd: after?Author: yep the first failure. * --maxfail=num: Stops the tests after num failures. * -nf / --new-first: Runs all the tests, ordered by file modification time. * --sw / --stepwise: Stops the tests at the first failure. Starts the tests at the last failure next time. * --sw-skip / --stepwise-skip: Same as --sw, but skips the first failure. Flags to control pytest output: * -v / --verbose Displays all the test names, passing or failing. * --tb=[auto/long/short/line/native/no] Controls the traceback style. * -l / --showlocals Displays local variables alongside the stacktrace. Flags to start a command-line debugger: * --pdb Starts an interactive debugging session at the point of failure. * --trace Starts the pdb source-code debugger immediately when running each test. * --pdbcls Uses alternatives to pdb, such as IPython's debugger with –-pdbcls=IPython.terminal.debugger:TerminalPdb. This list is also found in Chapter 13 of Python Testing with pytest, 2nd edition (https://pythontest.com/pytest-book/). The chapter is "Debugging Test Failures" and covers way more than just debug flags, while walking through debugging 2 test failures.
Text blocks are a new beta feature for Muse. Mark and Adam use the opportunity to discuss the origins and philosophy of text in computing, including text as a datum in environments like wikis, REPLs, and social media; a writing workflow for collapsing spatially-arranged ideas down to a linear text buffer; and company memo culture. Plus: Mark shares his vision for how the Pencil could become an X-Acto knife for editing text. @MuseAppHQ hello@museapp.com Show notes Review Metamuse on Apple Podcasts Podstatus Cortex, Accidental Tech Podcast “going viral slowly” text blocks beta Notion, Roam, Craft plain text ASCII art logograms The Humane Representation of Thought William Playfair Literature & Latte, Scrivener, Scapple terminal, REPL Man-Computer Symbiosis TTY = teletype Roam backlinks and knowledge graph view source Sublime Text Twitter was 140 characters for SMS episode about iPad emacs Org Mode, WorkFlowy Miro, FigJam, GoodNotes uncanny valley IPython, Jupyter Markdown Atlassian's wiki “turn my ideas into our ideas” responsive design folio keyboard, Magic Keyboard iOS voice input Scribble infinite canvas beta → flex boards kill your darlings
Može li bilo šta olakšati pisanje naučnih radova? Šta sprečava širenje novih tehnologija? Kako je Google ubio svoju najbolju društvenu mrežu?
Talk Python To Me - Python conversations for passionate developers
When we talk about scaling software threading and async get all the buzz. And while they are powerful, using asynchronous queues can often be much more effective. You might think this means creating a Celery server, maybe running RabbitMQ or Redis as well. What if you wanted this async ability and many more message exchange patterns like pub/sub. But you wanted to do zero of that server work? Then you should check out ZeroMQ. ZeroMQ is to queuing what Flask is to web apps. A powerful and simple framework for you to build just what you need. You're almost certain to learn some new networking patterns and capabilities in this episode with our guest Min Ragan-Kelley to discuss using ZeroMQ from Python as well as how ZeroMQ is central to the internals of Jupyter Notebooks. Links from the show Min on Twitter: @minrk Simula Lab: simula.no Talk Python Binder episode: talkpython.fm/256 The ZeroMQ Guide: zguide.zeromq.org Binder: mybinder.org IPython for parallel computing: ipyparallel.readthedocs.io Messaging in Jupyter: jupyter-client.readthedocs.io DevWheel Package: pypi.org cibuildwheel: pypi.org YouTube Live Stream: youtube.com PyCon Ticket Contest: talkpython.fm/pycon2021 Sponsors Linode Mito Talk Python Training
It’s funny how powerful symbols are, right? The Eiffel Tower makes you think of Paris, the Statue of Liberty is New-York, and the Trevi Fountain… is Rome of course! Just with one symbol, you can invoke multiple concepts and ideas. You probably know that symbols are omnipresent in mathematics — but did you know that they are also very important in statistics, especially probabilistic programming? Rest assured, I didn’t really know either… until I talked with Brandon Willard! Brandon is indeed a big proponent of relational programming and symbolic computation, and he often promotes their use in research and industry. Actually, a few weeks after our recording, Brandon started spearheading the revival of Theano through the JAX backend that we’re currently working on for the future version of PyMC3! As you guessed it, Brandon is a core developer of PyMC, and also a contributor to Airflow and IPython, just to name a few. His interests revolve around the means and methods of mathematical modeling and its automation. In a nutshell, he’s a Bayesian statistician: he likes to use the language and logic of probability to quantify uncertainty and frame problems. After a Bachelor’s in physics and mathematics, Brandon got a Master’s degree in statistics from the University of Chicago. He’s worked in different areas in his career – from finance, transportation and energy to start-ups, gov-tech and academia. Brandon particularly loves projects where popular statistical libraries are inadequate, where sophisticated models must be combined in non-trivial ways, or when you have to deal with high-dimensional and discrete processes. Our theme music is « Good Bayesian », by Baba Brinkman (feat MC Lars and Mega Ran). Check out his awesome work at https://bababrinkman.com/ (https://bababrinkman.com/) ! Thank you to my Patrons for making this episode possible! Yusuke Saito, Avi Bryant, Ero Carrera, Brian Huey, Giuliano Cruz, Tim Gasser, James Wade, Tradd Salvo, Adam Bartonicek, William Benton, Alan O'Donnell, Mark Ormsby, Demetri Pananos, James Ahloy, Jon Berezowski, Robin Taylor, Thomas Wiecki, Chad Scherrer, Vincent Arel-Bundock, Nathaniel Neitzke, Zwelithini Tunyiswa, Elea McDonnell Feit, Bertrand Wilden, James Thompson, Stephen Oates, Gian Luca Di Tanna, Jack Wells, Matthew Maldonado, Ian Costley, Ally Salim, Larry Gill, Joshua Duncan, Ian Moran, Paul Oreto, Colin Caprani, George Ho and Colin Carroll. Visit https://www.patreon.com/learnbayesstats (https://www.patreon.com/learnbayesstats) to unlock exclusive Bayesian swag ;) Links from the show: Brandon's website: https://brandonwillard.github.io/ (https://brandonwillard.github.io/) Brandon on GitHub: https://github.com/brandonwillard (https://github.com/brandonwillard) The Future of PyMC3, or "Theano is Dead, Long Live Theano": https://pymc-devs.medium.com/the-future-of-pymc3-or-theano-is-dead-long-live-theano-d8005f8a0e9b (https://pymc-devs.medium.com/the-future-of-pymc3-or-theano-is-dead-long-live-theano-d8005f8a0e9b) New Theano-PyMC library: https://github.com/pymc-devs/Theano-PyMC (https://github.com/pymc-devs/Theano-PyMC) Symbolic PyMC: https://pymc-devs.github.io/symbolic-pymc/ (https://pymc-devs.github.io/symbolic-pymc/) A Role for Symbolic Computation in the General Estimation of Statistical Models: https://brandonwillard.github.io/a-role-for-symbolic-computation-in-the-general-estimation-of-statistical-models.html (https://brandonwillard.github.io/a-role-for-symbolic-computation-in-the-general-estimation-of-statistical-models.html) Symbolic Math in PyMC3: https://brandonwillard.github.io/symbolic-math-in-pymc3.html (https://brandonwillard.github.io/symbolic-math-in-pymc3.html) Dynamic Linear Models in Theano: https://brandonwillard.github.io/dynamic-linear-models-in-theano.html (https://brandonwillard.github.io/dynamic-linear-models-in-theano.html) Symbolic PyMC Radon Example in PyMC4: https://brandonwillard.github.io/symbolic-pymc-radon-example-in-pymc4.html Support this podcast
Brian and Don welcome a much anticipated guest for this episode, Professor Fernando Perez joins us for an episode of 26.1 AI Podcast. Dr. Perez speaks about his journey, the community, and all the challenges along the way. Fernando shares in his inimitable style, how he journeyed from straight laced physicist in pursuit of an academic career to doggedly ignoring naysayers and creating one of the most important components of the modern PyData stack. One personal challenge during this journey was losing a friend Dr. John Hunter. John also influenced your host Brian Ray. Though John missed collaborating when Fernando set out with IPython because of conflicts from prior commitments, the two joined in later to collaborate on advancing tools data scientists use every day now. Sit back and enjoy Fernando's dexterity on multiple topics as hosts Brian Ray and Don Sheu hold on for the ride for your benefit listener.
Sponsored by us! Support our work through: Our courses at Talk Python Training Python Testing with pytest Brian #1: How to be helpful online Ned Batchelder When answering questions. Lots of great advice. We’ll focus on just a few here. Answer the question first. There may be other problems with their code that they are not asking about that you want to point out. But keep that for after you’ve helped them and built up trust. No third rails. “It should be OK for someone to ask for help with a program using sockets, and not have to defend using sockets, especially if the specific question has nothing to do with sockets.” Same for pickle, threads, globals, singletons, etc. Don’t let your strong opinions derail the conversation. The goal is to help people. Strong reactions can make the asker feel attacked. No dog-piling. Meet their level. “Try to determine what they know, and give them a reasonable next step, not the ultimate solution. A suboptimal solution they understand is better than a gold standard they can’t make use of.” Say yes. Avoid absolutes. Step back. Take some blame. Use more words. “IRC and other online mediums encourage quick short responses, which are exactly the kinds of responses that will be easy to misinterpret. Try to use more words, especially encouraging optimistic words.” Understand your motivations. Humility. Make connections. Finally: It’s hard. All of Ned’s advice is great. Good meditations for when you read a question and your mouth drops open and your eyes stare in shock. Michael #2: blackcellmagic IPython magic command to format python code in cell using black. Has a great animated gif ;) Just do: %load_ext blackcellmagic Then in any cell %%black and magic! Accepts “arguments” like %%black -l 79 Tobin Jones has been kind enough to develop a NPM package over blackcellmagic to format all cells at once which can be found here. But it’s archived so no idea whether it’s current. Brian #3: Test smarter, not harder Luke Plant There’s lots of great advice in here, but I want to highlight two parts that are often overlooked. “Write your test code with the functions/methods/classes you wish existed, not the ones you’ve been given.” “If the API you want to use doesn’t exist yet, you still use it, and then make it exist.” This is huge. People tend to think like this while coding, but forget to do it while testing. Also. Your tests are often the first client for your API, so if the API in question is under your control and you need an easier API for testing, consider adding it to the real API. If it’s easier for testing, it may be easier for other clients of the API as well. “Only write necessary tests — specifically, tests whose estimated value is greater than their estimated cost. This is a hard judgement call, of course, but it does mean that at least some of the time you should be saying “it’s not worth it”.” Michael #4: US: The Greatest Package in the World by Jeremy Carbaugh A package for easily working with US and state metadata: all US states and territories postal abbreviations Associated Press style abbreviations FIPS codes capitals years of statehood time zones phonetic state name lookup is contiguous or continental URLs to shapefiles for state, census, congressional districts, counties, and census tracts The state lookup method allows matching by FIPS code, abbreviation, and name Even a CLI: $ states md Brian #5: Think Like A Coder Part of TED-Ed “… a 10-episode series that will challenge viewers with programming puzzles as the main characters— a girl and her robot companion— attempt to save a world that has been plunged into turmoil.” Although, I only count 9 episodes, I was 4 episodes in and hooked. Main cool thing, I think, is introducing terms and topics so they will be familiar when someone really does start coding: loops, for loops, until loops, while loops conditionals variables path logic permutations searches tables recursion Big O Also highly recommended for getting excited about coding: Girls Who Code: Learn to Code and Change the World TED-Ed has tons of other cool series on lots of subjects. CodeCombat Michael #6: Costs of running a Python web app for 55k monthly users How much does running a web app in production actually cost? KeepTheScore is an online software for scorekeeping. Create your own scoreboard for up to 150 players and start tracking points. It's mostly free and requires no user account. Keepthescore.co is a Python flask application running on DigitalOcean and Firebase. It currently has around 55k unique visitors per month, per day it’s around 3.4k. Servers and database on DigitalOcean: Costs per month: $95, the servers are oversized for the load they’re currently seeing. Amazon Web Services: Costs per month: $60, use a reporting tool called Metabase to generate insights and reports from the database Google Cloud, costs per month: $1.32, for Firebase DNS hosting, costs per month: $5 Disqus, costs per month: $10 Is it worth it? Is there revenue? In total that’s around $171 USD per month. If you’re running a company with employees that would be peanuts, but in this case the cost is being borne by a single indie-developer out of his own pocket. The bigger issue is that on the revenue side there’s a big fat zero. This is the reason why we are currently working on monetization. Some Talk Python stats: Maybe 40k monthly visitors, but oh, the podcast clients 3M requests / month just RSS, resulting in 320 GB / mo of XML traffic. We run on two prod servers: $10 & $5 as well as a dedicated MongoDB server @ $10. Total $25/mo. On the other hand, Talk Python Training's AWS bill last month was over $1,000 USD. You can hear a bunch about this on Talk Python 215. Joke: From twitter, originally from Netlify: "Oh no! We lost the hackers! Where did they go?" "I don't know! They just ransomware!” Number of days since I have encountered an array index error: -1.
Sponsored by Datadog: pythonbytes.fm/datadog Michael #1: PSF / JetBrains Survey via Jose Nario Let’s talk results: 84% of people who use Python do so as their primary language [unchanged] Other languages: JavaScript (down), Bash (down), HTML (down), C++ (down) Web vs Data Science languages: More C++ / Java / R / C# on Data Science side More SQL / JavaScript / HTML Why do you mainly use Python? 58% work and personal What do you use Python for? Average answers was 3.9 Data analysis [59% / 59% — now vs. last year] Web Development [51% / 55%] ML [40% / 39%] DevOps [39% / 43%] What do you use Python for the most? Web [28% / 29%] Data analysis [18% / 17%] Machine Learning [13% / 11%] Python 3 vs Python 2: 90% Python 3, 10% Python 2 Widest disparity of versions (pro 3) is in data science. Web Frameworks: Flask [48%] Django [44%] Data Science NumPy 63% Pandas 55% Matplotlib 46% Testing pytest 49% unittest 30% none 34% Cloud AWS 55% Google 33% DigitalOcean 22% Heroku 20% Azure 19% How do you run code in the cloud (in the production environment) Containers 47% VMs 46% PAAS 25% Editors PyCharm 33% VS Code 24% Vim 9% tool use version control 90% write tests 80% code linting 80% use type hints 65% code coverage 52% Brian #2: Hypermodern Python Claudio Jolowicz, @cjolowicz An opinionated and fun tour of Python development practices. Chapter 1: Setup Setup a project with pyenv and Poetry, src layout, virtual environments, dependency management, click for CLI, using requests for a REST API. Chapter 2: Testing Unit testing with pytest, using coverage.py, nox for automation, pytest-mock. Plus refactoring, handling exceptions, fakes, end-to-end testing opinions. Chapter 3: Linting Flake8, Black, import-order, bugbear, bandit, Safety. Plus more on managing dependencies, and using pre-commit for git hooks. Chapter 4: Typing mypy and pytype, adding annotations, data validation with Desert & Marshmallow, Typeguard, flake8-annotations, adding checks to test suite Chapter 5: Documentation docstrings, linting docstrings, docstrings in nox sessions and test suites, darglint, xdoctest, Sphinx, reStructuredText, and autodoc Chapter 6: CI/CD CI with GithHub Actions, reporting coverage with Codecov, uploading to PyPI, Release Drafter for release documentation, single-sourcing the package version, using TestPyPI, docs on RTD The series is worth it even for just the artwork. Lots of fun tools to try, lots to learn. Michael #3: Open AI Jukebox via Dan Bader Listen to the songs under “Curated samples.” A neural net that generates music, including rudimentary singing, as raw audio in a variety of genres and artist styles. Code is available on github. Dataset: To train this model, we crawled the web to curate a new dataset of 1.2 million songs (600,000 of which are in English), paired with the corresponding lyrics and metadata from LyricWiki. The top-level transformer is trained on the task of predicting compressed audio tokens. We can provide additional information, such as the artist and genre for each song. Two advantages: first, it reduces the entropy of the audio prediction, so the model is able to achieve better quality in any particular style; second, at generation time, we are able to steer the model to generate in a style of our choosing. Brian #4: The Curious Case of Python's Context Manager Redowan Delowar, @rednafi A quick tour of context managers that goes deeper than most introducitons. Writing custom context managers with __init__, __enter__, __exit__. Using the decorator contextlib.contextmanager Then it gets even more fun Context managers as decorators Nesting contexts within one with statement. Combining context managers into new ones Examples Context managers for SQLAlchemy sessions Context managers for exception handling Persistent parameters across http requests Michael #5: nbstripout via Clément Robert In the latest episode, you praised NBDev for having a git hook that strips out notebook outputs. strip output from Jupyter and IPython notebooks Opens a notebook, strips its output, and writes the outputless version to the original file. Useful mainly as a git filter or pre-commit hook for users who don’t want to track output in VCS. This does mostly the same thing as the Clear All Output command in the notebook UI. Has a nice youtube tutorial right in the pypi listing Just do nbstripout --``install in a git repo! Brian #6: Write ups for The 2020 Python Language Summit Guido talked about this in episode 179 But these write-ups are excellent and really interesting. Should All Strings Become f-strings?, Eric V. Smith Replacing CPython’s Parser with a PEG-based parser, Pablo Galindo, Lysandros Nikolaou, Guido van Rossum A Formal Specification for the (C)Python Virtual Machine, Mark Shannon HPy: a Future-Proof Way of Extending Python?, Antonio Cuni CPython Documentation: The Next 5 Years, Carol Willing, Ned Batchelder Lightning talks (pre-selected) What do you need from pip, PyPI, and packaging?, Sumana Harihareswara A Retrospective on My "Multi-Core Python" Project, Eric Snow The Path Forward for Typing, Guido van Rossum Property-Based Testing for Python Builtins and the Standard Library, Zac Hatfield-Dodds Core Workflow Updates, Mariatta Wijaya CPython on Mobile Platforms, Russell Keith-Magee Wanted to bring this up because Python is a living language and it’s important to pay attention and get involved, or at least pay attention to where Python might be going. Also, another way to get involved is to become a member of the PSF board of directors What’s a PSF board of directors member do? video There are some open seats, Nominations are open until May 31 Extras: Michael: Updated search engine for better result ranking Windel Bouwman wrote a nice little script for speedscope https://github.com/windelbouwman/pyspeedscope (follow up from Austin profiler) Jokes: “Due to social distancing, I wonder how many projects are migrating to UDP and away from TLS to avoid all the handshakes?” - From Sviatoslav Sydorenko “A chef and a vagrant walk into a bar. Within a few seconds, it was identical to the last bar they went to.” - From Benjamin Jones, crediting @lufcraft Understanding both of these jokes is left as an exercise for the reader.
We are joined by Ellen Körbes for this episode, where we focus on Kubernetes and its tooling. Ellen has a position at Tilt where they work in developer relations. Before Tilt, they were doing closely related kinds of work at Garden, a similar company! Both companies are directly related to working with Kubernetes and Ellen is here to talk to us about why Kubernetes does not have to be the difficult thing that it is made out to be. According to her, this mostly comes down to tooling. Ellen believes that with the right set of tools at your disposal it is not actually necessary to completely understand all of Kubernetes or even be familiar with a lot of its functions. You do not have to start from the bottom every time you start a new project and developers who are new to Kubernetes need not becomes experts in it in order to take advantage of its benefits.The major goal for Ellen and Tilt is to get developers code up, running and live in as quick a time as possible. When the system is standing in the way this process can take much longer, whereas, with Tilt, Ellen believes the process should be around two seconds! Ellen comments on who should be using Kubernetes and who it would most benefit. We also discuss where Kubernetes should be run, either locally or externally, for best results and Tilt's part in the process of unit testing and feedback. We finish off peering into the future of Kubernetes, so make sure to join us for this highly informative and empowering chat! Follow us: https://twitter.com/thepodlets Website: https://thepodlets.io Feeback: info@thepodlets.io https://www.notion.so/thepodlets/The-Podlets-Guest-Central-9cec18726e924863b559ef278cf695c9 Guest: Ellen Körbes https://twitter.com/ellenkorbes Hosts: Carlisia Campos Bryan Liles Olive Power Key Points From This Episode: Ellen's work at Tilt and the jumping-off point for today's discussion. The projects and companies that Ellen and Tilt work with, that they are allowed to mention! Who Ellen is referring to when they say 'developers' in this context. Tilt's goal of getting all developers' code up and running in the two seconds range. Who should be using Kubernetes? Is it necessary in development if it is used in production? Operating and deploying Kubernetes — who is it that does this? Where developers seem to be running Kubernetes; considerations around space and speed. Possible security concerns using Tilt; avoiding damage through Kubernetes options. Allowing greater possibilities for developers through useful shortcuts. VS Code extensions and IDE integrations that are possible with Kubernetes at present. Where to start with Kubernetes and getting a handle on the tooling like Tilt. Using unit testing for feedback and Tilt's part in this process. The future of Kubernetes tooling and looking across possible developments in the space. Quotes: “You're not meant to edit Kubernetes YAML by hand.” — @ellenkorbes [0:07:43] “I think from the point of view of a developer, you should try and stay away from Kubernetes for as long as you can.” — @ellenkorbes [0:11:50] “I've heard from many companies that the main reason they decided to use Kubernetes in development is that they wanted to mimic production as closely as possible.” — @ellenkorbes [0:13:21] Links Mentioned in Today’s Episode: Ellen Körbes — http://ellenkorbes.com/ Ellen Körbes on Twitter — https://twitter.com/ellenkorbes?lang=en Tilt — https://tilt.dev/ Garden — https://garden.io/ Cluster API — https://cluster-api.sigs.k8s.io/ Lyft — https://www.lyft.com/ KubeCon — https://events19.linuxfoundation.org/events/kubecon-cloudnativecon-europe-2019/ Unu Motors — https://unumotors.com/en Mindspace — https://www.mindspace.me/ Docker — https://www.docker.com/ Netflix — https://www.netflix.com/ GCP — https://cloud.google.com/ Azure — https://azure.microsoft.com/en-us/ AWS — https://aws.amazon.com/ ksonnet — https://ksonnet.io/ Ruby on Rails — https://rubyonrails.org/ Lambda – https://aws.amazon.com/lambda/ DynamoDB — https://aws.amazon.com/dynamodb/ Telepresence — https://www.telepresence.io/ Skaffold Google — https://cloud.google.com/blog/products/application-development/kubernetes-development-simplified-skaffold-is-now-ga Python — https://www.python.org/ REPL — https://repl.it/ Spring — https://spring.io/community Go — https://golang.org/ Helm — https://helm.sh/ Pulumi — https://www.pulumi.com/ Starlark — https://github.com/bazelbuild/starlark Transcript: EPISODE 22 [ANNOUNCER] Welcome to The Podlets Podcast, a weekly show that explores cloud native one buzzword at a time. Each week, experts in the field will discuss and contrast distributed systems concepts, practices, tradeoffs and lessons learned to help you on your cloud native journey. This space moves fast and we shouldn’t reinvent the wheel. If you’re an engineer, operator or technically minded decision-maker, this podcast is for you. [EPISODE] [0:00:41.8] CC: Hi, everybody. This is The Podlets. We are back this week with a special guest, Ellen Körbes. Ellen will introduce themselves in a little bit. Also on the show, it’s myself, Carlisia Campos, Michael Gasch and Duffie Cooley. [0:00:57.9] DC: Hey, everybody. [0:00:59.2] CC: Today’s topic is Kubernetes Sucks for Developers, right? No. Ellen is going to introduce themselves now and tell us all about what that even means. [0:01:11.7] EK: Hi. I’m L. I do developer relations at Tilt. Tilt is a company whose main focus is development experience when it comes to Kubernetes and multi-service development. Before Tilt, I used to work at Garden. They basically do the same thing, it's just a different approach. That is basically the topic that we're going to discuss, the fact that Kubernetes does not have to suck for developers. You just need to – you need some hacks and fixes and tools and then things get better. [0:01:46.4] DC: Yeah, I’m looking forward to this one. I've actually seen Tilt being used in some pretty high-profile open source projects. I've seen it being used in Cluster API and some of the work we've seen there and some of the other ones. What are some of the larger projects that you are aware of that are using it today? [0:02:02.6] EK: Oh, boy. That's complicated, because every company has a different policy as to whether I can name them publicly or not. Let's go around that question a little bit. You might notice that Lyft has a talk at KubeCon, where they're going to talk about Tilt. I can't tell you right now that they use Tilt, but there's that. Hopefully, I found a legal loophole here. I think they're the biggest name that you can find right now. Cluster API is of course huge and Cluster API is fun, because the way they're doing things is very different. We're used to seeing mostly companies that do apps in some way or another, like websites, phone apps, etc. Then Cluster API is completely insane. It's something else totally. There's tons of other companies. I'm not sure which ones that are large I can name specifically. There are smaller companies. Unu Motors, they do electric motorcycles. It's a company here in Berlin. They have 25 developers. They’re using Tilt. We have very tiny companies, like Mindspace, their studio in Tucson, Arizona. They also use Tilt and it's a three-person team. We have the whole spectrum, from very, very tiny companies that are using Docker for Mac and pretty happy with it, all the way up to huge companies with their own fleet of development clusters and all of that and they're using Tilt as well. [0:03:38.2] DC: That field is awesome. [0:03:39.3] MG: Quick question, Ellen. The title says ‘developers’. Developers is a pretty broad name. I have people saying that okay, Kubernetes is too raw. It's more like a Linux kernel that we want this past experience. Our business developers, our application developers are developing in there. How would you do describe developer interfacing with Kubernetes using the tools that you just mentioned? Is it the traditional enterprise developer, or more Kubernetes developers developing on Kubernetes? [0:04:10.4] EK: No. I specifically mean not Kubernetes developers. You have people work in Kubernetes. For example, the Cluster API folks, they're doing stuff that is Kubernetes specific. That is not my focus. The focus is you’re a back-end developer, you’re a front-end developer, you're the person configuring, I don't know the databases or whatever. Basically, you work at a company, you have your own business logic, you have your own product, your own app, your own internal stuff, all of that, but you're not a Kubernetes developer.It just so happens that if the stuff you are working on is going to be pointing at Kubernetes, it's going to target Kubernetes, then one, you're the target developer for me, for my work. Two, usually you're going to have a hard time doing your job. We can talk a bit about why. One issue is development clusters. If you're using Kubernetes in prod, rule of thumb, you should be using Kubernetes in dev, because you don't want completely separate environments where things work in your environment as a developer and then you push them and they break. You don't want that. You need some development cluster. The type of cluster that that's going to be is going to vary according to the level of complexity that you want and that you can deal with. Like I said, some people are pretty happy with Docker for Mac. I hear all the time these complaints that, “Oh, you're running Kubernetes on your machine. It's going to catch fire.” Okay, there's some truth to that, but also it depends on what you're doing. No one tries to run Netflix, let's say the whole Netflix on their laptop, because we all know that's not reasonable. People try to do similar things on their mini-Kube, or Docker for Mac. Then it doesn't work and they say, “Oh, Kubernetes on the laptop doesn't work.” No. Yeah, it does. Just not for you. That's a complaint I particularly dislike, because it comes from a – it's a blanket statement that has no – let's say, no facts behind it. Yeah, if you're a small company, Docker for Mac is going to work fine for you. Let's say you have a beefy laptop with 30 gigs of ram, you can put a lot of software in 30 gigs. You can put a lot of microservices in 30 gigs. That's going to work up to a point and then it's going to break. When it breaks, you're going to need to move to a cloud, you're going to need to do remote development and then you're going to Go to GCP, or Azure, or Amazon. You're going to set up a cluster there. Some people use the managed Kubernetes options. Some people just spin up a bunch of machines and wire up Kubernetes by themselves. That's going to depend on basically how much you have in terms of resources and in terms of needs. Usually, keeping up a remote cluster that works is going to demand more infrastructure work. You're going to need people who know how to do that, to keep an eye on that. There's all the billing aspect, which is you can run Docker for Mac all day and you're not going to pay extra. If you leave a bunch of stuff running on Google, you're going to have a bill at the end of the month that you need to pay attention to. That is one thing for people to think about. Another aspect that I see very often that people don't know what to do with is config files. You scroll Twitter, you scroll Kubernetes Twitter for five minutes and there's a joke about YAML. We all hate editing YAML. Again, the same way people make jokes about using about Kubernetes setting your laptop on fire, I would argue that you're not meant to edit Kubernetes YAML by hand. The tooling for that is arguably not as mature as the tooling when it comes to Kubernetes clusters to run on your laptop. You have stuff like YAML templates, you have ksonnet. I think there's one called customize, but I haven't used it myself. What I see in every company from the two-person team to the 600 person team is no one writes Kubernetes YAML by hand. Everyone uses a template solution, a templating solution of some sort. That is the first thing that I always tell people when they start making jokes about YAML, is if you’re editing YAML by hand, you're doing it wrong. You shouldn't do that in the first place. It's something that you set up once at some point and you look at it whenever you need to. On your day-to-day when you're writing your code, you should not touch those files, not by hand. [0:08:40.6] CC: We're five minutes in and you threw so much at us. We need to start breaking some of this stuff down. [0:08:45.9] EK: Okay. Let me throw you one last thing then, because that is what I do personally. One more thing that we can discuss is the development feedback loop. You're writing your code, you're working on your application, you make a change to your code. How much work is it for you to see that new line of code that you just wrote live and running? For most people, it's a very long journey. I asked that on Twitter, a lot of people said it was over half an hour. A very tiny amount of people said it was between five minutes and half an hour and only a very tiny fraction of people said it was two seconds or less. The goal of my job, of my work, the goal of Tilt, the tool, which is made by the company I work for, also called Tilt, is to get everyone in that two seconds range. I've done that on stage and talks, where we take an application and we go from, “Okay, every time you make a change, you need to build a new Docker image. You need to push it to a registry. You need to update your cluster, blah, blah, blah, and that's going to take minutes, whole minutes.” We take that from all that long and we dial it down to a couple seconds. You make a change, or save your file, snap your fingers and poof, it's up and running, the new version of your app. It's basically a real-time, perceptually real-time, just like back and when everyone was doing Ruby on Rails and you would just save your file and see if it worked basically instantly. That is the part of this discussion that personally I focus more on. [0:10:20.7] CC: I'm going to love to jump to the how in a little bit. I want to circle back to the beginning. I love the question that Michael asked at the beginning, what is considered developer, because that really makes a difference, first to understand who we are talking about. I think this conversation can go in circles and not that I'm saying we are going circles, but this conversation out in the wild can go in circles. Until we have an understanding of the difference between can you as a developer use Kubernetes in a somewhat not painful way, but should you? I'm very interested to get your take and Michael and Duffie’s take as well as far as should we be doing this and should all of the developers will be using Kubernetes through the development process? Then we also have to consider people who are not using Kubernetes, because a lot of people out there are not using communities. For developers and special, they hear Kubernetes is painful and definitely enough for developers. Obviously, that is not going to qualify Kubernetes as a tool that they’re going to look into. It's just not motivating. If there is anything that that would make people motivated to look into Kubernetes that would be beneficial for them not just for using Kubernetes for Kubernetes sake, but would it be useful? Basically why? Why would it be useful? [0:11:50.7] EK: I think from the point of view of a developer, you should try and stay away from Kubernetes for as long as you can. Kubernetes comes in when you start having issues of scale. It's a production matter, it's not a development matter. I don't know, like a DevOps issue, operations issue. Ideally, you put off moving your application to Kubernetes as long as possible. This is an opinion. We can argue about this forever. Just because it introduces a lot of complexity and if you don't need that complexity, you should probably stay away from it. To get to the other half of the question, which is if you're using Kubernetes in production, should you use Kubernetes in development? Now here, I'm going to say yes a 100% of the time. Blanket statement of course, we can argue about minutiae, but I think so. Because if you don't, you end up having separate environments. Let's say you're using Docker Compose, because you don't like Kubernetes. You’re using Kubernetes in production, so in development you are going to need containers of some sort. Let's say you're using Docker Compose. Now you're maintaining two different environments. You update something here, you have to update it there. One day, it's going to be Friday, you're going to be tired, you're going to update something here, you're going to forget to update something there, or you're going to update something there and it's going to be slightly different. Or maybe you're doing something that has no equivalent between what you're using locally and what you're using in production. Then basically, you're in trouble. I've heard from many companies that the main reason they decided to use Kubernetes in development is that they wanted to mimic production as closely as possible. One argument we can have here is that – oh, but if you're using Kubernetes in development, that's going to add a lot of overhead and you're not going to be able to do your job right. I agree that that was true for a while, but right now we have enough tooling that you can basically make Kubernetes disappear and you just focus on being a developer, writing your code, doing all of that stuff. Kubernetes is sitting there in the background. You don't have to think about it and you can just go on about your business with the advantage that now, your development environment and your production environment are going to very closely mimic each other, so you're not going to have issues with those potential disparities. [0:14:10.0] CC: All right. Another thing too is that I think we're making an assumption that the developers we are talking about are the developers that are also responsible for deployment. Sometimes that's the case, sometimes that's not the case and I'm going to shut up now. It would be interesting to talk about that too between all of us, is that what we see? Is that the case that now developers are responsible? It's like, developers DevOps is just so ubiquitous that we don't even consider differentiating between developers and ops people? All right? [0:14:45.2] DC: I think I have a different spin on that. I think that it's not necessarily that developers are the ones operating the infrastructure. The problem is that if your infrastructure is operated by a platform that may require some integration at the application layer to really hit its stride, then the question becomes how do you as a developer become more familiar? What is the user experience as of, or what I should say, what's the developer experience around that integration? What can you do to improve that, so that the developer can understand better, or play with how service discovery works, or understand better, or play with how the different services in their application will be able to interact without having to redefine that in different environments? Which is I think what Ellen point was. [0:15:33.0] EK: Yeah. At the most basic level, you have issues as such as you made a change to a service here, let's say on your local Docker Compose. Now you need to update your Kubernetes manifest on your cluster for things to make sense. Let's say, I don't know, you change the name of a service, something as simple as that. Even those kinds of things that sounds silly to even describe, when you're doing that every day, one day you're going to forget it, things are going to explode, you're not going to know why, you're going to lose hours trying to figure out where things went wrong. [0:16:08.7] MG: Also the same with [inaudible] maybe. Even if you use Kubernetes locally, you might run a later version of Kubernetes, maybe use kind for local development, but then your cluster, your remote cluster is on three or four versions behind. Shouldn't be because of the versions of product policy, but it might happen, right? Then APIs might be deprecated, or you're using different API. I totally agree with you, Ellen, that your development environment should reflect production as close as possible. Even there, you have to make sure that prod, like your APIs matches, API types matches and all the stuff right, because they could also break. [0:16:42.4] EK: You were definitely right that bugs are not going away anytime soon. [0:16:47.1] MG: Yeah. I think this discussion also remembers me of the discussion that the folks in the cloud will have with AWS Lambda for example, because there's similar, even though there are tools to simulate, or mimic these platforms, like serverless platforms locally, the general recommendation there is to embrace the cloud and develop in the cloud natively in the cloud, because that's something you cannot resemble. You cannot run DynamoDB locally. You could mimic it. You could mimic lambda runtimes locally. Essentially, it's going to be different. That's also a common complaint in the world of AWS and cloud development, which is it's really not that easy to develop locally, where you're supposed to develop on the platform that the code is being shipped and run on to, because you cannot run the cloud locally. It sounds crazy, but it is. I think the same is with Kubernetes, even though we have the tools. I don't think that every developer runs Kubernetes locally. Most of them maybe doesn't even have Docker locally, so they use some spring tools and then they have some pipeline and eventually it gets shipped as a container part in Kubernetes. That's what I wanted to throw in here as more like a question experience, especially for you Ellen with these customers that you work with, what are the different profiles that you see from the maturity perspective and these customers large enterprises might be different and the smaller ones that you mentioned. How do you see them having different requirements, as also Carlisia said, do they do ops, or DevOps, or is it strictly separated there, especially in large enterprises? [0:18:21.9] EK: What I see the most, let's get the last part first. [0:18:24.6] MG: Yeah, it was a lot of questions. Sorry for that. [0:18:27.7] EK: Yeah. When it comes to who operates Kubernetes, who deploys Kubernetes, definitely most developers push their code to Kubernetes themselves. Of course, this involves CI and testing and PRs and all of that, so it's not you can just go crazy and break everything. When it comes to operating the production cluster, then that's separate. Usually, you have someone writing code and someone else operating clusters and infrastructure. Sometimes it's the same person, but they're clearly separate roles, even if it's the same person doing it. Usually, you go from your IDE to PR and that goes straight into production once the whole process is done. Now we were talking about workflows and Lambda and all of that. I don't see a good solution for lambda, a good development experience for Lambda just yet. It feels a bit like it needs some refinement still. When it comes to Kubernetes, you asked do most developers run Kubernetes locally? Do they not? I don't know about the numbers, the absolute numbers. Is it most doing this, or most doing that? I'm not sure. I only know the companies I'm in touch with. Definitely not all developers run Kubernetes on their laptops, because it's a problem of scale. Right now, we are basically stuck with 30 gigs of RAM on our laptops. If your app is bigger than that, tough luck, you're not going to run it on the laptop. What most developers do is they still maintain a local development environment, where they can do development without going to CI. I think that is the main question. They maintain agility in their development process. What we usually see when you don't have Kubernetes on your laptop and you're using remote Kubernetes, so a remote development cluster in some cloud provider. What most people do and this is not the companies I talk to. This is basically everyone else. What most people will do is they make their development environment be the same, or work the same way as their production environment. You make a change to your code, you have to push a PR that has to get tested by CI. It has to get approved. Then it ends up in the Kubernetes cluster. Your feedback loop as a developer is insanely slow, because there's so much red tape between you changing a line of code and you getting a new process running in your cluster. Now when you use tools, I call the category MDX. I basically coined that category name myself. MDX is a multi-service development experience tooling. When you use MDX tools, and that's not just Tilt; it’s Tilt, it’s Garden where I used to work, people use telepresence like that. There is Scaffold from Google and so on. There's a bunch of tools. When you use a tool like that, you can have your feedback loop down to a second like I said before. I think that is the major improvement developers can do if they're using Kubernetes remotely and even if they’re using Kubernetes locally. I would guess most people do not run Kubernetes locally. They use remotely. We have clients who have clients — we have users who don't even have Docker on their local machines, because if you have the right tooling, you can change the files on your machine. You have tooling running that detects those five changes. It syncs those five changes to your cluster. The cluster then rebuilds images, or restarts containers, or syncs live code that's already running. Then you can see those changes reflected in your development cluster right, away even though you don't even have Docker in your machine. There's all of those possibilities. [0:22:28.4] MG: Do you see security issues with that approach with not knowing the architecture of Tilt? Even though it's just the development clusters, there might be stuff that could break, or you could break by bypassing the red tape as you said? [0:22:42.3] EK: Usually, we assign one user per namespace. Usually, every developer has a namespace. Kubernetes itself has enough options that if that's a concern to you, you can make it secure. Most people don't worry about it that much, because it's development clusters. They're not accessible to the public. Usually, there's – you can only access it through a VPN or something of that sort. We haven't heard about security issues so far. I'm sure they’re going to pop out at some point. I'm not sure how severe it’s going to be, or how hard it's going to be to fix. I am assuming, because none of this stuff is meant to be accessible to the wider Internet that it's not going to be a hard problem to tackle. [0:23:26.7] DC: I would like to back up for a second, because I feel we're pretty far down the road on what the value of this particular pattern is without really explaining what it is. I want to back this up for just a minute and talk about some of the things that a tooling like this is trying to solve in a use case model, right? Back in the day when I was learning Python, I remember really struggling with the idea of being able to debug Python live. I came across iPython, which is a REPL and that was – which was hugely eye-opening, because it gave me the ability to interact with my code live and also open me up to the idea that it was an improve over things like having to commit a new log line against a particular function and then push that new function up to a place where it would actually get some use and then be able to go look at that log line and see what's coming out of it, or do I actually have enough logs to even put together what went wrong. That whole set of use case is I think is somewhat addressed by tooling like this. I do think we should talk about how did we get here and how does that actually and how does like this address some of those things, and which use cases specifically is it looking to address. I guess where I'm going with this is to your point, so a tooling like Tilt, for example, is the idea that you can, as far as I understand it, inject into a running application, a new instance that would be able to – that you would have a local development control over. If you make a change to that code, then the instance running inside of your target environment would be represented by that new code change very quickly, right? Basically, solving the problem of making sure that you have some very quick feedback loop. I mean, functionally, that's the killer feature here. I think it’s really interesting to see tooling like that start to develop, right? Another example of that tooling would be the REPL thing, wherein instead of writing your code and compiling your code and seeing the output, you could do a thing where you're actually inside, running as a thread inside of the code and you can dump a data structure and you can modify that data structure and you can see if your function actually does the right thing, without having to go back and write that code while imagining all those data structures in your head. Basic tooling like this, I think is pretty killer. [0:25:56.8] EK: Yeah. I think one area where that is still partially untapped right now where this tooling could go, and I'm pushing it, but it's a process. It's not something we can do overnight, is to have very high-level patterns, the let's say codified. For example, everyone's copying Docker files and Kubernetes manifests and Terraform can take files, which I forgot what they're called. Everyone's copying that stuff from the Internet from other websites. That's cool. Oh, you need a container that does such-and-such and sets up this environment and provides these tools. Just download this image and everything is all set up for you. One area where I see things going is for us to have that same portability, but for development environments. For example, I did this whole talk about how to take your Go code, your Go application from I don't know, a 30-seconds feedback loop where you're rebuilding an image every time you make a code change and all of that, down to 1 second. There's a lot of hacks in there that span all kinds of stuff, like should you use Go vendor, or should you keep your dependencies cached inside a Docker layer? Those kinds of things. Then I went down a bunch of those things and eventually came up with a workflow that was basically the best I could find in terms of development experience. What is the snappiest workflow? Or for example, you could have what is a workflow that makes it really easy to debug my Go app? You would use applications like Squash and that's a debugger that you can connect to a process running in a large container. Those kinds of things. If we can prepackage those and offer those to users and not just for Go and not just for debugging, but for all kinds of development workflows, I think that would be really great. We can offer those types of experiences to people who don't necessarily have the inclination to develop those workflows themselves. [0:28:06.8] DC: Yeah, I agree. I mean, it is interesting. I've had a few conversations lately about the fact that the abstraction layer of coding in the way that we think about it really hasn't changed over time, right? It's the same thing. That's actually a really relevant point. It's also interesting to think about with these sorts of frameworks and this tooling, it might be interesting to think of what else we can – what else we can enable the developer to have a feedback loop on more quickly, right? To your point, right? We talked about how these different environments, your development environment and your production environment, the general consensus is they should be as close as you can get them reasonably, so that the behavior in one should somewhat mimic the behavior in the other. At least that's the story we tell ourselves. Given that, it would also be interesting if the developer was getting feedback from effectively how the security posture of that particular cluster might affect the work that they're doing. You do actually have to define network policy. Maybe you don't necessarily have to think about it if we can provide tooling that can abstract that away, but at least you should be aware that it's happening so that you understand if it's not working correctly, this is where you might be able to see the sharp edges pop up, you know what I mean? That sort of thing. [0:29:26.0] EK: Yeah. At the last KubeCon, where was it? In San Diego. There was this running joke. I was running around with the security crowd and there was this joke about KubeCon applies security.yaml. It was in a mocking tone. I'm not disparaging their joke. It was a good joke. Then I was thinking, “What if we can make this real?” I mean, maybe it is real. I don't know. I don't do security myself. What if we can apply a comprehensive enough set of security measures, security monitoring, security scanning, all of that stuff, we prepackage it, we allow users to access all of that with one command, or even less than that, maybe you pre-configure it as a team lead and then everyone else in your team can just use it without even knowing that it's there. Then it just lets you know like, “Oh, hey. This thing you just did, this is a potential security issue that you should know about.” Yeah, I think coming up with these developer shortcuts, it's my hobby. [0:30:38.4] MG: That's cool. What you just mentioned Ellen and Duffie remembers me on – reminds me on the Spring community, the Spring framework, where a lot of the boilerplate, or beat security stuff, or connections, integrations, etc., is being abstracted away and you just annotate your code a bit and then some framework and Spring obviously, it's a spring framework. In your case Ellen, what you were hinting to is maybe this build environment that gives me these integration hooks where I just annotate. Or even those annotations could be enforced. Standards could be enforced if I don't annotate at all, right? I could maybe override them. Then this build environment would just pick it up, because it scans the code, right? It has the source code access, so I could just scan it and hook into it and then apply security policies, lock it down, see ports being used, maybe just open them up to the application, the other ones will automatically get blocked, etc., etc. It just came to my mind. I have not done any research there, or whether there's already some place or activity. [0:31:42.2] EK: Yeah. Because I won't shut up about this stuff, because I just love it, we are doing a – it's in a very early stage right now. We are doing a thing at Tile, we're calling extensions. Very creative name, I suppose. It's basically Go in parts, but for those were closed. It's still at a very early stage. We still have some road ahead of us. For example, we have – let's say this one user and they did some very special integration of Helm and Tilt. You don't have to use Helm by hand anymore. You can just make all of your Helm stuff happen automatically when you're using Tilt. Usually, you would have to copy I don't know, a 100 lines of code from your Tilt config file and copy that around for other developers to be able to use it. Now we have this thing that it's basically going parts where you can just say load extension and give it a name, it fetches it from a repository and it's running. I think that is basically an early stage of what you just described with Spring, but more geared towards let's say an infra-Kubernetes, like how do you tie infra-Kubernetes, that stuff with a higher level functionality that you might want to use? [0:33:07.5] MG: Cool. I have another one. Speaking of which, is there any other integrations for IDEs with Tilt? Because I know that VS code for example, has Kubernetes integrations, does the fabric aid and may even plugin, which handles some stuff under the covers. [0:33:24.3] EK: Usually, Tilt watches your code files and it doesn't care which IDEs you use. It has its own dashboard, which is just a page that you open on your browser. I have just heard this week. Someone mentioned on Slack that they wrote an extension for Tilt. I'm not sure if it was for VS code or the other VS code-like .NET editors. I don't remember what it’s called, but they have a family of those. I just heard that someone wrote one of those and they shared the repo. We have someone looking into that. I haven't seen it myself. The idea has come up when I was working at Garden, which is in the same area as Tilt. I think it's pertinent. We also had the idea of a VS code extension. I think the question is what do you put in the extension? What do you make the VS code extension do? Because both Tilt and Garden. They have their own web dashboards that show users what should be shown and in the manner that we think should be shown. If you're going to create a VS code extension, you either replicate that completely and you basically take this stuff that was in the browser and put it in the IDE. I don't particularly see much benefit in that. If enough people ask, maybe we'll do it, but it's not something that I find particularly useful. Either you do that and you replicate the functionality, or you come up with new functionality. In both cases, I just don't see a very strong point as to what different and IDE-specific functionality should you want. [0:35:09.0] MG: Yes. The reason why I was asking is that we see all these Pulumi, CDKs, AWS CDKs coming up, where you basically use a programming language to write your application/application infrastructure code and your IDE and then all the templating, that YAML stuff, etc., gets generated under covers. Just this week, AWS announced the CDKs, like the CDK basically for Kubernetes. I was thinking, with this happening where some of these providers abstract the scaffolding as well, including the build. You don't even have to build, because it's abstracted away under the covers. I was seeing this trend. Then obviously, we still have Helm and the templating and the customize and then you still have the manual as I mentioned in the beginning. I do like the IDE integration, because that's where I spend most of my time. Whenever I have to leave the IDE, it's a context switch that I have to go through. Even if it's just for opening another file also that I need to edit somewhere. That's why I think having IDE integration is useful for developers, because that's where they most spend up their time. As you said, there might be reasons to not do it in an IDE, because it's just replicating functionality that might not be useful there. [0:36:29.8] EK: Yeah. In the case of Tilt, all the config is written in Starlark, which is a language and it's basically Python. If your IDE can syntax highlight Python, it can syntax highlight the Tilt config files. About Pulumi and that stuff, I'm not that familiar. It's stuff that I know how it works, but I haven't used it myself. I'm not familiar with the browse and the IDE integration side of it. The thing about tools like Tilt is that usually, if you set it up right, you can just write your code all day and you don't have to look at the tool. You just switch from your IDE to let's say, your browser where your app is running, so you get feedback and that kind of thing. Once you configure it, you don't really spend much time looking at it. You're going to look at it when there are errors. You try to refresh your application and it fails. You need to find that error. By the time that happened, you already lost focus from your code anyway. Whether you're going to look for your error on a terminal, or on the Tilt dashboard, that's not much an issue. [0:37:37.7] MG: That's right. That’s right. I agree. [0:37:39.8] CC: All this talk about tooling and IDEs is making me think to ask you Ellen. If I'm a developer and let's say, my company decides that we’re going to use Kubernetes. What we are advocating here with this episode is to think about well, if you're going to be the point to Kubernetes in production, you should consider running Kubernetes as a local development environment. Now for those developers who don't even – haven't even worked with Kubernetes, where do you suggest they jump in? Should they get a handle on – because it's too many things. I mean, Kubernetes already is so big and there are so many toolings around to how to operate Kubernetes itself. For a developer who is, “Okay, I like this idea of having my own local Kubernetes environment, or a development environment somehow may also be in the cloud,” should they start with a tooling like Tilt, or something similar? Would that make it easier for them to wrap their head around Kubernetes and what Kubernetes does? Or should they first get a handle on Kubernetes and then look at a tool like this? [0:38:56.2] EK: Okay. There are a few sides to this question. If you have a very large team, ideally you should get one or a few people to actually really learn Kubernetes and then make it so that everyone else doesn't have to. Something we have seen is very large company, they are going to do Kubernetes in development. They set up a developer experience team and then for example, they have their own wrapper around Kubectl and then basically, they automate a bunch of stuff so that everyone in the team doesn't have to take a certified Kubernetes application development certificate. Because for people who don't know that certificate, it's basically how much Kubectl can you do off top of your head? That is basically what that certificate is about, because Kubectl is an insanely huge and powerful tool. On the one hand, you should do that. If you have a big team, take a few people, learn all that you can about Kubernetes, write some wrappers so that people don't have to do Kubectl or something, something by hand. Just make very easy functions, like Kubectl, let’s say you know a name of your wrapper, context and the name and then that's going to switch you to a namespace let's say, where some version of your app is running, so that thing. Now about the tooling. Once you have your development environment set up and you're going to need someone who has some experience with Kubernetes to set that up in the first place, but once that is set up, if you have the right tooling, you don't really have to know everything that Kubernetes does. You should have at least a conceptual overview. I can tell you for sure, that there's hundreds of developers out there writing code that is going to be deployed to Kubernetes, writing codes that whenever they make a change to their code, it goes to a Kubernetes development cluster and they don't have the first – well, I’m not going to say the first clue, but they are not experienced Kubernetes users. That's because of all the tooling that you can put around. [0:41:10.5] CC: Yeah, that makes sense. [0:41:12.2] EK: Yeah. You can abstract a bunch of stuff with basically good sense, so that you know the common operations that need to be done for your team and then you just abstract them away, so that people don't have to become Kubectl experts. On the other side, you can also abstract a bunch of stuff away with tooling. Basically, as long as your developer has the basic grasp of containers and basics of Kubernetes, that stuff, they don't need to know how to operate it, with any depth. [0:41:44.0] MG: Hey Ellen, in the beginning you said that it's all about this feedback loop and iterating fast. Part of a feedback loop for a developer is unit testing, integration testing, or all sorts of testing. How do you see that changing, or benefiting from tools like Tilt, especially when it comes to integration testing? Unit tests usually locally, but the integration testing. [0:42:05.8] EK: One thing that people can do when they're using Tilt is once you have Tilt running, you basically have all of your application running. You can just set up one-off tasks with Tilt. You could basically set up a script that there's a bunch of stuff, which would basically be what your test does. If it returns zero, it succeeded. If it doesn’t, it failed. You can set something up like that. It's not something that we have right now in a prepackaged farm that you can use right away. You would basically just say, “Hey Tilt, run this thing for me,” and then you would see if it worked or not. I have to make a plug to the competition right now. Garden has more of that part of it, that part of things set up. They have tests as a separate primitive right next to building and deploying, which is what you usually see. They also have testing. It does basically what I just said about Tilt, but they have a special little framework around it. With Garden, you would say, “Oh, here's a test. Here's how you run the test. Here's what the test depends on, etc.” Then it runs it and it tells you if it failed or not. With Tilt, it would be a more generic approach where you would just say, “Hey Tilt, run this and tell me if it fails or not,” but without the little wrapping around it that's specific for testing. When it comes to how things work, like when you're trying to push the production, let's say you did a bunch of stuff locally, you're happy with it, now it's time to push the production. Then there's all that headache with CI and waiting for tests to run and flaky tests and all of that, that I don't know. That is a big open question that everyone's unhappy about and no one really knows which way to run to. [0:43:57.5] DC: That’s awesome. Where do you see this space going in the future? I mean, as you look at the tooling that’s out there, maybe not specifically to the Tilt particular service or capability, but where do you see some other people exploring that space? We were talking about AWS dropping and CDK and there are different people trying to solve the YAML problem, but more from the developer user experience tooling way, where do you see that space going? [0:44:23.9] EK: For me, it's all about higher level abstractions and well-defined best practices. Right now, everyone is fumbling around in the dark not knowing what to do, trying to figure out what works and what doesn't. The main thing that I see changing is that given enough time, best practices are going to emerge and it's going to be clear for everyone. If you're doing this thing, you should use this workflow. If you're doing that thing, you should use that workflow. Basically, what happened when IDEs emerged and became a thing, that is the best practice aside. [0:44:57.1] DC: That's a great example. [0:44:58.4] EK: Yeah. What I see in terms of things being offered for me tomorrow of — in terms of prepackaged higher level abstractions. I don't think developers should, everyone know how to deal with Kubernetes at a deeper level, the same way as I don't know how to build the Linux kernel even though I use Linux every day. I think things should be wrapped up in a way that developers can focus on what matters to them, which is right now basically writing code. Developers should be able to get to the office in the morning, open up their computer, start writing code, or doing whatever else they want to do and not worry about Kubernetes, not worry about lambda, not worry about how is this getting built and how is this getting deployed and how is this getting tested, what's the underlying mechanism. I'd love for higher-level patterns of those to emerge and be very easy to use for everyone. [0:45:53.3] CC: Yeah, it's going to be very interesting. I think best practices is such an interesting thing to think about, because somebody could sit down and write, “Oh, these are the best practices we should be following in the space.” I think, my opinion it's really going to come out of what worked historically when we have enough data to look at over the years. I think it's going to be as far as tooling goes, like a survival of the fittest. Whatever tool has been used the most, that's what's going to be the best practice way to do things. Yeah, we know today there are so many tools, but I think probably we're going to get to a point where we know what to use for what in the future. With that, we have to wrap-up, because we are at the top of the hour. It was so great to have Ellen, or L, how they I think prefer to be called and to have you on the show, Elle. Thank you so much. I mean, L. See, I can't even follow my own. You're very active on Twitter. We're going to have all the information for how to reach you on the show notes. We're going to have a transcript. As always people, subscribe, follow us on Twitter, so you can be up to date with what we are doing and suggest episodes too on our github repo. With that, thank you everybody. Thank you L. [0:47:23.1] DC: Thank you, everybody. [0:47:23.3] CC: Thank you, Michael and – [0:47:24.3] MG: Thank you. [0:47:24.8] CC: - thank you, Duffie. [0:47:26.2] EK: Thank you. It was great. [0:47:26.8] MG: Until next time. [0:47:27.0] CC: Until next week. [0:47:27.7] MG: Bye-bye. [0:47:28.5] EK: Bye. [0:47:28.6] CC: It really was. [END OF EPISODE] [0:47:31.0] ANNOUNCER: Thank you for listening to the Podlets Cloud Native Podcast. Find us on Twitter @thepodlets and on thepodlets.io website. That is ThePodlets, altogether, where you will find transcripts and show notes. We’ll be back next week. Stay tuned by subscribing. [END]See omnystudio.com/listener for privacy information.
Sergio #1: Reto para predecir el clima de Pangeo Repo con datos e instrucciones Hilo en tuiter https://twitter.com/raspstephan/status/1229272564729614336?s=21 Rodo #2: HiPlot - Descubre relaciones en datos de altas dimensiones HiPlot es una herramienta de visualización interactiva liviana para ayudar a descubrir correlaciones y patrones en datos de alta dimensión utilizando gráficos paralelos y otras formas gráficas para representar la información. HiPlot se puede utilizar con IPython notebooks y a través de un webserver. Sergio #3: Aprendizaje No Supervisado Desmitificado Una traducción por Carlos Secada del original en inglés por Cassie Kozyrkov Rodo #4: Me ama, no me ama: Clasifica textos con TensorFlow y Twilio El post provee un tutorial que paso a paso te ayuda a entrenar un modelo de ML y a servirlo a través de una aplicación con Flask. Si eres un R user, este tutorial no debería ser tan difícil de extender utilizando el NLTK4R y TensorFlow para R. Sergio #5: Todas las charlas de rstudio::conf 2020 Periodismo con Rstudio, R y el tidyverse Charlas sobre Rmarkdown (de Yihui Xie creador de Blogdown y Bookdown) y "Rmarkdown Driven Development" "Datos" el paquete de R4DS en espanol Rodo #6: ¡Comienzan a subir las charlas del PyCon Colombia 2020! Comenzando con el keynote de Andrew Godwin, creador de Django Channels y Django Core Developer, el equipo de PyCon Colombia ha comenzado a subir los videos del evento, ¡así que no puedes perderte todo el increíble contenido que nos estarán compartiendo! Extras: Sergio: Becas Santander para el MIT https://www.becas-santander.com/es/program/becas-santander-for-mit-leading-digital-transformation Grupo de usuários de TensorFlow en Sucre, Bolivia (saludos a Lesly Zerna, atte. Rodo) https://www.meetup.com/TensorFlow-User-Group-Bolivia/ !El horario de la PyCon US ya esta! https://us.pycon.org/2020/schedule/talks/ y la charla de Denny Perez https://us.pycon.org/2020/schedule/presentation/84/ - elDevShow https://anchor.fm/eldevshow/episodes/Cmo-ser-pap-luchn-y-mudarte-a-Canad-como-desarrollador-mvil-con-el-Pinedax-e9angg Rodo: Meme de la semana: https://www.reddit.com/r/mathmemes/comments/f6e5vb/the_battle_of_titans/ Segundo meme de la semana: https://www.reddit.com/r/mathmemes/comments/f6g43c/society/ Open Data Day en CDMX, Morelia (Michoacán) y León (Guanajuato). Nuevamente un saludo para Lesly, ¡que nos ayude a crear un grupo de TF en MX! --- This episode is sponsored by · Anchor: The easiest way to make a podcast. https://anchor.fm/app --- Send in a voice message: https://anchor.fm/quaildata/message Support this podcast: https://anchor.fm/quaildata/support
Matplotlib is a Python 2D plotting library which produces publication quality figures in a variety of hardcopy formats and interactive environments across platforms. Matplotlib can be used in Python scripts, the Python and IPython shells, the Jupyter notebook, web application servers, and four graphical user interface toolkits.
Part 1Paul and Sara chat about what language is best to choose as your first when you're just getting started on your journey as a programmer. Probably not Mathmatica, but it's a neat one.Jupyter Notebooks - an in-browser notebook for working with Python. You can write your words, have your code right next to it, and see how things play out. Or as Tom Butterworth put it on DEV."Jupyter Notebook is an interactive web application that interfaces with the iPython repl, allowing you to run structured but iterative Python scripts. It is the de facto choice for data scientists to rapidly prototype pipelines, visualise data, or perform ad hoc analysis."Interview: Jess LeeJess Lee had some great perspectives to share on what it means to balance being an entrepreneur and a coder.Issac Lyman kicked off a community project on DEV to create a book that would help guide readers through their first year in code. 15 contributors ended up writing chapters for the book, which is available for free here.DEV is open source, and they have decided it can be a software platform other organizations can use to build their own communities. As Ben Halpern writes, "The future of our company will be based on delivering the DEV open-source software to power new standalone communities. We will work with a network of partners both inside and outside of the software ecosystem."Part 2We dig into D3.JS. Stack Overflow has a lot to teach folks on this subject.What's the best way to make a d3.js visualization layout responsive?Just don't ask about a good book for learning the subject!And finally, what's the difference between d3.js and jQuery? It's a silly question with some interesting answers and a nice history of the web in the background.
Part 1Paul and Sara chat about what language is best to choose as your first when you're just getting started on your journey as a programmer. Probably not Mathmatica, but it's a neat one.Jupyter Notebooks - an in-browser notebook for working with Python. You can write your words, have your code right next to it, and see how things play out. Or as Tom Butterworth put it on DEV."Jupyter Notebook is an interactive web application that interfaces with the iPython repl, allowing you to run structured but iterative Python scripts. It is the de facto choice for data scientists to rapidly prototype pipelines, visualise data, or perform ad hoc analysis."Interview: Jess LeeJess Lee had some great perspectives to share on what it means to balance being an entrepreneur and a coder.Issac Lyman kicked off a community project on DEV to create a book that would help guide readers through their first year in code. 15 contributors ended up writing chapters for the book, which is available for free here.DEV is open source, and they have decided it can be a software platform other organizations can use to build their own communities. As Ben Halpern writes, "The future of our company will be based on delivering the DEV open-source software to power new standalone communities. We will work with a network of partners both inside and outside of the software ecosystem."Part 2We dig into D3.JS. Stack Overflow has a lot to teach folks on this subject.What's the best way to make a d3.js visualization layout responsive?Just don't ask about a good book for learning the subject!And finally, what's the difference between d3.js and jQuery? It's a silly question with some interesting answers and a nice history of the web in the background.
No more excuses. Network Engineers are now developers. Plus, learn how to open CSV files and manipulate CSV data using Python. Hank Preston continues to teach us core Python skills. Menu: 00:16 - Hank Conferences 00:34 - Global PyCon conference 01:10 - Is the python PyCon conference worthwhile for Network Engineers? 03:00 - Network Automation Open Space at PyCon 05:25 - Are Network Engineers Programmers / Developers? 09:25 - Imposter Syndrome! 12:50 - Sharing is caring 15:50 - Python Libraries & Packages / Don't reinvent the wheel! 17:08 - CSV Libraries 19:15 - Atom (IDE) and iPython and other code editors 21:33 - Working with CSV libraries 25:00 - Importing and opening CSV 32:48 - Python for loop 42:33 - Using the Python With statement 50:51 - Using the built in help capabilities 1:00:35 - Prompting the user to add more data 1:05:47 - Using a Shebang line 1:08:20 - Linux Permissions (add execute permissions) Links: Start Now Page on Devnet: http://bit.ly/2WBjb5v NetDevOps Live Episode on Python Libraries including CSV: http://bit.ly/2VjbGP4 Coding Fundamentals Learning Labs: http://bit.ly/2vU1BO8 Network programmability Basics Video Series covering CSV: http://bit.ly/2JzlbY4 David's details: YouTube: www.youtube.com/davidbombal Twitter: twitter.com/davidbombal Instagram: www.instagram.com/davidbombal/ LinkedinIn: www.linkedin.com/in/davidbombal/ #python #devnet #ciscopython
Support these videos: http://pgbovine.net/support.htmhttp://pgbovine.net/PG-Vlog-145-python-r-data-analysis-setup.htm- [Forcing Functions](forcing-functions.htm)- [The IPython shell](https://ipython.readthedocs.io/en/stable/)- [.jsonl file format](http://jsonlines.org/)Recorded: 2018-05-10
Fernando Perez is best-known as the creator of IPython and co-founder of Project Jupyter: a set of open-source data science tools that some may consider to be the equivalent of the bat & ball to the sport of baseball. Today, you really can’t play the game of data science without Jupyter Notebooks and our guest today is one of Jupyter's leads and originators (see here for the rest of the amazing team). Fernando is also an Assistant Professor in Statistics at UC Berekely, Researcher at the Berekely Institute for Data Science, and Founding Board Member of the NumFOCUS foundation — the community that creates the SciPy stack, along with virtually every other notable open source data science tool out there. This conversation was recorded in-person with Fernando in his office on UC Berekely’s campus, and it turned out to be the most humanizing, energizing, and down-to-earth interview I’ve had so far. Some of the many topics we covered include: what Fernando wanted to be while growing up in Medellin (Me-de-jean), Colombia the function that formal education played in his learning of data science the story behind IPython and Project Jupyter and it’s evolution over the past 10 years lessons learned about technical competence and human character from his mentors over the years what a “computational narrative” means to him and why it’s principles are key to data storytelling Fernando’s experience teaching a 650-student course (part of a pair of courses that are the largest of it's kind) as part of the Berekely Institute of Data Science Enjoy the show! Show Notes: https://ajgoldstein.com/podcast/ep7/ Fernando’s Twitter: https://twitter.com/fperez_org AJ’s Twitter: www.twitter.com/ajgoldstein393/
Jessica Forde, Yuvi Panda and Chris Holdgraf join Melanie and Mark to discuss Project Jupyter from it’s interactive notebook origin story to the various open source modular projects it’s grown into supporting data research and applications. We dive specifically into JupyterHub using Kubernetes to enable a multi-user server. We also talk about Binder, an interactive development environment that makes work easily reproducible. Jessica Forde Jessica Forde is a Project Jupyter Maintainer with a background in reinforcement learning and Bayesian statistics. At Project Jupyter, she works primarily on JupyterHub, Binder, and JuptyerLab to improve access to scientific computing and scientific research. Her previous open source projects include datamicroscopes, a DARPA-funded Bayesian nonparametrics library in Python, and density, a wireless device data tool at Columbia University. Jessica has also worked as a machine learning researcher and data scientist in a variety of applications including healthcare, energy, and human capital. Yuvi Panda Yuvi Panda is the Project Jupyter Technical Operations Architect in the UC Berkeley Data Sciences Division. He works on making it easy for people who don’t traditionally consider themselves “programmers” to do things with code. He builds tools (e.g., Quarry, PAWS, etc.) to sidestep the list of historical accidents that constitute the “command line tax” that people have to pay before doing productive things with computing. Chris Holdgraf Chris Holdgraf is a is a Project Jupyter Maintainer and Data Science Fellow at the Berkeley Institute for Data Science and a Community Architect at the Data Science Education Program at UC Berkeley. His background is in cognitive and computational neuroscience, where he used predictive models to understand the auditory system in the human brain. He’s interested in the boundary between technology, open-source software, and scientific workflows, as well as creating new pathways for this kind of work in science and the academy. He’s a core member of Project Jupyter, specifically working with JupyterHub and Binder, two open-source projects that make it easier for researchers and educators to do their work in the cloud. He works on these core tools, along with research and educational projects that use these tools at Berkeley and in the broader open science community. Cool things of the week Dragonball hosted on GC / powered by Spanner blog and GDC presentation at Developer Day Cloud Text-to-Speech API powered by DeepMind WaveNet blog and docs Now you can deploy to Kubernetes Engine from Gitlab blog Interview Jupyter site JupyterHub github Binder site and docs JupyterLab site Kubernetes site github Jupyter Notebook github LIGO (Laser Interferometer Gravitational-Wave Observatory) site and binder Paul Romer, World Bank Chief Economist blog and jupyter notebook The Scientific Paper is Obsolete article Large Scale Teaching Infrastructure with Kubernetes - Yuvi Panda, Berkeley University video Data 8: The Foundations of Data Science site Zero to JupyterHub site JupyterHub Deploy Docker github Jupyter Gitter channels Jupyter Pop-Up, May 15th site JupyterCon, Aug 21-24 site Question of the week How did Google’s predictions do during March Madness? How to build a realt time prediction model: Architecting live NCAA predictions Final Four halftime - fed data from first half to create prediction on second half and created a 30 second spot that ran on CBS before game play sample prediction ad Kaggle Competition site Where can you find us next? Melanie is speaking about AI at Techtonica today, and April 14th will be participating in a panel on Diversity and Inclusion at the Harker Research Symposium
Brian Granger is an associate professor of physics and data science at Cal Poly State University in San Luis Obispo, CA. His research focuses on building open-source tools for interactive computing, data science, and data visualization. Brian is a leader of the IPython project, co-founder of Project Jupyter, co-founder of the Altair project for statistical visualization, and an active contributor to a number of other open-source projects focused on data science in Python. He is an advisory board member of NumFOCUS and a faculty fellow of the Cal Poly Center for Innovation and Entrepreneurship.
Live stream to http://twitch.tv/adafruit looking at Jupyter/IPython as a tool for programming Python on the Raspberry Pi. Looks at how to install and run Jupyter on the Pi, and how to use it to do simple tasks like graph sensor data. Links mentioned in the video: - Jupyter homepage: http://jupyter.org/ - Jupyter installation: http://jupyter.readthedocs.io/en/latest/install.html - Introducing IPython: http://ipython.readthedocs.io/en/stable/interactive/tutorial.html - MCP9808 temperature sensor guide: https://learn.adafruit.com/mcp9808-temperature-sensor-python-library/overview - IPython Plotting: http://ipython.readthedocs.io/en/stable/interactive/plotting.html - Plotting with Matplotlib: http://nbviewer.jupyter.org/github/ipython/ipython/blob/1.x/examples/notebooks/Part%203%20-%20Plotting%20with%20Matplotlib.ipynb Acknowledgements: - Music: bartlebeats - Intro shuttle footage: NASA - Intro fonts: Typodermic - Intro inspiration: Mr. Wizards's World - Matrix background: cool-retro-term & cmatrix ----------------------------------------- Visit the Adafruit shop online - http://www.adafruit.com Subscribe to Adafruit on YouTube: http://adafru.it/subscribe Join our weekly Show & Tell on G+ Hangouts On Air: http://adafru.it/showtell Watch our latest project videos: http://adafru.it/latest New tutorials on the Adafruit Learning System: http://learn.adafruit.com/ Music by bartlebeats: http://soundcloud.com/bartlebeats -----------------------------------------
Talk Python To Me - Python conversations for passionate developers
See the full show notes for this episode on the website at talkpython.fm/44.
In this episode we explore the local world of open data and civic hacking. Kenneth, Kevin and Len are joined by Jason Norwood-Young (@j_norwood_young) from Code4SA (@Code4SA). Jason was a tech journalist in a previous life and "converted to the dark side" to become a developer. He's currently working with Code4SA on various open data and civic hacking initiatives and took us for a whirlwind tour of the space. We touch on a range of topics, from acquiring data from government and civil society, to the tools used to clean and interview the data, to publishing the data and building impacting projects that better the lives of people on the ground. We wrap up with some ways curious folks can get involved in the civic hacking movement, chatting local and global efforts to disseminate information and empower people. Follow Jason (https://twitter.com/j_norwood_young) and Code4SA (https://twitter.com/Code4SA) on Twitter. Code4SA has an exhaustive list of projects on GitHub (https://github.com/Code4SA), and can be found online at http://code4sa.org This show was packed with a lot of resources, all listed below: * What is open data? - http://opendatahandbook.org/guide/en/what-is-open-data/ * Section 32: Access to information - http://www.acts.co.za/constitution-of-the-republic-of-south-africa-act-1996/index.html?32_access_to_information.php Non-exhaustive list of sources: * STATS SA Data - http://www.statssa.gov.za/?page_id=1417 * IEC Elections API - https://api.elections.org.za * Code4SA data repository - https://data.code4sa.org * City of Cape Town Open Data Portal - https://web1.capetown.gov.za/web1/OpenDataPortal/ * Municipal Demarcation Board - http://www.demarcation.org.za/index.php/downloads/boundary-data * Code4SA Maps API - http://maps.code4sa.org * Parliamentary Monitoring Group - https://pmg.org.za A few tools of the trade: * IPython - http://ipython.org * Requests - http://docs.python-requests.org * lxml - http://lxml.de/ * Beautiful Soup - http://www.crummy.com/software/BeautifulSoup/ * cheerio - https://github.com/cheeriojs/cheerio * Google Spreadsheets - https://docs.google.com/spreadsheets * OpenRefine - http://openrefine.org * Infogr.am - https://infogr.am/ * Mapbox - https://www.mapbox.com * Datawrapper - https://datawrapper.de * DocumentCloud - https://www.documentcloud.org Spotlight on Code4SA: * Data quests - http://scibraai.co.za/join-the-data-quest-to-tell-science-stories-with-sa-data/ * Naked Data Newsletter - http://code4sa.org/newsletter/ * Wazimap - http://wazimap.co.za * Know your 'hood - http://mg.co.za/page/know-your-hood * South Africa's Protest Map - http://protest-map.code4sa.org/ * Hospital ratings - http://hospitals.code4sa.org * Medicine Price Index - http://mpr.code4sa.org * Living on the edge - http://livingwage.code4sa.org * Open By-laws - http://openbylaws.org.za * Black Sash MAVC - https://www.youtube.com/watch?v=Bdi2kDt4Ieo & http://www.blacksash.org.za/index.php/sash-in-action/stories-from-the-field/1657-mavc-dialogue-with-tshedza-development-project * Data Journalism School - http://code4sa.org/school/ * Code4SA on GitHub - https://github.com/Code4SA Other, non-affiliated, code* groups: * Code for Africa - http://www.codeforafrica.org * Code for America - http://www.codeforamerica.org * HackHackers - http://hackshackers.com * HackHackers Johannesburg - http://www.meetup.com/HacksHackersAfrica/ Our spot in the global arena * Open Data Index - http://index.okfn.org * South Africa's ranking - http://index.okfn.org/place/south-africa/ After we stopped recording we also chatted about Adrian Frith's Dotmap - http://dotmap.adrianfrith.com/ And finally, our picks: Kenneth: * Bounce - http://bounceinc.co.za * The Revenant - https://www.youtube.com/watch?v=LoebZZ8K5N0 Len: * WolfenGo - https://github.com/gdm85/wolfengo Jason: * Wolfenstein 1-D Kevin: * Pocket - https://getpocket.com
Materials Available here: https://media.defcon.org/DEF%20CON%2023/DEF%20CON%2023%20presentations/DEFCON-23-Yan-Shoshitaishvili-Fish-Wang-Angry-Hacking.pdf Angry Hacking - the next generation of binary analysis Yan Shoshitaishvili PhD Student, UC Santa Barbara Fish Wang PhD Student, UC Santa Barbara Security has gone from a curiosity to a phenomenon in the last decade. Fortunately for us, despite the rise of memory-safe, interpreted, lame languages, the security of binaries is as relevant as ever. On top of that, (computer security) Capture the Flag competitions have skyrocketed in popularity, with new and exciting binaries on offer for hacking every weekend. This all sounds great, and it is. Unfortunately, the more time goes by, the older we get, and the more our skills fade. Whereas we were happy to stare at objdump a decade ago, today, we find the menial parts of reversing and pwning more and more tiring and more and more difficult. Worse, while security analysis tools have been evolving to make life easier for us hackers, the core tools that we use (like IDA Pro) have remained mostly stagnant. And on top of that, the term "binaries" have expanded to regularly include ARM, MIPS, PPC, MSP430, and every other crazy architecture you can think of, rather than the nice, comfortable x86 of yesteryear. New tools are required, and we're here to deliver. Over the last two years, we have been working on a next-generation binary analysis framework in an attempt to turn back the tide and reduce our mounting noobness. The result is called angr. angr assists in binary analysis by providing extremely powerful, state-of-the-art analyses, and making them as straightforward to use as possible. Ever wanted to know *what freaking value* some variable could take on in a function (say, can the target of a computed write point to the return address)? angr can tell you! Want to know what input you need to trigger a certain code path and export a flag? Ask angr! In the talk, we'll cover three of the analyses that angr provides: a powerful static analysis engine (able to, among other things, automatically identify potential memory corruption in binaries through the use of Value-Set Analysis), its symbolic execution engine, and dynamic emulation of various architectures (*super* useful for debugging shellcode). On top of that, angr is designed to make the life of a hacker as easy as possible -- for example, the whole system is 98% Python, and is designed to be a breeze to interact with through iPython. Plus, it comes with a nifty GUI with nice visualizations for symbolically exploring a program, tracking differences between different program paths, and understanding value ranges of variables and registers. Finally, angr is designed to be easily extensible and embeddable in other applications. We'll show off a semantic-aware ROP gadget finder ("are there any gadgets that write to a positive offset of rax but don't clobber rbx" or "given this program state, what are the gadgets that won't cause a segfault") and a binary diffing engine, both built on angr. We've used angr to solve CTF binaries, analyze embedded devices, debug shellcode, and even dabble in the DARPA Cyber Grand Challenge. We'll talk about our experiences with all of that and will release angr to the world, hopefully revolutionizing binary analysis and making everyone ANGRY! Yan and Fish are two members of Shellphish, a pretty badass hacking team famous for low SLA and getting the freaking exploit JUST A FREAKING MINUTE LATE. Their secret identities are those of PhD students in the security lab of UC Santa Barbara. When they're not CTFing or surfing, they're doing next-generation (what does that even mean?) security research. Their works have been published in numerous academic venues. For example, in 2013, they created an automatic tool, called MovieStealer, a tool to automatically break the DRM of streaming media services [1]. After taking 2014 to work on angr, in 2015, they followed this up with an analysis of backdoors in embedded devices [2]. Now, they've set their sights on helping the world analyze binaries faster, better, stronger, by revolutionizing the analysis tool landscape! [1] https://www.usenix.org/conference/usenixsecurity13/technical-sessions/paper/wang_ruoyu [2] http://www.internetsociety.org/doc/firmalice-automatic-detection-authentication-bypass-vulnerabilities-binary-firmware Twitter: @zardus
Episode 10 - Brian Granger and Fernando Perez of the IPython Project
This week the team talks about Jonathon’s new ISIS analysis, iPython 3, Indian Food, Algorithm Aversion, and more!
Introducción a Python para científicos e ingenieros (2ª ed.) - Curso online
Introducción a Python para científicos e ingenieros (2ª ed.) - Curso online
Introducción a Python para científicos e ingenieros (2ª ed.) - Curso online
Introducción a Python para científicos e ingenieros (2ª ed.) - Curso online
Introducción a Python para científicos e ingenieros (2ª ed.) - Curso online
Introducción a Python para científicos e ingenieros (2ª ed.) - Curso online
Introducción a Python para científicos e ingenieros (2ª ed.) - Curso online
Introducción a Python para científicos e ingenieros (2ª ed.) - Curso online
Introducción a Python para científicos e ingenieros (2ª ed.) - Curso online
Introducción a Python para científicos e ingenieros (2ª ed.) - Curso online
Introducción a Python para científicos e ingenieros (2ª ed.) - Curso online
Introducción a Python para científicos e ingenieros (2ª ed.) - Curso online
Introducción a Python para científicos e ingenieros (2ª ed.) - Curso online
Show notes: http://betweenscreens.fm/episodes/9
Introduction [0:38] Django ActionScript PyConSG .net Python in a Nutshell TurboGears Bitmazk Developing with Python [6:27] Guido von Rossum Pull request Bitmazk Github Fabric Less Solr Search Django Haystack PostgreSQL Full Text Search pip PyPi Python 2 vs 3 Django Shop Django CMS RaspberryPi Bitcoin Scratch PyGame MongoDB iPython Notebook Python for Data Analysis Organising Talks in a Tech Conference[20:45] Dropbox PyConSG Call for Proposals Symposion PyCon US I/O Polling audie nce questions [28:27] David Cramer Kenneth Reitz Daniel Greenfeld Learn Python the Hard way Pandas Pelican Rapid Fire questions [38:20] Two Scoops of Django Daniel Greenfeld Go Python User Group Singapore Ubuntu Broken Age Monkey Island Doubly Linked List - New releases [40:07] iPython 2.0 Django 1.7 South WinJS Event Loop - Local events [48:05] Samsung Developer Day PyCon SG Electric Plug – Connect with Martin! [50:25] Martin’s Twitter Martin’s Facebook Martin’s Github Bitmazk Github
Episode 2! In this episode of the Data Driven Security Podcast, Bob and Jay review the DDS coverage of Harvard's "Weathering the Data Storm" symposium including some specific focus on the IPython talk by Fernando Pérez, Cynthia Rudin's "Manhole Event" paper and the pretty consistent theme of "need to prove your models in little data before driving them to scale". Then, they execute a whirlwind review of recent blog posts, give a preview of an upcoming talk at RSA by Jay & Wade Baker, plus give a preview of upcoming DDS blog and podcast topics. NOTE: An enhanced, video version of Episode 2 is available on YouTube. Resources mentioned in the episode: Weathering the Data Storm symposium DDS Tweetscription of the symposium with links to resources covered in the talks openPERT The new DDS Data Set Collection DDS' new short domain Review of recent DDS blog posts including the "marx" data set, malicious cartography and data-driven risk analysis SolvoMediocris - "FAIR"-like risk analysis tools built by DDS Jay & Bob's ZeroAccess collaboration More ZeroAccess machinations Facebook/Princeton Article with mixed ggplot and Excel graphics
Horst JENS, Florian SCHWEIKERT und Greor PRIDUN plaudern über freie Software und andere Nerd-Themen. Shownotes auf http://goo.gl/VNL2i1 oder http://biertaucher.at Bitte nach Möglichkeit diesen Flattr-Link anlicken: http://flattr.com/thing/1771380/Biertaucher-Podcast-Folge-116
Discuss this episode in the Muse community Follow @MuseAppHQ on Twitter Show notes 00:00:00 - Speaker 1: There’s been very little innovation and research more generally into what is a good interface for inputting equations. So I think most people are probably familiar with Microsoft Word or Excel have these equation editors where you basically open this palette and there is a preview and there is a button for every possible mathematical symbol or operator you can imagine. 00:00:28 - Speaker 2: Hello and welcome to Meta Muse. Muse is a tool for thought on iPad and Mac. This podcast isn’t about Muse the product, it’s about Muse the company and the small team behind it. I’m Adam Wiggins here today with my colleague Mark McGranaghan. Hey, Adam. And joined by our guest Sarah Lim, who goes by Slim. Hello, hello, and Slim, you’ve got various interesting affiliations including UC Berkeley, Notion, Inc and Switch, but what I’m interested in right now is the lessons you’ve learned from playing classic video games. Tell me about that. 00:01:01 - Speaker 1: So this arose when I was deciding whether to get the 14 inch or 16 inch M1 MacBook Pro and a critical question of our age, let’s be 00:01:10 - Speaker 1: honest. Exactly, exactly. I couldn’t decide. I posted a request for comments on Twitter, and then I had this realization that when I was 6 years old playing Organ Trail 5, which is a remake of Organ Trail 2, which is itself a remake of the original. I was in the initial outfitting stage, and you have 3 choices for your farm wagon. You can get the small farm wagon, the large farm wagon, and the Conestoga wagon. I actually don’t know if I’m pronouncing that correctly, but let’s assume I am. So I just naively chose the Conestoga wagon because as a 6 year old, I figured that bigger must be better and being able to store more supplies for your expedition would make it more successful. I eventually learned that the fact that the wagon is much larger and can store a lot more weight means that it’s a lot easier to overload it. Among other things, this requires constantly abandoning supplies to cut weight. It makes the roover forwarding minigame much more perilous. It’s a lot harder to control the wagon. And yeah, I never chose that wagon again on subsequent playthroughs, and I decided to get the 14-inch laptop. 00:02:12 - Speaker 2: Makes perfect sense to me and and what a great lesson for a six year old trade-offs, I feel like it’s one of the most important kind of fundamental concepts to understand as a human in this world, and I think many folks struggle with that well into adulthood. At least I feel like I’ve often been in certainly business conversations where trying to explain trade-offs is met with confusion. 00:02:35 - Speaker 1: They should just play Organ Trail. 00:02:37 - Speaker 2: Clearly that’s the solution. And tell us a little bit about your background. 00:02:42 - Speaker 1: Yeah, so I’ve been interested in basically all permutations really of user interfaces and programming languages for a really long time, so this includes the very different programming languages as user interfaces and programming languages for user interfaces, and then, you know, the combination of the two. So right now I’m doing a PhD in programming languages, interested in more of like the theoretical perspective, but in the past, I’ve worked on I guess, end user computing, which is really the broader vision of notion, I was at Khan Academy for a while on the long term research team. 00:03:18 - Speaker 2: Yeah, and there I think you worked with Andy Matuschek, who’s a good friend of ours and uh previous guest on the podcast. 00:03:24 - Speaker 1: Yes, definitely. That was the first time I worked with Andy in real depth, and I still really enjoy talking to him and occasionally collaborating with him today. So, I guess, prior to that, I was doing a lot of research at the intersection of HCI or human computer interaction and programming tools, programming systems, I guess. So, one of the big projects that I worked on as an undergrad was focused on inspecting. CSS on a webpage or more generally trying to understand what are the properties of like the code that influence how the page looks or a visual outcome of interest, and there I was really motivated by the fact that you have these software tools have their own Mental model, I guess, or just model of how code works and how different parts of the program interact to produce some output and then you have the user who has often this entirely different intuitive model of what matters, what’s important. So they don’t care if this line of code is or isn’t evaluated, they care whether it actually has a visible effect on the output. So trying to reconcile those two paradigms, I think is a recurring theme in a lot of my work. 00:04:30 - Speaker 2: And I remember seeing a little demo maybe of some of the, I don’t know if it was a prototype or a full open source tool, but essentially a visualizer that helps you better understand which CSS rules are being applied. Am I remembering that right? 00:04:43 - Speaker 1: Yeah, so that was both part of the prototype and the eventual implementation in Firefox, but the idea there is The syntax of CSS really elides the complexity, I think, because syntactically it looks like you have all of these independent properties like color, red, you know, font size, 16 pixels, and they seem to be all equally important and at the same level of nesting, I guess, and what that really hides is the fact that there are a lot of dependencies between properties, so a certain property like Z index, you know, the perennial favorite Z index 9999999. Doesn’t take effect unless the element has like position relative, for example, and it’s not at all apparent if you’re writing those two properties that there is a dependency between them. So I was working on visualizing kind of what those dependencies were. This actually arose because I wrote to Burt, who is one of the co-creators of CSS and was like, Hi, I’m interested in building a tool that visualizes these dependencies. Where can I find the computer readable list of all such dependencies? And he was like, oh, we don’t have one, you know, we have this SVG that tries to map out the dependencies between CSS 2.1 modules, and even there you can see all these circular dependencies, but we don’t have anything like what you’re looking for. That to me was totally bananas because it was the basic blocker to most people being able to go from writing really trivial CSS to more complicated layouts. So I was like, well, I guess this thing doesn’t exist, so I’d better go invent it. 00:06:12 - Speaker 2: Perfect way to find good research problems. Now, more recently, two projects I wanted to make sure we reference because they connect to what we’ll talk about today, which is recently worked on the equation editor at Ocean, and then you worked on a rich text CRDT called Paratext at In and Switch. Uh, would love to hear a little bit about those projects. 00:06:34 - Speaker 1: Yeah, definitely. So I guess the Paroex project, which was the most recent one was collaboration with Jeffrey Litt, Martin Klutman and Peter Van Harperberg, and that one was really exciting because we were trying to build a CRDT that could handle rich text formatting and traditionally, you have all of these CRDTs that are designed for fairly bespoke applications. They’re things like a counter data type or a set data type that has certain behavior when you combine two sets, and we’re still at the stage of CRDT development where aside from things like JSON CRDTs like automerge, we don’t really have a one size fits all CRDT framework or solution. You still mostly have to hand design and implement the CRDT for a given application. And it turns out that in the case of something like rich text, it’s a lot harder than just saying, oh, you know, we’ll store annotations in an array and call it a day, because the semantics for how you want different types of formatting to combine when people split and rejoin sessions and things like that are all very complex and it turns out that we have a lot of learned behaviors that arise, even from just like, Design decisions in Microsoft Word, where you expect certain annotations to be able to extend, certain annotations to not extend, things like that. Capturing all of the nuance in that behavior turns out to be really difficult and requires a lot of domain specific thinking. But we think we have an approach that works and I would really encourage everyone to read the essay that we published and try to poke holes in it too. This was like the 5th version of the. algorithm, right? So like months ago, we were like, all right, let’s start writing and then Martin, who has just an incredible talent for these things is like, hey, everyone, you know, I found some issues with the approach and, you know, oh no, 00, and sort of we fix those, we’re like, all right, you know, this one’s good and just repeat this like week after week. So I really have to give him a ton of credit for both coming up with a lot of these problems and also figuring out ways to work around it. 00:08:33 - Speaker 2: We talked with Peter a little bit recently, Peter van Hardenberg, about the pencils down element of the lab, but also just research generally, which is there’s always more to solve, you know, it’s the classic XKCD, more research needed is always the end of every paper ever written, which is indeed the pursuit of the unknown. That’s part of what makes science and Seeking new knowledge, exciting and interesting, but at some point you do have to say we have a new quantum of knowledge and it’s worth publishing that. But then I think if it’s just straight up wrong or you see major problems that you feel embarrassed by, then if you want to invest more. 00:09:09 - Speaker 1: Right, exactly. I think in this case. There was a distinction between, there’s always more we can tack on versus we wanted to get it right, you know, and in particular, the history of both operational transforms or OT and CRDT for rich text, just text in general is such that it’s this minefield of I guess to use kind of a gruesome visual metaphor, just dead bodies everywhere. You’re like, oh, you know, such and such algorithm was published and it’s such and such time and it was new hotness for a while and then we realized, oh, it was actually wrong and this new paper came out which proved like 4 of the algorithms were wrong and so on. And so with correctness being such an important part of any algorithm, of course, but also kind of this white whale in the rich text field, we thought it was important to at least make a credible effort to having a correct algorithm. 00:09:57 - Speaker 2: Yeah, makes sense. Yeah, I can highly recommend the Paroex essay. One of the things I found interesting about it, maybe just for anyone who’s listening, whose head is spinning from all the specialized jargon here, CRDTs are a data structure for doing collaborative software, collaborative documents, and then, yeah, rich text, the Microsoft Word is the canonical example there. You can bold things, you can italic things, you can make things bigger and smaller. Well, part of what I enjoyed about this paper was actually that I felt, even if you have no interest in CRDTs, it has these lovely visualizations that show kind of the data representation of a sentence like the quick brown fox, and then if you bold quick, and then later someone else bolds fox, you know, how do those things merge together. But even aside from the merging and the collaborative aspect, which obviously is the research, the novel research here. I felt it gave me a greater understanding of just how rich text editing works under the hood, which I guess I had a vague idea of, but hadn’t thought about it so deeply. So, highly recommend that paper. Just give them the figures, even if you don’t want to read the thousands of words. 00:11:05 - Speaker 1: I’m glad you like the figures. They were a real labor of sigma. 00:11:08 - Speaker 2: Perfect, yeah, so. 00:11:10 - Speaker 1: The one thing I would add is that CRDTs are a technology for collaboration, but the way they differ from operational transforms or OTs is that a CRDT is basically designed to operate in a decentralized setting, so you don’t need a persistent network connection to all the parts. you don’t need a centralized server. The idea is you can fluidly recover from network partitions by merging all of the data and operations that happened while you were offline, and this turns out to be really important to our vision of how collaborative editing should work because we think it’s really important for people to be able to do things like not always be editing in the same document at the same time as everyone. Maybe I want to take some space for myself to write in private and then have my changes sync up with everyone else thereafter. Maybe I’m, you know, self-conscious about other people editing. are seeing my work in progress, but I think that it would be interesting and helpful to look at what the main document looks like and how that’s evolving while I’m working in private, and you can have that kind of one way visibility with something like a CRDT versus with something like Google Docs, where it’s just sort of always online or always not editing in your own personal editor. Conversely, maybe I’m OK with everyone else seeing the work that I’m doing in progress, but I just find it really visually jarring to have all these cursors and different colors jumping around and People inserting text, bumping my paragraphs down the page. I’ve definitely been there. I’m not particularly precious about people seeing my work in progress, but I just cannot focus on writing when the page is just changing all around me. So in that situation, maybe I would want to allow other people to see my work in progress, so that we don’t duplicate effort or something like that, but I just have like a focus mode where incoming changes don’t disrupt my writing environment and these kinds of fork join one way window. Microgit style branching paradigms are really only enabled by a technology like CRDTs where you have the flexibility to separate and then come back together. 00:13:12 - Speaker 2: And I’m incredibly excited by the design research that needs to go into that. Now at this point, I think we’re still on the technology level, you know, one way to think of it is Google Docs came along, I don’t know, 15, it’s almost 20 years ago now, I can’t even remember, let’s say 15 years ago, and this novel idea that We could both have a shared document or several people could have a shared document, all see the up to-date version and type into it and get, you know, a reasonable response or have that be coherent was an amazing breakthrough at the time and has since been kind of widely copied notion, Figma, many others. But now maybe we can go beyond that, much more granularity, like you said, maybe borrowing from the developer version control workflows a little bit in a lightweight way, giving a lot more control and flexibility, and giving us a lot more choices about how we want to work most effectively. But before we can even get onto those design decisions and how do we present all these different things to the user, what are the different options? We need this like fundamental underlying merge technology, hence the endless fascination that we have the lab and increasingly the technology industry generally has with CRDTs because it has the potential to enable all that. 00:14:23 - Speaker 1: Yeah, when we were working on the Paratax project, Peter was pushing really hard for, don’t make this just a technology project. It’s a socio-technical endeavor and we need to invest a lot of time in the design component, also just doing user interviews, identifying how people interact with and. How people collaborate in the status quo on text and Jeffrey and I actually did do a bunch of user interviews with people from all kinds of backgrounds. We’ve talked to people who write plays, people who produce a dramatic podcast kind of in this style of Night Vale. I love Night Vale. Yeah, people who are in the writer’s room kind of working together with their collaborators on that, people who write lessons, video lessons for educational platforms. And there was a ton of really interesting Insights into user behavior around collaborative text. We ended up just torn because we had this 12 week project and we were like, how should we best spend our time? Clearly, this is not just a technical area and we need to invest a lot in getting the design right, understanding what the design space even looks like since it hasn’t really been explored. I really want to avoid, and this is a recurring theme in my work, I really want to avoid publishing or shipping something. And having it be this like, very broad, very shallow exploration into all the things that are possible. I think that this kind of work plays an important role, and there are a lot of people who do this well, just fermenting the space of possibilities and getting these ideas in a lot of people’s heads, who can then go on and do really cool things with them. My personal style, I never want to feel like something is half baked, I guess, I would much rather ship this cohesive contribution like, here is an algorithm for building rich text. We think that this is a technical prerequisite to all of these interesting design choices, but the alternative with a 12 week period, and in fact, you know, this, the correctness and revision phase extended way over that. So thanks a lot to Martin and Jeffrey for leading during that part. But it’s just already so hard to get it correct that trying to tack on a really substantive design exploration that does the area justice on top of that, I was just really worried it would stretched too thin. So absolutely lots of room for future work in this particular. project. It’s very much a challenge in any area where you have simultaneously this rich design space that’s just asking to be explored with tons of prototypes and things like that, and then also to even realize the most simple of those prototypes, you require fundamentally new technology. 00:16:53 - Speaker 2: Yeah, I’ve been down that same path on many research projects as well, and often it’s that I’m excited for what the technology will enable, but also that in many cases it’s a combination, you know, some kind of peer to peer networking thing, but with that will enable us to provide a certain benefit to the user and I want to explore both of those things, but then that’s too much and then the whole thing is half baked exactly as you said. I’ve never found a perfect or even a good. Way to really manage that tradeoff. You just kind of pick your battles and hope for the best. Yeah, definitely. Well, I do want to hear about the equation editor project, but first I feel I should introduce our topic here, which I think folks could probably have gleaned is going to be rich text and rich text editing, and maybe we could just step back a moment and define that a little bit. I think we know that texts, you know, symbolic representation of language is a pretty key thing, writing and the printing press and all that sort of thing. We wrote about that a little bit in our text blocks memo, which I’ll like in the show notes. But typically, I think computers for a lot of their early time and even now with something like computer code is typically plain text, that’s the dot TXT file is kind of almost the native style of text that you have and then rich text typically layers something on top of that. I don’t know, so maybe you could better define rich text for us to have a more concrete discussion about it. 00:18:21 - Speaker 1: Yeah, I think rich text for most people basically evokes things like bold, italic, underline, the ability to augment plain text with annotations that are useful in formatting, actually, I think. Notepad to word pad is the archetypal jump in software, if you’re thinking about it from the old Windows perspective. In the past few years, I think we’ve started to see a real expansion of what rich text can look like. So, of course, we started out with something like Markdown, which is, of course, a plain text representation. But it’s designed to be able to capture more nuance in plain text and be rendered to something like HTML which very much supports rich text. So in Markdown, you have not only these kinds of inline formatting elements like bold and italic and hyperlinks as well. You also have support for images, which you could think of as more block level rich text elements, I guess, and I don’t think there’s a real clear consensus across editors on how block level rich text elements should be displayed. Of course, in between you have things like bulleted lists and those tend to be handled in a fairly standard manner with nested lists and so on, but it quickly becomes like a question of taste. Which kinds of annotations you support. So in editors like Coto or Notion, you have all these different block types where the block is really the atom of collaboration and editing, and then you can have things like, you know, file embeds or even database views, things like that. So I think we’re at a point now where both block-based editors, I’m using block based editors in like the text or writing sense, not the structured editors for programming sense, although I have other things to say about that, but we’re at a point where you’re starting to see these block-based editors appear and I think that there are a lot of really interesting patterns that this permits that the paragraphs via linear sequence of characters, including new lines and whitespace does not permit, or at least doesn’t allow you to build as structured tooling around. 00:20:30 - Speaker 2: I’m trying to think what is actually the core of the difference between a block-based editor, that’s a notion, a RO uses working on its own block text implementation and a flow of characters, so that’s Microsoft Word, Google Docs, maybe even text editors. I guess it’s sort of like paragraphs are separated by like these sort of nested. Elements or have a parent to the document versus like two new lines embedded in the stream of characters, but I don’t know, that seems too unsophisticated, maybe have a better definition for us. 00:21:03 - Speaker 1: So, I actually think about this very similarly to in the like programming languages and editor tools space. There is a distinction between structured editors and regular plain text editors for programs. The idea is that you might have a text-based programming language and you can write that perfectly fine in any buffer that allows you to put sequential characters, often AI is sufficient for some languages, and then on the other hand, These programs might have a lot of inherent structure. A simple example is with lisps which are built out of these parenthesis S expressions, everything is, you know, an S expression. You can think about like the structure of the tree formed by, I guess a forest, formed by having like these S expressions with subelements and stuff. that, and then you can do manipulations directly on the structure in a way that allows you to always have a syntactically correct program or at least a partial syntactically correct program by doing things like I’m just going to take this subtree, which is a sub-expression and move it somewhere else where there’s room for another subexpression. So, I think of block-based editors as capturing a very similar zeitgeist to structured editors for code, because instead of just having this linear buffer of characters that can have, you know, formatting or things like that, you can have new lines, you actually have more of a forest structure where you have lots of like individual blocks, and then you can have blocks that are children of other blocks and so on, and that allows you to Do things like move an entire subtree representing an outline to another position in the document without selecting all of the characters, you know, cut them and then paste them somewhere else. So things like reparenting becomes a lot easier, things like setting the background of an entire subtree becomes a lot easier. Just in general, you have more structure and there’s more things you can do with that structure, I guess is how I would phrase it. One of my favorite things that you can do with this model in notion is you can change the type of a block very easily. So let’s say I have a bullet list item, and then I hit enter and enter these like subnote or something like that as children of the initial bullet list item. I can turn the bullet list item into a page, and then all of a sudden it’s just a subpage in the document, and the sub bullets that were there before are just like top level bullets in that page. And this is particularly important for my workflow because I care a lot about starting out with something like really rough and sketchy and then progressively improving it or moving up and down the ladder of like fidelity into something more polished. So you might, for instance, start off with just an outline list or even a one dimensional list of to do blocks when you’re trying to do project planning or something. And then later on, let’s say I want to put these into like a tasks database with support for like a conbond view or something like that. I don’t actually want to sit there and like recreate all of these tasks in Jira. I’ve been there, you know, I’ve been the person making all the tasks in Jira after the meeting and then assigning them to people. What the workflow that I think notion is poised to enable and can certainly do a better job in this regard, but already offers some benefits on is like, can I just highlight all of these blocks because everything is a block, move them into some existing database and have them match the schema. That kind of like allowing people to do fast and loose prototyping with very unstructured primitives and then promote them into something more structured like in a relational database setting or similar, I think is the sweet spot, structured editing provides the sweet spot between like just completely unstructured text and these very high fidelity, high effort interfaces that allows you to kind of move between them. 00:24:47 - Speaker 3: Yeah, I really like that direction and framing, and if I can extend it a little bit, I think we can also look at a continuum of richness in terms of the content itself. So you have plain text, what you might classically call rich text with links and bold and underlying. And then you maybe start to throw a few images in, and then what if you can put it in videos and what if you have a whole table, and that table is actually a database query, and you can nest the figment document, and this way you can see that there’s sort of continuum on the richness of the document. One reason I think Notion has been so successful, they’ve been pushing along that continuum while maintaining a sort of foundation of rich textness, which is very familiar and the important basic use case for a lot of people. A related idea is that I think we’re seeing a lot of the classic document types converge. So if you look at a rich text like a Microsoft Word and a PowerPoint and increasingly spreadsheets, those all used to be 3 distinct Microsoft Office applications, and we’re seeing the value of them being in or being the same document. This is actually one of the motivating ideas behind Muse and a lot of the research we’ve done in the lab, and the kind of something Slim was saying, you want to take your idea continuously through different media and different modalities and different degrees of fidelity, and you don’t want to jump between different applications do that. You want to be able to do it on the same canvas. That’s by the way, one of the reasons I like Canvas. It’s not only because it’s a free multimedia surface, but also it evokes this idea of like flexibility and potentiality, and I think that’s one of the things that’s really excited about these mixed media documents. 00:26:16 - Speaker 2: And I know if Jeffrey were here, he might jump in and say that one downside to our current application silo world is that the only way to have this deeply rich text where it’s images, video, a table, a database query, something like that, is to have the Uber application, to have the everything app, and certainly notion has probably gotten pretty far on that, but others kind of in In some ways are forced to do that, like we have to do some of that in Muse as well. People come in and ask for all these different types here as well, and there’s more of like an open doc inspired or Unix inspired future that maybe Jeffrey and others, including me, would hope for, which would be more that applications could be these individual data types and you could put them all together through some kind of more operating system connection. But that is so completely reversed from kind of how all our computing devices work today. It’s hard to see how we might get to that. 00:27:14 - Speaker 3: Yeah, I’m certainly sympathetic to that concern, although I suspect the way out is through, and you get platforms from working killer apps. And so the way we got the whole unit ecosystem was they wanted to build a computer for, you know, writing and running programs and then eventually got all this generalized text processing stuff, but it’s not like they started in like, oh, I’m gonna make a generalized text processing machine. I don’t think that was really the way that they approached it and developed a success. So, I’m still hopeful we could do this, but I think you got to extract it from something that’s already working as an app, but it always helps to have an eye towards that, and I think we’ve done some of that with Muse. 00:27:46 - Speaker 1: I was just going to say that it’s not me talking about texts, unless I bring up my favorite piece of software of all time, which is Pandok. And I think that Pandok actually is very relevant to this discussion. So for those who aren’t as familiar with it, Panok brands itself as this Swiss Army knife for document formats, and it’s sort of headline contribution is that it allows you to convert between all kinds of documents. For instance, I can take a Word document and convert it to a PDF Word documents to something like, I don’t know, IPython notebook, Jupiter notebook, back and forth across this incredible bipartite graph of formats, but I think that the subtler contribution that Pandokc makes, which is extremely significant, is that Pandok has this form of markdown called Pandok markdown that essentially aligns and supersedes all of the different fragments of markdown that we’ve seen before. So the problem with markdown basically is that the original specification is sort of ill-defined. There are several cases in which the behavior is not super clear and then on top of that, it’s not very expressive. There aren’t very many constructs. So things like fenced code blocks, which many people associate very closely with Markdown today, that was only added by GitHubb flavored markdown, which is certainly widely used among the programming community, but not everyone is on GitHub, of course. And then you have things like table formatting or even like strike through really strike Through wasn’t defined in the original markdown specification either. And so you have markdown and then you have like GitHub flavored markdown, common mark is sort of this unifying effort remark down all these different is the markdown cinematic universe. I tried to make a joke about this. I had this joke ready for the markdown Cinematic universe when the last Marvel. Movie came out. But then like, it didn’t get nearly the traction in my timeline as the Dune did, perhaps understandably. So really, I’m just going to have to wait till the next movie comes out. It’s a real, real tragedy. No, but like, I guess you have this real pluralism of forms and it becomes very difficult to use markdown truly as a portable format because the way it renders in one editor or even parses can very much differ from editor to editor. So, Pandoc provides this format that essentially serves as an IR or intermediate representation between all these kinds of documents using a markdown supersets that somehow magically encapsulates everything. 00:30:18 - Speaker 2: And that includes not just markdown, but also like PDFs or Microsoft Word, that seems. 00:30:24 - Speaker 1: Well, so the way it works is it’s this compilation pipeline, I guess, that allows you to go from a markdown document. It compiles it to PDF using PDF Lawtech or something. It outputs Lawtech, it outputs HTML various things, and you can think of it as being this intermediate representation because you start with this like Word document, you can turn that into markdown and you can go from that markdown format into any of these output formats, which turns out to be like really powerful because the main issue with these kinds of conversions is that it’s often lossy, there are features that are supported by Law tech, for instance, that aren’t supported by the web natively, there are features that are part of like Word documents that aren’t necessarily supported by HTML and so on and so forth. So Pandok serves this role of like basically saying, OK, what is an intermediate language that can encapsulate all the different implementations of the same concept across different input and output formats. And what I think is so remarkable about it is that oftentimes when you are using an AP. of software and you’re like, oh darn, you know, now I need to support this other thing too. You quickly end up in a situation where you have the snowball and things start to feel tacked on. So you’re like, Oh man, it’s very clear that they just glommed on this additional syntax for this feature. And with Pandok, everything feels like very principled in its inclusion. And at the same time, whenever I’m using Pandok and I’m like, darn, I really wish there was a construct that I could use to express this. particular thing, I look up in the documentation and it’s always supported. So, as one of my favorite examples, one of the output formats that Handok supports is various slideshow frameworks. So Beamer for people who use Lawtech and Reveal JS for people who use HTML and CSS and these slideshow frameworks basically allow you to replace something like PowerPoint, Keynote, Google slides with essentially like a text-based format. I really like doing slideshows in Pandock markdown. There are a few reasons for that. The first reason is that it’s really useful to be able to reuse some of the same content from like my blog post or essay even in the slideshow. There are some really minor and almost petty, but really significant reasons. Like, I like to have equations or code blocks with syntax highlighting in my slideshows, and there’s not really a good solution to putting like a syntax highlighted code block in Keynote right now. 00:32:39 - Speaker 2: Last I remembered, the gold standard at the Ruby conferences I used to frequent was to take screenshot of Textmate and paste that in. 00:32:47 - Speaker 1: Yeah, it’s awful. I don’t want to see your like monochai editor with like the weird background that contrasts weirdly with the slide background. I just, ah, and it doesn’t scale on a huge conference display anyway, I digress, but The other reason why I really like doing my slideshows in text is actually that there is often a hierarchical structure to my presentations, right? I’ll have like these main top level sections and then I’ll have subsections, and then I’ll have like sub subsections and all of these manifest and slides. But in the gooey thumbnail view of most of these existing Slideshow editors like PowerPoint or Google slides, it reduces it all to like this linear list. It’s like, here are all of your thumbnails in order. And it makes it very hard, as soon as I have like an hour-long conference talk, how do I like jump to this subsection that I know exists, aside from like scrolling past like 117 thumbnails and trying to find the right one, right? And moreover, let’s say I want to Reorder a certain part of the talk because I think it better fits the narrative structure. Now I have to like figure out which thumbnails I need to drag to which other place or worse, go into the individual slide, select the text from that, move that somewhere else, and it’s just way, way clunkier actually than reordering some text in like a bullet list outline in my editor. And then the other part is that I was talking about how Pandok has really great support, expressive support for idioms of different formats, and one thing you often have in slideshows is that I have some element on the screen and then I press, you know, the next button again and then another element will appear. So in Pandoc you can denote this with just like an ellipsis basically so like dot dot dot and then if I have a slide where I have a paragraph and then the dot dot dot and then another paragraph, it will render with just the first paragraph visible and then I press next and then like the subsequent paragraph comes in. And that’s like just a very lightweight way to handle these stepped. Animations compared to going to the animation pane and then clicking the element that I want to animate in and so on and so forth. So it started off with me being like, I’ll just prototype in this format, but then it ended up supporting columns, it supports all these things that you actually want. And I was like, this is in many ways a more ergonomic way to handle long technical slideshows. Anyway, I have to chill for Pandok anytime I talk about rich text, I’m contractually obligated to do so. 00:35:08 - Speaker 2: Yeah, it’s a great piece of software, use it here and there. I think I was doing some Asky doc kind of manuals many years ago and yeah, just in general, it’s also worth looking at the homepage that you mentioned the plot they have where it shows all the different formats that can convert between is quite fun. You click on that, you can zoom in. 00:35:26 - Speaker 1: Yeah, I had this really elaborate plan when I decided to go to Berkeley, that I was going to print out a door-sized poster of like that graph that shows all the formats they convert between and then show up at John McFarlane’s door and ask him to sign it. But then the pandemic interfered with some of those plans. Nonetheless, it remains on my list. 00:35:48 - Speaker 2: Good bucket list item, pretty unique one at that. 00:35:51 - Speaker 1: Also, I found my tweet, or I found the draft of my tweet, which is about eternals, and I said, directed by Chloe Zhao, the latest entry in the Markdown Cinematic Universe features an ensemble cast of multi markdown, GitHubb flavored markdown, PHP Markdown Extra, R Markdown, and Common Mark as they joined forces in battle against mankind’s ancient enemy, Doc X. Nice. 00:36:12 - Speaker 2: Wow. You would have gotten the like from me. 00:36:16 - Speaker 1: Yeah, we’ll see if it ever sees the light of Twitter.com. 00:36:20 - Speaker 2: You briefly mentioned there equations and La tech, and maybe that’s a good chance to talk about the equation project you did for notion. And part of what I thought was so interesting or what I think in general is interesting about equations is that they are obviously an extremely important symbolic format, but in many ways extremely different from the pros we’ve been talking about. So English or other languages, even languages that are right to left or something like that, they all have the same kind of basic flow and the way that we represent sound. So with these little squiggly symbols, even though the symbols themselves and sounds vary and how we put them together into words across languages, that’s a common thing. If you go to the mathematical realm, you have symbolic representation, but equations are the whole own beast, and I think one that has gotten a lot less attention from kind of the software and editing world. So tell us about that rabbit hole. 00:37:16 - Speaker 1: Yeah, so just as context for people, notion and many other applications actually have long supported block equations, an equation that basically takes up, you know, most of the page horizontally. What is much more uncommon in editors is support for inline equations and so this can be something as simple as saying, You want to type let X be a variable, and X should be formatted or stylized mathematically. Being able to refer to elements of a block level equation in inline text is a prerequisite for being able to do any kind of serious mathematical writing, yet because this is kind of this niche area that has historically been the purview of Overleaf and other law tech editors, it’s really not implemented. In most editors. So I pushed really hard to add inline equations and inline math to notion, because I was like, there’s a huge opportunity for people to write scientific or mathematical documents that take advantage of all of notion’s other features like being able to embed FIMA or embed illustrations and things like that, right? So, it turns out that it’s kind of difficult, exactly as you’re describing to do this equation format. There’s been very little innovation and research more generally into what is like a good interface for inputting equations. So I think most people Probably familiar with Microsoft Word or Excel have these equation editors, or even like operating system level sometimes where you basically like open this palette, and there is a preview and there is a button for every possible mathematical symbol or operator you can imagine. And then for composite symbols like the fraction bar or integral or something like that, you find the button for that, you click it, and then you click into like the little subboxes and then you find whatever symbol you want and you put those there too. So it’s kind of a structured editor, but like in an unimaginably cumbersome interface. This is what I used to do my lab reports in high school, for example. And then at the other end of the spectrum, you have things like law tech. Law tech is basically how everyone in at least in computer science and mathematics chooses to typeset their work, typesets complex mathematics. One of the real selling points of law tech, I think is that It turns out that operator spacing is really important, and there’s a big difference between, say, a dash that’s used like a hyphen or a dash character that’s used in text, and a hyphen or a dash character that’s used as a minus sign in an equation, the spacing is subtly different. And one of the big things that Lawtech does is it basically allows you to declare certain operations in certain contexts as like a math operator versus just a symbol versus just like a tagged group of characters, and it correctly handles the spacing depending on what kinds of characters are around the operator in question. And so Lawtech basically produces really nice looking mathematics at the cost of this markdown which looks like I kind of smashed my keyboard that only had like 3 characters. It’s the exact opposite of the equation editors instead of having a button for every imaginable character, you only have 3 buttons. The buttons are backslash, open curly brace, and closed curly brace, and somehow like permuting those characters is supposed to get you like any possible mathematical outfit. There’s just two ends of the spectrum. 00:40:41 - Speaker 3: Yeah, I used to do my analysis homework in college in law tech, and I remember when I first looked up how you would input in law tech these formulas, like, that can’t be right. This is not the best way in the world to do this. In fact, that’s it, that’s the one and only way. 00:40:53 - Speaker 1: It really is, it’s terrifying. It’s the one and only way and the wild part is there are people who are like super, super good at law tech. They can like live tech their lecture notes. I was never nearly like that fast, but some people can do it usually with extensive use of macros, which macros are another selling point of law tech as you can define these kind of custom shorthand for operators you use a lot. But anyway, yeah, so you have a lot of tech sort of at the other end of the spectrum, like really quite unreadable, oftentimes, like, it’s like a right only format, many times. 00:41:23 - Speaker 2: And of a regular expressions come to mind on that as well, yeah. 00:41:26 - Speaker 1: It’s exactly the same zeitgeist, I think. It turns out that figuring out how to have like a combination, gooey, plain text interface that allows you to be like in a rich text editor like notion, then. into an inline equation field to have like an inline symbol and then go back into the GUI editor was like just very unexplored territory. And it kind of makes sense that lots of people don’t prioritize this because many people that notion rightfully had the question like, oh, is this something we should be working on? But first of all, it turned out that if you actually tallied up like our user requests, inline math was like near the top. Of editor feature based requests. And then more generally, it turns out that because this is like a prerequisite for many researchers and for students, you can get a lot of people on your platform who rely on it, you know, as a student to take notes and something like that, because there’s literally no alternative. And then they are able to stick around and use the platform for all kinds of other things. So this is just kind of a plug that more editors should implement this. But Yeah, I thought that this project was really interesting because in the interaction paradigm, you want to capture a lot of the things that are very fluid about editing regular text. So for instance, we knew it was important that you should be able to use the arrow keys to move left and right, kind of straight through a token without editing it if you wanted, or if you wanted to be able to go. Into a token and edit it using the arrow keys, you shouldn’t have to like use the mouse to click, although, of course, you should also be able to use the mouse to click. And when you have this formatted equation, we made the decision that the rendered equation would be represented as this atomic token. So if you were highlighting text to copy and paste and move around, it would be like highlighting a single character that would just be like the whole equation. But of course, you could go in and edit the equation. Any way you want it in kind of this pop up text editing interface. I think another thing that’s the subtle interface challenge here is that like Mark was saying, there is often a Uh, disproportionately large number of characters used to represent the equivalent of like one character with a formatted output. And so that’s something you don’t really take into account. The output is like X with a hat in San Sara font, and then there’s like 25 characters of markup that goes into that, and you just need to like scale the interface appropriately to take that into account. But I think that it’s really interesting because It shows the power of combining different input and output formats in like the same atom, right? So you have like a single line of text, and you want to have rich text that’s formatted and stylized and so on, hyperlinks, and then also equations or whatever inline rendered output of another input format that you have. I think that that’s really where GUI editors and whizzy wig editors can shine is being able to combine these like, Input formats and output formats like in the same line in Chu, yeah, I guess you can’t really do that at all with the terminal or something like that, and I say this as someone who uses like CLIIM for everything. 00:44:34 - Speaker 3: This is bringing back so many memories. I wish I had notion with equation support back when I was a math undergrad. It’s so nice. 00:44:41 - Speaker 1: I’m like the notion math stand guardian, I don’t know, something like that. And I’m always keeping track of like all the cool things people are doing using equations and notion. A lot of people are doing like math blogs in notion, which is really awesome for me to see. Also, I just feel like they’re having tried lots of other things. They’re just like really isn’t. A good alternative short of like actually writing lots like for your blog, which no one really likes. And yeah, I mean, certainly it’s the kind of thing that I implemented originally, kind of, I was like, I’m gonna do this for myself, and then realized that lots of people would be able to benefit from it. It’s been really cool to see a bit of reception it gets, like the inline math tweets on the notion, uh Twitter account overwhelmingly get the most engagement and interaction. And initially, like the marketing team was shocked. They thought this would be the super niche feature, but no, it turns out that people love math and like, they may not be the most vocal proponents or they’re used to no one caring about math type setting, things like that. For a while, I think it was the case that when I did find an editor that had support for equations of some kind, to me, it was overwhelmingly obvious that the people who implemented it did not regularly use equations for writing. I think you can often tell that with different features. So I think that having that kind of Representation is not quite the right word, but being able to see a feature that was designed by someone who really cares about using it themselves is really cool for people who are interested in typesetting, students, researchers, people who are interested in typesetting more mathematical text. 00:46:11 - Speaker 3: Yeah, and I think it’s really important, like you were saying that it’s mixed media because you’re combining the equations, the inline equation and the block equation, by the way, in the world class form, which is a lot tech based with a world class rich text editor with text and images and stuff. It’s really nice. I do think there’s still one frontier here, especially for math, which is the fully gradual process from you’re taking handwritten notes and you’re working out a problem and you’re drawing squiggly diagrams all the way up through your finished homework. I remember when I was at math undergrad. I would basically have to do the homework twice. You do it once on paper. Nobody could read that, including myself, so that, you know, do it in lot again. And I always wish there was a way to do it incrementally. You sort of changed equation by equation and diagram by diagram into the final product. And I know there has been some research on uh turning equations into lot tech formulas with machine learning. I don’t know if I can do handwriting, but perhaps someday we’ll get the new support for equations and you can go all the way to the end. 00:47:02 - Speaker 1: Yeah, like you, I share exactly the same frustration that you have to essentially do lots of things twice, and the relative position of everything is ambiguous, and Lawtech is what allows you to do things like have subscripts of subscripts, which would be really inscrutable in most people’s handwriting, including my own, and, you know, subscripts of subscripts along with super scripts and things like that. There are just so many ambiguous details and it turns out in my experience with like, anything that tries to automate the transition is that I always end up Going through and like really rewriting all of the details to be structured in a readable way. You have this other problem which back in the days of like Wizzy Wig web editors like Dreamweaver and Microsoft Front Page and things like that, you would often end up with this problem where you try to do like any edit in the Wizzy Wig side and then you look at the generated HTML and it’s ridiculous. There’s just like 16 nested empty span tags, and no one would ever be able to maintain that. And my worry is basically that when you automatically create Markup for something that has a very complex graphical representation, it’s really like one way, you know, maybe it will help you produce a compiled output, but it doesn’t actually help you go back in and like edit and tweak the representation later or it’s just so inscrutable if you do that it’s kind of also a reg x type situation. I think we really need to get to some kind of like good intermediate representation that allows you to flexibly go both ways. And that goes back to something that I think Adam and I were chatting about earlier, which is that a lot of people gripe and complain that like law tech is the best we have and, you know, I’m one of them, but It really is the case that, you know, lottech was just this like monumental effort by really a few people and amount of effort that would be like considered really impressive if I were to try to do the same thing but better today and not a lot of people just have like spare time to do this all in one text formatting, packaging, document representation project, even though it would have huge impact on the way people write and publish these kinds of documents. And so in many ways we’re sort of just bottlenecked on the fact that It’s hard to do incremental improvements to this particular area. We really depend on these like software monoliths to keep us afloat. 00:49:19 - Speaker 2: I’m not nearly as mathy as either of you, but I can’t help but make the comparison on these equation editing to what you mentioned earlier with kind of structured editors and programming, where whether there’s lightweight help from your text editor, things like code folding, syntax highlighting and autocomplete, or full structured editing, some of the visual programming stuff we talked about with Maggie Appleton, like Scratch, for example, or these flow based systems that are fully graph. and you sort of can’t have it in a bad state. And I can’t help but to think there might be some direction like that that is not necessarily the right only inscrutable tech, but is not the Microsoft Word one button literally for every symbol you might ever want. It does seem like there might be some other path, and yeah, I agree it’s a monumental effort, but I mean, mathematics is so important and foundational and so much of human endeavor that certainly seems like one worth investing in, although perhaps hard to reap a profit from, and that makes it harder to put concentrated capital behind it. 00:50:20 - Speaker 1: Yeah, I think that there’s definitely very clear demand for I think something exactly like what you’re describing, which is somewhere in between the two extremes, and it is really relevant because ACM, which is the Association for Computing Machinery, the academic and professional body really for computer science, they are currently undergoing this. Fiasco, maybe, I probably shouldn’t go on the record as calling it a fiasco. The ACM is currently undergoing this initiative called TAPS, which is the ACM Publishing System, where they are attempting to revise the template by which all computer science research is published and disseminated, and the idea behind this is that right now, computer science research is published to these PDFs. Initially they were all two column PDFs, now I think there’s some one column PDFs. They want to output HTML as the archival format for various reasons, including that it offers much better reading experience on different screen widths, so like phones or tablets, which are increasingly how people are reading papers, not just printed out. And they are much more accessible than PDFs. PDFs are just like really quite inaccessible, especially to screen readers and other assistive technologies that are trying to parse out all the different math or whatever arbitrary formatting you’ve decided to use. The upshot of this, I guess, is that there are currently a group of very smart people who are trying to figure out how in the world we’re going to get people to start writing all of their papers and outputting them in a different format, in a world where everyone is already used to preparing. Their publications and preprints in law tech. And turns out that even if you solve the problem of like what the input syntax should be, rendering math in the browser is like an extremely unsolved problem. 00:52:05 - Speaker 3: Yeah, isn’t the state of the art that it like generates PNG and sticks it in the web page? 00:52:09 - Speaker 1: Not exactly, but like almost. OK. So MathML, which is like an XML dialect or like mathematical markup language, was this effort to build. HTML XML style syntax for typesetting mathematics. Naturally, it is only implemented in Firefox, so that’s really unfortunate. So in terms of the state of the art, there are basically two libraries that you can use to typeset mathematics. There’s math Jack and Caltech. Mathjax supports basically all valid law tech, including, you know, different. Environments and equations and things like that. The problem is that Mathjacks is very slow. So if you ever go on math overflow or another like related stock exchange and you see like all of these answers with like weird gaps, and then as you watch before you, the page starts to like load all of the rendered equations like bumping everything down one level at a time. That’s math Jackson action. And oftentimes it is doing what you’re describing where it is outputting like an SBG or a PNG or something like that, and it’s just like reflowing the page with every equation. So then you have Caltech, which was a library developed at Konn Academy where they realized that math Jack’s performance was basically just like not satisfactory for their exercises and things like that. Sootte supports a much more limited subset of all of Law tech syntax, but it does it all using CSS basically, and it doesn’t reflow the page for every equation. It’s basically instant surrender. So tech is what we use at Notion, it’s also what’s used in like Facebook Messenger, which supports equations if you ever tried that, and many other websites, and basically it means that your options, if you want to render math are only target Firefox. Use a limited subset of math that’s supported by Kottech and Consign yourself to like extremely slow, dozens of reflow, full expressive power rendering to inline PNG’s. And so that’s just not like a great situation to be in, and we haven’t even gotten to the question of like how people write math. So I would say that people underestimate like how open this problem spaces. 00:54:17 - Speaker 3: Yeah, man. 00:54:19 - Speaker 1: Just take a moment of silence to like recognize the gravity of the situation. 00:54:23 - Speaker 3: This is an aside, I don’t know if you want to put this in the episode, but now I’m curious. It sounds like both of those are interpreted in the sense that the equations are rendered at load time instead of being compiled down to some like HTML and CSS that you can render without JavaScript. Like, basically, do you need JavaScript to render these pages? 00:54:39 - Speaker 1: Yeah, basically, I should say you also need JavaScript, unless you’re doing the pre-compied to MathML and then hope that people are using Firefox. 00:54:47 - Speaker 3: Man, I feel like there’s no way that that stuff loads in 10 years, but we’ll see. 00:54:52 - Speaker 1: I actually had this exact argument, again, I don’t know if you want to put this in the episode. I had this exact argument with Jonathan Aldrich, who’s on the taps committee when we were talking about this, and I think the point was not so much that you can guarantee that the artifact loads. Exactly the same way in 10 years, but that the representation is rich enough that one could feasibly build software that renders it the same way in 10 years. So it’s more about the fidelity of the like underlying representation where like a team of, I guess, digital, you know, archaeologists could recover the work that we were doing and not so much like we trust in the vendors to like keep everything stable, which is obviously never going to happen. You know, the only reason like PDFs are stable is because how many trillions of dollars of IP depend on being able to load the PDF the same way as it was written, you know, 30 years ago. 00:55:45 - Speaker 3: Yeah, interesting. 00:55:46 - Speaker 1: Nice. Going back to this idea earlier that Mark mentioned of the spectrum of like plain text, rich text, Wizzy wig editors. One recurring theme for me is thinking about decoupling this spectrum into like what is the format and then what are like the editors and tools that we can use to interact with this format, so they structured, unstructured, etc. I want to call outAR, which is a native application for Mac OS and iOS that does a really great job with this, which is that Bear is basically Something in between a whizzy wig and a plain text editor in that you’re always editing markdown documents and indeed, when you have something that’s bold, you can see the like asterisks around it that delimits that character. But all of these standard, you know, Control B, U, editor shortcuts work as you would expect. And more importantly, you can see like the formatting applied in real time. So That when you do star star, hello star star, he suddenly becomes bold face in this gooey. And so in many ways it combines like the fluidity and the real-time preview of a rich text editor or previewer with the flexibility of like ultimately just writing plain text characters. And I think this is like really unexplored area. I don’t just mean something like Open VS code or VIM and type characters and then see like different formatting labels attached to the results. I mean like a native application that’s really designed like for end use or end users, that doesn’t fully obscure the input syntax but does real time rendering in place. It’s not even like in monospace font, right? It makes it feel much more like this is actually the output that you’re targeting. And not just like an input step that needs to be pre-processed. I think that there is a lot of room for applications that are kind of in between and in that same spaces where it doesn’t entirely obscure what you are writing, but it does give you a lot of the benefits of previewing things and having like a GUI application outside of the terminal in terms of like capturing the richness of the possible results. 00:57:52 - Speaker 3: Yeah, I like the bear approach a lot. Now, are there particular domains or types of documents that you think would be susceptible to this approach, or it just for rich tech specifically? 00:58:01 - Speaker 1: So I was making a list of like all of the different traditionally graphical outputs that have corresponding plain text representations and a lot of them I was thinking about, for example, in engraving sheet music, right, traditionally you would use a desktop program like Finae or Sibelius nowadays you have options like new score and flat, which are more web-based editors, but you see the staff and you click notes. In the staff like corresponding to where you want the note, and you know you use the quarter note or the 8th note cursor to pick the duration and so on. And then at the other end of the spectrum you have Lily Pond, which is kind of like law tech I guess for engraving sheet music where you type a very like law tech-esque syntax and out comes, you know, beautifully typeset sheet music. For me this