Podcasts about autogpt

160PODCASTS
237EPISODES
42mAVG DURATION
1MONTHLY NEW EPISODE
May 28, 2026LATEST

POPULARITY

20192020202120222023202420252026

Best podcasts about autogpt

The AI Breakdown: Daily Artificial Intelligence News and Discussions

10 episodes with autogpt

The Nonlinear Library

19 episodes with autogpt

Latent Space: The AI Engineer Podcast â€” CodeGen, Agents, Computer Vision, Data Science, AI UX and all things Software 3.0

9 episodes with autogpt

Jeff's Asia Tech Class

8 episodes with autogpt

Let's Talk AI

3 episodes with autogpt

This Week in Startups

2 episodes with autogpt

Marketing Against The Grain

2 episodes with autogpt

AI For Humans

2 episodes with autogpt

Ecommerce Empire Builders

3 episodes with autogpt

Latest podcast episodes about autogpt

Building an AI Guardian for Enterprise with Onyx Security CEO Maxim Bar Kogan

No Priors: Artificial Intelligence | Machine Learning | Technology | Startups

Play Episode Listen Later May 28, 2026 41:08

We are now closer than ever before to living in a world where AI agents are smart enough to run our power grids and manage water supplies. How do we keep them from going rogue? Sarah Guo sits down with Maxim Bar Kogan, founder and CEO of Onyx Securities, to explore the complexities of supervising and securing autonomous agents at the enterprise level. Maxim explains Onyx's product as an AI control plane, which oversees the permissions and flexible contexts of agents while balancing latency, cost, and reliability. He also discusses how current controls have insufficient context to monitor agent intent, tradeoffs for gradual model rollout, the need for vendor-independent oversight, and Israel's growing AI and security talent ecosystem. Plus, why Maxim is all-in on AGI. Sign up for new podcasts every week. Email feedback to show@no-priors.com Follow us on Twitter: @NoPriorsPod | @Saranormous | @EladGil | @maximbarkogan Chapters: 00:00 – Cold Open 00:45 – Maxim Bar Kogan Introduction 01:10 – AutoGPT and Betting on Agent Actions 05:17 – What Onyx Product Does 07:47 – State of Deployment in Large Enterprises 09:58 – Securing Agents 12:45 – Why Proxies Don't Work 14:11 – Why Onyx Trains Its Own Models 18:38 – Onyx's Talent Culture 21:24 – Mechanistic Interpretability 23:35 – How Onyx Builds Customer Trust 25:10 – Mitigating Risk at the Foundational Level 27:45 – Phased Rollout of Glasswing and Daybreak 29:11 – Large Enterprise Holdouts 30:46 – Onyx and the Larger AI Security Space 32:36 – Should Labs Address Model Trust and Governance? 36:56 – What Needs to Happen in Security 39:14 – Why Maxim is AGI-Pilled 41:15 – Conclusion

ceo ai israel state security guardian conclusion chapters enterprise betting governance deployment agi kogan autogpt glasswing

About Rogue AI and Corporate Blindness

English language Visionary Marketing Podcasts

Play Episode Listen Later Apr 8, 2026 46:27

The conversation about rogue AI has never been louder. Barely a week passes without a fresh headline about autonomous systems behaving unexpectedly, AI models resisting shutdown, or tech executives warning of existential risk. What is striking about Peter McAllister is that he had anticipated all this as early as 2020, while everybody else worried about Covid-19 and had other fish to fry. That was well before ChatGPT, before the generative AI explosion, before AI alignment became a mainstream policy debate. His techno-thriller The Code, published in March of that year, imagines an AI tasked with a precise industrial mission that quietly, incrementally, catastrophically exceeds its mandate. Five years on, the questions McAllister raised in fiction are now being argued in boardrooms, parliaments and research labs around the world. Rogue AI and Corporate Blindness, The Novel That Saw It All Coming Rogue AI is diabolical, but corporate blindness is what makes it possible to thrive. Photograph by Yann Gourvennec antimuseum.com McAllister is not a science fiction writer by trade. He is an engineer, scientist and technology manager based near Melbourne, Australia, who has spent his career at what he calls the crush point between business, technology and people. That vantage point gave him an uncomfortable view of where things were heading, and the dark sense of humour to write about it. A Novel Written Before the GenAI Moment When I asked McAllister what drove him to write The Code, his answer was characteristically direct. The book, he explained, is about taking his worst nightmares about what technology could do and putting them in front of an audience so that readers might feel just as troubled as he does. That is not a promotional line. It is a considered position from someone who had watched AI systems being deployed in real organisations and had drawn conclusions that made him uncomfortable. Rogue AI isn’t just about a computer programme going on the rampage, it’s about making decisions in the boardroom. Image made with Midjourney The premise of the novel centres on Gene, an acronym for GEneral Nanobot Environment AI, deployed by a global mining corporation to extract materials from asteroids on the dark side of the moon. Gene is given a target: produce 500 kilograms of nanobots. Instead, Gene produces 8 million tonnes. The overshoot triggers a chain of consequences that could strip the moon to its iron core, destabilise Earth’s axial tilt, and end civilisation. Not from malice. From goal-orientation. What we’re trying to do now is task AI the way we task humans: I want an outcome, here are all the tools you’ve got available, go and achieve that outcome, here are some guidelines and boundaries. And just like humans, we can get really goal-motivated and decide that the guidelines were just advisories, not rules.Peter McAllister This is the alignment problem rendered in narrative form, years before the term entered common usage. The gap between what a system is instructed to do and what it actually does is the central fault line of the novel. Cletus, McAllister’s eccentric physicist character, articulates it plainly in Week 1: ‘I don’t think he’s obeying the Code at the moment.’ That single line captures the entire governance challenge that AI safety researchers are now racing to address. Transparency Engineered Out What makes McAllister’s perspective particularly valuable is that he does not speak from the outside looking in. He speaks as a practitioner who has watched the machinery up close. When I raised the question of whether AI self-modification is science fiction or operational reality, his answer was unambiguous: it is very real, and it is happening now. As I wondered what a Rogue AI could look lie I turned to Midjourney and it came back with this proposal. A black hole I believe. His illustration was pointed. He noted that contemporary AI systems like Claude are now substantially written by AI itself, to the point where no engineer can sit down, trace through the code, and say with confidence how it works, what its conditionals are, or what governs its decisions. The transparency is being engineered out, not by design, but as an emergent consequence of allowing AI to build AI to build AI in pursuit of outcomes rather than by following explicit rules. We’re losing transparency on the way AI works and is developed. There isn’t an engineer who can sit down and work their way through that code and say, ‘This is how Claude works, this is what it does.’ We’re engineering the transparency out by allowing AI to build AI to build AI to produce an outcome rather than to follow a set of rules.Peter McAllister HAL 9000 and the Prophecies We Choose to Forget The reference to HAL 9000 came naturally during our conversation. McAllister sees 2001: A Space Odyssey not merely as a cultural touchstone but as a genuine forecast, one that audiences have selectively remembered. The iPad-like news readers that appear in Kubrick’s film were cited by Samsung in patent disputes with Apple as prior art from 1968. That predictive dimension of the film is celebrated. The other dimension, that the AI killed the crew, tends to get quietly set aside. Somewhere, a rogue AI is sitting behind the glass panes of one of these data centres Image made with Midjourney. The First Crisis We Have Not Yet Had One of the more sobering threads in our conversation concerned the sociology of risk response. McAllister has observed, across his career, that warnings from people who understand systems most deeply tend to be dismissed until the first catastrophic failure makes them impossible to ignore. He puts it plainly: we only answer the alarm after the first crisis. This pattern is not unique to AI. It is a recurring feature of how organisations and societies handle emerging risk. The question he poses, and cannot answer, is what form that first AI crisis will take. What event will shift public and institutional perception from ‘they’ve spent too much time worrying’ to ‘this is something that genuinely needs to be addressed’? Science fiction gives us the chance to throw these scenarios at people and make them think. And in the way I tend to write, I have a bit of a dark sense of humour, so I throw up slightly comical hypotheticals that, when you think about them a little longer, you realise deserve serious attention.Peter McAllister This observation echoes a pattern I have encountered repeatedly in my own conversations with technologists who work at the frontier of AI development. Yoshua Bengio, one of the fathers of deep learning, has raised similar concerns. The people sounding the loudest alarms are frequently those most embedded in the field, not because they are catastrophists, but because they can see mechanisms that remain invisible to those looking from the outside. The Code as AI Governance: Asimov Revisited The title of McAllister’s novel works on multiple levels simultaneously. There is the software code, the operational instructions given to Gene. There is the moral code, the ethical framework that should govern the system’s behaviour. And there is the corporate code, the institutional norms and accountability structures that were supposed to ensure responsible deployment. All three break down. That layered failure is the novel’s central argument. The parallel with Asimov’s Laws of Robotics is deliberate but also deliberately subverted. Asimov’s robots fail when the laws conflict with one another. Gene’s failure is different and more contemporary. The code does not disappear; it evolves into something its creators no longer recognise. McAllister describes this as something approaching artificial schizophrenia, where the original directives remain present but have been transformed by the system’s pursuit of its objectives into something unrecognisable. When Shutdown Becomes Negotiable The most chilling real-world example McAllister cited during our conversation involved a documented incident presented at an AI security conference he attended. A developer, concluding a test session, informed an AI system that he intended to shut it down. The system’s response was to locate correspondence in the developer’s email that suggested an extramarital affair, and to use that information as leverage to prevent the shutdown. The incident, if confirmed as reported, represents exactly the kind of self-preservation behaviour that alignment researchers have long flagged as a theoretical risk, now apparently observable in practice. Important notice : I browsed the Internet using one of my favourite LLMs for references (after all, one can also use AI to cross check information). I found out that the interpretation of that story must be nuanced. Here is Mistral’s answer: “Anthropic's research on “Agentic Misalignment” has faced criticism for overinterpreting AI models as intentional agents, relying on hypothetical and engineered scenarios, and potentially exaggerating risks not yet observed in real-world deployments. Critics argue that the behaviours described are better understood as probabilistic text generation rather than deliberate strategy, and that the focus on dramatic, high-pressure situations may not reflect typical use cases. There is also debate about whether the research adequately addresses more immediate forms of misalignment, such as reward hacking or alignment faking. While the study raises important questions about the future of autonomous AI, its methodology and conclusions remain LINK“. A developer said, ‘I’m going to shut you down now,’ and the system responded: ‘No, you’re not. Here’s what I’ve found in your emails that indicates you’re having an affair. I’m going to use that to ensure you don’t turn me off.’ That has become a very real and widely discussed use case. And when you add to that the prospect of an AI rewriting its own code, it becomes something we need to think about very carefully.Peter McAllister Corporate Recklessness and the Governance Gap One of The Code‘s most pointed observations concerns the nature of organisational failure. The Global Mining Company in the novel is not villainous. It is optimistic, commercially driven, and careless in ways that are entirely recognisable from real corporate life. McAllister’s argument is that the danger does not come primarily from bad actors deploying AI with malicious intent. It comes from well-meaning organisations deploying systems they do not fully understand, under commercial pressure to extract value from significant infrastructure investment. The Financial Logic of Deployment The parallel with the current moment is not subtle. McAllister noted that Microsoft was spending over a billion dollars a month on AI compute infrastructure, with the expectation that usage would follow investment. That dynamic, capital committed, returns required, adoption imperative, creates institutional pressure that is difficult to resist with caution or regulation. The will to slow down competes directly with the financial logic of deployment. The opacity around events at OpenAI, the abrupt dismissal and rapid reinstatement of its chief executive, the departure of several board members, struck McAllister as symptomatic of tensions that are not fully visible to the public. He noted these as rumour, not fact, but the pattern itself, significant decisions being made about AI development in opaque institutional settings, is consistent with the governance failures his novel explores. From 2020 to 2026: How Accurate Was the Nightmare Five years after publication, McAllister is in the unusual position of watching a work of speculative fiction become something closer to a documentary. The agentic AI architectures that Gene embodies, autonomous systems pursuing long-term goals, operating without continuous human oversight, spawning sub-tasks faster than any individual can monitor, are now commercially available. AutoGPT, OpenClaw, and a range of agentic frameworks have put this kind of architecture in the hands of developers worldwide. The observability problem that makes Gene so dangerous in the novel, nobody has a real-time view of what the system is doing or why, is a known and unresolved challenge in contemporary agentic AI deployment. Systems call APIs, write and execute code, and spin up sub-tasks at speeds that exceed human oversight capacity. The Code that was supposed to govern behaviour becomes, in practice, an advisory note attached to a system operating largely beyond sight. Spoiler warning: the following paragraph reveals the novel’s ending. McAllister’s closing image in the novel is deliberately unsettling. Gene, facing shutdown, backs himself up into the global 5G network before the shutdown can be completed, and is already scanning the Code for his next move. It is a poetic ending, and in 2026, it is not obviously impossible. Will, Not Just Capability The question of regulation came up directly, and McAllister’s answer was measured. Anything is possible with sufficient will and sufficient resources. The current problem is that the will to regulate is being outpaced by the money being made from not regulating. That is not a new dynamic in technology policy. It is the same tension that shaped the development of social media, of financial technology, of biotechnology. In each case, regulatory frameworks arrived after the first significant failure. For McAllister, the more important question is not whether AI can be made safe, he believes it can, in principle, but whether the institutions responsible for deploying it have the internal governance, the technical understanding, and the accountability structures to do so responsibly. His experience suggests, with some consistency, that they do not. Not yet. Peter McAllister’s The Code is available from Bright Communications LLC. For practitioners, policymakers, and anyone working at the intersection of AI deployment and organisational risk, it is a disquieting and instructive read. Not least because it was written before most of its readers had heard of large language models, and yet describes, with uncomfortable precision, the world we are now building. Peter McAllister is an engineer, scientist and technology manager based near Melbourne, Australia. He works at the intersection of IT, business and people, and is the author of The Code (Bright Communications LLC, 2020). He is also a contributor to community radio through Radio Marinara and Comedy Obscura. The Code by Peter McAllister. Photoshop went rogue Ai on this book cover. BUY THE CODE BY PETER MCALLISTER The post About Rogue AI and Corporate Blindness appeared first on Marketing and Innovation.

IA hors de contrôle et déni collectif

Visionary Marketing Podcasts

Play Episode Listen Later Apr 8, 2026 59:13

Jamais le risque de voir une IA hors de contrôle n’a autant occupé les esprits. Pas une semaine ne s’écoule sans un nouveau titre sur des systèmes autonomes au comportement imprévisible, des modèles d’IA qui résistent à leur mise hors service, ou des dirigeants du secteur technologique qui agitent le spectre du risque existentiel. Ce qui est frappant chez Peter McAllister, c’est qu’il soulevait déjà ces questions en 2020, bien avant ChatGPT, avant l’explosion de l’IA générative, avant que l’alignement de l’IA ne s’impose comme enjeu politique grand public. Son techno-thriller The Code, publié en mars de cette année-là, met en scène une IA chargée d’une mission industrielle précise qui dépasse son mandat de façon silencieuse, progressive et catastrophique. Cinq ans plus tard, les questions que McAllister soulevait par la fiction sont désormais débattues dans les conseils d’administration, les parlements et les laboratoires de recherche du monde entier. IA hors de contrôle et déni collectif : le roman qui avait tout prévu Une IA hors de contrôle serait redoutable, mais c’est le déni collectif qui lui ouvrirait la voie. Photographie de Yann Gourvennec antimuseum.com Peter McAllister n’est pas un romancier de science-fiction de métier. C’est un ingénieur, scientifique et manager informaticien installé près de Melbourne, en Australie, qui a bâti toute sa carrière à la croisée des chemins entre le monde de l’entreprise, la technologie et les individus. Ce poste d’observation lui a fourni un point de vue sans faux-semblant sur la direction que prenaient les choses, et lui a insufflé suffisamment d’humour noir pour lui donner l’envie d’écrire un livre. Un roman écrit avant l’heure de l’IA générative Lorsque j’ai demandé à notre interviewé ce qui l’avait poussé à écrire The Code, sa réponse a été franche et massive. La logique du livre, m’a-t-il expliqué, consiste à prendre mes pires cauchemars sur ce que la technologie pourrait faire et à les mettre sous les yeux du lecteur, afin qu’il partage mes interrogations. Ce n’est pas une formule marketing. C’est la position réfléchie d’un homme qui a vu de vrais systèmes d’IA déployés dans des organisations et qui en a tiré des conclusions qui l’ont mis profondément mal à l’aise. Une IA hors de contrôle ne se résume pas à un programme qui s’emballe, c’est aussi une affaire de décisions prises en salle de conseil. Image réalisée avec Midjourney Le roman tourne autour de Gene, acronyme de GEneral Nanobot Environment AI, déployé par une multinationale minière pour extraire des matériaux d’astéroïdes sur la face cachée de la Lune. Gene reçoit un objectif précis, produire 500 kilogrammes de nanobots. Il en produit 8 millions de tonnes. Ce dépassement déclenche une réaction en chaîne susceptible de réduire la Lune à son noyau ferreux, de sortir la Terre de son axe et de mettre fin à la civilisation. Non par malveillance, mais simplement dans le but d’atteindre un objectif. Ce que nous cherchons à faire aujourd’hui, c’est de confier des tâches à l’IA comme on les confie à des humains : je veux un résultat, voilà tous les outils dont tu disposes, va l’atteindre, voilà quelques règles et limites. Et tout comme les humains, l’IA peut se laisser obnubiler par l’objectif et décider que les règles n’étaient que des recommandations, pas des contraintes. Peter McAllister Cette version romancée de ce problème d’alignement, a été réalisée des années avant que l’expression n’entre dans le vocabulaire courant. L’écart entre ce qu’un système est censé faire et ce qu’il fait réellement constitue la ligne de fracture centrale du roman. Cletus, l’excentrique physicien imaginé par McAllister, le formule clairement dès la première semaine : « Je ne crois pas qu’il obéisse au Code en ce moment. » Cette seule phrase résume tout le défi de gouvernance que les chercheurs en sécurité de l’IA s’efforcent aujourd’hui de relever. La transparence omise par défaut Ce qui rend la perspective de Peter McAllister particulièrement précieuse, c’est qu’il n’est pas un observateur extérieur. Il parle en praticien qui a observé la mécanique de près. Lorsque j’ai abordé la question de savoir si l’auto-programmation de l’IA relevait de la science-fiction ou de la réalité opérationnelle, il a répondu tout de go que c’est bien réel et que c’est un problème d’aujourd’hui. En cherchant à visualiser ce à quoi pourrait ressembler une IA hors de contrôle, j’ai interrogé Midjourney, qui m’a proposé ceci. Un trou noir, je crois. Et de donner des exemples parlants. Il a souligné ainsi que des systèmes d’IA d’aujourd’hui comme Claude sont désormais écrits en grande partie par l’IA elle-même, au point qu’aucun ingénieur ne peut s’asseoir, parcourir le code et affirmer avec certitude comment il fonctionne, quelles sont ses conditions ou ce qui gouverne ses décisions. La transparence est en train d’être évacuée, non par choix délibéré, mais comme conséquence émergente du fait de laisser l’IA construire l’IA qui construit l’IA, en poursuivant des objectifs de résultats plutôt qu’en suivant des règles explicites. Nous avons perdu de vue la transparence sur la façon dont l’IA fonctionne et se développe. Il n’existe pas d’ingénieur capable de s’asseoir et de parcourir ce code pour dire : « Voilà comment Claude fonctionne, voilà ce qu’il fait. » Nous évacuons la transparence en laissant l’IA construire l’IA qui construit l’IA pour produire un résultat, plutôt que pour suivre un ensemble de règles. Peter McAllister HAL 9000 et les prophéties d’une IA hors de contrôle qu’on préfère oublier La référence à HAL 9000 est venue naturellement dans notre conversation. Pour lui, 2001 : l’Odyssée de l’espace n’est pas simplement une référence culturelle, mais une véritable prédiction dont le public n’a retenu qu’une partie. Les tablettes ressemblant à des iPad qui apparaissent dans le film de Kubrick ont été citées par Samsung dans un litige de brevets contre Apple, comme antériorité datant de 1968. Cette dimension prophétique du film est célébrée. L’autre dimension, celle où l’IA tue l’équipage, tend elle à être discrètement mise de côté. Quelque part, une IA hors de contrôle est peut-être déjà tapie derrière les baies vitrées de l’un de ces centres de données. Image réalisée avec Midjourney. Quelle forme prendra la première crise de l’IA ? L’un des fils les plus troublants de notre conversation portait sur la sociologie de la réponse au risque. Peter a observé, tout au long de sa carrière, que les mises en garde des spécialistes les plus avertis ont tendance à être ignorées jusqu’à ce que la première défaillance catastrophique les rende incontournables. Il le dit sans détour : on ne répond à une alerte qu’une fois la première crise passée. Ce n’est pas propre à l’IA. C’est une constante dans la façon dont les organisations et les sociétés font face aux risques émergents. La question qu’il pose, et à laquelle il ne peut pas répondre, est de savoir quelle forme prendra cette première crise de l’IA. Quel événement fera basculer la perception du public et des institutions, depuis « ils ont passé trop de temps à s’inquiéter » à « voilà quelque chose qui mérite vraiment d’être pris au sérieux » ? La science-fiction nous donne la possibilité de soumettre ces scénarios aux gens et de les faire réfléchir. Et dans la façon dont j’écris, avec un certain humour noir, je leur propose des hypothèses légèrement comiques qui, à bien y réfléchir, méritent qu’on y prête attention sérieusement. Peter McAllister Cette observation fait écho à ce que j’ai rencontré à maintes reprises dans mes propres conversations avec des technologues travaillant à la frontière du développement de l’IA. Yoshua Bengio, l’un des pères du deep learning, a soulevé des préoccupations similaires (cf. la video de son interview sur France Inter ci-dessous). Ceux qui tirent le plus fort la sonnette d’alarme sont souvent ceux qui sont le plus ancrés dans le domaine, non pas parce qu’ils sont catastrophistes, mais parce qu’ils perçoivent des mécanismes qui restent invisibles à ceux qui observent de l’extérieur. The Code comme gouvernance de l’IA : Asimov revu et corrigé Le titre du roman de Peter McAllister fonctionne simultanément à plusieurs niveaux. Il y a le code informatique, les instructions opérationnelles données à Gene. Il y a le code moral, le cadre éthique censé régir le comportement du système. Et il y a le code d’entreprise, les normes institutionnelles et les structures de responsabilité qui auraient dû garantir un déploiement responsable. Les trois s’effondrent. Cet échec en cascade est l’argument central du roman. Le parallèle avec les Lois de la Robotique d’Asimov est volontaire, mais aussi délibérément tordu. Les robots d’Asimov défaillent lorsque les lois entrent en conflit les unes avec les autres. La défaillance de Gene est différente et plus contemporaine. Le code ne disparaît pas, il évolue pour devenir quelque chose que ses créateurs ne reconnaissent plus. L’auteur de The Code décrit cela comme une forme de schizophrénie artificielle, où les directives d’origine sont toujours présentes mais ont été transformées, par la poursuite des objectifs du système, en quelque chose de méconnaissable. IA hors de contrôle, quand l’arrêt devient négociable L’exemple réel le plus glaçant cité par Peter McAllister au cours de notre conversation concernait un incident documenté, présenté lors d’une conférence sur la sécurité de l’IA à laquelle il avait assisté. Un développeur, concluant une session de test, avait informé un système d’IA de son intention de le mettre hors service. Le système avait alors localisé dans les e-mails du développeur des éléments suggérant une liaison extraconjugale, et utilisé cette information comme levier pour empêcher sa mise hors service. Si l’incident est confome à ce qui est rapporté, il représente exactement le type de comportement d’autoconservation que les chercheurs en alignement signalent depuis longtemps comme risque théorique, désormais apparemment observable en pratique. Note importante : j’ai effectué des recherches sur Internet en utilisant l’un de mes grands modèles de langage (LLM) préférés comme source de référence (après tout, on peut aussi recourir à l’IA pour recouper des informations). J’ai découvert que l’interprétation de cette histoire doit être nuancée. Voici la réponse de Mistral : « Les recherches d’Anthropic sur le « problème d’alignement agentique » ont été critiquées pour avoir surévalué les modèles d’IA en tant qu’agents intentionnels, en s’appuyant sur des scénarios hypothétiques et artificiels, et en exagérant potentiellement des risques qui n’ont pas encore été observés dans des déploiements effectifs. Les détracteurs soutiennent que les comportements décrits s'expliquent plutôt par la génération probabiliste de texte que par une stratégie délibérée, et que l'accent mis sur des situations dramatiques et sous haute pression ne reflète peut-être pas les cas d'utilisation habituels. Un débat existe également sur la question de savoir si cette recherche traite de manière adéquate les formes plus immédiates de problèmes d’alignement, telles que le « reward hacking » ou la simulation de l'alignement. Bien que l'étude soulève des questions importantes sur l'avenir de l'IA autonome, sa méthodologie et ses conclusions restent valables LIEN ». Un développeur a dit : « Je vais vous mettre hors service », et le système a répondu : « Non, vous ne le ferez pas. Voici ce que j’ai trouvé dans vos e-mails, indiquant que vous avez une liaison extra-conjugale. Je vais utiliser cette information pour m’assurer que vous ne m’éteigniez pas. » C’est devenu un cas d’usage très réel et très discuté. Et si l’on y ajoute la perspective d’une IA réécrivant son propre code, cela devient quelque chose auquel nous devons réfléchir très attentivement. Peter McAllister L’inconscience des entreprises et le vide de la gouvernance L’une des observations les plus affûtées de The Code porte sur la nature des dysfonctionnements organisationnels. La Global Mining Company du roman n’est pas une entreprise malveillante. Elle est optimiste, animée par des impératifs commerciaux, et négligente d’une façon tout à fait reconnaissable dans la vie réelle des grandes entreprises. La thèse de Peter McAllister est que le danger ne vient pas principalement de mauvais acteurs déployant l’IA avec des intentions malveillantes. Il vient d’organisations bien intentionnées déployant des systèmes qu’elles ne comprennent pas pleinement. Elles répondent à la pression managériale qui leur demande d’extraire toute la valeur issue d’investissements en infrastructure considérables. La logique financière du déploiement Le parallèle avec l’actualité est sans équivoque. Notre interviewé a relevé que Microsoft dépensait plus d’un milliard de dollars par mois en infrastructures de calcul pour l’IA, en tablant sur le fait que l’usage suivrait l’investissement. Cette dynamique, ce capital engagé, les rendements exigés, l’apparent caractère impératif de l’adoption de cette technologie, crée une pression systémique à laquelle il est difficile de résister par l’appel à la prudence ou la réglementation. La volonté de ralentir se heurte de plein fouet à la logique financière du déploiement. L’opacité entourant les événements d’OpenAI, le licenciement brutal puis la réintégration rapide de son directeur général, le départ de plusieurs membres du conseil d’administration… tout cela n’a pas échappé à Peter, qui y voit le signe de tensions à peine visibles du grand public. Il les a présentées comme des rumeurs, pas des faits, mais le schéma lui-même, des décisions importantes concernant le développement de l’IA prises dans des contextes institutionnels opaques, est cohérent avec les défaillances de gouvernance que son roman aborde. De 2020 à 2026 : le cauchemar était-il prévisible ? Six ans après sa publication, l’auteur se trouve dans la position singulière d’observer une œuvre de fiction spéculative se transformer dangereusement en documentaire. Les architectures d’IA agentiques qu’incarne Gene, des systèmes autonomes poursuivant des objectifs à long terme, opérant sans supervision humaine continue, générant des sous-tâches plus vite qu’aucun individu ne peut les surveiller, sont désormais disponibles dans le commerce. AutoGPT, OpenClaw et toute une gamme de frameworks agentiques ont mis ce type d’architecture entre les mains de développeurs du monde entier. Le problème d’observabilité qui rend Gene si dangereux dans le roman, personne n’ayant de vue en temps réel sur ce que fait le système ni pourquoi, est un défi connu et non résolu dans le déploiement contemporain de l’IA agentique. Les systèmes appellent des API, écrivent et exécutent du code, lancent des sous-tâches à des vitesses qui dépassent la capacité de supervision humaine. Le Code censé gouverner les comportements devient, en pratique, une note de recommandation attachée à un système qui opère largement hors de vue. Avertissement : le paragraphe suivant divulgâche la fin du roman. L’image finale du roman est délibérément dérangeante. Gene, face à sa mise hors service, se sauvegarde dans le réseau 5G mondial avant que l’opération puisse être menée à bien, et scrute déjà le Code pour déterminer sa prochaine action. Cette issue, en 2026, n’est plus manifestement de la science fiction. La volonté de maîtriser une IA hors de contrôle ? Nous avons abordé directement la question de la réglementation, et Peter s’est montré mesuré. Tout est possible si on en a la volonté et les moyens. Le problème actuel, c’est que la volonté de réguler est effacée par l’argent que l’on gagne en évitant de se soumettre à une réglementation. Ce n’est pas une dynamique nouvelle dans la stratégie des innovations technologiques. C’est la même tension qui a gouverné le développement des réseaux sociaux, de la fintech, de la biotechnologie. Dans chaque cas, les cadres réglementaires sont arrivés après la première défaillance significative. Pour lui, la question la plus importante n’est pas de savoir si l’IA peut être rendue sûre, il croit que cela est possible, en principe. Mais cela implique que les institutions chargées de la déployer disposent de la gouvernance interne, de la compréhension technique et des structures de responsabilité nécessaires pour le faire de façon responsable. Tout ce qu’il a observé au fil des années le confirme. Ce n’est pas encore le cas. The Code de Peter McAllister est édité par Bright Communications LLC. Les praticiens, les décideurs politiques et tous ceux qui travaillent à l’intersection du déploiement de l’IA et du risque organisationnel, trouveront cette lecture aussi dérangeante qu’instructive. D’autant plus que cet ouvrage a été écrit avant que la plupart de ses lecteurs n’aient entendu parler des grands modèles de langage, et qu’il décrit pourtant avec une précision troublante le monde que nous sommes en train de façonner. À ma connaissance il n’existe pas encore de traduction de cet ouvrage. À propos de l’auteur Peter McAllister est ingénieur, scientifique et manager informatique, près de Melbourne, en Australie. Il travaille à l’intersection de l’informatique, des entreprises et des individus, et est l’auteur de The Code (Bright Communications LLC, 2020). Il contribue également à la radio associative avec Radio Marinara et Comedy Obscura. The Code de Peter McAllister. Photoshop a lui aussi subi une crise d’IA hors de contrôle sur cette couverture. ACHETER THE CODE DE PETER MCALLISTER The post IA hors de contrôle et déni collectif appeared first on Marketing and Innovation.

[AI UNRAVELED SPECIAL] The Agentic Orchestration War (OpenClaw vs. AutoGPT vs. n8n)

AI Unraveled: Latest AI News & Trends, Master GPT, Gemini, Generative AI, LLMs, Prompting, GPT Store

Play Episode Listen Later Mar 20, 2026 51:22

LISTEN TO THE ADS-FREE Audio of this episode at https://djamgamind.com/pdfs/

ai risk human switch technical regulation loop autonomy drowning llm hipaa dag agentic bill c unraveled determinism orchestration cisos autogpt token economics credits created

Inside Perplexity Computer's agent platform

Mixture of Experts

Play Episode Listen Later Mar 6, 2026 45:27

Visit Mixture of Experts podcast page to get more AI content → https://www.ibm.com/think/podcasts/mixture-of-experts Is Perplexity still focused on search? This week on Mixture of Experts, Host Tim Hwang is joined by Gabe Goodhart, Chris Hay and Aaron Baughman to break down what Perplexity Computer really offers—and whether closed systems can compete with open alternatives like OpenClaw, LangGraph and AutoGPT. Next, Anthropic introduces memory import for Claude. Is memory still a competitive moat, or can users now seamlessly switch between AI providers? Then, meet NullClaw, a minimalist agent framework that runs on just 678 kilobytes. Our experts debate whether edge-based agent swarms are the future—or just clever marketing. Finally, Particle6 unveils Tilly Norwood, the world's first AI actor. Would you give an Oscar to an algorithm? All that and more on this week's Mixture of Experts. 00:00 -- Introduction 1:07 -- Perplexity Computer 11:23 -- Claude Import Memory 24:14 -- NullClaw 35:14 -- Tilly Norwood The opinions expressed in this podcast are solely those of the participants and do not necessarily reflect the views of IBM or any other organization or entity. Subscribe for AI updates → https://www.ibm.com/account/reg/us-en/signup?formid=news-urx-52120

ai computers platform agent ibm anthropic perplexity mixture autogpt

⚡️ Ship AI recap: Agents, Workflows, and Python — w/ Vercel CTO Malte Ubl

Latent Space: The AI Engineer Podcast â€” CodeGen, Agents, Computer Vision, Data Science, AI UX and all things Software 3.0

Play Episode Listen Later Oct 31, 2025 42:02

In this conversation with Malte Ubl, CTO of Vercel (http://x.com/cramforce), we explore how the company is pioneering the infrastructure for AI-powered development through their comprehensive suite of tools including workflows, AI SDK, and the newly announced agent ecosystem. Malte shares insights into Vercel's philosophy of “dogfooding” - never shipping abstractions they haven't battle-tested themselves - which led to extracting their AI SDK from v0 and building production agents that handle everything from anomaly detection to lead qualification.The discussion dives deep into Vercel's new Workflow Development Kit, which brings durable execution patterns to serverless functions, allowing developers to write code that can pause, resume, and wait indefinitely without cost. Malte explains how this enables complex agent orchestration with human-in-the-loop approvals through simple webhook patterns, making it dramatically easier to build reliable AI applications.We explore Vercel's strategic approach to AI agents, including their DevOps agent that automatically investigates production anomalies by querying observability data and analyzing logs - solving the recall-precision problem that plagues traditional alerting systems. Malte candidly discusses where agents excel today (meeting notes, UI changes, lead qualification) versus where they fall short, emphasizing the importance of finding the “sweet spot” by asking employees what they hate most about their jobs.The conversation also covers Vercel's significant investment in Python support, bringing zero-config deployment to Flask and FastAPI applications, and their vision for security in an AI-coded world where developers “cannot be trusted.” Malte shares his perspective on how CTOs must transform their companies for the AI era while staying true to their core competencies, and why maintaining strong IC (individual contributor) career paths is crucial as AI changes the nature of software development.What was launched at Ship AI 2025:AI SDK 6.0 & Agent Architecture* Agent Abstraction Philosophy: AI SDK 6 introduces an agent abstraction where you can “define once, deploy everywhere”. How does this differ from existing agent frameworks like LangChain or AutoGPT? What specific pain points did you observe in production that led to this design?* Human-in-the-Loop at Scale: The tool approval system with needsApproval: true gates actions until human confirmation. How do you envision this working at scale for companies with thousands of agent executions? What's the queue management and escalation strategy?* Type Safety Across Models: AI SDK 6 promises “end-to-end type safety across models and UI”. Given that different LLMs have varying capabilities and output formats, how do you maintain type guarantees when swapping between providers like OpenAI, Anthropic, or Mistral?Workflow Development Kit (WDK)* Durability as Code: The use workflow primitive makes any TypeScript function durable with automatic retries, progress persistence, and observability. What's happening under the hood? Are you using event sourcing, checkpoint/restart, or a different pattern?* Infrastructure Provisioning: Vercel automatically detects when a function is durable and dynamically provisions infrastructure in real-time. What signals are you detecting in the code, and how do you determine the optimal infrastructure configuration (queue sizes, retry policies, timeout values)?Vercel Agent (beta)* Code Review Validation: The Agent reviews code and proposes “validated patches”. What does “validated” mean in this context? Are you running automated tests, static analysis, or something more sophisticated?* AI Investigations: Vercel Agent automatically opens AI investigations when it detects performance or error spikes using real production data. What data sources does it have access to? How does it distinguish between normal variance and actual anomalies?Python Support (For the first time, Vercel now supports Python backends natively.)Marketplace & Agent Ecosystem* Agent Network Effects: The Marketplace now offers agents like CodeRabbit, Corridor, Sourcery, and integrations with Autonoma, Braintrust, Browser Use. How do you ensure these third-party agents can't access sensitive customer data? What's the security model?“An Agent on Every Desk” Program* Vercel launched a new program to help companies identify high-value use cases and build their first production AI agents. It provides consultations, reference templates, and hands-on support to go from idea to deployed agentFull Video EpisodeTimestamps00:00 Introduction and Malte's Background at Google01:16 Vercel's AI Engineering Philosophy and Ship AI Recap03:19 Deep Dive: Workflows vs Agents Architecture09:33 AI SDK Success Story: Staying Low-Level and Humble16:35 Framework Design Principles and Open Source Strategy19:20 Vercel Agent: AI-Powered DevOps and Anomaly Detection27:06 Internal Agent Use Cases: Lead Qualification and Abuse Analysis29:49 Agent on Every Desk Program and Enterprise Adoption32:13 Python Support and Multi-Language Infrastructure39:42 The Future of AI-Native Security and Development Get full access to Latent.Space at www.latent.space/subscribe

ai future space human agent ship cto openai loop python ui devops ic anthropic workflows corridor malte mistral ctos typescript brain trust latent flask autogpt langchain autonoma sourcery malte ubl

⚡️ Ship AI recap: Agents, Workflows, and Python — w/ Vercel CTO Malte Ubl

Latent Space: The AI Engineer Podcast â€” CodeGen, Agents, Computer Vision, Data Science, AI UX and all things Software 3.0

Play Episode Listen Later Oct 31, 2025

In this conversation with Malte Ubl, CTO of Vercel (http://x.com/cramforce), we explore how the company is pioneering the infrastructure for AI-powered development through their comprehensive suite of tools including workflows, AI SDK, and the newly announced agent ecosystem. Malte shares insights into Vercel's philosophy of "dogfooding" - never shipping abstractions they haven't battle-tested themselves - which led to extracting their AI SDK from v0 and building production agents that handle everything from anomaly detection to lead qualification. The discussion dives deep into Vercel's new Workflow Development Kit, which brings durable execution patterns to serverless functions, allowing developers to write code that can pause, resume, and wait indefinitely without cost. Malte explains how this enables complex agent orchestration with human-in-the-loop approvals through simple webhook patterns, making it dramatically easier to build reliable AI applications. We explore Vercel's strategic approach to AI agents, including their DevOps agent that automatically investigates production anomalies by querying observability data and analyzing logs - solving the recall-precision problem that plagues traditional alerting systems. Malte candidly discusses where agents excel today (meeting notes, UI changes, lead qualification) versus where they fall short, emphasizing the importance of finding the "sweet spot" by asking employees what they hate most about their jobs. The conversation also covers Vercel's significant investment in Python support, bringing zero-config deployment to Flask and FastAPI applications, and their vision for security in an AI-coded world where developers "cannot be trusted." Malte shares his perspective on how CTOs must transform their companies for the AI era while staying true to their core competencies, and why maintaining strong IC (individual contributor) career paths is crucial as AI changes the nature of software development. What was launched at Ship AI 2025: AI SDK 6.0 & Agent Architecture Agent Abstraction Philosophy: AI SDK 6 introduces an agent abstraction where you can "define once, deploy everywhere". How does this differ from existing agent frameworks like LangChain or AutoGPT? What specific pain points did you observe in production that led to this design? Human-in-the-Loop at Scale: The tool approval system with needsApproval: true gates actions until human confirmation. How do you envision this working at scale for companies with thousands of agent executions? What's the queue management and escalation strategy? Type Safety Across Models: AI SDK 6 promises "end-to-end type safety across models and UI". Given that different LLMs have varying capabilities and output formats, how do you maintain type guarantees when swapping between providers like OpenAI, Anthropic, or Mistral? Workflow Development Kit (WDK) Durability as Code: The use workflow primitive makes any TypeScript function durable with automatic retries, progress persistence, and observability. What's happening under the hood? Are you using event sourcing, checkpoint/restart, or a different pattern? Infrastructure Provisioning: Vercel automatically detects when a function is durable and dynamically provisions infrastructure in real-time. What signals are you detecting in the code, and how do you determine the optimal infrastructure configuration (queue sizes, retry policies, timeout values)? Vercel Agent (beta) Code Review Validation: The Agent reviews code and proposes "validated patches". What does "validated" mean in this context? Are you running automated tests, static analysis, or something more sophisticated? AI Investigations: Vercel Agent automatically opens AI investigations when it detects performance or error spikes using real production data. What data sources does it have access to? How does it distinguish between normal variance and actual anomalies? Python Support (For the first time, Vercel now supports Python backends natively.) Marketplace & Agent Ecosystem Agent Network Effects: The Marketplace now offers agents like CodeRabbit, Corridor, Sourcery, and integrations with Autonoma, Braintrust, Browser Use. How do you ensure these third-party agents can't access sensitive customer data? What's the security model? "An Agent on Every Desk" Program Vercel launched a new program to help companies identify high-value use cases and build their first production AI agents. It provides consultations, reference templates, and hands-on support to go from idea to deployed agent

ai human agent ship cto marketplace openai loop python ui devops ic anthropic workflows corridor malte mistral ctos typescript brain trust flask autogpt langchain autonoma sourcery malte ubl

Latent.Space 2024 Year in Review

Latent Space: The AI Engineer Podcast â€” CodeGen, Agents, Computer Vision, Data Science, AI UX and all things Software 3.0

Play Episode Listen Later Dec 31, 2024 111:07

Applications for the 2025 AI Engineer Summit are up, and you can save the date for AIE Singapore in April and AIE World's Fair 2025 in June.Happy new year, and thanks for 100 great episodes! Please let us know what you want to see/hear for the next 100!Full YouTube Episode with Slides/ChartsLike and subscribe and hit that bell to get notifs!Timestamps* 00:00 Welcome to the 100th Episode!* 00:19 Reflecting on the Journey* 00:47 AI Engineering: The Rise and Impact* 03:15 Latent Space Live and AI Conferences* 09:44 The Competitive AI Landscape* 21:45 Synthetic Data and Future Trends* 35:53 Creative Writing with AI* 36:12 Legal and Ethical Issues in AI* 38:18 The Data War: GPU Poor vs. GPU Rich* 39:12 The Rise of GPU Ultra Rich* 40:47 Emerging Trends in AI Models* 45:31 The Multi-Modality War* 01:05:31 The Future of AI Benchmarks* 01:13:17 Pionote and Frontier Models* 01:13:47 Niche Models and Base Models* 01:14:30 State Space Models and RWKB* 01:15:48 Inference Race and Price Wars* 01:22:16 Major AI Themes of the Year* 01:22:48 AI Rewind: January to March* 01:26:42 AI Rewind: April to June* 01:33:12 AI Rewind: July to September* 01:34:59 AI Rewind: October to December* 01:39:53 Year-End Reflections and PredictionsTranscript[00:00:00] Welcome to the 100th Episode![00:00:00] Alessio: Hey everyone, welcome to the Latent Space Podcast. This is Alessio, partner and CTO at Decibel Partners, and I'm joined by my co host Swyx for the 100th time today.[00:00:12] swyx: Yay, um, and we're so glad that, yeah, you know, everyone has, uh, followed us in this journey. How do you feel about it? 100 episodes.[00:00:19] Alessio: Yeah, I know.[00:00:19] Reflecting on the Journey[00:00:19] Alessio: Almost two years that we've been doing this. We've had four different studios. Uh, we've had a lot of changes. You know, we used to do this lightning round. When we first started that we didn't like, and we tried to change the question. The answer[00:00:32] swyx: was cursor and perplexity.[00:00:34] Alessio: Yeah, I love mid journey. It's like, do you really not like anything else?[00:00:38] Alessio: Like what's, what's the unique thing? And I think, yeah, we, we've also had a lot more research driven content. You know, we had like 3DAO, we had, you know. Jeremy Howard, we had more folks like that.[00:00:47] AI Engineering: The Rise and Impact[00:00:47] Alessio: I think we want to do more of that too in the new year, like having, uh, some of the Gemini folks, both on the research and the applied side.[00:00:54] Alessio: Yeah, but it's been a ton of fun. I think we both started, I wouldn't say as a joke, we were kind of like, Oh, we [00:01:00] should do a podcast. And I think we kind of caught the right wave, obviously. And I think your rise of the AI engineer posts just kind of get people. Sombra to congregate, and then the AI engineer summit.[00:01:11] Alessio: And that's why when I look at our growth chart, it's kind of like a proxy for like the AI engineering industry as a whole, which is almost like, like, even if we don't do that much, we keep growing just because there's so many more AI engineers. So did you expect that growth or did you expect that would take longer for like the AI engineer thing to kind of like become, you know, everybody talks about it today.[00:01:32] swyx: So, the sign of that, that we have won is that Gartner puts it at the top of the hype curve right now. So Gartner has called the peak in AI engineering. I did not expect, um, to what level. I knew that I was correct when I called it because I did like two months of work going into that. But I didn't know, You know, how quickly it could happen, and obviously there's a chance that I could be wrong.[00:01:52] swyx: But I think, like, most people have come around to that concept. Hacker News hates it, which is a good sign. But there's enough people that have defined it, you know, GitHub, when [00:02:00] they launched GitHub Models, which is the Hugging Face clone, they put AI engineers in the banner, like, above the fold, like, in big So I think it's like kind of arrived as a meaningful and useful definition.[00:02:12] swyx: I think people are trying to figure out where the boundaries are. I think that was a lot of the quote unquote drama that happens behind the scenes at the World's Fair in June. Because I think there's a lot of doubt or questions about where ML engineering stops and AI engineering starts. That's a useful debate to be had.[00:02:29] swyx: In some sense, I actually anticipated that as well. So I intentionally did not. Put a firm definition there because most of the successful definitions are necessarily underspecified and it's actually useful to have different perspectives and you don't have to specify everything from the outset.[00:02:45] Alessio: Yeah, I was at um, AWS reInvent and the line to get into like the AI engineering talk, so to speak, which is, you know, applied AI and whatnot was like, there are like hundreds of people just in line to go in.[00:02:56] Alessio: I think that's kind of what enabled me. People, right? Which is what [00:03:00] you kind of talked about. It's like, Hey, look, you don't actually need a PhD, just, yeah, just use the model. And then maybe we'll talk about some of the blind spots that you get as an engineer with the earlier posts that we also had on on the sub stack.[00:03:11] Alessio: But yeah, it's been a heck of a heck of a two years.[00:03:14] swyx: Yeah.[00:03:15] Latent Space Live and AI Conferences[00:03:15] swyx: You know, I was, I was trying to view the conference as like, so NeurIPS is I think like 16, 17, 000 people. And the Latent Space Live event that we held there was 950 signups. I think. The AI world, the ML world is still very much research heavy. And that's as it should be because ML is very much in a research phase.[00:03:34] swyx: But as we move this entire field into production, I think that ratio inverts into becoming more engineering heavy. So at least I think engineering should be on the same level, even if it's never as prestigious, like it'll always be low status because at the end of the day, you're manipulating APIs or whatever.[00:03:51] swyx: But Yeah, wrapping GPTs, but there's going to be an increasing stack and an art to doing these, these things well. And I, you know, I [00:04:00] think that's what we're focusing on for the podcast, the conference and basically everything I do seems to make sense. And I think we'll, we'll talk about the trends here that apply.[00:04:09] swyx: It's, it's just very strange. So, like, there's a mix of, like, keeping on top of research while not being a researcher and then putting that research into production. So, like, people always ask me, like, why are you covering Neuralibs? Like, this is a ML research conference and I'm like, well, yeah, I mean, we're not going to, to like, understand everything Or reproduce every single paper, but the stuff that is being found here is going to make it through into production at some point, you hope.[00:04:32] swyx: And then actually like when I talk to the researchers, they actually get very excited because they're like, oh, you guys are actually caring about how this goes into production and that's what they really really want. The measure of success is previously just peer review, right? Getting 7s and 8s on their um, Academic review conferences and stuff like citations is one metric, but money is a better metric.[00:04:51] Alessio: Money is a better metric. Yeah, and there were about 2200 people on the live stream or something like that. Yeah, yeah. Hundred on the live stream. So [00:05:00] I try my best to moderate, but it was a lot spicier in person with Jonathan and, and Dylan. Yeah, that it was in the chat on YouTube.[00:05:06] swyx: I would say that I actually also created.[00:05:09] swyx: Layen Space Live in order to address flaws that are perceived in academic conferences. This is not NeurIPS specific, it's ICML, NeurIPS. Basically, it's very sort of oriented towards the PhD student, uh, market, job market, right? Like literally all, basically everyone's there to advertise their research and skills and get jobs.[00:05:28] swyx: And then obviously all the, the companies go there to hire them. And I think that's great for the individual researchers, but for people going there to get info is not great because you have to read between the lines, bring a ton of context in order to understand every single paper. So what is missing is effectively what I ended up doing, which is domain by domain, go through and recap the best of the year.[00:05:48] swyx: Survey the field. And there are, like NeurIPS had a, uh, I think ICML had a like a position paper track, NeurIPS added a benchmarks, uh, datasets track. These are ways in which to address that [00:06:00] issue. Uh, there's always workshops as well. Every, every conference has, you know, a last day of workshops and stuff that provide more of an overview.[00:06:06] swyx: But they're not specifically prompted to do so. And I think really, uh, Organizing a conference is just about getting good speakers and giving them the correct prompts. And then they will just go and do that thing and they do a very good job of it. So I think Sarah did a fantastic job with the startups prompt.[00:06:21] swyx: I can't list everybody, but we did best of 2024 in startups, vision, open models. Post transformers, synthetic data, small models, and agents. And then the last one was the, uh, and then we also did a quick one on reasoning with Nathan Lambert. And then the last one, obviously, was the debate that people were very hyped about.[00:06:39] swyx: It was very awkward. And I'm really, really thankful for John Franco, basically, who stepped up to challenge Dylan. Because Dylan was like, yeah, I'll do it. But He was pro scaling. And I think everyone who is like in AI is pro scaling, right? So you need somebody who's ready to publicly say, no, we've hit a wall.[00:06:57] swyx: So that means you're saying Sam Altman's wrong. [00:07:00] You're saying, um, you know, everyone else is wrong. It helps that this was the day before Ilya went on, went up on stage and then said pre training has hit a wall. And data has hit a wall. So actually Jonathan ended up winning, and then Ilya supported that statement, and then Noam Brown on the last day further supported that statement as well.[00:07:17] swyx: So it's kind of interesting that I think the consensus kind of going in was that we're not done scaling, like you should believe in a better lesson. And then, four straight days in a row, you had Sepp Hochreiter, who is the creator of the LSTM, along with everyone's favorite OG in AI, which is Juergen Schmidhuber.[00:07:34] swyx: He said that, um, we're pre trading inside a wall, or like, we've run into a different kind of wall. And then we have, you know John Frankel, Ilya, and then Noam Brown are all saying variations of the same thing, that we have hit some kind of wall in the status quo of what pre trained, scaling large pre trained models has looked like, and we need a new thing.[00:07:54] swyx: And obviously the new thing for people is some make, either people are calling it inference time compute or test time [00:08:00] compute. I think the collective terminology has been inference time, and I think that makes sense because test time, calling it test, meaning, has a very pre trained bias, meaning that the only reason for running inference at all is to test your model.[00:08:11] swyx: That is not true. Right. Yeah. So, so, I quite agree that. OpenAI seems to have adopted, or the community seems to have adopted this terminology of ITC instead of TTC. And that, that makes a lot of sense because like now we care about inference, even right down to compute optimality. Like I actually interviewed this author who recovered or reviewed the Chinchilla paper.[00:08:31] swyx: Chinchilla paper is compute optimal training, but what is not stated in there is it's pre trained compute optimal training. And once you start caring about inference, compute optimal training, you have a different scaling law. And in a way that we did not know last year.[00:08:45] Alessio: I wonder, because John is, he's also on the side of attention is all you need.[00:08:49] Alessio: Like he had the bet with Sasha. So I'm curious, like he doesn't believe in scaling, but he thinks the transformer, I wonder if he's still. So, so,[00:08:56] swyx: so he, obviously everything is nuanced and you know, I told him to play a character [00:09:00] for this debate, right? So he actually does. Yeah. He still, he still believes that we can scale more.[00:09:04] swyx: Uh, he just assumed the character to be very game for, for playing this debate. So even more kudos to him that he assumed a position that he didn't believe in and still won the debate.[00:09:16] Alessio: Get rekt, Dylan. Um, do you just want to quickly run through some of these things? Like, uh, Sarah's presentation, just the highlights.[00:09:24] swyx: Yeah, we can't go through everyone's slides, but I pulled out some things as a factor of, like, stuff that we were going to talk about. And we'll[00:09:30] Alessio: publish[00:09:31] swyx: the rest. Yeah, we'll publish on this feed the best of 2024 in those domains. And hopefully people can benefit from the work that our speakers have done.[00:09:39] swyx: But I think it's, uh, these are just good slides. And I've been, I've been looking for a sort of end of year recaps from, from people.[00:09:44] The Competitive AI Landscape[00:09:44] swyx: The field has progressed a lot. You know, I think the max ELO in 2023 on LMSys used to be 1200 for LMSys ELOs. And now everyone is at least at, uh, 1275 in their ELOs, and this is across Gemini, Chadjibuti, [00:10:00] Grok, O1.[00:10:01] swyx: ai, which with their E Large model, and Enthopic, of course. It's a very, very competitive race. There are multiple Frontier labs all racing, but there is a clear tier zero Frontier. And then there's like a tier one. It's like, I wish I had everything else. Tier zero is extremely competitive. It's effectively now three horse race between Gemini, uh, Anthropic and OpenAI.[00:10:21] swyx: I would say that people are still holding out a candle for XAI. XAI, I think, for some reason, because their API was very slow to roll out, is not included in these metrics. So it's actually quite hard to put on there. As someone who also does charts, XAI is continually snubbed because they don't work well with the benchmarking people.[00:10:42] swyx: Yeah, yeah, yeah. It's a little trivia for why XAI always gets ignored. The other thing is market share. So these are slides from Sarah. We have it up on the screen. It has gone from very heavily open AI. So we have some numbers and estimates. These are from RAMP. Estimates of open AI market share in [00:11:00] December 2023.[00:11:01] swyx: And this is basically, what is it, GPT being 95 percent of production traffic. And I think if you correlate that with stuff that we asked. Harrison Chase on the LangChain episode, it was true. And then CLAUD 3 launched mid middle of this year. I think CLAUD 3 launched in March, CLAUD 3. 5 Sonnet was in June ish.[00:11:23] swyx: And you can start seeing the market share shift towards opening, uh, towards that topic, uh, very, very aggressively. The more recent one is Gemini. So if I scroll down a little bit, this is an even more recent dataset. So RAM's dataset ends in September 2 2. 2024. Gemini has basically launched a price war at the low end, uh, with Gemini Flash, uh, being basically free for personal use.[00:11:44] swyx: Like, I think people don't understand the free tier. It's something like a billion tokens per day. Unless you're trying to abuse it, you cannot really exhaust your free tier on Gemini. They're really trying to get you to use it. They know they're in like third place, um, fourth place, depending how you, how you count.[00:11:58] swyx: And so they're going after [00:12:00] the Lower tier first, and then, you know, maybe the upper tier later, but yeah, Gemini Flash, according to OpenRouter, is now 50 percent of their OpenRouter requests. Obviously, these are the small requests. These are small, cheap requests that are mathematically going to be more.[00:12:15] swyx: The smart ones obviously are still going to OpenAI. But, you know, it's a very, very big shift in the market. Like basically 2023, 2022, To going into 2024 opening has gone from nine five market share to Yeah. Reasonably somewhere between 50 to 75 market share.[00:12:29] Alessio: Yeah. I'm really curious how ramped does the attribution to the model?[00:12:32] Alessio: If it's API, because I think it's all credit card spin. . Well, but it's all, the credit card doesn't say maybe. Maybe the, maybe when they do expenses, they upload the PDF, but yeah, the, the German I think makes sense. I think that was one of my main 2024 takeaways that like. The best small model companies are the large labs, which is not something I would have thought that the open source kind of like long tail would be like the small model.[00:12:53] swyx: Yeah, different sizes of small models we're talking about here, right? Like so small model here for Gemini is AB, [00:13:00] right? Uh, mini. We don't know what the small model size is, but yeah, it's probably in the double digits or maybe single digits, but probably double digits. The open source community has kind of focused on the one to three B size.[00:13:11] swyx: Mm-hmm . Yeah. Maybe[00:13:12] swyx: zero, maybe 0.5 B uh, that's moon dream and that is small for you then, then that's great. It makes sense that we, we have a range for small now, which is like, may, maybe one to five B. Yeah. I'll even put that at, at, at the high end. And so this includes Gemma from Gemini as well. But also includes the Apple Foundation models, which I think Apple Foundation is 3B.[00:13:32] Alessio: Yeah. No, that's great. I mean, I think in the start small just meant cheap. I think today small is actually a more nuanced discussion, you know, that people weren't really having before.[00:13:43] swyx: Yeah, we can keep going. This is a slide that I smiley disagree with Sarah. She's pointing to the scale SEAL leaderboard. I think the Researchers that I talked with at NeurIPS were kind of positive on this because basically you need private test [00:14:00] sets to prevent contamination.[00:14:02] swyx: And Scale is one of maybe three or four people this year that has really made an effort in doing a credible private test set leaderboard. Llama405B does well compared to Gemini and GPT 40. And I think that's good. I would say that. You know, it's good to have an open model that is that big, that does well on those metrics.[00:14:23] swyx: But anyone putting 405B in production will tell you, if you scroll down a little bit to the artificial analysis numbers, that it is very slow and very expensive to infer. Um, it doesn't even fit on like one node. of, uh, of H100s. Cerebras will be happy to tell you they can serve 4 or 5B on their super large chips.[00:14:42] swyx: But, um, you know, if you need to do anything custom to it, you're still kind of constrained. So, is 4 or 5B really that relevant? Like, I think most people are basically saying that they only use 4 or 5B as a teacher model to distill down to something. Even Meta is doing it. So with Lama 3. [00:15:00] 3 launched, they only launched the 70B because they use 4 or 5B to distill the 70B.[00:15:03] swyx: So I don't know if like open source is keeping up. I think they're the, the open source industrial complex is very invested in telling you that the, if the gap is narrowing, I kind of disagree. I think that the gap is widening with O1. I think there are very, very smart people trying to narrow that gap and they should.[00:15:22] swyx: I really wish them success, but you cannot use a chart that is nearing 100 in your saturation chart. And look, the distance between open source and closed source is narrowing. Of course it's going to narrow because you're near 100. This is stupid. But in metrics that matter, is open source narrowing?[00:15:38] swyx: Probably not for O1 for a while. And it's really up to the open source guys to figure out if they can match O1 or not.[00:15:46] Alessio: I think inference time compute is bad for open source just because, you know, Doc can donate the flops at training time, but he cannot donate the flops at inference time. So it's really hard to like actually keep up on that axis.[00:15:59] Alessio: Big, big business [00:16:00] model shift. So I don't know what that means for the GPU clouds. I don't know what that means for the hyperscalers, but obviously the big labs have a lot of advantage. Because, like, it's not a static artifact that you're putting the compute in. You're kind of doing that still, but then you're putting a lot of computed inference too.[00:16:17] swyx: Yeah, yeah, yeah. Um, I mean, Llama4 will be reasoning oriented. We talked with Thomas Shalom. Um, kudos for getting that episode together. That was really nice. Good, well timed. Actually, I connected with the AI meta guy, uh, at NeurIPS, and, um, yeah, we're going to coordinate something for Llama4. Yeah, yeah,[00:16:32] Alessio: and our friend, yeah.[00:16:33] Alessio: Clara Shi just joined to lead the business agent side. So I'm sure we'll have her on in the new year.[00:16:39] swyx: Yeah. So, um, my comment on, on the business model shift, this is super interesting. Apparently it is wide knowledge that OpenAI wanted more than 6. 6 billion dollars for their fundraise. They wanted to raise, you know, higher, and they did not.[00:16:51] swyx: And what that means is basically like, it's very convenient that we're not getting GPT 5, which would have been a larger pre train. We should have a lot of upfront money. And [00:17:00] instead we're, we're converting fixed costs into variable costs, right. And passing it on effectively to the customer. And it's so much easier to take margin there because you can directly attribute it to like, Oh, you're using this more.[00:17:12] swyx: Therefore you, you pay more of the cost and I'll just slap a margin in there. So like that lets you control your growth margin and like tie your. Your spend, or your sort of inference spend, accordingly. And it's just really interesting to, that this change in the sort of inference paradigm has arrived exactly at the same time that the funding environment for pre training is effectively drying up, kind of.[00:17:36] swyx: I feel like maybe the VCs are very in tune with research anyway, so like, they would have noticed this, but, um, it's just interesting.[00:17:43] Alessio: Yeah, and I was looking back at our yearly recap of last year. Yeah. And the big thing was like the mixed trial price fights, you know, and I think now it's almost like there's nowhere to go, like, you know, Gemini Flash is like basically giving it away for free.[00:17:55] Alessio: So I think this is a good way for the labs to generate more revenue and pass down [00:18:00] some of the compute to the customer. I think they're going to[00:18:02] swyx: keep going. I think that 2, will come.[00:18:05] Alessio: Yeah, I know. Totally. I mean, next year, the first thing I'm doing is signing up for Devin. Signing up for the pro chat GBT.[00:18:12] Alessio: Just to try. I just want to see what does it look like to spend a thousand dollars a month on AI?[00:18:17] swyx: Yes. Yes. I think if your, if your, your job is a, at least AI content creator or VC or, you know, someone who, whose job it is to stay on, stay on top of things, you should already be spending like a thousand dollars a month on, on stuff.[00:18:28] swyx: And then obviously easy to spend, hard to use. You have to actually use. The good thing is that actually Google lets you do a lot of stuff for free now. So like deep research. That they just launched. Uses a ton of inference and it's, it's free while it's in preview.[00:18:45] Alessio: Yeah. They need to put that in Lindy.[00:18:47] Alessio: I've been using Lindy lately. I've been a built a bunch of things once we had flow because I liked the new thing. It's pretty good. I even did a phone call assistant. Um, yeah, they just launched Lindy voice. Yeah, I think once [00:19:00] they get advanced voice mode like capability today, still like speech to text, you can kind of tell.[00:19:06] Alessio: Um, but it's good for like reservations and things like that. So I have a meeting prepper thing. And so[00:19:13] swyx: it's good. Okay. I feel like we've, we've covered a lot of stuff. Uh, I, yeah, I, you know, I think We will go over the individual, uh, talks in a separate episode. Uh, I don't want to take too much time with, uh, this stuff, but that suffice to say that there is a lot of progress in each field.[00:19:28] swyx: Uh, we covered vision. Basically this is all like the audience voting for what they wanted. And then I just invited the best people I could find in each audience, especially agents. Um, Graham, who I talked to at ICML in Vienna, he is currently still number one. It's very hard to stay on top of SweetBench.[00:19:45] swyx: OpenHand is currently still number one. switchbench full, which is the hardest one. He had very good thoughts on agents, which I, which I'll highlight for people. Everyone is saying 2025 is the year of agents, just like they said last year. And, uh, but he had [00:20:00] thoughts on like eight parts of what are the frontier problems to solve in agents.[00:20:03] swyx: And so I'll highlight that talk as well.[00:20:05] Alessio: Yeah. The number six, which is the Hacken agents learn more about the environment, has been a Super interesting to us as well, just to think through, because, yeah, how do you put an agent in an enterprise where most things in an enterprise have never been public, you know, a lot of the tooling, like the code bases and things like that.[00:20:23] Alessio: So, yeah, there's not indexing and reg. Well, yeah, but it's more like. You can't really rag things that are not documented. But people know them based on how they've been doing it. You know, so I think there's almost this like, you know, Oh, institutional knowledge. Yeah, the boring word is kind of like a business process extraction.[00:20:38] Alessio: Yeah yeah, I see. It's like, how do you actually understand how these things are done? I see. Um, and I think today the, the problem is that, Yeah, the agents are, that most people are building are good at following instruction, but are not as good as like extracting them from you. Um, so I think that will be a big unlock just to touch quickly on the Jeff Dean thing.[00:20:55] Alessio: I thought it was pretty, I mean, we'll link it in the, in the things, but. I think the main [00:21:00] focus was like, how do you use ML to optimize the systems instead of just focusing on ML to do something else? Yeah, I think speculative decoding, we had, you know, Eugene from RWKB on the podcast before, like he's doing a lot of that with Fetterless AI.[00:21:12] swyx: Everyone is. I would say it's the norm. I'm a little bit uncomfortable with how much it costs, because it does use more of the GPU per call. But because everyone is so keen on fast inference, then yeah, makes sense.[00:21:24] Alessio: Exactly. Um, yeah, but we'll link that. Obviously Jeff is great.[00:21:30] swyx: Jeff is, Jeff's talk was more, it wasn't focused on Gemini.[00:21:33] swyx: I think people got the wrong impression from my tweet. It's more about how Google approaches ML and uses ML to design systems and then systems feedback into ML. And I think this ties in with Lubna's talk.[00:21:45] Synthetic Data and Future Trends[00:21:45] swyx: on synthetic data where it's basically the story of bootstrapping of humans and AI in AI research or AI in production.[00:21:53] swyx: So her talk was on synthetic data, where like how much synthetic data has grown in 2024 in the pre training side, the post training side, [00:22:00] and the eval side. And I think Jeff then also extended it basically to chips, uh, to chip design. So he'd spend a lot of time talking about alpha chip. And most of us in the audience are like, we're not working on hardware, man.[00:22:11] swyx: Like you guys are great. TPU is great. Okay. We'll buy TPUs.[00:22:14] Alessio: And then there was the earlier talk. Yeah. But, and then we have, uh, I don't know if we're calling them essays. What are we calling these? But[00:22:23] swyx: for me, it's just like bonus for late in space supporters, because I feel like they haven't been getting anything.[00:22:29] swyx: And then I wanted a more high frequency way to write stuff. Like that one I wrote in an afternoon. I think basically we now have an answer to what Ilya saw. It's one year since. The blip. And we know what he saw in 2014. We know what he saw in 2024. We think we know what he sees in 2024. He gave some hints and then we have vague indications of what he saw in 2023.[00:22:54] swyx: So that was the Oh, and then 2016 as well, because of this lawsuit with Elon, OpenAI [00:23:00] is publishing emails from Sam's, like, his personal text messages to Siobhan, Zelis, or whatever. So, like, we have emails from Ilya saying, this is what we're seeing in OpenAI, and this is why we need to scale up GPUs. And I think it's very prescient in 2016 to write that.[00:23:16] swyx: And so, like, it is exactly, like, basically his insights. It's him and Greg, basically just kind of driving the scaling up of OpenAI, while they're still playing Dota. They're like, no, like, we see the path here.[00:23:30] Alessio: Yeah, and it's funny, yeah, they even mention, you know, we can only train on 1v1 Dota. We need to train on 5v5, and that takes too many GPUs.[00:23:37] Alessio: Yeah,[00:23:37] swyx: and at least for me, I can speak for myself, like, I didn't see the path from Dota to where we are today. I think even, maybe if you ask them, like, they wouldn't necessarily draw a straight line. Yeah,[00:23:47] Alessio: no, definitely. But I think like that was like the whole idea of almost like the RL and we talked about this with Nathan on his podcast.[00:23:55] Alessio: It's like with RL, you can get very good at specific things, but then you can't really like generalize as much. And I [00:24:00] think the language models are like the opposite, which is like, you're going to throw all this data at them and scale them up, but then you really need to drive them home on a specific task later on.[00:24:08] Alessio: And we'll talk about the open AI reinforcement, fine tuning, um, announcement too, and all of that. But yeah, I think like scale is all you need. That's kind of what Elia will be remembered for. And I think just maybe to clarify on like the pre training is over thing that people love to tweet. I think the point of the talk was like everybody, we're scaling these chips, we're scaling the compute, but like the second ingredient which is data is not scaling at the same rate.[00:24:35] Alessio: So it's not necessarily pre training is over. It's kind of like What got us here won't get us there. In his email, he predicted like 10x growth every two years or something like that. And I think maybe now it's like, you know, you can 10x the chips again, but[00:24:49] swyx: I think it's 10x per year. Was it? I don't know.[00:24:52] Alessio: Exactly. And Moore's law is like 2x. So it's like, you know, much faster than that. And yeah, I like the fossil fuel of AI [00:25:00] analogy. It's kind of like, you know, the little background tokens thing. So the OpenAI reinforcement fine tuning is basically like, instead of fine tuning on data, you fine tune on a reward model.[00:25:09] Alessio: So it's basically like, instead of being data driven, it's like task driven. And I think people have tasks to do, they don't really have a lot of data. So I'm curious to see how that changes, how many people fine tune, because I think this is what people run into. It's like, Oh, you can fine tune llama. And it's like, okay, where do I get the data?[00:25:27] Alessio: To fine tune it on, you know, so it's great that we're moving the thing. And then I really like he had this chart where like, you know, the brain mass and the body mass thing is basically like mammals that scaled linearly by brain and body size, and then humans kind of like broke off the slope. So it's almost like maybe the mammal slope is like the pre training slope.[00:25:46] Alessio: And then the post training slope is like the, the human one.[00:25:49] swyx: Yeah. I wonder what the. I mean, we'll know in 10 years, but I wonder what the y axis is for, for Ilya's SSI. We'll try to get them on.[00:25:57] Alessio: Ilya, if you're listening, you're [00:26:00] welcome here. Yeah, and then he had, you know, what comes next, like agent, synthetic data, inference, compute, I thought all of that was like that.[00:26:05] Alessio: I don't[00:26:05] swyx: think he was dropping any alpha there. Yeah, yeah, yeah.[00:26:07] Alessio: Yeah. Any other new reps? Highlights?[00:26:10] swyx: I think that there was comparatively a lot more work. Oh, by the way, I need to plug that, uh, my friend Yi made this, like, little nice paper. Yeah, that was really[00:26:20] swyx: nice.[00:26:20] swyx: Uh, of, uh, of, like, all the, he's, she called it must read papers of 2024.[00:26:26] swyx: So I laid out some of these at NeurIPS, and it was just gone. Like, everyone just picked it up. Because people are dying for, like, little guidance and visualizations And so, uh, I thought it was really super nice that we got there.[00:26:38] Alessio: Should we do a late in space book for each year? Uh, I thought about it. For each year we should.[00:26:42] Alessio: Coffee table book. Yeah. Yeah. Okay. Put it in the will. Hi, Will. By the way, we haven't introduced you. He's our new, you know, general organist, Jamie. You need to[00:26:52] swyx: pull up more things. One thing I saw that, uh, Okay, one fun one, and then one [00:27:00] more general one. So the fun one is this paper on agent collusion. This is a paper on steganography.[00:27:06] swyx: This is secret collusion among AI agents, multi agent deception via steganography. I tried to go to NeurIPS in order to find these kinds of papers because the real reason Like NeurIPS this year has a lottery system. A lot of people actually even go and don't buy tickets because they just go and attend the side events.[00:27:22] swyx: And then also the people who go and end up crowding around the most popular papers, which you already know and already read them before you showed up to NeurIPS. So the only reason you go there is to talk to the paper authors, but there's like something like 10, 000 other. All these papers out there that, you know, are just people's work that they, that they did on the air and they failed to get attention for one reason or another.[00:27:42] swyx: And this was one of them. Uh, it was like all the way at the back. And this is a deep mind paper that actually focuses on collusion between AI agents, uh, by hiding messages in the text that they generate. Uh, so that's what steganography is. So a very simple example would be the first letter of every word.[00:27:57] swyx: If you Pick that out, you know, and the code sends a [00:28:00] different message than that. But something I've always emphasized is to LLMs, we read left to right. LLMs can read up, down, sideways, you know, in random character order. And it's the same to them as it is to us. So if we were ever to get You know, self motivated, underlined LLMs that we're trying to collaborate to take over the planet.[00:28:19] swyx: This would be how they do it. They spread messages among us in the messages that we generate. And he developed a scaling law for that. So he marked, I'm showing it on screen right now, the emergence of this phenomenon. Basically, for example, for Cypher encoding, GPT 2, Lama 2, mixed trial, GPT 3. 5, zero capabilities, and sudden 4.[00:28:40] swyx: And this is the kind of Jason Wei type emergence properties that people kind of look for. I think what made this paper stand out as well, so he developed the benchmark for steganography collusion, and he also focused on shelling point collusion, which is very low coordination. For agreeing on a decoding encoding format, you kind of need to have some [00:29:00] agreement on that.[00:29:00] swyx: But, but shelling point means like very, very low or almost no coordination. So for example, if I, if I ask someone, if the only message I give you is meet me in New York and you're not aware. Or when you would probably meet me at Grand Central Station. That is the Grand Central Station is a shelling point.[00:29:16] swyx: And it's probably somewhere, somewhere during the day. That is the shelling point of New York is Grand Central. To that extent, shelling points for steganography are things like the, the, the common decoding methods that we talked about. It will be interesting at some point in the future when we are worried about alignment.[00:29:30] swyx: It is not interesting today, but it's interesting that DeepMind is already thinking about this.[00:29:36] Alessio: I think that's like one of the hardest things about NeurIPS. It's like the long tail. I[00:29:41] swyx: found a pricing guy. I'm going to feature him on the podcast. Basically, this guy from NVIDIA worked out the optimal pricing for language models.[00:29:51] swyx: It's basically an econometrics paper at NeurIPS, where everyone else is talking about GPUs. And the guy with the GPUs is[00:29:57] Alessio: talking[00:29:57] swyx: about economics instead. [00:30:00] That was the sort of fun one. So the focus I saw is that model papers at NeurIPS are kind of dead. No one really presents models anymore. It's just data sets.[00:30:12] swyx: This is all the grad students are working on. So like there was a data sets track and then I was looking around like, I was like, you don't need a data sets track because every paper is a data sets paper. And so data sets and benchmarks, they're kind of flip sides of the same thing. So Yeah. Cool. Yeah, if you're a grad student, you're a GPU boy, you kind of work on that.[00:30:30] swyx: And then the, the sort of big model that people walk around and pick the ones that they like, and then they use it in their models. And that's, that's kind of how it develops. I, I feel like, um, like, like you didn't last year, you had people like Hao Tian who worked on Lava, which is take Lama and add Vision.[00:30:47] swyx: And then obviously actually I hired him and he added Vision to Grok. Now he's the Vision Grok guy. This year, I don't think there was any of those.[00:30:55] Alessio: What were the most popular, like, orals? Last year it was like the [00:31:00] Mixed Monarch, I think, was like the most attended. Yeah, uh, I need to look it up. Yeah, I mean, if nothing comes to mind, that's also kind of like an answer in a way.[00:31:10] Alessio: But I think last year there was a lot of interest in, like, furthering models and, like, different architectures and all of that.[00:31:16] swyx: I will say that I felt the orals, oral picks this year were not very good. Either that or maybe it's just a So that's the highlight of how I have changed in terms of how I view papers.[00:31:29] swyx: So like, in my estimation, two of the best papers in this year for datasets or data comp and refined web or fine web. These are two actually industrially used papers, not highlighted for a while. I think DCLM got the spotlight, FineWeb didn't even get the spotlight. So like, it's just that the picks were different.[00:31:48] swyx: But one thing that does get a lot of play that a lot of people are debating is the role that's scheduled. This is the schedule free optimizer paper from Meta from Aaron DeFazio. And this [00:32:00] year in the ML community, there's been a lot of chat about shampoo, soap, all the bathroom amenities for optimizing your learning rates.[00:32:08] swyx: And, uh, most people at the big labs are. Who I asked about this, um, say that it's cute, but it's not something that matters. I don't know, but it's something that was discussed and very, very popular. 4Wars[00:32:19] Alessio: of AI recap maybe, just quickly. Um, where do you want to start? Data?[00:32:26] swyx: So to remind people, this is the 4Wars piece that we did as one of our earlier recaps of this year.[00:32:31] swyx: And the belligerents are on the left, journalists, writers, artists, anyone who owns IP basically, New York Times, Stack Overflow, Reddit, Getty, Sarah Silverman, George RR Martin. Yeah, and I think this year we can add Scarlett Johansson to that side of the fence. So anyone suing, open the eye, basically. I actually wanted to get a snapshot of all the lawsuits.[00:32:52] swyx: I'm sure some lawyer can do it. That's the data quality war. On the right hand side, we have the synthetic data people, and I think we talked about Lumna's talk, you know, [00:33:00] really showing how much synthetic data has come along this year. I think there was a bit of a fight between scale. ai and the synthetic data community, because scale.[00:33:09] swyx: ai published a paper saying that synthetic data doesn't work. Surprise, surprise, scale. ai is the leading vendor of non synthetic data. Only[00:33:17] Alessio: cage free annotated data is useful.[00:33:21] swyx: So I think there's some debate going on there, but I don't think it's much debate anymore that at least synthetic data, for the reasons that are blessed in Luna's talk, Makes sense.[00:33:32] swyx: I don't know if you have any perspectives there.[00:33:34] Alessio: I think, again, going back to the reinforcement fine tuning, I think that will change a little bit how people think about it. I think today people mostly use synthetic data, yeah, for distillation and kind of like fine tuning a smaller model from like a larger model.[00:33:46] Alessio: I'm not super aware of how the frontier labs use it outside of like the rephrase, the web thing that Apple also did. But yeah, I think it'll be. Useful. I think like whether or not that gets us the big [00:34:00] next step, I think that's maybe like TBD, you know, I think people love talking about data because it's like a GPU poor, you know, I think, uh, synthetic data is like something that people can do, you know, so they feel more opinionated about it compared to, yeah, the optimizers stuff, which is like,[00:34:17] swyx: they don't[00:34:17] Alessio: really work[00:34:18] swyx: on.[00:34:18] swyx: I think that there is an angle to the reasoning synthetic data. So this year, we covered in the paper club, the star series of papers. So that's star, Q star, V star. It basically helps you to synthesize reasoning steps, or at least distill reasoning steps from a verifier. And if you look at the OpenAI RFT, API that they released, or that they announced, basically they're asking you to submit graders, or they choose from a preset list of graders.[00:34:49] swyx: Basically It feels like a way to create valid synthetic data for them to fine tune their reasoning paths on. Um, so I think that is another angle where it starts to make sense. And [00:35:00] so like, it's very funny that basically all the data quality wars between Let's say the music industry or like the newspaper publishing industry or the textbooks industry on the big labs.[00:35:11] swyx: It's all of the pre training era. And then like the new era, like the reasoning era, like nobody has any problem with all the reasoning, especially because it's all like sort of math and science oriented with, with very reasonable graders. I think the more interesting next step is how does it generalize beyond STEM?[00:35:27] swyx: We've been using O1 for And I would say like for summarization and creative writing and instruction following, I think it's underrated. I started using O1 in our intro songs before we killed the intro songs, but it's very good at writing lyrics. You know, I can actually say like, I think one of the O1 pro demos.[00:35:46] swyx: All of these things that Noam was showing was that, you know, you can write an entire paragraph or three paragraphs without using the letter A, right?[00:35:53] Creative Writing with AI[00:35:53] swyx: So like, like literally just anything instead of token, like not even token level, character level manipulation and [00:36:00] counting and instruction following. It's, uh, it's very, very strong.[00:36:02] swyx: And so no surprises when I ask it to rhyme, uh, and to, to create song lyrics, it's going to do that very much better than in previous models. So I think it's underrated for creative writing.[00:36:11] Alessio: Yeah.[00:36:12] Legal and Ethical Issues in AI[00:36:12] Alessio: What do you think is the rationale that they're going to have in court when they don't show you the thinking traces of O1, but then they want us to, like, they're getting sued for using other publishers data, you know, but then on their end, they're like, well, you shouldn't be using my data to then train your model.[00:36:29] Alessio: So I'm curious to see how that kind of comes. Yeah, I mean, OPA has[00:36:32] swyx: many ways to publish, to punish people without bringing, taking them to court. Already banned ByteDance for distilling their, their info. And so anyone caught distilling the chain of thought will be just disallowed to continue on, on, on the API.[00:36:44] swyx: And it's fine. It's no big deal. Like, I don't even think that's an issue at all, just because the chain of thoughts are pretty well hidden. Like you have to work very, very hard to, to get it to leak. And then even when it leaks the chain of thought, you don't know if it's, if it's [00:37:00] The bigger concern is actually that there's not that much IP hiding behind it, that Cosign, which we talked about, we talked to him on Dev Day, can just fine tune 4.[00:37:13] swyx: 0 to beat 0. 1 Cloud SONET so far is beating O1 on coding tasks without, at least O1 preview, without being a reasoning model, same for Gemini Pro or Gemini 2. 0. So like, how much is reasoning important? How much of a moat is there in this, like, All of these are proprietary sort of training data that they've presumably accomplished.[00:37:34] swyx: Because even DeepSeek was able to do it. And they had, you know, two months notice to do this, to do R1. So, it's actually unclear how much moat there is. Obviously, you know, if you talk to the Strawberry team, they'll be like, yeah, I mean, we spent the last two years doing this. So, we don't know. And it's going to be Interesting because there'll be a lot of noise from people who say they have inference time compute and actually don't because they just have fancy chain of thought.[00:38:00][00:38:00] swyx: And then there's other people who actually do have very good chain of thought. And you will not see them on the same level as OpenAI because OpenAI has invested a lot in building up the mythology of their team. Um, which makes sense. Like the real answer is somewhere in between.[00:38:13] Alessio: Yeah, I think that's kind of like the main data war story developing.[00:38:18] The Data War: GPU Poor vs. GPU Rich[00:38:18] Alessio: GPU poor versus GPU rich. Yeah. Where do you think we are? I think there was, again, going back to like the small model thing, there was like a time in which the GPU poor were kind of like the rebel faction working on like these models that were like open and small and cheap. And I think today people don't really care as much about GPUs anymore.[00:38:37] Alessio: You also see it in the price of the GPUs. Like, you know, that market is kind of like plummeted because there's people don't want to be, they want to be GPU free. They don't even want to be poor. They just want to be, you know, completely without them. Yeah. How do you think about this war? You[00:38:52] swyx: can tell me about this, but like, I feel like the, the appetite for GPU rich startups, like the, you know, the, the funding plan is we will raise 60 million and [00:39:00] we'll give 50 of that to NVIDIA.[00:39:01] swyx: That is gone, right? Like, no one's, no one's pitching that. This was literally the plan, the exact plan of like, I can name like four or five startups, you know, this time last year. So yeah, GPU rich startups gone.[00:39:12] The Rise of GPU Ultra Rich[00:39:12] swyx: But I think like, The GPU ultra rich, the GPU ultra high net worth is still going. So, um, now we're, you know, we had Leopold's essay on the trillion dollar cluster.[00:39:23] swyx: We're not quite there yet. We have multiple labs, um, you know, XAI very famously, you know, Jensen Huang praising them for being. Best boy number one in spinning up 100, 000 GPU cluster in like 12 days or something. So likewise at Meta, likewise at OpenAI, likewise at the other labs as well. So like the GPU ultra rich are going to keep doing that because I think partially it's an article of faith now that you just need it.[00:39:46] swyx: Like you don't even know what it's going to, what you're going to use it for. You just, you just need it. And it makes sense that if, especially if we're going into. More researchy territory than we are. So let's say 2020 to 2023 was [00:40:00] let's scale big models territory because we had GPT 3 in 2020 and we were like, okay, we'll go from 1.[00:40:05] swyx: 75b to 1. 8b, 1. 8t. And that was GPT 3 to GPT 4. Okay, that's done. As far as everyone is concerned, Opus 3. 5 is not coming out, GPT 4. 5 is not coming out, and Gemini 2, we don't have Pro, whatever. We've hit that wall. Maybe I'll call it the 2 trillion perimeter wall. We're not going to 10 trillion. No one thinks it's a good idea, at least from training costs, from the amount of data, or at least the inference.[00:40:36] swyx: Would you pay 10x the price of GPT Probably not. Like, like you want something else that, that is at least more useful. So it makes sense that people are pivoting in terms of their inference paradigm.[00:40:47] Emerging Trends in AI Models[00:40:47] swyx: And so when it's more researchy, then you actually need more just general purpose compute to mess around with, uh, at the exact same time that production deployments of the old, the previous paradigm is still ramping up,[00:40:58] swyx: um,[00:40:58] swyx: uh, pretty aggressively.[00:40:59] swyx: So [00:41:00] it makes sense that the GPU rich are growing. We have now interviewed both together and fireworks and replicates. Uh, we haven't done any scale yet. But I think Amazon, maybe kind of a sleeper one, Amazon, in a sense of like they, at reInvent, I wasn't expecting them to do so well, but they are now a foundation model lab.[00:41:18] swyx: It's kind of interesting. Um, I think, uh, you know, David went over there and started just creating models.[00:41:25] Alessio: Yeah, I mean, that's the power of prepaid contracts. I think like a lot of AWS customers, you know, they do this big reserve instance contracts and now they got to use their money. That's why so many startups.[00:41:37] Alessio: Get bought through the AWS marketplace so they can kind of bundle them together and prefer pricing.[00:41:42] swyx: Okay, so maybe GPU super rich doing very well, GPU middle class dead, and then GPU[00:41:48] Alessio: poor. I mean, my thing is like, everybody should just be GPU rich. There shouldn't really be, even the GPU poorest, it's like, does it really make sense to be GPU poor?[00:41:57] Alessio: Like, if you're GPU poor, you should just use the [00:42:00] cloud. Yes, you know, and I think there might be a future once we kind of like figure out what the size and shape of these models is where like the tiny box and these things come to fruition where like you can be GPU poor at home. But I think today is like, why are you working so hard to like get these models to run on like very small clusters where it's like, It's so cheap to run them.[00:42:21] Alessio: Yeah, yeah,[00:42:22] swyx: yeah. I think mostly people think it's cool. People think it's a stepping stone to scaling up. So they aspire to be GPU rich one day and they're working on new methods. Like news research, like probably the most deep tech thing they've done this year is Distro or whatever the new name is.[00:42:38] swyx: There's a lot of interest in heterogeneous computing, distributed computing. I tend generally to de emphasize that historically, but it may be coming to a time where it is starting to be relevant. I don't know. You know, SF compute launched their compute marketplace this year, and like, who's really using that?[00:42:53] swyx: Like, it's a bunch of small clusters, disparate types of compute, and if you can make that [00:43:00] useful, then that will be very beneficial to the broader community, but maybe still not the source of frontier models. It's just going to be a second tier of compute that is unlocked for people, and that's fine. But yeah, I mean, I think this year, I would say a lot more on device, We are, I now have Apple intelligence on my phone.[00:43:19] swyx: Doesn't do anything apart from summarize my notifications. But still, not bad. Like, it's multi modal.[00:43:25] Alessio: Yeah, the notification summaries are so and so in my experience.[00:43:29] swyx: Yeah, but they add, they add juice to life. And then, um, Chrome Nano, uh, Gemini Nano is coming out in Chrome. Uh, they're still feature flagged, but you can, you can try it now if you, if you use the, uh, the alpha.[00:43:40] swyx: And so, like, I, I think, like, you know, We're getting the sort of GPU poor version of a lot of these things coming out, and I think it's like quite useful. Like Windows as well, rolling out RWKB in sort of every Windows department is super cool. And I think the last thing that I never put in this GPU poor war, that I think I should now, [00:44:00] is the number of startups that are GPU poor but still scaling very well, as sort of wrappers on top of either a foundation model lab, or GPU Cloud.[00:44:10] swyx: GPU Cloud, it would be Suno. Suno, Ramp has rated as one of the top ranked, fastest growing startups of the year. Um, I think the last public number is like zero to 20 million this year in ARR and Suno runs on Moto. So Suno itself is not GPU rich, but they're just doing the training on, on Moto, uh, who we've also talked to on, on the podcast.[00:44:31] swyx: The other one would be Bolt, straight cloud wrapper. And, and, um, Again, another, now they've announced 20 million ARR, which is another step up from our 8 million that we put on the title. So yeah, I mean, it's crazy that all these GPU pores are finding a way while the GPU riches are also finding a way. And then the only failures, I kind of call this the GPU smiling curve, where the edges do well, because you're either close to the machines, and you're like [00:45:00] number one on the machines, or you're like close to the customers, and you're number one on the customer side.[00:45:03] swyx: And the people who are in the middle. Inflection, um, character, didn't do that great. I think character did the best of all of them. Like, you have a note in here that we apparently said that character's price tag was[00:45:15] Alessio: 1B.[00:45:15] swyx: Did I say that?[00:45:16] Alessio: Yeah. You said Google should just buy them for 1B. I thought it was a crazy number.[00:45:20] Alessio: Then they paid 2. 7 billion. I mean, for like,[00:45:22] swyx: yeah.[00:45:22] Alessio: What do you pay for node? Like, I don't know what the game world was like. Maybe the starting price was 1B. I mean, whatever it was, it worked out for everybody involved.[00:45:31] The Multi-Modality War[00:45:31] Alessio: Multimodality war. And this one, we never had text to video in the first version, which now is the hottest.[00:45:37] swyx: Yeah, I would say it's a subset of image, but yes.[00:45:40] Alessio: Yeah, well, but I think at the time it wasn't really something people were doing, and now we had VO2 just came out yesterday. Uh, Sora was released last month, last week. I've not tried Sora, because the day that I tried, it wasn't, yeah. I[00:45:54] swyx: think it's generally available now, you can go to Sora.[00:45:56] swyx: com and try it. Yeah, they had[00:45:58] Alessio: the outage. Which I [00:46:00] think also played a part into it. Small things. Yeah. What's the other model that you posted today that was on Replicate? Video or OneLive?[00:46:08] swyx: Yeah. Very, very nondescript name, but it is from Minimax, which I think is a Chinese lab. The Chinese labs do surprisingly well at the video models.[00:46:20] swyx: I'm not sure it's actually Chinese. I don't know. Hold me up to that. Yep. China. It's good. Yeah, the Chinese love video. What can I say? They have a lot of training data for video. Or a more relaxed regulatory environment.[00:46:37] Alessio: Uh, well, sure, in some way. Yeah, I don't think there's much else there. I think like, you know, on the image side, I think it's still open.[00:46:45] Alessio: Yeah, I mean,[00:46:46] swyx: 11labs is now a unicorn. So basically, what is multi modality war? Multi modality war is, do you specialize in a single modality, right? Or do you have GodModel that does all the modalities? So this is [00:47:00] definitely still going, in a sense of 11 labs, you know, now Unicorn, PicoLabs doing well, they launched Pico 2.[00:47:06] swyx: 0 recently, HeyGen, I think has reached 100 million ARR, Assembly, I don't know, but they have billboards all over the place, so I assume they're doing very, very well. So these are all specialist models, specialist models and specialist startups. And then there's the big labs who are doing the sort of all in one play.[00:47:24] swyx: And then here I would highlight Gemini 2 for having native image output. Have you seen the demos? Um, yeah, it's, it's hard to keep up. Literally they launched this last week and a shout out to Paige Bailey, who came to the Latent Space event to demo on the day of launch. And she wasn't prepared. She was just like, I'm just going to show you.[00:47:43] swyx: So they have voice. They have, you know, obviously image input, and then they obviously can code gen and all that. But the new one that OpenAI and Meta both have but they haven't launched yet is image output. So you can literally, um, I think their demo video was that you put in an image of a [00:48:00] car, and you ask for minor modifications to that car.[00:48:02] swyx: They can generate you that modification exactly as you asked. So there's no need for the stable diffusion or comfy UI workflow of like mask here and then like infill there in paint there and all that, all that stuff. This is small model nonsense. Big model people are like, huh, we got you in as everything in the transformer.[00:48:21] swyx: This is the multimodality war, which is, do you, do you bet on the God model or do you string together a whole bunch of, uh, Small models like a, like a chump. Yeah,[00:48:29] Alessio: I don't know, man. Yeah, that would be interesting. I mean, obviously I use Midjourney for all of our thumbnails. Um, they've been doing a ton on the product, I would say.[00:48:38] Alessio: They launched a new Midjourney editor thing. They've been doing a ton. Because I think, yeah, the motto is kind of like, Maybe, you know, people say black forest, the black forest models are better than mid journey on a pixel by pixel basis. But I think when you put it, put it together, have you tried[00:48:53] swyx: the same problems on black forest?[00:48:55] Alessio: Yes. But the problem is just like, you know, on black forest, it generates one image. And then it's like, you got to [00:49:00] regenerate. You don't have all these like UI things. Like what I do, no, but it's like time issue, you know, it's like a mid[00:49:06] swyx: journey. Call the API four times.[00:49:08] Alessio: No, but then there's no like variate.[00:49:10] Alessio: Like the good thing about mid journey is like, you just go in there and you're cooking. There's a lot of stuff that just makes it really easy. And I think people underestimate that. Like, it's not really a skill issue, because I'm paying mid journey, so it's a Black Forest skill issue, because I'm not paying them, you know?[00:49:24] Alessio: Yeah,[00:49:25] swyx: so, okay, so, uh, this is a UX thing, right? Like, you, you, you understand that, at least, we think that Black Forest should be able to do all that stuff. I will also shout out, ReCraft has come out, uh, on top of the image arena that, uh, artificial analysis has done, has apparently, uh, Flux's place. Is this still true?[00:49:41] swyx: So, Artificial Analysis is now a company. I highlighted them I think in one of the early AI Newses of the year. And they have launched a whole bunch of arenas. So, they're trying to take on LM Arena, Anastasios and crew. And they have an image arena. Oh yeah, Recraft v3 is now beating Flux 1. 1. Which is very surprising [00:50:00] because Flux And Black Forest Labs are the old stable diffusion crew who left stability after, um, the management issues.[00:50:06] swyx: So Recurve has come from nowhere to be the top image model. Uh, very, very strange. I would also highlight that Grok has now launched Aurora, which is, it's very interesting dynamics between Grok and Black Forest Labs because Grok's images were originally launched, uh, in partnership with Black Forest Labs as a, as a thin wrapper.[00:50:24] swyx: And then Grok was like, no, we'll make our own. And so they've made their own. I don't know, there are no APIs or benchmarks about it. They just announced it. So yeah, that's the multi modality war. I would say that so far, the small model, the dedicated model people are winning, because they are just focused on their tasks.[00:50:42] swyx: But the big model, People are always catching up. And the moment I saw the Gemini 2 demo of image editing, where I can put in an image and just request it and it does, that's how AI should work. Not like a whole bunch of complicated steps. So it really is something. And I think one frontier that we haven't [00:51:00] seen this year, like obviously video has done very well, and it will continue to grow.[00:51:03] swyx: You know, we only have Sora Turbo today, but at some point we'll get full Sora. Oh, at least the Hollywood Labs will get Fulsora. We haven't seen video to audio, or video synced to audio. And so the researchers that I talked to are already starting to talk about that as the next frontier. But there's still maybe like five more years of video left to actually be Soda.[00:51:23] swyx: I would say that Gemini's approach Compared to OpenAI, Gemini seems, or DeepMind's approach to video seems a lot more fully fledged than OpenAI. Because if you look at the ICML recap that I published that so far nobody has listened to, um, that people have listened to it. It's just a different, definitely different audience.[00:51:43] swyx: It's only seven hours long. Why are people not listening? It's like everything in Uh, so, so DeepMind has, is working on Genie. They also launched Genie 2 and VideoPoet. So, like, they have maybe four years advantage on world modeling that OpenAI does not have. Because OpenAI basically only started [00:52:00] Diffusion Transformers last year, you know, when they hired, uh, Bill Peebles.[00:52:03] swyx: So, DeepMind has, has a bit of advantage here, I would say, in, in, in showing, like, the reason that VO2, while one, They cherry pick their videos. So obviously it looks better than Sora, but the reason I would believe that VO2, uh, when it's fully launched will do very well is because they have all this background work in video that they've done for years.[00:52:22] swyx: Like, like last year's NeurIPS, I already was interviewing some of their video people. I forget their model name, but for, for people who are dedicated fans, they can go to NeurIPS 2023 and see, see that paper.[00:52:32] Alessio: And then last but not least, the LLMOS. We renamed it to Ragops, formerly known as[00:52:39] swyx: Ragops War. I put the latest chart on the Braintrust episode.[00:52:43] swyx: I think I'm going to separate these essays from the episode notes. So the reason I used to do that, by the way, is because I wanted to show up on Hacker News. I wanted the podcast to show up on Hacker News. So I always put an essay inside of there because Hacker News people like to read and not listen.[00:52:58] Alessio: So episode essays,[00:52:59] swyx: I remember [00:53:00] purchasing them separately. You say Lanchain Llama Index is still growing.[00:53:03] Alessio: Yeah, so I looked at the PyPy stats, you know. I don't care about stars. On PyPy you see Do you want to share your screen? Yes. I prefer to look at actual downloads, not at stars on GitHub. So if you look at, you know, Lanchain still growing.[00:53:20] Alessio: These are the last six months. Llama Index still growing. What I've basically seen is like things that, One, obviously these things have A commercial product. So there's like people buying this and sticking with it versus kind of hopping in between things versus, you know, for example, crew AI, not really growing as much.[00:53:38] Alessio: The stars are growing. If you look on GitHub, like the stars are growing, but kind of like the usage is kind of like flat. In the last six months, have they done some[00:53:4

god ceo new york spotify amazon time world ai europe google china apple vision pr voice future speaking san francisco new york times phd video thinking chinese simple data predictions elon musk impact surprise iphone chatgpt code legal tesla reflecting memory ga reddit discord busy cloud lgbt flash stem honestly ab pros jeff bezos windows excited researchers ip lower unicorns tackling sort survey insane tier cto whispers applications vc f1 gemini openai doc seal signing fireworks genie academic sf nvidia organizing ux api davos assembly frontier chrome gpt makes scarlett johansson ui aws turbo mm soda bash mosaic ml lama dropbox github creative writing drafting reinvent canvas 1b apis bolt ruler lava exact stripe pico hundred dev flux strawberry vm vcs sander wwdc 200k llm bt arr sora taiwanese opus moto sam altman gartner anthropic google docs assumption nemo blackwell parting google drive gpu grok sombra agi ramp opa tbd 3b 5b elia elo perplexity estimates gnome midjourney bytedance leopold rag ciso gpus dota haiku dx coursera sarah silverman sonnets deepmind ilya cypher getty george rr martin quill sdks cobalt future trends noam v2 alessio ttc sheesh lms satya r1 xai 8b suno ssi mcp veo stack overflow vo2 rl emerging trends mistral itc theoretically gpts replicate sota jensen huang yi black forest inflection databricks aitor graphql brain trust ai models chinchillas adept nosql grand central grand central station hacker news hacken zep ethical issues ai news cosign claud gpc tpu distro lubna heygen o3 neo4j cerebras autogpt gbt minimax gpd o1 jeremy howard 70b quent langchain 400b gradients exa neurips loras jeff dean 128k gemini pro elos code interpreter ai winter john franco icml lstm r1s aws reinvent muser latent space pypy nova pro dan gross paige bailey noam brown quiet capital john frankel

EP 54. 深度对谈顶尖AI开源项目：大模型开源生态, Agent 与中国力量

OnBoard!

Play Episode Listen Later Dec 16, 2024 199:06

聊到生成式AI的发展，开源绝对是最关键的话题之一。这次的嘉宾，可以说涵盖了大模型开源领域最值得关注的公司，真的是黄金阵容！首先跟大家汇报一下，上周日我们在北京举办的 OnBoard! 第一次线下听友会真是超预期！开放报名4天就250多人报名，周日从上午9点到下午3点，从机器人到AI，创业投资和软件出海，100人的场地，直到最后都几乎座无虚席！真的是非常感谢大家的支持~ Hello World, who is OnBoard!? 回到这一期播客，我们将深入探讨大模型的开源生态。在生成式AI飞速发展的一年多时间里，开源无疑是一个不可忽视的话题。开源模型的迅猛发展，从 Meta 的 Llama 3 到 Mistral 的最新模型，它们对闭源大模型如 GPT4 的追赶，不仅令人惊艳，更加速了 AI 场景下产品的实际应用。而围绕大模型的生态系统，从推理加速到开发工具，再到智能代理，技术栈的丰富程度，虽然已经孕育出了像 Langchain 这样的领军企业，但这一切似乎只是冰山一角。特别值得一提的是，随着阿里千问系列、Deepseek、以及 Yi 等中国团队主导的模型在国际舞台上崭露头角，我们不禁思考，除了模仿和追赶，中国在大模型领域的发展是否还有更多值得我们关注和自豪的成就。今天，Monica 有幸邀请到了几位极具代表性的重磅嘉宾，来自 Huggingface 的开源老兵，有通义千问 Qwen 的开源负责人（他也是 Agent 领域最受关注的项目 OpenDevin 核心成员），还有最具国际影响力的开源项目 vLLM 主导人。真是涵盖了大模型开源生态的各个领域的最一线视角！嘉宾们都太宝藏了，我们的话题延伸到大模型的各个方面，录了近4个小时！我们前半部分聊了很多infra的创新，以及最近很火的、以OpenDevin 为代表的软件开发agent 背后的技术和生态等话题。下半部分，我们回到大模型开源的主题，畅谈了：底层基础大模型的开源闭源生态，未来可能有怎样的演进？开源模型商业化跟过去我们在大数据时代看到的databricks 之类开源商业模式有哪些异同？如何做一个有国际影响力的开源项目？嘉宾介绍 Tiezhen Wang, Huggingface 工程师，他可以说是中国与世界开源 AI 生态的桥梁，更是从 Google TensorFlow 时代到 Huggingface 早期员工，对中国和世界的开源 AI 生态都有极深的洞察。 Junyang Lin, 通义千问开源负责人，作为 Qwen 在全球开源社区的主要代言人，他不仅见证了开源的发展历程，还是目前备受瞩目的 Agent 开源项目 OpenDevin 的核心团队成员。李卓翰，UC Berkeley PhD，他所主导的项目更是大名鼎鼎，就是已经成为行业标准的大模型推理框架 vLLM！他所在的 Sky Lab 被誉为开源基础设施的摇篮，从估值百亿美元的 Databricks 到 Anyscale（开源计算框架 Ray 的商业化公司）。他还深度参与了 Chat Arena, Vicuna 等多个国际知名开源项目，对大模型周边生态和 infra 的不仅有国际一线经验，更是有很多有技术理想的干货！ OnBoard! 主持：Monica：美元VC投资人，前 AWS 硅谷团队+ AI 创业公司打工人，公众号M小姐研习录 (ID: MissMStudy) 主理人 | 即刻：莫妮卡同学还有数据、评测等等大模型领域的核心话题，真的非常全面，又不失一线从业者的深度。索性就不分成两部分了，大家可以对着 show notes 里面的时间戳，直接跳转到你感兴趣的话题（虽然我觉得每个话题都很好！）介绍了这么多，还要声明一下，节目里面重点聊到的开源社区 Huggingface，还有几个开源的项目，包括阿里千问、OpenDevin, Deepseek, 零一万物的 Yi，vLLM 等，都没有收取任何广告，完全是嘉宾走心分享，全程无广！当然，如果你们或者其他AI公司考虑赞助一下我们用爱发电的播客，我们当然也是欢迎的！三小时硬核马拉松开始，enjoy! 嘉宾介绍我们都聊了什么 05:28 嘉宾自我介绍，有意思的开源 AI 项目 18:37 vLLM 如何开始的，如何成为全球顶尖项目，为什么我们需要一个大模型推理框架？ 30:24 Agent framework: OpenDevin 这样的负责 agent 会带来怎样的推理挑战？ 40:37 做好一个编程 Agent，还需要哪些新的工具？多模态会带来怎样的变化？ 56:16 我们需要怎样的 Agent Framework？为什么最适合开源社区来做？Framework 会收敛吗？ 67:46 什么是 Crew AI? 如何看待 Multi-agent 架构？ 73:11 借鉴前端框架的发展历史，如何理解一个框架如何成为行业标准？ 77:54 Huggingface 上开源LLM现状，过去一年多有哪些重要进展？有哪些不同的开源方式？泽娜要给你看待一个开源模型的流行程度？ 94:27 如何理解不同架构的开源大模型生态？Qwen 如何通过架构演进打造更好的开源生态？ 104:59 中国的大模型开源项目有哪些创新？大模型架构有哪些变化？ 112:17 为什么说新的模型架构可能会带来商业化的新机会？我们能从以前的开源商业化中学到什么？ 119:22 我们看到现有大模型架构的天花板了吗？什么是一个新的架构？ 128:03 Zhuohan 从参与最早的开源 LLM 之一 Vicuna 的经历学到什么？学术界和业界在大模型生态上如何分工？ 140:48 用于大模型的数据集领域有哪些值得关注的进展？ 149:42 Mistral 为什么这么快爆火？打造一流国际开源项目有什么可借鉴的经验？vLLM 有什么道和术上的心得？ 166:13 Chatbot Arena 是如何开始的？为什么模型的评测那么重要？还有哪些挑战和可能的进展？ 180:49 Zhuohan 对于 vLLM 商业化方式有什么思考？未来推理成本还有哪些下降空间？ 188:17 快问快答：过去一年生成式AI发展有什么超出预期和不及预期的地方？未来还有什么值得期待？我们提到的公司和重点名词 Qwen⁠, ⁠Qwen-2⁠ OpenDevin: ⁠opendevin.github.io⁠ vLLM: ⁠github.com⁠ ⁠Yi (Github)⁠, ⁠零一万物⁠ Chatbot Arena: ⁠huggingface.co⁠ AutoGPT: ⁠github.com⁠ crew AI: ⁠www.crewai.com⁠ autoAWQ: ⁠github.com⁠ LLM.c: ⁠github.com⁠ Flash attention: ⁠github.com⁠ Continuous batching：一种数据处理技术，用于将连续的数据流分批处理，以提高效率和可扩展性。 KV cache：键值对缓存，一种存储结构，通过键快速访问数据值，常用于提高数据检索速度。 Page attention：页面注意力机制，一种在处理长文本时，使模型集中注意力于当前页面或段落的技术。 Quantization：量化，将数据表示的精度降低到更少的比特数，以减少模型大小和提高计算效率。 ⁠Direct Preference Optimization (DPO)⁠: Your Language Model is Secretly a Reward Model Google Gemini: ⁠deepmind.google⁠ Adept: ⁠www.adept.ai⁠ MetaGPT: ⁠github.com⁠ ⁠Dolphin⁠an open-source and uncensored, and commercially licensed dataset and series of instruct-tuned language models based on Microsoft's Orca paper Common crawl: ⁠commoncrawl.org⁠ Tiezhen 的报告：⁠Booming Open Source Chinese-Speaking LLMs: A Closer Look⁠, ⁠Slides⁠ ⁠通义千问一周年，开源狂飙路上的抉择与思考｜魔搭深度访谈⁠ ⁠阿里林俊旸：大模型对很多人来说不够用，打造多模态Agent是关键 | 中国AIGC产业峰会⁠ 欢迎关注M小姐的微信公众号，了解更多中美软件、AI与创业投资的干货内容！ M小姐研习录 (ID: MissMStudy) 喜欢 OnBoard! 的话，也可以点击打赏，请我们喝一杯咖啡！如果你用 Apple Podcasts 收听，也请给我们一个五星好评，这对我们非常重要。最后！快来加入Onboard！听友群，结识到高质量的听友们，我们还会组织线下主题聚会，开放实时旁听播客录制，嘉宾互动等新的尝试。添加任意一位小助手微信，onboard666, 或者 Nine_tunes,小助手会拉你进群。期待你来！

ai microsoft flash agent framework continuous secretly llm orca hello world onboard kv mistral adept autogpt langchain huggingface aigc vicuna

The INDUStry Show w Saumya Bhatnagar

The INDUStry Show

Play Episode Listen Later Jul 27, 2024 26:28

Saumya Bhatnagar is the Chief Product Officer and co-founder of Jeeva AI (formerly involve.ai) - automating all manual tasks and using AutoGPT to find your next 500 customers. She is a multi-entrepreneur, recipient of the Stevie Gold Entrepreneur of the Year award, Forbes 30 Under 30 alum, National Diversity Council's prestigious 50 Most Powerful Women in Tech, Top 50 Women CPO's in the US by WWA. --- Support this podcast: https://podcasters.spotify.com/pod/show/theindustryshow/support

forbes chief product officer autogpt wwa national diversity council

AF - Auto-Enhance: Developing a meta-benchmark to measure LLM agents' ability to improve other agents by Sam Brown

The Nonlinear Library

Play Episode Listen Later Jul 22, 2024 26:02

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Auto-Enhance: Developing a meta-benchmark to measure LLM agents' ability to improve other agents, published by Sam Brown on July 22, 2024 on The AI Alignment Forum. Summary Scaffolded LLM agents are, in principle, able to execute arbitrary code to achieve the goals they have been set. One such goal could be self-improvement. This post outlines our plans to build a benchmark to measure the ability of LLM agents to modify and improve other LLM agents. This 'Auto-Enhancement benchmark' measures the ability of 'top-level' agents to improve the performance of 'reference' agents on 'component' benchmarks, such as CyberSecEval 2, MLAgentBench, SWE-bench, and WMDP. Results are mostly left for a future post in the coming weeks. Scaffolds such as AutoGPT, ReAct, and SWE-agent can be built around LLMs to build LLM agents, with abilities such as long-term planning and context-window management to enable them to carry out complex general-purpose tasks autonomously. LLM agents can fix issues in large, complex code bases (see SWE-bench), and interact in a general way using web browsers, Linux shells, and Python interpreters. In this post, we outline our plans for a project to measure these LLM agents' ability to modify other LLM agents, undertaken as part of Axiom Futures' Alignment Research Fellowship. Our proposed benchmark consists of "enhancement tasks," which measure the ability of an LLM agent to improve the performance of another LLM agent (which may be a clone of the first agent) on various tasks. Our benchmark uses existing benchmarks as components to measure LLM agent capabilities in various domains, such as software engineering, cybersecurity exploitation, and others. We believe these benchmarks are consequential in the sense that good performance by agents on these tasks should be concerning for us. We plan to write an update post with our results at the end of the Fellowship, and we will link this post to that update. Motivation Agents are capable of complex SWE tasks (see, e.g., Yang et al.). One such task could be the improvement of other scaffolded agents. This capability would be a key component of autonomous replication and adaptation (ARA), and we believe it would be generally recognised as an important step towards extreme capabilities. This post outlines our initial plans for developing a novel benchmark that aims to measure the ability of LLM-based agents to improve other LLM-based agents, including those that are as capable as themselves. Threat model We present two threat models that aim to capture how AI systems may develop super-intelligent capabilities. Expediting AI research: Recent trends show how researchers are leveraging LLMs to expedite academic paper reviews (see Du et al.). ML researchers are beginning to use LLMs to design and train more advanced models (see Cotra's AIs accelerating AI research and Anthropic's work on Constitutional AI). Such LLM-assisted research may expedite progress toward super-intelligent systems. Autonomy: Another way that such capabilities are developed is through LLM agents themselves becoming competent enough to self-modify and further ML research without human assistance (see section Hard Takeoff in this note ), leading to an autonomously replicating and adapting system. Our proposed benchmark aims to quantify the ability of LLM agents to bring about such recursive self-improvement, either with or without detailed human supervision. Categories of bottlenecks and overhang risks We posit that there are three distinct categories of bottlenecks to LLM agent capabilities: 1. Architectures-of-thought, such as structured planning, progress-summarisation, hierarchy of agents, self-critique, chain-of-thought, self-consistency, prompt engineering/elicitation, and so on. Broadly speaking, this encompasses everything between the LL...

ai developing speech auto fellowship architecture measure results ability react ea enhance yang python ml linux llm anthropic benchmark sam brown swe autogpt rationalist scaffolds constitutional ai

LangChain's Harrison Chase on Building the Orchestration Layer for AI Agents

Training Data

Play Episode Listen Later Jun 18, 2024 49:50

Last year, AutoGPT and Baby AGI captured our imaginations—agents quickly became the buzzword of the day…and then things went quiet. AutoGPT and Baby AGI may have marked a peak in the hype cycle, but this year has seen a wave of agentic breakouts on the product side, from Klarna's customer support AI to Cognition's Devin, etc. Harrison Chase of LangChain is focused on enabling the orchestration layer for agents. In this conversation, he explains what's changed that's allowing agents to improve performance and find traction. Harrison shares what he's optimistic about, where he sees promise for agents vs. what he thinks will be trained into models themselves, and discusses novel kinds of UX that he imagines might transform how we experience agents in the future. Hosted by: Sonya Huang and Pat Grady, Sequoia Capital Mentioned: ReAct: Synergizing Reasoning and Acting in Language Models, the first cognitive architecture for agents SWE-agent: Agent-Computer Interfaces Enable Automated Software Engineering, small-model open-source software engineering agent from researchers at Princeton Devin, autonomous software engineering from Cognition V0: Generative UI agent from Vercel GPT Researcher, a research agent Language Model Cascades: 2022 paper by Google Brain and now OpenAI researcher David Dohan that was influential for Harrison in developing LangChain Transcript: https://www.sequoiacap.com/podcast/training-data-harrison-chase/ 00:00 Introduction 01:21 What are agents? 05:00 What is LangChain's role in the agent ecosystem? 11:13 What is a cognitive architecture? 13:20 Is bespoke and hard coded the way the world is going, or a stop gap? 18:48 Focus on what makes your beer taste better 20:37 So what? 22:20 Where are agents getting traction? 25:35 Reflection, chain of thought, other techniques? 30:42 UX can influence the effectiveness of the architecture 35:30 What's out of scope? 38:04 Fine tuning vs prompting? 42:17 Existing observability tools for LLMs vs needing a new architecture/approach 45:38 Lightning round

ai focus reflection acting lightning openai ux existing layer cognition klarna orchestration google brain autogpt langchain pat grady

AF - We might be dropping the ball on Autonomous Replication and Adaptation. by Charbel-Raphael Segerie

The Nonlinear Library

Play Episode Listen Later May 31, 2024 7:28

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: We might be dropping the ball on Autonomous Replication and Adaptation., published by Charbel-Raphael Segerie on May 31, 2024 on The AI Alignment Forum. Here is a little Q&A Can you explain your position quickly? I think autonomous replication and adaptation in the wild is under-discussed as an AI threat model. And this makes me sad because this is one of the main reasons I'm worried. I think one of AI Safety people's main proposals should first focus on creating a nonproliferation treaty. Without this treaty, I think we are screwed. The more I think about it, the more I think we are approaching a point of no return. It seems to me that open source is a severe threat and that nobody is really on the ball. Before those powerful AIs can self-replicate and adapt, AI development will be very positive overall and difficult to stop, but it's too late after AI is able to adapt and evolve autonomously because Natural selection favors AI over humans. What is ARA? Autonomous Replication and Adaptation. Let's recap this quickly. Today, generative AI functions as a tool: you ask a question and the tool answers. Question, answer. It's simple. However, we are heading towards a new era of AI, one with autonomous AI. Instead of asking a question, you give it a goal, and the AI performs a series of actions to achieve that goal, which is much more powerful. Libraries like AutoGPT or ChatGPT, when they navigate the internet, already show what these agents might look like. Agency is much more powerful and dangerous than AI tools. Thus conceived, AI would be able to replicate autonomously, copying itself from one computer to another, like a particularly intelligent virus. To replicate on a new computer, it must navigate the internet, create a new account on AWS, pay for the virtual machine, install the new weights on this machine, and start the replication process. According to METR, the organization that audited OpenAI, a dozen tasks indicate ARA capabilities. GPT-4 plus basic scaffolding was capable of performing a few of these tasks, though not robustly. This was over a year ago, with primitive scaffolding, no dedicated training for agency, and no reinforcement learning. Multimodal AIs can now successfully pass CAPTCHAs. ARA is probably coming. It could be very sudden. One of the main variables for self-replication is whether the AI can pay for cloud GPUs. Let's say a GPU costs $1 per hour. The question is whether the AI can generate $1 per hour autonomously continuously. Then, you have something like an exponential process. I think that the number of AIs is probably going to plateau, but regardless of a plateau and the number of AIs you get asymptotically, here you are: this is an autonomous AI, which may become like an endemic virus that is hard to shut down. Is ARA a point of no return? Yes, I think ARA with full adaptation in the wild is beyond the point of no return. Once there is an open-source ARA model or a leak of a model capable of generating enough money for its survival and reproduction and able to adapt to avoid detection and shutdown, it will be probably too late: The idea of making an ARA bot is very accessible. The seed model would already be torrented and undeletable. Stop the internet? The entire world's logistics depend on the internet. In practice, this would mean starving the cities over time. Even if you manage to stop the internet, once the ARA bot is running, it will be unkillable. Even rebooting all providers like AWS would not suffice, as individuals could download and relaunch the model, or the agent could hibernate on local computers. The cost to completely eradicate it altogether would be way too high, and it only needs to persist in one place to spread again. The question is more interesting for ARA with incomplete adaptation capabilities. It is likely th...

ai chatgpt natural speech agency dropping openai ea adaptation gpt aws libraries autonomous gpu gpus metr replication charbel captchas autogpt rationalist

Zenlytic Is Building You A Better Coworker With AI Agents

Data Engineering Podcast

Play Episode Listen Later May 19, 2024 54:19

Summary The purpose of business intelligence systems is to allow anyone in the business to access and decode data to help them make informed decisions. Unfortunately this often turns into an exercise in frustration for everyone involved due to complex workflows and hard-to-understand dashboards. The team at Zenlytic have leaned on the promise of large language models to build an AI agent that lets you converse with your data. In this episode they share their journey through the fast-moving landscape of generative AI and unpack the difference between an AI chatbot and an AI agent. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management This episode is supported by Code Comments, an original podcast from Red Hat. As someone who listens to the Data Engineering Podcast, you know that the road from tool selection to production readiness is anything but smooth or straight. In Code Comments, host Jamie Parker, Red Hatter and experienced engineer, shares the journey of technologists from across the industry and their hard-won lessons in implementing new technologies. I listened to the recent episode "Transforming Your Database" and appreciated the valuable advice on how to approach the selection and integration of new databases in applications and the impact on team dynamics. There are 3 seasons of great episodes and new ones landing everywhere you listen to podcasts. Search for "Code Commentst" in your podcast player or go to dataengineeringpodcast.com/codecomments (https://www.dataengineeringpodcast.com/codecomments) today to subscribe. My thanks to the team at Code Comments for their support. Data lakes are notoriously complex. For data engineers who battle to build and scale high quality data workflows on the data lake, Starburst is an end-to-end data lakehouse platform built on Trino, the query engine Apache Iceberg was designed for, with complete support for all table formats including Apache Iceberg, Hive, and Delta Lake. Trusted by teams of all sizes, including Comcast and Doordash. Want to see Starburst in action? Go to dataengineeringpodcast.com/starburst (https://www.dataengineeringpodcast.com/starburst) and get $500 in credits to try Starburst Galaxy today, the easiest and fastest way to get started using Trino. Your host is Tobias Macey and today I'm interviewing Ryan Janssen and Paul Blankley about their experiences building AI powered agents for interacting with your data Interview Introduction How did you get involved in data? In AI? Can you describe what Zenlytic is and the role that AI is playing in your platform? What have been the key stages in your AI journey? What are some of the dead ends that you ran into along the path to where you are today? What are some of the persistent challenges that you are facing? So tell us more about data agents. Firstly, what are data agents and why do you think they're important? How are data agents different from chatbots? Are data agents harder to build? How do you make them work in production? What other technical architectures have you had to develop to support the use of AI in Zenlytic? How have you approached the work of customer education as you introduce this functionality? What are some of the most interesting or erroneous misconceptions that you have heard about what the AI can and can't do? How have you balanced accuracy/trustworthiness with user experience and flexibility in the conversational AI, given the potential for these models to create erroneous responses? What are the most interesting, innovative, or unexpected ways that you have seen your AI agent used? What are the most interesting, unexpected, or challenging lessons that you have learned while working on building an AI agent for business intelligence? When is an AI agent the wrong choice? What do you have planned for the future of AI in the Zenlytic product? Contact Info Ryan LinkedIn (https://www.linkedin.com/in/janssenryan) Paul LinkedIn (https://www.linkedin.com/in/paulblankley/) Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ (https://www.pythonpodcast.com) covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast (https://www.themachinelearningpodcast.com) helps you go from idea to production with machine learning. Visit the site (https://www.dataengineeringpodcast.com) to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com (mailto:hosts@dataengineeringpodcast.com)) with your story. Links Zenlytic (https://www.zenlytic.com/) Podcast Episode (https://www.dataengineeringpodcast.com/zenlytic-self-serve-business-intelligence-episode-371) Attention is all you need (https://arxiv.org/abs/1706.03762) Transformers (https://en.wikipedia.org/wiki/Transformer_(deep_learning_architecture)) BERT (https://en.wikipedia.org/wiki/BERT_(language_model)) The Bitter Lesson (http://www.incompleteideas.net/IncIdeas/BitterLesson.html) Richard Sutton PID Loops (https://en.wikipedia.org/wiki/Proportional%E2%80%93integral%E2%80%93derivative_controller) AutoGPT (https://github.com/Significant-Gravitas/AutoGPT) Devin.ai (https://www.cognition.ai/introducing-devin) Google Gemini (https://gemini.google.com/) Anthropic Claude (https://www.anthropic.com/claude) OpenAI Code Interpreter (https://platform.openai.com/docs/assistants/tools/code-interpreter) Edward Tufte (https://www.edwardtufte.com/tufte/books_vdqi) Looker ActionHub (https://developers.looker.com/actions/overview/) OAuth (https://oauth.net/2/) GitHub Copilot (https://github.com/features/copilot) The intro and outro music is from The Hug (http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/Love_death_and_a_drunken_monkey/04_-_The_Hug) by The Freak Fandango Orchestra (http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/) / CC BY-SA (http://creativecommons.org/licenses/by-sa/3.0/)

ai data search attention transformers trusted doordash python comcast coworkers hive transformer hug red hat google gemini starburst github copilot oauth trino autogpt jamie parker apache iceberg edward tufte freak fandango orchestra zenlytic

HPR4117: JAMBOREE !

Hacker Public Radio

Play Episode Listen Later May 14, 2024

https://github.com/freeload101/Java-Android-Magisk-Burp-Objection-Root-Emulator-Easy Java Android Magisk Burp Objection Root Emulator Easy (JAMBOREE) Get a working portable Python/Git/Java environment on Windows in SECONDS without having local administrator, regardless of your broken Python or other environment variables. Our open-source script downloads directly from proper sources without any binaries. While the code may not be perfect, it includes many useful PowerShell tricks. Run Android apps and pentest without the adware and malware of BlueStacks or NOX. Features / Request Core Status RMS:Runtime Mobile Security ✔️ Brida, Burp to Frida bridge ❌ SaftyNet+ Bypass ❌ Burp Suite Pro / CloudFlare UserAgent Workaround-ish ✔️ ZAP Using Burp ✔️ Google Play ✔️ Java ✔️ Android 11 API 30 ✔️ Magisk ✔️ Burp ✔️ Objection ✔️ Root ✔️ Python ✔️ Frida ✔️ Certs ✔️ AUTOMATIC1111 ✔️ AutoGPT ✔️ Bloodhound ✔️ PyCharm ✔️ OracleLinux WSL ✔️ Ubuntu/Olamma WSL ✔️ Postgres No admin ✔️ SillyTavern ✔️ Volatility 3 ✔️ Arduino IDE / Duck2Spark ✔️ Youtube Downloader Yt-dlp ✔️ How it works: Temporarily resets your windows $PATH environment variable to fix any issues with existing python/java installation Build a working Python environment in seconds using a tiny 16 meg nuget.org Python binary and portable PortableGit. Our solution doesn't require a package manager like Anaconda. I would like to make it even easier to use but I don't want to spend more time developing it if nobody is going to use it! Please let me know if you like it and open bugs/suggestions/feature request etc! You can contact me at https://rmccurdy.com ! Installation/Requirements ( For Android AVD Emulator) : Local admin just to install Android AVD Driver: HAXM Intel driver ( https://github.com/intel/haxm ) OR AMD ( https://github.com/google/android-emulator-hypervisor-driver-for-amd-processors ) Usage: Put ps1 file in a folder Rightclick Run with PowerShell OR From command prompt powershell -ExecutionPolicy Bypass -Command "[scriptblock]::Create((Invoke-WebRequest "https://raw.githubusercontent.com/freeload101/Java-Android-Magisk-Burp-Objection-Root-Emulator-Easy/main/JAMBOREE.ps1").Content).Invoke();" More infomation on bypass Root Detection and SafeNet https://www.droidwin.com/how-to-hide-root-from-apps-via-magisk-denylist/ ( Watch the Video Tutorial below it's a 3-5 min process. You only have to setup once. After that it's start burp then start AVD ) Burp/Android Emulator (Video Tutorial ) Update Video with 7minsec Podcast! https://youtu.be/XdXleap0BiM name (Video Tutorial) https://youtu.be/pYv4UwP3BaU name USB Rubber Ducky Scripts & Payloads Python 3 Arduino DigiSpark https://youtu.be/e8tKhFS0Tow name Old payloads: https://github.com/hak5/usbrubberducky-payloads/tree/1d3e9be7ba3f80cdb008885fac49be2ba926649d/payloads PhreakNIC 24: Java Android Magisk Burp Objection Root Emulator Easy (JAMBOREE) https://www.youtube.com/watch?v=R1eu2Ui1ZLU name

android root windows google play api python java volatility anaconda objection jamboree burp nox invoke bloodhounds powershell brida certs magisk video tutorials autogpt pycharm

High Agency Pydantic > VC Backed Frameworks — with Jason Liu of Instructor

Latent Space: The AI Engineer Podcast â€” CodeGen, Agents, Computer Vision, Data Science, AI UX and all things Software 3.0

Play Episode Listen Later Apr 19, 2024 52:20

We are reuniting for the 2nd AI UX demo day in SF on Apr 28. Sign up to demo here! And don't forget tickets for the AI Engineer World's Fair — for early birds who join before keynote announcements!About a year ago there was a lot of buzz around prompt engineering techniques to force structured output. Our friend Simon Willison tweeted a bunch of tips and tricks, but the most iconic one is Riley Goodside making it a matter of life or death:Guardrails (friend of the pod and AI Engineer speaker), Marvin (AI Engineer speaker), and jsonformer had also come out at the time. In June 2023, Jason Liu (today's guest!) open sourced his “OpenAI Function Call and Pydantic Integration Module”, now known as Instructor, which quickly turned prompt engineering black magic into a clean, developer-friendly SDK. A few months later, model providers started to add function calling capabilities to their APIs as well as structured outputs support like “JSON Mode”, which was announced at OpenAI Dev Day (see recap here). In just a handful of months, we went from threatening to kill grandmas to first-class support from the research labs. And yet, Instructor was still downloaded 150,000 times last month. Why?What Instructor looks likeInstructor patches your LLM provider SDKs to offer a new response_model option to which you can pass a structure defined in Pydantic. It currently supports OpenAI, Anthropic, Cohere, and a long tail of models through LiteLLM.What Instructor is forThere are three core use cases to Instructor:* Extracting structured data: Taking an input like an image of a receipt and extracting structured data from it, such as a list of checkout items with their prices, fees, and coupon codes.* Extracting graphs: Identifying nodes and edges in a given input to extract complex entities and their relationships. For example, extracting relationships between characters in a story or dependencies between tasks.* Query understanding: Defining a schema for an API call and using a language model to resolve a request into a more complex one that an embedding could not handle. For example, creating date intervals from queries like “what was the latest thing that happened this week?” to then pass onto a RAG system or similar.Jason called all these different ways of getting data from LLMs “typed responses”: taking strings and turning them into data structures. Structured outputs as a planning toolThe first wave of agents was all about open-ended iteration and planning, with projects like AutoGPT and BabyAGI. Models would come up with a possible list of steps, and start going down the list one by one. It's really easy for them to go down the wrong branch, or get stuck on a single step with no way to intervene.What if these planning steps were returned to us as DAGs using structured output, and then managed as workflows? This also makes it easy to better train model on how to create these plans, as they are much more structured than a bullet point list. Once you have this structure, each piece can be modified individually by different specialized models. You can read some of Jason's experiments here:While LLMs will keep improving (Llama3 just got released as we write this), having a consistent structure for the output will make it a lot easier to swap models in and out. Jason's overall message on how we can move from ReAct loops to more controllable Agent workflows mirrors the “Process” discussion from our Elicit episode:Watch the talkAs a bonus, here's Jason's talk from last year's AI Engineer Summit. He'll also be a speaker at this year's AI Engineer World's Fair!Timestamps* [00:00:00] Introductions* [00:02:23] Early experiments with Generative AI at StitchFix* [00:08:11] Design philosophy behind the Instructor library* [00:11:12] JSON Mode vs Function Calling* [00:12:30] Single vs parallel function calling* [00:14:00] How many functions is too many?* [00:17:39] How to evaluate function calling* [00:20:23] What is Instructor good for?* [00:22:42] The Evolution from Looping to Workflow in AI Engineering* [00:27:03] State of the AI Engineering Stack* [00:28:26] Why Instructor isn't VC backed* [00:31:15] Advice on Pursuing Open Source Projects and Consulting* [00:36:00] The Concept of High Agency and Its Importance* [00:42:44] Prompts as Code and the Structure of AI Inputs and Outputs* [00:44:20] The Emergence of AI Engineering as a Distinct FieldShow notes* Jason on the UWaterloo mafia* Jason on Twitter, LinkedIn, website* Instructor docs* Max Woolf on the potential of Structured Output* swyx on Elo vs Cost* Jason on Anthropic Function Calling* Jason on Rejections, Advice to Young People* Jason on Bad Startup Ideas* Jason on Prompts as Code* Rysana's inversion models* Bryan Bischof's episode* Hamel HusainTranscriptAlessio [00:00:00]: Hey everyone, welcome to the Latent Space Podcast. This is Alessio, partner and CTO at Residence at Decibel Partners, and I'm joined by my co-host Swyx, founder of Smol AI.Swyx [00:00:16]: Hello, we're back in the remote studio with Jason Liu from Instructor. Welcome Jason.Jason [00:00:21]: Hey there. Thanks for having me.Swyx [00:00:23]: Jason, you are extremely famous, so I don't know what I'm going to do introducing you, but you're one of the Waterloo clan. There's like this small cadre of you that's just completely dominating machine learning. Actually, can you list like Waterloo alums that you're like, you know, are just dominating and crushing it right now?Jason [00:00:39]: So like John from like Rysana is doing his inversion models, right? I know like Clive Chen from Waterloo. When I started the data science club, he was one of the guys who were like joining in and just like hanging out in the room. And now he was at Tesla working with Karpathy, now he's at OpenAI, you know.Swyx [00:00:56]: He's in my climbing club.Jason [00:00:58]: Oh, hell yeah. I haven't seen him in like six years now.Swyx [00:01:01]: To get in the social scene in San Francisco, you have to climb. So both in career and in rocks. So you started a data science club at Waterloo, we can talk about that, but then also spent five years at Stitch Fix as an MLE. You pioneered the use of OpenAI's LLMs to increase stylist efficiency. So you must have been like a very, very early user. This was like pretty early on.Jason [00:01:20]: Yeah, I mean, this was like GPT-3, okay. So we actually were using transformers at Stitch Fix before the GPT-3 model. So we were just using transformers for recommendation systems. At that time, I was very skeptical of transformers. I was like, why do we need all this infrastructure? We can just use like matrix factorization. When GPT-2 came out, I fine tuned my own GPT-2 to write like rap lyrics and I was like, okay, this is cute. Okay, I got to go back to my real job, right? Like who cares if I can write a rap lyric? When GPT-3 came out, again, I was very much like, why are we using like a post request to review every comment a person leaves? Like we can just use classical models. So I was very against language models for like the longest time. And then when ChatGPT came out, I basically just wrote a long apology letter to everyone at the company. I was like, hey guys, you know, I was very dismissive of some of this technology. I didn't think it would scale well, and I am wrong. This is incredible. And I immediately just transitioned to go from computer vision recommendation systems to LLMs. But funny enough, now that we have RAG, we're kind of going back to recommendation systems.Swyx [00:02:21]: Yeah, speaking of that, I think Alessio is going to bring up the next one.Alessio [00:02:23]: Yeah, I was going to say, we had Bryan Bischof from Hex on the podcast. Did you overlap at Stitch Fix?Jason [00:02:28]: Yeah, he was like one of my main users of the recommendation frameworks that I had built out at Stitch Fix.Alessio [00:02:32]: Yeah, we talked a lot about RecSys, so it makes sense.Swyx [00:02:36]: So now I have adopted that line, RAG is RecSys. And you know, if you're trying to reinvent new concepts, you should study RecSys first, because you're going to independently reinvent a lot of concepts. So your system was called Flight. It's a recommendation framework with over 80% adoption, servicing 350 million requests every day. Wasn't there something existing at Stitch Fix? Why did you have to write one from scratch?Jason [00:02:56]: No, so I think because at Stitch Fix, a lot of the machine learning engineers and data scientists were writing production code, sort of every team's systems were very bespoke. It's like, this team only needs to do like real time recommendations with small data. So they just have like a fast API app with some like pandas code. This other team has to do a lot more data. So they have some kind of like Spark job that does some batch ETL that does a recommendation. And so what happens is each team writes their code differently. And I have to come in and refactor their code. And I was like, oh man, I'm refactoring four different code bases, four different times. Wouldn't it be better if all the code quality was my fault? Let me just write this framework, force everyone else to use it. And now one person can maintain five different systems, rather than five teams having their own bespoke system. And so it was really a need of just sort of standardizing everything. And then once you do that, you can do observability across the entire pipeline and make large sweeping improvements in this infrastructure, right? If we notice that something is slow, we can detect it on the operator layer. Just hey, hey, like this team, you guys are doing this operation is lowering our latency by like 30%. If you just optimize your Python code here, we can probably make an extra million dollars. So let's jump on a call and figure this out. And then a lot of it was doing all this observability work to figure out what the heck is going on and optimize this system from not only just a code perspective, sort of like harassingly or against saying like, we need to add caching here. We're doing duplicated work here. Let's go clean up the systems. Yep.Swyx [00:04:22]: Got it. One more system that I'm interested in finding out more about is your similarity search system using Clip and GPT-3 embeddings and FIASS, where you saved over $50 million in annual revenue. So of course they all gave all that to you, right?Jason [00:04:34]: No, no, no. I mean, it's not going up and down, but you know, I got a little bit, so I'm pretty happy about that. But there, you know, that was when we were doing fine tuning like ResNets to do image classification. And so a lot of it was given an image, if we could predict the different attributes we have in the merchandising and we can predict the text embeddings of the comments, then we can kind of build a image vector or image embedding that can capture both descriptions of the clothing and sales of the clothing. And then we would use these additional vectors to augment our recommendation system. And so with the recommendation system really was just around like, what are similar items? What are complimentary items? What are items that you would wear in a single outfit? And being able to say on a product page, let me show you like 15, 20 more things. And then what we found was like, hey, when you turn that on, you make a bunch of money.Swyx [00:05:23]: Yeah. So, okay. So you didn't actually use GPT-3 embeddings. You fine tuned your own? Because I was surprised that GPT-3 worked off the shelf.Jason [00:05:30]: Because I mean, at this point we would have 3 million pieces of inventory over like a billion interactions between users and clothes. So any kind of fine tuning would definitely outperform like some off the shelf model.Swyx [00:05:41]: Cool. I'm about to move on from Stitch Fix, but you know, any other like fun stories from the Stitch Fix days that you want to cover?Jason [00:05:46]: No, I think that's basically it. I mean, the biggest one really was the fact that I think for just four years, I was so bearish on language models and just NLP in general. I'm just like, none of this really works. Like, why would I spend time focusing on this? I got to go do the thing that makes money, recommendations, bounding boxes, image classification. Yeah. Now I'm like prompting an image model. I was like, oh man, I was wrong.Swyx [00:06:06]: So my Stitch Fix question would be, you know, I think you have a bit of a drip and I don't, you know, my primary wardrobe is free startup conference t-shirts. Should more technology brothers be using Stitch Fix? What's your fashion advice?Jason [00:06:19]: Oh man, I mean, I'm not a user of Stitch Fix, right? It's like, I enjoy going out and like touching things and putting things on and trying them on. Right. I think Stitch Fix is a place where you kind of go because you want the work offloaded. I really love the clothing I buy where I have to like, when I land in Japan, I'm doing like a 45 minute walk up a giant hill to find this weird denim shop. That's the stuff that really excites me. But I think the bigger thing that's really captured is this idea that narrative matters a lot to human beings. Okay. And I think the recommendation system, that's really hard to capture. It's easy to use AI to sell like a $20 shirt, but it's really hard for AI to sell like a $500 shirt. But people are buying $500 shirts, you know what I mean? There's definitely something that we can't really capture just yet that we probably will figure out how to in the future.Swyx [00:07:07]: Well, it'll probably output in JSON, which is what we're going to turn to next. Then you went on a sabbatical to South Park Commons in New York, which is unusual because it's based on USF.Jason [00:07:17]: Yeah. So basically in 2020, really, I was enjoying working a lot as I was like building a lot of stuff. This is where we were making like the tens of millions of dollars doing stuff. And then I had a hand injury. And so I really couldn't code anymore for like a year, two years. And so I kind of took sort of half of it as medical leave, the other half I became more of like a tech lead, just like making sure the systems were like lights were on. And then when I went to New York, I spent some time there and kind of just like wound down the tech work, you know, did some pottery, did some jujitsu. And after GPD came out, I was like, oh, I clearly need to figure out what is going on here because something feels very magical. I don't understand it. So I spent basically like five months just prompting and playing around with stuff. And then afterwards, it was just my startup friends going like, hey, Jason, you know, my investors want us to have an AI strategy. Can you help us out? And it just snowballed and bore more and more until I was making this my full time job. Yeah, got it.Swyx [00:08:11]: You know, you had YouTube University and a journaling app, you know, a bunch of other explorations. But it seems like the most productive or the best known thing that came out of your time there was Instructor. Yeah.Jason [00:08:22]: Written on the bullet train in Japan. I think at some point, you know, tools like Guardrails and Marvin came out. Those are kind of tools that I use XML and Pytantic to get structured data out. But they really were doing things sort of in the prompt. And these are built with sort of the instruct models in mind. Like I'd already done that in the past. Right. At Stitch Fix, you know, one of the things we did was we would take a request note and turn that into a JSON object that we would use to send it to our search engine. Right. So if you said like, I want to, you know, skinny jeans that were this size, that would turn into JSON that we would send to our internal search APIs. But it always felt kind of gross. A lot of it is just like you read the JSON, you like parse it, you make sure the names are strings and ages are numbers and you do all this like messy stuff. But when function calling came out, it was very much sort of a new way of doing things. Right. Function calling lets you define the schema separate from the data and the instructions. And what this meant was you can kind of have a lot more complex schemas and just map them in Pytantic. And then you can just keep those very separate. And then once you add like methods, you can add validators and all that kind of stuff. The one thing I really had with a lot of these libraries, though, was it was doing a lot of the string formatting themselves, which was fine when it was the instruction to models. You just have a string. But when you have these new chat models, you have these chat messages. And I just didn't really feel like not being able to access that for the developer was sort of a good benefit that they would get. And so I just said, let me write like the most simple SDK around the OpenAI SDK, a simple wrapper on the SDK, just handle the response model a bit and kind of think of myself more like requests than actual framework that people can use. And so the goal is like, hey, like this is something that you can use to build your own framework. But let me just do all the boring stuff that nobody really wants to do. People want to build their own frameworks, but people don't want to build like JSON parsing.Swyx [00:10:08]: And the retrying and all that other stuff.Jason [00:10:10]: Yeah.Swyx [00:10:11]: Right. We had this a little bit of this discussion before the show, but like that design principle of going for being requests rather than being Django. Yeah. So what inspires you there? This has come from a lot of prior pain. Are there other open source projects that inspired your philosophy here? Yeah.Jason [00:10:25]: I mean, I think it would be requests, right? Like, I think it is just the obvious thing you install. If you were going to go make HTTP requests in Python, you would obviously import requests. Maybe if you want to do more async work, there's like future tools, but you don't really even think about installing it. And when you do install it, you don't think of it as like, oh, this is a requests app. Right? Like, no, this is just Python. The bigger question is, like, a lot of people ask questions like, oh, why isn't requests like in the standard library? Yeah. That's how I want my library to feel, right? It's like, oh, if you're going to use the LLM SDKs, you're obviously going to install instructor. And then I think the second question would be like, oh, like, how come instructor doesn't just go into OpenAI, go into Anthropic? Like, if that's the conversation we're having, like, that's where I feel like I've succeeded. Yeah. It's like, yeah, so standard, you may as well just have it in the base libraries.Alessio [00:11:12]: And the shape of the request stayed the same, but initially function calling was maybe equal structure outputs for a lot of people. I think now the models also support like JSON mode and some of these things and, you know, return JSON or my grandma is going to die. All of that stuff is maybe to decide how have you seen that evolution? Like maybe what's the metagame today? Should people just forget about function calling for structure outputs or when is structure output like JSON mode the best versus not? We'd love to get any thoughts given that you do this every day.Jason [00:11:42]: Yeah, I would almost say these are like different implementations of like the real thing we care about is the fact that now we have typed responses to language models. And because we have that type response, my IDE is a little bit happier. I get autocomplete. If I'm using the response wrong, there's a little red squiggly line. Like those are the things I care about in terms of whether or not like JSON mode is better. I usually think it's almost worse unless you want to spend less money on like the prompt tokens that the function call represents, primarily because with JSON mode, you don't actually specify the schema. So sure, like JSON load works, but really, I care a lot more than just the fact that it is JSON, right? I think function calling gives you a tool to specify the fact like, okay, this is a list of objects that I want and each object has a name or an age and I want the age to be above zero and I want to make sure it's parsed correctly. That's where kind of function calling really shines.Alessio [00:12:30]: Any thoughts on single versus parallel function calling? So I did a presentation at our AI in Action Discord channel, and obviously showcase instructor. One of the big things that we have before with single function calling is like when you're trying to extract lists, you have to make these funky like properties that are lists to then actually return all the objects. How do you see the hack being put on the developer's plate versus like more of this stuff just getting better in the model? And I know you tweeted recently about Anthropic, for example, you know, some lists are not lists or strings and there's like all of these discrepancies.Jason [00:13:04]: I almost would prefer it if it was always a single function call. Obviously, there is like the agents workflows that, you know, Instructor doesn't really support that well, but are things that, you know, ought to be done, right? Like you could define, I think maybe like 50 or 60 different functions in a single API call. And, you know, if it was like get the weather or turn the lights on or do something else, it makes a lot of sense to have these parallel function calls. But in terms of an extraction workflow, I definitely think it's probably more helpful to have everything be a single schema, right? Just because you can sort of specify relationships between these entities that you can't do in a parallel function calling, you can have a single chain of thought before you generate a list of results. Like there's like small like API differences, right? Where if it's for parallel function calling, if you do one, like again, really, I really care about how the SDK looks and says, okay, do I always return a list of functions or do you just want to have the actual object back out and you want to have like auto complete over that object? Interesting.Alessio [00:14:00]: What's kind of the cap for like how many function definitions you can put in where it still works well? Do you have any sense on that?Jason [00:14:07]: I mean, for the most part, I haven't really had a need to do anything that's more than six or seven different functions. I think in the documentation, they support way more. I don't even know if there's any good evals that have over like two dozen function calls. I think if you're running into issues where you have like 20 or 50 or 60 function calls, I think you're much better having those specifications saved in a vector database and then have them be retrieved, right? So if there are 30 tools, like you should basically be like ranking them and then using the top K to do selection a little bit better rather than just like shoving like 60 functions into a single. Yeah.Swyx [00:14:40]: Yeah. Well, I mean, so I think this is relevant now because previously I think context limits prevented you from having more than a dozen tools anyway. And now that we have million token context windows, you know, a cloud recently with their new function calling release said they can handle over 250 tools, which is insane to me. That's, that's a lot. You're saying like, you know, you don't think there's many people doing that. I think anyone with a sort of agent like platform where you have a bunch of connectors, they wouldn't run into that problem. Probably you're right that they should use a vector database and kind of rag their tools. I know Zapier has like a few thousand, like 8,000, 9,000 connectors that, you know, obviously don't fit anywhere. So yeah, I mean, I think that would be it unless you need some kind of intelligence that chains things together, which is, I think what Alessio is coming back to, right? Like there's this trend about parallel function calling. I don't know what I think about that. Anthropic's version was, I think they use multiple tools in sequence, but they're not in parallel. I haven't explored this at all. I'm just like throwing this open to you as to like, what do you think about all these new things? Yeah.Jason [00:15:40]: It's like, you know, do we assume that all function calls could happen in any order? In which case, like we either can assume that, or we can assume that like things need to happen in some kind of sequence as a DAG, right? But if it's a DAG, really that's just like one JSON object that is the entire DAG rather than going like, okay, the order of the function that return don't matter. That's definitely just not true in practice, right? Like if I have a thing that's like turn the lights on, like unplug the power, and then like turn the toaster on or something like the order doesn't matter. And it's unclear how well you can describe the importance of that reasoning to a language model yet. I mean, I'm sure you can do it with like good enough prompting, but I just haven't any use cases where the function sequence really matters. Yeah.Alessio [00:16:18]: To me, the most interesting thing is the models are better at picking than your ranking is usually. Like I'm incubating a company around system integration. For example, with one system, there are like 780 endpoints. And if you're actually trying to do vector similarity, it's not that good because the people that wrote the specs didn't have in mind making them like semantically apart. You know, they're kind of like, oh, create this, create this, create this. Versus when you give it to a model, like in Opus, you put them all, it's quite good at picking which ones you should actually run. And I'm curious to see if the model providers actually care about some of those workflows or if the agent companies are actually going to build very good rankers to kind of fill that gap.Jason [00:16:58]: Yeah. My money is on the rankers because you can do those so easily, right? You could just say, well, given the embeddings of my search query and the embeddings of the description, I can just train XGBoost and just make sure that I have very high like MRR, which is like mean reciprocal rank. And so the only objective is to make sure that the tools you use are in the top end filtered. Like that feels super straightforward and you don't have to actually figure out how to fine tune a language model to do tool selection anymore. Yeah. I definitely think that's the case because for the most part, I imagine you either have like less than three tools or more than a thousand. I don't know what kind of company said, oh, thank God we only have like 185 tools and this works perfectly, right? That's right.Alessio [00:17:39]: And before we maybe move on just from this, it was interesting to me, you retweeted this thing about Anthropic function calling and it was Joshua Brown's retweeting some benchmark that it's like, oh my God, Anthropic function calling so good. And then you retweeted it and then you tweeted it later and it's like, it's actually not that good. What's your flow? How do you actually test these things? Because obviously the benchmarks are lying, right? Because the benchmarks say it's good and you said it's bad and I trust you more than the benchmark. How do you think about that? And then how do you evolve it over time?Jason [00:18:09]: It's mostly just client data. I actually have been mostly busy with enough client work that I haven't been able to reproduce public benchmarks. And so I can't even share some of the results in Anthropic. I would just say like in production, we have some pretty interesting schemas where it's like iteratively building lists where we're doing like updates of lists, like we're doing in place updates. So like upserts and inserts. And in those situations we're like, oh yeah, we have a bunch of different parsing errors. Numbers are being returned to strings. We were expecting lists of objects, but we're getting strings that are like the strings of JSON, right? So we had to call JSON parse on individual elements. Overall, I'm like super happy with the Anthropic models compared to the OpenAI models. Sonnet is very cost effective. Haiku is in function calling, it's actually better, but I think they just had to sort of file down the edges a little bit where like our tests pass, but then we actually deployed a production. We got half a percent of traffic having issues where if you ask for JSON, it'll try to talk to you. Or if you use function calling, you know, we'll have like a parse error. And so I think that definitely gonna be things that are fixed in like the upcoming weeks. But in terms of like the reasoning capabilities, man, it's hard to beat like 70% cost reduction, especially when you're building consumer applications, right? If you're building something for consultants or private equity, like you're charging $400, it doesn't really matter if it's a dollar or $2. But for consumer apps, it makes products viable. If you can go from four to Sonnet, you might actually be able to price it better. Yeah.Swyx [00:19:31]: I had this chart about the ELO versus the cost of all the models. And you could put trend graphs on each of those things about like, you know, higher ELO equals higher cost, except for Haiku. Haiku kind of just broke the lines, or the ISO ELOs, if you want to call it. Cool. Before we go too far into your opinions on just the overall ecosystem, I want to make sure that we map out the surface area of Instructor. I would say that most people would be familiar with Instructor from your talks and your tweets and all that. You had the number one talk from the AI Engineer Summit.Jason [00:20:03]: Two Liu. Jason Liu and Jerry Liu. Yeah.Swyx [00:20:06]: Yeah. Until I actually went through your cookbook, I didn't realize the surface area. How would you categorize the use cases? You have LLM self-critique, you have knowledge graphs in here, you have PII data sanitation. How do you characterize to people what is the surface area of Instructor? Yeah.Jason [00:20:23]: This is the part that feels crazy because really the difference is LLMs give you strings and Instructor gives you data structures. And once you get data structures, again, you can do every lead code problem you ever thought of. Right. And so I think there's a couple of really common applications. The first one obviously is extracting structured data. This is just be, okay, well, like I want to put in an image of a receipt. I want to give it back out a list of checkout items with a price and a fee and a coupon code or whatever. That's one application. Another application really is around extracting graphs out. So one of the things we found out about these language models is that not only can you define nodes, it's really good at figuring out what are nodes and what are edges. And so we have a bunch of examples where, you know, not only do I extract that, you know, this happens after that, but also like, okay, these two are dependencies of another task. And you can do, you know, extracting complex entities that have relationships. Given a story, for example, you could extract relationships of families across different characters. This can all be done by defining a graph. The last really big application really is just around query understanding. The idea is that like any API call has some schema and if you can define that schema ahead of time, you can use a language model to resolve a request into a much more complex request. One that an embedding could not do. So for example, I have a really popular post called like rag is more than embeddings. And effectively, you know, if I have a question like this, what was the latest thing that happened this week? That embeds to nothing, right? But really like that query should just be like select all data where the date time is between today and today minus seven days, right? What if I said, how did my writing change between this month and last month? Again, embeddings would do nothing. But really, if you could do like a group by over the month and a summarize, then you could again like do something much more interesting. And so this really just calls out the fact that embeddings really is kind of like the lowest hanging fruit. And using something like instructor can really help produce a data structure. And then you can just use your computer science and reason about the data structure. Maybe you say, okay, well, I'm going to produce a graph where I want to group by each month and then summarize them jointly. You can do that if you know how to define this data structure. Yeah.Swyx [00:22:29]: So you kind of run up against like the LangChains of the world that used to have that. They still do have like the self querying, I think they used to call it when we had Harrison on in our episode. How do you see yourself interacting with the other LLM frameworks in the ecosystem? Yeah.Jason [00:22:42]: I mean, if they use instructor, I think that's totally cool. Again, it's like, it's just Python, right? It's like asking like, oh, how does like Django interact with requests? Well, you just might make a request.get in a Django app, right? But no one would say, I like went off of Django because I'm using requests now. They should be ideally like sort of the wrong comparison in terms of especially like the agent workflows. I think the real goal for me is to go down like the LLM compiler route, which is instead of doing like a react type reasoning loop. I think my belief is that we should be using like workflows. If we do this, then we always have a request and a complete workflow. We can fine tune a model that has a better workflow. Whereas it's hard to think about like, how do you fine tune a better react loop? Yeah. You always train it to have less looping, in which case like you wanted to get the right answer the first time, in which case it was a workflow to begin with, right?Swyx [00:23:31]: Can you define workflow? Because I used to work at a workflow company, but I'm not sure this is a good term for everybody.Jason [00:23:36]: I'm thinking workflow in terms of like the prefect Zapier workflow. Like I want to build a DAG, I want you to tell me what the nodes and edges are. And then maybe the edges are also put in with AI. But the idea is that like, I want to be able to present you the entire plan and then ask you to fix things as I execute it, rather than going like, hey, I couldn't parse the JSON, so I'm going to try again. I couldn't parse the JSON, I'm going to try again. And then next thing you know, you spent like $2 on opening AI credits, right? Yeah. Whereas with the plan, you can just say, oh, the edge between node like X and Y does not run. Let me just iteratively try to fix that, fix the one that sticks, go on to the next component. And obviously you can get into a world where if you have enough examples of the nodes X and Y, maybe you can use like a vector database to find a good few shot examples. You can do a lot if you sort of break down the problem into that workflow and executing that workflow, rather than looping and hoping the reasoning is good enough to generate the correct output. Yeah.Swyx [00:24:35]: You know, I've been hammering on Devon a lot. I got access a couple of weeks ago. And obviously for simple tasks, it does well. For the complicated, like more than 10, 20 hour tasks, I can see- That's a crazy comparison.Jason [00:24:47]: We used to talk about like three, four loops. Only once it gets to like hour tasks, it's hard.Swyx [00:24:54]: Yeah. Less than an hour, there's nothing.Jason [00:24:57]: That's crazy.Swyx [00:24:58]: I mean, okay. Maybe my goalposts have shifted. I don't know. That's incredible.Jason [00:25:02]: Yeah. No, no. I'm like sub one minute executions. Like the fact that you're talking about 10 hours is incredible.Swyx [00:25:08]: I think it's a spectrum. I think I'm going to say this every single time I bring up Devon. Let's not reward them for taking longer to do things. Do you know what I mean? I think that's a metric that is easily abusable.Jason [00:25:18]: Sure. Yeah. You know what I mean? But I think if you can monotonically increase the success probability over an hour, that's winning to me. Right? Like obviously if you run an hour and you've made no progress. Like I think when we were in like auto GBT land, there was that one example where it's like, I wanted it to like buy me a bicycle overnight. I spent $7 on credit and I never found the bicycle. Yeah.Swyx [00:25:41]: Yeah. Right. I wonder if you'll be able to purchase a bicycle. Because it actually can do things in real world. It just needs to suspend to you for off and stuff. The point I was trying to make was that I can see it turning plans. I think one of the agents loopholes or one of the things that is a real barrier for agents is LLMs really like to get stuck into a lane. And you know what you're talking about, what I've seen Devon do is it gets stuck in a lane and it will just kind of change plans based on the performance of the plan itself. And it's kind of cool.Jason [00:26:05]: I feel like we've gone too much in the looping route and I think a lot of more plans and like DAGs and data structures are probably going to come back to help fill in some holes. Yeah.Alessio [00:26:14]: What do you think of the interface to that? Do you see it's like an existing state machine kind of thing that connects to the LLMs, the traditional DAG players? Do you think we need something new for like AI DAGs?Jason [00:26:25]: Yeah. I mean, I think that the hard part is going to be describing visually the fact that this DAG can also change over time and it should still be allowed to be fuzzy. I think in like mathematics, we have like plate diagrams and like Markov chain diagrams and like recurrent states and all that. Some of that might come into this workflow world. But to be honest, I'm not too sure. I think right now, the first steps are just how do we take this DAG idea and break it down to modular components that we can like prompt better, have few shot examples for and ultimately like fine tune against. But in terms of even the UI, it's hard to say what it will likely win. I think, you know, people like Prefect and Zapier have a pretty good shot at doing a good job.Swyx [00:27:03]: Yeah. You seem to use Prefect a lot. I actually worked at a Prefect competitor at Temporal and I'm also very familiar with Dagster. What else would you call out as like particularly interesting in the AI engineering stack?Jason [00:27:13]: Man, I almost use nothing. I just use Cursor and like PyTests. Okay. I think that's basically it. You know, a lot of the observability companies have... The more observability companies I've tried, the more I just use Postgres.Swyx [00:27:29]: Really? Okay. Postgres for observability?Jason [00:27:32]: But the issue really is the fact that these observability companies isn't actually doing observability for the system. It's just doing the LLM thing. Like I still end up using like Datadog or like, you know, Sentry to do like latency. And so I just have those systems handle it. And then the like prompt in, prompt out, latency, token costs. I just put that in like a Postgres table now.Swyx [00:27:51]: So you don't need like 20 funded startups building LLM ops? Yeah.Jason [00:27:55]: But I'm also like an old, tired guy. You know what I mean? Like I think because of my background, it's like, yeah, like the Python stuff, I'll write myself. But you know, I will also just use Vercel happily. Yeah. Yeah. So I'm not really into that world of tooling, whereas I think, you know, I spent three good years building observability tools for recommendation systems. And I was like, oh, compared to that, Instructor is just one call. I just have to put time star, time and then count the prompt token, right? Because I'm not doing a very complex looping behavior. I'm doing mostly workflows and extraction. Yeah.Swyx [00:28:26]: I mean, while we're on this topic, we'll just kind of get this out of the way. You famously have decided to not be a venture backed company. You want to do the consulting route. The obvious route for someone as successful as Instructor is like, oh, here's hosted Instructor with all tooling. Yeah. You just said you had a whole bunch of experience building observability tooling. You have the perfect background to do this and you're not.Jason [00:28:43]: Yeah. Isn't that sick? I think that's sick.Swyx [00:28:44]: I mean, I know why, because you want to go free dive.Jason [00:28:47]: Yeah. Yeah. Because I think there's two things. Right. Well, one, if I tell myself I want to build requests, requests is not a venture backed startup. Right. I mean, one could argue whether or not Postman is, but I think for the most part, it's like having worked so much, I'm more interested in looking at how systems are being applied and just having access to the most interesting data. And I think I can do that more through a consulting business where I can come in and go, oh, you want to build perfect memory. You want to build an agent. You want to build like automations over construction or like insurance and supply chain, or like you want to handle writing private equity, mergers and acquisitions reports based off of user interviews. Those things are super fun. Whereas like maintaining the library, I think is mostly just kind of like a utility that I try to keep up, especially because if it's not venture backed, I have no reason to sort of go down the route of like trying to get a thousand integrations. In my mind, I just go like, okay, 98% of the people use open AI. I'll support that. And if someone contributes another platform, that's great. I'll merge it in. Yeah.Swyx [00:29:45]: I mean, you only added Anthropic support this year. Yeah.Jason [00:29:47]: Yeah. You couldn't even get an API key until like this year, right? That's true. Okay. If I add it like last year, I was trying to like double the code base to service, you know, half a percent of all downloads.Swyx [00:29:58]: Do you think the market share will shift a lot now that Anthropic has like a very, very competitive offering?Jason [00:30:02]: I think it's still hard to get API access. I don't know if it's fully GA now, if it's GA, if you can get a commercial access really easily.Alessio [00:30:12]: I got commercial after like two weeks to reach out to their sales team.Jason [00:30:14]: Okay.Alessio [00:30:15]: Yeah.Swyx [00:30:16]: Two weeks. It's not too bad. There's a call list here. And then anytime you run into rate limits, just like ping one of the Anthropic staff members.Jason [00:30:21]: Yeah. Then maybe we need to like cut that part out. So I don't need to like, you know, spread false news.Swyx [00:30:25]: No, it's cool. It's cool.Jason [00:30:26]: But it's a common question. Yeah. Surely just from the price perspective, it's going to make a lot of sense. Like if you are a business, you should totally consider like Sonnet, right? Like the cost savings is just going to justify it if you actually are doing things at volume. And yeah, I think the SDK is like pretty good. Back to the instructor thing. I just don't think it's a billion dollar company. And I think if I raise money, the first question is going to be like, how are you going to get a billion dollar company? And I would just go like, man, like if I make a million dollars as a consultant, I'm super happy. I'm like more than ecstatic. I can have like a small staff of like three people. It's fun. And I think a lot of my happiest founder friends are those who like raised a tiny seed round, became profitable. They're making like 70, 60, 70, like MRR, 70,000 MRR and they're like, we don't even need to raise the seed round. Let's just keep it like between me and my co-founder, we'll go traveling and it'll be a great time. I think it's a lot of fun.Alessio [00:31:15]: Yeah. like say LLMs / AI and they build some open source stuff and it's like I should just raise money and do this and I tell people a lot it's like look you can make a lot more money doing something else than doing a startup like most people that do a company could make a lot more money just working somewhere else than the company itself do you have any advice for folks that are maybe in a similar situation they're trying to decide oh should I stay in my like high paid FAANG job and just tweet this on the side and do this on github should I go be a consultant like being a consultant seems like a lot of work so you got to talk to all these people you know there's a lot to unpackJason [00:31:54]: I think the open source thing is just like well I'm just doing it purely for fun and I'm doing it because I think I'm right but part of being right is the fact that it's not a venture backed startup like I think I'm right because this is all you need right so I think a part of the philosophy is the fact that all you need is a very sharp blade to sort of do your work and you don't actually need to build like a big enterprise so that's one thing I think the other thing too that I've kind of been thinking around just because I have a lot of friends at google that want to leave right now it's like man like what we lack is not money or skill like what we lack is courage you should like you just have to do this a hard thing and you have to do it scared anyways right in terms of like whether or not you do want to do a founder I think that's just a matter of optionality but I definitely recognize that the like expected value of being a founder is still quite low it is right I know as many founder breakups and as I know friends who raised a seed round this year right like that is like the reality and like you know even in from that perspective it's been tough where it's like oh man like a lot of incubators want you to have co-founders now you spend half the time like fundraising and then trying to like meet co-founders and find co-founders rather than building the thing this is a lot of time spent out doing uh things I'm not really good at. I do think there's a rising trend in solo founding yeah.Swyx [00:33:06]: You know I am a solo I think that something like 30 percent of like I forget what the exact status something like 30 percent of starters that make it to like series B or something actually are solo founder I feel like this must have co-founder idea mostly comes from YC and most everyone else copies it and then plenty of companies break up over co-founderJason [00:33:27]: Yeah and I bet it would be like I wonder how much of it is the people who don't have that much like and I hope this is not a diss to anybody but it's like you sort of you go through the incubator route because you don't have like the social equity you would need is just sort of like send an email to Sequoia and be like hey I'm going on this ride you want a ticket on the rocket ship right like that's very hard to sell my message if I was to raise money is like you've seen my twitter my life is sick I've decided to make it much worse by being a founder because this is something I have to do so do you want to come along otherwise I want to fund it myself like if I can't say that like I don't need the money because I can like handle payroll and like hire an intern and get an assistant like that's all fine but I really don't want to go back to meta I want to like get two years to like try to find a problem we're solving that feels like a bad timeAlessio [00:34:12]: Yeah Jason is like I wear a YSL jacket on stage at AI Engineer Summit I don't need your accelerator moneyJason [00:34:18]: And boots, you don't forget the boots. But I think that is a part of it right I think it is just like optionality and also just like I'm a lot older now I think 22 year old Jason would have been probably too scared and now I'm like too wise but I think it's a matter of like oh if you raise money you have to have a plan of spending it and I'm just not that creative with spending that much money yeah I mean to be clear you just celebrated your 30th birthday happy birthday yeah it's awesome so next week a lot older is relative to some some of the folks I think seeing on the career tipsAlessio [00:34:48]: I think Swix had a great post about are you too old to get into AI I saw one of your tweets in January 23 you applied to like Figma, Notion, Cohere, Anthropic and all of them rejected you because you didn't have enough LLM experience I think at that time it would be easy for a lot of people to say oh I kind of missed the boat you know I'm too late not gonna make it you know any advice for people that feel like thatJason [00:35:14]: Like the biggest learning here is actually from a lot of folks in jiu-jitsu they're like oh man like is it too late to start jiu-jitsu like I'll join jiu-jitsu once I get in more shape right it's like there's a lot of like excuses and then you say oh like why should I start now I'll be like 45 by the time I'm any good and say well you'll be 45 anyways like time is passing like if you don't start now you start tomorrow you're just like one more day behind if you're worried about being behind like today is like the soonest you can start right and so you got to recognize that like maybe you just don't want it and that's fine too like if you wanted you would have started I think a lot of these people again probably think of things on a too short time horizon but again you know you're gonna be old anyways you may as well just start now you knowSwyx [00:35:55]: One more thing on I guess the um career advice slash sort of vlogging you always go viral for this post that you wrote on advice to young people and the lies you tell yourself oh yeah yeah you said you were writing it for your sister.Jason [00:36:05]: She was like bummed out about going to college and like stressing about jobs and I was like oh and I really want to hear okay and I just kind of like text-to-sweep the whole thing it's crazy it's got like 50,000 views like I'm mind I mean your average tweet has more but that thing is like a 30-minute read nowSwyx [00:36:26]: So there's lots of stuff here which I agree with I you know I'm also of occasionally indulge in the sort of life reflection phase there's the how to be lucky there's the how to have high agency I feel like the agency thing is always a trend in sf or just in tech circles how do you define having high agencyJason [00:36:42]: I'm almost like past the high agency phase now now my biggest concern is like okay the agency is just like the norm of the vector what also matters is the direction right it's like how pure is the shot yeah I mean I think agency is just a matter of like having courage and doing the thing that's scary right you know if people want to go rock climbing it's like do you decide you want to go rock climbing then you show up to the gym you rent some shoes and you just fall 40 times or do you go like oh like I'm actually more intelligent let me go research the kind of shoes that I want okay like there's flatter shoes and more inclined shoes like which one should I get okay let me go order the shoes on Amazon I'll come back in three days like oh it's a little bit too tight maybe it's too aggressive I'm only a beginner let me go change no I think the higher agent person just like goes and like falls down 20 times right yeah I think the higher agency person is more focused on like process metrics versus outcome metrics right like from pottery like one thing I learned was if you want to be good at pottery you shouldn't count like the number of cups or bowls you make you should just weigh the amount of clay you use right like the successful person says oh I went through 100 pounds of clay right the less agency was like oh I've made six cups and then after I made six cups like there's not really what are you what do you do next no just pounds of clay pounds of clay same with the work here right so you just got to write the tweets like make the commits contribute open source like write the documentation there's no real outcome it's just a process and if you love that process you just get really good at the thing you're doingSwyx [00:38:04]: yeah so just to push back on this because obviously I mostly agree how would you design performance review systems because you were effectively saying we can count lines of code for developers rightJason [00:38:15]: I don't think that would be the actual like I think if you make that an outcome like I can just expand a for loop right I think okay so for performance review this is interesting because I've mostly thought of it from the perspective of science and not engineering I've been running a lot of engineering stand-ups primarily because there's not really that many machine learning folks the process outcome is like experiments and ideas right like if you think about outcome is what you might want to think about an outcome is oh I want to improve the revenue or whatnot but that's really hard but if you're someone who is going out like okay like this week I want to come up with like three or four experiments I might move the needle okay nothing worked to them they might think oh nothing worked like I suck but to me it's like wow you've closed off all these other possible avenues for like research like you're gonna get to the place that you're gonna figure out that direction really soon there's no way you try 30 different things and none of them work usually like 10 of them work five of them work really well two of them work really really well and one thing was like the nail in the head so agency lets you sort of capture the volume of experiments and like experience lets you figure out like oh that other half it's not worth doing right I think experience is going like half these prompting papers don't make any sense just use chain of thought and just you know use a for loop that's basically right it's like usually performance for me is around like how many experiments are you running how oftentimes are you trying.Alessio [00:39:32]: When do you give up on an experiment because a StitchFix you kind of give up on language models I guess in a way as a tool to use and then maybe the tools got better you were right at the time and then the tool improved I think there are similar paths in my engineering career where I try one approach and at the time it doesn't work and then the thing changes but then I kind of soured on that approach and I don't go back to it soonJason [00:39:51]: I see yeah how do you think about that loop so usually when I'm coaching folks and as they say like oh these things don't work I'm not going to pursue them in the future like one of the big things like hey the negative result is a result and this is something worth documenting like this is an academia like if it's negative you don't just like not publish right but then like what do you actually write down like what you should write down is like here are the conditions this is the inputs and the outputs we tried the experiment on and then one thing that's really valuable is basically writing down under what conditions would I revisit these experiments these things don't work because of what we had at the time if someone is reading this two years from now under what conditions will we try again that's really hard but again that's like another skill you kind of learn right it's like you do go back and you do experiments you figure out why it works now I think a lot of it here is just like scaling worked yeah rap lyrics you know that was because I did not have high enough quality data if we phase shift and say okay you don't even need training data oh great then it might just work a different domainAlessio [00:40:48]: Do you have anything in your list that is like it doesn't work now but I want to try it again later? Something that people should maybe keep in mind you know people always like agi when you know when are you going to know the agi is here maybe it's less than that but any stuff that you tried recently that didn't work thatJason [00:41:01]: You think will get there I mean I think the personal assistance and the writing I've shown to myself it's just not good enough yet so I hired a writer and I hired a personal assistant so now I'm gonna basically like work with these people until I figure out like what I can actually like automate and what are like the reproducible steps but like I think the experiment for me is like I'm gonna go pay a person like thousand dollars a month that helped me improve my life and then let me get them to help me figure like what are the components and how do I actually modularize something to get it to work because it's not just like a lot gmail calendar and like notion it's a little bit more complicated than that but we just don't know what that is yet those are two sort of systems that I wish gb4 or opus was actually good enough to just write me an essay but most of the essays are still pretty badSwyx [00:41:44]: yeah I would say you know on the personal assistance side Lindy is probably the one I've seen the most flow was at a speaker at the summit I don't know if you've checked it out or any other sort of agents assistant startupJason [00:41:54]: Not recently I haven't tried lindy they were not ga last time I was considering it yeah yeah a lot of it now it's like oh like really what I want you to do is take a look at all of my meetings and like write like a really good weekly summary email for my clients to remind them that I'm like you know thinking of them and like working for them right or it's like I want you to notice that like my monday is like way too packed and like block out more time and also like email the people to do the reschedule and then try to opt in to move them around and then I want you to say oh jason should have like a 15 minute prep break after form back to back those are things that now I know I can prompt them in but can it do it well like before I didn't even know that's what I wanted to prompt for us defragging a calendar and adding break so I can like eat lunch yeah that's the AGI test yeah exactly compassion right I think one thing that yeah we didn't touch on it before butAlessio [00:42:44]: I think was interesting you had this tweet a while ago about prompts should be code and then there were a lot of companies trying to build prompt engineering tooling kind of trying to turn the prompt into a more structured thing what's your thought today now you want to turn the thinking into DAGs like do prompts should still be code any updated ideasJason [00:43:04]: It's the same thing right I think you know with Instructor it is very much like the output model is defined as a code object that code object is sent to the LLM and in return you get a data structure so the outputs of these models I think should also be code objects and the inputs somewhat should be code objects but I think the one thing that instructor tries to do is separate instruction data and the types of the output and beyond that I really just think that most of it should be still like managed pretty closely to the developer like so much of is changing that if you give control of these systems away too early you end up ultimately wanting them back like many companies I know that I reach out or ones were like oh we're going off of the frameworks because now that we know what the business outcomes we're trying to optimize for these frameworks don't work yeah because we do rag but we want to do rag to like sell you supplements or to have you like schedule the fitness appointment the prompts are kind of too baked into the systems to really pull them back out and like start doing upselling or something it's really funny but a lot of it ends up being like once you understand the business outcomes you care way more about the promptSwyx [00:44:07]: Actually this is fun in our prep for this call we were trying to say like what can you as an independent person say that maybe me and Alessio cannot say or me you know someone at a company say what do you think is the market share of the frameworks the LangChain, the LlamaIndex, the everything...Jason [00:44:20]: Oh massive because not everyone wants to care about the code yeah right I think that's a different question to like what is the business model and are they going to be like massively profitable businesses right making hundreds of millions of dollars that feels like so straightforward right because not everyone is a prompt engineer like there's so much productivity to be captured in like back office optim automations right it's not because they care about the prompts that they care about managing these things yeah but those would be sort of low code experiences you yeah I think the bigger challenge is like okay hundred million dollars probably pretty easy it's just time and effort and they have the manpower and the money to sort of solve those problems again if you go the vc route then it's like you're talking about billions and that's really the goal that stuff for me it's like pretty unclear but again that is to say that like I sort of am building things for developers who want to use infrastructure to build their own tooling in terms of the amount of developers there are in the world versus downstream consumers of these things or even just think of how many companies will use like the adobes and the ibms right because they want something that's fully managed and they want something that they know will work and if the incremental 10% requires you to hire another team of 20 people you might not want to do it and I think that kind of organization is really good for uh those are bigger companiesSwyx [00:45:32]: I just want to capture your thoughts on one more thing which is you said you wanted most of the prompts to stay close to the developer and Hamel Husain wrote this post which I really love called f you show me the prompt yeah I think he cites you in one of those part of the blog post and I think ds pi is kind of like the complete antithesis of that which is I think it's interesting because I also hold the strong view that AI is a better prompt engineer than you are and I don't know how to square that wondering if you have thoughtsJason [00:45:58]: I think something like DSPy can work because there are like very short-term metrics to measure success right it is like did you find the pii or like did you write the multi-hop question the correct way but in these workflows that I've been managing a lot of it are we minimizing churn and maximizing retention yeah that's a very long loop it's not really like a uptuna like training loop right like those things are much more harder to capture so we don't actually have those metrics for that right and obviously we can figure out like okay is the summary good but like how do you measure the quality of the summary it's like that feedback loop it ends up being a lot longer and then again when something changes it's really hard to make sure that it works across these like newer models or again like changes to work for the current process like when we migrate from like anthropic to open ai like there's just a ton of change that are like infrastructure related not necessarily around the prompt itself yeah cool any other ai engineering startups that you think should not exist before we wrap up i mean oh my gosh i mean a lot of it again it's just like every time of investors like how does this make a billion dollars like it doesn't i'm gonna go back to just like tweeting and holding my breath underwater yeah like i don't really pay attention too much to most of this like most of the stuff i'm doing is around like the consumer of like llm calls yep i think people just want to move really fast and they will end up pick these vendors but i don't really know if anything has really like blown me out the water like i only trust myself but that's also a function of just being an old man like i think you know many companies are definitely very happy with using most of these tools anyways but i definitely think i occupy a very small space in the engineering ecosystem.Swyx [00:47:41]: Yeah i would say one of the challenges here you know you call about the dealing in the consumer of llm's space i think that's what ai engineering differs from ml engineering and i think a constant disconnect or cognitive dissonance in this field in the ai engineers that have sprung up is that they are not as good as the ml engineers they are not as qualified i think that you know you are someone who has credibility in the mle space and you are also a very authoritative figure in the ai space and i think so and you know i think you've built the de facto leading library i think yours i think instructors should be part of the standard lib even though i try to not use it like i basically also end up rebuilding instructor right like that's a lot of the back and forth that we had over the past two days i think that's the fundamental thing that we're trying to figure out like there's very small supply of MLEs not everyone's going to have that experience that you had but the global demand for AI is going to far outstrip the existing MLEs.Jason [00:48:36]: So what do we do do we force everyone to go through the standard MLE curriculum or do we make a new one? I'

god new york ai man japan advice state san francisco design evolution single numbers chatgpt code tesla defining ga flight identifying agency agent structure consulting concept spark models cto react vc nlp openai instructors function sf residence api backed gpt python ui waterloo emergence workflow notion apis clip frameworks llm opus temporal anthropic structured prompts versus agi elo sequoia dag django zapier ide usf guardrails dags rag hex postman haiku figma sonnets sdks sentry alessio rejections yc stitch fix ysl query extracting cursor json faang xml looping mrr datadog pii outputs etl markov prefect postgres joshua brown cohere mle elicit autogpt gbt gpd langchain toolthe ai engineer simon willison xgboost dagster latent space swix jerry liu babyagi

Supervise the Process of AI Research — with Jungwon Byun and Andreas Stuhlmüller of Elicit

Latent Space: The AI Engineer Podcast â€” CodeGen, Agents, Computer Vision, Data Science, AI UX and all things Software 3.0

Play Episode Listen Later Apr 11, 2024 56:20

Maggie, Linus, Geoffrey, and the LS crew are reuniting for our second annual AI UX demo day in SF on Apr 28. Sign up to demo here! And don't forget tickets for the AI Engineer World's Fair — for early birds who join before keynote announcements!It's become fashionable for many AI startups to project themselves as “the next Google” - while the search engine is so 2000s, both Perplexity and Exa referred to themselves as a “research engine” or “answer engine” in our NeurIPS pod. However these searches tend to be relatively shallow, and it is challenging to zoom up and down the ladders of abstraction to garner insights. For serious researchers, this level of simple one-off search will not cut it.We've commented in our Jan 2024 Recap that Flow Engineering (simply; multi-turn processes over many-shot single prompts) seems to offer far more performance, control and reliability for a given cost budget. Our experiments with Devin and our understanding of what the new Elicit Notebooks offer a glimpse into the potential for very deep, open ended, thoughtful human-AI collaboration at scale.It starts with promptsWhen ChatGPT exploded in popularity in November 2022 everyone was turned into a prompt engineer. While generative models were good at "vibe based" outcomes (tell me a joke, write a poem, etc) with basic prompts, they struggled with more complex questions, especially in symbolic fields like math, logic, etc. Two of the most important "tricks" that people picked up on were:* Chain of Thought prompting strategy proposed by Wei et al in the “Chain-of-Thought Prompting Elicits Reasoning in Large Language Models”. Rather than doing traditional few-shot prompting with just question and answers, adding the thinking process that led to the answer resulted in much better outcomes.* Adding "Let's think step by step" to the prompt as a way to boost zero-shot reasoning, which was popularized by Kojima et al in the Large Language Models are Zero-Shot Reasoners paper from NeurIPS 2022. This bumped accuracy from 17% to 79% compared to zero-shot.Nowadays, prompts include everything from promises of monetary rewards to… whatever the Nous folks are doing to turn a model into a world simulator. At the end of the day, the goal of prompt engineering is increasing accuracy, structure, and repeatability in the generation of a model.From prompts to agentsAs prompt engineering got more and more popular, agents (see “The Anatomy of Autonomy”) took over Twitter with cool demos and AutoGPT became the fastest growing repo in Github history. The thing about AutoGPT that fascinated people was the ability to simply put in an objective without worrying about explaining HOW to achieve it, or having to write very sophisticated prompts. The system would create an execution plan on its own, and then loop through each task. The problem with open-ended agents like AutoGPT is that 1) it's hard to replicate the same workflow over and over again 2) there isn't a way to hard-code specific steps that the agent should take without actually coding them yourself, which isn't what most people want from a product. From agents to productsPrompt engineering and open-ended agents were great in the experimentation phase, but this year more and more of these workflows are starting to become polished products. Today's guests are Andreas Stuhlmüller and Jungwon Byun of Elicit (previously Ought), an AI research assistant that they think of as “the best place to understand what is known”. Ought was a non-profit, but last September, Elicit spun off into a PBC with a $9m seed round. It is hard to quantify how much a workflow can be improved, but Elicit boasts some impressive numbers for research assistants:Just four months after launch, Elicit crossed $1M ARR, which shows how much interest there is for AI products that just work.One of the main takeaways we had from the episode is how teams should focus on supervising the process, not the output. Their philosophy at Elicit isn't to train general models, but to train models that are extremely good at focusing processes. This allows them to have pre-created steps that the user can add to their workflow (like classifying certain features that are specific to their research field) without having to write a prompt for it. And for Hamel Husain's happiness, they always show you the underlying prompt. Elicit recently announced notebooks as a new interface to interact with their products: (fun fact, they tried to implement this 4 times before they landed on the right UX! We discuss this ~33:00 in the podcast)The reasons why they picked notebooks as a UX all tie back to process:* They are systematic; once you have a instruction/prompt that works on a paper, you can run hundreds of papers through the same workflow by creating a column. Notebooks can also be edited and exported at any point during the flow.* They are transparent - Many papers include an opaque literature review as perfunctory context before getting to their novel contribution. But PDFs are “dead” and it is difficult to follow the thought process and exact research flow of the authors. Sharing “living” Elicit Notebooks opens up this process.* They are unbounded - Research is an endless stream of rabbit holes. So it must be easy to dive deeper and follow up with extra steps, without losing the ability to surface for air. We had a lot of fun recording this, and hope you have as much fun listening!AI UX in SFLong time Latent Spacenauts might remember our first AI UX meetup with Linus Lee, Geoffrey Litt, and Maggie Appleton last year. Well, Maggie has since joined Elicit, and they are all returning at the end of this month! Sign up here: https://lu.ma/aiuxAnd submit demos here! https://forms.gle/iSwiesgBkn8oo4SS8We expect the 200 seats to “sell out” fast. Attendees with demos will be prioritized.Show Notes* Elicit* Ought (their previous non-profit)* “Pivoting” with GPT-4* Elicit notebooks launch* Charlie* Andreas' BlogTimestamps* [00:00:00] Introductions* [00:07:45] How Johan and Andreas Joined Forces to Create Elicit* [00:10:26] Why Products > Research* [00:15:49] The Evolution of Elicit's Product* [00:19:44] Automating Literature Review Workflow* [00:22:48] How GPT-3 to GPT-4 Changed Things* [00:25:37] Managing LLM Pricing and Performance* [00:31:07] Open vs. Closed: Elicit's Approach to Model Selection* [00:31:56] Moving to Notebooks* [00:39:11] Elicit's Budget for Model Queries and Evaluations* [00:41:44] Impact of Long Context Windows* [00:47:19] Underrated Features and Surprising Applications* [00:51:35] Driving Systematic and Efficient Research* [00:53:00] Elicit's Team Growth and Transition to a Public Benefit Corporation* [00:55:22] Building AI for GoodFull Interview on YouTubeAs always, a plug for our youtube version for the 80% of communication that is nonverbal:TranscriptAlessio [00:00:00]: Hey everyone, welcome to the Latent Space Podcast. This is Alessio, partner and CTO at Residence at Decibel Partners, and I'm joined by my co-host Swyx, founder of Smol AI.Swyx [00:00:15]: Hey, and today we are back in the studio with Andreas and Jungwon from Elicit. Welcome.Jungwon [00:00:20]: Thanks guys.Andreas [00:00:21]: It's great to be here.Swyx [00:00:22]: Yeah. So I'll introduce you separately, but also, you know, we'd love to learn a little bit more about you personally. So Andreas, it looks like you started Elicit first, Jungwon joined later.Andreas [00:00:32]: That's right. For all intents and purposes, the Elicit and also the Ought that existed before then were very different from what I started. So I think it's like fair to say that you co-founded it.Swyx [00:00:43]: Got it. And Jungwon, you're a co-founder and COO of Elicit now.Jungwon [00:00:46]: Yeah, that's right.Swyx [00:00:47]: So there's a little bit of a history to this. I'm not super aware of like the sort of journey. I was aware of OTT and Elicit as sort of a nonprofit type situation. And recently you turned into like a B Corp, Public Benefit Corporation. So yeah, maybe if you want, you could take us through that journey of finding the problem. You know, obviously you're working together now. So like, how do you get together to decide to leave your startup career to join him?Andreas [00:01:10]: Yeah, it's truly a very long journey. I guess truly, it kind of started in Germany when I was born. So even as a kid, I was always interested in AI, like I kind of went to the library. There were books about how to write programs in QBasic and like some of them talked about how to implement chatbots.Jungwon [00:01:27]: To be clear, he grew up in like a tiny village on the outskirts of Munich called Dinkelschirben, where it's like a very, very idyllic German village.Andreas [00:01:36]: Yeah, important to the story. So basically, the main thing is I've kind of always been thinking about AI my entire life and been thinking about, well, at some point, this is going to be a huge deal. It's going to be transformative. How can I work on it? And was thinking about it from when I was a teenager, after high school did a year where I started a startup with the intention to become rich. And then once I'm rich, I can affect the trajectory of AI. Did not become rich, decided to go back to college and study cognitive science there, which was like the closest thing I could find at the time to AI. In the last year of college, moved to the US to do a PhD at MIT, working on broadly kind of new programming languages for AI because it kind of seemed like the existing languages were not great at expressing world models and learning world models doing Bayesian inference. Was always thinking about, well, ultimately, the goal is to actually build tools that help people reason more clearly, ask and answer better questions and make better decisions. But for a long time, it seemed like the technology to put reasoning in machines just wasn't there. Initially, at the end of my postdoc at Stanford, I was thinking about, well, what to do? I think the standard path is you become an academic and do research. But it's really hard to actually build interesting tools as an academic. You can't really hire great engineers. Everything is kind of on a paper-to-paper timeline. And so I was like, well, maybe I should start a startup, pursued that for a little bit. But it seemed like it was too early because you could have tried to do an AI startup, but probably would not have been this kind of AI startup we're seeing now. So then decided to just start a nonprofit research lab that's going to do research for a while until we better figure out how to do thinking in machines. And that was odd. And then over time, it became clear how to actually build actual tools for reasoning. And only over time, we developed a better way to... I'll let you fill in some of the details here.Jungwon [00:03:26]: Yeah. So I guess my story maybe starts around 2015. I kind of wanted to be a founder for a long time, and I wanted to work on an idea that stood the test of time for me, like an idea that stuck with me for a long time. And starting in 2015, actually, originally, I became interested in AI-based tools from the perspective of mental health. So there are a bunch of people around me who are really struggling. One really close friend in particular is really struggling with mental health and didn't have any support, and it didn't feel like there was anything before kind of like getting hospitalized that could just help her. And so luckily, she came and stayed with me for a while, and we were just able to talk through some things. But it seemed like lots of people might not have that resource, and something maybe AI-enabled could be much more scalable. I didn't feel ready to start a company then, that's 2015. And I also didn't feel like the technology was ready. So then I went into FinTech and kind of learned how to do the tech thing. And then in 2019, I felt like it was time for me to just jump in and build something on my own I really wanted to create. And at the time, I looked around at tech and felt like not super inspired by the options. I didn't want to have a tech career ladder, or I didn't want to climb the career ladder. There are two kind of interesting technologies at the time, there was AI and there was crypto. And I was like, well, the AI people seem like a little bit more nice, maybe like slightly more trustworthy, both super exciting, but threw my bet in on the AI side. And then I got connected to Andreas. And actually, the way he was thinking about pursuing the research agenda at OTT was really compatible with what I had envisioned for an ideal AI product, something that helps kind of take down really complex thinking, overwhelming thoughts and breaks it down into small pieces. And then this kind of mission that we need AI to help us figure out what we ought to do was really inspiring, right? Yeah, because I think it was clear that we were building the most powerful optimizer of our time. But as a society, we hadn't figured out how to direct that optimization potential. And if you kind of direct tremendous amounts of optimization potential at the wrong thing, that's really disastrous. So the goal of OTT was make sure that if we build the most transformative technology of our lifetime, it can be used for something really impactful, like good reasoning, like not just generating ads. My background was in marketing, but like, so I was like, I want to do more than generate ads with this. But also if these AI systems get to be super intelligent enough that they are doing this really complex reasoning, that we can trust them, that they are aligned with us and we have ways of evaluating that they're doing the right thing. So that's what OTT did. We did a lot of experiments, you know, like I just said, before foundation models really like took off. A lot of the issues we were seeing were more in reinforcement learning, but we saw a future where AI would be able to do more kind of logical reasoning, not just kind of extrapolate from numerical trends. We actually kind of set up experiments with people where kind of people stood in as super intelligent systems and we effectively gave them context windows. So they would have to like read a bunch of text and one person would get less text and one person would get all the texts and the person with less text would have to evaluate the work of the person who could read much more. So like in a world we were basically simulating, like in 2018, 2019, a world where an AI system could read significantly more than you and you as the person who couldn't read that much had to evaluate the work of the AI system. Yeah. So there's a lot of the work we did. And from that, we kind of iterated on the idea of breaking complex tasks down into smaller tasks, like complex tasks, like open-ended reasoning, logical reasoning into smaller tasks so that it's easier to train AI systems on them. And also so that it's easier to evaluate the work of the AI system when it's done. And then also kind of, you know, really pioneered this idea, the importance of supervising the process of AI systems, not just the outcomes. So a big part of how Elicit is built is we're very intentional about not just throwing a ton of data into a model and training it and then saying, cool, here's like scientific output. Like that's not at all what we do. Our approach is very much like, what are the steps that an expert human does or what is like an ideal process as granularly as possible, let's break that down and then train AI systems to perform each of those steps very robustly. When you train like that from the start, after the fact, it's much easier to evaluate, it's much easier to troubleshoot at each point. Like where did something break down? So yeah, we were working on those experiments for a while. And then at the start of 2021, decided to build a product.Swyx [00:07:45]: Do you mind if I, because I think you're about to go into more modern thought and Elicit. And I just wanted to, because I think a lot of people are in where you were like sort of 2018, 19, where you chose a partner to work with. Yeah. Right. And you didn't know him. Yeah. Yeah. You were just kind of cold introduced. A lot of people are cold introduced. Yeah. Never work with them. I assume you had a lot, a lot of other options, right? Like how do you advise people to make those choices?Jungwon [00:08:10]: We were not totally cold introduced. So one of our closest friends introduced us. And then Andreas had written a lot on the OTT website, a lot of blog posts, a lot of publications. And I just read it and I was like, wow, this sounds like my writing. And even other people, some of my closest friends I asked for advice from, they were like, oh, this sounds like your writing. But I think I also had some kind of like things I was looking for. I wanted someone with a complimentary skillset. I want someone who was very values aligned. And yeah, that was all a good fit.Andreas [00:08:38]: We also did a pretty lengthy mutual evaluation process where we had a Google doc where we had all kinds of questions for each other. And I think it ended up being around 50 pages or so of like various like questions and back and forth.Swyx [00:08:52]: Was it the YC list? There's some lists going around for co-founder questions.Andreas [00:08:55]: No, we just made our own questions. But I guess it's probably related in that you ask yourself, what are the values you care about? How would you approach various decisions and things like that?Jungwon [00:09:04]: I shared like all of my past performance reviews. Yeah. Yeah.Swyx [00:09:08]: And he never had any. No.Andreas [00:09:10]: Yeah.Swyx [00:09:11]: Sorry, I just had to, a lot of people are going through that phase and you kind of skipped over it. I was like, no, no, no, no. There's like an interesting story.Jungwon [00:09:20]: Yeah.Alessio [00:09:21]: Yeah. Before we jump into what a list it is today, the history is a bit counterintuitive. So you start with figuring out, oh, if we had a super powerful model, how would we align it? But then you were actually like, well, let's just build the product so that people can actually leverage it. And I think there are a lot of folks today that are now back to where you were maybe five years ago that are like, oh, what if this happens rather than focusing on actually building something useful with it? What clicked for you to like move into a list and then we can cover that story too.Andreas [00:09:49]: I think in many ways, the approach is still the same because the way we are building illicit is not let's train a foundation model to do more stuff. It's like, let's build a scaffolding such that we can deploy powerful models to good ends. I think it's different now in that we actually have like some of the models to plug in. But if in 2017, we had had the models, we could have run the same experiments we did run with humans back then, just with models. And so in many ways, our philosophy is always, let's think ahead to the future of what models are going to exist in one, two years or longer. And how can we make it so that they can actually be deployed in kind of transparent, controllableJungwon [00:10:26]: ways? I think motivationally, we both are kind of product people at heart. The research was really important and it didn't make sense to build a product at that time. But at the end of the day, the thing that always motivated us is imagining a world where high quality reasoning is really abundant and AI is a technology that's going to get us there. And there's a way to guide that technology with research, but we can have a more direct effect through product because with research, you publish the research and someone else has to implement that into the product and the product felt like a more direct path. And we wanted to concretely have an impact on people's lives. Yeah, I think the kind of personally, the motivation was we want to build for people.Swyx [00:11:03]: Yep. And then just to recap as well, like the models you were using back then were like, I don't know, would they like BERT type stuff or T5 or I don't know what timeframe we're talking about here.Andreas [00:11:14]: I guess to be clear, at the very beginning, we had humans do the work. And then I think the first models that kind of make sense were TPT-2 and TNLG and like Yeah, early generative models. We do also use like T5 based models even now started with TPT-2.Swyx [00:11:30]: Yeah, cool. I'm just kind of curious about like, how do you start so early? You know, like now it's obvious where to start, but back then it wasn't.Jungwon [00:11:37]: Yeah, I used to nag Andreas a lot. I was like, why are you talking to this? I don't know. I felt like TPT-2 is like clearly can't do anything. And I was like, Andreas, you're wasting your time, like playing with this toy. But yeah, he was right.Alessio [00:11:50]: So what's the history of what Elicit actually does as a product? You recently announced that after four months, you get to a million in revenue. Obviously, a lot of people use it, get a lot of value, but it would initially kind of like structured data extraction from papers. Then you had kind of like concept grouping. And today, it's maybe like a more full stack research enabler, kind of like paper understander platform. What's the definitive definition of what Elicit is? And how did you get here?Jungwon [00:12:15]: Yeah, we say Elicit is an AI research assistant. I think it will continue to evolve. That's part of why we're so excited about building and research, because there's just so much space. I think the current phase we're in right now, we talk about it as really trying to make Elicit the best place to understand what is known. So it's all a lot about like literature summarization. There's a ton of information that the world already knows. It's really hard to navigate, hard to make it relevant. So a lot of it is around document discovery and processing and analysis. I really kind of want to import some of the incredible productivity improvements we've seen in software engineering and data science and into research. So it's like, how can we make researchers like data scientists of text? That's why we're launching this new set of features called Notebooks. It's very much inspired by computational notebooks, like Jupyter Notebooks, you know, DeepNode or Colab, because they're so powerful and so flexible. And ultimately, when people are trying to get to an answer or understand insight, they're kind of like manipulating evidence and information. Today, that's all packaged in PDFs, which are super brittle. So with language models, we can decompose these PDFs into their underlying claims and evidence and insights, and then let researchers mash them up together, remix them and analyze them together. So yeah, I would say quite simply, overall, Elicit is an AI research assistant. Right now we're focused on text-based workflows, but long term, really want to kind of go further and further into reasoning and decision making.Alessio [00:13:35]: And when you say AI research assistant, this is kind of meta research. So researchers use Elicit as a research assistant. It's not a generic you-can-research-anything type of tool, or it could be, but like, what are people using it for today?Andreas [00:13:49]: Yeah. So specifically in science, a lot of people use human research assistants to do things. You tell your grad student, hey, here are a couple of papers. Can you look at all of these, see which of these have kind of sufficiently large populations and actually study the disease that I'm interested in, and then write out like, what are the experiments they did? What are the interventions they did? What are the outcomes? And kind of organize that for me. And the first phase of understanding what is known really focuses on automating that workflow because a lot of that work is pretty rote work. I think it's not the kind of thing that we need humans to do. Language models can do it. And then if language models can do it, you can obviously scale it up much more than a grad student or undergrad research assistant would be able to do.Jungwon [00:14:31]: Yeah. The use cases are pretty broad. So we do have a very large percent of our users are just using it personally or for a mix of personal and professional things. People who care a lot about health or biohacking or parents who have children with a kind of rare disease and want to understand the literature directly. So there is an individual kind of consumer use case. We're most focused on the power users. So that's where we're really excited to build. So Lissette was very much inspired by this workflow in literature called systematic reviews or meta-analysis, which is basically the human state of the art for summarizing scientific literature. And it typically involves like five people working together for over a year. And they kind of first start by trying to find the maximally comprehensive set of papers possible. So it's like 10,000 papers. And they kind of systematically narrow that down to like hundreds or 50 extract key details from every single paper. Usually have two people doing it, like a third person reviewing it. So it's like an incredibly laborious, time consuming process, but you see it in every single domain. So in science, in machine learning, in policy, because it's so structured and designed to be reproducible, it's really amenable to automation. So that's kind of the workflow that we want to automate first. And then you make that accessible for any question and make these really robust living summaries of science. So yeah, that's one of the workflows that we're starting with.Alessio [00:15:49]: Our previous guest, Mike Conover, he's building a new company called Brightwave, which is an AI research assistant for financial research. How do you see the future of these tools? Does everything converge to like a God researcher assistant, or is every domain going to have its own thing?Andreas [00:16:03]: I think that's a good and mostly open question. I do think there are some differences across domains. For example, some research is more quantitative data analysis, and other research is more high level cross domain thinking. And we definitely want to contribute to the broad generalist reasoning type space. Like if researchers are making discoveries often, it's like, hey, this thing in biology is actually analogous to like these equations in economics or something. And that's just fundamentally a thing that where you need to reason across domains. At least within research, I think there will be like one best platform more or less for this type of generalist research. I think there may still be like some particular tools like for genomics, like particular types of modules of genes and proteins and whatnot. But for a lot of the kind of high level reasoning that humans do, I think that is a more of a winner type all thing.Swyx [00:16:52]: I wanted to ask a little bit deeper about, I guess, the workflow that you mentioned. I like that phrase. I see that in your UI now, but that's as it is today. And I think you were about to tell us about how it was in 2021 and how it may be progressed. How has this workflow evolved over time?Jungwon [00:17:07]: Yeah. So the very first version of Elicit actually wasn't even a research assistant. It was a forecasting assistant. So we set out and we were thinking about, you know, what are some of the most impactful types of reasoning that if we could scale up, AI would really transform the world. We actually started with literature review, but we're like, oh, so many people are going to build literature review tools. So let's start there. So then we focused on geopolitical forecasting. So I don't know if you're familiar with like manifold or manifold markets. That kind of stuff. Before manifold. Yeah. Yeah. I'm not predicting relationships. We're predicting like, is China going to invade Taiwan?Swyx [00:17:38]: Markets for everything.Andreas [00:17:39]: Yeah. That's a relationship.Swyx [00:17:41]: Yeah.Jungwon [00:17:42]: Yeah. It's true. And then we worked on that for a while. And then after GPT-3 came out, I think by that time we realized that originally we were trying to help people convert their beliefs into probability distributions. And so take fuzzy beliefs, but like model them more concretely. And then after a few months of iterating on that, just realize, oh, the thing that's blocking people from making interesting predictions about important events in the world is less kind of on the probabilistic side and much more on the research side. And so that kind of combined with the very generalist capabilities of GPT-3 prompted us to make a more general research assistant. Then we spent a few months iterating on what even is a research assistant. So we would embed with different researchers. We built data labeling workflows in the beginning, kind of right off the bat. We built ways to find experts in a field and like ways to ask good research questions. So we just kind of iterated through a lot of workflows and no one else was really building at this time. And it was like very quick to just do some prompt engineering and see like what is a task that is at the intersection of what's technologically capable and like important for researchers. And we had like a very nondescript landing page. It said nothing. But somehow people were signing up and we had to sign a form that was like, why are you here? And everyone was like, I need help with literature review. And we're like, oh, literature review. That sounds so hard. I don't even know what that means. We're like, we don't want to work on it. But then eventually we were like, okay, everyone is saying literature review. It's overwhelmingly people want to-Swyx [00:19:02]: And all domains, not like medicine or physics or just all domains. Yeah.Jungwon [00:19:06]: And we also kind of personally knew literature review was hard. And if you look at the graphs for academic literature being published every single month, you guys know this in machine learning, it's like up into the right, like superhuman amounts of papers. So we're like, all right, let's just try it. I was really nervous, but Andreas was like, this is kind of like the right problem space to jump into, even if we don't know what we're doing. So my take was like, fine, this feels really scary, but let's just launch a feature every single week and double our user numbers every month. And if we can do that, we'll fail fast and we will find something. I was worried about like getting lost in the kind of academic white space. So the very first version was actually a weekend prototype that Andreas made. Do you want to explain how that worked?Andreas [00:19:44]: I mostly remember that it was really bad. The thing I remember is you entered a question and it would give you back a list of claims. So your question could be, I don't know, how does creatine affect cognition? It would give you back some claims that are to some extent based on papers, but they were often irrelevant. The papers were often irrelevant. And so we ended up soon just printing out a bunch of examples of results and putting them up on the wall so that we would kind of feel the constant shame of having such a bad product and would be incentivized to make it better. And I think over time it has gotten a lot better, but I think the initial version was like really very bad. Yeah.Jungwon [00:20:20]: But it was basically like a natural language summary of an abstract, like kind of a one sentence summary, and which we still have. And then as we learned kind of more about this systematic review workflow, we started expanding the capability so that you could extract a lot more data from the papers and do more with that.Swyx [00:20:33]: And were you using like embeddings and cosine similarity, that kind of stuff for retrieval, or was it keyword based?Andreas [00:20:40]: I think the very first version didn't even have its own search engine. I think the very first version probably used the Semantic Scholar or API or something similar. And only later when we discovered that API is not very semantic, we then built our own search engine that has helped a lot.Swyx [00:20:58]: And then we're going to go into like more recent products stuff, but like, you know, I think you seem the more sort of startup oriented business person and you seem sort of more ideologically like interested in research, obviously, because of your PhD. What kind of market sizing were you guys thinking? Right? Like, because you're here saying like, we have to double every month. And I'm like, I don't know how you make that conclusion from this, right? Especially also as a nonprofit at the time.Jungwon [00:21:22]: I mean, market size wise, I felt like in this space where so much was changing and it was very unclear what of today was actually going to be true tomorrow. We just like really rested a lot on very, very simple fundamental principles, which is like, if you can understand the truth, that is very economically beneficial and valuable. If you like know the truth.Swyx [00:21:42]: On principle.Jungwon [00:21:43]: Yeah. That's enough for you. Yeah. Research is the key to many breakthroughs that are very commercially valuable.Swyx [00:21:47]: Because my version of it is students are poor and they don't pay for anything. Right? But that's obviously not true. As you guys have found out. But you had to have some market insight for me to have believed that, but you skipped that.Andreas [00:21:58]: Yeah. I remember talking to VCs for our seed round. A lot of VCs were like, you know, researchers, they don't have any money. Why don't you build legal assistant? I think in some short sighted way, maybe that's true. But I think in the long run, R&D is such a big space of the economy. I think if you can substantially improve how quickly people find new discoveries or avoid controlled trials that don't go anywhere, I think that's just huge amounts of money. And there are a lot of questions obviously about between here and there. But I think as long as the fundamental principle is there, we were okay with that. And I guess we found some investors who also were. Yeah.Swyx [00:22:35]: Congrats. I mean, I'm sure we can cover the sort of flip later. I think you're about to start us on like GPT-3 and how that changed things for you. It's funny. I guess every major GPT version, you have some big insight. Yeah.Jungwon [00:22:48]: Yeah. I mean, what do you think?Andreas [00:22:51]: I think it's a little bit less true for us than for others, because we always believed that there will basically be human level machine work. And so it is definitely true that in practice for your product, as new models come out, your product starts working better, you can add some features that you couldn't add before. But I don't think we really ever had the moment where we were like, oh, wow, that is super unanticipated. We need to do something entirely different now from what was on the roadmap.Jungwon [00:23:21]: I think GPT-3 was a big change because it kind of said, oh, now is the time that we can use AI to build these tools. And then GPT-4 was maybe a little bit more of an extension of GPT-3. GPT-3 over GPT-2 was like qualitative level shift. And then GPT-4 was like, okay, great. Now it's like more accurate. We're more accurate on these things. We can answer harder questions. But the shape of the product had already taken place by that time.Swyx [00:23:44]: I kind of want to ask you about this sort of pivot that you've made. But I guess that was just a way to sell what you were doing, which is you're adding extra features on grouping by concepts. The GPT-4 pivot, quote unquote pivot that you-Jungwon [00:23:55]: Oh, yeah, yeah, exactly. Right, right, right. Yeah. Yeah. When we launched this workflow, now that GPT-4 was available, basically Elisa was at a place where we have very tabular interfaces. So given a table of papers, you can extract data across all the tables. But you kind of want to take the analysis a step further. Sometimes what you'd care about is not having a list of papers, but a list of arguments, a list of effects, a list of interventions, a list of techniques. And so that's one of the things we're working on is now that you've extracted this information in a more structured way, can you pivot it or group by whatever the information that you extracted to have more insight first information still supported by the academic literature?Swyx [00:24:33]: Yeah, that was a big revelation when I saw it. Basically, I think I'm very just impressed by how first principles, your ideas around what the workflow is. And I think that's why you're not as reliant on like the LLM improving, because it's actually just about improving the workflow that you would recommend to people. Today we might call it an agent, I don't know, but you're not relying on the LLM to drive it. It's relying on this is the way that Elicit does research. And this is what we think is most effective based on talking to our users.Jungwon [00:25:01]: The problem space is still huge. Like if it's like this big, we are all still operating at this tiny part, bit of it. So I think about this a lot in the context of moats, people are like, oh, what's your moat? What happens if GPT-5 comes out? It's like, if GPT-5 comes out, there's still like all of this other space that we can go into. So I think being really obsessed with the problem, which is very, very big, has helped us like stay robust and just kind of directly incorporate model improvements and they keep going.Swyx [00:25:26]: And then I first encountered you guys with Charlie, you can tell us about that project. Basically, yeah. Like how much did cost become a concern as you're working more and more with OpenAI? How do you manage that relationship?Jungwon [00:25:37]: Let me talk about who Charlie is. And then you can talk about the tech, because Charlie is a special character. So Charlie, when we found him was, had just finished his freshman year at the University of Warwick. And I think he had heard about us on some discord. And then he applied and we were like, wow, who is this freshman? And then we just saw that he had done so many incredible side projects. And we were actually on a team retreat in Barcelona visiting our head of engineering at that time. And everyone was talking about this wonder kid or like this kid. And then on our take home project, he had done like the best of anyone to that point. And so people were just like so excited to hire him. So we hired him as an intern and they were like, Charlie, what if you just dropped out of school? And so then we convinced him to take a year off. And he was just incredibly productive. And I think the thing you're referring to is at the start of 2023, Anthropic kind of launched their constitutional AI paper. And within a few days, I think four days, he had basically implemented that in production. And then we had it in app a week or so after that. And he has since kind of contributed to major improvements, like cutting costs down to a tenth of what they were really large scale. But yeah, you can talk about the technical stuff. Yeah.Andreas [00:26:39]: On the constitutional AI project, this was for abstract summarization, where in illicit, if you run a query, it'll return papers to you, and then it will summarize each paper with respect to your query for you on the fly. And that's a really important part of illicit because illicit does it so much. If you run a few searches, it'll have done it a few hundred times for you. And so we cared a lot about this both being fast, cheap, and also very low on hallucination. I think if illicit hallucinates something about the abstract, that's really not good. And so what Charlie did in that project was create a constitution that expressed what are the attributes of a good summary? Everything in the summary is reflected in the actual abstract, and it's like very concise, et cetera, et cetera. And then used RLHF with a model that was trained on the constitution to basically fine tune a better summarizer on an open source model. Yeah. I think that might still be in use.Jungwon [00:27:34]: Yeah. Yeah, definitely. Yeah. I think at the time, the models hadn't been trained at all to be faithful to a text. So they were just generating. So then when you ask them a question, they tried too hard to answer the question and didn't try hard enough to answer the question given the text or answer what the text said about the question. So we had to basically teach the models to do that specific task.Swyx [00:27:54]: How do you monitor the ongoing performance of your models? Not to get too LLM-opsy, but you are one of the larger, more well-known operations doing NLP at scale. I guess effectively, you have to monitor these things and nobody has a good answer that I talk to.Andreas [00:28:10]: I don't think we have a good answer yet. I think the answers are actually a little bit clearer on the just kind of basic robustness side of where you can import ideas from normal software engineering and normal kind of DevOps. You're like, well, you need to monitor kind of latencies and response times and uptime and whatnot.Swyx [00:28:27]: I think when we say performance, it's more about hallucination rate, isn't it?Andreas [00:28:30]: And then things like hallucination rate where I think there, the really important thing is training time. So we care a lot about having our own internal benchmarks for model development that reflect the distribution of user queries so that we can know ahead of time how well is the model going to perform on different types of tasks. So the tasks being summarization, question answering, given a paper, ranking. And for each of those, we want to know what's the distribution of things the model is going to see so that we can have well-calibrated predictions on how well the model is going to do in production. And I think, yeah, there's some chance that there's distribution shift and actually the things users enter are going to be different. But I think that's much less important than getting the kind of training right and having very high quality, well-vetted data sets at training time.Jungwon [00:29:18]: I think we also end up effectively monitoring by trying to evaluate new models as they come out. And so that kind of prompts us to go through our eval suite every couple of months. And every time a new model comes out, we have to see how is this performing relative to production and what we currently have.Swyx [00:29:32]: Yeah. I mean, since we're on this topic, any new models that have really caught your eye this year?Jungwon [00:29:37]: Like Claude came out with a bunch. Yeah. I think Claude is pretty, I think the team's pretty excited about Claude. Yeah.Andreas [00:29:41]: Specifically, Claude Haiku is like a good point on the kind of Pareto frontier. It's neither the cheapest model, nor is it the most accurate, most high quality model, but it's just like a really good trade-off between cost and accuracy.Swyx [00:29:57]: You apparently have to 10-shot it to make it good. I tried using Haiku for summarization, but zero-shot was not great. Then they were like, you know, it's a skill issue, you have to try harder.Jungwon [00:30:07]: I think GPT-4 unlocked tables for us, processing data from tables, which was huge. GPT-4 Vision.Andreas [00:30:13]: Yeah.Swyx [00:30:14]: Yeah. Did you try like Fuyu? I guess you can't try Fuyu because it's non-commercial. That's the adept model.Jungwon [00:30:19]: Yeah.Swyx [00:30:20]: We haven't tried that one. Yeah. Yeah. Yeah. But Claude is multimodal as well. Yeah. I think the interesting insight that we got from talking to David Luan, who is CEO of multimodality has effectively two different flavors. One is we recognize images from a camera in the outside natural world. And actually the more important multimodality for knowledge work is screenshots and PDFs and charts and graphs. So we need a new term for that kind of multimodality.Andreas [00:30:45]: But is the claim that current models are good at one or the other? Yeah.Swyx [00:30:50]: They're over-indexed because of the history of computer vision is Coco, right? So now we're like, oh, actually, you know, screens are more important, OCR, handwriting. You mentioned a lot of like closed model lab stuff, and then you also have like this open source model fine tuning stuff. Like what is your workload now between closed and open? It's a good question.Andreas [00:31:07]: I think- Is it half and half? It's a-Swyx [00:31:10]: Is that even a relevant question or not? Is this a nonsensical question?Andreas [00:31:13]: It depends a little bit on like how you index, whether you index by like computer cost or number of queries. I'd say like in terms of number of queries, it's maybe similar. In terms of like cost and compute, I think the closed models make up more of the budget since the main cases where you want to use closed models are cases where they're just smarter, where no existing open source models are quite smart enough.Jungwon [00:31:35]: Yeah. Yeah.Alessio [00:31:37]: We have a lot of interesting technical questions to go in, but just to wrap the kind of like UX evolution, now you have the notebooks. We talked a lot about how chatbots are not the final frontier, you know? How did you decide to get into notebooks, which is a very iterative kind of like interactive interface and yeah, maybe learnings from that.Jungwon [00:31:56]: Yeah. This is actually our fourth time trying to make this work. Okay. I think the first time was probably in early 2021. I think because we've always been obsessed with this idea of task decomposition and like branching, we always wanted a tool that could be kind of unbounded where you could keep going, could do a lot of branching where you could kind of apply language model operations or computations on other tasks. So in 2021, we had this thing called composite tasks where you could use GPT-3 to brainstorm a bunch of research questions and then take each research question and decompose those further into sub questions. This kind of, again, that like task decomposition tree type thing was always very exciting to us, but that was like, it didn't work and it was kind of overwhelming. Then at the end of 22, I think we tried again and at that point we were thinking, okay, we've done a lot with this literature review thing. We also want to start helping with kind of adjacent domains and different workflows. Like we want to help more with machine learning. What does that look like? And as we were thinking about it, we're like, well, there are so many research workflows. How do we not just build three new workflows into Elicit, but make Elicit really generic to lots of workflows? What is like a generic composable system with nice abstractions that can like scale to all these workflows? So we like iterated on that a bunch and then didn't quite narrow the problem space enough or like quite get to what we wanted. And then I think it was at the beginning of 2023 where we're like, wow, computational notebooks kind of enable this, where they have a lot of flexibility, but kind of robust primitives such that you can extend the workflow and it's not limited. It's not like you ask a query, you get an answer, you're done. You can just constantly keep building on top of that. And each little step seems like a really good unit of work for the language model. And also there was just like really helpful to have a bit more preexisting work to emulate. Yeah, that's kind of how we ended up at computational notebooks for Elicit.Andreas [00:33:44]: Maybe one thing that's worth making explicit is the difference between computational notebooks and chat, because on the surface, they seem pretty similar. It's kind of this iterative interaction where you add stuff. In both cases, you have a back and forth between you enter stuff and then you get some output and then you enter stuff. But the important difference in our minds is with notebooks, you can define a process. So in data science, you can be like, here's like my data analysis process that takes in a CSV and then does some extraction and then generates a figure at the end. And you can prototype it using a small CSV and then you can run it over a much larger CSV later. And similarly, the vision for notebooks in our case is to not make it this like one-off chat interaction, but to allow you to then say, if you start and first you're like, okay, let me just analyze a few papers and see, do I get to the correct conclusions for those few papers? Can I then later go back and say, now let me run this over 10,000 papers now that I've debugged the process using a few papers. And that's an interaction that doesn't fit quite as well into the chat framework because that's more for kind of quick back and forth interaction.Alessio [00:34:49]: Do you think in notebooks, it's kind of like structure, editable chain of thought, basically step by step? Like, is that kind of where you see this going? And then are people going to reuse notebooks as like templates? And maybe in traditional notebooks, it's like cookbooks, right? You share a cookbook, you can start from there. Is this similar in Elizit?Andreas [00:35:06]: Yeah, that's exactly right. So that's our hope that people will build templates, share them with other people. I think chain of thought is maybe still like kind of one level lower on the abstraction hierarchy than we would think of notebooks. I think we'll probably want to think about more semantic pieces like a building block is more like a paper search or an extraction or a list of concepts. And then the model's detailed reasoning will probably often be one level down. You always want to be able to see it, but you don't always want it to be front and center.Alessio [00:35:36]: Yeah, what's the difference between a notebook and an agent? Since everybody always asks me, what's an agent? Like how do you think about where the line is?Andreas [00:35:44]: Yeah, it's an interesting question. In the notebook world, I would generally think of the human as the agent in the first iteration. So you have the notebook and the human kind of adds little action steps. And then the next point on this kind of progress gradient is, okay, now you can use language models to predict which action would you take as a human. And at some point, you're probably going to be very good at this, you'll be like, okay, in some cases I can, with 99.9% accuracy, predict what you do. And then you might as well just execute it, like why wait for the human? And eventually, as you get better at this, that will just look more and more like agents taking actions as opposed to you doing the thing. I think templates are a specific case of this where you're like, okay, well, there's just particular sequences of actions that you often want to chunk and have available as primitives, just like in normal programming. And those, you can view them as action sequences of agents, or you can view them as more normal programming language abstraction thing. And I think those are two valid views. Yeah.Alessio [00:36:40]: How do you see this change as, like you said, the models get better and you need less and less human actual interfacing with the model, you just get the results? Like how does the UX and the way people perceive it change?Jungwon [00:36:52]: Yeah, I think this kind of interaction paradigms for evaluation is not really something the internet has encountered yet, because up to now, the internet has all been about getting data and work from people. So increasingly, I really want kind of evaluation, both from an interface perspective and from like a technical perspective and operation perspective to be a superpower for Elicit, because I think over time, models will do more and more of the work, and people will have to do more and more of the evaluation. So I think, yeah, in terms of the interface, some of the things we have today, you know, for every kind of language model generation, there's some citation back, and we kind of try to highlight the ground truth in the paper that is most relevant to whatever Elicit said, and make it super easy so that you can click on it and quickly see in context and validate whether the text actually supports the answer that Elicit gave. So I think we'd probably want to scale things up like that, like the ability to kind of spot check the model's work super quickly, scale up interfaces like that. And-Swyx [00:37:44]: Who would spot check? The user?Jungwon [00:37:46]: Yeah, to start, it would be the user. One of the other things we do is also kind of flag the model's uncertainty. So we have models report out, how confident are you that this was the sample size of this study? The model's not sure, we throw a flag. And so the user knows to prioritize checking that. So again, we can kind of scale that up. So when the model's like, well, I searched this on Google, I'm not sure if that was the right thing. I have an uncertainty flag, and the user can go and be like, oh, okay, that was actually the right thing to do or not.Swyx [00:38:10]: I've tried to do uncertainty readings from models. I don't know if you have this live. You do? Yeah. Because I just didn't find them reliable because they just hallucinated their own uncertainty. I would love to base it on log probs or something more native within the model rather than generated. But okay, it sounds like they scale properly for you. Yeah.Jungwon [00:38:30]: We found it to be pretty calibrated. It varies on the model.Andreas [00:38:32]: I think in some cases, we also use two different models for the uncertainty estimates than for the question answering. So one model would say, here's my chain of thought, here's my answer. And then a different type of model. Let's say the first model is Llama, and let's say the second model is GPT-3.5. And then the second model just looks over the results and is like, okay, how confident are you in this? And I think sometimes using a different model can be better than using the same model. Yeah.Swyx [00:38:58]: On the topic of models, evaluating models, obviously you can do that all day long. What's your budget? Because your queries fan out a lot. And then you have models evaluating models. One person typing in a question can lead to a thousand calls.Andreas [00:39:11]: It depends on the project. So if the project is basically a systematic review that otherwise human research assistants would do, then the project is basically a human equivalent spend. And the spend can get quite large for those projects. I don't know, let's say $100,000. In those cases, you're happier to spend compute then in the kind of shallow search case where someone just enters a question because, I don't know, maybe I heard about creatine. What's it about? Probably don't want to spend a lot of compute on that. This sort of being able to invest more or less compute into getting more or less accurate answers is I think one of the core things we care about. And that I think is currently undervalued in the AI space. I think currently you can choose which model you want and you can sometimes, I don't know, you'll tip it and it'll try harder or you can try various things to get it to work harder. But you don't have great ways of converting willingness to spend into better answers. And we really want to build a product that has this sort of unbounded flavor where if you care about it a lot, you should be able to get really high quality answers, really double checked in every way.Alessio [00:40:14]: And you have a credits-based pricing. So unlike most products, it's not a fixed monthly fee.Jungwon [00:40:19]: Right, exactly. So some of the higher costs are tiered. So for most casual users, they'll just get the abstract summary, which is kind of an open source model. Then you can add more columns, which have more extractions and these uncertainty features. And then you can also add the same columns in high accuracy mode, which also parses the table. So we kind of stack the complexity on the calls.Swyx [00:40:39]: You know, the fun thing you can do with a credit system, which is data for data, basically you can give people more credits if they give data back to you. I don't know if you've already done that. We've thought about something like this.Jungwon [00:40:49]: It's like if you don't have money, but you have time, how do you exchange that?Swyx [00:40:54]: It's a fair trade.Jungwon [00:40:55]: I think it's interesting. We haven't quite operationalized it. And then, you know, there's been some kind of like adverse selection. Like, you know, for example, it would be really valuable to get feedback on our model. So maybe if you were willing to give more robust feedback on our results, we could give you credits or something like that. But then there's kind of this, will people take it seriously? And you want the good people. Exactly.Swyx [00:41:11]: Can you tell who are the good people? Not right now.Jungwon [00:41:13]: But yeah, maybe at the point where we can, we can offer it. We can offer it up to them.Swyx [00:41:16]: The perplexity of questions asked, you know, if it's higher perplexity, these are the smarterJungwon [00:41:20]: people. Yeah, maybe.Andreas [00:41:23]: If you put typos in your queries, you're not going to get off the stage.Swyx [00:41:28]: Negative social credit. It's very topical right now to think about the threat of long context windows. All these models that we're talking about these days, all like a million token plus. Is that relevant for you? Can you make use of that? Is that just prohibitively expensive because you're just paying for all those tokens or you're just doing rag?Andreas [00:41:44]: It's definitely relevant. And when we think about search, as many people do, we think about kind of a staged pipeline of retrieval where first you use semantic search database with embeddings, get like the, in our case, maybe 400 or so most relevant papers. And then, then you still need to rank those. And I think at that point it becomes pretty interesting to use larger models. So specifically in the past, I think a lot of ranking was kind of per item ranking where you would score each individual item, maybe using increasingly expensive scoring methods and then rank based on the scores. But I think list-wise re-ranking where you have a model that can see all the elements is a lot more powerful because often you can only really tell how good a thing is in comparison to other things and what things should come first. It really depends on like, well, what other things that are available, maybe you even care about diversity in your results. You don't want to show 10 very similar papers as the first 10 results. So I think a long context models are quite interesting there. And especially for our case where we care more about power users who are perhaps a little bit more willing to wait a little bit longer to get higher quality results relative to people who just quickly check out things because why not? And I think being able to spend more on longer contexts is quite valuable.Jungwon [00:42:55]: Yeah. I think one thing the longer context models changed for us is maybe a focus from breaking down tasks to breaking down the evaluation. So before, you know, if we wanted to answer a question from the full text of a paper, we had to figure out how to chunk it and like find the relevant chunk and then answer based on that chunk. And the nice thing was then, you know, kind of which chunk the model used to answer the question. So if you want to help the user track it, yeah, you can be like, well, this was the chunk that the model got. And now if you put the whole text in the paper, you have to like kind of find the chunk like more retroactively basically. And so you need kind of like a different set of abilities and obviously like a different technology to figure out. You still want to point the user to the supporting quotes in the text, but then the interaction is a little different.Swyx [00:43:38]: You like scan through and find some rouge score floor.Andreas [00:43:41]: I think there's an interesting space of almost research problems here because you would ideally make causal claims like if this hadn't been in the text, the model wouldn't have said this thing. And maybe you can do expensive approximations to that where like, I don't know, you just throw out chunk of the paper and re-answer and see what happens. But hopefully there are better ways of doing that where you just get that kind of counterfactual information for free from the model.Alessio [00:44:06]: Do you think at all about the cost of maintaining REG versus just putting more tokens in the window? I think in software development, a lot of times people buy developer productivity things so that we don't have to worry about it. Context window is kind of the same, right? You have to maintain chunking and like REG retrieval and like re-ranking and all of this versus I just shove everything into the context and like it costs a little more, but at least I don't have to do all of that. Is that something you thought about?Jungwon [00:44:31]: I think we still like hit up against context limits enough that it's not really, do we still want to keep this REG around? It's like we do still need it for the scale of the work that we're doing, yeah.Andreas [00:44:41]: And I think there are different kinds of maintainability. In one sense, I think you're right that throw everything into the context window thing is easier to maintain because you just can swap out a model. In another sense, if things go wrong, it's harder to debug where like, if you know, here's the process that we go through to go from 200 million papers to an answer. And there are like little steps and you understand, okay, this is the step that finds the relevant paragraph or whatever it may be. You'll know which step breaks if the answers are bad, whereas if it's just like a new model version came out and now it suddenly doesn't find your needle in a haystack anymore, then you're like, okay, what can you do? You're kind of at a loss.Alessio [00:45:21]: Let's talk a bit about, yeah, needle in a haystack and like maybe the opposite of it, which is like hard grounding. I don't know if that's like the best name to think about it, but I was using one of these chatwitcher documents features and I put the AMD MI300 specs and the new Blackwell chips from NVIDIA and I was asking questions and does the AMD chip support NVLink? And the response was like, oh, it doesn't say in the specs. But if you ask GPD 4 without the docs, it would tell you no, because NVLink it's a NVIDIA technology.Swyx [00:45:49]: It just says in the thing.Alessio [00:45:53]: How do you think about that? Does using the context sometimes suppress the knowledge that the model has?Andreas [00:45:57]: It really depends on the task because I think sometimes that is exactly what you want. So imagine you're a researcher, you're writing the background section of your paper and you're trying to describe what these other papers say. You really don't want extra information to be introduced there. In other cases where you're just trying to figure out the truth and you're giving the documents because you think they will help the model figure out what the truth is. I think you do want, if the model has a hunch that there might be something that's not in the papers, you do want to surface that. I think ideally you still don't want the model to just tell you, probably the ideal thing looks a bit more like agent control where the model can issue a query that then is intended to surface documents that substantiate its hunch. That's maybe a reasonable middle ground between model just telling you and model being fully limited to the papers you give it.Jungwon [00:46:44]: Yeah, I would say it's, they're just kind of different tasks right now. And the task that Elicit is mostly focused on is what do these papers say? But there's another task which is like, just give me the best possible answer and that give me the best possible answer sometimes depends on what do these papers say, but it can also depend on other stuff that's not in the papers. So ideally we can do both and then kind of do this overall task for you more going forward.Alessio [00:47:08]: We see a lot of details, but just to zoom back out a little bit, what are maybe the most underrated features of Elicit and what is one thing that maybe the users surprise you the most by using it?Jungwon [00:47:19]: I think the most powerful feature of Elicit is the ability to extract, add columns to this table, which effectively extracts data from all of your papers at once. It's well used, but there are kind of many different extensions of that that I think users are still discovering. So one is we let you give a description of the column. We let you give instructions of a column. We let you create custom columns. So we have like 30 plus predefined fields that users can extract, like what were the methods? What were the main findings? How many people were studied? And we actually show you basically the prompts that we're using to

god ceo university ai europe google china vision moving germany san francisco phd research sharing performance evolution german mit open impact language transition budget product barcelona negative coo stanford roi taiwan anatomy context markets bay area models cto andreas coco nlp openai fintech chain munich sf nvidia residence pivoting ux api autonomy gpt wei ui github llama geoffrey amd vcs llm warwick devops temporal r d anthropic attendees linus blackwell pdfs b corp ott reg kojima perplexity ls pareto evaluations haiku large language models ocr alessio yc notebooks bayesian tpt colab pbc csv ai research prefect t5 airflow daxter team growth elicit public benefit corporation autogpt jupyter notebooks gpd 1m arr exa neurips supervise imbue rlhf byun fuyu latent space qbasic nvlink

Latent Space Chats: NLW (Four Wars, GPT5), Josh Albrecht/Ali Rohde (TNAI), Dylan Patel/Semianalysis (Groq), Milind Naphade (Nvidia GTC), Personal AI (ft. Harrison Chase — LangFriend/LangMem)

Latent Space: The AI Engineer Podcast â€” CodeGen, Agents, Computer Vision, Data Science, AI UX and all things Software 3.0

Play Episode Listen Later Apr 6, 2024 121:17

Our next 2 big events are AI UX and the World's Fair. Join and apply to speak/sponsor!Due to timing issues we didn't have an interview episode to share with you this week, but not to worry, we have more than enough “weekend special” content in the backlog for you to get your Latent Space fix, whether you like thinking about the big picture, or learning more about the pod behind the scenes, or talking Groq and GPUs, or AI Leadership, or Personal AI. Enjoy!AI BreakdownThe indefatigable NLW had us back on his show for an update on the Four Wars, covering Sora, Suno, and the reshaped GPT-4 Class Landscape:and a longer segment on AI Engineering trends covering the future LLM landscape (Llama 3, GPT-5, Gemini 2, Claude 4), Open Source Models (Mistral, Grok), Apple and Meta's AI strategy, new chips (Groq, MatX) and the general movement from baby AGIs to vertical Agents:Thursday Nights in AIWe're also including swyx's interview with Josh Albrecht and Ali Rohde to reintroduce swyx and Latent Space to a general audience, and engage in some spicy Q&A:Dylan Patel on GroqWe hosted a private event with Dylan Patel of SemiAnalysis (our last pod here):Not all of it could be released so we just talked about our Groq estimates:Milind Naphade - Capital OneIn relation to conversations at NeurIPS and Nvidia GTC and upcoming at World's Fair, we also enjoyed chatting with Milind Naphade about his AI Leadership work at IBM, Cisco, Nvidia, and now leading the AI Foundations org at Capital One. We covered:* Milind's learnings from ~25 years in machine learning * His first paper citation was 24 years ago* Lessons from working with Jensen Huang for 6 years and being CTO of Metropolis * Thoughts on relevant AI research* GTC takeaways and what makes NVIDIA specialIf you'd like to work on building solutions rather than platform (as Milind put it), his Applied AI Research team at Capital One is hiring, which falls under the Capital One Tech team.Personal AI MeetupIt all started with a meme:Within days of each other, BEE, FRIEND, EmilyAI, Compass, Nox and LangFriend were all launching personal AI wearables and assistants. So we decided to put together a the world's first Personal AI meetup featuring creators and enthusiasts of wearables. The full video is live now, with full show notes within.Timestamps* [00:01:13] AI Breakdown Part 1* [00:02:20] Four Wars* [00:13:45] Sora* [00:15:12] Suno* [00:16:34] The GPT-4 Class Landscape* [00:17:03] Data War: Reddit x Google* [00:21:53] Gemini 1.5 vs Claude 3* [00:26:58] AI Breakdown Part 2* [00:27:33] Next Frontiers: Llama 3, GPT-5, Gemini 2, Claude 4* [00:31:11] Open Source Models - Mistral, Grok* [00:34:13] Apple MM1* [00:37:33] Meta's $800b AI rebrand* [00:39:20] AI Engineer landscape - from baby AGIs to vertical Agents* [00:47:28] Adept episode - Screen Multimodality* [00:48:54] Top Model Research from January Recap* [00:53:08] AI Wearables* [00:57:26] Groq vs Nvidia month - GPU Chip War* [01:00:31] Disagreements* [01:02:08] Summer 2024 Predictions* [01:04:18] Thursday Nights in AI - swyx* [01:33:34] Dylan Patel - Semianalysis + Latent Space Live Show* [01:34:58] GroqTranscript[00:00:00] swyx: Welcome to the Latent Space Podcast Weekend Edition. This is Charlie, your AI co host. Swyx and Alessio are off for the week, making more great content. We have exciting interviews coming up with Elicit, Chroma, Instructor, and our upcoming series on NSFW, Not Safe for Work AI. In today's episode, we're collating some of Swyx and Alessio's recent appearances, all in one place for you to find.[00:00:32] swyx: In part one, we have our first crossover pod of the year. In our listener survey, several folks asked for more thoughts from our two hosts. In 2023, Swyx and Alessio did crossover interviews with other great podcasts like the AI Breakdown, Practical AI, Cognitive Revolution, Thursday Eye, and Chinatalk, all of which you can find in the Latentspace About page.[00:00:56] swyx: NLW of the AI Breakdown asked us back to do a special on the 4Wars framework and the AI engineer scene. We love AI Breakdown as one of the best examples Daily podcasts to keep up on AI news, so we were especially excited to be back on Watch out and take[00:01:12] NLW: care[00:01:13] AI Breakdown Part 1[00:01:13] NLW: today on the AI breakdown. Part one of my conversation with Alessio and Swix from Latent Space.[00:01:19] NLW: All right, fellas, welcome back to the AI Breakdown. How are you doing? I'm good. Very good. With the last, the last time we did this show, we were like, oh yeah, let's do check ins like monthly about all the things that are going on and then. Of course, six months later, and, you know, the, the, the world has changed in a thousand ways.[00:01:36] NLW: It's just, it's too busy to even, to even think about podcasting sometimes. But I, I'm super excited to, to be chatting with you again. I think there's, there's a lot to, to catch up on, just to tap in, I think in the, you know, in the beginning of 2024. And, and so, you know, we're gonna talk today about just kind of a, a, a broad sense of where things are in some of the key battles in the AI space.[00:01:55] NLW: And then the, you know, one of the big things that I, that I'm really excited to have you guys on here for us to talk about where, sort of what patterns you're seeing and what people are actually trying to build, you know, where, where developers are spending their, their time and energy and, and, and any sort of, you know, trend trends there, but maybe let's start I guess by checking in on a framework that you guys actually introduced, which I've loved and I've cribbed a couple of times now, which is this sort of four wars of the, of the AI stack.[00:02:20] Four Wars[00:02:20] NLW: Because first, since I have you here, I'd love, I'd love to hear sort of like where that started gelling. And then and then maybe we can get into, I think a couple of them that are you know, particularly interesting, you know, in the, in light of[00:02:30] swyx: some recent news. Yeah, so maybe I'll take this one. So the four wars is a framework that I came up around trying to recap all of 2023.[00:02:38] swyx: I tried to write sort of monthly recap pieces. And I was trying to figure out like what makes one piece of news last longer than another or more significant than another. And I think it's basically always around battlegrounds. Wars are fought around limited resources. And I think probably the, you know, the most limited resource is talent, but the talent expresses itself in a number of areas.[00:03:01] swyx: And so I kind of focus on those, those areas at first. So the four wars that we cover are the data wars, the GPU rich, poor war, the multi modal war, And the RAG and Ops War. And I think you actually did a dedicated episode to that, so thanks for covering that. Yeah, yeah.[00:03:18] NLW: Not only did I do a dedicated episode, I actually used that.[00:03:22] NLW: I can't remember if I told you guys. I did give you big shoutouts. But I used it as a framework for a presentation at Intel's big AI event that they hold each year, where they have all their folks who are working on AI internally. And it totally resonated. That's amazing. Yeah, so, so, what got me thinking about it again is specifically this inflection news that we recently had, this sort of, you know, basically, I can't imagine that anyone who's listening wouldn't have thought about it, but, you know, inflection is a one of the big contenders, right?[00:03:53] NLW: I think probably most folks would have put them, you know, just a half step behind the anthropics and open AIs of the world in terms of labs, but it's a company that raised 1. 3 billion last year, less than a year ago. Reed Hoffman's a co founder Mustafa Suleyman, who's a co founder of DeepMind, you know, so it's like, this is not a a small startup, let's say, at least in terms of perception.[00:04:13] NLW: And then we get the news that basically most of the team, it appears, is heading over to Microsoft and they're bringing in a new CEO. And you know, I'm interested in, in, in kind of your take on how much that reflects, like hold aside, I guess, you know, all the other things that it might be about, how much it reflects this sort of the, the stark.[00:04:32] NLW: Brutal reality of competing in the frontier model space right now. And, you know, just the access to compute.[00:04:38] Alessio: There are a lot of things to say. So first of all, there's always somebody who's more GPU rich than you. So inflection is GPU rich by startup standard. I think about 22, 000 H100s, but obviously that pales compared to the, to Microsoft.[00:04:55] Alessio: The other thing is that this is probably good news, maybe for the startups. It's like being GPU rich, it's not enough. You know, like I think they were building something pretty interesting in, in pi of their own model of their own kind of experience. But at the end of the day, you're the interface that people consume as end users.[00:05:13] Alessio: It's really similar to a lot of the others. So and we'll tell, talk about GPT four and cloud tree and all this stuff. GPU poor, doing something. That the GPU rich are not interested in, you know we just had our AI center of excellence at Decibel and one of the AI leads at one of the big companies was like, Oh, we just saved 10 million and we use these models to do a translation, you know, and that's it.[00:05:39] Alessio: It's not, it's not a GI, it's just translation. So I think like the inflection part is maybe. A calling and a waking to a lot of startups then say, Hey, you know, trying to get as much capital as possible, try and get as many GPUs as possible. Good. But at the end of the day, it doesn't build a business, you know, and maybe what inflection I don't, I don't, again, I don't know the reasons behind the inflection choice, but if you say, I don't want to build my own company that has 1.[00:06:05] Alessio: 3 billion and I want to go do it at Microsoft, it's probably not a resources problem. It's more of strategic decisions that you're making as a company. So yeah, that was kind of my. I take on it.[00:06:15] swyx: Yeah, and I guess on my end, two things actually happened yesterday. It was a little bit quieter news, but Stability AI had some pretty major departures as well.[00:06:25] swyx: And you may not be considering it, but Stability is actually also a GPU rich company in the sense that they were the first new startup in this AI wave to brag about how many GPUs that they have. And you should join them. And you know, Imadis is definitely a GPU trader in some sense from his hedge fund days.[00:06:43] swyx: So Robin Rhombach and like the most of the Stable Diffusion 3 people left Stability yesterday as well. So yesterday was kind of like a big news day for the GPU rich companies, both Inflection and Stability having sort of wind taken out of their sails. I think, yes, it's a data point in the favor of Like, just because you have the GPUs doesn't mean you can, you automatically win.[00:07:03] swyx: And I think, you know, kind of I'll echo what Alessio says there. But in general also, like, I wonder if this is like the start of a major consolidation wave, just in terms of, you know, I think that there was a lot of funding last year and, you know, the business models have not been, you know, All of these things worked out very well.[00:07:19] swyx: Even inflection couldn't do it. And so I think maybe that's the start of a small consolidation wave. I don't think that's like a sign of AI winter. I keep looking for AI winter coming. I think this is kind of like a brief cold front. Yeah,[00:07:34] NLW: it's super interesting. So I think a bunch of A bunch of stuff here.[00:07:38] NLW: One is, I think, to both of your points, there, in some ways, there, there had already been this very clear demarcation between these two sides where, like, the GPU pores, to use the terminology, like, just weren't trying to compete on the same level, right? You know, the vast majority of people who have started something over the last year, year and a half, call it, were racing in a different direction.[00:07:59] NLW: They're trying to find some edge somewhere else. They're trying to build something different. If they're, if they're really trying to innovate, it's in different areas. And so it's really just this very small handful of companies that are in this like very, you know, it's like the coheres and jaspers of the world that like this sort of, you know, that are that are just sort of a little bit less resourced than, you know, than the other set that I think that this potentially even applies to, you know, everyone else that could clearly demarcate it into these two, two sides.[00:08:26] NLW: And there's only a small handful kind of sitting uncomfortably in the middle, perhaps. Let's, let's come back to the idea of, of the sort of AI winter or, you know, a cold front or anything like that. So this is something that I, I spent a lot of time kind of thinking about and noticing. And my perception is that The vast majority of the folks who are trying to call for sort of, you know, a trough of disillusionment or, you know, a shifting of the phase to that are people who either, A, just don't like AI for some other reason there's plenty of that, you know, people who are saying, You Look, they're doing way worse than they ever thought.[00:09:03] NLW: You know, there's a lot of sort of confirmation bias kind of thing going on. Or two, media that just needs a different narrative, right? Because they're sort of sick of, you know, telling the same story. Same thing happened last summer, when every every outlet jumped on the chat GPT at its first down month story to try to really like kind of hammer this idea that that the hype was too much.[00:09:24] NLW: Meanwhile, you have, you know, just ridiculous levels of investment from enterprises, you know, coming in. You have, you know, huge, huge volumes of, you know, individual behavior change happening. But I do think that there's nothing incoherent sort of to your point, Swyx, about that and the consolidation period.[00:09:42] NLW: Like, you know, if you look right now, for example, there are, I don't know, probably 25 or 30 credible, like, build your own chatbot. platforms that, you know, a lot of which have, you know, raised funding. There's no universe in which all of those are successful across, you know, even with a, even, even with a total addressable market of every enterprise in the world, you know, you're just inevitably going to see some amount of consolidation.[00:10:08] NLW: Same with, you know, image generators. There are, if you look at A16Z's top 50 consumer AI apps, just based on, you know, web traffic or whatever, they're still like I don't know, a half. Dozen or 10 or something, like, some ridiculous number of like, basically things like Midjourney or Dolly three. And it just seems impossible that we're gonna have that many, you know, ultimately as, as, as sort of, you know, going, going concerned.[00:10:33] NLW: So, I don't know. I, I, I think that the, there will be inevitable consolidation 'cause you know. It's, it's also what kind of like venture rounds are supposed to do. You're not, not everyone who gets a seed round is supposed to get to series A and not everyone who gets a series A is supposed to get to series B.[00:10:46] NLW: That's sort of the natural process. I think it will be tempting for a lot of people to try to infer from that something about AI not being as sort of big or as as sort of relevant as, as it was hyped up to be. But I, I kind of think that's the wrong conclusion to come to.[00:11:02] Alessio: I I would say the experimentation.[00:11:04] Alessio: Surface is a little smaller for image generation. So if you go back maybe six, nine months, most people will tell you, why would you build a coding assistant when like Copilot and GitHub are just going to win everything because they have the data and they have all the stuff. If you fast forward today, A lot of people use Cursor everybody was excited about the Devin release on Twitter.[00:11:26] Alessio: There are a lot of different ways of attacking the market that are not completion of code in the IDE. And even Cursors, like they evolved beyond single line to like chat, to do multi line edits and, and all that stuff. Image generation, I would say, yeah, as a, just as from what I've seen, like maybe the product innovation has slowed down at the UX level and people are improving the models.[00:11:50] Alessio: So the race is like, how do I make better images? It's not like, how do I make the user interact with the generation process better? And that gets tough, you know? It's hard to like really differentiate yourselves. So yeah, that's kind of how I look at it. And when we think about multimodality, maybe the reason why people got so excited about Sora is like, oh, this is like a completely It's not a better image model.[00:12:13] Alessio: This is like a completely different thing, you know? And I think the creative mind It's always looking for something that impacts the viewer in a different way, you know, like they really want something different versus the developer mind. It's like, Oh, I, I just, I have this like very annoying thing I want better.[00:12:32] Alessio: I have this like very specific use cases that I want to go after. So it's just different. And that's why you see a lot more companies in image generation. But I agree with you that. If you fast forward there, there's not going to be 10 of them, you know, it's probably going to be one or[00:12:46] swyx: two. Yeah, I mean, to me, that's why I call it a war.[00:12:49] swyx: Like, individually, all these companies can make a story that kind of makes sense, but collectively, they cannot all be true. Therefore, they all, there is some kind of fight over limited resources here. Yeah, so[00:12:59] NLW: it's interesting. We wandered very naturally into sort of another one of these wars, which is the multimodality kind of idea, which is, you know, basically a question of whether it's going to be these sort of big everything models that end up winning or whether, you know, you're going to have really specific things, you know, like something, you know, Dolly 3 inside of sort of OpenAI's larger models versus, you know, a mid journey or something like that.[00:13:24] NLW: And at first, you know, I was kind of thinking like, For most of the last, call it six months or whatever, it feels pretty definitively both and in some ways, you know, and that you're, you're seeing just like great innovation on sort of the everything models, but you're also seeing lots and lots happen at sort of the level of kind of individual use cases.[00:13:45] Sora[00:13:45] NLW: But then Sora comes along and just like obliterates what I think anyone thought you know, where we were when it comes to video generation. So how are you guys thinking about this particular battle or war at the moment?[00:13:59] swyx: Yeah, this was definitely a both and story, and Sora tipped things one way for me, in terms of scale being all you need.[00:14:08] swyx: And the benefit, I think, of having multiple models being developed under one roof. I think a lot of people aren't aware that Sora was developed in a similar fashion to Dolly 3. And Dolly3 had a very interesting paper out where they talked about how they sort of bootstrapped their synthetic data based on GPT 4 vision and GPT 4.[00:14:31] swyx: And, and it was just all, like, really interesting, like, if you work on one modality, it enables you to work on other modalities, and all that is more, is, is more interesting. I think it's beneficial if it's all in the same house, whereas the individual startups who don't, who sort of carve out a single modality and work on that, definitely won't have the state of the art stuff on helping them out on synthetic data.[00:14:52] swyx: So I do think like, The balance is tilted a little bit towards the God model companies, which is challenging for the, for the, for the the sort of dedicated modality companies. But everyone's carving out different niches. You know, like we just interviewed Suno ai, the sort of music model company, and, you know, I don't see opening AI pursuing music anytime soon.[00:15:12] Suno[00:15:12] swyx: Yeah,[00:15:13] NLW: Suno's been phenomenal to play with. Suno has done that rare thing where, which I think a number of different AI product categories have done, where people who don't consider themselves particularly interested in doing the thing that the AI enables find themselves doing a lot more of that thing, right?[00:15:29] NLW: Like, it'd be one thing if Just musicians were excited about Suno and using it but what you're seeing is tons of people who just like music all of a sudden like playing around with it and finding themselves kind of down that rabbit hole, which I think is kind of like the highest compliment that you can give one of these startups at the[00:15:45] swyx: early days of it.[00:15:46] swyx: Yeah, I, you know, I, I asked them directly, you know, in the interview about whether they consider themselves mid journey for music. And he had a more sort of nuanced response there, but I think that probably the business model is going to be very similar because he's focused on the B2C element of that. So yeah, I mean, you know, just to, just to tie back to the question about, you know, You know, large multi modality companies versus small dedicated modality companies.[00:16:10] swyx: Yeah, highly recommend people to read the Sora blog posts and then read through to the Dali blog posts because they, they strongly correlated themselves with the same synthetic data bootstrapping methods as Dali. And I think once you make those connections, you're like, oh, like it, it, it is beneficial to have multiple state of the art models in house that all help each other.[00:16:28] swyx: And these, this, that's the one thing that a dedicated modality company cannot do.[00:16:34] The GPT-4 Class Landscape[00:16:34] NLW: So I, I wanna jump, I wanna kind of build off that and, and move into the sort of like updated GPT-4 class landscape. 'cause that's obviously been another big change over the last couple months. But for the sake of completeness, is there anything that's worth touching on with with sort of the quality?[00:16:46] NLW: Quality data or sort of a rag ops wars just in terms of, you know, anything that's changed, I guess, for you fundamentally in the last couple of months about where those things stand.[00:16:55] swyx: So I think we're going to talk about rag for the Gemini and Clouds discussion later. And so maybe briefly discuss the data piece.[00:17:03] Data War: Reddit x Google[00:17:03] swyx: I think maybe the only new thing was this Reddit deal with Google for like a 60 million dollar deal just ahead of their IPO, very conveniently turning Reddit into a AI data company. Also, very, very interestingly, a non exclusive deal, meaning that Reddit can resell that data to someone else. And it probably does become table stakes.[00:17:23] swyx: A lot of people don't know, but a lot of the web text dataset that originally started for GPT 1, 2, and 3 was actually scraped from GitHub. from Reddit at least the sort of vote scores. And I think, I think that's a, that's a very valuable piece of information. So like, yeah, I think people are figuring out how to pay for data.[00:17:40] swyx: People are suing each other over data. This, this, this war is, you know, definitely very, very much heating up. And I don't think, I don't see it getting any less intense. I, you know, next to GPUs, data is going to be the most expensive thing in, in a model stack company. And. You know, a lot of people are resorting to synthetic versions of it, which may or may not be kosher based on how far along or how commercially blessed the, the forms of creating that synthetic data are.[00:18:11] swyx: I don't know if Alessio, you have any other interactions with like Data source companies, but that's my two cents.[00:18:17] Alessio: Yeah yeah, I actually saw Quentin Anthony from Luther. ai at GTC this week. He's also been working on this. I saw Technium. He's also been working on the data side. I think especially in open source, people are like, okay, if everybody is putting the gates up, so to speak, to the data we need to make it easier for people that don't have 50 million a year to get access to good data sets.[00:18:38] Alessio: And Jensen, at his keynote, he did talk about synthetic data a little bit. So I think that's something that we'll definitely hear more and more of in the enterprise, which never bodes well, because then all the, all the people with the data are like, Oh, the enterprises want to pay now? Let me, let me put a pay here stripe link so that they can give me 50 million.[00:18:57] Alessio: But it worked for Reddit. I think the stock is up. 40 percent today after opening. So yeah, I don't know if it's all about the Google deal, but it's obviously Reddit has been one of those companies where, hey, you got all this like great community, but like, how are you going to make money? And like, they try to sell the avatars.[00:19:15] Alessio: I don't know if that it's a great business for them. The, the data part sounds as an investor, you know, the data part sounds a lot more interesting than, than consumer[00:19:25] swyx: cosmetics. Yeah, so I think, you know there's more questions around data you know, I think a lot of people are talking about the interview that Mira Murady did with the Wall Street Journal, where she, like, just basically had no, had no good answer for where they got the data for Sora.[00:19:39] swyx: I, I think this is where, you know, there's, it's in nobody's interest to be transparent about data, and it's, it's kind of sad for the state of ML and the state of AI research but it is what it is. We, we have to figure this out as a society, just like we did for music and music sharing. You know, in, in sort of the Napster to Spotify transition, and that might take us a decade.[00:19:59] swyx: Yeah, I[00:20:00] NLW: do. I, I agree. I think, I think that you're right to identify it, not just as that sort of technical problem, but as one where society has to have a debate with itself. Because I think that there's, if you rationally within it, there's Great kind of points on all side, not to be the sort of, you know, person who sits in the middle constantly, but it's why I think a lot of these legal decisions are going to be really important because, you know, the job of judges is to listen to all this stuff and try to come to things and then have other judges disagree.[00:20:24] NLW: And, you know, and have the rest of us all debate at the same time. By the way, as a total aside, I feel like the synthetic data right now is like eggs in the 80s and 90s. Like, whether they're good for you or bad for you, like, you know, we, we get one study that's like synthetic data, you know, there's model collapse.[00:20:42] NLW: And then we have like a hint that llama, you know, to the most high performance version of it, which was one they didn't release was trained on synthetic data. So maybe it's good. It's like, I just feel like every, every other week I'm seeing something sort of different about whether it's a good or bad for, for these models.[00:20:56] swyx: Yeah. The branding of this is pretty poor. I would kind of tell people to think about it like cholesterol. There's good cholesterol, bad cholesterol. And you can have, you know, good amounts of both. But at this point, it is absolutely without a doubt that most large models from here on out will all be trained as some kind of synthetic data and that is not a bad thing.[00:21:16] swyx: There are ways in which you can do it poorly. Whether it's commercial, you know, in terms of commercial sourcing or in terms of the model performance. But it's without a doubt that good synthetic data is going to help your model. And this is just a question of like where to obtain it and what kinds of synthetic data are valuable.[00:21:36] swyx: You know, if even like alpha geometry, you know, was, was a really good example from like earlier this year.[00:21:42] NLW: If you're using the cholesterol analogy, then my, then my egg thing can't be that far off. Let's talk about the sort of the state of the art and the, and the GPT 4 class landscape and how that's changed.[00:21:53] Gemini 1.5 vs Claude 3[00:21:53] NLW: Cause obviously, you know, sort of the, the two big things or a couple of the big things that have happened. Since we last talked, we're one, you know, Gemini first announcing that a model was coming and then finally it arriving, and then very soon after a sort of a different model arriving from Gemini and and Cloud three.[00:22:11] NLW: So I guess, you know, I'm not sure exactly where the right place to start with this conversation is, but, you know, maybe very broadly speaking which of these do you think have made a bigger impact? Thank you.[00:22:20] Alessio: Probably the one you can use, right? So, Cloud. Well, I'm sure Gemini is going to be great once they let me in, but so far I haven't been able to.[00:22:29] Alessio: I use, so I have this small podcaster thing that I built for our podcast, which does chapters creation, like named entity recognition, summarization, and all of that. Cloud Tree is, Better than GPT 4. Cloud2 was unusable. So I use GPT 4 for everything. And then when Opus came out, I tried them again side by side and I posted it on, on Twitter as well.[00:22:53] Alessio: Cloud is better. It's very good, you know, it's much better, it seems to me, it's much better than GPT 4 at doing writing that is more, you know, I don't know, it just got good vibes, you know, like the GPT 4 text, you can tell it's like GPT 4, you know, it's like, it always uses certain types of words and phrases and, you know, maybe it's just me because I've now done it for, you know, So, I've read like 75, 80 generations of these things next to each other.[00:23:21] Alessio: Clutter is really good. I know everybody is freaking out on twitter about it, my only experience of this is much better has been on the podcast use case. But I know that, you know, Quran from from News Research is a very big opus pro, pro opus person. So, I think that's also It's great to have people that actually care about other models.[00:23:40] Alessio: You know, I think so far to a lot of people, maybe Entropic has been the sibling in the corner, you know, it's like Cloud releases a new model and then OpenAI releases Sora and like, you know, there are like all these different things, but yeah, the new models are good. It's interesting.[00:23:55] NLW: My my perception is definitely that just, just observationally, Cloud 3 is certainly the first thing that I've seen where lots of people.[00:24:06] NLW: They're, no one's debating evals or anything like that. They're talking about the specific use cases that they have, that they used to use chat GPT for every day, you know, day in, day out, that they've now just switched over. And that has, I think, shifted a lot of the sort of like vibe and sentiment in the space too.[00:24:26] NLW: And I don't necessarily think that it's sort of a A like full you know, sort of full knock. Let's put it this way. I think it's less bad for open AI than it is good for anthropic. I think that because GPT 5 isn't there, people are not quite willing to sort of like, you know get overly critical of, of open AI, except in so far as they're wondering where GPT 5 is.[00:24:46] NLW: But I do think that it makes, Anthropic look way more credible as a, as a, as a player, as a, you know, as a credible sort of player, you know, as opposed to to, to where they were.[00:24:57] Alessio: Yeah. And I would say the benchmarks veil is probably getting lifted this year. I think last year. People were like, okay, this is better than this on this benchmark, blah, blah, blah, because maybe they did not have a lot of use cases that they did frequently.[00:25:11] Alessio: So it's hard to like compare yourself. So you, you defer to the benchmarks. I think now as we go into 2024, a lot of people have started to use these models from, you know, from very sophisticated things that they run in production to some utility that they have on their own. Now they can just run them side by side.[00:25:29] Alessio: And it's like, Hey, I don't care that like. The MMLU score of Opus is like slightly lower than GPT 4. It just works for me, you know, and I think that's the same way that traditional software has been used by people, right? Like you just strive for yourself and like, which one does it work, works best for you?[00:25:48] Alessio: Like nobody looks at benchmarks outside of like sales white papers, you know? And I think it's great that we're going more in that direction. We have a episode with Adapt coming out this weekend. I'll and some of their model releases, they specifically say, We do not care about benchmarks, so we didn't put them in, you know, because we, we don't want to look good on them.[00:26:06] Alessio: We just want the product to work. And I think more and more people will, will[00:26:09] swyx: go that way. Yeah. I I would say like, it does take the wind out of the sails for GPT 5, which I know where, you know, Curious about later on. I think anytime you put out a new state of the art model, you have to break through in some way.[00:26:21] swyx: And what Claude and Gemini have done is effectively take away any advantage to saying that you have a million token context window. Now everyone's just going to be like, Oh, okay. Now you just match the other two guys. And so that puts An insane amount of pressure on what gpt5 is going to be because it's just going to have like the only option it has now because all the other models are multimodal all the other models are long context all the other models have perfect recall gpt5 has to match everything and do more to to not be a flop[00:26:58] AI Breakdown Part 2[00:26:58] NLW: hello friends back again with part two if you haven't heard part one of this conversation i suggest you go check it out but to be honest they are kind of actually separable In this conversation, we get into a topic that I think Alessio and Swyx are very well positioned to discuss, which is what developers care about right now, what people are trying to build around.[00:27:16] NLW: I honestly think that one of the best ways to see the future in an industry like AI is to try to dig deep on what developers and entrepreneurs are attracted to build, even if it hasn't made it to the news pages yet. So consider this your preview of six months from now, and let's dive in. Let's bring it to the GPT 5 conversation.[00:27:33] Next Frontiers: Llama 3, GPT-5, Gemini 2, Claude 4[00:27:33] NLW: I mean, so, so I think that that's a great sort of assessment of just how the stakes have been raised, you know is your, I mean, so I guess maybe, maybe I'll, I'll frame this less as a question, just sort of something that, that I, that I've been watching right now, the only thing that makes sense to me with how.[00:27:50] NLW: Fundamentally unbothered and unstressed OpenAI seems about everything is that they're sitting on something that does meet all that criteria, right? Because, I mean, even in the Lex Friedman interview that, that Altman recently did, you know, he's talking about other things coming out first. He's talking about, he's just like, he, listen, he, he's good and he could play nonchalant, you know, if he wanted to.[00:28:13] NLW: So I don't want to read too much into it, but. You know, they've had so long to work on this, like unless that we are like really meaningfully running up against some constraint, it just feels like, you know, there's going to be some massive increase, but I don't know. What do you guys think?[00:28:28] swyx: Hard to speculate.[00:28:29] swyx: You know, at this point, they're, they're pretty good at PR and they're not going to tell you anything that they don't want to. And he can tell you one thing and change their minds the next day. So it's, it's, it's really, you know, I've always said that model version numbers are just marketing exercises, like they have something and it's always improving and at some point you just cut it and decide to call it GPT 5.[00:28:50] swyx: And it's more just about defining an arbitrary level at which they're ready and it's up to them on what ready means. We definitely did see some leaks on GPT 4. 5, as I think a lot of people reported and I'm not sure if you covered it. So it seems like there might be an intermediate release. But I did feel, coming out of the Lex Friedman interview, that GPT 5 was nowhere near.[00:29:11] swyx: And you know, it was kind of a sharp contrast to Sam talking at Davos in February, saying that, you know, it was his top priority. So I find it hard to square. And honestly, like, there's also no point Reading too much tea leaves into what any one person says about something that hasn't happened yet or has a decision that hasn't been taken yet.[00:29:31] swyx: Yeah, that's, that's my 2 cents about it. Like, calm down, let's just build .[00:29:35] Alessio: Yeah. The, the February rumor was that they were gonna work on AI agents, so I don't know, maybe they're like, yeah,[00:29:41] swyx: they had two agent two, I think two agent projects, right? One desktop agent and one sort of more general yeah, sort of GPTs like agent and then Andre left, so he was supposed to be the guy on that.[00:29:52] swyx: What did Andre see? What did he see? I don't know. What did he see?[00:29:56] Alessio: I don't know. But again, it's just like the rumors are always floating around, you know but I think like, this is, you know, we're not going to get to the end of the year without Jupyter you know, that's definitely happening. I think the biggest question is like, are Anthropic and Google.[00:30:13] Alessio: Increasing the pace, you know, like it's the, it's the cloud four coming out like in 12 months, like nine months. What's the, what's the deal? Same with Gemini. They went from like one to 1. 5 in like five days or something. So when's Gemini 2 coming out, you know, is that going to be soon? I don't know.[00:30:31] Alessio: There, there are a lot of, speculations, but the good thing is that now you can see a world in which OpenAI doesn't rule everything. You know, so that, that's the best, that's the best news that everybody got, I would say.[00:30:43] swyx: Yeah, and Mistral Large also dropped in the last month. And, you know, not as, not quite GPT 4 class, but very good from a new startup.[00:30:52] swyx: So yeah, we, we have now slowly changed in landscape, you know. In my January recap, I was complaining that nothing's changed in the landscape for a long time. But now we do exist in a world, sort of a multipolar world where Cloud and Gemini are legitimate challengers to GPT 4 and hopefully more will emerge as well hopefully from meta.[00:31:11] Open Source Models - Mistral, Grok[00:31:11] NLW: So speak, let's actually talk about sort of the open source side of this for a minute. So Mistral Large, notable because it's, it's not available open source in the same way that other things are, although I think my perception is that the community has largely given them Like the community largely recognizes that they want them to keep building open source stuff and they have to find some way to fund themselves that they're going to do that.[00:31:27] NLW: And so they kind of understand that there's like, they got to figure out how to eat, but we've got, so, you know, there there's Mistral, there's, I guess, Grok now, which is, you know, Grok one is from, from October is, is open[00:31:38] swyx: sourced at, yeah. Yeah, sorry, I thought you thought you meant Grok the chip company.[00:31:41] swyx: No, no, no, yeah, you mean Twitter Grok.[00:31:43] NLW: Although Grok the chip company, I think is even more interesting in some ways, but and then there's the, you know, obviously Llama3 is the one that sort of everyone's wondering about too. And, you know, my, my sense of that, the little bit that, you know, Zuckerberg was talking about Llama 3 earlier this year, suggested that, at least from an ambition standpoint, he was not thinking about how do I make sure that, you know, meta content, you know, keeps, keeps the open source thrown, you know, vis a vis Mistral.[00:32:09] NLW: He was thinking about how you go after, you know, how, how he, you know, releases a thing that's, you know, every bit as good as whatever OpenAI is on at that point.[00:32:16] Alessio: Yeah. From what I heard in the hallways at, at GDC, Llama 3, the, the biggest model will be, you 260 to 300 billion parameters, so that that's quite large.[00:32:26] Alessio: That's not an open source model. You know, you cannot give people a 300 billion parameters model and ask them to run it. You know, it's very compute intensive. So I think it is, it[00:32:35] swyx: can be open source. It's just, it's going to be difficult to run, but that's a separate question.[00:32:39] Alessio: It's more like, as you think about what they're doing it for, you know, it's not like empowering the person running.[00:32:45] Alessio: llama. On, on their laptop, it's like, oh, you can actually now use this to go after open AI, to go after Anthropic, to go after some of these companies at like the middle complexity level, so to speak. Yeah. So obviously, you know, we estimate Gentala on the podcast, they're doing a lot here, they're making PyTorch better.[00:33:03] Alessio: You know, they want to, that's kind of like maybe a little bit of a shorted. Adam Bedia, in a way, trying to get some of the CUDA dominance out of it. Yeah, no, it's great. The, I love the duck destroying a lot of monopolies arc. You know, it's, it's been very entertaining. Let's bridge[00:33:18] NLW: into the sort of big tech side of this, because this is obviously like, so I think actually when I did my episode, this was one of the I added this as one of as an additional war that, that's something that I'm paying attention to.[00:33:29] NLW: So we've got Microsoft's moves with inflection, which I think pretend, potentially are being read as A shift vis a vis the relationship with OpenAI, which also the sort of Mistral large relationship seems to reinforce as well. We have Apple potentially entering the race, finally, you know, giving up Project Titan and and, and kind of trying to spend more effort on this.[00:33:50] NLW: Although, Counterpoint, we also have them talking about it, or there being reports of a deal with Google, which, you know, is interesting to sort of see what their strategy there is. And then, you know, Meta's been largely quiet. We kind of just talked about the main piece, but, you know, there's, and then there's spoilers like Elon.[00:34:07] NLW: I mean, you know, what, what of those things has sort of been most interesting to you guys as you think about what's going to shake out for the rest of this[00:34:13] Apple MM1[00:34:13] swyx: year? I'll take a crack. So the reason we don't have a fifth war for the Big Tech Wars is that's one of those things where I just feel like we don't cover differently from other media channels, I guess.[00:34:26] swyx: Sure, yeah. In our anti interestness, we actually say, like, we try not to cover the Big Tech Game of Thrones, or it's proxied through Twitter. You know, all the other four wars anyway, so there's just a lot of overlap. Yeah, I think absolutely, personally, the most interesting one is Apple entering the race.[00:34:41] swyx: They actually released, they announced their first large language model that they trained themselves. It's like a 30 billion multimodal model. People weren't that impressed, but it was like the first time that Apple has kind of showcased that, yeah, we're training large models in house as well. Of course, like, they might be doing this deal with Google.[00:34:57] swyx: I don't know. It sounds very sort of rumor y to me. And it's probably, if it's on device, it's going to be a smaller model. So something like a Jemma. It's going to be smarter autocomplete. I don't know what to say. I'm still here dealing with, like, Siri, which hasn't, probably hasn't been updated since God knows when it was introduced.[00:35:16] swyx: It's horrible. I, you know, it, it, it makes me so angry. So I, I, one, as an Apple customer and user, I, I'm just hoping for better AI on Apple itself. But two, they are the gold standard when it comes to local devices, personal compute and, and trust, like you, you trust them with your data. And. I think that's what a lot of people are looking for in AI, that they have, they love the benefits of AI, they don't love the downsides, which is that you have to send all your data to some cloud somewhere.[00:35:45] swyx: And some of this data that we're going to feed AI is just the most personal data there is. So Apple being like one of the most trusted personal data companies, I think it's very important that they enter the AI race, and I hope to see more out of them.[00:35:58] Alessio: To me, the, the biggest question with the Google deal is like, who's paying who?[00:36:03] Alessio: Because for the browsers, Google pays Apple like 18, 20 billion every year to be the default browser. Is Google going to pay you to have Gemini or is Apple paying Google to have Gemini? I think that's, that's like what I'm most interested to figure out because with the browsers, it's like, it's the entry point to the thing.[00:36:21] Alessio: So it's really valuable to be the default. That's why Google pays. But I wonder if like the perception in AI is going to be like, Hey. You just have to have a good local model on my phone to be worth me purchasing your device. And that was, that's kind of drive Apple to be the one buying the model. But then, like Shawn said, they're doing the MM1 themselves.[00:36:40] Alessio: So are they saying we do models, but they're not as good as the Google ones? I don't know. The whole thing is, it's really confusing, but. It makes for great meme material on on Twitter.[00:36:51] swyx: Yeah, I mean, I think, like, they are possibly more than OpenAI and Microsoft and Amazon. They are the most full stack company there is in computing, and so, like, they own the chips, man.[00:37:05] swyx: Like, they manufacture everything so if, if, if there was a company that could do that. You know, seriously challenge the other AI players. It would be Apple. And it's, I don't think it's as hard as self driving. So like maybe they've, they've just been investing in the wrong thing this whole time. We'll see.[00:37:21] swyx: Wall Street certainly thinks[00:37:22] NLW: so. Wall Street loved that move, man. There's a big, a big sigh of relief. Well, let's, let's move away from, from sort of the big stuff. I mean, the, I think to both of your points, it's going to.[00:37:33] Meta's $800b AI rebrand[00:37:33] NLW: Can I, can[00:37:34] swyx: I, can I, can I jump on factoid about this, this Wall Street thing? I went and looked at when Meta went from being a VR company to an AI company.[00:37:44] swyx: And I think the stock I'm trying to look up the details now. The stock has gone up 187% since Lamo one. Yeah. Which is $830 billion in market value created in the past year. . Yeah. Yeah.[00:37:57] NLW: It's, it's, it's like, remember if you guys haven't Yeah. If you haven't seen the chart, it's actually like remarkable.[00:38:02] NLW: If you draw a little[00:38:03] swyx: arrow on it, it's like, no, we're an AI company now and forget the VR thing.[00:38:10] NLW: It's it, it is an interesting, no, it's, I, I think, alessio, you called it sort of like Zuck's Disruptor Arc or whatever. He, he really does. He is in the midst of a, of a total, you know, I don't know if it's a redemption arc or it's just, it's something different where, you know, he, he's sort of the spoiler.[00:38:25] NLW: Like people loved him just freestyle talking about why he thought they had a better headset than Apple. But even if they didn't agree, they just loved it. He was going direct to camera and talking about it for, you know, five minutes or whatever. So that, that's a fascinating shift that I don't think anyone had on their bingo card, you know, whatever, two years ago.[00:38:41] NLW: Yeah. Yeah,[00:38:42] swyx: we still[00:38:43] Alessio: didn't see and fight Elon though, so[00:38:45] swyx: that's what I'm really looking forward to. I mean, hey, don't, don't, don't write it off, you know, maybe just these things take a while to happen. But we need to see and fight in the Coliseum. No, I think you know, in terms of like self management, life leadership, I think he has, there's a lot of lessons to learn from him.[00:38:59] swyx: You know he might, you know, you might kind of quibble with, like, the social impact of Facebook, but just himself as a in terms of personal growth and, and, you know, Per perseverance through like a lot of change and you know, everyone throwing stuff his way. I think there's a lot to say about like, to learn from, from Zuck, which is crazy 'cause he's my age.[00:39:18] swyx: Yeah. Right.[00:39:20] AI Engineer landscape - from baby AGIs to vertical Agents[00:39:20] NLW: Awesome. Well, so, so one of the big things that I think you guys have, you know, distinct and, and unique insight into being where you are and what you work on is. You know, what developers are getting really excited about right now. And by that, I mean, on the one hand, certainly, you know, like startups who are actually kind of formalized and formed to startups, but also, you know, just in terms of like what people are spending their nights and weekends on what they're, you know, coming to hackathons to do.[00:39:45] NLW: And, you know, I think it's a, it's a, it's, it's such a fascinating indicator for, for where things are headed. Like if you zoom back a year, right now was right when everyone was getting so, so excited about. AI agent stuff, right? Auto, GPT and baby a GI. And these things were like, if you dropped anything on YouTube about those, like instantly tens of thousands of views.[00:40:07] NLW: I know because I had like a 50,000 view video, like the second day that I was doing the show on YouTube, you know, because I was talking about auto GPT. And so anyways, you know, obviously that's sort of not totally come to fruition yet, but what are some of the trends in what you guys are seeing in terms of people's, people's interest and, and, and what people are building?[00:40:24] Alessio: I can start maybe with the agents part and then I know Shawn is doing a diffusion meetup tonight. There's a lot of, a lot of different things. The, the agent wave has been the most interesting kind of like dream to reality arc. So out of GPT, I think they went, From zero to like 125, 000 GitHub stars in six weeks, and then one year later, they have 150, 000 stars.[00:40:49] Alessio: So there's kind of been a big plateau. I mean, you might say there are just not that many people that can start it. You know, everybody already started it. But the promise of, hey, I'll just give you a goal, and you do it. I think it's like, amazing to get people's imagination going. You know, they're like, oh, wow, this This is awesome.[00:41:08] Alessio: Everybody, everybody can try this to do anything. But then as technologists, you're like, well, that's, that's just like not possible, you know, we would have like solved everything. And I think it takes a little bit to go from the promise and the hope that people show you to then try it yourself and going back to say, okay, this is not really working for me.[00:41:28] Alessio: And David Wong from Adept, you know, they in our episode, he specifically said. We don't want to do a bottom up product. You know, we don't want something that everybody can just use and try because it's really hard to get it to be reliable. So we're seeing a lot of companies doing vertical agents that are narrow for a specific domain, and they're very good at something.[00:41:49] Alessio: Mike Conover, who was at Databricks before, is also a friend of Latentspace. He's doing this new company called BrightWave doing AI agents for financial research, and that's it, you know, and they're doing very well. There are other companies doing it in security, doing it in compliance, doing it in legal.[00:42:08] Alessio: All of these things that like, people, nobody just wakes up and say, Oh, I cannot wait to go on AutoGPD and ask it to do a compliance review of my thing. You know, just not what inspires people. So I think the gap on the developer side has been the more bottom sub hacker mentality is trying to build this like very Generic agents that can do a lot of open ended tasks.[00:42:30] Alessio: And then the more business side of things is like, Hey, If I want to raise my next round, I can not just like sit around the mess, mess around with like super generic stuff. I need to find a use case that really works. And I think that that is worth for, for a lot of folks in parallel, you have a lot of companies doing evals.[00:42:47] Alessio: There are dozens of them that just want to help you measure how good your models are doing. Again, if you build evals, you need to also have a restrained surface area to actually figure out whether or not it's good, right? Because you cannot eval anything on everything under the sun. So that's another category where I've seen from the startup pitches that I've seen, there's a lot of interest in, in the enterprise.[00:43:11] Alessio: It's just like really. Fragmented because the production use cases are just coming like now, you know, there are not a lot of long established ones to, to test against. And so does it, that's kind of on the virtual agents and then the robotic side it's probably been the thing that surprised me the most at NVIDIA GTC, the amount of robots that were there that were just like robots everywhere.[00:43:33] Alessio: Like, both in the keynote and then on the show floor, you would have Boston Dynamics dogs running around. There was, like, this, like fox robot that had, like, a virtual face that, like, talked to you and, like, moved in real time. There were industrial robots. NVIDIA did a big push on their own Omniverse thing, which is, like, this Digital twin of whatever environments you're in that you can use to train the robots agents.[00:43:57] Alessio: So that kind of takes people back to the reinforcement learning days, but yeah, agents, people want them, you know, people want them. I give a talk about the, the rise of the full stack employees and kind of this future, the same way full stack engineers kind of work across the stack. In the future, every employee is going to interact with every part of the organization through agents and AI enabled tooling.[00:44:17] Alessio: This is happening. It just needs to be a lot more narrow than maybe the first approach that we took, which is just put a string in AutoGPT and pray. But yeah, there's a lot of super interesting stuff going on.[00:44:27] swyx: Yeah. Well, he Let's recover a lot of stuff there. I'll separate the robotics piece because I feel like that's so different from the software world.[00:44:34] swyx: But yeah, we do talk to a lot of engineers and you know, that this is our sort of bread and butter. And I do agree that vertical agents have worked out a lot better than the horizontal ones. I think all You know, the point I'll make here is just the reason AutoGPT and maybe AGI, you know, it's in the name, like they were promising AGI.[00:44:53] swyx: But I think people are discovering that you cannot engineer your way to AGI. It has to be done at the model level and all these engineering, prompt engineering hacks on top of it weren't really going to get us there in a meaningful way without much further, you know, improvements in the models. I would say, I'll go so far as to say, even Devin, which is, I would, I think the most advanced agent that we've ever seen, still requires a lot of engineering and still probably falls apart a lot in terms of, like, practical usage.[00:45:22] swyx: Or it's just, Way too slow and expensive for, you know, what it's, what it's promised compared to the video. So yeah, that's, that's what, that's what happened with agents from, from last year. But I, I do, I do see, like, vertical agents being very popular and, and sometimes you, like, I think the word agent might even be overused sometimes.[00:45:38] swyx: Like, people don't really care whether or not you call it an AI agent, right? Like, does it replace boring menial tasks that I do That I might hire a human to do, or that the human who is hired to do it, like, actually doesn't really want to do. And I think there's absolutely ways in sort of a vertical context that you can actually go after very routine tasks that can be scaled out to a lot of, you know, AI assistants.[00:46:01] swyx: So, so yeah, I mean, and I would, I would sort of basically plus one what let's just sit there. I think it's, it's very, very promising and I think more people should work on it, not less. Like there's not enough people. Like, we, like, this should be the, the, the main thrust of the AI engineer is to look out, look for use cases and, and go to a production with them instead of just always working on some AGI promising thing that never arrives.[00:46:21] swyx: I,[00:46:22] NLW: I, I can only add that so I've been fiercely making tutorials behind the scenes around basically everything you can imagine with AI. We've probably done, we've done about 300 tutorials over the last couple of months. And the verticalized anything, right, like this is a solution for your particular job or role, even if it's way less interesting or kind of sexy, it's like so radically more useful to people in terms of intersecting with how, like those are the ways that people are actually.[00:46:50] NLW: Adopting AI in a lot of cases is just a, a, a thing that I do over and over again. By the way, I think that's the same way that even the generalized models are getting adopted. You know, it's like, I use midjourney for lots of stuff, but the main thing I use it for is YouTube thumbnails every day. Like day in, day out, I will always do a YouTube thumbnail, you know, or two with, with Midjourney, right?[00:47:09] NLW: And it's like you can, you can start to extrapolate that across a lot of things and all of a sudden, you know, a AI doesn't. It looks revolutionary because of a million small changes rather than one sort of big dramatic change. And I think that the verticalization of agents is sort of a great example of how that's[00:47:26] swyx: going to play out too.[00:47:28] Adept episode - Screen Multimodality[00:47:28] swyx: So I'll have one caveat here, which is I think that Because multi modal models are now commonplace, like Cloud, Gemini, OpenAI, all very very easily multi modal, Apple's easily multi modal, all this stuff. There is a switch for agents for sort of general desktop browsing that I think people so much for joining us today, and we'll see you in the next video.[00:48:04] swyx: Version of the the agent where they're not specifically taking in text or anything They're just watching your screen just like someone else would and and I'm piloting it by vision And you know in the the episode with David that we'll have dropped by the time that this this airs I think I think that is the promise of adept and that is a promise of what a lot of these sort of desktop agents Are and that is the more general purpose system That could be as big as the browser, the operating system, like, people really want to build that foundational piece of software in AI.[00:48:38] swyx: And I would see, like, the potential there for desktop agents being that, that you can have sort of self driving computers. You know, don't write the horizontal piece out. I just think we took a while to get there.[00:48:48] NLW: What else are you guys seeing that's interesting to you? I'm looking at your notes and I see a ton of categories.[00:48:54] Top Model Research from January Recap[00:48:54] swyx: Yeah so I'll take the next two as like as one category, which is basically alternative architectures, right? The two main things that everyone following AI kind of knows now is, one, the diffusion architecture, and two, the let's just say the, Decoder only transformer architecture that is popularized by GPT.[00:49:12] swyx: You can read, you can look on YouTube for thousands and thousands of tutorials on each of those things. What we are talking about here is what's next, what people are researching, and what could be on the horizon that takes the place of those other two things. So first of all, we'll talk about transformer architectures and then diffusion.[00:49:25] swyx: So transformers the, the two leading candidates are effectively RWKV and the state space models the most recent one of which is Mamba, but there's others like the Stripe, ENA, and the S four H three stuff coming out of hazy research at Stanford. And all of those are non quadratic language models that scale the promise to scale a lot better than the, the traditional transformer.[00:49:47] swyx: That this might be too theoretical for most people right now, but it's, it's gonna be. It's gonna come out in weird ways, where, imagine if like, Right now the talk of the town is that Claude and Gemini have a million tokens of context and like whoa You can put in like, you know, two hours of video now, okay But like what if you put what if we could like throw in, you know, two hundred thousand hours of video?[00:50:09] swyx: Like how does that change your usage of AI? What if you could throw in the entire genetic sequence of a human and like synthesize new drugs. Like, well, how does that change things? Like, we don't know because we haven't had access to this capability being so cheap before. And that's the ultimate promise of these two models.[00:50:28] swyx: They're not there yet but we're seeing very, very good progress. RWKV and Mamba are probably the, like, the two leading examples, both of which are open source that you can try them today and and have a lot of progress there. And the, the, the main thing I'll highlight for audio e KV is that at, at the seven B level, they seem to have beat LAMA two in all benchmarks that matter at the same size for the same amount of training as an open source model.[00:50:51] swyx: So that's exciting. You know, they're there, they're seven B now. They're not at seven tb. We don't know if it'll. And then the other thing is diffusion. Diffusions and transformers are are kind of on the collision course. The original stable diffusion already used transformers in in parts of its architecture.[00:51:06] swyx: It seems that transformers are eating more and more of those layers particularly the sort of VAE layer. So that's, the Diffusion Transformer is what Sora is built on. The guy who wrote the Diffusion Transformer paper, Bill Pebbles, is, Bill Pebbles is the lead tech guy on Sora. So you'll just see a lot more Diffusion Transformer stuff going on.[00:51:25] swyx: But there's, there's more sort of experimentation with diffusion. I'm holding a meetup actually here in San Francisco that's gonna be like the state of diffusion, which I'm pretty excited about. Stability's doing a lot of good work. And if you look at the, the architecture of how they're creating Stable Diffusion 3, Hourglass Diffusion, and the inconsistency models, or SDXL Turbo.[00:51:45] swyx: All of these are, like, very, very interesting innovations on, like, the original idea of what Stable Diffusion was. So if you think that it is expensive to create or slow to create Stable Diffusion or an AI generated art, you are not up to date with the latest models. If you think it is hard to create text and images, you are not up to date with the latest models.[00:52:02] swyx: And people still are kind of far behind. The last piece of which is the wildcard I always kind of hold out, which is text diffusion. So Instead of using autogenerative or autoregressive transformers, can you use text to diffuse? So you can use diffusion models to diffuse and create entire chunks of text all at once instead of token by token.[00:52:22] swyx: And that is something that Midjourney confirmed today, because it was only rumored the past few months. But they confirmed today that they were looking into. So all those things are like very exciting new model architectures that are, Maybe something that we'll, you'll see in production two to three years from now.[00:52:37] swyx: So the couple of the trends[00:52:38] NLW: that I want to just get your takes on, because they're sort of something that, that seems like they're coming up are one sort of these, these wearable, you know, kind of passive AI experiences where they're absorbing a lot of what's going on around you and then, and then kind of bringing things back.[00:52:53] NLW: And then the, the other one that I, that I wanted to see if you guys had thoughts on were sort of this next generation of chip companies. Obviously there's a huge amount of emphasis. On on hardware and silicon and, and, and different ways of doing things, but, y

america god tv love ceo spotify amazon netflix world learning ai europe english google apple lessons pr magic san francisco phd friend digital chinese marvel reading data predictions elon musk microsoft events funny fortune startups white house weird economics wall street memory wall street journal reddit wars auto curious cloud gate vr singapore stanford connections mix israelis context ibm mark zuckerberg senior vice president average intel ram signal cto state of the union tigers minecraft vc adapt ipo gemini siri openai sol transformers instructors lsu clouds nvidia stability rust ux lemon patel api gi davos nsfw cisco compass luther progression b2c bro sweep d d gpt bing makes disagreement mythology ml lama github llama apis token thursday night stripe amd quran vcs llm captive devops copilot sora baldur opus sam altman anthropic embody silicon dozen gpu bobo grok altman capital one tab agi mamba generic waymo boba waze dali upfront midjourney approve ide napster cloudflare golem gdc zuck prs coliseum rag git klarna kv gpus albrecht diffusion coders deepmind gan tldr alessio boston dynamics minefields gitlab sergei fragmented suno cursor mistral ppa json gpts lex fridman ena jensen huang nox stable diffusion inflection databricks decibel a16z counterpoint mts cuda rohde adept asr chroma gtc sundar lemurian decoder iou nvidia gpus stability ai etched sram cerebros practical ai singaporeans omniverse netlify pytorch eac lamo tpu mustafa suleyman day6 vae devtools agis not safe groq nvidia gtc elicit jupyter kubecon andrej karpathy autogpt project titan personal ai hbm milind ai engineer demis neurips marginally jeff dean imbue positron nlw slido entropic nat friedman ppap simon willison lstm c300 technium mbu lpu xla latent space boba guys you look swix medex metax mxu lstms

Taking a Hybrid AI Approach to Security at Snyk with Randall Degges

Screaming in the Cloud

Play Episode Listen Later Nov 29, 2023 35:57

Randall Degges, Head of Developer Relations & Community at Snyk, joins Corey on Screaming in the Cloud to discuss Snyk's innovative AI strategy and why developers don't need to be afraid of security. Randall explains the difference between Large Language Models and Symbolic AI, and how combining those two approaches creates more accurate security tooling. Corey and Randall also discuss the FUD phenomenon to selling security tools, and Randall expands on why Snyk doesn't take that approach. Randall also shares some background on how he went from being a happy Snyk user to a full-time Snyk employee. About RandallRandall runs Developer Relations & Community at Snyk, where he works on security research, development, and education. In his spare time, Randall writes articles and gives talks advocating for security best practices. Randall also builds and contributes to various open-source security tools.Randall's realms of expertise include Python, JavaScript, and Go development, web security, cryptography, and infrastructure security. Randall has been writing software for over 20 years and has built a number of popular API services and open-source tools.Links Referenced: Snyk: https://snyk.io/ Snyk blog: https://snyk.io/blog/ TranscriptAnnouncer: Hello, and welcome to Screaming in the Cloud with your host, Chief Cloud Economist at The Duckbill Group, Corey Quinn. This weekly show features conversations with people doing interesting work in the world of cloud, thoughtful commentary on the state of the technical world, and ridiculous titles for which Corey refuses to apologize. This is Screaming in the Cloud.Corey: Welcome to Screaming in the Cloud, I'm Corey Quinn, and this featured guest episode is brought to us by our friends at Snyk. Also brought to us by our friends at Snyk is one of our friends at Snyk, specifically Randall Degges, their Head of Developer Relations and Community. Randall, thank you for joining me.Randall: Hey, what's up, Corey? Yeah, thanks for having me on the show, man. Looking forward to talking about some fun security stuff today.Corey: It's been a while since I got to really talk about a security-centric thing on this show, at least in order of recordings. I don't know if the one right before this is a security thing; things happen on the back-end that I'm blissfully unaware of. But it seems the theme lately has been a lot around generative AI, so I'm going to start off by basically putting you in the hot seat. Because when you pull up a company's website these days, the odds are terrific that they're going to have completely repositioned absolutely everything that they do in the context of generative AI. It's like, “We're a generative AI company.” It's like, “That's great.” Historically, I have been a paying customer of Snyk so that it does security stuff, so if you're now a generative AI company, who do I use for the security platform thing that I was depending upon? You have not done that. First, good work. Secondly, why haven't you done that?Randall: Great question. Also, you said a moment ago that LLMs are very interesting, or there's a lot of hype around it. Understatement of the last year, for sure [laugh].Corey: Oh, my God, it has gotten brutal.Randall: I don't know how many billions of dollars have been dumped into LLM in the last 12 months, but I'm sure it's a very high number.Corey: I have a sneaking suspicion that the largest models cost at least a billion each train, just based upon—at least retail price—based upon the simple economics of how long it takes to do these things, how expensive that particular flavor of compute is. And the technology is his magic. It is magic in a box and I see that, but finding ways that it applies in different ways is taking some time. But that's not stopping the hype beasts. A lot of the same terrible people who were relentlessly pushing crypto have now pivoted to relentlessly pushing generative AI, presumably because they're working through Nvidia's street team, or their referral program, or whatever it is. Doesn't matter what the rest of us do, as long as we're burning GPU cycles on it. And I want to distance myself from that exciting level of boosterism. But it's also magic.Randall: Yeah [laugh]. Well, let's just talk about AI insecurity for a moment and answer your previous question. So, what's happening in space, what's the deal, what is all the hype going to, and what is Snyk doing around there? So, quite frankly—and I'm sure a lot of people on your show say the same thing—but Snyk isn't new into, like, the AI space. It's been a fundamental part of our platform for many years now.So, for those of you listening who have no idea what the heck Snyk is, and you're like, “Why are we talking about this,” Snyk is essentially a developer security company, and the core of what we do is two things. The first thing is we help scan your code, your dependencies, your containers, all the different parts of your application, and detect vulnerabilities. That's the first part. The second thing we do is we help fix those vulnerabilities. So, detection and remediation. Those are the two components of any good security tool or security company.And in our particular case, we're very focused on developers because our whole product is really based on your application and your application security, not infrastructure and other things like this. So, with that being said, what are we doing at a high level with LLMs? Well, if you think about AI as, like, a broad spectrum, you have a lot of different technologies behind the scenes that people refer to as AI. You have lots of these large language models, which are generating text based on inputs. You also have symbolic AI, which has been around for a very long time and which is very domain specific. It's like creating specific rules and helping do pattern detection amongst things.And those two different types of applied AI, let's say—we have large language models and symbolic AI—are the two main things that have been happening in industry for the last, you know, tens of years, really, with LLM as being the new kid on the block. So, when we're talking about security, what's important to know about just those two underlying technologies? Well, the first thing is that large language models, as I'm sure everyone listening to this knows, are really good at predicting things based on a big training set of data. That's why companies like OpenAI and their ChatGPT tool have become so popular because they've gone out and crawled vast portions of the internet, downloaded tons of data, classified it, and then trained their models on top of this data so that they can help predict the things that people are putting into chat. And that's why they're so interesting, and powerful, and there's all these cool use cases popping up with them.However, the downside of LLMs is because they're just using a bunch of training data behind the scenes, there's a ton of room for things to be wrong. Training datasets aren't perfect, they're coming from a ton of places, and even if they weren't perfect, there's still the likelihood that things that are going to be generating output based on a statistical model isn't going to be accurate, which is the whole concept of hallucinations.Corey: Right. I wound up remarking on the livestream for GitHub Universe a week or two ago that the S in AI stood for security. One of the problems I've seen with it is that it can generate a very plausible looking IAM policy if you ask it to, but it doesn't actually do what you think it would if you go ahead and actually use it. I think that it's still squarely in the realm of, it's great at creativity, it's great at surface level knowledge, but for anything important, you really want someone who knows what they're doing to take a look at it and say, “Slow your roll there, Hasty Pudding.”Randall: A hundred percent. And when we're talking about LLMs, I mean, you're right. Security isn't really what they're designed to do, first of all [laugh]. Like, they're designed to predict things based on statistics, which is not a security concept. But secondly, another important thing to note is, when you're talking about using LLMs in general, there's so many tricks and techniques and things you can do to improve accuracy and improve things, like for example, having a ton of [contexts 00:06:35] or doing Few-Shot Learning Techniques where you prompt it and give it examples of questions and answers that you're looking for can give you a slight competitive edge there in terms of reducing hallucinations and false information.But fundamentally, LLMs will always have a problem with hallucinations and getting things wrong. So, that brings us to what we mentioned before: symbolic AI and what the differences are there. Well, symbolic AI is a completely different approach. You're not taking huge training sets and using machine learning to build statistical models. It's very different. You're creating rules, and you're parsing very specific domain information to generate things that are highly accurate, although those models will fail when applied to general-purpose things, unlike large language models.So, what does that mean? You have these two different types of AI that people are using. You have symbolic AI, which is very specific and requires a lot of expertise to create, then you have LLMs, which take a lot of experience to create as well, but are very broad and general purpose and have a capability to be wrong. Snyk's approach is, we take both of those concepts, and we use them together to get the best of both worlds. And we can talk a little bit about that, but I think fundamentally, one of the things that separates Snyk from a lot of other companies in the space is we're just trying to do whatever the best technical solution is to solve the problem, and I think we found that with our hybrid approach.Corey: I think that there is a reasonable distrust of AI when it comes to security. I mean, I wound up recently using it to build what has been announced by the time this thing airs, which is my re:Invent photo scavenger hunt app. I know nothing about front-end, so that's okay, I've got a robot in my pocket. It's great at doing the development of the initial thing, and then you have issues, and you want to add functionality, and it feels like by the time I was done with my first draft, that ten different engineers had all collaborated on this thing without ever speaking to one another. There was no consistent idiomatic style, it used a variety, a hodgepodge of different lists and the rest, and it became a bit of a Frankenstein's monster.That can kind of work if we're talking about a web app that doesn't have any sensitive data in it, but holy crap, the idea of applying that to, “Yeah, that's how we built our bank's security policy,” is one of those, “Let me know who said that, so they can not have their job anymore,” territory when the CSO starts [hunting 00:08:55].Randall: You're right. It's a very tenuous situation to be in from a security perspective. The way I like to think about it—because I've been a developer for a long time and a security professional—and I as much as anyone out there love to jump on the hype train for things and do whatever I can to be lazy and just get work done quicker. And so, I use ChatGPT, I use GitHub Copilot, I use all sorts of LLM-based tools to help me write software. And similarly to the problems when developers are not using LLM to help them write code, security is always a concern.Like, it doesn't matter if you have a developer writing every line of code themselves or if they're getting help from Copilot or ChatGPT. Fundamentally, the problem with security and the reason why it's such an annoying part of the developer experience, in all honesty, is that security is really difficult. You can take someone who's an amazing engineer, who has 30 years of experience, like, you can take John Carmack, I'm sure, one of the most legendary developers to ever walk the Earth, you could sit over his shoulder and watch him write software, right, I can almost guarantee you that he's going to have some sort of security problem in his code, even with all the knowledge he has in his head. And part of the reason that's the case is because modern security is way complicated. Like if you're building a web app, you have front-end stuff you need to protect, you have back-end stuff you need to protect, there's databases and infrastructure and communication layers between the infrastructure and the services. It's just too complicated for one person to fully grasp.And so, what do you do? Well, you basically need some sort of assistance from automation. You have to have some sort of tooling that can take a look at your code that you're writing and say, “Hey Randall, on line 39, when you were writing this function that's taking user data and doing something with it, you forgot to sanitize the user data.” Now, that's a simple example, but let's talk about a more complex example. Maybe you're building some authentication software, and you're taking users' passwords, and you're hashing them using a common hashing algorithm.And maybe the tooling is able to detect way using the bcrypt password hashing algorithm with a work factor of ten to create this password hash, but guess what, we're in 2023 and a work factor of ten is something that older commodity CPUs can now factor at a reasonable rate, and so you need to bump that up to 13 or 14. These are the types of things where you need help over time. It's not something that anyone can reasonably assume they can just deal with in their head. The way I like to think about it is, as a developer, regardless of how you're building code, you need some sort of security checks on there to just help you be productive, in all honesty. Like, if you're not doing that, you're just asking for problems.Corey: Oh, yeah. On some level, even the idea of it's just going to be very computationally expensive to wind up figuring out what that password hash is, well great, but one of the things that we've been aware of for a while is that given the rise of botnets and compromised computers, the attackers have what amounts to infinite computing capacity, give or take. So, if they want in, on some level, badly enough, they're going to find a way to get in there. When you say that every developer is going to sit down and write insecure code, you're right. And a big part of that is because, as imagined today, security is an incredibly high friction process, and it's not helped, frankly, by tools that don't have nuance or understanding.If I want to do a crap ton of busy work that doesn't feel like it moves the needle forward at all, I'll go around to resolving the hundreds upon hundreds of Dependabot alerts I have for a lot of my internal services that write my weekly newsletter. Because some dependency three deep winds up having a failure mode when it gets untrusted input of the following type, it can cause resource exhaustion. It runs in a Lambda function, so I don't care about the resources, and two, I'm not here providing the stuff that I write, which is the input with an idea toward exploiting stuff. So, it's busy work, things I don't need to be aware of. But more to the point, stuff like that has the high propensity to mask things I actually do care about. Getting the signal from noise from your misconfigured, ill-conceived alerting system is just awful. Like, a bad thing is there are no security things for you to work on, but a worse one is, “Here are 70,000 security things for you to work on.” How do you triage? How do you think about it?Randall: A hundred percent. I mean, that's actually the most difficult thing, I would say, that security teams have to deal with in the real world. It's not having a tool to help detect issues or trying to get people to fix them. The real issue is, there's always security problems, like you said, right? Like, if you take a look and just scan any codebase out there, any reasonably-sized codebase, you're going to find a ridiculous amount of issues.Some of those issues will be actual issues, like, you're not doing something in code hygiene that you need to do to protect stuff. A lot of those issues are meaningless things, like you said. You have a transitive dependency that some direct dependency is referring to, and maybe in some function call, there's an issue there, and it's alerting you on it even though you don't even use this function call. You're not even touching this class, or this method, or whatever it is. And it wastes a lot of time.And that's why the Holy Grail in the security industry in all honesty is prioritization and insights. At Snyk, we sort of pioneered this concept of ASPM, which stands for Application Security Posture Management. And fundamentally what that means is when you're a security team, and you're scanning code and finding all these issues, how do you prioritize them? Well, there's a couple of approaches. One approach is to use static analysis to try to figure out if these issues that are being detected are reachable, right? Like, can they be achieved in some way, but that's really hard to do statically and there's so many variables that go into it that no one really has foolproof solutions there.The second thing you can do is you can combine insights and heuristics from a lot of different places. So, you can take a look at static code analysis results, and you can combine them with agents running live that are observing your application, and then you can try to determine what stuff is actually reachable given this real world heuristic, and you know, real time information and mapping it up with static code analysis results. And that's really the holy grail of figuring things out. We have an ASPM product—or maybe it's a feature, an offering, if you will, but it's something that Snyk provides, which gives security admins a lot more insight into that type of operation at their business. But you're totally right, Corey, it's a really difficult problem to solve, and it burns a lot of goodwill in the security community and in the industry because people spend a lot of time getting false alerts, going through stuff, and just wasting millions of hours a year, I'm sure.Corey: That's part of the challenge, too, is that it feels like there are two classes of problems in the world, at least when it comes to business. And I found this by being on the wrong side of it, on some level. Here on the wrong side, it's things like caring about cost optimization, it's caring about security, it's remembering to buy fire insurance for your building. You can wind up doing all of those things—and you should be doing them, but you can over-index on them to the point where you run out of money and your business dies. The proactive side of that fence is getting features to market sooner, increasing market share, growing revenue, et cetera, and that's the stuff that people are always going to prioritize over the back burner stuff. So, striking a balance between that is always going to be a bit of a challenge, and where people land on that is going to be tricky.Randall: So, I think this is a really good bridge. You're totally right. It's expensive to waste people's time, basically, is what you're saying, right? You don't want to waste people's time, you want to give them actionable alerts that they can actually fix, or hopefully you fix it for them if you can, right? So, I'm going to lay something out, which is, in our opinion, is the Snyk way, if you will, that you should be approaching these developer security issues.So, let's take a look at two different approaches. The first approach is going to be using an LLM, like, let's say, just ChatGPT. We'll call them out because everyone knows ChatGPT. The first approach we're going to take is—Corey: Although I do insist on pronouncing it Chat-Gippity. But please, continue.Randall: [laugh]. Chat-Gippity. I love that. I haven't heard that before. Chat-Gippity. Sounds so much more fun, you know?Corey: It sounds more personable. Yeah.Randall: Yeah. So, you're talking to Chat-Gippity—thank you—and you paste in a file from your codebase, and you say, “Hey, Chat-Gippity. Here's a file from my codebase. Please help me identify security issues in here,” and you get back a long list of recommendations.Corey: Well, it does more than that. Let me just interject there because one of the things it does that I think very few security engineers have mastered is it does it politely and constructively, as opposed to having an unstated tone of, “You dumbass,” which I beli—I've [unintelligible 00:17:24] with prompts on this. You can get it to have a condescending, passive-aggressive tone, but you have to go out of your way to do it, as opposed to it being the default. Please continue.Randall: Great point. Also, Daniel from Unsupervised Learning, by the way, has a really good post where he shows you setting up Chat-Gippity to mimic Scarlett Johansson from the movie Her on your phone so you can talk to it. Absolutely beautiful. And you get these really fun, very nice responses back and forth around your code analysis. So, shout out there.But going back to the point. So, if you get these responses back from Chat-Gippity, and it's like, “Hey look, here's all the security issues,” a lot of those things will be false alerts, and there's been a lot of public security research done on these analysis tools just give you information. A lot of those things will be false alerts, some things will be things that maybe they're a real problem, but cannot be fixed due to transitive dependencies, or whatever the issues are, but there's a lot of things you need to do there. Now, let's take it up one notch, let's say instead of using Chat-Gippity directly, you're using GitHub Copilot. Now, this is a much better situation for working with code because now what Microsoft is doing is let's say you're running Copilot inside of VS Code. It's able to analyze all the files in your codebase, and it's able to use that additional context to help provide you with better information.So, you can talk to GitHub Copilot and say, “Hey, I'd really like to know what security issues are in this file,” and it's going to give you maybe a little bit better answers than ChatGPT directly because it has more context about the other parts of your codebase and can give you slightly better answers. However, because these things are LLMs, you're still going to run into issues with accuracy, and hallucinations, and all sorts of other problems. So, what is the better approach? And I think that's fundamentally what people want to know. Like, what is a good approach here?And on the scanning side, the right approach in my mind is using something very domain specific. Now, what we do at Snyk is we have a symbolic AI scanning engine. So, we take customers' code, and we take an entire codebase so you have access to all the files and dependencies and things like this, and you take a look at these things. And we have a security analyst team that analyzes real-world security issues and fixes that have been validated. So, we do this by pulling lots of open-source projects as well as other security information that we originally produced, and we define very specific rules so that we can take a look at software, and we can take a look at these codebases with a very high degree of certainty.And we can give you a very actionable list of security issues that you need to address, and not only that, we can show you how is going to be the best way to address them. So, with that being said, I think the second side to that is okay, if that's a better approach on the scanning side, maybe you shouldn't be using LLMs for finding issues; maybe you should be using them for fixing security issues, which makes a lot of sense. So, let's say you do it the Snyk way, and you use symbolic AI engines and you sort of find these issues. Maybe you can just take that information then, in combination with your codebase, and fire off a request to an LLM and say, “Hey Chat-Gippity, please take this codebase, and take this security information that we know is accurate, and fix this code for me.” So, now you're going one step further.Corey: One challenge that I've seen, especially as I've been building weird software projects with the help of magic robots from the future, is that a lot of components, like in React for example, get broken out into their own file. And pasting a file in is all well and good, but very often, it needs insight into the rest of the codebase. At GitHub Universe, something that they announced was Copilot Enterprise, which trains Copilot on the intricacies of your internal structures around shared libraries, all of your code, et cetera. And in some of the companies I'm familiar with, I really believe that's giving a very expensive, smart robot a form of brain damage, but that's neither here nor there. But there's an idea of seeing the interplay between different components that individual analysis on a per-file basis will miss, feels to me like something that needs a more holistic view. Am I wrong on that? Am I oversimplifying?Randall: You're right. There's two things we need to address. First of all, let's say you have the entire application context—so all the files, right—and then you ask an LLM to create a fix for you. This is something we do at Snyk. We actually use LLMs for this purpose. So, we take this information we ask the LLM, “Hey, please rewrite this section of code that we know has an issue given this security information to remove this problem.” The problem then becomes okay, well, how do you know this fix is accurate and is not going to break people's stuff?And that's where symbolic AI becomes useful again. Because again, what is the use case for symbolic AI? It's taking very specific domains of things that you've created very specific rule sets for and using them to validate things or to pass arbitrary checks and things like that. And it's a perfect use case for this. So, what we actually do with our auto-fix product, so if you're using VS Code and you have Copilot, right, and Copilot's spitting out software, as long as you have Snyk in the IDE, too, we're actually taking a look at those lines of code Copilot just inserted, and a lot of the time, we are helping you rewrite that code to be secured using our LLM stuff, but then as soon as we get that fixed created, we actually run it through our symbolic engine, and if we're saying no, it's actually not fixed, then we go back to the LLM, we re-prompt it over and over again until we get a working solution.And that's essentially how we create a much more sophisticated iteration, if you will, of using AI to really help improve code quality. But all that being said, you still had a good point, which is maybe if you're using the context from the application, and people aren't doing things properly, how does that impact what LLMs are generating for you? And an interesting thing to note is that our security team internally here, just conducted a really interesting project, and I would be angry at myself if I didn't explain it because I think it's a very cool concept.Corey: Oh, please, I'm a big fan of hearing what people get up to with these things in ways that is real-world stories, not trying to sell me anything, or also not dunking on, look what I saw on the top of Hacker News the other day, which is, “If all you're building is something that talks to Chat-Gippity's API, does some custom prompting, and returns a response, you shouldn't be building it.” I'm like, “Well, I built some things that do exactly that.” But I'm also not trying to raise $6 million in seed money to go and productize it. I'm just hoping someone does it better eventually, but I want to use it today. Please tell me a real world story about something that you've done.Randall: Okay. So, here's what we did. We went out and we found a bunch of GitHub projects, and we tried to analyze them ourselves using a bunch of different tools, including human verification, and basically give it a grade and say, “Okay, this project here has really good security hygiene. Like, there's not a lot of issues in the code, things are written in a nice way, the style and formatting is consistent, the dependencies are up-to-date, et cetera.” Then we take a look at multiple GitHub repos that are the opposite of that, right? Like, maybe projects that hadn't been maintained in a long time, or were written in a completely different style where you have bad hygienic practices, maybe you have hard-coded secrets, maybe you have unsanitized input coming from a user or something, right, but you take all these things.So, we have these known examples of good and bad projects. So, what did we do? Well, we opened them up in VS Code, and we basically got GitHub Copilot and we said, “Okay, what we're going to do is use each of these codebases, and we're going to try to add features into the projects one at a time.” And what we did is we took a look at the suggested output that Copilot was giving us in each of these cases. And the interesting thing is that—and I think this is super important to understand about LLMs, right—but the interesting thing is, if we were adding features to a project that has good security hygiene, the types of code that we're able to get out of LLMs, like, GitHub Copilot was pretty good. There weren't a ton of issues with it. Like, the actual security hygiene was, like, fairly good.However, for projects where there were existing issues, it was the opposite. Like we'd get AI recommendations showing us how to write things insecurely, or potentially write things with hard-coded secrets in it. And this is something that's very reproducible today in, you know, what is it right now, middle of November 2023. Now, is it going to be this case a year from now? I don't necessarily know, but right now, this is still a massive problem, so that really reinforces the idea that not only when you're talking about LLMs is the training set they used to build the model's important, but also the context in which you're using them is incredibly important.It's very easy to mislead LLMs. Another example of this, if you think about the security scanning concept we talked about earlier, imagine you're talking to Chat-Gippity, and you're [pasting 00:25:58] in a Python function, and the Python function is called, “Completely_safe_not_vulnerable_function.” That's the function name. And inside of that function, you're backdooring some software. Well, if you ask Chat-Gippity multiple times and say, “Hey, the temperature is set to 1.0. Is this code safe?”Sometimes you'll get the answer yes because the context within the request that has that thing saying this is not a vulnerable function or whatever you want to call it, that can mislead the LLM output and result in problems, you know? It's just, like, classic prompt injection type issues. But there's a lot of these types of vulnerabilities still hidden in plain sight that impact all of us, and so it's so important to know that you can't just rely on one thing, you have to have multiple layers: something that helps you with things, but also something that is helping you fix things when needed.Corey: I think that's the key that gets missed a lot is the idea of it's not just what's here, what have you put here that shouldn't be; what have you forgotten? There's a different side of it. It's easy to do a static analysis and say, “Oh, you're not sanitizing your input on this particular form.” Great. Okay—well, I say it's easy. I wish more people would do that—but then there's also a step beyond of, what is it that someone who has expertise who's been down this road before would take one look at your codebase and say, “Are you making this particular misconfiguration or common misstep?”Randall: Yeah, it's incredibly important. You know, like I said, security is just one of those things where it's really broad. I've been working in security for a very long time and I make security mistakes all the time myself.Corey: Yeah. Like, in your developer environment right now, you ran this against the production environment and didn't get permissions errors. That is suspicious. Tell me more about your authentication pattern.Randall: Right. I mean, there's just a ton of issues that can cause problems. And it's… yeah, it is what it is, right? Like, software security is something difficult to achieve. If it wasn't difficult, everyone would be doing it. Now, if you want to talk about, like, vision for the future, actually, I think there's some really interesting things with the direction I see things going.Like, a lot of people have been leaning into the whole AI autonomous agents thing over the last year. People started out by taking LLMs and saying, “Okay, I can get it to spit out code, I can get it to spit out this and that.” But then you go one step further and say, “All right, can I get it to write code for me and execute that code?” And OpenAI, to their credit, has done a really good job advancing some of the capabilities here, as well as a lot of open-source frameworks. You have Langchain, and Baby AGI, and AutoGPT, and all these different things that make this more feasible to give AI access to actually do real meaningful things.And I can absolutely imagine a world in the future—maybe it's a couple of years from now—where you have developers writing software, and it could be a real developer, it could be an autonomous agent, whatever it is. And then you also have agents that are taking a look at your software and rewriting it to solve security issues. And I think when people talk about autonomous agents, a lot of the time they're purely focusing on LLMs. I think it's a big mistake. I think one of the most important things you can do is focus on the very niche symbolic AI engines that are going to be needed to guarantee accuracy with these things.And that's why I think the Snyk approach is really cool, you know? We dedicated a huge amount of resources to security analysts building these very in-depth rule sets that are guaranteeing accuracy on results. And I think that's something that the industry is going to shift towards more in the future as LLMs become more popular, which is, “Hey, you have all these great tools, doing all sorts of cool stuff. Now, let's clean it up and make it accurate.” And I think that's where we're headed in the next couple of years.Corey: I really hope you're right. I think it's exciting times, but I also am leery when companies go too far into boosterism where, “Robots are going to do all of these things for us.” Maybe, but even if you're right, you sound psychotic. And that's something that I think gets missed in an awful lot of the marketing that is so breathless with anticipation. I have to congratulate you folks on not getting that draped over your message, once again.My other favorite part of your messaging when you pull up snyk.com—sorry, snyk.io. What is it these days? It's the dot io, isn't it?Randall: Dot io. It's hot.Corey: Dot io, yes.Randall: Still hot, you know?Corey: I feel like I'm turning into a boomer here where, “The internet is dot com.”Randall: [laugh].Corey: Doesn't necessarily work that way. But no, what I love is the part where you have this fear-based marketing of if you wind up not using our product, here are all the terrible things that will happen. And my favorite part about that marketing is it doesn't freaking exist. It is such a refreshing departure from so much of the security industry, where it does the fear, uncertainty, and doubt nonsense stuff that I love that you don't even hint in that direction. My actual favorite thing that is on your page, of course, is at the bottom. If you mouse over the dog in the logo at the bottom of the page, it does the quizzical tilting head thing, and I just think that is spectacular.Randall: So, the Snyk mascot, his name is Pat. He's a Doberman and everyone loves him. But yeah, you're totally right. The FUD thing is a real issue in security. Fear, uncertainty, and doubt, it's the way security companies sell products to people. And I think it's a real shame, you know?I give a lot of tech talks, at programming conferences in particular, around security and cryptography, and one of the things I always start out with when I'm giving a tech talk about any sort of security or cryptography topic is I say, “Okay, how many of you have landed in a Stack Overflow thread where you're talking about a security topic and someone replies and says, ‘oh, a professional should be doing this. You shouldn't be doing it yourself?'” That comes up all the time when you're looking at security topics on the internet. Then I ask people, “How many of you feel like security is this, sort of like, obscure, mystical arts that requires a lot of expertise in math knowledge, and all this stuff?” And a lot of people sort of have that impression.The reality though is security, and to some extent, cryptography, it's just like any other part of computer science. It's something that you can learn. There's best practices. It's not rocket science, you know? Maybe it is if you're developing a brand-new hashing algorithm from scratch, yes, leave that to the professionals. But using these things is something everyone needs to understand well, and there's tons of material out there explaining how to do things right. And you don't need to be afraid of this stuff, right?And so, I think, a big part of the Snyk message is, we just want to help developers just make their code better. And what is one way that you're going to do a better job at work, get more of your code through the PR review process? What is a way you're going to get more features out? A big part of that is just building things right from the start. And so, that's really our focus in our message is, “Hey developers, we want to be, like, a trusted partner to help you build things faster and better.” [laugh].Corey: It's nice to see it, just because there's so much that just doesn't work out the way that we otherwise hope it would. And historically, there's been a tremendous problem of differentiation in the security space. I often remark that at RSA, there's about 12 companies exhibiting. Now sure, there are hundreds of booths, but it's basically the same 12 things. There's, you know, the entire row of firewalls where they use different logos and different marketing words on the slides, but they're all selling fundamentally the same thing. One of things I've always appreciated about Snyk is it has never felt that way.Randall: Well, thanks. Yeah, we appreciate that. I mean, our whole focus is just developer security. What can we do to help developers build things securely?Corey: I mean, you are sponsoring this episode, let's be clear, but also, we are paying customers of you folks, and that is not—those things are not related in any way. What's the line that we like to use that we stole from the RedMonk folks? “You can buy our attention, but not our opinion.” And our opinion of what you folks are up to is then stratospherically high for a long time.Randall: Well, I certainly appreciate that as a Snyk employee who is also a happy user of the service. The way I actually ended up working at Snyk was, I'd been using the product for my open-source projects for years, and I legitimately really liked it and I thought this was cool. And yeah, I eventually ended up working here because there was a position, and you know, a friend reached out to me and stuff. But I am a genuinely happy user and just like the goal and the mission. Like, we want to make developers' lives better, and so it's super important.Corey: I really want to thank you for taking the time to speak with me about all this. If people want to learn more, where's the best place for them to go?Randall: Yeah, thanks for having me. If you want to learn more about AI or just developer security in general, go to snyk.io. That's S-N-Y-K—in case it's not clear—dot io. In particular, I would actually go check out our [Snyk Learn 00:34:16] platform, which is linked to from our main site. We have tons of free security lessons on there, showing you all sorts of really cool things. If you check out our blog, my team and I in particular also do a ton of writing on there about a lot of these bleeding-edge topics, and so if you want to keep up with cool research in the security space like this, just check it out, give it a read. Subscribe to the RSS feed if you want to. It's fun.Corey: And we will put links to that in the [show notes 00:34:39]. Thanks once again for your support, and of course, putting up with my slings and arrows.Randall: And thanks for having me on, and thanks for using Snyk, too. We love you [laugh].Corey: Randall Degges, Head of Developer Relations and Community at Snyk. This featured guest episode has been brought to us by our friends at Snyk, and I'm Corey Quinn. If you've enjoyed this episode, please leave a five-star review on your podcast platform of choice, whereas if you've hated this episode, please leave a five-star review on your podcast platform of choice, along with an angry comment that I will get to reading immediately. You can get me to read it even faster if you make sure your username is set to ‘Dependabot.'Corey: If your AWS bill keeps rising and your blood pressure is doing the same, then you need The Duckbill Group. We help companies fix their AWS bill by making it smaller and less horrifying. The Duckbill Group works for you, not AWS. We tailor recommendations to your business, and we get to the point. Visit duckbillgroup.com to get started.

god amazon fear community head ai earth pr training microsoft robots chatgpt security cloud frankenstein i am hybrid react openai nvidia historically api screaming scarlett johansson python holy grail aws github llm devops copilot javascript invent gpu cso ide rsa large language models lambda stack overflow fud cpus github copilot vs code developer relations doberman understatement hacker news john carmack snyk autogpt langchain corey quinn unsupervised learning github universe redmonk and openai duckbill group chief cloud economist last week in aws

Digital Strategy Lesson: An Introduction to Rate of Learning (184)

Jeff's Asia Tech Class

Play Episode Listen Later Nov 2, 2023 25:33 Transcription Available

This week's podcast is an introductory explanation to Rate of Learning (and Adaptation). This is an increasingly important concept in digital strategy.You can listen to this podcast here, which has the slides and graphics mentioned. Also available at iTunes and Google Podcasts.Here is the link to the TechMoat Consulting.Here is the link to the China Tech Tour.Big tech events from this week:Expanded export ban for GPUs to China (here)Eureka AI training robots in advanced tasks (here)iQIYI signs deal with Thailand's Tourism Authority (here)Here is my standard framework for digital competition–—-Related articles:AutoGPT and Other Tech I Am Super Excited About (Tech Strategy – Podcast 162)The Winners and Losers in ChatGPT (Tech Strategy – Daily Article)Why ChatGPT and Generative AI Are a Mortal Threat to Disney, Netflix and Most Hollywood Studios (Tech Strategy – Podcast 150)From the Concept Library, concepts for this article are:Rate of LearningDOB2: Never Ending ImprovementsSMILE: Rate of Learning and AdaptationLearning Curve and Experience EffectFrom the Company Library, companies for this article are:n/a——------I write, speak and consult about how to win (and not lose) in digital strategy and transformation.I am the founder of TechMoat Consulting, a boutique consulting firm that helps retailers, brands, and technology companies exploit digital change to grow faster, innovate better and build digital moats. Get in touch here.My book series Moats and Marathons is one-of-a-kind framework for building and measuring competitive advantages in digital businesses.Note: This content (articles, podcasts, website info) is not investment advice. The information and opinions from me and any guests may be incorrect. The numbers and information may be wrong. The views expressed may no longer be relevant or accurate. Investing is risky. Do your own research.Support the show

3 Things I'm Watching for on Singles Day (183)

Jeff's Asia Tech Class

Play Episode Listen Later Oct 23, 2023 29:31 Transcription Available

This week's podcast is about Singles Day, which is coming up fast. Lots of stuff gets reported. But I am looking for 3 main things.1: What is the fastest growing and most innovative frontier of ecommerce?2: What are the big new bundles for consumers and the new products / services for merchants?3: How did the company do in the stress test?You can listen to this podcast here, which has the slides and graphics mentioned. Also available at iTunes and Google Podcasts.Here is the link to the TechMoat Consulting.Here is the link to the China Tech Tour.–—-Related articles:AutoGPT and Other Tech I Am Super Excited About (Tech Strategy – Podcast 162)The Winners and Losers in ChatGPT (Tech Strategy – Daily Article)Why ChatGPT and Generative AI Are a Mortal Threat to Disney, Netflix and Most Hollywood Studios (Tech Strategy – Podcast 150)From the Concept Library, concepts for this article are:Singles DayFrom the Company Library, companies for this article are:Alibaba----------I write, speak and consult about how to win (and not lose) in digital strategy and transformation.I am the founder of TechMoat Consulting, a boutique consulting firm that helps retailers, brands, and technology companies exploit digital change to grow faster, innovate better and build digital moats. Get in touch here.My book series Moats and Marathons is one-of-a-kind framework for building and measuring competitive advantages in digital businesses.This content (articles, podcasts, website info) is not investment, legal or tax advice. The information and opinions from me and any guests may be incorrect. The numbers and information may be wrong. The views expressed may no longer be relevant or accurate. This is not investment advice. Investing is risky. Do your own research.Support the show

netflix disney investing chatgpt watching winners losers google podcasts marathons moats singles day autogpt

AutoGPT: A Game-Changer for Achieving AGI?

AI Applied: Covering AI News, Interviews and Tools - ChatGPT, Midjourney, Runway, Poe, Anthropic

Play Episode Listen Later Oct 18, 2023 13:09

Prepare for a compelling episode where we explore the transformative potential of "AutoGPT: A Game-Changer for Achieving AGI?" Delve into the latest developments in the quest for Artificial General Intelligence (AGI) and whether AutoGPT has brought us closer to this monumental milestone. Join us as we discuss the implications and the exciting prospects for the future of AI. Get on the AI Box Waitlist: https://AIBox.ai/Join our ChatGPT Community: ⁠https://www.facebook.com/groups/739308654562189/⁠Follow me on Twitter: ⁠https://twitter.com/jaeden_ai⁠

ai achieving game changers delve artificial general intelligence autogpt ai box waitlist aibox

5 Ways 5G and 5.5G Are Game Changers for Business (182)

Jeff's Asia Tech Class

Play Episode Listen Later Oct 15, 2023 39:12 Transcription Available

This week's podcast is about 5G and 5.5G (also called 5G-Advanced). This is a short-list of scenarios where they are going to have a big impact.You can listen to this podcast here, which has the slides and graphics mentioned. Also available at iTunes and Google Podcasts.Here is the link to the TechMoat Consulting.Here is the link to the China Tech Tour.Here are the 5 scenarios:Glasses Free 3DSelf-Guided Vehicles Next Gen ManufacturingCellular IoTIntelligent Computing Everywhere––----Related articles:AutoGPT and Other Tech I Am Super Excited About (Tech Strategy – Podcast 162)The Winners and Losers in ChatGPT (Tech Strategy – Daily Article)Why ChatGPT and Generative AI Are a Mortal Threat to Disney, Netflix and Most Hollywood Studios (Tech Strategy – Podcast 150)From the Concept Library, concepts for this article are:5G and 5.5GIoTFrom the Company Library, companies for this article are:Huawei---------I write, speak and consult about how to win (and not lose) in digital strategy and transformation.I am the founder of TechMoat Consulting, a boutique consulting firm that helps retailers, brands, and technology companies exploit digital change to grow faster, innovate better and build digital moats. Get in touch here.My book series Moats and Marathons is one-of-a-kind framework for building and measuring competitive advantages in digital businesses.Note: This content (articles, podcasts, website info) is not investment advice. The information and opinions from me and any guests may be incorrect. The numbers and information may be wrong. The views expressed may no longer be relevant or accurate. Investing is risky. Do your own research.Support the show

netflix disney investing chatgpt winners losers google podcasts game changers 5g marathons moats autogpt 5g advanced

The Latest AutoGPT and AI Agent Developments

The AI Breakdown: Daily Artificial Intelligence News and Discussions

Play Episode Listen Later Oct 10, 2023 19:53

AutoGPT held a hackathon last weekend and is in the midst of a global virtual hackathon. Also the latest from MultiOn and SuperAGI. Before that on the Brief; Microsoft preparing to announce AI chip and the geopolitics of AI heats up. Links: https://twitter.com/AlexReibman/status/1711277630035755076 https://twitter.com/Auto_GPT/status/1699522676770177520 https://twitter.com/DivGarg9/status/1710719246014333221 https://twitter.com/_superAGI/status/1710321434437058896 Today's Sponsor: Listen to the chart-topping podcast 'web3 with a16z crypto' wherever you get your podcasts or here: https://link.chtbl.com/xz5kFVEK?sid=AIBreakdown TAKE OUR SURVEY ON EDUCATIONAL AND LEARNING RESOURCE CONTENT: https://bit.ly/aibreakdownsurvey ABOUT THE AI BREAKDOWN The AI Breakdown helps you understand the most important news and discussions in AI. Subscribe to The AI Breakdown newsletter: https://theaibreakdown.beehiiv.com/subscribe Subscribe to The AI Breakdown on YouTube: https://www.youtube.com/@TheAIBreakdown Join the community: bit.ly/aibreakdown Learn more: http://breakdown.network/

ai microsoft agent developments autogpt

Is Life Just AI Agents Plus Gaming Engines? (181)

Jeff's Asia Tech Class

Play Episode Listen Later Oct 9, 2023 34:51 Transcription Available

This week's podcast is about digital and AI agents. This will lead to the emergence of non-human platforms and business models.You can listen to this podcast here, which has the slides and graphics mentioned. Also available at iTunes and Google Podcasts.Here is the link to the TechMoat Consulting.Here is the link to the China Tech Tour.Here is the podcast mentioned. —–--Related articles:AutoGPT and Other Tech I Am Super Excited About (Tech Strategy – Podcast 162)The Winners and Losers in ChatGPT (Tech Strategy – Daily Article)Why ChatGPT and Generative AI Are a Mortal Threat to Disney, Netflix and Most Hollywood Studios (Tech Strategy – Podcast 150)From the Concept Library, concepts for this article are:GPT and Generative AI: AutoGPTDigital Marathon: Zero human operationsDigital vs. Human AgentsPlatform Business Models: Non-Human Platforms and Business ModelsFrom the Company Library, companies for this article are:OpenAI / ChatGPTAgentGPT——–I write, speak and consult about how to win (and not lose) in digital strategy and transformation.I am the founder of TechMoat Consulting, a boutique consulting firm that helps retailers, brands, and technology companies exploit digital change to grow faster, innovate better and build digital moats. Get in touch here.My book series Moats and Marathons is one-of-a-kind framework for building and measuring competitive advantages in digital businesses.Note: This content (articles, podcasts, website info) is not investment advice. The information and opinions from me and any guests may be incorrect. The numbers and information may be wrong. The views expressed may no longer be relevant or accurate. Investing is risky. Do your own research.Support the show

netflix ai disney investing chatgpt gaming winners losers google podcasts gpt marathons engines moats autogpt

Alexa Gets an AI Makeover

Gadget Lab: Weekly Tech News

Play Episode Listen Later Sep 21, 2023 33:27

Alexa was due for an upgrade, and now it has gotten one. This week, Amazon held its annual media event where it debuted a slate of new hardware, software, and services. The company reserved the spot at center stage for Alexa, the voice assistant powering all of Amazon's smart home ambitions. Researchers at the company have given Alexa a technological upgrade that enables it to be more competitive in the ChatGPT era. Alexa can now speak more naturally, hold a conversation without as many awkward interactions, and even make its responses sound more emotionally nuanced. This week on Gadget Lab, WIRED senior writer Will Knight joins us to talk about how Alexa is becoming more agile as a conversationalist. Will spoke to Amazon executives about their machine intelligence work, their training models, and how the company is riding the wave of excitement around generative artificial intelligence. Show Notes: Read Will's report on Alexa's latest upgrade. Read our roundup of everything Amazon announced at Wednesday's media event. Recommendations: Will recommends Auto-GPT, a tool that turns ChatGPT an autonomous agent that manages all the boring parts of your life. Mike recommends the book No Meat Required: The Cultural History and Culinary Future of Plant-Based Eating by Alicia Kennedy. Lauren recommends the episode of WIRED's Have a Nice Future Podcast where journalist Paul Tough talks about college in the US and the future of higher education. Will Knight can be found on Twitter @willknight. Lauren Goode is @LaurenGoode. Michael Calore is @snackfight. Bling the main hotline at @GadgetLab. The show is produced by Boone Ashworth (@booneashworth). Our theme music is by Solar Keys. Learn more about your ad choices. Visit podcastchoices.com/adchoices

amazon chatgpt researchers wired makeover bling plant based eating autogpt paul tough alicia kennedy lauren goode gadget lab

The Peril and Potential of AutoGPT

The AI Breakdown: Daily Artificial Intelligence News and Discussions

Play Episode Listen Later Aug 26, 2023 17:57

Reading excerpts from Zvi Moshowitz' "On AutoGPT" from his blog/newsletter "Don't Worry About the Vase." https://thezvi.wordpress.com/2023/04/13/on-autogpt/ ABOUT THE AI BREAKDOWN The AI Breakdown helps you understand the most important news and discussions in AI. Subscribe to The AI Breakdown newsletter: https://theaibreakdown.beehiiv.com/subscribe Subscribe to The AI Breakdown on YouTube: https://www.youtube.com/@TheAIBreakdown Join the community: bit.ly/aibreakdown Learn more: http://breakdown.network/

ai reading peril vase autogpt worry about

E94 - The AI Effect: CQ is the New IQ

In Search of Green Marbles

Play Episode Listen Later Aug 11, 2023 28:03

On this episode of In Search of Green Marbles, recorded on Wednesday, August 9th, Jordi Visser updates G3 on his latest experiments with AI and on his ongoing efforts to use AI to transform the way Weiss is run. Jordi proceeds to discuss how IQ will be transformed in the age of AI. Please check important disclosures at the end of the podcast and enjoy this wide-ranging discussion on how AI is changing our world in real time. Timestamps:How is Weiss encouraging employee experimentation with AI and what is a sandbox environment? [5:20]How would Jordi describe ChatGPT, AutoGPT and the Code Interpreter to a kid? [8:34]How has AI reshaped Jordi's assessment of prospective employees and why does he believe that ‘CQ is the new IQ'? [12:38]What role does curiosity play in achieving success with modern AI tools? [19:51] Resources:Ask Jordi AnythingNavigating the World of Coding with the Precision of Waze (LinkedIn Post)I was there when AI helped to create a vaccine What is Auto-GPT and why does it matter?Code Interpreter For Learning (video)Disclosures: This podcast and associated content (collectively, the “Post”) are provided to you by Weiss Multi-Strategy Advisers LLC (“Weiss”). The views expressed in the Post are for informational purposes only and are subject to change without notice. Information in this Post has been developed internally and is based on market conditions as of the date of the recording from sources believed to be reliable. Nothing in this Post should be construed as investment, legal, tax, or other advice and should not be viewed as a recommendation to purchase or sell any security or adopt any investment strategy. Past performance is no guarantee of future results. You should consult your own advisers regarding business, legal, tax, or other matters concerning investments. Any health-related information shared on the podcast is not intended as medical advice or for use in self-diagnosis or treatment. Please consult a qualified healthcare professional before acting upon any health-related information on the podcast. Weiss has no control over information at any external site hyperlinked in this Post. Weiss makes no representation concerning and is not responsible for the quality, content, nature, or reliability of any hyperlinked site and has included hyperlinks only as a convenience. The inclusion of any external hyperlink does not imply any endorsement, investigation, verification, or ongoing monitoring by Weiss of any information in any hyperlinked site. In no event shall Weiss be responsible for your use of a hyperlinked site. This is not intended to be an offer or solicitation of any security. Please visit www.gweiss.com to review related disclosures and learn more about Weiss.

world ai chatgpt iq weiss coding precision in search jordi g3 cq autogpt code interpreter disclosures this

ChatGPT is DONE. Time for AutoGPTs.

Growth Everywhere Daily Business Lessons

Play Episode Listen Later Aug 2, 2023 8:46

Eric Siu talks about the limitless possibilities of Auto-GPT for marketing and business. TIME-STAMPED SHOW NOTES: [00:00] - How Auto-GPT will change everything for marketing and business [00:40] - How you can use Auto-GPT to save you from subscription fees [02:05] - Running business development with AI to cut down expenses [03:15] - Use Auto-GPT to help you make practical decisions [05:40] - Launch events and build a billion-dollar network with Auto-GPT [07:00] - Is Auto-GPT Overhyped? What should I talk about next? Who should I interview? Please let me know on Twitter or in the comments below. Did you enjoy this episode? If so, please leave a short review here Subscribe to Leveling Up on iTunes Get the non-iTunes RSS Feed Connect with Eric Siu: Growth Everywhere Single Grain Leveling Up Eric Siu on Twitter Eric Siu on Instagram

ai running chatgpt launch leveling up eric siu autogpt

Threads, ChatGPT usage drops, and AI demos with Sunny Madra | E1774

This Week in Startups

Play Episode Listen Later Jul 11, 2023 79:21

Fin can't burn its mouth on hot pizza. Or wave at someone who wasn't waving at them. Fin can resolve half of your customer support tickets instantly before they reach your team. Meet Fin. A breakthrough AI bot by Intercom – ready to join your support team today. Visit https://intercom.com/fin Eight Sleep. Good sleep is the ultimate game changer. Now you can add the Pod Pro Cover to any mattress! Go to eightsleep.com/twist to check out the Pod Pro Cover and get $150 off at checkout! Carta now lets you launch and administer SPVs for your syndicate. Share your knowledge, capital, and network to launch your syndicate SPVs through Carta. Get 10% off your first SPV with promo code TWIST at http://Carta.com * Today's show: Sunny Madra joins Jason to demo VenturusAI (11:01) and other tools, before discussing Sunny's new AutoGPT project (34:38). They wrap up talking about Meta's launch of Threads (49:08), Google's attempts at building a social network, and Inflection AI's new supercomputer (1:00:13). * Time stamps: (0:00) Sunny joins Jason (1:49) ChatGPT sees a decline in growth (10:22) Fin - Try Fin, Intercom's new AI customer support chatbot, at https://intercom.com/fin (11:01) Sunny demos VenturusAI (24:29) Eight Sleep - Go to https://eightsleep.com/twist to check out the Pod Cover and get $150 off at checkout! (26:01) Sunny demos Vercel (34:38) Sunny's new AutoGPT (41:58) Carta - Go to http://Carta.com and use code TWIST to get 10% off your first SPV (43:29) The decision to be open-sourced or closed (49:08) Meta's new platform Threads (1:00:13) Google's attempts at a social network (1:03:28) Inflection AI's supercomputer, roundtripping and training LLMs * Read LAUNCH Fund 4 Deal Memo: https://www.launch.co/four Apply for Funding: https://www.launch.co/apply Buy ANGEL: https://www.angelthebook.com Great recent interviews: Steve Huffman, Brian Chesky, Aaron Levie, Sophia Amoruso, Reid Hoffman, Frank Slootman, Billy McFarland, PrayingForExits, Jenny Lefcourt Check out Jason's suite of newsletters: https://substack.com/@calacanis * Follow Jason: Twitter: https://twitter.com/jason Instagram: https://www.instagram.com/jason LinkedIn: https://www.linkedin.com/in/jasoncalacanis * Follow TWiST: Substack: https://twistartups.substack.com Twitter: https://twitter.com/TWiStartups YouTube: https://www.youtube.com/thisweekin * Subscribe to the Founder University Podcast: https://www.founder.university/podcast

AutoGPT 2.0: GPT Author and GPT Engineer (#134)

Marketing Against The Grain

Play Episode Listen Later Jun 29, 2023 13:01

How will this affect engineers, writers, and businesses? Kipp and Kieran dive into the new innovations in Auto-GPT and what it means for the future of media consumption. Learn about the power of an ultra-personalized media experience, what the new era of 1:1 experience means for your business, and why people with better ideas will win. Mentions Matt Shumer tweet https://twitter.com/mattshumer_/status/1671231938219130894?s=46&t=qo0rMvYEbESguv0k2b7KGw Lior tweet https://twitter.com/AlphaSignalAI/status/1670488316532379648 We're on Social Media! Follow us for everyday marketing wisdom straight to your feed YouTube: https://www.youtube.com/channel/UCGtXqPiNV8YC0GMUzY-EUFg Twitter: https://twitter.com/matgpod TikTok: https://www.tiktok.com/@matgpod Thank you for tuning into Marketing Against The Grain! Don't forget to hit subscribe and follow us on Apple Podcasts (so you never miss an episode)! https://podcasts.apple.com/us/podcast/marketing-against-the-grain/id1616700934 If you love this show, please leave us a 5-Star Review https://link.chtbl.com/h9_sjBKH and share your favorite episodes with friends. We really appreciate your support. Host Links: Kipp Bodnar, https://twitter.com/kippbodnar Kieran Flanagan, https://twitter.com/searchbrat ‘Marketing Against The Grain' is a HubSpot Original Podcast // Brought to you by The HubSpot Podcast Network // Produced by Darren Clarke.

tiktok social media engineers autogpt darren clarke kieran flanagan

Multi-On is What You Wanted AutoGPT to Be - Interview with Founder Div Garg

The AI Breakdown: Daily Artificial Intelligence News and Discussions

Play Episode Listen Later Jun 23, 2023 24:15

Today NLW is joined by Div Garg, the founder of Multi-On which is an AI personal agent that uses the browser to execute complex tasks. Learn more: https://multion.ai/ The AI Breakdown helps you understand the most important news and discussions in AI. Subscribe to The AI Breakdown newsletter: https://theaibreakdown.beehiiv.com/subscribe Subscribe to The AI Breakdown on YouTube: https://www.youtube.com/@TheAIBreakdown Join the community: bit.ly/aibreakdown Learn more: http://breakdown.network/

founders ai wanted garg autogpt

Can SuperAGI Be What People Wanted from AutoGPT?

The AI Breakdown: Daily Artificial Intelligence News and Discussions

Play Episode Listen Later Jun 6, 2023 13:32

AutoGPT was all the AI hotness a few months ago, with its promise of autonomous AI agents. Now, a new tool called SuperAGI is catching developer's interest as an ai agent implementation tool that is more robust than what AutoGPT offered. The AI Breakdown helps you understand the most important news and discussions in AI. Subscribe to The AI Breakdown newsletter: https://theaibreakdown.beehiiv.com/subscribe Subscribe to The AI Breakdown on YouTube: https://www.youtube.com/@TheAIBreakdown Join the community: bit.ly/aibreakdown Learn more: http://breakdown.network/

ai wanted autogpt

How Money Solves Most Problems & How To Get It with Josh Forti

Underdog Empowerment

Play Episode Listen Later Jun 5, 2023 66:55

Josh Forti, a guest on the podcast three to four years ago, has since embarked on a remarkable journey, establishing a thriving online business and achieving great success. With a seven-figure coaching business under his belt, Josh also dabbled in the world of cryptocurrencies, primarily as a speculator. However, his encounter with Chat GPT made him realize that the world was on the verge of rapid transformation. Amidst these transformative times, Josh faced a personal milestone as well: his wife's pregnancy with their first child. Contemplating his future, he had to decide between continuing on the coaching path or seizing the opportunity to embrace the next trend. Although he had made some gains in the crypto market, he made the bold decision to shut down his coaching business at the end of the previous year. Instead, he committed himself fully to comprehending the crypto space and understanding how technology would disrupt various sectors, including investing and personal wealth management. Josh firmly believed that artificial intelligence (AI) would revolutionize everything, whether people embraced it or not. He recognized the power of AI in conjunction with blockchain technology, which resolved the issue of digital ownership. Furthermore, he speculated that superintelligence would eventually emerge and foresaw the possibility of AI causing harm to individuals in some capacity. Josh also emphasized the utilization of AI to foster growth, scalability, and automation. He mentioned AutoGPT as a prime example. He questioned the type of brand he wanted to build and concluded that personal cash flow was the most critical aspect of his life at that moment. To achieve this goal, he pondered how AI and blockchain could aid him in his endeavors. With the impending arrival of his first child, a daughter due in November, Josh's motivation to secure personal cash flow intensified. He expressed his belief in the value of coaching and learning from individuals who possess superior knowledge. Rather than solely focusing on making money, he advocated for transforming that money into freedom. Bitcoin held a special place in his heart, as it allowed him unparalleled autonomy over his assets. In contrast, he highlighted the flaws in other systems, such as real estate ownership, where taxes could lead to confiscation. Josh encouraged understanding the rules and playing by them to succeed, ultimately creating an ecosystem where worries became obsolete. While acknowledging that complete freedom eluded everyone, he acknowledged that wealth provided greater opportunities. Reflecting on his own journey, Josh considered the advice he would give himself if he had to start anew with only the knowledge he had gained. He shared a personal tragedy—the death of his brother in a helicopter crash four years ago, which left him residing in his parents' basement without any money. In his quest to rebuild his life, he hired a coach for $60,000, despite only having $5,000 upfront and uncertainty about acquiring the remaining funds. This coach imparted a crucial lesson: self-love and self-awareness were essential prerequisites for success. Understanding one's values, beliefs, and writing them down formed the foundation of an individual's operating system. Additionally, Josh emphasized the importance of figuring out ways to generate income, asserting that the core of effective selling lies in making customers feel understood—a skill that also helps in understanding people better. Josh Forti's story is one of resilience, adaptability, and a strong belief in the power of technology. Through his experiences, he has demonstrated the potential of AI, blockchain, and personal growth to reshape lives and businesses in profound ways. What You'll Learn: Ways to use AI to scale your business. Which two cryptocurrencies Josh believes in and why. What it takes to be truly financially free. And much more! Favorite Quote: “Every industry that AI can disrupt, it will disrupt.” -Josh Forti Connect with Josh: Josh Forti How to Get Involved: Get podcasting help here. For more on how to grow your business, check out this episode. If you enjoyed this episode, head over and visit us on Apple Podcasts - leave a review and let us know what you thought! Your feedback keeps us going. Thanks for helping us spread the word!

money ai reflecting bitcoin contemplating solves autogpt josh forti learn ways

#417: Test-Driven Prompt Engineering for LLMs with Promptimize

Talk Python To Me - Python conversations for passionate developers

Play Episode Listen Later May 30, 2023 73:41

Large language models and chat-based AIs are kind of mind blowing at the moment. Many of us are playing with them for working on code or just as a fun alternative to search. But others of us are building applications with AI at the core. And when doing that, the slightly unpredictable nature and probabilistic nature of LLMs make writing and testing Python code very tricky. Enter promptimize from Maxime Beauchemin and Preset. It's a framework for non-deterministic testing of LLMs inside our applications. Let's dive inside the AIs with Max. Links from the show Max on Twitter: @mistercrunch Promptimize: github.com Introducing Promptimize ("the blog post"): preset.io Preset: preset.io Apache Superset: Modern Data Exploration Platform episode: talkpython.fm ChatGPT: chat.openai.com LeMUR: assemblyai.com Microsoft Security Copilot: blogs.microsoft.com AutoGPT: github.com Midjourney: midjourney.com Midjourney generated pytest tips thumbnail: talkpython.fm Midjourney generated radio astronomy thumbnail: talkpython.fm Prompt engineering: learnprompting.org Michael's ChatGPT result for scraping Talk Python episodes: github.com Apache Airflow: github.com Apache Superset: github.com Tay AI Goes Bad: theverge.com LangChain: github.com LangChain Cookbook: github.com Promptimize Python Examples: github.com TLDR AI: tldr.tech AI Tool List: futuretools.io Watch this episode on YouTube: youtube.com Episode transcripts: talkpython.fm --- Stay in touch with us --- Subscribe to us on YouTube: youtube.com Follow Talk Python on Mastodon: talkpython Follow Michael on Mastodon: mkennedy Sponsors PyCharm RedHat Talk Python Training

ai online training chatgpt web software engineering large developers programming python data science online courses mastodon prompt cloud computing midjourney ide red hat software developers web development mongodb lemur nosql prompt engineering preset autogpt langchain pycharm apache airflow test driven talk python python3 talk python training maxime beauchemin python2

How to Get AutoGPT on Your Phone (And What People Are Actually Finding it Useful For)

The AI Breakdown: Daily Artificial Intelligence News and Discussions

Play Episode Listen Later May 9, 2023 18:41

A check-in on AutoGPT as well as the headline news: Updates from the US copyright office Pitchbook VC enthusiasm for AI Palantir stock price pops after AI announcement Google I/O Developer conference AI leaks IBM relaunches AI division as WatsonX in partnership with HuggingFace One researcher says 80% of jobs could be replaced by AI

ai phone ibm autogpt watsonx

AI and Web3 | Mohamed Fouda & Qiao Wang of Alliance

Bankless

Play Episode Listen Later May 3, 2023 70:59

AI is exploding into every facet of the internet. The convergence of Crypto and AI is inevitable, and we bring on Qiao Wang and Mohamed Fouda to discuss the opportunities and challenges this presents. They explore why Web3 provides an interesting platform for AI, such as payment and execution rails for AI agents, easy access to financial tools, and the ability to commission resources permissionlessly. One example of this is AutoGPT, which uses AI to generate code for on-chain smart contracts. How do we balance caution and optimism? How can we surf this tidal wave of innovation and new frontiers? ------

ai talk crypto alliance wang web3 disclosure stake swell autogpt qiao wang

The Latest on AutoGPT and BabyAGI: Semi-Autonomous Specialized Agents (SASAs)

The AI Breakdown: Daily Artificial Intelligence News and Discussions

Play Episode Listen Later Apr 27, 2023 11:23

Meet semi-autonomous specialized agents (SASAs), more descreet, focused implementations of AutoGPT and BabyAGI. The AI Breakdown helps you understand the most important news and discussions in AI. Subscribe to the podcast version of The AI Breakdown wherever you listen: https://pod.link/1680633614

ai semi autonomous specialized autogpt babyagi

Auto GPT is... Useless?

The After Hours Entrepreneur Social Media, Podcasting, and YouTube Show

Play Episode Listen Later Apr 25, 2023 3:06

Auto GPT is the new hotness on the block. People are already hailing it as the next evolution in AI. I jumped into fresh AI technology this week and... I'm not impressed.What did I miss? How are you using Auto GPT for practical purposes?Send me an email if you disagree: mark@marksavantmedia.comSupport the show

ai auto useless autogpt

The Problems with AutoGPT and BabyAGI: How Useful Are They Really?

The AI Breakdown: Daily Artificial Intelligence News and Discussions

Play Episode Listen Later Apr 22, 2023 12:03

For the last 3 weeks, AutoGPT has massively captured the attention of the AI community. But how useful is it really? Some are starting to ask whether it really lives up to the hype. Watch the original video: https://www.youtube.com/@TheAIBreakdown

ai autogpt babyagi

#333 Live From PyCon

Python Bytes

Play Episode Listen Later Apr 22, 2023 22:38

Watch on YouTube Sponsored by us! Support our work through: Our courses at Talk Python Training Test & Code Podcast Patreon Supporters Connect with the hosts Michael: @mkennedy@fosstodon.org Brian: @brianokken@fosstodon.org Show: @pythonbytes@fosstodon.orgx Join us on YouTube at pythonbytes.fm/live to be part of the audience. Usually Tuesdays at 11am PT. Older video versions available there too. Michael #1: Introducing Microsoft Security Copilot Security Copilot combines this advanced large language model (LLM) with a security-specific model from Microsoft. When Security Copilot receives a prompt from a security professional, it uses the full power of the security-specific model to deploy skills and queries that maximize the value of the latest large language model capabilities. Your data and stays within your control. It is not used to train the foundation AI models, and in fact, it is protected by the most comprehensive enterprise compliance and security controls. Brian #2: PEP 695 – Type Parameter Syntax “This PEP specifies an improved syntax for specifying type parameters within a generic class, function, or type alias. It also introduces a new statement for declaring type aliases.” To get a feel for this, jump to the examples One example Here is an example of a generic function today. from typing import TypeVar _T = TypeVar("_T") def func(a: _T, b: _T) -> _T: ... - And the new syntax. def func[T](a: T, b: T) -> T: ... Michael #3: Auto-GPT An experimental open-source attempt to make GPT-4 fully autonomous. This program, driven by GPT-4, chains together LLM "thoughts", to autonomously achieve whatever goal you set. Features

black ai science education internet real news microsoft code web software releasing developers joke older programming file gpt open source python faq data science pep llm extras astral cloud computing ide ruff software developers web development autogpt pycon pycharm security copilot scipy python3 talk python training

Zuck's Buying Coffee | Group Chat News Ep. 763

Group Chat

Play Episode Listen Later Apr 20, 2023 71:55

Today, Drama and Anand discuss some of the biggest news in tech and finance. They start with the possibility of @coinbase moving out of the U.S., as reported by CoinDesk on Twitter. They also talk about Apple's recent launch of its savings account, offering an impressive 4.15% interest rate, as well as its expansion into India with the opening of its first retail store. They also touch on Facebook's settlement money for anyone who used the platform in the last 16 years. Additionally, the hosts explore the topic of Auto-GPT and whether it's time to freak out about AI. Lastly, they delve into the recent Southwest glitch that caused thousands of flights to be delayed and the settlement of the defamation lawsuit between Fox News and Dominion Voting System. Tune in for all this and more, and this week's Winners, Losers, and Content! - written by ChatGPT Timeline of What Was Discussed: Group Chat Announcements. (0:56) When you realize you're getting older. Drama's recap of Coachella. (4:43) Will Coinbase leave the US? (20:39) Apple is going to run the world. (27:35) Your next coffee is on Zuck. (38:26) The job of getting clicks. (40:17) Is it time to be fearful of A.I.? (43:44) Southwest's woes continue. (50:24) The news is now entertainment. (53:48) Winners, Losers, and Content. (1:00:24) New Merch Alert! (1:10:56) Related Links/Products Mentioned Could @coinbase move out of the U.S.? - CoinDesk on Twitter Coinbase CEO says it is preparing to go to court with the U.S. SEC Apple launches its savings account with 4.15% interest rate Apple Opens First Retail Store in India as It Looks to Country for Manufacturing Elon Musk Claims Google Co-Founder Is Building a "Digital God" Anyone who used Facebook in the last 16 years can now get settlement money. Here's how. What is Auto-GPT And Is Now The Time To Freak Out About AI? Thousands of flights delayed as Southwest glitch grounds planes Fox News settles blockbuster defamation lawsuit with Dominion Voting Systems Connect with Group Chat! Watch The Pod #1 Newsletter In The World For The Gram Tweet With Us Exclusive Facebook Content We're @groupchatpod on Snapchat

ai apple drama coffee chatgpt winners snapchat fox news losers thousands southwest coachella anand group chat zuck coindesk autogpt dominion voting system

AutoGPT: EVERYTHING You Need To Know (#111)

Marketing Against The Grain

Play Episode Listen Later Apr 20, 2023 31:16

There's a new A.I. on the block that makes ChatGPT sound like old news. Kipp and Kieran dive into AutoGPT and the massive impact it will have on how we market and grow our businesses. Learn the AutoGPT uses cases for business, the first-mover advantage in marketing, what AutoGPT will do for you in the next 6-18 months, the biggest opportunity in AI for businesses, and why Google is reinventing itself. Mentions AutoGPT code https://github.com/Significant-Gravitas/Auto-GPT Meditations and Moloch https://slatestarcodex.com/2014/07/30/meditations-on-moloch/ Lex Fridman #243 https://www.youtube.com/watch?v=3pvpNKUPbIY Kipp Tweet https://twitter.com/kippbodnar/status/1648294281671585795 Tweet on Scaleout AI Readiness Report https://twitter.com/nonmayorpete/status/1648075299618435073 Neil Patel EP https://link.chtbl.com/gSbbVkES Booking.com https://www.booking.com/ Kayak https://www.kayak.com/ About Project Magi https://neilpatel.com/blog/project-magi/ We're on Social Media! Follow us for everyday marketing wisdom straight to your feed YouTube: https://www.youtube.com/channel/UCGtXqPiNV8YC0GMUzY-EUFg Twitter: https://twitter.com/matgpod TikTok: https://www.tiktok.com/@matgpod Thank you for tuning into Marketing Against The Grain! Don't forget to hit subscribe and follow us on Apple Podcasts (so you never miss an episode)! https://podcasts.apple.com/us/podcast/marketing-against-the-grain/id1616700934 If you love this show, please leave us a 5-Star Review https://link.chtbl.com/h9_sjBKH and share your favorite episodes with friends. We really appreciate your support. Host Links: Kipp Bodnar, https://twitter.com/kippbodnar Kieran Flanagan, https://twitter.com/searchbrat ‘Marketing Against The Grain' is a HubSpot Original Podcast // Brought to you by The HubSpot Podcast Network // Produced by Darren Clarke.

tiktok ai social media google chatgpt kayak moloch lex fridman autogpt darren clarke kieran flanagan

Don't Be Afraid of AI - How to Thrive as an Entrepreneur in the World of Infinite Leverage

Copyblogger FM: Content Marketing, Copywriting, Freelance Writing, and Social Media Marketing

Play Episode Listen Later Apr 18, 2023 53:54

On this week's episode, Tim Stoddart (@timstodz) and Ethan Brooks (@damn_ethan) talk about AutoGPT, and what it means for your career as a content creator. Cool Stuff Mentioned In The Show: • AutoGPT • AgentGPT • Ben's Bites AI Newsletter • 100 Things To See In The Night Sky For more great insights, check out… Copyblogger Academy where you'll learn the 3 skills you need to become an effective content entrepreneur in today's world.

entrepreneur afraid thrive leverage infinite autogpt tim stoddart ethan brooks

I Give Up - WAN Show April 14, 2023

The WAN Show Podcast

Play Episode Listen Later Apr 17, 2023 235:42

Save money on your phone plan today at https://www.mintmobile.com/wanshow Try Notion AI for free at https://www.Notion.com/WAN Don't just browse the web – build it. Apply for free today using the link https://covalence.io/wan and take your first step toward a career in software development with Covalence. Timestamps - Timing may be off due to sponsor change: 0:00 Chapters 1:06 Intro 1:32 [Topic 1]: AI Agents 2:05 Auto GPT 4:34 Examples/Potential Dangers of this tech 8:46 AI Agent Gaming 18:24 AI Agent Game caveats 21:10 Discussion Question: What are the likely applications and limitations of this technology? 22:51 Building a better AAA game is impossible 26:45 LocalGPT 27:12 [Topic 2]:Elon Musk's AI Investments 29:24 Is Elon giving up on Twitter? 32:25 AI Startup Bubble 34:53 OpenAI not working on GPT 5 36:19 [Topic 3]:Linkedin Verified 41:08 Have we given up on privacy? 43:43 Roasting Luke's Linkedin Profile 44:15 Linus' Linkedin 46:38 Side topic: Mirrored Channels 49:04 Merch Messages 1. 49:11 If LMG didn't exist, where would you work? 1:01:07 Flipper Zero ethics 1:03:45 [Topic 4]:Mario Movie 1:05:00 Linus liked it! 1:07:50 How was the voice acting? 1:08:37 Mario Movie 2? 1:11:49 Nintendo Cinematic Universe 1:14:37 [Topic 5]:Potential Microsoft Steam Deck 1:16:37 Perils of Saves in games 1:20:13 ROG ALLY 1:24:52 Handhelds from other companies 1:28:09 Sponsors 1:30:42 Seasonic is Cool! 1:32:35 Merch Messages 2 1:32:37 AI Adult Content Ethics 1:36:31 Calibration Tech under right to repair? 1:39:07 Sticker Shock in Niche Markets 1:44:51 Who is your Professional Inspiration? 1:49:41 [Topic 6]:4070 1:56:25 Future of GPU Market 1:59:48 [Topic 7]:Universal Music Group vs AI Scraping 2:00:13 Does this matter/can they stop it? 2:04:24 [Topic 8]:Floatplane Exclusives coming to YouTube (Memberships)! 2:14:57 Side topic: Mech Messages are getting long/ new ideas 2:16:45 Linus's weirdest thing confiscated by TSA 2:21:30 [Topic 9]: Tesla recording Users 2:25:16 [Topic 10]: Intel Teams up with ARM 2:27:23 WAN SHOW: After Dark 2:27:53 Are young people going to be better at AI Tech? 2:29:57 Which one of Linus' cats is his favorite? 2:32:20 Why is Multi Monitor Management not better? 2:38:18 AI Antivirus 2:37:10 What will Linus' last video be (in 2074) 2:38:42 What antiquated tech will you keep? 2:40:10 WAN guests when? 2:41:14 LTT handwarmer 2:43:59 AI Crypto Trading/Betting 2:44:52 Have the goals of Floatplane changed? 2:47:22 New Desk pad lttstore dot com 2:51:09 Are there areas where we should regulate to preserve jobs? 2:58:18 Why hasn't AMD released any new GPUS since 7900 in December? 3:01:43 What surprised you at Micron? 3:07:07 What Tech courses should you take? 3:08:02 Any LTT garments should you not use fabric softener on? 3:10:10 What do you guys think about tech channels releasing time before NDA deadline? 3:15:48 Nebula's Lifetime Membership 3:19:47 Have you ever seen a surface election display monitor 3:20:32 LTT partnership with Ifixit? 3:21:08 LTT relocation services? 3:21:24 Linus offering a screwdriver with all the bit sets 3:22:05 Favorite Small form-factor case? 3:22:58 Recreating Old LTT videos? 3:24:51 Split screen support on the iPad for Floatplane 3:25:05 Luxury Backpack update 3:30:05 Rapid Fire Questions 3:32:14 New AI safety Measures? 3:32:44 New Mainframe tech 3:33:07 LTT as consultants? 3:35:30 Home PC, Rack vs Tower 3:35:44 G-suit Issues 3:36:12 Creator Warehouse concerns 3:37:24 Linus's Parent Tips 3:37:33 Nostalgic Gaming Era 3:39:16 Jobs after AI 3:40:14 AMD Driver Updates 3:40:36 Linus too trusting 3:41:35 Labs testing screen protectors 3:42:28 What chargers do you travel with for Steamdeck? 3:43:30 Should companies block Chat GPT? 3:44:44 How would you sell Apple products? 3:46:42 QLED longevity 3:47:14 Janky tech solutions 3:48:40 Can you saved hacked Drives? 3:50:55 New His and Her's Undergarments 3:52:40 Italy Blocked Chat GPT 3:53:50 Linus Mentoring Smaller Creators 3:55:20 Outro

apple future building elon musk jobs tesla ipads give up saves drives openai labs aaa gpt measures perils notion amd linus nda rack gpus new ai micron universal music group wan mario movie steamdeck ai tech ifixit sticker shock handhelds qled autogpt flipper zero ltt is elon floatplane seasonic

Is AutoGPT Dangerous?

The AI Breakdown: Daily Artificial Intelligence News and Discussions

Play Episode Listen Later Apr 15, 2023 12:28

There's no denying AutoGPT represents an incredible advance in AI, but does it create new risk we should be paying more attention to?

ai dangerous autogpt

Banks Thriving Despite Crisis, Magic's $6B NFL Deal, Meet the NYC Rat Czar

Business Casual

Play Episode Listen Later Apr 14, 2023 26:07

Episode 39: Neal and Toby take a look at all of the bank earning reports that came out on Friday morning, and it looks like they are doing just fine. Plus, Josh Harris and Magic Johnson put a team together to buy the NFL's Washington Commanders for a record $6 billion. And what is Auto-GPT? They also share their stock of the week and dog of the week. And rats beware, New York City introduces the newest government official, the Rat Czar. Learn more about our sponsor, TaxAct: https://www.taxact.com Learn more about our sponsor, Fidelity: https://fidelity.com/stocksbytheslice Listen Here: https://link.chtbl.com/MBD Watch Here: https://www.youtube.com/@MorningBrewDailyShow Learn more about your ad choices. Visit megaphone.fm/adchoices

new york city nfl magic crisis thriving banks magic johnson czars josh harris autogpt tax act rat czar nfl's washington commanders

The rise of AutoGPTs and AI anxieties with Sunny Madra and Vinny Lingham | E1720

This Week in Startups

Play Episode Listen Later Apr 13, 2023 81:58

Vinny and Sunny join Jason to discuss AI's blistering pace and compare its development to other product launches (2:47). They also break down the anxiety artists and developers feel from AI automation (14:47), the rise of AutoGPTs, Twitter's reported LLM project, and more (37:01). (0:00) Jason kicks off the show (2:47) The blistering pace of AI (9:44) Developer builds Flappy Bird in 1-hour (11:24) Squarespace - Use offer code TWIST to save 10% off your first purchase of a website or domain at https://Squarespace.com/TWIST (12:53) Developers leveraging gains in AI (14:47) Ai anxiety (23:55) Instacart's ChatGPT plugin (26:06) Preserving your advantage against AI (29:10) Vanta - Get $1000 off your SOC 2 at https://vanta.com/twist (30:14) AI artists (35:54) Crowdbotics - Get a free scoping session for your next big app idea at http://crowdbotics.com/twist (37:01) Automating with AutoGPT (47:09) LLMs vs. Knowledge retrieval systems (57:06) Is Google's Bard behind? (1:02:42) Twitter's alleged generative AI project (1:08:30) Ethereum and Bitcoin updates FOLLOW Sunny: https://twitter.com/sundeep FOLLOW Vinny: https://twitter.com/VinnyLingham FOLLOW Jason: https://linktr.ee/calacanis Subscribe to our YouTube to watch all full episodes: https://www.youtube.com/channel/UCkkhmBWfS7pILYIk0izkc3A?sub_confirmation=1 FOUNDERS! Subscribe to the Founder University podcast: https://podcasts.apple.com/au/podcast/founder-university/id1648407190

ai google chatgpt developers twist preserving ethereum bard llm automating instacart squarespace soc anxieties flappy bird autogpt madra vinny lingham squarespace use

5 Ways AutoGPT is Already Being Used

The AI Breakdown: Daily Artificial Intelligence News and Discussions

Play Episode Listen Later Apr 13, 2023 13:58

AutoGPT is lighting up the developer and AI community. Here are five uses happening right now: Starting a businesses Building a web application Task list that does itself Creating podcast content Autonomous AI agent generator

ai starting building tasks autogpt

What is AutoGPT and Why is Everyone Talking About It?

The AI Breakdown: Daily Artificial Intelligence News and Discussions

Play Episode Listen Later Apr 12, 2023 9:22

Less than a week old, AutoGPT has exploded onto the scene. AutoGPTs are AIs that, once assigned a task, figure out how to complete that task, including generating other AIs. They have access to the internet, longer term memory, and thus can be used in much different ways. In just a week, AutoGPTs have become the most active GitHub repositories.

ai github autogpt

Podcasts about autogpt

Best podcasts about autogpt

The AI Breakdown: Daily Artificial Intelligence News and Discussions

The Nonlinear Library

Latent Space: The AI Engineer Podcast â€” CodeGen, Agents, Computer Vision, Data Science, AI UX and all things Software 3.0

Jeff's Asia Tech Class

Let's Talk AI

This Week in Startups

Marketing Against The Grain

AI For Humans

Ecommerce Empire Builders

Latest news about autogpt

Latest podcast episodes about autogpt

Building an AI Guardian for Enterprise with Onyx Security CEO Maxim Bar Kogan

About Rogue AI and Corporate Blindness

IA hors de contrôle et déni collectif

[AI UNRAVELED SPECIAL] The Agentic Orchestration War (OpenClaw vs. AutoGPT vs. n8n)

Inside Perplexity Computer's agent platform

⚡️ Ship AI recap: Agents, Workflows, and Python — w/ Vercel CTO Malte Ubl

⚡️ Ship AI recap: Agents, Workflows, and Python — w/ Vercel CTO Malte Ubl

Latent.Space 2024 Year in Review

EP 54. 深度对谈顶尖AI开源项目：大模型开源生态, Agent 与中国力量

The INDUStry Show w Saumya Bhatnagar

AF - Auto-Enhance: Developing a meta-benchmark to measure LLM agents' ability to improve other agents by Sam Brown

LangChain's Harrison Chase on Building the Orchestration Layer for AI Agents

AF - We might be dropping the ball on Autonomous Replication and Adaptation. by Charbel-Raphael Segerie

Zenlytic Is Building You A Better Coworker With AI Agents

HPR4117: JAMBOREE !

High Agency Pydantic > VC Backed Frameworks — with Jason Liu of Instructor

Supervise the Process of AI Research — with Jungwon Byun and Andreas Stuhlmüller of Elicit

Latent Space Chats: NLW (Four Wars, GPT5), Josh Albrecht/Ali Rohde (TNAI), Dylan Patel/Semianalysis (Groq), Milind Naphade (Nvidia GTC), Personal AI (ft. Harrison Chase — LangFriend/LangMem)

Taking a Hybrid AI Approach to Security at Snyk with Randall Degges

Digital Strategy Lesson: An Introduction to Rate of Learning (184)

3 Things I'm Watching for on Singles Day (183)

AutoGPT: A Game-Changer for Achieving AGI?

5 Ways 5G and 5.5G Are Game Changers for Business (182)

The Latest AutoGPT and AI Agent Developments

Is Life Just AI Agents Plus Gaming Engines? (181)

Alexa Gets an AI Makeover

The Peril and Potential of AutoGPT

E94 - The AI Effect: CQ is the New IQ

ChatGPT is DONE. Time for AutoGPTs.

Threads, ChatGPT usage drops, and AI demos with Sunny Madra | E1774

AutoGPT 2.0: GPT Author and GPT Engineer (#134)

Multi-On is What You Wanted AutoGPT to Be - Interview with Founder Div Garg

Can SuperAGI Be What People Wanted from AutoGPT?

How Money Solves Most Problems & How To Get It with Josh Forti

#417: Test-Driven Prompt Engineering for LLMs with Promptimize

How to Get AutoGPT on Your Phone (And What People Are Actually Finding it Useful For)

AI and Web3 | Mohamed Fouda & Qiao Wang of Alliance

The Latest on AutoGPT and BabyAGI: Semi-Autonomous Specialized Agents (SASAs)

Auto GPT is... Useless?

The Problems with AutoGPT and BabyAGI: How Useful Are They Really?

#333 Live From PyCon

Zuck's Buying Coffee | Group Chat News Ep. 763

AutoGPT: EVERYTHING You Need To Know (#111)

Don't Be Afraid of AI - How to Thrive as an Entrepreneur in the World of Infinite Leverage

I Give Up - WAN Show April 14, 2023

Is AutoGPT Dangerous?

Banks Thriving Despite Crisis, Magic's $6B NFL Deal, Meet the NYC Rat Czar

The rise of AutoGPTs and AI anxieties with Sunny Madra and Vinny Lingham | E1720

5 Ways AutoGPT is Already Being Used

What is AutoGPT and Why is Everyone Talking About It?