Sample Space

Sample Space

Follow Sample Space
Share on
Copy link to clipboard

Sample space is a podcast about tools, thoughts and techniques from machine learning practitioners. We talk to toolmakers and practitioners about interesting problems in the real world.

probabl


    • Jan 15, 2025 LATEST EPISODE
    • every other week NEW EPISODES
    • 57m AVG DURATION
    • 14 EPISODES


    Search for episodes from Sample Space with a specific topic:

    Latest episodes from Sample Space

    Time for some (extreme) distillation with Thomas van Dongen - founder of the Minish Lab

    Play Episode Listen Later Jan 15, 2025 49:21


    Word embeddings might feel like they are a little bit out of fashion. After all, we have attention mechanisms and transformer models now, right? Well, it turns out that if you apply distillation the right way you can actually get highly performant word embeddings out. It's a technique featured by the model2vec project from the Minish lab and in this episode we talk to the founder to learn more about the technique.We have a Discord these days, feel free to discuss the podcast with us there! https://discord.probabl.ai This podcast is part of the open efforts over at probabl. To learn more you can check out website or reach out to us on social media. Website: https://probabl.ai/ Bluesky: https://bsky.app/profile/probabl.bsky.social LinkedIn: https://www.linkedin.com/company/probabl Twitter: https://x.com/probabl_ai #probabl

    Imbalanced learn: regrets and onwards

    Play Episode Listen Later Dec 6, 2024 54:06


    Imbalanced learn is one of the most popular scikit-learn projects out there. It has support for resampling techniques which historically have always been used for imbalanced classification use-cases. However, now that we are a few years down the line, it may be time to start rethinking the library. As it turns out, other techniques may be preferable. We talk to the maintainer, Guillaume Lemaitre, to discuss the lessons that have been learned over the last decade.We have a Discord these days, feel free to discuss the podcast with us there! https://discord.probabl.aiThis podcast is part of the open efforts over at probabl. To learn more you can check out website or reach out to us on social media.Website: https://probabl.ai/Bluesky: https://bsky.app/profile/probabl.bsky.socialLinkedIn: https://www.linkedin.com/company/probablTwitter: https://x.com/probabl_ai

    You want to be in control of your own Copilot

    Play Episode Listen Later Nov 6, 2024 67:16


    There are many LLMs that you can use for programming these days. Some of them even go into your IDE like Cursor or Github Copilot. But what if you want to tweak these LLMs do to what you want? Instead of being stuck with the tools that a vendor gives you, the goal of Continue.dev is to allow you to customise this yourself. In this podcast we talk to Ty Dunn, co-founder of the project to learn more about this.If you are curious to learn more about this effort, please check out https://continue.dev. You may always want to read the manifesto over at https://amplified.dev/.We have a Discord these days, feel free to discuss the podcast with us there! https://discord.probabl.aiThis podcast is part of the open efforts over at probabl. To learn more you can check out website or reach out to us on social media. Website: https://probabl.ai/ Bluesky: https://bsky.app/profile/probabl.bsky.social LinkedIn: https://www.linkedin.com/company/probabl Twitter: https://x.com/probabl_ai

    What it is like to maintain the scikit-learn docs

    Play Episode Listen Later Oct 31, 2024 55:01


    Scikit-learn's documentation pages are celebrated. But not everyone is aware that the project actually has somebody on payroll to take care of it. In this episode we talk to Arturo about stories from the scikit-learn documentation. In particular, the docs have a recommender that few folks are aware of. People just assume that it is manually curated, but there are a few base scikit-learn tools under the hood there. Link to the official scikit-learn MOOC: https://inria.github.io/scikit-learn-mooc/ We have a Discord these days, feel free to discuss the podcast with us there! https://discord.probabl.ai You can follow the podcast on most podcast players including apple podcasts, spotify and rss.com. - https://podcasts.apple.com/us/podcast/sample-space/id1739598572 - https://open.spotify.com/show/0BnwEHuyOlHgeZfselpn1n - https://rss.com/podcasts/sample-space/ This podcast is part of the open efforts over at probabl. To learn more you can check out website or reach out to us on social media. Website: https://probabl.ai/ Bluesky: https://bsky.app/profile/probabl.bsky.social LinkedIn: https://www.linkedin.com/company/probabl Twitter: https://x.com/probabl_ai

    Sqlite can totally do embeddings now

    Play Episode Listen Later Oct 23, 2024 59:20


    Vector databases are kind of everywhere these days. There is a big pool of VC's that are pooring money into the ecosystem too. But while all of that is happening, sqlite has also gotten support for it. In this episode we talk the Alex Garcia, the maintainer of this project, and discuss how the project got created on what the future has in store. Sqlite-vec Github repo:https://github.com/asg017/sqlite-vecAlex Garcia blog:https://alexgarcia.xyz/blog/2024/sqlite-vec-hybrid-search/index.htmlDatasette discord:https://discord.com/invite/ktd74dm5mwSqlite-vec channel on Mozilla Discord:https://discord.gg/Ve7WeCJFXk

    How to rethink the notebook - with Akshay Agrawal, co-creator of Marimo

    Play Episode Listen Later Oct 16, 2024 72:04


    Jupyter has been a great environment to explore computational ideas, but that doesn't mean that it can be the only environment for interactive coding in Python. It also comes with some downsides, which led Akshay Agrawal to create an alternative called Marimo. We discussed it in a previous livestream and figured that it was time to sit down with the creator to learn what led to the development of this exciting new too. You can learn more about Marimo by going to their website over at https://marimo.io To learn more you can check out website or reach out to us on social media. Website: https://probabl.ai/ LinkedIn: https://www.linkedin.com/company/probabl Twitter: https://x.com/probabl_ai

    You are always dealing with many tables - with Madelon Hulsebos

    Play Episode Listen Later Sep 10, 2024 69:10


    When you are working on a data pipeline for ML ... you are never dealing with a single table. It always demands different tables for different reasons that all have to be mashed together in order to have something that you can learn from. But if that is the case, why do we spend so much time talking about ML pipelines that only work on a single table? Madelon Hulsebos has a Phd on the topic and so we figured that we might ask her.As mentioned in the podcast, here is the link to Madelon's homepage. https://www.madelonhulsebos.com/ Some links to interesting articles from Madelon, as well as her homepage, can be found below. https://www.madelonhulsebos.com/assets/dataset_search_survey.pdf https://dl.acm.org/doi/pdf/10.1145/3654975 https://dl.acm.org/doi/pdf/10.1145/3588710

    How Narwhals has many end users ... that never use it directly with Marco Gorelli

    Play Episode Listen Later Aug 21, 2024 60:53


    When you pip install a package you will for sure end up using it later. But often you will also install a bunch of dependencies and it is very likely that you won't directly interact with all of them. That does not mean that such a package is not useful, it merely means that the package might be directly used by a maintainer instead. This is interesting, because recently one such tool came into existence. It is called Narwhals and it seems to be on track to become critical infrastructure for data science projects. We have the maintainer of Narwhals on the show this week to talk about it. To learn more about Narwhals, you can check the repository here: https://github.com/narwhals-dev/narwhals This podcast is part of the open efforts over at probabl. To learn more you can check out website or reach out to us on social media. Website: https://probabl.ai/ LinkedIn: https://www.linkedin.com/company/probabl Twitter: https://x.com/probabl_ai

    Pragmatic data science checklists with Peter Bull - cofounder Drivendata

    Play Episode Listen Later Jul 17, 2024 65:38


    A lot of things can (and have) gone wrong when folks tried to apply data science projects. So how might we prevent that? Maybe what we need to do is to look at the medical profession and their practice of checklists before surgery.

    Model safety, that's a pickle! with Adrin Jalali - scikit-learn maintainer

    Play Episode Listen Later Jun 27, 2024 61:47


    Historically it's always been the case that you would use a pickle file to store a trained scikit-learn model on disk for deployment. Pickles make sense because these are so flexible, but they do carry a security concern. Adrin has been working on a remedy called skops, which is the main topic of this podcast. To learn more about skops, make sure to check the documentation: https://skops.readthedocs.io/en/stable/

    Moving Towards KDearestNeighbors with Leland McInnes - creator of UMAP

    Play Episode Listen Later May 30, 2024 57:19


    Leland McInnes is known for a lot of packages. There's UMAP, but also PyNNDescent and HDBScan. Recently he's also been working on tools to help visualise clusters of data and he's also cooking up something new that's related to nearest neighbor algorithms. This interview touches all of these topics.If you're interested in learning more about the MoMA exhibition, it was by Refik Anadol: https://refikanadol.com/ and this was the work at MoMA: https://refikanadol.com/works/unsupervised/.The other artist was Kyle McDonald: https://kylemcdonald.net/ and the piece we mentioned was this one: https://www.youtube.com/watch?v=04DqdT0-NtI.

    Talk like a DataFrame, run like SQL with Phillip Cloud - core-committer on Ibis

    Play Episode Listen Later May 2, 2024 64:09


    Ibis is a Python library that offers a single data-frame API, from Python, which can run your queries on many different backends. These include databases like Postgres, but also commercial vendors like BigQuery and Snowflake. This ability to control multiple backends from a single API has a lot of use-cases, as well as maintainer challenges, all of which are discussed in this episode. To learn more about Ibis, check out the docs here: https://ibis-project.org/ If you're attending PyCon US this year, you may be interested in Philip's talk: https://us.pycon.org/2024/schedule/presentation/55/ During the podcast, Philip also mentioned a blogpost about DuckDB, here: https://ibis-project.org/posts/why-duckdb/ There was also a dogfooding blogpost, which is this one: https://ibis-project.org/posts/ci-analysis/

    Enhancing Jupyter with Widgets with Trevor Manz - creator of anywidget.

    Play Episode Listen Later Apr 11, 2024 71:56


    In this (first!) episode of Sample Space we talk to Trevor Mantz, the creator of anywidget. It's a (neat!) tool to help you build more interactive notebooks by giving you tools to apply just enough Javascript to get directional communication working in your favorite notebook environment. That means that Python can talk to widgets, but also that widgets can talk to Python. There's a lot to like about these widgets and we're doing a proper deep dive in this first episode.To learn more about anywidget, check out the docs. In particular you may want to glance at the gallery first, it has loads of nice examples.You can also find the project on Github and if you're eager to talk to folks involved with the project, consider joining the discord here.

    Introducing Sample Space

    Play Episode Listen Later Apr 3, 2024 1:32


    We're starting a new podcast!

    Claim Sample Space

    In order to claim this podcast we'll send an email to with a verification link. Simply click the link and you will be able to edit tags, request a refresh, and other features to take control of your podcast page!

    Claim Cancel