POPULARITY
In questa puntata, ci immergiamo nel mondo del MLOps e dell'orchestrazione dei dati con Stefano Bosisio, Senior Software Engineer presso NVIDIA. Stefano condivide le sue conoscenze su framework popolari come Apache Beam, Kubeflow e Dagster, evidenziandone punti di forza e limitazioni. Affrontiamo anche le tendenze emergenti nel DataOps e le sfide che i team devono affrontare nella scelta degli strumenti di orchestrazione più adatti alle loro esigenze.
Stefano Bosisio is an accomplished MLOps Engineer with a solid background in Biomedical Engineering, focusing on cellular biology, genetics, and molecular simulations. Reinvent Yourself and Be Curious // MLOps Podcast #264 with Stefano Bosisio, MLOps Engineer at Synthesia. // Abstract This talk goes through Stefano's experience, to be an inspirational source for whoever wants to jump on a career in the MLOps sector. Moreover, Stefano will also introduce his MLOps Course on the MLOps community platform. // Bio Sai Bharath Gottam Stefano Bosisio is an MLOps Engineer, with a versatile background that ranges from biomedical engineering to computational chemistry and data science. Stefano got an MSc in biomedical engineering from the Polytechnic of Milan, focusing on cellular biology, genetics, and molecular simulations. Then, he landed in Scotland, in Edinburgh, to earn a PhD in chemistry from the University of Edinburgh, where he developed robust physical theories and simulation methods, to understand and unlock the drug discovery problem. After completing his PhD, Stefano transitioned into Data Science, where he began his career as a data scientist. His interest in machine learning engineering grew, leading him to specialize in building ML platforms that drive business success. Stefano's expertise bridges the gap between complex scientific research and practical machine learning applications, making him a key figure in the MLOps field. Bonus points beyond data: Stefano, as a proper Italian, loves cooking and (mainly) baking, playing the piano, crocheting and running half-marathons. // MLOps Jobs board https://mlops.pallet.xyz/jobs // MLOps Swag/Merch https://mlops-community.myshopify.com/ // Related Links Website: https://medium.com/@stefanobosisio1First MLOps Stack Course: https://learn.mlops.community/courses/languages/your-first-mlops-stack/ --------------- ✌️Connect With Us ✌️ ------------- Join our slack community: https://go.mlops.community/slack Follow us on Twitter: @mlopscommunity Sign up for the next meetup: https://go.mlops.community/register Catch all episodes, blogs, newsletters, and more: https://mlops.community/ Connect with Demetrios on LinkedIn: https://www.linkedin.com/in/dpbrinkm/ Connect with Stefano on LinkedIn: https://www.linkedin.com/in/stefano-bosisio1/ Timestamps: [00:00] Stephano's preferred coffee [00:12] Takeaways [01:06] Stephano's MLOps Course [01:47] From Academia to AI Industry [09:10] Data science and platforms [16:53] Persistent MLOps challenges [21:23] Internal evangelization for success [24:21] Adapt communication skills to diverse individual needs [29:43] Key components of ML pipelines are essentia l[33:47] Create a generalizable AI training pipeline with Kubeflow [35:44] Consider cost-effective algorithms and deployment methods [39:02] Agree with dream platform; LLMs require simple microservice [42:48] Auto scaling: crucial, tricky, prone to issues [46:28] Auto-scaling issues with Apache Beam data pipelines [49:49] Guiding students through MLOps with practical experience [53:16] Bulletproof Problem Solving: Decision trees for problem analysis [55:03] Evaluate tools critically; appreciate educational opportunities [57:01] Wrap up
Summary Any software system that survives long enough will require some form of migration or evolution. When that system is responsible for the data layer the process becomes more challenging. Sriram Panyam has been involved in several projects that required migration of large volumes of data in high traffic environments. In this episode he shares some of the valuable lessons that he learned about how to make those projects successful. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management Data lakes are notoriously complex. For data engineers who battle to build and scale high quality data workflows on the data lake, Starburst is an end-to-end data lakehouse platform built on Trino, the query engine Apache Iceberg was designed for, with complete support for all table formats including Apache Iceberg, Hive, and Delta Lake. Trusted by teams of all sizes, including Comcast and Doordash. Want to see Starburst in action? Go to dataengineeringpodcast.com/starburst (https://www.dataengineeringpodcast.com/starburst) and get $500 in credits to try Starburst Galaxy today, the easiest and fastest way to get started using Trino. This episode is supported by Code Comments, an original podcast from Red Hat. As someone who listens to the Data Engineering Podcast, you know that the road from tool selection to production readiness is anything but smooth or straight. In Code Comments, host Jamie Parker, Red Hatter and experienced engineer, shares the journey of technologists from across the industry and their hard-won lessons in implementing new technologies. I listened to the recent episode "Transforming Your Database" and appreciated the valuable advice on how to approach the selection and integration of new databases in applications and the impact on team dynamics. There are 3 seasons of great episodes and new ones landing everywhere you listen to podcasts. Search for "Code Commentst" in your podcast player or go to dataengineeringpodcast.com/codecomments (https://www.dataengineeringpodcast.com/codecomments) today to subscribe. My thanks to the team at Code Comments for their support. Your host is Tobias Macey and today I'm interviewing Sriram Panyam about his experiences conducting large scale data migrations and the useful strategies that he learned in the process Interview Introduction How did you get involved in the area of data management? Can you start by sharing some of your experiences with data migration projects? As you have gone through successive migration projects, how has that influenced the ways that you think about architecting data systems? How would you categorize the different types and motivations of migrations? How does the motivation for a migration influence the ways that you plan for and execute that work? Can you talk us through one or two specific projects that you have taken part in? Part 1: The Triggers Section 1: Technical Limitations triggering Data Migration Scaling bottlenecks: Performance issues with databases, storage, or network infrastructure Legacy compatibility: Difficulties integrating with modern tools and cloud platforms System upgrades: The need to migrate data during major software changes (e.g., SQL Server version upgrade) Section 2: Types of Migrations for Infrastructure Focus Storage migration: Moving data between systems (HDD to SSD, SAN to NAS, etc.) Data center migration: Physical relocation or consolidation of data centers Virtualization migration: Moving from physical servers to virtual machines (or vice versa) Section 3: Technical Decisions Driving Data Migrations End-of-life support: Forced migration when older software or hardware is sunsetted Security and compliance: Adopting new platforms with better security postures Cost Optimization: Potential savings of cloud vs. on-premise data centers Part 2: Challenges (and Anxieties) Section 1: Technical Challenges Data transformation challenges: Schema changes, complex data mappings Network bandwidth and latency: Transferring large datasets efficiently Performance testing and load balancing: Ensuring new systems can handle the workload Live data consistency: Maintaining data integrity while updates occur in the source system Minimizing Lag: Techniques to reduce delays in replicating changes to the new system Change data capture: Identifying and tracking changes to the source system during migration Section 2: Operational Challenges Minimizing downtime: Strategies for service continuity during migration Change management and rollback plans: Dealing with unexpected issues Technical skills and resources: In-house expertise/data teams/external help Section 3: Security & Compliance Challenges Data encryption and protection: Methods for both in-transit and at-rest data Meeting audit requirements: Documenting data lineage & the chain of custody Managing access controls: Adjusting identity and role-based access to the new systems Part 3: Patterns Section 1: Infrastructure Migration Strategies Lift and shift: Migrating as-is vs. modernization and re-architecting during the move Phased vs. big bang approaches: Tradeoffs in risk vs. disruption Tools and automation: Using specialized software to streamline the process Dual writes: Managing updates to both old and new systems for a time Change data capture (CDC) methods: Log-based vs. trigger-based approaches for tracking changes Data validation & reconciliation: Ensuring consistency between source and target Section 2: Maintaining Performance and Reliability Disaster recovery planning: Failover mechanisms for the new environment Monitoring and alerting: Proactively identifying and addressing issues Capacity planning and forecasting growth to scale the new infrastructure Section 3: Data Consistency and Replication Replication tools - strategies and specialized tooling Data synchronization techniques, eg Pros and cons of different methods (incremental vs. full) Testing/Verification Strategies for validating data correctness in a live environment Implication of large scale systems/environments Comparison of interesting strategies: DBLog, Debezium, Databus, Goldengate etc What are the most interesting, innovative, or unexpected approaches to data migrations that you have seen or participated in? What are the most interesting, unexpected, or challenging lessons that you have learned while working on data migrations? When is a migration the wrong choice? What are the characteristics or features of data technologies and the overall ecosystem that can reduce the burden of data migration in the future? Contact Info LinkedIn (https://www.linkedin.com/in/srirampanyam/) Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ (https://www.pythonpodcast.com) covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast (https://www.themachinelearningpodcast.com) helps you go from idea to production with machine learning. Visit the site (https://www.dataengineeringpodcast.com) to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com (mailto:hosts@dataengineeringpodcast.com)) with your story. Links DagKnows (https://dagknows.com) Google Cloud Dataflow (https://cloud.google.com/dataflow) Seinfeld Risk Management (https://www.youtube.com/watch) ACL == Access Control List (https://en.wikipedia.org/wiki/Access-control_list) LinkedIn Databus - Change Data Capture (https://github.com/linkedin/databus) Espresso Storage (https://engineering.linkedin.com/data-replication/open-sourcing-databus-linkedins-low-latency-change-data-capture-system) HDFS (https://hadoop.apache.org/docs/r1.2.1/hdfs_design.html) Kafka (https://kafka.apache.org/) Postgres Replication Slots (https://www.postgresql.org/docs/current/logical-replication.html) Queueing Theory (https://en.wikipedia.org/wiki/Queueing_theory) Apache Beam (https://beam.apache.org/) Debezium (https://debezium.io/) Airbyte (https://airbyte.com/) Fivetran (fivetran.com) Designing Data Intensive Applications (https://amzn.to/4aAztR1) by Martin Kleppman (https://martin.kleppmann.com/) (affiliate link) Vector Databases (https://en.wikipedia.org/wiki/Vector_database) Pinecone (https://www.pinecone.io/) Weaviate (https://www.weveate.io/) LAMP Stack (https://en.wikipedia.org/wiki/LAMP_(software_bundle)) Netflix DBLog (https://arxiv.org/abs/2010.12597) The intro and outro music is from The Hug (http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/Love_death_and_a_drunken_monkey/04_-_The_Hug) by The Freak Fandango Orchestra (http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/) / CC BY-SA (http://creativecommons.org/licenses/by-sa/3.0/)
On Call with Insignia Ventures with Yinglan Tan and Paulo Joquino
Generalist Software Engineer. Tech Lead. Cloud Advocate. Podcast Host. Thought Leader. Henry Suryawirawan, VP of Engineering at Indonesia's largest and leading money movement platform (with more than 10 million users) Flip, goes on call with us to share his journey from startup to big tech and now scale up, his leadership approach as VP of Engineering at Flip, how his growth at Flip has impacted his own career and views, as well as his deep interest in Cloud and work as host of the Tech Lead Journal podcast. Join Henry at Flip in shaping the Indonesian financial culture and building the fairest tech company in the world! See open roles across departments. Timestamps (01:43) Paulo introduces Henry; (02:47) Venturing into Flip as a Scale Up; (07:37) Bringing Flip's Culture of Fairness to Fully Remote Engineering Teams; (14:01) How Flip Fosters Engineering Careers and Impact; (20:21) A Cloud Advocate's Perspective on the Impact of Cloud on Digitalization and Cybersecurity; (24:54) A Dev Podcast Host's Learnings; About our guest Henry Suryawirawan is a generalist software engineer, tech lead, cloud advocate, thought leader, and avid personal growth learner. He is the host of Tech Lead Journal, a podcast about technical leadership and excellence. He is also the creator of Apache Beam Katas, a learning platform for people to learn about Apache Beam. Henry's career spans across multiple industries—insurance, banking, startup, consulting, government, cloud—which includes companies like Great Eastern, Barclays, JP Morgan, Einsights, ThoughtWorks, Singapore GovTech, Google Cloud, and currently Flip. Henry also has deep interests in cloud technology, software architecture, technical practices, building tech products, and forming high performing engineering teams. He is highly experienced in Agile, DevOps, and CI/CD. Henry has delivered talks on multiple occasions, ranging from Google Cloud Next, Google Cloud Summits, webinars, and community meetups. Henry holds a Master of IT in Business (Financial Services track) from Singapore Management University. He also holds 5 GCP certifications, plus the CKAD and CKA certifications. During his spare time, Henry loves to read books, listening to podcasts, learning personal growth, running, and playing with his kids. Fun fact, Henry has finished Standard Chartered Singapore Marathon 7 times. Music: Energetic and Upbeat Rock Background Music For Videos and Workouts The content of this podcast is for informational purposes only, should not be taken as legal, tax, or business advice or be used to evaluate any investment or security, and is not directed at any investors or potential investors in any Insignia Ventures fund.
Eric Anderson (@ericmander) reunites with old colleagues Kenn Knowles (@KennKnowles) and Pablo Estrada (@polecitoem) for a conversation on Apache Beam, the open-source programming model for data processing. The trio once worked together at Google, and Beam was a turning point in the history of open-source there. Today, both Kenn and Pablo are members of the Beam PMC, and join the show with the inside scoop on Beam's past, present and future. In this episode we discuss: Transitioning Beam to the Apache Way How “inner source” works at Google Thoughts on the relationship between batch processing and streaming Some ways that community “power users” have contributed to Beam Information on Beam Summit 2022, the first onsite summit since COVID began The first few people to register can use code BEAM_POD_INV for a discount on tickets! Links: Apache Beam Apache Spark Apache Flink Apache Nemo Apache Samza Apache Crunch MapReduce paper MillWheel paper FlumeJava paper Dataflow paper Beam Summit 2022 Website Other episodes: TensorFlow with Rajat Monga
Trazemos nesse episódio o especialista Lucas Magalhães para falar um pouco de projetos de Big Data e Analytics dentro do Google GCP Discutimos sobre os projetos que podem ser facilmente implementados assim como melhores formas e tecnologias utilizadas para lidar com processamento massivo de dados.No YouTube possuímos um canal de Engenharia de Dados com os tópicos mais importantes dessa área e com lives todas as quartas-feiras.https://www.youtube.com/channel/UCnErAicaumKqIo4sanLo7vQ Quer ficar por dentro dessa área com posts e updates semanais, então acesse o LinkedIN para não perder nenhuma notícia.https://www.linkedin.com/in/luanmoreno/ Disponível no Spotify e na Apple Podcasthttps://open.spotify.com/show/5n9mOmAcjra9KbhKYpOMqYhttps://podcasts.apple.com/br/podcast/engenharia-de-dados-cast/ Luan Moreno = https://www.linkedin.com/in/luanmoreno/
On the podcast this week, your hosts Stephanie Wong and Mark Mirchandani talk about the data processing tool Apache Beam with guests Pablo Estrada and Kenneth Knowles. Kenn starts us off with an overview of how Apache Beam began and how Cloud Dataflow was involved. The unique batch and stream method and emphasis on correctness garnered support from developers early on and continues to attract users. Pablo helps us understand why Beam is a better option for certain projects looking to process large amounts of data. Our guests describe how Beam may be a better fit than microservices that could become obsolete as company needs change. Next, we step back and take a look at why batch and stream is the gold standard of data processing because of its balance between low latency and ease of “being done” with data collection. Beam's focus on the correctness of data and correctness in processing that data is a core component. With good data, processing becomes easier, more reliable, and cheaper. Kenn gives examples of how things can go wrong with bad data processing. Beam strives for the perfect combination of low latency, correct data, and affordability. Users can choose where to run Beam pipelines, from other Apache software offerings to Dataflow, which means excellent flexibility. Our guests talk about the pros and cons of some of these options and we hear examples of how companies are using Beam along with supporting software to solve data processing challenges. To get started with Beam, check out Beam College or attend Beam Summit 2022. Kenneth Knowles Kenn Knowles is chair of the Apache Beam Project Management Committee. Kenn has been working on Google Cloud Dataflow—Google's Beam backend—since 2014. Kenn holds a PhD in programming languages from the University of California, Santa Cruz. Pablo Estrada Pablo is a Software Engineer at Google, and a management committee member for Apache Beam. Pablo is big into working on an open source project, and has worked all across the Apache Beam stack. Cool things of the week Under the sea: Building the world's fiber optic internet video Discovering Data Centers videos Google Data Cloud Summit site It's official—Google Distributed Cloud Edge is generally available blog GCP Podcast Episode 228: Fastly with Tyler McMullen podcast Save big by temporarily suspending unneeded Compute Engine VMs—now GA blog Interview Apache Beam site Apache Beam Documentation site Dataflow site Apache Flink site Apache Spark site Apache Samza site Apache Nemo site Spanner site BigQuery site Beam College site Beam College on Github site Beam Developer Mailing List email Beam User Mailing List email Beam Summit site What's something cool you're working on? Mark is working on a new Apache Beam video series Getting Started Wtih Apache Beam Hosts Stephanie Wong and Mark Mirchandani
Stephanie Wong and Debi Cabrera host a special episode highlighting the amazing accomplishments of our guest Vidya Nagarajan Raman as we celebrate Women's History Month! With her more than 20 years of experience fostering growth and monetization in enterprise and education platforms, investing and working in the holistic lifestyle space, and earning her MBA while raising her two children, Vidya has certainly done a lot! Vidya tells us about her latest blog post stressing the importance of being an event-driven organization. In this business structure, reactions to events are planned in advance and developers consider how services are integrated for maximum efficiency. With synchronous extensions, projects retain flexibility in existing applications as they work with Cloud Functions to extend to new areas. Vidya gives our listeners examples of how this works. The journey from engineer to Head of Product Management was an interesting one for Vidya, and she describes how she got started in computer engineering. Her passion for connecting with users later pushed her to product management. She tells us about her contributions to Chromebooks for Education as well as other milestones during her time with Google. Vidya talks about the support system she credits with helping her along the way and gives our listeners advice for finding mentors in their fields. She touches on the challenges she faced, describes what it was like for a woman in the industry when she first started, and offers encouragement to women getting started now. Balancing work, continuing her education, and raising children was tough, but Vidya says that, along with her incredible professional and personal support systems, defining priorities is vital. Vidya offers our listeners the insights she's gained as she's watched Google and workplace teams change and adapt over the years. Building an inclusive team, encouraging diverse perspectives, and defining a framework for settling disagreements are some of the pieces of advice she shares. Don't be afraid to fail and be a risk-taker, Vidya says, because that promotes growth and learning. If you learn something new every day and have fun doing it, then you will be successful. In her spare time, Vidya leads a charitable foundation that partners with organizations in countries like India and Peru to further education, build orphanages and libraries, and provide medical care for women. She is an angel investor and runs workshops on creating a holistic lifestyle to help others lead well-rounded, fulfilling lives. Vidya Nagarajan Raman Vidya Nagarajan Raman is the Head of Product Management for Serverless at Google Cloud. She is also an angel investor, advisor, and co-founder of a holistic lifestyle platform that empowers people to grow and transform their lives. Cool things of the week Ready to solve for the future? Data Cloud Summit ‘22 is coming April 6 blog Visualizing Google Cloud: 101 Illustrated References for Cloud Engineers and Architects site Interview Evolving to a programmable cloud blog Cloud Functions site Cloud Run site Eventarc docs Work Flows site Chromebook site What's something cool you're working on? Debi is working on Apache Beam series with Mark Mirchandani. Stephanie is working on scripts for a series about getting into a career in cloud. Hosts Stephanie Wong and Debi Cabrera
About Alexa AbbasAlexandra Abbas is a Google Cloud Certified Data Engineer & Architect and Apache Airflow Contributor. She currently works as a Machine Learning Engineer at Wise. She has experience with large-scale data science and engineering projects. She spends her time building data pipelines using Apache Airflow and Apache Beam and creating production-ready Machine Learning pipelines with Tensorflow.Alexandra was a speaker at Serverless Days London 2019 and presented at the Tensorflow London meetup.Personal linksTwitter: https://twitter.com/alexandraabbasLinkedIn: https://www.linkedin.com/in/alexandraabbasGitHub: https://github.com/alexandraabbasdatastack.tv's linksWeb: https://datastack.tvTwitter: https://twitter.com/datastacktvYouTube: https://www.youtube.com/c/datastacktvLinkedIn: https://www.linkedin.com/company/datastacktvGitHub: https://github.com/datastacktvLink to the Data Engineer Roadmap: https://github.com/datastacktv/data-engineer-roadmapThis episode is sponsored by CBT Nuggets: cbtnuggets.com/serverless andStackery: https://www.stackery.io/Watch this video on YouTube: https://youtu.be/SLJZPwfRLb8TranscriptJeremy: Hi, everyone. I'm Jeremy Daly, and this is Serverless Chats. Today I'm joined by Alexa Abbas. Hey, Alexa, thanks for joining me.Alexa: Hey, everyone. Thanks for having me.Jeremy: So you are a machine learning engineer at Wise and also the founder of datastack.tv. So I'd love it if you could tell the listeners a little bit about your background and what you do at Wise and what datastack.tv is all about.Alexa: Yeah. So as you said, I'm a machine learning engineer at Wise. So Wise is an international money transfer service. We are aiming for very transparent fees and very low fees compared to banks. So at Wise, basically, designing, maintaining, and developing the machine learning platform, which serves data scientists and analysts, so they can train their models and deploy their models, easily.Datastack.tv is, basically, it's a video service or a video platform for data engineers. So we create bite-sized videos, educational videos, for data engineers. We mostly cover open source topics, because we noticed that some of the open source tools in the data engineering world are quite underserved in terms of educational content. So we create videos about those.Jeremy: Awesome. And then, what about your background?Alexa: So I actually worked as a data engineer and machine learning engineer, so I've always been a data engineer or machine learning engineer in terms of roles. I also worked, for a small amount of time, I worked as a data scientist as well. In terms of education, I did a big data engineering Master's, but actually my Bachelor is economics, so quite a mix.Jeremy: Well, it's always good to have a ton of experience and that diverse perspective. Well, listen, I'm super excited to have you here, because machine learning is one of those things where it probably is more of a buzzword, I think, to a lot of people where every startup puts it in their pitch deck, like, "Oh, we're doing machine learning and artificial intelligence ..." stuff like that. But I think it's important to understand, one, what exactly it is, because I think there's a huge confusion there in terms of what we think of as machine learning, and maybe we think it's more advanced than it is sometimes, as I think there's lower versions of machine learning that can be very helpful.And obviously, this being a serverless podcast, I've heard you speak a number of times about the work that you've done with machine learning and some experiments you've done with serverless there. So I'd love to just pick your brain about that and just see if we can educate the users here on what exactly machine learning is, how people are using it, and where it fits in with serverless and some of the use cases and things like that. So first of all, I think one of the important things to start with anyways is this idea of MLOps. So can you explain what MLOps is?Alexa: Yeah, sure. So really short, MLOps is DevOps for machine learning. So I guess the traditional software engineering projects, you have a streamlined process you can release, really often, really quickly, because you already have all these best practices that all these traditional software engineering projects implement. Machine learning, this is still in a quite early stage and MLOps is in a quite early stage. But what we try to do in MLOps is we try to streamline machine learning projects, as well as traditional software engineering projects are streamlined. So data scientists can train models really easily, and they can release models really frequently and really easily into production. So MLOps is all about streamlining the whole data science workflow, basically.And I guess it's good to understand what the data science workflow is. So I talk a bit about that as well. So before actually starting any machine learning project, the first phase is an experimentation phase. It's a really iterative process when data scientists are looking at the data, they are trying to find features and they are also training many different models; they are doing architecture search, trying different architecture, trying different hyperparameter settings with those models. So it's a really iterative process of trying many models, many features.And then by the end, they probably find a model that they like and that hit the benchmark that they were looking for, and then they are ready to release that model into production. And this usually looks like ... so sometimes they use shadow models, in the beginning, to check if the results are as expected in production as well, and then they actually release into production. So basically MLOps tries to create the infrastructure and the processes that streamline this whole process, the whole life cycle.Jeremy: Right. So the question I have is, so if you're an ML engineer or you're working on these models and you're going through these iterations and stuff, so now you have this, you're ready to release it to production, so why do you need something like an MLOps pipeline? Why can't you just move that into production? Where's the barrier?Alexa: Well, I guess ... I mean, to be honest, the thing is there shouldn't be a barrier. Right now, that's the whole goal of MLOps. They shouldn't feel that they need to do any manual model artifact copying or anything like that. They just, I don't know, press a button and they can release to production. So that's what MLOps is about really and we can version models, we can version the data, things like that. And we can create reproducible experiments. So I guess right now, I think many bits in this whole lifecycle is really manual, and that could be automated. For example, releasing to production, sometimes it's a manual thing. You just copy a model artifact to a production bucket or whatever. So sometimes we would like to automate all these things.Jeremy: Which makes a lot of sense. So then, in terms of actually implementing this stuff, because we hear all the time about CI/CD. If we're talking about DevOps, we know that there's all these tools that are being built and services that are being launched that allow us to quickly move code through some process and get into production. So are there similar tools for deploying models and things like that?Alexa: Well, I think this space is quite crowded. It's getting more and more crowded. I think there are many ... So there are the cloud providers, who are trying to create tools that help these processes, and there are also many third-party platforms that are trying to create the ML platform that everybody uses. So I think there is no go-to thing that everybody uses, so I think there is many tools that we can use.Some examples, for example, TensorFlow is a really popular machine learning library, But TensorFlow, they created a package on top of TensorFlow, which is called TFX, TensorFlow Extended, which is exactly for streamlining this process and serving models easily, So I would say it TFX is a really good example. There is Kubeflow, which is a machine learning toolkit for Kubernetes. I think there are many custom implementations in-house in many companies, they create their own machine learning platforms, their own model serving API, things like that. And like the cloud providers on AWS, we have SageMaker. They are trying to cover many parts of the tech science lifecycle. And on Google Cloud, we have AI Platform, which is really similar to SageMaker.Jeremy: Right. And what are you doing at Wise? Are you using one of those tools? Are you building something custom?Alexa: Yeah, it's a mix actually. We have some custom bits. We have a custom API, serving API, for serving models. But for model training, we are using many things. We are using SageMaker, Notebooks. And we are also experimenting with SageMaker endpoints, which are actually serverless model serving endpoints. And we are also using EMR for model training and data preparation, so some Spark-based things, a bit more traditional type of model training. So it's quite a mix.Jeremy: Right. Right. So I am not well-versed in machine learning. I know just enough to be dangerous. And so I think that what would be really interesting, at least for me, and hopefully be interesting to listeners as well, is just talk about some of these standard tools. So you mentioned things like TensorFlow and then Kubeflow, which I guess is that end-to-end piece of it, but if you're ... Just how do you start? How do you go from, I guess, building and training a model to then productizing it and getting that out? What's that whole workflow look like?Alexa: So, actually, the data science workflow I mentioned, the first bit is that experimentation, which is really iterative, really free, so you just try to find a good model. And then, when you found a good model architecture and you know that you are going to receive new data, let's say, I don't know, I have a day, or whatever, I have a week, then you need to build out a retraining pipeline. And that is, I think, what the productionization of a model really means, that you can build a retraining pipeline, which can automatically pick up new data and then prepare that new data, retrain the model on that data, and release that model into production automatically. So I think that means productionization really.Jeremy: Right. Yeah. And so by being able to build and train a model and then having that process where you're getting that feedback back in, is that something where you're just taking that data and assuming that that is right and fits in the model or is there an ongoing testing process? Is there supervised learning? I know that's a buzzword. I'm not even sure what it means. But those ... I mean, what types of things go into that retraining of the models? Is it something that is just automatic or is it something where you need constant, babysitting's probably the wrong word, but somebody to be monitoring that on a regular basis?Alexa: So monitoring is definitely necessary, especially, I think when you trained your model and you shouldn't release automatically in production just because you've trained a new data. I mentioned this shadow model thing a bit. Usually, after you retrained the model and this retraining pipeline, then you release that model into shadow mode; and then you will serve that model in parallel to your actual product production model, and then you will check the results from your new model against your production model. And that's a manual thing, you need to ... or maybe you can automate it as well, actually. So if it performs like ... If it is comparable with your production model or if it's even better, then you will replace it.And also, in terms of the data quality in the beginning, you should definitely monitor that. And I think that's quite custom, really depends on what kind of data you work with. So it's really important to test your data. I mean, there are many ... This space is also quite crowded. There are many tools that you can use to monitor your distribution of your data and see that the new data is actually corresponds to your already existing data set. So there are many bits that you can monitor in this whole retraining pipeline, and you should monitor.Jeremy: Right. Yeah. And so, I think of some machine learning like use cases of like sentiment analysis, for example... looking at tweets or looking at customer service conversations and trying to rate those things. So when you say monitoring or running them against a shadow model, is that something where ... I mean, how do you gauge what's better, right? if you've got a shadow... I mean, what's the success metric there as to say X number were classified as positive versus negative sentiment? Is that something that requires human review or some sampling for you to kind of figure out the quality of the success of those models?Alexa: Yeah. So actually, I think that really depends on the use case. For example, when you are trying to catch fraudsters, your false positive rate and true positive rate, these are really important. If your true positive rate is higher that means, oh, you are catching more fraudsters. But let's say your new model, with your model, also the false positive rate is higher, which means that you are catching more people who are actually not fraudsters, but you have more work because I guess that's a manual process to actually check those people. So I think it really depends on the use case.Jeremy: Right. Right. And you also said that the markets a little bit flooded and, I mean, I know of SageMaker and then, of course, there's all these tools like, what's it called, Recognition, a bunch of things at AWS, and then Google has a whole bunch of the Vision API and some of these things and Watson's Natural Language Processing over at IBM and some of these things. So there's all these different tools that are just available via an API, which is super simple and great for people like me that don't want to get into building TensorFlow models and things like that. So is there an advantage to building your own models beyond those things, or are we getting to a point where with things like ... I mean, again, I know SageMaker has a whole library of models that are already built for you and things like that. So are we getting to a point where some of these models are just good enough off the shelf or do we really still need ... And I know there are probably some custom things. But do we still really need to be building our own models around that stuff?Alexa: So to be honest, I think most of the data scientists, they are using off-the-shelf models, maybe not the serverless API type of models that Google has, but just off-the-shelf TensorFlow models or SageMaker, they have these built-in containers for some really popular model architectures like XGBoost, and I think most of the people they don't tweak these, I mean, as far as I know. I think they just use them out of the box, and they really try to tweak the data instead, the data that they have, and try to have these off-the-shelf models with higher and higher quality data.Jeremy: So shape the data to fit the model as opposed to the model to fit the data.Alexa: Yeah, exactly. Yeah. So you don't actually have to know ... You don't have to know how those models work exactly. As long as you know what the input should be and what output you expect, then I think you're good to go.Jeremy: Yeah, yeah. Well, I still think that there's probably a lot of value in tuning the models though against your particular data sets.Alexa: Yeah, right. But also there are services for hyperparameter tuning. There are services even for neural architecture search, where they try a lot of different architectures for your data specifically and then they will tell you what is the best model architecture that you should use and same for the hyperparameter search. So these can be automated as well.Jeremy: Yeah. Very cool. So if you are hosting your own version of this ... I mean, maybe you'll go back to the MLOps piece of this. So I would assume that a data scientist doesn't want to be responsible for maintaining the servers or the virtual machines or whatever it is that it's running on. So you want to have this workflow where you can get your models trained, you can get them into production, and then you can run them through this loop you talked about and be able to tweak them and continue to retrain them as things go through. So on the other side of that wall, if we want to put it that way, you have your ops people that are running this stuff. Is there something specific that ops people need to know? How much do they need to know about ML, as opposed to ... I mean, the data scientists, hopefully, they know more. But in terms of running it, what do they need to know about it, or is it just a matter of keeping a server up and running?Alexa: Well, I think ... So I think the machine learning pipelines are not yet as standardized as a traditional software engineering pipeline. So I would say that you have to have some knowledge of machine learning or at least some understanding of how this lifecycle works. You don't actually need to know about research and things like that, but you need to know how this whole lifecycle works in order to work as an ops person who can automate this. But I think the software engineering skills and DevOps skills are the base, and then you can just build this knowledge on top of that. So I think it's actually quite easy to pick this up.Jeremy: Yeah. Okay. And what about, I mean, you mentioned this idea of a lot of data scientists aren't actually writing the models, they're just using the preconfigured model. So I guess that begs the question: How much does just a regular person ... So let's say I'm just a regular developer, and I say, "I want to start building machine learning tools." Is it as easy as just pulling a model off the shelf and then just learning a little bit more about it? How much can the average person do with some of these tools out of the box?Alexa: So I think most of the time, it's that easy, because usually the use cases that someone tries to tackle, those are not super edge cases. So for those use cases, there are already models which perform really well. Especially if you are talking about, I don't know, supervised learning on tabular data, I think you can definitely find models that will perform really well off the shelf on those type of datasets.Jeremy: Right. And if you were advising somebody who wanted to get started... I mean, because I think that I think where it might come down to is going to be things like pricing. If you're using Vision API and you're maybe limited on your quota, and then you can ... if you're paying however many cents per, I guess, lookup or inference, then that can get really expensive as opposed to potentially running your own model on something else. But how would you suggest that somebody get started? Would you point them at the APIs or would you want to get them up and running on TensorFlow or something like that?Alexa: So I think, actually, for a developer, just using an API would be super easy. Those APIs are, I think ... So getting started with those APIs just to understand the concepts are very useful, but I think getting started with Tensorflow itself or just Keras, I definitely I would recommend that, or just use scikit-learn, which is a more basic package for more basic machine learning. So those are really good starting points. And there are so many tutorials to get started with, and if you have an idea of what you would like to build, then I think you will definitely find tutorials which are similar to your own use case and you can just use those to build your custom pipeline or model. So I would say, for developers, I would definitely recommend jumping into TensorFlow or scikit-learn or XGBoost or things like that.Jeremy: Right, right. And how many of these models exist? I mean, are we talking there's 20 different models or are we talking there's 20,000 models?Alexa: Well, I think ... Wow. Good question. I think we are more towards today maybe not 20,000, but definitely many thousands, I think. But there are popular models that most of the people use, and I think there are maybe 50 or 100 models that are the most popular and most companies use them and you are probably fine just using those for any use case or most of the use cases.Jeremy: Right. Now, and speaking of use cases, so, again, I try to think of use cases or machine learning and whether it's classifying movies into genres or sentiment analysis, like I said, or maybe trying to classify news stories, things like that. Fraud detection, you mentioned. Those are all great use cases, but what are ... I know you've worked on a bunch of projects. So what are some of the projects that you've done and what were the use cases that were being solved there, because I find these to be really interesting?Alexa: Yeah. So I think a nice project that I worked on was a project with Lush, which is a cosmetics company. They manufacture like soaps and bath bombs. And they have this nice mission that they would like to eliminate packaging from their shops. So they asked us, when I worked at Datatonic, we worked on a small project with them. They asked us to create an image recognition model, to train one, and then create a retraining pipeline that they can use afterwards. So they provided us with many hundred thousand images of their products, and they made photos from different angles with different lightings and all of that, so really high-quality image data set of all their products.And then, we used a mobile net model, because they wanted this model to be built-in into their mobile application. So when users actually use this model, they download it with their mobile application. And then, they created a service called Lush [inaudible], which you can use from within their app. And then, people can just scan the products and they can see the ingredients and how-to-use guides and things like that. So this is how they are trying to eliminate all kinds of packaging from their shops, that they don't actually need to put the papers there or put packaging with ingredients and things like that.And in terms of what we did on the technical side, so as I mentioned, we used a mobile net model, because we needed to quantize the model in order to put it on a mobile device. And we used TF Lite to do this. TF Lite is specifically for models that you want to run on an edge device, like a mobile phone. So that was already a constraint. So this is how we picked a model. I think, back then, like there were only a few model architectures supported by TF Lite, and I think there were only two, maybe. So we picked MobileNet, because it had a smaller size.And then, in terms of the retraining, so we automated the whole workflow with Cloud Composer on Google Cloud, which is a managed version of Apache Airflow, the open source scheduling package. The training happened on AI Platform, which is Google Cloud's SageMaker.Jeremy: Yeah.Alexa: Yeah. And what else? We also had an image pre-processing step just before the training, which happened on Dataflow, which is an auto-scaling processing service on Google Cloud. And after we trained the model, we just saved the model active artifact in a bucket, and then ... I think we also monitored the performance of the model, and if it was good enough, then we just shipped the model to developers who actually they manually updated the model file that went into the application that people can download. So we didn't really see if they use any shadow model thing or anything like that.Jeremy: Right. Right. And I think that is such a cool use case, because, if I'm hearing you right, there were just like a bar soap or something like that with no packaging, no nothing, and you just hold your mobile phone camera up to it or it looks at it, determines which particular product is, gives you all that ... so no QR codes, no bar codes, none of that stuff. How did they ring them up though? Do you know how that process worked? Did the employees just have to know what they were or did the employees use the app as well to figure out what they were billing people for?Alexa: Good question. So I think they wanted the employees as well to use the app.Jeremy: Nice.Alexa: Yeah. But when the app was wrong, then I don't know what happened.Jeremy: Just give them a discount on it or something like that. That's awesome. And that's the thing you mentioned there about ... Was it Tensor Lite, was it called?Alexa: TF Lite. Yeah.Jeremy: TF Lite. Yes. TensorFlow Lite or TF Lite. But, basically, that idea of being able to really package a model and get it to be super small like you said. You said edge devices, and I'm thinking serverless compute at the edge, I'm thinking Lambda functions. I'm thinking other ways that if you could get your models small enough in package, that you could run it. But that'd be a pretty cool way to do inference, right? Because, again, even if you're using edge devices, if you're on an edge network or something like that, if you could do that at the edge, that'd be a pretty fast response time.Alexa: Yeah, definitely. Yeah.Jeremy: Awesome. All right. So what about some other stuff that you've done? You've mentioned some things about fraud detection and things like that.Alexa: Yeah. So fraud detection is a use case for Wise. As I mentioned, Wise services international money transfer, one of its services. So, obviously, if you are doing anything with money, then a full use case is for sure that you will have. So, I mean, in terms of ... I don't actually develop models at Wise, so I don't know actually what models they use. I know that they use H2O, which is a Spark-based library that you can use for model training. I think it's quite an advanced library, but I haven't used it myself too much, so I cannot talk about that too much.But in terms of the workflow, it's quite similar. We also have Airflow to schedule the retraining of the models. And they use EMR for data preparation, so quite similar to Dataflow, in a sense. A Spark-based auto-scaling cluster that processes the data and then, they train the models on EMR as well but using this H2O library. And then in the end, when they are happy with the model, we have this tool that they can use for releasing shadow models in production. And then, if they are satisfied with the performance of the model that they can actually release into production. And at Wise, we have a custom micro service, a custom API, for serving models.Jeremy: Right. Right. And that sounds like you need a really good MLOps flow to make all that stuff work, because you just have a lot of moving parts there, right?Alexa: Yeah, definitely. Also, I think we have many bits that could be improved. I think there are many bits that still a bit manual and not streamlined enough. But I think most of the companies struggle with the same thing. It's just we don't yet have those best practices that we can implement, so many people try many different things, and then ... Yeah, so I think it's still a work in progress.Jeremy: Right. Right. And I'm curious if your economics background helps at all with the fraud and the money laundering stuff at all?Alexa: No.Jeremy: No. All right. So what about you worked in another data engineering project for Vodafone, right?Alexa: Yeah. Yeah, so that was a data engineering project purely, so we didn't do any machine learning. Well, Vodafone has their own Google Analytics library that they use in all their websites and mobile apps and things like that and that sense Clickstream data to a server in a Google Cloud Platform Project, and we consume that data in a streaming manner from data flows. So, basically, the project was really about processing this data by writing an Apache Beam pipeline, which was always on and always expected messages to come in. And then, we dumped all the data into BigQuery tables, which is data warehouse in Google Cloud. And then, these BigQuery tables powered some of the dashboards that they use to monitor the uptime and, I don't know, different metrics for their websites and mobile apps.Jeremy: Right. But collecting all of that data is a good source for doing machine learning on top of that, right?Alexa: Yeah, exactly. Yeah. I think they already had some use cases in mind. I'm not sure if they actually done those or not, but it's a really good base for machine learning, what we collected the data there in BigQuery, because that is an analytical data warehouse, so some analysts can already start and explore the data as a first step of the machine learning process.Jeremy: Right. I would think anomaly detection and things like that, right?Alexa: Yeah, exactly.Jeremy: Right. All right. Well, so let's go on and talk about serverless a little bit more, because I know I saw you do a talk where you were you ran some experiments with serverless. And so, I'm just kind of curious, where are the limitations that you see? And I know that there continues ... I mean, we now have EFS integration, and we've got 10 gigs of memory for lambda functions, you've even got Cloud Run, which I don't know how much you could do with that, but where's still some of the limitations for running machine learning in a serverless way, I guess?Alexa: So I think, actually, from this data science lifecycle, many bits, there are Cloud providers offer a lot of serverless options. For data preparation, there is Dataflow, which is, I think, kind of like serverless data processing service, so you can use that for data processing. For model training, there is ... Or the SageMaker and AI Platform, which are kind of serverless, because you don't actually need to provision these clusters that you train your models on. And for model serving, in SageMaker, there are the serverless model endpoints that you can deploy. So there are many options, I think, for serverless in the machine learning lifecycle.In my experience, many times, it's a cost thing. For example, at Wise, we have this custom model serving API, where we serve all our models. And if they would use SageMaker endpoints, I think, a single SageMaker endpoint is about $50 per month, that's the minimum price, and that's for a single model and a single endpoint. And if you have thousands of models, then your price can go up pretty quickly, or maybe not thousands, but hundreds of models, then your price can go up pretty quickly. So I think, in my experience, limitation could be just price.But in terms of ... So I think, for example, if I compare Dataflow with a spark cluster that you program yourself, then I would definitely go with Dataflow. I think it's just much easier and maybe cost-wise as well, you might be better off, I'm not sure. But in terms of comfort and developer experience, it's a much better experience.Jeremy: Right. Right. And so, we talked a little bit about TF Lite there. Is that something possible where maybe the training piece of it, running that on Functions as a Service or something like that maybe isn't the most efficient or cost-effective way to do that, but what about running models or running inference on something like a Lambda function or a Google Cloud function or an Azure function or something like that? Is it possible to package those models in a way that's small enough that you could do that type of workload?Alexa: I think so. Yeah. I think you can definitely make inference using a Lambda function. But in terms of model training, I think that's not a ... Maybe there were already experiments for, I'm sure there were. But I think it's not the kind of workload that would fit for Lambda functions. That's a typical parallelizable, really large-scale workloads for ... You know the MapReduce type of data processing workloads? I think those are not necessarily fit for Lambda functions. So I think for model training and data preparation, maybe those are not the best options, but for model inference, definitely. And I think there are many examples using Lambda functions for inference.Jeremy: Right. Now, do you think that ... because this is always something where I find with serverless, and I know you're more of a data scientist, ML expert, but I look at serverless and I question whether or not it needs to handle some of these things. Especially with some of the endpoints that are out there now, we talked about the Vision API and some of the other NLP things, are we putting in too much effort maybe to try to make serverless be able to handle these things, or is it just something where there's a really good way to handle these by hosting your ... I mean, even if you're doing SageMaker, maybe not SageMaker endpoints, but just running SageMaker machines to do it or whatever, are we trying too hard to squeeze some of these things into a serverless environment?Alexa: Well, I don't know. I think, as a developer, I definitely prefer the more managed versions of these products. So the less I need to bother with, "Oh, my cluster died and now we need to rebuild a cluster of things," and I think serverless can definitely solve that. I would definitely prefer the more managed version. Maybe not serverless, because, for some of the use cases or some of the bits from the lifecycle, serverless is not the best fit, but a managed product is definitely something that I prefer over a non-managed product.Jeremy: Right. And so, I guess one last question for you here, because this is something that always interests me. Just there are relevant things that we need machine learning for. I mean, I think the fraud detection is a hugely important one. Sentiment analysis, again. Some of those other things are maybe, I don't know, I shouldn't call them toy things, but personalization and some of the things, they're all really great things to have, and it seems like you can't build an application now without somebody wanting some piece of that machine learning in there. So do you see that as where we are going where in the future, we're just going to have more of these APIs?I mean, out of AWS, because I'm more familiar with the AWS ecosystem, but they have Personalize and they have Connect and they have all these other services, they have the recommendation engine thing, all these different services ... Lex, or whatever, that will read text, natural language processing and all that kind of stuff. Is that where we're moving to just all these pre-trained, canned products that I can just access via an API or do you think that if you're somebody getting started and you really want to get into the ML world that you should start diving into the TensorFlows and some of those other things?Alexa: So I think if you are building an app and your goal is not to become an ML engineer or a data scientist, then these canned models are really useful because you can have a really good recommendation engine in your product, you could have really good personalization engine in your product, things like that. And so, those are, I think, really useful and you don't need to know any machine learning in order to use them. So I think we definitely go into that direction, because most of the companies won't hire data scientists just to train a recommender model. I think it's just easier to use an API endpoint that is already really good.So I think, yeah, we are definitely heading into that direction. But if you are someone who wants to become a data scientist or wants to be more involved with MLOps or machine learning engineering, then I think jumping into TensorFlow and understanding, maybe not, as we discussed, not getting into the model architectures and things like that, but just understanding the workflow and being able to program a machine learning pipeline from end to end, I think that's definitely recommended.Jeremy: All right. So one last question: If you've ever used the Watson NLP API or the Google Vision API, can you put on your resume that you're a machine learning expert?Alexa: Well, if you really want to do that, I would give it a go. Why not?Jeremy: All right. Good. Good to know. Well, Alexa, thank you so much for sharing all this information. Again, I find the use cases here to be much more complex than maybe some of the surface ones that you sometimes hear about. So, obviously, machine learning is here to stay. It sounds like there's a lot of really good opportunities for people to start kind of dabbling in it and using that without having to become a machine learning expert. But, again, I appreciate your expertise. So if people want to find out more about you or more about the things you're working on and datastack.tv, things like that, how do they do that?Alexa: So we have a Twitter page for datastack.tv, so feel free to follow that. I also have a Twitter page, feel free to follow me, account, not page. There is a datastack.tv website, so it's just datastack.tv. You can go there, and you can check out the courses. And also, we have created a roadmap for data engineers specifically, because there was no good roadmap for data engineers. I definitely recommend checking that out, because we listed most of the tools that a data engineer and also machine learning engineer should know about. So if you're interested in this career path, then I would definitely recommend checking that out. So under datastack.tv's GitHub, there is a roadmap that you can find.Jeremy: Awesome. All right. And that's just, like you said, datastack.tv.Alexa: Yes.Jeremy: I will make sure that we get your Twitter and LinkedIn and GitHub and all that stuff in there. Alexa, thank you so much.Alexa: Thanks. Thank you.
Sponsored by us! Support our work through: Our courses at Talk Python Training Test & Code Podcast Patreon Supporters Brian #1: pip-chill - Make requirements with only the packages you need Ricardo Bánffy Like pip freeze but lists only the packages that are not dependencies of installed packages. Will be great for creating requirements.txt files that look like the ones you would write by hand. I wish it had an option to not list itself, but pip-chill | grep -v pip-chill works. What do I have installed? (foo) $ pip freeze appdirs==1.4.4 black==20.8b1 click==7.1.2 mypy-extensions==0.4.3 ... No really, what did I myself install? (foo) $ pip-chill black==20.8b1 pip-chill==1.0.0 Without versions? (foo) $ pip-chill --no-version black pip-chill What did those things install as dependencies? (foo) $ pip-chill -v --no-version black pip-chill # appdirs # Installed as dependency for black # click # Installed as dependency for black ... Michael #2: Windows update broke NumPy Sent in by Daniel Mulkey A recent Windows update broke some behavior that I think OpenBLAS (used by NumPy) relied on. There's a Developer Community thread here. I am a NumPy developer. We have been trying to track down a strange issue where after updating to windows 10 2004, suddenly code that worked no longer works. Here is the NumPy issue and here is the corresponding issue in OpenBLAS. The problem can be summarized: when calling fmod, something is changed so that much later calling an OpenBLAS assembly routine fails. The only difference I can see in the registers that visual studio exposes is that after the call to fmod, register ST(0) is set to NAN. Steve Dower and other Microsoft people have commented. The fix is slated to take until January 2021 to be released, though there are workarounds for some scenarios. Matt P. posted a workaround: For all those at home following along and looking for a quick fix, NumPy has released a bugfix 1.19.3 to work around this issue. The bugfix broke something else on Linux, so we had to revert the fix in release 1.19.4, but you can still install the 1.19.3 via pip install numpy==1.19.3. Note this is only works around the way this bug crashes NumPy (technically, in OpenBLAS which is shipped with NumPy), and may not fix all your problems related to this bug, Microsoft’s help is needed to do that. Brian #3: Build Plugins with Pluggy kracekumar Blog post related to talks given at PyGotham and PyCon India Pluggy is the plugin library used by pytest Article starts with a CLI application that has one output format. Need is for more formats, implemented as plugins. Quick look at pluggy architecture of host/caller/core system and plugin/hook. Also plugin manager, hook specs, and hook implementations. Walks through the changes to the application needed to support plugins. I’ve been waiting for an article on pluggy, and this is nice. But I admit I’m still a little lost. I guess I need to watch one of the presentations and try to build something with pluggy. Michael #4: LINQ in Python via Adam: I seem to recall that Michael had a C# background, so this might be of interest: Bringing LINQ-like expressions to Python with linqit Example: last_hot_pizza_slice = programmers.where(lambda e:e.experience > 15) .except_for(elon_musk) .of_type(Avi) .take(3) # [[HTML_REMOVED], [HTML_REMOVED], [HTML_REMOVED]] .select(lambda avi:avi.lunch) # [[HTML_REMOVED], [HTML_REMOVED], [HTML_REMOVED]] .where(lambda p:p.is_hot() and p.origin != 'Pizza Hut'). .last() # [HTML_REMOVED] .slices.last() # [HTML_REMOVED] Also interesting asq: https://github.com/sixty-north/asq Brian #5: Klio : a framework for processing audio files or any binary files, at large scale Recently open sourced by Spotify An article about it Klio is based on Apache Beam and allows integration with cloud processing engines open graph of job dependencies batch and streaming pipelines goals: large-file input/output scalability, reproducibility, efficiency closer collaboration between researchers and engineers uses Python Obviously useful for Spotify, but they are hoping it will help with other audio research and applications. Michael #6: Collapsing code cells in Jupyter Notebooks via Marco Gorelli You mentioned in that episode that you'd like to have a way of collapsing code cells in Jupyter Notebooks so you can export them as reports - incidentally, I wrote a little blog post about how to do that - in case it's useful/of interest to you, here it is! Basically get a static HTML file that is the static notebook output but can start with the code cells collapsed and can toggle their visibility. Extras Michael: New Apple Silicon macs? Bot tweets: twitter.com/MichelARenard/status/1324269474544029696 Joke: By Richard Cairns Q: Why did the data scientist get in trouble with Animal Welfare? A: She was caught trying to import pandas. “10e engineeeeeeeeers are the future.” - detahq
最新情報を "ながら" でキャッチアップ! ラジオ感覚放送 「毎日AWS!」 おはようございます、サーバーワークスの加藤です。 今日は 9/15 に出たアップデート8件をご紹介。 ※今回アップデートが多く2回に分けての放送となります。 感想は Twitter にて「#サバワ」をつけて投稿してください! ■ UPDATE ラインナップ Amplify JavaScript が Next.js や Nuxt.js などサーバーサイドレンダリングフレームワークに対応 Amazon Lex がイギリス英語に対応 Amazon CloudFront が Brotli 圧縮をサポート Amazon Transcribe が自動言語認識機能を追加 Amazon Kinesis Data Analytics が JavaベースのApache Beam を使用したストリーミングアプリケーションをサポート Amazon Kinesis Data Analytics が Apache Flink Kinesis Data Firehose Producer v2.0.0 に対応 AWS上で .NET アプリケーションを展開する、新しいデジタルコースが登場 CloudFormation が Amazon Kendra をサポート ■ サーバーワークスSNS Twitter / Facebook ■ サーバーワークスブログ サーバーワークスエンジニアブログ
An airhacks.fm conversation with Romain Manni-Bucau (@rmannibucau) about: PaintShop Pro, science fiction matte paintings, scene generation, short movies, 3D tool automation with scripting, starting C programming with GTK, programming PaintShop Pro "clone" as "hello, world", linux over windows, image editing involves math, learning algorithms from the internet, building winamp-like mp3 player with C++ and GTK, switching from C/C++ to Java, no memory management in Java, implementing problem-solvers with Java, developing "BigData" apps with Hazelcast, Talip Ozturk, implementing map-reduce algorithms for a banking sector with Hazelcast, using Apache openEJB, working with Jean-Louis Monteiro the openEJB committer, using openEJB for good start times and for testing, Java EE and standards do not impact your business code, working with friends at Tomitribe, implementing extensions for TomEE - the MicroProfile before MicroProfile, joining talend to implement batch processes, joining yupiik.com startup, Apache Spark, Apache Beam and ReactJS, using Apache Meecrowave, ReactJS vs. Custom Elements, WebComponents and Redux, deploying service on-the-fly with OSGi, integrating CDI with OSGI, working with Apache Aries, using OSGi to load machine learnings models, hot-loading modules for "Fluid Logic", OSGI alliance specs, Karaf OSGi, HTTP/2 with Felix, OSGi ConfigAdmin configuration, OSGi whiteboard pattern, Aries CDI, Romain Manni-Bucau on twitter: @rmannibucau, Romain's blog: rmannibucau.metawerx.net
An airhacks.fm conversation with Alexis (@alexismp) about: java -jar glassfish.jar, Community Management at Sun, Developer Relations, how to talk to developers, Texas Instruments 4a, a circle qualifies as "Hello World", Prolog to Java Applets migration for National French Space Agency, Java Center of Excellence at Sun Microsystems, Sun / JavaSoft / IBM as dream jobs, Scott McNealy and the ability of predicting the future - a reference to airhacks.fm episode #19 - interview with Scott McNealy, starting at Sun in 1998, Sun Netscape Alliance, iPlanet Appserver, moving a Reference Implementation to a product called "GlassFish", HK2, GlassFish started faster than Tomcat, moving the industry with GlassFish, fascination with modularity, NetBeans as platform, plugins as quality asurance, lightweight runtimes with 500 MB WARS, making servers bigger and deployables smaller, docker changed the conversation, dealing with boring technologies, different language communities at Google, Java is less ceremonial, than people think, the popularity of Java at Google, AppEngines 10th anniversary, Apache Beam and Google Dataflow, how Sun lost the engineers at Java 5 timeframe, a huge amount of Google projects is based on Java, AppEngine is "serverless", Sun and Google have a lot in common, JAX-RS is Google Cloud Endpoints, Managed PubSub service, PubSub is like JMS, AppEngine as PubSub message listener, Cloud Spanner -- a distributable scalable persistence, DataStore supports versioning is a document, key value store, canary deployments, Objectify an ORM for DataStore, Cloud SQL and PostgreSQL, BigTable, exports to BigQuery, istio , Kubernetes, Helidon on Google Cloud, Kubernetes Engine, you can find Alexis at twitter: @alexismp, LinkedIn, medium: @alexismp and his: blog.
I denne episoden får vi besøk av Ståle Heitmann fra Hafslund Nett, og snakker litt om skysatsing hos dem. Hafslund Nett bruker både Microsofts Azure og Google Cloud Platform for å kunne hente det beste ut av begge plattformene, og valgte Kubernetes som orkestreringsverktøy for containerne sine. Men når man utvikler og legger applikasjoner ut i skyen, er det viktig å holde orden i sysakene sine, f.eks. i form av gode navnekonvensjoner og -strukturer, og at man definerer infrastruktur som kode. Vi snakket også litt om mikrotjenester, og hvorfor Hafslund har valgt å ikke satse på serverless (bortsett fra Apache Beam, som er for lett til å ikke bruke).
Mark and Melanie are joined by Sarah Novotny, Head of Open Source Strategy for Google Cloud Platform, to talk all about Open Source, the Cloud Native Compute Foundation and their relationships to Google Cloud Platform. Sarah Novotny Sarah Novotny leads an Open Source Strategy group for Google Cloud Platform. She has long been an Open Source community champion in communities such as Kubernetes, NGINX and MySQL and ran large scale technology infrastructures at Amazon before web-scale had a name. In 2001, she co-founded Blue Gecko, which was sold to DatAvail in 2012. She is a program chair emeritus for O’Reilly Media’s OSCON. Cool things of the week Now live in Tokyo: using TensorFlow to predict taxi demand blog Kubernetes best practices: Organizing with Namespaces blog youtube Announcing Open Images V4 and the ECCV 2018 Open Images Challenge blog dataset challenge Introducing Kubernetes Service Catalog and Google Cloud Platform Service Broker: find and connect services to your cloud-native apps blog docs Julia Evans - zines store Interview Kubernetes site Node.js Foundation board of directors Tensorflow site gRPC site Apache Beam site Google Kubernetes Engine site Forseti site podcast Cloud Native Compute Foundation site Cloud Native Computing Foundation Announces Kubernetes® as First Graduated Project blog NTP’s Fate Hinges On ‘Father Time’ article Open Container Initiative site Fireside chat: building on and contributing to Google’s open source projects Google I/O Question of the week Mark broke SSH access to his Compute Engine instance by accidentally removing the GCP linux guest environment. How did he fix it? Installing the Linux Guest Environment via Clone Root Disk & Use Startup Script docs Where can you find us next? Mark can be found streaming Agones development on Twitch and finished his blog series on scaling game servers on Kubernetes. Melanie will be speaking at the internet2 Global Summit, May 9th in San Diego, and will also be talking at the Understand Risk Forum on May 17th, in Mexico City.
Dans cet épisode Antonio, Audrey et Guillaume commentent l’actualité du mois de février : beaucoup de nouveautés dans les librairies et côté front mais également des nouvelles de Java 10 et 11 et de Kotlin bien sûr ! Enregistré le 1er mars 2018 Téléchargement de l’épisode LesCastCodeurs-Episode–184.mp3 News Langages Première release candidate pour le JDK 10 JDK 11 en early access Java 8 ne recevra plus de mises à jour et de correctifs de sécurité à partir de janvier 2019 JDBC Next: A New Asynchronous API for Connecting to a Database Librairies Introducing Kotlin Support in Spring Framework 5.0 SpringBoot 1.5.10 SpringBoot 2.0 GA Vert.x 3.5.1 Tensorflow 1.5 Apache Beam 2.3.0 Elastic 6.2.0 Elastic open source X-Pack Middleware Java EE devient Jakarta EE Infinispan 9.2.0.CR3 Infrastructure Cloudbees acquiert Codeship Cloud CoreOS agrees to join Red Hat Debugging “FROM scratch” on Kubernetes Web Webpack 4 Parcel 1.5.0 NPM 5.7 JHipster 4.14.0 TypeScript 2.7 Angular-CLI 1.7 Angular CLI diff l’outil d’aide à la migration de Cédric Exbrayat AngularJS 1.7 LTS Nuxt.js 1.0 Web Components Todo Flutter beta 1 Outillage Gradle 4.5.0 Méthodologies Effective Use of Slack Sécurité Chrome marquera tous les sites HTTP “non sûrs” à partir de Juillet 2018 Loi, société et organisation The unwinding of net neutrality will begin on April 23rd Socle interministériel des logiciels libres 2018 Elon Musk quitte le conseil d’administration de son centre sur l’intelligence artificielle Conférences BreizhCamp du 28 au 30 Mars 2018 Devoxx France du 18 au 20 avril 2018 MixIT le 19–20 avril 2018 à Lyon Riviera Dev les 2, 3 et 4 mai 2018 à Sophia Antipolis NCrafts les 18 et 19 mai 2018 - Le CfP est ouvert. Best Of Web les 7 et 8 juin 2018 EclipseCon les 13 et 14 juin 2018 - Le CfP est ouvert. JHipster Conf le 21 juin DevFest Lille le 21 juin 2018 - Le CfP est ouvert. Voxxed Luxembourg le 22 juin 2018 Sunny Tech les 28 et 29 juin 2018 - Le CfP est ouvert. Jenkins User Conference le 28 juin 2018 - Le CfP est ouvert. Nous contacter Faire un crowdcast ou une crowdquestion Contactez-nous via twitter https://twitter.com/lescastcodeurs sur le groupe Google https://groups.google.com/group/lescastcodeurs ou sur le site web https://lescastcodeurs.com/ Flattr-ez nous (dons) sur https://lescastcodeurs.com/ En savoir plus sur le sponsoring? sponsors@lescastcodeurs.com
In this podcast, Deam Wampler discusses fast data, streaming, microservices, and the paradox of choice when it comes to the options available today building data pipelines. Why listen to this podcast: * Apache Beam is fast becoming the de-facto standard API for stream processing * Spark is great for batch processing, but Flink is tackling the low-latency streaming processing market * Avoid running blocking REST calls from within a stream processing system - have them asynchronously launched and communicate over Kafka queues * Visibility into telemetry of streaming processing systems is still a new field and under active development * Running the fast data platform is easily launched on an existing or new Mesosphere DC/OS runtime More on this: Quick scan our curated show notes on InfoQ http://bit.ly/2BYTMbI You can also subscribe to the InfoQ newsletter to receive weekly updates on the hottest topics from professional software development. bit.ly/24x3IVq Subscribe: www.youtube.com/infoq Like InfoQ on Facebook: bit.ly/2jmlyG8 Follow on Twitter: twitter.com/InfoQ Follow on LinkedIn: www.linkedin.com/company/infoq Want to see extented shownotes? Check the landing page on InfoQ: http://bit.ly/2BYTMbI
On this week's podcast, Eric Anderson shares how Dataprep helps summarize, transform, visualize and cleanup data on the Google Cloud Platform. When doing data analysis, typically data munging can take up most of the time and this serverless tool helps optimize the process. About Eric Anderson Eric is a Product Manager at Google working on Cloud Dataprep and recently Cloud Dataflow. Previously he was at Amazon Web Services, Harvard Business School, General Electric and University of Utah. He's from Salt Lake City, Utah and lives in Mountain View, California with and wife and three kids. Cool things of the week Intel Performance Libraries and Python Distribution enhance performance and scaling of Intel Xeon Scalable (‘Skylake') processors on GCP blog The hidden costs of cloud blog and Server Density podcast Monitor and manage your costs with Cloud Platform billing export to BigQuery blog and Public Datasets podcast Kaggle TensorFlow Speech Recognition Challenge site Interview Cloud Dataprep site docs Cloud Dataflow site docs 7 Steps to Mastering Data Preparation with Python blog Design Your Pipeline blog Apache Beam site Question of the week What is feature engineering? Intro to Feature Engineering with TensorFlow video Where can you find us next? Mark will be Montreal in December to speak at Montreal International Games Summit. Melanie will be at NIPS (Neural Information Processing Systems) in Long Beach in December
http://beam.incubator.apache.org/Vous pouvez retouver Jean-Baptiste :http://blog.nanthrax.net/https://github.com/jbonofrehttps://twitter.com/jbonofre https://www.linkedin.com/in/jean-baptiste-onofr%C3%A9-a0739317Lisez le blog D'affini-Techhttp://blog.affini-tech.com-------------------------------------------------------------http://www.bigdatahebdo.com https://twitter.com/bigdatahebdoVincent : https://twitter.com/vhe74Alexander : https://twitter.com/alexanderdeja Cette publication est sponsorisée par Affini-Tech ( http://affini-tech.com https://twitter.com/affinitech )On recrute ! venez cruncher de la data avec nous ! écrivez nous à recrutement@affini-tech.com
http://beam.incubator.apache.org/Vous pouvez retouver Jean-Baptiste :http://blog.nanthrax.net/https://github.com/jbonofrehttps://twitter.com/jbonofre https://www.linkedin.com/in/jean-baptiste-onofr%C3%A9-a0739317Lisez le blog D'affini-Techhttp://blog.affini-tech.com-------------------------------------------------------------http://www.bigdatahebdo.com https://twitter.com/bigdatahebdoVincent : https://twitter.com/vhe74Alexander : https://twitter.com/alexanderdeja Cette publication est sponsorisée par Affini-Tech ( http://affini-tech.com https://twitter.com/affinitech )On recrute ! venez cruncher de la data avec nous ! écrivez nous à recrutement@affini-tech.com
In this podcast, we are talking to Tyler Akidau, a senior engineer at Google, who leads the technical infrastructure and data processing teams in Seattle, and a founding member of the Apache Beam PMC and a passionate voice in the streaming space. This podcast will cover data streaming and the 2015 DataFlow Model streaming paper [http://www.vldb.org/pvldb/vol8/p1792-Akidau.pdf] and much of the concepts covered, such as why dealing with out-of-order data is important, event time versus processing time, windowing approaches, and finally preview the track he is hosting at QConf SF next week. Why listen to this podcast: - Batch processing and streaming aren’t two incompatible things; they are a function of different windowing options. - Event time and processing time are two different concepts, and may be out of step with each other. - Completeness is knowing that you have processed all the events for a particular window. - Windowing choice can be answered from the what, when, where, how questions. - Unbounded versus bounded data is a better dimension than stream or batch processing. More on this: Quick scan our curated show notes on InfoQ http://bit.ly/2AyBTAb You can also subscribe to the InfoQ newsletter to receive weekly updates on the hottest topics from professional software development. bit.ly/24x3IVq Subscribe: www.youtube.com/infoq Like InfoQ on Facebook: bit.ly/2jmlyG8 Follow on Twitter: twitter.com/InfoQ Follow on LinkedIn: www.linkedin.com/company/infoq Want to see extented shownotes? Check the landing page on InfoQ: http://bit.ly/2AyBTAb
Exactly-once Semantics are Possible: Here’s How Kafka Does ithttps://www.confluent.io/blog/exactly-once-semantics-are-possible-heres-how-apache-kafka-does-it/?utm_content=buffer9b1b6&utm_medium=social&utm_source=twitter.com&utm_campaign=bufferhttps://blog.ippon.fr/2017/07/11/kafka-0-11-0-%E2%99%A5/Confluent KSQLhttps://www.confluent.io/blog/ksql-open-source-streaming-sql-for-apache-kafka/https://www.youtube.com/watch?v=A45uRzJiv7I&feature=youtu.beKafka + Prestodb.iohttps://prestodb.io/docs/current/connector/kafka.htmlStreaming SQL in Apache Flink, KSQL, and Stream Processing for Everyonehttps://data-artisans.com/blog/flink-streaming-sql-ksql-stream-processingKafka Wakes Up And Is Metamorphosed Into A Databasehttps://www.nextplatform.com/2017/08/30/kafka-wakes-metamorphosed-database/amp/(Editor’s Note: It would have been far funnier, of course, if Kafka woke up one morning and had been turned into CockroachDB.)Open sourcing Kafka cruise controlhttps://engineering.linkedin.com/blog/2017/08/open-sourcing-kafka-cruise-controlhttps://github.com/linkedin/cruise-controlYahoo’s New Pulsar: A Kafka Competitor?https://www.datanami.com/2016/09/07/yahoos-new-pulsar-kafka-competitor/Apache Beam 2.1https://beam.apache.org/get-started/downloads/Apache Beam splittable DoFnhttps://beam.apache.org/blog/2017/08/16/splittable-do-fn.htmlInstaclustr Dynamic Resizing for Apache Cassandrahttps://www.instaclustr.com/instaclustr-dynamic-resizing-for-apache-cassandra/?utm_content=buffer624e7&utm_medium=social&utm_source=twitter.com&utm_campaign=bufferRiak devs giddy over gambling biz's vow to set code freehttps://www.theregister.co.uk/2017/08/25/bet365_to_buy_basho_release_code/?mt=1503782778086Spark Release 2.2.0http://spark.apache.org/releases/spark-release-2-2-0.html[mooc] Specialisation Data-Engineering Google Cloud sur Courserahttps://fr.coursera.org/specializations/gcp-data-machine-learning[podcast] Y a-t-il un cerveau dans la machine ? une interview de Yann Le Cun, directeur du FAIRhttps://www.franceculture.fr/emissions/la-methode-scientifique/y-t-il-un-cerveau-dans-la-machine[podcast] DREMEL, DRUID AND DATA MODELING ON GOOGLE BIGQUERY' https://www.drilltodetail.com/podcast/2017/6/19/drill-to-detail-ep31-dremel-druid-and-data-modeling-on-google-bigquery-with-special-guest-dan-mcclary[privacy] comment les apps Figaro, L’Équipe ou Closer participent au pistage de 10 millions de Françaishttp://www.numerama.com/politique/282934-enquete-comment-les-apps-figaro-lequipe-ou-closer-participent-au-pistage-de-10-millions-de-francais.htmlComment l’intelligence artificielle bouleverse l’industrie des médiashttp://www.latribune.fr/opinions/tribunes/comment-l-intelligence-artificielle-bouleverse-l-industrie-des-medias-746917.htmlCédric Villani est chargé d'une mission d'information parlementaire sur l'IA.http://www.numerama.com/politique/286341-le-gouvernement-fait-appel-a-cedric-villani-pour-une-mission-sur-lia.html-------------------------------------------------------------http://www.bigdatahebdo.comhttps://twitter.com/bigdatahebdoVincent : https://twitter.com/vhe74Alexander : https://twitter.com/alexanderdeja Cette publication est sponsorisée par Affini-Tech ( http://affini-tech.com https://twitter.com/affinitech )On recrute ! venez cruncher de la data avec nous ! écrivez nous à recrutement@affini-tech.com
Exactly-once Semantics are Possible: Here’s How Kafka Does ithttps://www.confluent.io/blog/exactly-once-semantics-are-possible-heres-how-apache-kafka-does-it/?utm_content=buffer9b1b6&utm_medium=social&utm_source=twitter.com&utm_campaign=bufferhttps://blog.ippon.fr/2017/07/11/kafka-0-11-0-%E2%99%A5/Confluent KSQLhttps://www.confluent.io/blog/ksql-open-source-streaming-sql-for-apache-kafka/https://www.youtube.com/watch?v=A45uRzJiv7I&feature=youtu.beKafka + Prestodb.iohttps://prestodb.io/docs/current/connector/kafka.htmlStreaming SQL in Apache Flink, KSQL, and Stream Processing for Everyonehttps://data-artisans.com/blog/flink-streaming-sql-ksql-stream-processingKafka Wakes Up And Is Metamorphosed Into A Databasehttps://www.nextplatform.com/2017/08/30/kafka-wakes-metamorphosed-database/amp/(Editor’s Note: It would have been far funnier, of course, if Kafka woke up one morning and had been turned into CockroachDB.)Open sourcing Kafka cruise controlhttps://engineering.linkedin.com/blog/2017/08/open-sourcing-kafka-cruise-controlhttps://github.com/linkedin/cruise-controlYahoo’s New Pulsar: A Kafka Competitor?https://www.datanami.com/2016/09/07/yahoos-new-pulsar-kafka-competitor/Apache Beam 2.1https://beam.apache.org/get-started/downloads/Apache Beam splittable DoFnhttps://beam.apache.org/blog/2017/08/16/splittable-do-fn.htmlInstaclustr Dynamic Resizing for Apache Cassandrahttps://www.instaclustr.com/instaclustr-dynamic-resizing-for-apache-cassandra/?utm_content=buffer624e7&utm_medium=social&utm_source=twitter.com&utm_campaign=bufferRiak devs giddy over gambling biz's vow to set code freehttps://www.theregister.co.uk/2017/08/25/bet365_to_buy_basho_release_code/?mt=1503782778086Spark Release 2.2.0http://spark.apache.org/releases/spark-release-2-2-0.html[mooc] Specialisation Data-Engineering Google Cloud sur Courserahttps://fr.coursera.org/specializations/gcp-data-machine-learning[podcast] Y a-t-il un cerveau dans la machine ? une interview de Yann Le Cun, directeur du FAIRhttps://www.franceculture.fr/emissions/la-methode-scientifique/y-t-il-un-cerveau-dans-la-machine[podcast] DREMEL, DRUID AND DATA MODELING ON GOOGLE BIGQUERY' https://www.drilltodetail.com/podcast/2017/6/19/drill-to-detail-ep31-dremel-druid-and-data-modeling-on-google-bigquery-with-special-guest-dan-mcclary[privacy] comment les apps Figaro, L’Équipe ou Closer participent au pistage de 10 millions de Françaishttp://www.numerama.com/politique/282934-enquete-comment-les-apps-figaro-lequipe-ou-closer-participent-au-pistage-de-10-millions-de-francais.htmlComment l’intelligence artificielle bouleverse l’industrie des médiashttp://www.latribune.fr/opinions/tribunes/comment-l-intelligence-artificielle-bouleverse-l-industrie-des-medias-746917.htmlCédric Villani est chargé d'une mission d'information parlementaire sur l'IA.http://www.numerama.com/politique/286341-le-gouvernement-fait-appel-a-cedric-villani-pour-une-mission-sur-lia.html-------------------------------------------------------------http://www.bigdatahebdo.comhttps://twitter.com/bigdatahebdoVincent : https://twitter.com/vhe74Alexander : https://twitter.com/alexanderdeja Cette publication est sponsorisée par Affini-Tech ( http://affini-tech.com https://twitter.com/affinitech )On recrute ! venez cruncher de la data avec nous ! écrivez nous à recrutement@affini-tech.com
Cloud Dataflow and its OSS counterpart Apache Beam are amazing tools for Big Data. So today your co-hosts Francesc and Mark interview Frances Perry, the Tech Lead and PMC for those projects, to join us and tell us more about it. About Frances Perry Frances Perry is a software engineer who likes to make big data processing easy, intuitive, and efficient. After many years working on Google's internal data processing stack, she joined the Cloud Dataflow team to make this technology available to external cloud customers. She led the early work on Dataflow's unified batch/streaming programming model and is now on the PMC for Apache Beam. Cool things of the week Rewriting moviegolf.com Medium App Engine users, now you can configure custom domains from the API or CLI announcement Join the gRPC/Istio community day June 26th at Google Sunnyvale twitter Interview Cloud Dataflow homepage Apache Beam homepage Java SDK Quickstart docs Python SDK Quickstart docs Cloud Dataflow, Apache Beam and you announcement Question of the week How can I connect all the instances in a Managed Instance Group to CloudSQL securely? Connecting MySQL Client from Compute Engine About the Cloud SQL Proxy CloudSQL Proxy GitHub repo Where can you find us next? Francesc just released a new #justforfunc episode where he explains how to use cgo. He will be running a workshop at QCon New York on Go tooling based on this video, after that he'll be at GopherCon in Denver! Mark is still on vacation - but don't worry, he'll be back soon!
Confluent raises $50M to continue growing commercial arm of Apache Kafkahttps://techcrunch.com/2017/03/07/confluent-raises-50m-to-continue-growing-commercial-arm-of-apache-kafka/How Kafka Redefined Data Processing for the Streaming Agehttps://www.datanami.com/2017/03/07/kafka-redefined-data-processing-streaming-age/Hoodie: Uber Engineering’s Incremental Processing Framework on Hadoophttps://eng.uber.com/hoodie/Analysis: It’s Amazon Web Services’ world – Google Cloud is just living in ithttp://siliconangle.com/blog/2017/03/08/analysis-amazon-web-services-world-google-cloud-just-living/Welcome Kaggle to Google Cloudhttps://cloudplatform.googleblog.com/2017/03/welcome-Kaggle-to-Google-Cloud.htmlGoogle Next 17https://blog.google/topics/google-cloud/100-announcements-google-cloud-next-17/Google Cloud Dataprephttps://cloud.google.com/blog/big-data/2017/03/google-cloud-platform-adds-new-tools-for-easy-data-preparation-and-integrationPython SDK released in Apache Beam 0.6.0https://beam.apache.org/blog/2017/03/16/python-sdk-release.htmlScyllaDB Raises $16M to Advance NoSQL Database Technologyhttp://www.enterpriseappstoday.com/data-management/scylladb-raises-16m-to-advance-nosql-database-technology.htmlHadoop Has Failed Us, Tech Experts Sayhttps://www.datanami.com/2017/03/13/hadoop-failed-us-tech-experts-say/Impact de l'Intelligence Artificielle sur l'économie - Laurent ALEXANDRE au Senathttps://www.youtube.com/watch?v=rJowm24piM4&feature=youtu.beGDPR General Data Protection Regulationhttp://www.cil.cnrs.fr/CIL/spip.php?article2634https://en.wikipedia.org/wiki/General_Data_Protection_RegulationEvents Le programme Devoxx France est publiéhttp://cfp.devoxx.fr/2017/byday/wed-------------------------------------------------------------http://www.bigdatahebdo.comhttps://twitter.com/bigdatahebdoVincent : https://twitter.com/vhe74Alexander : https://twitter.com/alexanderdeja Cette publication est sponsorisée par Affini-Tech ( http://affini-tech.com https://twitter.com/affinitech )On recrute ! venez cruncher de la data avec nous ! écrivez nous à recrutement@affini-tech.com
Confluent raises $50M to continue growing commercial arm of Apache Kafkahttps://techcrunch.com/2017/03/07/confluent-raises-50m-to-continue-growing-commercial-arm-of-apache-kafka/How Kafka Redefined Data Processing for the Streaming Agehttps://www.datanami.com/2017/03/07/kafka-redefined-data-processing-streaming-age/Hoodie: Uber Engineering’s Incremental Processing Framework on Hadoophttps://eng.uber.com/hoodie/Analysis: It’s Amazon Web Services’ world – Google Cloud is just living in ithttp://siliconangle.com/blog/2017/03/08/analysis-amazon-web-services-world-google-cloud-just-living/Welcome Kaggle to Google Cloudhttps://cloudplatform.googleblog.com/2017/03/welcome-Kaggle-to-Google-Cloud.htmlGoogle Next 17https://blog.google/topics/google-cloud/100-announcements-google-cloud-next-17/Google Cloud Dataprephttps://cloud.google.com/blog/big-data/2017/03/google-cloud-platform-adds-new-tools-for-easy-data-preparation-and-integrationPython SDK released in Apache Beam 0.6.0https://beam.apache.org/blog/2017/03/16/python-sdk-release.htmlScyllaDB Raises $16M to Advance NoSQL Database Technologyhttp://www.enterpriseappstoday.com/data-management/scylladb-raises-16m-to-advance-nosql-database-technology.htmlHadoop Has Failed Us, Tech Experts Sayhttps://www.datanami.com/2017/03/13/hadoop-failed-us-tech-experts-say/Impact de l'Intelligence Artificielle sur l'économie - Laurent ALEXANDRE au Senathttps://www.youtube.com/watch?v=rJowm24piM4&feature=youtu.beGDPR General Data Protection Regulationhttp://www.cil.cnrs.fr/CIL/spip.php?article2634https://en.wikipedia.org/wiki/General_Data_Protection_RegulationEvents Le programme Devoxx France est publiéhttp://cfp.devoxx.fr/2017/byday/wed-------------------------------------------------------------http://www.bigdatahebdo.comhttps://twitter.com/bigdatahebdoVincent : https://twitter.com/vhe74Alexander : https://twitter.com/alexanderdeja Cette publication est sponsorisée par Affini-Tech ( http://affini-tech.com https://twitter.com/affinitech )On recrute ! venez cruncher de la data avec nous ! écrivez nous à recrutement@affini-tech.com
SpannerSpanner, the Google Database That Mastered Time, Is Now Open to Everyonehttps://www.wired.com/2017/02/spanner-google-database-harnessed-time-now-open-everyone/Google Spanner Inspires CockroachDB To Outrun Ithttps://www.nextplatform.com/2017/02/22/google-spanner-inspires-cockroachdb-outrun/Spanner, TrueTime and the CAP Theoremhttps://research.google.com/pubs/pub45855.htmlSpanner quickstarthttps://cloud.google.com/spanner/docs/quickstart-console-------------------------------------------------------------DBThe probability of data loss in large clustershttp://martin.kleppmann.com/2017/01/26/data-loss-in-large-clusters.htmlThe first release candidate of Redis 4.0 is outhttp://antirez.com/news/110MongoDB 3.4 Passes Jepsen – The Industry’s Toughest Database Testhttps://www.mongodb.com/mongodb-3.4-passes-jepsen-testhttp://jepsen.io/analyses/mongodb-3-4-0-rc3-------------------------------------------------------------Data-science10 Signs Of A Bad Data Scientisthttp://www.kdnuggets.com/2016/04/10-signs-bad-data-scientist.htmlThe Rise of the Weaponized AI Propaganda Machinehttps://medium.com/join-scout/the-rise-of-the-weaponized-ai-propaganda-machine-86dac61668b#.qvwftlojyAnnouncing TensorFlow 1.0https://developers.googleblog.com/2017/02/announcing-tensorflow-10.htmlLearn TensorFlow and deep learning, without a Ph.D.https://cloud.google.com/blog/big-data/2017/01/learn-tensorflow-and-deep-learning-without-a-phd-------------------------------------------------------------DiversProjets Hadoop : un échec dans 70 % des cashttp://www.silicon.fr/projets-hadoop-echec-70-cas-169110.htmlMedia recap of the Apache Beam graduationhttps://beam.apache.org/blog/2017/02/01/graduation-media-recap.htmlhttps://beam.apache.org/documentation/runners/capability-matrix/#cap-summary-whatSpark Summit East 2017 - A summaryhttp://blog.ippon.tech/spark-summit-east-2017-a-summary/https://spark-summit.org/east-2017/schedule/Joue la comme Clever Cloud : comment nous avons survécu à un redressement judiciairehttps://medium.com/@waxzce/joue-la-comme-clever-cloud-comment-nous-avons-surv%C3%A9cu-%C3%A0-un-redressement-judiciaire-68a4b79c902#.s1g9bha19-------------------------------------------------------------http://www.bigdatahebdo.comhttps://twitter.com/bigdatahebdoVincent : https://twitter.com/vhe74Alexander : https://twitter.com/alexanderdeja Cette publication est sponsorisée par Affini-Tech ( http://affini-tech.com https://twitter.com/affinitech )On recrute ! venez cruncher de la data avec nous ! écrivez nous à recrutement@affini-tech.com
SpannerSpanner, the Google Database That Mastered Time, Is Now Open to Everyonehttps://www.wired.com/2017/02/spanner-google-database-harnessed-time-now-open-everyone/Google Spanner Inspires CockroachDB To Outrun Ithttps://www.nextplatform.com/2017/02/22/google-spanner-inspires-cockroachdb-outrun/Spanner, TrueTime and the CAP Theoremhttps://research.google.com/pubs/pub45855.htmlSpanner quickstarthttps://cloud.google.com/spanner/docs/quickstart-console-------------------------------------------------------------DBThe probability of data loss in large clustershttp://martin.kleppmann.com/2017/01/26/data-loss-in-large-clusters.htmlThe first release candidate of Redis 4.0 is outhttp://antirez.com/news/110MongoDB 3.4 Passes Jepsen – The Industry’s Toughest Database Testhttps://www.mongodb.com/mongodb-3.4-passes-jepsen-testhttp://jepsen.io/analyses/mongodb-3-4-0-rc3-------------------------------------------------------------Data-science10 Signs Of A Bad Data Scientisthttp://www.kdnuggets.com/2016/04/10-signs-bad-data-scientist.htmlThe Rise of the Weaponized AI Propaganda Machinehttps://medium.com/join-scout/the-rise-of-the-weaponized-ai-propaganda-machine-86dac61668b#.qvwftlojyAnnouncing TensorFlow 1.0https://developers.googleblog.com/2017/02/announcing-tensorflow-10.htmlLearn TensorFlow and deep learning, without a Ph.D.https://cloud.google.com/blog/big-data/2017/01/learn-tensorflow-and-deep-learning-without-a-phd-------------------------------------------------------------DiversProjets Hadoop : un échec dans 70 % des cashttp://www.silicon.fr/projets-hadoop-echec-70-cas-169110.htmlMedia recap of the Apache Beam graduationhttps://beam.apache.org/blog/2017/02/01/graduation-media-recap.htmlhttps://beam.apache.org/documentation/runners/capability-matrix/#cap-summary-whatSpark Summit East 2017 - A summaryhttp://blog.ippon.tech/spark-summit-east-2017-a-summary/https://spark-summit.org/east-2017/schedule/Joue la comme Clever Cloud : comment nous avons survécu à un redressement judiciairehttps://medium.com/@waxzce/joue-la-comme-clever-cloud-comment-nous-avons-surv%C3%A9cu-%C3%A0-un-redressement-judiciaire-68a4b79c902#.s1g9bha19-------------------------------------------------------------http://www.bigdatahebdo.comhttps://twitter.com/bigdatahebdoVincent : https://twitter.com/vhe74Alexander : https://twitter.com/alexanderdeja Cette publication est sponsorisée par Affini-Tech ( http://affini-tech.com https://twitter.com/affinitech )On recrute ! venez cruncher de la data avec nous ! écrivez nous à recrutement@affini-tech.com
Mark Rittman is joined by Alex Olivier from Qubit to talk about their platform journey from on-premise Hadoop to petabytes of data running in Google Cloud Platform, using Google Cloud Dataflow (aka Apache Beam), Google PubSub and Google BigQuery along with machine learning and analytics to deliver personalisation at-scale for digital retailers around the world.
Mark Rittman is joined by Alex Olivier from Qubit to talk about their platform journey from on-premise Hadoop to petabytes of data running in Google Cloud Platform, using Google Cloud Dataflow (aka Apache Beam), Google PubSub and Google BigQuery along with machine learning and analytics to deliver personalisation at-scale for digital retailers around the world.
Software Engineering Radio - The Podcast for Professional Software Developers
Jeff Meyerson talks with Frances Perry about Apache Beam, a unified batch and stream processing model. Topics include a history of batch and stream processing, from MapReduce to the Lambda Architecture to the more recent Dataflow model, originally defined in a Google paper. Dataflow overcomes the problem of event time skew by using watermarks and […]
Software Engineering Radio - The Podcast for Professional Software Developers
Jeff Meyerson talks with Frances Perry about Apache Beam, a unified batch and stream processing model. Topics include a history of batch and stream processing, from MapReduce to the Lambda Architecture to the more recent Dataflow model, originally defined in a Google paper. Dataflow overcomes the problem of event time skew by using watermarks and other methods discussed between Jeff and Frances. Apache Beam defines a way for users to define their pipelines in a way that is agnostic of the underlying execution engine, similar to how SQL provides a unified language for databases. This seeks to solve the churn and repeated work that has occurred in the rapidly evolving stream processing ecosystem.
Un épisode qu'on a mis beaucoup de temps à sortir, et qui souffre d'une mauvaise qualité sonore, désolé. Le manque de bande passante est clairement l'ennemi du podcast. http://beam.incubator.apache.org/Vous pouvez retouver Jean-Baptiste :http://blog.nanthrax.net/https://github.com/jbonofrehttps://twitter.com/jbonofre https://www.linkedin.com/in/jean-baptiste-onofr%C3%A9-a0739317
Un épisode qu'on a mis beaucoup de temps à sortir, et qui souffre d'une mauvaise qualité sonore, désolé. Le manque de bande passante est clairement l'ennemi du podcast. http://beam.incubator.apache.org/Vous pouvez retouver Jean-Baptiste :http://blog.nanthrax.net/https://github.com/jbonofrehttps://twitter.com/jbonofre https://www.linkedin.com/in/jean-baptiste-onofr%C3%A9-a0739317
Batch and stream processing systems have been evolving for the past decade. From MapReduce to Apache Storm to Dataflow, the best practices for large volume data processing have become more sophisticated as the industry and open source communities have iterated on them. Dataflow and Apache Beam are projects that present a unified batch and The post Cloud Dataflow with Eric Anderson appeared first on Software Engineering Daily.
Unbounded data streams create difficult challenges for our application architectures. The data never stops coming, and we are forced to assume that we will never know if or when we have seen all of our data. Some streaming systems give us the tools to deal partially with unbounded data streams, but we have to complement The post Apache Beam with Frances Perry appeared first on Software Engineering Daily.