POPULARITY
MapReduce: Ein Deep DiveIm Jahr 2004 war die Verarbeitung von großen Datenmengen eine richtige Herausforderung. Einige Firmen hatten dafür sogenannte Supercomputer. Andere haben nur mit der Schulter gezuckt und auf das Ende ihrer Berechnung gewartet. Google war einer der Player, der zwar große Datenmengen hatte und diese auch verarbeiten wollte, jedoch keine Supercomputer zur Verfügung hatte. Oder besser gesagt: Nicht das Geld in die Hand nehmen wollte.Was macht man also, wenn man ein Problem hat? Eine Lösung suchen. Das hat Jeffrey Dean und sein Team getan. Das Ergebnis? Ein revolutionäres Paper, wie man mittels MapReduce große Datenmengen verteilt auf einfacher Commodity-Hardware verarbeiten kann.In dieser Podcast-Episode schauen wir uns das mal genauer an. Wir klären, was MapReduce ist, wie es funktioniert, warum MapReduce so revolutionär war, wie es mit Hardware-Ausfällen umgegangen ist, welche Herausforderungen in der Praxis hatte bzw. immer noch hat, was das Google File System, Hadoop und HDFS damit zu tun haben und ordnen MapReduce im Kontext der heutigen Technologien mit Cloud und Co ein.Eine weitere Episode “Papers We Love”.Bonus: Hadoop ist wohl der Elefant im Raum.Unsere aktuellen Werbepartner findest du auf https://engineeringkiosk.dev/partnersDas schnelle Feedback zur Episode:
This week we talk to Sue Scheiner about her path from beautiful Cornell to the vibrant world of Sesame Street.We had an incredibly fun and nostalgic conversation about education, inspiration, and the magic of mentorship. As an HDFS major in Cornell Human Ecology, she was guided by the wonderful Professor John Condry, whose influence helped launch her 35-year career at Sesame Street. Sue shares how her love for learning, her incredible Cornell friendships, and her passion for children's media shaped her career. And after all these years, she still likes going to work every day! #GoalsPlus, we put our own Muppet creations to the test—did they get the Sesame Street seal of approval? And have you heard of Miss Rachel?We loved spending time with Sue - she's just the best.Not sponsored by or affiliated with Cornell University
Summary Any software system that survives long enough will require some form of migration or evolution. When that system is responsible for the data layer the process becomes more challenging. Sriram Panyam has been involved in several projects that required migration of large volumes of data in high traffic environments. In this episode he shares some of the valuable lessons that he learned about how to make those projects successful. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management Data lakes are notoriously complex. For data engineers who battle to build and scale high quality data workflows on the data lake, Starburst is an end-to-end data lakehouse platform built on Trino, the query engine Apache Iceberg was designed for, with complete support for all table formats including Apache Iceberg, Hive, and Delta Lake. Trusted by teams of all sizes, including Comcast and Doordash. Want to see Starburst in action? Go to dataengineeringpodcast.com/starburst (https://www.dataengineeringpodcast.com/starburst) and get $500 in credits to try Starburst Galaxy today, the easiest and fastest way to get started using Trino. This episode is supported by Code Comments, an original podcast from Red Hat. As someone who listens to the Data Engineering Podcast, you know that the road from tool selection to production readiness is anything but smooth or straight. In Code Comments, host Jamie Parker, Red Hatter and experienced engineer, shares the journey of technologists from across the industry and their hard-won lessons in implementing new technologies. I listened to the recent episode "Transforming Your Database" and appreciated the valuable advice on how to approach the selection and integration of new databases in applications and the impact on team dynamics. There are 3 seasons of great episodes and new ones landing everywhere you listen to podcasts. Search for "Code Commentst" in your podcast player or go to dataengineeringpodcast.com/codecomments (https://www.dataengineeringpodcast.com/codecomments) today to subscribe. My thanks to the team at Code Comments for their support. Your host is Tobias Macey and today I'm interviewing Sriram Panyam about his experiences conducting large scale data migrations and the useful strategies that he learned in the process Interview Introduction How did you get involved in the area of data management? Can you start by sharing some of your experiences with data migration projects? As you have gone through successive migration projects, how has that influenced the ways that you think about architecting data systems? How would you categorize the different types and motivations of migrations? How does the motivation for a migration influence the ways that you plan for and execute that work? Can you talk us through one or two specific projects that you have taken part in? Part 1: The Triggers Section 1: Technical Limitations triggering Data Migration Scaling bottlenecks: Performance issues with databases, storage, or network infrastructure Legacy compatibility: Difficulties integrating with modern tools and cloud platforms System upgrades: The need to migrate data during major software changes (e.g., SQL Server version upgrade) Section 2: Types of Migrations for Infrastructure Focus Storage migration: Moving data between systems (HDD to SSD, SAN to NAS, etc.) Data center migration: Physical relocation or consolidation of data centers Virtualization migration: Moving from physical servers to virtual machines (or vice versa) Section 3: Technical Decisions Driving Data Migrations End-of-life support: Forced migration when older software or hardware is sunsetted Security and compliance: Adopting new platforms with better security postures Cost Optimization: Potential savings of cloud vs. on-premise data centers Part 2: Challenges (and Anxieties) Section 1: Technical Challenges Data transformation challenges: Schema changes, complex data mappings Network bandwidth and latency: Transferring large datasets efficiently Performance testing and load balancing: Ensuring new systems can handle the workload Live data consistency: Maintaining data integrity while updates occur in the source system Minimizing Lag: Techniques to reduce delays in replicating changes to the new system Change data capture: Identifying and tracking changes to the source system during migration Section 2: Operational Challenges Minimizing downtime: Strategies for service continuity during migration Change management and rollback plans: Dealing with unexpected issues Technical skills and resources: In-house expertise/data teams/external help Section 3: Security & Compliance Challenges Data encryption and protection: Methods for both in-transit and at-rest data Meeting audit requirements: Documenting data lineage & the chain of custody Managing access controls: Adjusting identity and role-based access to the new systems Part 3: Patterns Section 1: Infrastructure Migration Strategies Lift and shift: Migrating as-is vs. modernization and re-architecting during the move Phased vs. big bang approaches: Tradeoffs in risk vs. disruption Tools and automation: Using specialized software to streamline the process Dual writes: Managing updates to both old and new systems for a time Change data capture (CDC) methods: Log-based vs. trigger-based approaches for tracking changes Data validation & reconciliation: Ensuring consistency between source and target Section 2: Maintaining Performance and Reliability Disaster recovery planning: Failover mechanisms for the new environment Monitoring and alerting: Proactively identifying and addressing issues Capacity planning and forecasting growth to scale the new infrastructure Section 3: Data Consistency and Replication Replication tools - strategies and specialized tooling Data synchronization techniques, eg Pros and cons of different methods (incremental vs. full) Testing/Verification Strategies for validating data correctness in a live environment Implication of large scale systems/environments Comparison of interesting strategies: DBLog, Debezium, Databus, Goldengate etc What are the most interesting, innovative, or unexpected approaches to data migrations that you have seen or participated in? What are the most interesting, unexpected, or challenging lessons that you have learned while working on data migrations? When is a migration the wrong choice? What are the characteristics or features of data technologies and the overall ecosystem that can reduce the burden of data migration in the future? Contact Info LinkedIn (https://www.linkedin.com/in/srirampanyam/) Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ (https://www.pythonpodcast.com) covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast (https://www.themachinelearningpodcast.com) helps you go from idea to production with machine learning. Visit the site (https://www.dataengineeringpodcast.com) to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com (mailto:hosts@dataengineeringpodcast.com)) with your story. Links DagKnows (https://dagknows.com) Google Cloud Dataflow (https://cloud.google.com/dataflow) Seinfeld Risk Management (https://www.youtube.com/watch) ACL == Access Control List (https://en.wikipedia.org/wiki/Access-control_list) LinkedIn Databus - Change Data Capture (https://github.com/linkedin/databus) Espresso Storage (https://engineering.linkedin.com/data-replication/open-sourcing-databus-linkedins-low-latency-change-data-capture-system) HDFS (https://hadoop.apache.org/docs/r1.2.1/hdfs_design.html) Kafka (https://kafka.apache.org/) Postgres Replication Slots (https://www.postgresql.org/docs/current/logical-replication.html) Queueing Theory (https://en.wikipedia.org/wiki/Queueing_theory) Apache Beam (https://beam.apache.org/) Debezium (https://debezium.io/) Airbyte (https://airbyte.com/) Fivetran (fivetran.com) Designing Data Intensive Applications (https://amzn.to/4aAztR1) by Martin Kleppman (https://martin.kleppmann.com/) (affiliate link) Vector Databases (https://en.wikipedia.org/wiki/Vector_database) Pinecone (https://www.pinecone.io/) Weaviate (https://www.weveate.io/) LAMP Stack (https://en.wikipedia.org/wiki/LAMP_(software_bundle)) Netflix DBLog (https://arxiv.org/abs/2010.12597) The intro and outro music is from The Hug (http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/Love_death_and_a_drunken_monkey/04_-_The_Hug) by The Freak Fandango Orchestra (http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/) / CC BY-SA (http://creativecommons.org/licenses/by-sa/3.0/)
Join us for a heartfelt and insightful conversation with Katherine Velez, a compassionate Veterinary Social Worker at Cornell Veterinary Specialists in Stamford CT, as she shares her experiences and wisdom in navigating the delicate journey of pet loss, quality of life assessments, and the importance of self-care in veterinary medicine. From comforting grieving pet owners to making difficult euthanasia decisions, Catherine sheds light on the emotional depth and professional challenges faced in veterinary practice. Discover the power of empathy, the role of humor in coping, and the profound impact of veterinary care on both animals and their human companions. Katherine Velez received her B.A. in HDFS with a minor in Women Studies in Spring of 2010. She was involved in Zero Tolerance, a student activities organization and participated in the production of The Vagina Monologues. During her time at UCONN, she completed an internship at St. Luke's Lifeworks (now Inspirica), working at their women's shelter. She also participated in a mentorship program with middle school children through the Stamford Public Education Foundation. This experience helped cement her interest in working with vulnerable populations and encouraged her to think about plans for after graduation. Her advisor, Dr. Annamaria Csizmadia, and professors were integral and supportive of her decision to pursue a social work master's degree. After graduation, Katherine earned her master's in social work with a clinical concentration from Fordham University. During her time at Fordham, she started working as a case worker at a local nonprofit, Person to Person, providing emergency assistance programs part time. She completed an internship with the White Plains Youth Bureau where she developed an after-work program for at risk youth in an immigrant community. She also completed her clinical internship at Norwalk Community Health Center where she provided individual psychotherapy to clinic patients. Upon completing her MSW, she started working full time for Person to Person and was promoted to Case Work Manager as the organization grew and branched out to a bigger catchment area. In 2016, Katherine began working as a Research Coordinator at Columbia University Medical Center (CUMC) in the Pediatrics Department. She had the privilege of working in several clinical trials within her department, including working with mothers and children in the NICU at Morgan Stanley Children's Hospital and running mother-child groups in preschool settings. At CUMC, she earned the required hours toward her clinical license and is now a fully licensed social worker in the state of CT. Today, Katherine is the Veterinary Social Worker at Cornell Veterinary Specialists in Stamford CT where she works with clients and supports staff in the day-to-day human issues that arise within the veterinary field and the human-animal bond. She is also in private practice. Katherine's time at UConn helped her realize her goals of becoming a social worker. The staff and professors she met along the way forever impacted her life and the trajectory of it. It is because of these relationships that she was able to succeed and was prepared to pursue a master's degree. She will forever hold the UConn community in her heart and is grateful for her time in the HDFS program. --- Send in a voice message: https://podcasters.spotify.com/pod/show/speakingofpets/message
Summary A data lakehouse is intended to combine the benefits of data lakes (cost effective, scalable storage and compute) and data warehouses (user friendly SQL interface). Multiple open source projects and vendors have been working together to make this vision a reality. In this episode Dain Sundstrom, CTO of Starburst, explains how the combination of the Trino query engine and the Iceberg table format offer the ease of use and execution speed of data warehouses with the infinite storage and scalability of data lakes. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management Dagster offers a new approach to building and running data platforms and data pipelines. It is an open-source, cloud-native orchestrator for the whole development lifecycle, with integrated lineage and observability, a declarative programming model, and best-in-class testability. Your team can get up and running in minutes thanks to Dagster Cloud, an enterprise-class hosted solution that offers serverless and hybrid deployments, enhanced security, and on-demand ephemeral test deployments. Go to dataengineeringpodcast.com/dagster (https://www.dataengineeringpodcast.com/dagster) today to get started. Your first 30 days are free! Data lakes are notoriously complex. For data engineers who battle to build and scale high quality data workflows on the data lake, Starburst powers petabyte-scale SQL analytics fast, at a fraction of the cost of traditional methods, so that you can meet all your data needs ranging from AI to data applications to complete analytics. Trusted by teams of all sizes, including Comcast and Doordash, Starburst is a data lake analytics platform that delivers the adaptability and flexibility a lakehouse ecosystem promises. And Starburst does all of this on an open architecture with first-class support for Apache Iceberg, Delta Lake and Hudi, so you always maintain ownership of your data. Want to see Starburst in action? Go to dataengineeringpodcast.com/starburst (https://www.dataengineeringpodcast.com/starburst) and get $500 in credits to try Starburst Galaxy today, the easiest and fastest way to get started using Trino. Join in with the event for the global data community, Data Council Austin. From March 26th-28th 2024, they'll play host to hundreds of attendees, 100 top speakers, and dozens of startups that are advancing data science, engineering and AI. Data Council attendees are amazing founders, data scientists, lead engineers, CTOs, heads of data, investors and community organizers who are all working togethr to build the future of data. As a listener to the Data Engineering Podcast you can get a special discount of 20% off your ticket by using the promo code dataengpod20. Don't miss out on their only event this year! Visit: dataengineeringpodcast.com/data-council (https://www.dataengineeringpodcast.com/data-council) today. Your host is Tobias Macey and today I'm interviewing Dain Sundstrom about building a data lakehouse with Trino and Iceberg Interview Introduction How did you get involved in the area of data management? To start, can you share your definition of what constitutes a "Data Lakehouse"? What are the technical/architectural/UX challenges that have hindered the progression of lakehouses? What are the notable advancements in recent months/years that make them a more viable platform choice? There are multiple tools and vendors that have adopted the "data lakehouse" terminology. What are the benefits offered by the combination of Trino and Iceberg? What are the key points of comparison for that combination in relation to other possible selections? What are the pain points that are still prevalent in lakehouse architectures as compared to warehouse or vertically integrated systems? What progress is being made (within or across the ecosystem) to address those sharp edges? For someone who is interested in building a data lakehouse with Trino and Iceberg, how does that influence their selection of other platform elements? What are the differences in terms of pipeline design/access and usage patterns when using a Trino/Iceberg lakehouse as compared to other popular warehouse/lakehouse structures? What are the most interesting, innovative, or unexpected ways that you have seen Trino lakehouses used? What are the most interesting, unexpected, or challenging lessons that you have learned while working on the data lakehouse ecosystem? When is a lakehouse the wrong choice? What do you have planned for the future of Trino/Starburst? Contact Info LinkedIn (https://www.linkedin.com/in/dainsundstrom/) dain (https://github.com/dain) on GitHub Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ (https://www.pythonpodcast.com) covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast (https://www.themachinelearningpodcast.com) helps you go from idea to production with machine learning. Visit the site (https://www.dataengineeringpodcast.com) to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com (mailto:hosts@dataengineeringpodcast.com)) with your story. Links Trino (https://trino.io/) Starburst (https://www.starburst.io/) Presto (https://prestodb.io/) JBoss (https://en.wikipedia.org/wiki/JBoss_Enterprise_Application_Platform) Java EE (https://www.oracle.com/java/technologies/java-ee-glance.html) HDFS (https://hadoop.apache.org/docs/r1.2.1/hdfs_design.html) S3 (https://aws.amazon.com/s3/) GCS == Google Cloud Storage (https://cloud.google.com/storage?hl=en) Hive (https://hive.apache.org/) Hive ACID (https://cwiki.apache.org/confluence/display/hive/hive+transactions) Apache Ranger (https://ranger.apache.org/) OPA == Open Policy Agent (https://www.openpolicyagent.org/) Oso (https://www.osohq.com/) AWS Lakeformation (https://aws.amazon.com/lake-formation/) Tabular (https://tabular.io/) Iceberg (https://iceberg.apache.org/) Podcast Episode (https://www.dataengineeringpodcast.com/iceberg-with-ryan-blue-episode-52/) Delta Lake (https://delta.io/) Podcast Episode (https://www.dataengineeringpodcast.com/delta-lake-data-lake-episode-85/) Debezium (https://debezium.io/) Podcast Episode (https://www.dataengineeringpodcast.com/debezium-change-data-capture-episode-114) Materialized View (https://en.wikipedia.org/wiki/Materialized_view) Clickhouse (https://clickhouse.com/) Druid (https://druid.apache.org/) Hudi (https://hudi.apache.org/) Podcast Episode (https://www.dataengineeringpodcast.com/hudi-streaming-data-lake-episode-209) The intro and outro music is from The Hug (http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/Love_death_and_a_drunken_monkey/04_-_The_Hug) by The Freak Fandango Orchestra (http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/) / CC BY-SA (http://creativecommons.org/licenses/by-sa/3.0/)
HDFS student Ashley Poghen joins the show this week! Ashley discusses her path to college, how she juggles all her roles, and her love for the outdoors. All the upcoming information you need for this week and beyond. Enjoy!
In this episode, we spoke with Dhruba Borthakur, Dhruba is the CTO and Co-founder at Rockset. Rockset is a search and analytics database hosted on the cloud. Dhruba was the founding engineer of the RocksDB project at Facebook and the Principal Architect for HDFS for a while. In this episode, we discuss RocksDB and compare it with LevelDB. We also discuss in detail the Aggregator Leaf Tailer architecture, which started at Facebook and is now powering Rockset. Follow Dhruba: https://twitter.com/dhruba_rocks Follow Alex: https://twitter.com/alexbdebrie
In this episode, we bring you David Barton, an entrepreneur, a fitness genius, and a multi-talented man of the arts, who transformed the fitness landscape in NYC and the world. He discusses the challenges he faced and the innovative fitness philosophy that set his gyms apart. His newest gym, U, is state-of-the-art everything and positive energy everywhere. PLUS, he lovingly critiques our pushups.It's a conversation you do not want to miss and an exploration of David Barton's extraordinary life, from his Ivy League roots to his enduring legacy as a fitness visionary.Follow David on Instagram: ubydavidbartonWebsite:www.gymunyc.com
Enjoy this special presentation from the first ever Whinypaluza Ultimate Marriage & Parenting Summit! Laura Froyen received her PhD in Human Development and Family Studies (HDFS) with an emphasis in Couple and Family Therapy from Michigan State University in 2014. While pursuing her doctorate she worked as a Couple and Family Therapist in the state of Michigan helping families navigate difficult times. Her research focused on how marital and family relationships influence parenting and child development. She continued this research at the University of Wisconsin-Madison as an Assistant Professor of HDFS and UW-Extension as an Early Childhood and Parenting Specialist. She helps individuals, partners, and co-parents become the parents they are longing to be. She helps overwhelmed and disconnected parents reconnect with themselves, each other, and their children. She helps parents bring ease, calm and JOY back into their hectic and chaotic lives. She helps parents stop yelling and start feeling confident in themselves. She does this through her unique approach to parenting that is grounded in research and driven by the overarching goal of restoring balance and compassion to families. Listen to this insightful Whinypaluza episode with Laura Froyen about how everyday life gives us moments to shine light on our wounds that still need healing, creating opportunities for repair and growth. Here is what to expect on this week's show: There is no such thing as perfection, only “good enough.” There is a cycle of mistake/repair that creates growth. Acknowledge your triggers, and use them for growth, until they are no longer triggers. What are the different kinds of triggers? Repeating behavioral patterns is a way of your brain creating structure, efficiency, and making sense of things. Rules that you had as a child can become triggers for you as a parent if your children do things that would have been breaking those rules. It creates a sense of something being wrong. Addressing your triggers and healing yourself is about understanding we are all deserving of love and compassion. Awareness is the first step. Building in self-regulation is the second step. Learning mindfulness and breathwork is vastly beneficial in healing. Connect with Laura: Website https://www.laurafroyen.com/ Facebook https://www.facebook.com/laurafroyen/ Follow Rebecca Greene Blog https://www.whinypaluza.com/ Book 1 https://bit.ly/WhinypaluzaBook Book 2 https://bit.ly/whinybook2 Facebook https://www.facebook.com/whinypaluzaparenting Instagram https://www.instagram.com/becgreene5/ @becgreene5 TikTok https://www.tiktok.com/@whinypaluzamom?lang=en @whinypaluzamom Learn more about your ad choices. Visit megaphone.fm/adchoices
Faction CTO Matt Wallace returns to the CTO Advisor podcast to talk data architecture with Keith Townsend. Matt walks Keith through the basics of Hadoop and HDFS to more modern concepts such as AWS EMR and where things break. The two discuss how Enterprise Architecture and Data Architecture intersect to ensure users can collaborate on [...]
EP#66 Willis Nana : Data Ingenieur En savoir plus sur moi : https://espresso-jobs.com/conseils-carriere/les-geeks-du-web-willis-nana/ "Creates simple solutions to complex problems" - Mes Skills Data Engineering
A new research paper was published on the cover of Aging (Aging-US) Volume 14, Issue 22, entitled, “Glutaminase inhibitors rejuvenate human skin via clearance of senescent cells: a study using a mouse/human chimeric model.” Skin aging caused by various endogenous and exogenous factors results in structural and functional changes to skin components. However, the role of senescent cells in skin aging has not been clarified. In this new study, researchers Kento Takaya, Tatsuyuki Ishii, Toru Asou, and Kazuo Kishi, from the Department of Plastic and Reconstructive Surgery at the Keio University School of Medicine, evaluated the effects of the glutaminase inhibitor BPTES (bis-2-(5-phenylacetamido-1, 3, 4-thiadiazol-2-yl)ethyl sulfide) on human senescent dermal fibroblasts and aged human skin to elucidate the function of senescent cells in skin aging. “[...] we utilized plastic surgery to create an experimental mouse/human chimeric model in which intraoperatively obtained human whole skin layers were transplanted into nude mice using previously described methods [25] and evaluated the anti-aging effects of BPTES on real human skin.” Primary human dermal fibroblasts (HDFs) were induced to senescence by long-term passaging, ionizing radiation, and treatment with doxorubicin, an anticancer drug. Cell viability of HDFs was assessed after BPTES treatment. A mouse/human chimeric model was created by subcutaneously transplanting whole skin grafts from aged humans into nude mice. The model was treated intraperitoneally with BPTES or vehicle for 30 days. Skin samples were collected and subjected to reverse transcription-quantitative polymerase chain reaction (RT-qPCR), western blotting, and histological analysis. BPTES selectively eliminated senescent dermal fibroblasts regardless of the method used to induce senescence; aged human skin grafts treated with BPTES exhibited increased collagen density, increased cell proliferation in the dermis, and decreased aging-related secretory phenotypes, such as matrix metalloprotease and interleukin. These effects were maintained in the grafts 1 month after termination of the treatment. In conclusion, selective removal of senescent dermal fibroblasts can improve the skin aging phenotype, indicating that BPTES may be an effective novel therapeutic agent for skin aging. “In summary, our results indicate that selective clearance of aging dermal fibroblasts by BPTES ameliorates skin senescence-related changes and that aging dermal fibroblasts may play an important role in the skin aging process. Therefore, senescent cell eliminators for aging skin cells may be an effective option for treating skin aging.” DOI: https://doi.org/10.18632/aging.204391 Corresponding Author: Kento Takaya - kento-takaya312@keio.jp Sign up for free Altmetric alerts about this article: https://aging.altmetric.com/details/email_updates?id=10.18632%2Faging.204391 About Aging-US Launched in 2009, Aging-US publishes papers of general interest and biological significance in all fields of aging research and age-related diseases, including cancer—and now, with a special focus on COVID-19 vulnerability as an age-dependent syndrome. Topics in Aging-US go beyond traditional gerontology, including, but not limited to, cellular and molecular biology, human age-related diseases, pathology in model organisms, signal transduction pathways (e.g., p53, sirtuins, and PI-3K/AKT/mTOR, among others), and approaches to modulating these signaling pathways. Please visit our website at https://www.Aging-US.com and connect with us: SoundCloud - https://soundcloud.com/Aging-Us Facebook - https://www.facebook.com/AgingUS/ Twitter - https://twitter.com/AgingJrnl Instagram - https://www.instagram.com/agingjrnl/ YouTube - https://www.youtube.com/agingus LinkedIn - https://www.linkedin.com/company/aging/ Pinterest - https://www.pinterest.com/AgingUS/ For media inquiries, please contact media@impactjournals.com
BUFFALO, NY- June 15, 2022 – A new research paper was published in Aging (Aging-US) on the cover of Volume 14, Issue 11, entitled, “Histone deacetylase 4 reverses cellular senescence via DDIT4 in dermal fibroblasts.” Researchers—from Seoul National University, Seoul National University College of Medicine, Seoul National University Graduate School, and Daegu Gyeongbuk Institute of Science and Technology (DGIST)—previously demonstrated that histone deacetylase 4 (HDAC4) is consistently downregulated in aged and ultraviolet (UV)-irradiated human skin. However, there is little research on how HDAC4 causes skin aging. “To elucidate the potential role of HDAC4 in the regulation of cellular senescence and skin aging, we established oxidative stress- and UV-induced cellular senescence models using primary human dermal fibroblasts (HDFs).” After overexpression or knockdown of HDAC4 in primary HDFs, RNA sequencing identified candidate molecular targets of HDAC4. “Integrative analyses of our current and public mRNA expression profiles identified DNA damage-inducible transcript 4 (DDIT4) as a critical senescence-associated factor regulated by HDAC4.” Full press release - https://aging-us.net/2022/06/15/aging-us-ddit4-identified-as-candidate-target-of-hdac4-associated-skin-aging/ DOI: https://doi.org/10.18632/aging.204118 Corresponding Authors: Daehee Hwang - daehee@snu.ac.kr, Dong Hun Lee - ivymed27@snu.ac.kr, Jin Ho Chung - jhchung@snu.ac.kr Keywords: cellular senescence, DNA damage-inducible transcript 4, histone deacetylase 4, oxidative stress, ultraviolet light Sign up for free Altmetric alerts about this article: https://aging.altmetric.com/details/email_updates?id=10.18632%2Faging.204118 About Aging-US: Launched in 2009, Aging (Aging-US) publishes papers of general interest and biological significance in all fields of aging research and age-related diseases, including cancer—and now, with a special focus on COVID-19 vulnerability as an age-dependent syndrome. Topics in Aging go beyond traditional gerontology, including, but not limited to, cellular and molecular biology, human age-related diseases, pathology in model organisms, signal transduction pathways (e.g., p53, sirtuins, and PI-3K/AKT/mTOR, among others), and approaches to modulating these signaling pathways. Follow Aging on social media: SoundCloud – https://soundcloud.com/Aging-Us Facebook – https://www.facebook.com/AgingUS/ Twitter – https://twitter.com/AgingJrnl Instagram – https://www.instagram.com/agingjrnl/ YouTube – https://www.youtube.com/agingus LinkedIn – https://www.linkedin.com/company/aging/ Pinterest – https://www.pinterest.com/AgingUS/ For media inquiries, please contact media@impactjournals.com.
In our Trino Community Broadcast episode 35 we are catching up on recent releases 375, 376, 377, and 378. We then talk about how Trino is packaged as tarball, rpm, and docker container, what some of the differences are, and how you can customize either of them. Beyond we also look for your feedback and input on usage of the different packages. As a next step we chat about adopting Java 17 is standard for Trino, and then we get a demo of a new feature of the web UI.- Intro song: 00:00- Intro: 00:32- Releases: 4:22- Concept of the episode: Packaging Trino: 21:28- Additional topic of the episode: Modernizing Trino with Java 17: 46:49- Pull requests of the episode: Worker stats in the Web UI: 55:25- Question of the episode: HDFS supported by Delta Lake connector?: 1:01:52- Demo of the episode: Tarball installation and new Web UI feature: 1:05:58Show Notes: https://trino.io/episodes/35.htmlShow Page: https://trino.io/broadcast/
Listen to the Data Eng podcast: https://www.dataengineeringpodcast.com/snuba-event-data-warehouse-episode-108/ (11mins in)https://blog.cloudflare.com/http-analytics-for-6m-requests-per-second-using-clickhouse/TranscriptJames CunninghamYeah, so I'd say as far as all the decisions that we made in order to go into this new platform, one of the biggest leaders was that we had a big push for having environments be kind of like a first class filtration, we had to build a new dimensionality of data across all this denormalized data, essentially doubled the storage that we had. And then we said to ourselves, like all this is great, this looks cool. environments are dope. But what happens we want to add another dimension and have dimension or we're just going to continue to, I guess, like, extrapolate across this data set and eventually end up with 100 terabytes of, you know, five different dimensions of data. So we said ourselves That we kind of needed a flat event model that we'd be able to kind of search across and to ourselves, you know, there are a few other pieces that we want. And on top of that, we want to be able to search across these arbitrary fields that we really, really looked into whether those are custom tags or something that we kind of promote, whether that is like releases or traces or searching across messages. We didn't want that to take as long as it did. And some of the other parts is that we have all this data stored in, you know, this tag store and all these searches that we have to go through. But we have in a completely different side for time series data that again, had to have that dimensionality in it. If we search across these arbitrary fields, the next thing that a customer would ask for is, Hey, can I please see a pretty graph. So if we could boil down that search, and that time series data into the same system, we'd be destroying two systems with one rewrite.Ted KaemmingAnd also like as part of that process, I mean, you kind of always have this Standard checkpoints, you know, like the replication and durability is obviously really important for us ease of maintenance is huge, low cost as well for us. So even that just kind of ruled out some like the hosted magic storage solutions, like those kinds of pressures.Tobias MaceyAnd as you were deciding how to architect this new system, can you talk through some of the initial list of possible components that you were evaluating and what the process was for determining whether something was going to stay or go in the final right?James CunninghamYeah, of course. Um, so our first, I guess, thing that we kind of crossed off is no more orientation, Postgres to serve as well, probably wouldn't, you know, we hope that we could engineer a good solution on top of it, but ultimately, we decided we probably needed a different shape of database to get the query across. We've kind of had like, five major options. We had document stores, you know, we had Some sort of Google proprietary blend, because we are completely on GCP. We had, you know, more more generic distributed query stuff, you know, a little bit of Spark, maybe a little bit of presto, we took a look at other distributed databases, we ran a good amount of Cassandra and my old gig. So I know how to run that. And we also said, like, Oh, hey, we could just like, put data down on distance ourselves and not have to worry about this. Some of the other like, serious considered things that we had was a was a column restore some of these other ones that we actually like kick the tires on, was to do we kick the tires on Pino, and Druid. And ultimately, we found click house as a commerce store. And we kind of just started running it. And it was one of the easiest ones to kick the tires on. Some of these other like, I guess, you know, columnar stores built on top of distributed file systems. It really did take a good amount of bricks to put down in order to get to your first query. And some of the things that we wanted was figuring out operational costs on that. We want to be able to iterate across question You wanted to be able to kind of pare down all the dependencies that the service had. You know, while we weren't afraid to run a few JVM, or to run it, you know, a little bit of HDFS, that was something that realistically, I might not want to have to have, you know, an entire engineer dedicated to running something like that. And on the antithesis of that, you know, we can choose some of this Google proprietary blend, but how did it feel to go from having century only require Redis and Postgres to now saying, you can only run the new version on Google? Yeah, as a little bit silly. So we ended up really just getting through an MVP of I think, both Kudo and click house, and one of the one of the biggest ones that really did kick us and for anyone listening, go ahead and correct me if I'm wrong. But one of my memories was that one of our engineers, you know, started loading data into q2, and you didn't really know when it was there. It was great for you know, being able to being able to crunch down about your numbers, but one of our biggest things that you did kind of hint at Is that we do need real time data and to be able to write into this data store, and then to be able to read it on a consistent basis with one of the things we need it, we have the ability to have a feature called alert rules and what you say, hey, only tell me if, you know, any event with the tag, you know, foom in got in and the value equals to what it was only maybe like 10 events in the last hour. And you want to be able to read that pretty quickly so that when that 10th event comes in, you're not waiting minutes until that alert shows up and click houses able to do that. And so that kind of just got its way up to number one.Ted KaemmingYeah, I think also in general, like, at century we try and kind of bias a little bit towards relatively simple solutions. And it seemed like click house there was, at least to us, based on our backgrounds, it seemed more straightforward to get running. And I think that as well. appealed to us quite a bit. The documentation is pretty solid. It's also open source. You know, a lot of us will be but you know, click house has a pretty active repository. They've been very responsive when we've had questions or issues, they're very public about their development plan. So I think a lot of these things just kind of kind of worked out in its favor.Tobias MaceyYeah, it's definitely from what I've been able to understand a fairly new entrant into the overall database and data storage market. But I've heard of a few different stories of people using it in fairly high load environments. So I heard about the work that you're doing with Snoop, as far as I understand. CloudFlare is also using it for some of their use cases. And they definitely operate at some pretty massive scale with high data volume. So it seems like a pretty impressive system that has a lot of different capabilities. And I was pretty impressed when I had some of the folks from all tend to be on the podcast A while ago to talk about their experience of working it and working with some of their clients on getting it deployed. And I'm curious what some of the other types of systems you are able to replace with click house were given that you as you said, you have these four different systems that you had to be able to replicate event data to Were you able to collapse them all down into this one storage engine.Ted KaemmingYeah. So like in our code base, the those four different things, the TSP search, tag store, and node star all have kind of abstract service interfaces that really just sort of evolved from the fact that it's a open source projects, people wanted to use these these different methods for it. Three of those now are backed by the same data set and click house. So all the TSP data comes directly out of click house, there's no pre aggregation that happens anymore. It's just you know, we're just ripping over individual rows competing those aggregates on demand, at least for now. Search. Some of the data for search still lives in Postgres, but a lot of it now is it just runs in from log data in House essentially, tax store, we've removed how many servers were we using for tags?James CunninghamWe had? Oh, goodness, like 12 and one haiman 3232 core and maybe 200 odd gigs. But you know, getting getting into some of these other stats that we have a little bit more down the list. We went from 52 terabytes of SSD to two terabytes. Which is a good number to break down from. Yeah,Ted Kaemmingso we were able to absolutely, yeah, we were able to decommission like an entire Redis cluster, like cluster in quotes, and this entire Postgres cluster with drastically less hardware. And yeah, just the fact that it all reads from the same click house cluster. And there's none of this weird replication lag between all these systems. That's it's a huge positive.Tobias MaceyCan you talk a bit more about the overall architecture of Snoopy itself and just some of the operations characteristics and experience that you've had in terms of click house itself and maybe some of the early pain points and sharp edges that you ran into as you are getting used to this new system.Ted KaemmingYeah, sure. So I guess just to give you kind of a brief overview of the architecture, because it's, it's something that's really not particularly fancy. It's really Snoopy is just a small, like, a relatively small flask application at least small when you compare it with like the remainder of century. So it's a Yeah, it's a flask application and it just speaks HTTP. It's in Python. It's generally stateless rights as they come in. They go through a Kafka topic. It's published directly from the the remainder of the century kobus. The central code base in this new book codebase are actually completely independent, at least as far as like the project. Get to read. So century rights in this Kafka topic. This new book, consumer picks them up, does some de normalization Some data munging you know, kind of conventional Kafka consumer stuff and writes large batches of events to click house. We don't use the click house Kafka engine or anything particularly special for that we just use the complete Kafka driver from confluent, which is live already Kafka based. And that's all in on Python reads just me about half and also over HTTP. Not anything also particularly fancy there. We have some various optimizations that we we do kind of just a general query cache and duplication of queries. So that way, we don't have large queries that have long run times, executing concurrently on the cluster. We do some optimizations where we move some stuff from the where clause in click house sequel to a pre WHERE clause, which is basically the closest thing you get to any sort of query optimization. And we just some other just like query rewriting stuff based on our domain model. There's other rate limits and Quality of Service metrics logging type stuff that happens in there as well. As long as that all goes well, responses returned to the caller with something that is almost identical to what you would get if you're just interacting with the HTTP interface, click house itself. If it doesn't go, well, that ends up getting locked a century. And we we then kind of entered the system again to go to go look at it. So that's kind of a brief overview. It's, it's nothing particularly fancy.Tobias MaceyYeah, sometimes simple as best, particularly when you're dealing with something that is critical path is this.James CunninghamYeah, for sure. Yeah, so talk a little bit of the early engineering that you might have alluded to. One of our I say one of our biggest early difficulties was that we've you know, we've we've spent a lot of eggs in the Postgres basket. So we turn this on and, you know, the queries that we've set up for a rather oriented database are just like, absolutely not met. columnar store, which is a crazy thing to say,Ted Kaemmingit's so easy to type select star.James CunninghamSo easy spelling is Howard. But, you know, there's there's some things that just absolutely did not cut over to this column or store that we kind of had to like redesign how we had every query, you know, a century kind of had a quick application of order by some arbitrary column and then limit by 1000, to be able to like, explicitly hit a binary tree index in Postgres. And that didn't matter in click house, you know, any sort of limits just kind of truncated, what rose you're returning if you applied an order by that would have taken your entire data set and ordered it so many other things is that we have a lot of select stars everywhere, like Ted said, and that is, honestly one of the worst ways to operate on a column or store because you're just reading from every liberal file. So maybe change that a little bit. Some of the other things that we kind of had, you know, we we didn't have a quarter planner, so there was a lot of like, Taking a query and just kind of moving pieces around. One of the things that Ted alluded to was the notion of a pre where when you have, you know, multiple columns that you want to you want to filter on and aware clause, you kind of have the ability to give click house a little bit of heuristics and say, This is the column that we believe has the highest selectivity. And you put them in a pre WHERE clause, it will read through that column first, you know, decide which block IDs it's going to read from for the rest of them. So if you have something along the lines of an event ID that for us is, you know, global unique, that might have a little bit higher selectivity than environment or you know, it release might have a little bit of higher selectivity. So we were kind of working around these edges by just swapping variables around and saying, Well, did that make it faster? And then we said, Yes, we kind of threw some high fives around.Ted KaemmingYeah, they're like, also just the integration into some of the query patterns we have in century was a bit of a challenge. Click house is really designed to do particularly well with inserts, it does not do particularly well with updates or deletes to the point where they are actually like syntactically valid in the like click house flavored sequel. So we have except century as a whole is particularly insert heavy but it's not insert only and so we had to kind of work around. Basically the fact that click houses is extremely oriented towards inserts. We kind of ended up with something that actually James mentioned he worked on Cassandra in a past life I did as well. We ended up with a architecture that is fairly similar to Cassandra tombstone for how we delete data, where we kind of implement our own last right wins semantics on top of the replacing merge tree and click house. There's a long blog posts About how we do that, as part of, we have this field guide series that we've been working on where we go into some of these like weird things that we do with cookhouse. Similarly, for things like those alerts that James mentioned earlier, we basically require sequential consistency to be able to execute those queries effectively. That becomes a problem when you're dealing with multi master replication, like click house does. So we ended up having to do some kind of dodgy load balancing stuff, where we, we don't have a literal primary for all rights, but we kind of have this ad hoc primary, that all rights go to as long as that is up. And for some subset of queries, they are only allowed to evaluate on that that primary. It's not like guaranteed sequential consistency and like a true distributed system sense but it's it's good enough for what we need. It's also particularly complicated because the system doing the querying is not smoother. It's lives in the century codebase. And so we basically need to be able to notify the century codebase that these rows have been written to click half from Cuba as part of this. So we ended up having to engineer this solution where we have a commit log coming out of the smooth Kafka consumer that the century application is actually subscribed to that Kafka topic, the commit log Kafka topic and gating its own progress based on the progress of this new writer. There's also a blog post that goes into more depth about how we specifically implemented that on the century blog as part of this field guide series. But just yeah, things like that, that you like we knew things like the mutations were going to be something that we had to manage. We didn't particularly have strategy around it and The sequential consistency stuff probably caught us a little bit more by surprise than it should have, as we were doing some of our our kind of integration testing in production with us. And notice that some of the queries weren't exactly returning what we thought they would have. So that was that was something we also had to solve.Tobias MaceyAnd you mentioned that one of the reasons that you ended up going further forward with click house than any of the other systems is that it was pretty easy to get up and running with and seemed fairly simple operationally. So I'm curious what you have found to be the case now that you're actually using it in production and putting it under heavier load in a clustered environment. And any sort of useful lessons that you've learned in the process. Do you think anybody else is evaluating click has to know about?James CunninghamAbsolutely. So this is this is my time to shine.So one of the things that I kind of had to had to make a concession Is that I've never worked with a database that possibly be bound by CPU. It's always been, you know, make sure that your disks are as fast as possible, you know that the data is on thedisks, you got to read from the disk.And the reason that you know, it very well could be bound by CPU is that, you know, I've seen compression in the compression in the past, and I didn't really understand what compression could actually give you until we returned click house on sort of compression realistically, you know, brings our entire data set, you know, we kind of alluded to it earlier, brings our entire data set from 52 terabytes data, two terabytes, and about 800 gigs of those are surprisingly uncompressible because they're unique, you know, 32 character strings. If anyone can tell me a, an algorithm that helps compress that, I think that we made a TV series around that or something like that, but you know, for the for the right The rest of the data, it's so well compressed that being able to actually like compute across it does so well, you know, we, we run a small amount of servers to supply what is a large amount of a data set? You know, we've, we started, I wouldn't say that, like, if there was any advice to anyone out there, start by sharding. Never Never shard by two, because two is a curse to number in terms of distributed systems. But we really just started with, you know, three shards, three replicas. And you know, with that, with that blessed number of nine, we haven't gone up yet. We kind of have a high watermark of a terabyte per machine. Google gives a certain amount of read and write off that disk based on how much storage you have. And we've kind of unlocked a certain level and one terabyte for a machine on if anyone else is somehow running click house on GCP I guess on GCP that is, you know, we're we're about to apply our fourth shard. But realistically, some of the other things that are operationally sound is That, you know, as as much as we'd all love to, I guess like hammer on or praise XML. It is it is very explicit about about what you have to write in. Its configured via XML. There's no runtime configuration that you're applying. There's no you know, magic distribution of writing into an options store and watching that cascade into a clusterauto scaling.Yeah, I'm not I'm not, you know, crunching in any Kubernetes pods or anything like that. One of the things I'd be remiss to not say is that you did mention CloudFlare is running click house and shut out CloudFlare they run real hardware and I'll never do that again in my life. But uh, one of the things that they alluded to and one of their kick ass blogs about click house is that it replicates so fast that they found it more performance that when a disk in a like raid 10 dies, they just wipe all the data, rebuild the disk essentially empty and just have click house refill it itself. It is crazy fast in terms of rough application. Since all that is compressed, it really just sends that across the wire. Some of the other stuff that, you know, we found completely great in terms of operationalize is that since it is CPU bound, it's mostly by reads when you are right heavy company, and you're now bound by reads in terms of cost of goods sold, like, I can throw around a million high fives after that. It's great to just watch, you know, people log in and actually look at their data and watch our graphs tick up, instead of just saying, Well, you know, we spent a lot of spend a lot of money on this, and people are only reading, you know, 1% of their data. One other piece that I'd be remiss to not answer is that some some niceties about click house that kind of separated for a few of the databases I've worked with is that the ability to kind of set some very quick either like throttling or kind of like turbo ng settings that you have on a client side. So some of the things that we might do is that if we know that a query is going to be expensive, we could you know, sacrifice a little bit of resources and Kind of like turn it back fast. So there is just a literal setting that is Max threads where I say, you know what, I really want this to run faster set max threads to eight instead of four. And it does exactly what it says it does, it'll run twice as fast if you have it twice as many threads. So they're pretty easy things that we kind of run around in terms of operational wise, I think that as far as a database goes, you know, one of the hardest things to do is just kind of read all of the settings to figure out what they do. But after you kind of get versed in it, you'll understand you know, what applying this setting might be or at what threshold, you might set something, and it's not very magical, you know, some of these settings, realistically are for very explicit types of queries that you'd only supply from a client side if you really needed them. So fairly, I wouldn't go so far as a simple like the configurations almost like dumb, and then either straightforward, very straightforward. Yeah.
Bradley earned a master's degree in Public Administration and a bachelor's degree in Human Development and Family Studies with a concentration in Youth Development from Kent State University. He is currently the Senior Director of J Kids and Special Projects at the Jewish Community Center of Greater Baltimore. In this episode, he discusses how he found the field of HDFS and his professional experiences to date. As is true for all interviewees on this podcast, Bradley's views are his own as a private citizen and do not reflect the views of his current, former, or future employers.
Mary earned a bachelor's degree in Human Development and Family Studies with a concentration in Family Life Education from Kent State University. She is currently the Public Relations Director for Plains Local School in Ohio. She is also the founder and president of the Josette Beddell Memorial Foundation. In this episode, she discusses how she found the field of HDFS and her professional experiences to date. As is true for all interviewees on this podcast, Mary's views are his own as a private citizen and do not reflect the views of her current, former, or future employers. Without further ado, here is her interview.
Rashawn earned a master's degree in Education in School Counseling from Malone University. He also earned a bachelor's degree in Human Development and Family Studies with a concentration in Youth Development from Kent State University. He is currently a Professional School Counselor for a public school in Ohio. In this episode, he discusses how he found the field of HDFS and his professional experiences to date. As is true for all interviewees on this podcast, Rashawn's views are his own as a private citizen and do not reflect the views of his current, former, or future employers.Links to additional information: Video about working as a school counselor: https://youtu.be/U1rrTmpD728Video about elementary and middle school counselors: https://youtube.com/playlist?list=PLmVWg2oiNxJyZiewS7DPGbW3QWxZBBvFoRashawn discussing helping students during the pandemic: https://www.ideastream.org/news/kids-returning-to-school-after-isolation-with-mental-health-concernsIf you have recommendations for HDFS or other family science alumni to interview, please reach out to me at HDFSCareers.com. Don't worry if they are not working in a job that would normally be considered “in the field.” I am interested in hearing a variety of stories--especially if they are working outside of academia.
About ABAB Periasamy is the co-founder and CEO of MinIO, an open source provider of high performance, object storage software. In addition to this role, AB is an active investor and advisor to a wide range of technology companies, from H2O.ai and Manetu where he serves on the board to advisor or investor roles with Humio, Isovalent, Starburst, Yugabyte, Tetrate, Postman, Storj, Procurify, and Helpshift. Successful exits include Gitter.im (Gitlab), Treasure Data (ARM) and Fastor (SMART).AB co-founded Gluster in 2005 to commoditize scalable storage systems. As CTO, he was the primary architect and strategist for the development of the Gluster file system, a pioneer in software defined storage. After the company was acquired by Red Hat in 2011, AB joined Red Hat's Office of the CTO. Prior to Gluster, AB was CTO of California Digital Corporation, where his work led to scaling of the commodity cluster computing to supercomputing class performance. His work there resulted in the development of Lawrence Livermore Laboratory's “Thunder” code, which, at the time was the second fastest in the world. AB holds a Computer Science Engineering degree from Annamalai University, Tamil Nadu, India.AB is one of the leading proponents and thinkers on the subject of open source software - articulating the difference between the philosophy and business model. An active contributor to a number of open source projects, he is a board member of India's Free Software Foundation.Links: MinIO: https://min.io/ Twitter: https://twitter.com/abperiasamy MinIO Slack channel: https://minio.slack.com/join/shared_invite/zt-11qsphhj7-HpmNOaIh14LHGrmndrhocA LinkedIn: https://www.linkedin.com/in/abperiasamy/ TranscriptAnnouncer: Hello, and welcome to Screaming in the Cloud with your host, Chief Cloud Economist at The Duckbill Group, Corey Quinn. This weekly show features conversations with people doing interesting work in the world of cloud, thoughtful commentary on the state of the technical world, and ridiculous titles for which Corey refuses to apologize. This is Screaming in the Cloud.Corey: This episode is sponsored in part by our friends at Sysdig. Sysdig is the solution for securing DevOps. They have a blog post that went up recently about how an insecure AWS Lambda function could be used as a pivot point to get access into your environment. They've also gone deep in-depth with a bunch of other approaches to how DevOps and security are inextricably linked. To learn more, visit sysdig.com and tell them I sent you. That's S-Y-S-D-I-G dot com. My thanks to them for their continued support of this ridiculous nonsense.Corey: This episode is sponsored in part by our friends at Rising Cloud, which I hadn't heard of before, but they're doing something vaguely interesting here. They are using AI, which is usually where my eyes glaze over and I lose attention, but they're using it to help developers be more efficient by reducing repetitive tasks. So, the idea being that you can run stateless things without having to worry about scaling, placement, et cetera, and the rest. They claim significant cost savings, and they're able to wind up taking what you're running as it is, in AWS, with no changes, and run it inside of their data centers that span multiple regions. I'm somewhat skeptical, but their customers seem to really like them, so that's one of those areas where I really have a hard time being too snarky about it because when you solve a customer's problem, and they get out there in public and say, “We're solving a problem,” it's very hard to snark about that. Multus Medical, Construx.ai, and Stax have seen significant results by using them, and it's worth exploring. So, if you're looking for a smarter, faster, cheaper alternative to EC2, Lambda, or batch, consider checking them out. Visit risingcloud.com/benefits. That's risingcloud.com/benefits, and be sure to tell them that I said you because watching people wince when you mention my name is one of the guilty pleasures of listening to this podcast.in a siloCorey: Welcome to Screaming in the Cloud. I'm Corey Quinn. I'm joined this week by someone who's doing something a bit off the beaten path when we talk about cloud. I've often said that S3 is sort of a modern wonder of the world. It was the first AWS service brought into general availability. Today's promoted guest is the co-founder and CEO of MinIO, Anand Babu Periasamy, or AB as he often goes, depending upon who's talking to him. Thank you so much for taking the time to speak with me today.AB: It's wonderful to be here, Corey. Thank you for having me.Corey: So, I want to start with the obvious thing, where you take a look at what is the cloud and you can talk about AWS's ridiculous high-level managed services, like Amazon Chime. Great, we all see how that plays out. And those are the higher-level offerings, ideally aimed at problems customers have, but then they also have the baseline building blocks services, and it's hard to think of a more baseline building block than an object store. That's something every cloud provider has, regardless of how many scare quotes there are around the word cloud; everyone offers the object store. And your solution is to look at this and say, “Ah, that's a market ripe for disruption. We're going to build through an open-source community software that emulates an object store.” I would be sitting here, more or less poking fun at the idea except for the fact that you're a billion-dollar company now.AB: Yeah.Corey: How did you get here?AB: So, when we started, right, we did not actually think about cloud that way, right? “Cloud, it's a hot trend, and let's go disrupt is like that. It will lead to a lot of opportunity.” Certainly, it's true, it lead to the M&S, right, but that's not how we looked at it, right? It's a bad idea to build startups for M&A.When we looked at the problem, when we got back into this—my previous background, some may not know that it's actually a distributed file system background in the open-source space.Corey: Yeah, you were one of the co-founders of Gluster—AB: Yeah.Corey: —which I have only begrudgingly forgiven you. But please continue.AB: [laugh]. And back then we got the idea right, but the timing was wrong. And I had—while the data was beginning to grow at a crazy rate, end of the day, GlusterFS has to still look like an FS, it has to look like a file system like NetApp or EMC, and it was hugely limiting what we can do with it. The biggest problem for me was legacy systems. I have to build a modern system that is compatible with a legacy architecture, you cannot innovate.And that is where when Amazon introduced S3, back then, like, when S3 came, cloud was not big at all, right? When I look at it, the most important message of the cloud was Amazon basically threw everything that is legacy. It's not [iSCSI 00:03:21] as a Service; it's not even FTP as a Service, right? They came up with a simple, RESTful API to store your blobs, whether it's JavaScript, Android, iOS, or [AAML 00:03:30] application, or even Snowflake-type application.Corey: Oh, we spent ten years rewriting our apps to speak object store, and then they released EFS, which is NFS in the cloud. It's—AB: Yeah.Corey: —I didn't realize I could have just been stubborn and waited, and the whole problem would solve itself. But here we are. You're quite right.AB: Yeah. And even EFS and EBS are more for legacy stock can come in, buy some time, but that's not how you should stay on AWS, right? When Amazon did that, for me, that was the opportunity. I saw that… while world is going to continue to produce lots and lots of data, if I built a brand around that, I'm not going to go wrong.The problem is data at scale. And what do I do there? The opportunity I saw was, Amazon solved one of the largest problems for a long time. All the legacy systems, legacy protocols, they convinced the industry, throw them away and then start all over from scratch with the new API. While it's not compatible, it's not standard, it is ridiculously simple compared to anything else.No fstabs, no [unintelligible 00:04:27], no [root 00:04:28], nothing, right? From any application anywhere you can access was a big deal. When I saw that, I was like, “Thank you Amazon.” And I also knew Amazon would convince the industry that rewriting their application is going to be better and faster and cheaper than retrofitting legacy applications.Corey: I wonder how much that's retconned because talking to some of the people involved in the early days, they were not at all convinced they [laugh] would be able to convince the industry to do this.AB: Actually, if you talk to the analyst reporters, the IDC's, Gartner's of the world to the enterprise IT, the VMware community, they would say, “Hell no.” But if you talk to the actual application developers, data infrastructure, data architects, the actual consumers of data, for them, it was so obvious. They actually did not know how to write an fstab. The iSCSI and NFS, you can't even access across the internet, and the modern applications, they ran across the globe, in JavaScript, and all kinds of apps on the device. From [Snap 00:05:21] to Snowflake, today is built on object store. It was more natural for the applications team, but not from the infrastructure team. So, who you asked that mattered.But nevertheless, Amazon convinced the rest of the world, and our bet was that if this is going to be the future, then this is also our opportunity. S3 is going to be limited because it only runs inside AWS. Bulk of the world's data is produced everywhere and only a tiny fraction will go to AWS. And where will the rest of the data go? Not SAN, NAS, HDFS, or other blob store, Azure Blob, or GCS; it's not going to be fragmented. And if we built a better object store, lightweight, faster, simpler, but fully compatible with S3 API, we can sweep and consolidate the market. And that's what happened.Corey: And there is a lot of validity to that. We take a look across the industry, when we look at various standards—I mean, one of the big problems with multi-cloud in many respects is the APIs are not quite similar enough. And worse, the failure patterns are very different, of I don't just need to know how the load balancer works, I need to know how it breaks so I can detect and plan for that. And then you've got the whole identity problem as well, where you're trying to manage across different frames of reference as you go between providers, and leads to a bit of a mess. What is it that makes MinIO something that has been not just something that has endured since it was created, but clearly been thriving?AB: The real reason, actually is not the multi-cloud compatibility, all that, right? Like, while today, it is a big deal for the users because the deployments have grown into 10-plus petabytes, and now the infrastructure team is taking it over and consolidating across the enterprise, so now they are talking about which key management server for storing the encrypted keys, which key management server should I talk to? Look at AWS, Google, or Azure, everyone has their own proprietary API. Outside they, have [YAML2 00:07:18], HashiCorp Vault, and, like, there is no standard here. It is supposed to be a [KMIP 00:07:23] standard, but in reality, it is not. Even different versions of Vault, there are incompatibilities for us.That is where—like from Key Management Server, Identity Management Server, right, like, everything that you speak around, how do you talk to different ecosystem? That, actually, MinIO provides connectors; having the large ecosystem support and large community, we are able to address all that. Once you bring MinIO into your application stack like you would bring Elasticsearch or MongoDB or anything else as a container, your application stack is just a Kubernetes YAML file, and you roll it out on any cloud, it becomes easier for them, they're able to go to any cloud they want. But the real reason why it succeeded was not that. They actually wrote their applications as containers on Minikube, then they will push it on a CI/CD environment.They never wrote code on EC2 or ECS writing objects on S3, and they don't like the idea of [past 00:08:15], where someone is telling you just—like you saw Google App Engine never took off, right? They liked the idea, here are my building blocks. And then I would stitch them together and build my application. We were part of their application development since early days, and when the application matured, it was hard to remove. It is very much like Microsoft Windows when it grew, even though the desktop was Microsoft Windows Server was NetWare, NetWare lost the game, right?We got the ecosystem, and it was actually developer productivity, convenience, that really helped. The simplicity of MinIO, today, they are arguing that deploying MinIO inside AWS is easier through their YAML and containers than going to AWS Console and figuring out how to do it.Corey: As you take a look at how customers are adopting this, it's clear that there is some shift in this because I could see the story for something like MinIO making an awful lot of sense in a data center environment because otherwise, it's, “Great. I need to make this app work with my SAN as well as an object store.” And that's sort of a non-starter for obvious reasons. But now you're available through cloud marketplaces directly.AB: Yeah.Corey: How are you seeing adoption patterns and interactions from customers changing as the industry continues to evolve?AB: Yeah, actually, that is how my thinking was when I started. If you are inside AWS, I would myself tell them that why don't use AWS S3? And it made a lot of sense if it's on a colo or your own infrastructure, then there is an object store. It even made a lot of sense if you are deploying on Google Cloud, Azure, Alibaba Cloud, Oracle Cloud, it made a lot of sense because you wanted an S3 compatible object store. Inside AWS, why would you do it, if there is AWS S3?Nowadays, I hear funny arguments, too. They like, “Oh, I didn't know that I could use S3. Is S3 MinIO compatible?” Because they will be like, “It came along with the GitLab or GitHub Enterprise, a part of the application stack.” They didn't even know that they could actually switch it over.And otherwise, most of the time, they developed it on MinIO, now they are too lazy to switch over. That also happens. But the real reason that why it became serious for me—I ignored that the public cloud commercialization; I encouraged the community adoption. And it grew to more than a million instances, like across the cloud, like small and large, but when they start talking about paying us serious dollars, then I took it seriously. And then when I start asking them, why would you guys do it, then I got to know the real reason why they wanted to do was they want to be detached from the cloud infrastructure provider.They want to look at cloud as CPU network and drive as a service. And running their own enterprise IT was more expensive than adopting public cloud, it was productivity for them, reducing the infrastructure, people cost was a lot. It made economic sense.Corey: Oh, people always cost more the infrastructure itself does.AB: Exactly right. 70, 80%, like, goes into people, right? And enterprise IT is too slow. They cannot innovate fast, and all of those problems. But what I found was for us, while we actually build the community and customers, if you're on AWS, if you're running MinIO on EBS, EBS is three times more expensive than S3.Corey: Or a single copy of it, too, where if you're trying to go multi-AZ and you have the replication traffic, and not to mention you have to over-provision it, which is a bit of a different story as well. So, like, it winds up being something on the order of 30 times more expensive, in many cases, to do it right. So, I'm looking at this going, the economics of running this purely by itself in AWS don't make sense to me—long experience teaches me the next question of, “What am I missing?” Not, “That's ridiculous and you're doing it wrong.” There's clearly something I'm not getting. What am I missing?AB: I was telling them until we made some changes, right—because we saw a couple of things happen. I was initially like, [unintelligible 00:12:00] does not make 30 copies. It makes, like, 1.4x, 1.6x.But still, the underlying block storage is not only three times more expensive than S3, it's also slow. It's a network storage. Trying to put an object store on top of it, another, like, software-defined SAN, like EBS made no sense to me. Smaller deployments, it's okay, but you should never scale that on EBS. So, it did not make economic sense. I would never take it seriously because it would never help them grow to scale.But what changed in recent times? Amazon saw that this was not only a problem for MinIO-type players. Every database out there today, every modern database, even the message queues like Kafka, they all have gone scale-out. And they all depend on local block store and putting a scale-out distributed database, data processing engines on top of EBS would not scale. And Amazon introduced storage optimized instances. Essentially, that reduced to bet—the data infrastructure guy, data engineer, or application developer asking IT, “I want a SuperMicro, or Dell server, or even virtual machines.” That's too slow, too inefficient.They can provision these storage machines on demand, and then I can do it through Kubernetes. These two changes, all the public cloud players now adopted Kubernetes as the standard, and they have to stick to the Kubernetes API standard. If they are incompatible, they won't get adopted. And storage optimized that is local drives, these are machines, like, [I3 EN 00:13:23], like, 24 drives, they have SSDs, and fast network—like, 25-gigabit 200-gigabit type network—availability of these machines, like, what typically would run any database, HDFS cluster, MinIO, all of them, those machines are now available just like any other EC2 instance.They are efficient. You can actually put MinIO side by side to S3 and still be price competitive. And Amazon wants to—like, just like their retail marketplace, they want to compete and be open. They have enabled it. In that sense, Amazon is actually helping us. And it turned out that now I can help customers build multiple petabyte infrastructure on Amazon and still stay efficient, still stay price competitive.Corey: I would have said for a long time that if you were to ask me to build out the lingua franca of all the different cloud providers into a common API, the S3 API would be one of them. Now, you are building this out, multi-cloud, you're in all three of the major cloud marketplaces, and the way that you do that and do those deployments seems like it is the modern multi-cloud API of Kubernetes. When you first started building this, Kubernetes was very early on. What was the evolution of getting there? Or were you one of the first early-adoption customers in a Kubernetes space?AB: So, when we started, there was no Kubernetes. But we saw the problem was very clear. And there was containers, and then came Docker Compose and Swarm. Then there was Mesos, Cloud Foundry, you name it, right? Like, there was many solutions all the way up to even VMware trying to get into that space.And what did we do? Early on, I couldn't choose. I couldn't—it's not in our hands, right, who is going to be the winner, so we just simply embrace everybody. It was also tiring that to allow implement native connectors to all of them different orchestration, like Pivotal Cloud Foundry alone, they have their own standard open service broker that's only popular inside their system. Go outside elsewhere, everybody was incompatible.And outside that, even, Chef Ansible Puppet scripts, too. We just simply embraced everybody until the dust settle down. When it settled down, clearly a declarative model of Kubernetes became easier. Also Kubernetes developers understood the community well. And coming from Borg, I think they understood the right architecture. And also written in Go, unlike Java, right?It actually matters, these minute new details resonating with the infrastructure community. It took off, and then that helped us immensely. Now, it's not only Kubernetes is popular, it has become the standard, from VMware to OpenShift to all the public cloud providers, GKS, AKS, EKS, whatever, right—GKE. All of them now are basically Kubernetes standard. It made not only our life easier, it made every other [ISV 00:16:11], other open-source project, everybody now can finally write one code that can be operated portably.It is a big shift. It is not because we chose; we just watched all this, we were riding along the way. And then because we resonated with the infrastructure community, modern infrastructure is dominated by open-source. We were also the leading open-source object store, and as Kubernetes community adopted us, we were naturally embraced by the community.Corey: Back when AWS first launched with S3 as its first offering, there were a bunch of folks who were super excited, but object stores didn't make a lot of sense to them intrinsically, so they looked into this and, “Ah, I can build a file system and users base on top of S3.” And the reaction was, “Holy God don't do that.” And the way that AWS decided to discourage that behavior is a per request charge, which for most workloads is fine, whatever, but there are some that causes a significant burden. With running something like MinIO in a self-hosted way, suddenly that costing doesn't exist in the same way. Does that open the door again to so now I can use it as a file system again, in which case that just seems like using the local file system, only with extra steps?AB: Yeah.Corey: Do you see patterns that are emerging with customers' use of MinIO that you would not see with the quote-unquote, “Provider's” quote-unquote, “Native” object storage option, or do the patterns mostly look the same?AB: Yeah, if you took an application that ran on file and block and brought it over to object storage, that makes sense. But something that is competing with object store or a layer below object store, that is—end of the day that drives our block devices, you have a block interface, right—trying to bring SAN or NAS on top of object store is actually a step backwards. They completely missed the message that Amazon told that if you brought a file system interface on top of object store, you missed the point, that you are now bringing the legacy things that Amazon intentionally removed from the infrastructure. Trying to bring them on top doesn't make it any better. If you are arguing from a compatibility some legacy applications, sure, but writing a file system on top of object store will never be better than NetApp, EMC, like EMC Isilon, or anything else. Or even GlusterFS, right?But if you want a file system, I always tell the community, they ask us, “Why don't you add an FS option and do a multi-protocol system?” I tell them that the whole point of S3 is to remove all those legacy APIs. If I added POSIX, then I'll be a mediocre object storage and a terrible file system. I would never do that. But why not write a FUSE file system, right? Like, S3Fs is there.In fact, initially, for legacy compatibility, we wrote MinFS and I had to hide it. We actually archived the repository because immediately people started using it. Even simple things like end of the day, can I use Unix [Coreutils 00:19:03] like [cp, ls 00:19:04], like, all these tools I'm familiar with? If it's not file system object storage that S3 [CMD 00:19:08] or AWS CLI is, like, to bloatware. And it's not really Unix-like feeling.Then what I told them, “I'll give you a BusyBox like a single static binary, and it will give you all the Unix tools that works for local filesystem as well as object store.” That's where the [MC tool 00:19:23] came; it gives you all the Unix-like programmability, all the core tool that's object storage compatible, speaks native object store. But if I have to make object store look like a file system so UNIX tools would run, it would not only be inefficient, Unix tools never scaled for this kind of capacity.So, it would be a bad idea to take step backwards and bring legacy stuff back inside. For some very small case, if there are simple POSIX calls using [ObjectiveFs 00:19:49], S3Fs, and few, for legacy compatibility reasons makes sense, but in general, I would tell the community don't bring file and block. If you want file and block, leave those on virtual machines and leave that infrastructure in a silo and gradually phase them out.Corey: This episode is sponsored in part by our friends at Vultr. Spelled V-U-L-T-R because they're all about helping save money, including on things like, you know, vowels. So, what they do is they are a cloud provider that provides surprisingly high performance cloud compute at a price that—while sure they claim its better than AWS pricing—and when they say that they mean it is less money. Sure, I don't dispute that but what I find interesting is that it's predictable. They tell you in advance on a monthly basis what it's going to going to cost. They have a bunch of advanced networking features. They have nineteen global locations and scale things elastically. Not to be confused with openly, because apparently elastic and open can mean the same thing sometimes. They have had over a million users. Deployments take less that sixty seconds across twelve pre-selected operating systems. Or, if you're one of those nutters like me, you can bring your own ISO and install basically any operating system you want. Starting with pricing as low as $2.50 a month for Vultr cloud compute they have plans for developers and businesses of all sizes, except maybe Amazon, who stubbornly insists on having something to scale all on their own. Try Vultr today for free by visiting: vultr.com/screaming, and you'll receive a $100 in credit. Thats v-u-l-t-r.com slash screaming.Corey: So, my big problem, when I look at what S3 has done is in it's name because of course, naming is hard. It's, “Simple Storage Service.” The problem I have is with the word simple because over time, S3 has gotten more and more complex under the hood. It automatically tiers data the way that customers want. And integrated with things like Athena, you can now query it directly, whenever of an object appears, you can wind up automatically firing off Lambda functions and the rest.And this is increasingly looking a lot less like a place to just dump my unstructured data, and increasingly, a lot like this is sort of a database, in some respects. Now, understand my favorite database is Route 53; I have a long and storied history of misusing services as databases. Is this one of those scenarios, or is there some legitimacy to the idea of turning this into a database?AB: Actually, there is now S3 Select API that if you're storing unstructured data like CSV, JSON, Parquet, without downloading even a compressed CSV, you can actually send a SQL query into the system. IN MinIO particularly the S3 Select is [CMD 00:21:16] optimized. We can load, like, every 64k worth of CSV lines into registers and do CMD operations. It's the fastest SQL filter out there. Now, bringing these kinds of capabilities, we are just a little bit away from a database; should we do database? I would tell definitely no.The very strength of S3 API is to actually limit all the mutations, right? Particularly if you look at database, they're dealing with metadata, and querying; the biggest value they bring is indexing the metadata. But if I'm dealing with that, then I'm dealing with really small block lots of mutations, the separation of objects storage should be dealing with persistence and not mutations. Mutations are [AWS 00:21:57] problem. Separation of database work function and persistence function is where object storage got the storage right.Otherwise, it will, they will make the mistake of doing POSIX-like behavior, and then not only bringing back all those capabilities, doing IOPS intensive workloads across the HTTP, it wouldn't make sense, right? So, object storage got the API right. But now should it be a database? So, it definitely should not be a database. In fact, I actually hate the idea of Amazon yielding to the file system developers and giving a [file three 00:22:29] hierarchical namespace so they can write nice file managers.That was a terrible idea. Writing a hierarchical namespace that's also sorted, now puts tax on how the metadata is indexed and organized. The Amazon should have left the core API very simple and told them to solve these problems outside the object store. Many application developers don't need. Amazon was trying to satisfy everybody's need. Saying no to some of these file system-type, file manager-type users, what should have been the right way.But nevertheless, adding those capabilities, eventually, now you can see, S3 is no longer simple. And we had to keep that compatibility, and I hate that part. I actually don't mind compatibility, but then doing all the wrong things that Amazon is adding, now I have to add because it's compatible. I kind of hate that, right?But now going to a database would be pushing it to the whole new level. Here is the simple reason why that's a bad idea. The right way to do database—in fact, the database industry is already going in the right direction. Unstructured data, the key-value or graph, different types of data, you cannot possibly solve all that even in a single database. They are trying to be multimodal database; even they are struggling with it.You can never be a Redis, Cassandra, like, a SQL all-in-one. They tried to say that but in reality, that you will never be better than any one of those focused database solutions out there. Trying to bring that into object store will be a mistake. Instead, let the databases focus on query language implementation and query computation, and leave the persistence to object store. So, object store can still focus on storing your database segments, the table segments, but the index is still in the memory of the database.Even the index can be snapshotted once in a while to object store, but use objects store for persistence and database for query is the right architecture. And almost all the modern databases now, from Elasticsearch to [unintelligible 00:24:21] to even Kafka, like, message queue. They all have gone that route. Even Microsoft SQL Server, Teradata, Vertica, name it, Splunk, they all have gone object storage route, too. Snowflake itself is a prime example, BigQuery and all of them.That's the right way. Databases can never be consolidated. There will be many different kinds of databases. Let them specialize on GraphQL or Graph API, or key-value, or SQL. Let them handle the indexing and persistence, they cannot handle petabytes of data. That [unintelligible 00:24:51] to object store is how the industry is shaping up, and it is going in the right direction.Corey: One of the ways I learned the most about various services is by talking to customers. Every time I think I've seen something, this is amazing. This service is something I completely understand. All I have to do is talk to one more customer. And when I was doing a bill analysis project a couple of years ago, I looked into a customer's account and saw a bucket with okay, that has 280 billion objects in it—and wait was that billion with a B?And I asked them, “So, what's going on over there?” And there's, “Well, we built our own columnar database on top of S3. This may not have been the best approach.” It's, “I'm going to stop you there. With no further context, it was not, but please continue.”It's the sort of thing that would never have occurred to me to even try, do you tend to see similar—I would say they're anti-patterns, except somehow they're made to work—in some of your customer environments, as they are using the service in ways that are very different than ways encouraged or even allowed by the native object store options?AB: Yeah, when I first started seeing the database-type workloads coming on to MinIO, I was surprised, too. That was exactly my reaction. In fact, they were storing these 256k, sometimes 64k table segments because they need to index it, right, and the table segments were anywhere between 64k to 2MB. And when they started writing table segments, it was more often [IOPS-type 00:26:22] I/O pattern, then a throughput-type pattern. Throughput is an easier problem to solve, and MinIO always saturated these 100-gigabyte NVMe-type drives, they were I/O intensive, throughput optimized.When I started seeing the database workloads, I had to optimize for small-object workloads, too. We actually did all that because eventually I got convinced the right way to build a database was to actually leave the persistence out of database; they made actually a compelling argument. If historically, I thought metadata and data, data to be very big and coming to object store make sense. Metadata should be stored in a database, and that's only index page. Take any book, the index pages are only few, database can continue to run adjacent to object store, it's a clean architecture.But why would you put database itself on object store? When I saw a transactional database like MySQL, changing the [InnoDB 00:27:14] to [RocksDB 00:27:15], and making changes at that layer to write the SS tables [unintelligible 00:27:19] to MinIO, and then I was like, where do you store the memory, the journal? They said, “That will go to Kafka.” And I was like—I thought that was insane when it started. But it continued to grow and grow.Nowadays, I see most of the databases have gone to object store, but their argument is, the databases also saw explosive growth in data. And they couldn't scale the persistence part. That is where they realized that they still got very good at the indexing part that object storage would never give. There is no API to do sophisticated query of the data. You cannot peek inside the data, you can just do streaming read and write.And that is where the databases were still necessary. But databases were also growing in data. One thing that triggered this was the use case moved from data that was generated by people to now data generated by machines. Machines means applications, all kinds of devices. Now, it's like between seven billion people to a trillion devices is how the industry is changing. And this led to lots of machine-generated, semi-structured, structured data at giant scale, coming into database. The databases need to handle scale. There was no other way to solve this problem other than leaving the—[unintelligible 00:28:31] if you looking at columnar data, most of them are machine-generated data, where else would you store? If they tried to build their own object storage embedded into the database, it would make database mentally complicated. Let them focus on what they are good at: Indexing and mutations. Pull the data table segments which are immutable, mutate in memory, and then commit them back give the right mix. What you saw what's the fastest step that happened, we saw that consistently across. Now, it is actually the standard.Corey: So, you started working on this in 2014, and here we are—what is it—eight years later now, and you've just announced a Series B of $100 million dollars on a billion-dollar valuation. So, it turns out this is not just one of those things people are using for test labs; there is significant momentum behind using this. How did you get there from—because everything you're saying makes an awful lot of sense, but it feels, at least from where I sit, to be a little bit of a niche. It's a bit of an edge case that is not the common case. Obviously, I missing something because your investors are not the types of sophisticated investors who see something ridiculous and, “Yep. That's the thing we're going to go for.” There right more than they're not.AB: Yeah. The reason for that was the saw what we were set to do. In fact, these are—if you see the lead investor, Intel, they watched us grow. They came into Series A and they saw, everyday, how we operated and grew. They believed in our message.And it was actually not about object store, right? Object storage was a means for us to get into the market. When we started, our idea was, ten years from now, what will be a big problem? A lot of times, it's hard to see the future, but if you zoom out, it's hidden in plain sight.These are simple trends. Every major trend pointed to world producing more data. No one would argue with that. If I solved one important problem that everybody is suffering, I won't go wrong. And when you solve the problem, it's about building a product with fine craftsmanship, attention to details, connecting with the user, all of that standard stuff.But I picked object storage as the problem because the industry was fragmented across many different data stores, and I knew that won't be the case ten years from now. Applications are not going to adopt different APIs across different clouds, S3 to GCS to Azure Blob to HDFS to everything is incompatible. I saw that if I built a data store for persistence, industry will consolidate around S3 API. Amazon S3, when we started, it looked like they were the giant, there was only one cloud industry, it believed mono-cloud. Almost everyone was talking to me like AWS will be the world's data center.I certainly see that possibility, Amazon is capable of doing it, but my bet was the other way, that AWS S3 will be one of many solutions, but not—if it's all incompatible, it's not going to work, industry will consolidate. Our bet was, if world is producing so much data, if you build an object store that is S3 compatible, but ended up as the leading data store of the world and owned the application ecosystem, you cannot go wrong. We kept our heads low and focused on the first six years on massive adoption, build the ecosystem to a scale where we can say now our ecosystem is equal or larger than Amazon, then we are in business. We didn't focus on commercialization; we focused on convincing the industry that this is the right technology for them to use. Once they are convinced, once you solve business problems, making money is not hard because they are already sold, they are in love with the product, then convincing them to pay is not a big deal because data is so critical, central part of their business.We didn't worry about commercialization, we worried about adoption. And once we got the adoption, now customers are coming to us and they're like, “I don't want open-source license violation. I don't want data breach or data loss.” They are trying to sell to me, and it's an easy relationship game. And it's about long-term partnership with customers.And so the business started growing, accelerating. That was the reason that now is the time to fill up the gas tank and investors were quite excited about the commercial traction as well. And all the intangible, right, how big we grew in the last few years.Corey: It really is an interesting segment, that has always been something that I've mostly ignored, like, “Oh, you want to run your own? Okay, great.” I get it; some people want to cosplay as cloud providers themselves. Awesome. There's clearly a lot more to it than that, and I'm really interested to see what the future holds for you folks.AB: Yeah, I'm excited. I think end of the day, if I solve real problems, every organization is moving from compute technology-centric to data-centric, and they're all looking at data warehouse, data lake, and whatever name they give data infrastructure. Data is now the centerpiece. Software is a commodity. That's how they are looking at it. And it is translating to each of these large organizations—actually, even the mid, even startups nowadays have petabytes of data—and I see a huge potential here. The timing is perfect for us.Corey: I'm really excited to see this continue to grow. And I want to thank you for taking so much time to speak with me today. If people want to learn more, where can they find you?AB: I'm always on the community, right. Twitter and, like, I think the Slack channel, it's quite easy to reach out to me. LinkedIn. I'm always excited to talk to our users or community.Corey: And we will of course put links to this in the [show notes 00:33:58]. Thank you so much for your time. I really appreciate it.AB: Again, wonderful to be here, Corey.Corey: Anand Babu Periasamy, CEO and co-founder of MinIO. I'm Cloud Economist Corey Quinn, and this is Screaming in the Cloud. If you've enjoyed this podcast, please leave a five-star review on your podcast platform of choice, whereas if you've hated this podcast, please leave a five-star review on your podcast platform of choice, along with what starts out as an angry comment but eventually turns into you, in your position on the S3 product team, writing a thank you note to MinIO for helping validate your market.Corey: If your AWS bill keeps rising and your blood pressure is doing the same, then you need The Duckbill Group. We help companies fix their AWS bill by making it smaller and less horrifying. The Duckbill Group works for you, not AWS. We tailor recommendations to your business and we get to the point. Visit duckbillgroup.com to get started.Announcer: This has been a HumblePod production. Stay humble.
Dr. Laura Froyen – The Whinypaluza Podcast with Rebecca Greene Episode 112 The Balanced Parent Dr. Laura Froyen received her PhD in Human Development and Family Studies (HDFS) with an emphasis in Couple and Family Therapy from Michigan State University in 2014. While pursuing her doctorate she worked as a Couple and Family Therapist in the state of Michigan helping families navigate difficult times. Her research focused on how marital and family relationships influence parenting and child development. She continued this research at the University of Wisconsin-Madison as an Assistant Professor of HDFS and UW-Extension as an Early Childhood and Parenting Specialist. She helps individuals, partners, and co-parents become the parents they are longing to be. She helps overwhelmed and disconnected parents reconnect with themselves, each other, and their children. She helps parents bring ease, calm and JOY back into their hectic and chaotic lives. She helps parents stop yelling and start feeling confident in themselves. She does this through her unique approach to parenting that is grounded in research and driven by the overarching goal of restoring balance and compassion to families. Listen to this insightful Whinypaluza episode with Dr. Laura Froyen about being a balanced parent. Here is what to expect on this week's show: How early literacy skills are influenced by the home environment and relationships. Why finding a balance can be difficult for moms and dads. How there are 3 components to living a balanced life and what they are. Why it's important to practice self-compassion and not be so hard on ourselves. How you can train your body to respond to triggers how you want. Connect with Laura: Links Mentioned: https://www.laurafroyen.com/ https://laurafroyen.lpages.co/self-compassion-meditation-download/ Guest Contact Info: Instagram @laurafroyenphd Facebook https://www.facebook.com/laurafroyen Follow Rebecca Greene Blog whinypaluza.com Book bit.ly/WhinypaluzaBook Facebook facebook.com/whinypaluzaparenting Instagram @becgreene5 TikTok @whinypaluzamom Learn more about your ad choices. Visit megaphone.fm/adchoices
About ThomasThomas Hazel is Founder, CTO, and Chief Scientist of ChaosSearch. He is a serial entrepreneur at the forefront of communication, virtualization, and database technology and the inventor of ChaosSearch's patented IP. Thomas has also patented several other technologies in the areas of distributed algorithms, virtualization and database science. He holds a Bachelor of Science in Computer Science from University of New Hampshire, Hall of Fame Alumni Inductee, and founded both student & professional chapters of the Association for Computing Machinery (ACM).Links:ChaosSearch: https://www.chaossearch.io TranscriptAnnouncer: Hello, and welcome to Screaming in the Cloud with your host, Chief Cloud Economist at The Duckbill Group, Corey Quinn. This weekly show features conversations with people doing interesting work in the world of cloud, thoughtful commentary on the state of the technical world, and ridiculous titles for which Corey refuses to apologize. This is Screaming in the Cloud.Corey: This episode is sponsored in part by my friends at ThinkstCanary. Most companies find out way too late that they've been breached. ThinksCanary changes this and I love how they do it. Deploy canaries and canary tokens in minutes and then forget about them. What's great is the attackers tip their hand by touching them, giving you one alert, when it matters. I use it myself and I only remember this when I get the weekly update with a “we're still here, so you're aware” from them. It's glorious! There is zero admin overhead to this, there are effectively no false positives unless I do something foolish. Canaries are deployed and loved on all seven continents. You can check out what people are saying at canary.love. And, their Kub config canary token is new and completely free as well. You can do an awful lot without paying them a dime, which is one of the things I love about them. It is useful stuff and not an, “ohh, I wish I had money.” It is speculator! Take a look; that's canary.love because it's genuinely rare to find a security product that people talk about in terms of love. It really is a unique thing to see. Canary.love. Thank you to ThinkstCanary for their support of my ridiculous, ridiculous non-sense. Corey: This episode is sponsored in part by our friends at Vultr. Spelled V-U-L-T-R because they're all about helping save money, including on things like, you know, vowels. So, what they do is they are a cloud provider that provides surprisingly high performance cloud compute at a price that—while sure they claim its better than AWS pricing—and when they say that they mean it is less money. Sure, I don't dispute that but what I find interesting is that it's predictable. They tell you in advance on a monthly basis what it's going to going to cost. They have a bunch of advanced networking features. They have nineteen global locations and scale things elastically. Not to be confused with openly, because apparently elastic and open can mean the same thing sometimes. They have had over a million users. Deployments take less that sixty seconds across twelve pre-selected operating systems. Or, if you're one of those nutters like me, you can bring your own ISO and install basically any operating system you want. Starting with pricing as low as $2.50 a month for Vultr cloud compute they have plans for developers and businesses of all sizes, except maybe Amazon, who stubbornly insists on having something to scale all on their own. Try Vultr today for free by visiting: vultr.com/screaming, and you'll receive a $100 in credit. Thats v-u-l-t-r.com slash screaming.Corey: Welcome to Screaming in the Cloud. I'm Corey Quinn. This promoted episode is brought to us by our friends at ChaosSearch.We've been working with them for a long time; they've sponsored a bunch of our nonsense, and it turns out that we've been talking about them to our clients since long before they were a sponsor because it actually does what it says on the tin. Here to talk to us about that in a few minutes is Thomas Hazel, ChaosSearch's CTO and founder. First, Thomas, nice to talk to you again, and as always, thanks for humoring me.Thomas: [laugh]. Hi, Corey. Always great to talk to you. And I enjoy these conversations that sometimes go up and down, left and right, but I look forward to all the fun we're going to have.Corey: So, my understanding of ChaosSearch is probably a few years old because it turns out, I don't spend a whole lot of time meticulously studying your company's roadmap in the same way that you presumably do. When last we checked in with what the service did-slash-does, you are effectively solving the problem of data movement and querying that data. The idea behind data warehouses is generally something that's shoved onto us by cloud providers where, “Hey, this data is going to be valuable to you someday.” Data science teams are big proponents of this because when you're storing that much data, their salaries look relatively reasonable by comparison. And the ChaosSearch vision was, instead of copying all this data out of an object store and storing it on expensive disks, and replicating it, et cetera, what if we queried it in place in a somewhat intelligent manner?So, you take the data and you store it, in this case, in S3 or equivalent, and then just query it there, rather than having to move it around all over the place, which of course, then incurs data transfer fees, you're storing it multiple times, and it's never in quite the format that you want it. That was the breakthrough revelation, you were Elasticsearch—now OpenSearch—API compatible, which was great. And that was, sort of, a state of the art a year or two ago. Is that generally correct?Thomas: No, you nailed our mission statement. No, you're exactly right. You know, the value of cloud object stores, S3, the elasticity, the durability, all these wonderful things, the problem was you couldn't get any value out of it, and you had to move it out to these siloed solutions, as you indicated. So, you know, our mission was exactly that, transformed customers' cloud storage into an analytical database, a multi-model analytical database, where our first use case was search and log analytics, replacing the ELK stack and also replacing the data pipeline, the schema management, et cetera. We automate the entire step, raw data to insights.Corey: It's funny we're having this conversation today. Earlier, today, I was trying to get rid of a relatively paltry 200 gigs or so of small files on an EFS volume—you know, Amazon's version of NFS; it's like an NFS volume except you're paying Amazon for the privilege—great. And it turns out that it's a whole bunch of operations across a network on a whole bunch of tiny files, so I had to spin up other instances that were not getting backed by spot terminations, and just firing up a whole bunch of threads. So, now the load average on that box is approaching 300, but it's plowing through, getting rid of that data finally.And I'm looking at this saying this is a quarter of a terabyte. Data warehouses are in the petabyte range. Oh, I begin to see aspects of the problem. Even searching that kind of data using traditional tooling starts to break down, which is sort of the revelation that Google had 20-some-odd years ago, and other folks have since solved for, but this is the first time I've had significant data that wasn't just easily searched with a grep. For those of you in the Unix world who understand what that means, condolences. We're having a support group meeting at the bar.Thomas: Yeah. And you know, I always thought, what if you could make cloud object storage like S3 high performance and really transform it into a database? And so that warehouse capability, that's great. We like that. However to manage it, to scale it, to configure it, to get the data into that, was the problem.That was the promise of a data lake, right? This simple in, and then this arbitrary schema on read generic out. The problem next came, it became swampy, it was really hard, and that promise was not delivered. And so what we're trying to do is get all the benefits of the data lake: simple in, so many services naturally stream to cloud storage. Shoot, I would say every one of our customers are putting their data in cloud storage because their data pipeline to their warehousing solution or Elasticsearch may go down and they're worried they'll lose the data.So, what we say is what if you just said activate that data lake and get that ELK use case, get that BI use case without that data movement, as you indicated, without that ETL-ing, without that data pipeline that you're worried is going to fall over. So, that vision has been Chaos. Now, we haven't talked in, you know, a few years, but this idea that we're growing beyond what we are just going after logs, we're going into new use cases, new opportunities, and I'm looking forward to discussing with you.Corey: It's a great answer that—though I have to call out that I am right there with you as far as inappropriately using things as databases. I know that someone is going to come back and say, “Oh, S3 is a database. You're dancing around it. Isn't that what Athena is?” Which is named, of course, after the Greek Goddess of spending money on AWS? And that is a fair question, but to my understanding, there's a schema story behind that does not apply to what you're doing.Thomas: Yeah, and that is so crucial is that we like the relational access. The time-cost complexity to get it into that, as you mentioned, scaled access, I mean, it could take weeks, months to test it, to configure it, to provision it, and imagine if you got it wrong; you got to redo it again. And so our unique service removes all that data pipeline schema management. And because of our innovation because of our service, you do all schema definition, on the fly, virtually, what we call views on your index data, that you can publish an elastic index pattern for that consumption, or a relational table for that consumption. And that's kind of leading the witness into things that we're coming out with this quarter into 2022.Corey: I have to deal with a little bit of, I guess, a shame here because yeah, I'm doing exactly what you just described. I'm using Athena to wind up querying our customers' Cost and Usage Reports, and we spend a couple hundred bucks a month on AWS Glue to wind up massaging those into the way that they expect it to be. And it's great. Ish. We hook it up to Tableau and can make those queries from it, and all right, it's great.It just, burrr goes the money printer, and we somehow get access and insight to a lot of valuable data. But even that is knowing exactly what the format is going to look like. Ish. I mean, Cost and Usage Reports from Amazon are sort of aspirational when it comes to schema sometimes, but here we are. And that's been all well and good.But now the idea of log files, even looking at the base case of sending logs from an application, great. Nginx, or Apache, or [unintelligible 00:07:24], or any of the various web servers out there all tend to use different logging formats just to describe the same exact things, start spreading that across custom in-house applications and getting signal from that is almost impossible. “Oh,” people say, “So, we'll use a structured data format.” Now, you're putting log and structuring requirements on application developers who don't care in the first place, and now you have a mess on your hands.Thomas: And it really is a mess. And that challenge is, it's so problematic. And schemas changing. You know, we have customers and one reasons why they go with us is their log data is changing; they didn't expect it. Well, in your data pipeline, and your Athena database, that breaks. That brings the system down.And so our system uniquely detects that and manages that for you and then you can pick and choose how you want to export in these views dynamically. So, you know, it's really not rocket science, but the problem is, a lot of the technology that we're using is designed for static, fixed thinking. And then to scale it is problematic and time-consuming. So, you know, Glue is a great idea, but it has a lot of sharp [pebbles 00:08:26]. Athena is a great idea but also has a lot of problems.And so that data pipeline, you know, it's not for digitally native, active, new use cases, new workloads coming up hourly, daily. You think about this long-term; so a lot of that data prep pipelining is something we address so uniquely, but really where the customer cares is the value of that data, right? And so if you're spending toils trying to get the data into a database, you're not answering the questions, whether it's for security, for performance, for your business needs. That's the problem. And you know, that agility, that time-to-value is where we're very uniquely coming in because we start where your data is raw and we automate the process all the way through.Corey: So, when I look at the things that I have stuffed into S3, they generally fall into a couple of categories. There are a bunch of logs for things I never asked for nor particularly wanted, but AWS is aggressive about that, first routing through CloudTrail so you can get charged 50-cent per gigabyte ingested. Awesome. And of course, large static assets, images I have done something to enter colloquially now known as shitposts, which is great. Other than logs, what could you possibly be storing in S3 that lends itself to, effectively, the type of analysis that you built around this?Thomas: Well, our first use case was the classic log use cases, app logs, web service logs. I mean, CloudTrail, it's famous; we had customers that gave up on elastic, and definitely gave up on relational where you can do a couple changes and your permutation of attributes for CloudTrail is going to put you to your knees. And people just say, “I give up.” Same thing with Kubernetes logs. And so it's the classic—whether it's CSV, where it's JSON, where it's log types, we auto-discover all that.We also allow you, if you want to override that and change the parsing capabilities through a UI wizard, we do discover what's in your buckets. That term data swamp, and not knowing what's in your bucket, we do a facility that will index that data, actually create a report for you for knowing what's in. Now, if you have text data, if you have log data, if you have BI data, we can bring it all together, but the real pain is at the scale. So classically, app logs, system logs, many devices sending IoT-type streams is where we really come in—Kubernetes—where they're dealing with terabytes of data per day, and managing an ELK cluster at that scale. Particularly on a Black Friday.Shoot, some of our customers like—Klarna is one of them; credit card payment—they're ramping up for Black Friday, and one of the reasons why they chose us is our ability to scale when maybe you're doing a terabyte or two a day and then it goes up to twenty, twenty-five. How do you test that scale? How do you manage that scale? And so for us, the data streams are, traditionally with our customers, the well-known log types, at least in the log use cases. And the challenge is scaling it, is getting access to it, and that's where we come in.Corey: I will say the last time you were on the show a couple of years ago, you were talking about the initial logging use case and you were speaking, in many cases aspirationally, about where things were going. What a difference a couple years is made. Instead of talking about what hypothetical customers might want, or what—might be able to do, you're just able to name-drop them off the top of your head, you have scaled to approximately ten times the number of employees you had back then. You've—Thomas: Yep. Yep.Corey: —raised, I think, a total of—what, 50 million?—since then.Thomas: Uh, 60 now. Yeah.Corey: Oh, 60? Fantastic.Thomas: Yeah, yeah.Corey: Congrats. And of course, how do you do it? By sponsoring Last Week in AWS, as everyone should. I'm taking clear credit for that every time someone announces around, that's the game. But no, there is validity to it because telling fun stories and sponsoring exciting things like this only carry you so far. At some point, customers have to say, yeah, this is solving a pain that I have; I'm willing to pay you money to solve it.And you've clearly gotten to a point where you are addressing the needs of those customers at a pretty fascinating clip. It's bittersweet from my perspective because it seems like the majority of your customers have not come from my nonsense anymore. They're finding you through word of mouth, they're finding through more traditional—read as boring—ad campaigns, et cetera, et cetera. But you've built a brand that extends beyond just me. I'm no longer viewed as the de facto ombudsperson for any issue someone might have with ChaosSearch on Twitters. It's kind of, “Aww, the company grew up. What happened there?”Thomas: No, [laugh] listen, this you were great. We reached out to you to tell our story, and I got to be honest. A lot of people came by, said, “I heard something on Corey Quinn's podcasts,” or et cetera. And it came a long way now. Now, we have, you know, companies like Equifax, multi-cloud—Amazon and Google.They love the data lake philosophy, the centralized, where use cases are now available within days, not weeks and months. Whether it's logs and BI. Correlating across all those data streams, it's huge. We mentioned Klarna, [APM Performance 00:13:19], and, you know, we have Armor for SIEM, and Blackboard for [Observers 00:13:24].So, it's funny—yeah, it's funny, when I first was talking to you, I was like, “What if? What if we had this customer, that customer?” And we were building the capabilities, but now that we have it, now that we have customers, yeah, I guess, maybe we've grown up a little bit. But hey, listen to you're always near and dear to our heart because we remember, you know, when you stop[ed by our booth at re:Invent several times. And we're coming to re:Invent this year, and I believe you are as well.Corey: Oh, yeah. But people listening to this, it's if they're listening the day it's released, this will be during re:Invent. So, by all means, come by the ChaosSearch booth, and see what they have to say. For once they have people who aren't me who are going to be telling stories about these things. And it's fun. Like, I joke, it's nothing but positive here.It's interesting from where I sit seeing the parallels here. For example, we have both had—how we say—adult supervision come in. You have a CEO, Ed, who came over from IBM Storage. I have Mike Julian, whose first love language is of course spreadsheets. And it's great, on some level, realizing that, wow, this company has eclipsed my ability to manage these things myself and put my hands-on everything. And eventually, you have to start letting go. It's a weird growth stage, and it's a heck of a transition. But—Thomas: No, I love it. You know, I mean, I think when we were talking, we were maybe 15 employees. Now, we're pushing 100. We brought on Ed Walsh, who's an amazing CEO. It's funny, I told him about this idea, I invented this technology roughly eight years ago, and he's like, “I love it. Let's do it.” And I wasn't ready to do it.So, you know, five, six years ago, I started the company always knowing that, you know, I'd give him a call once we got the plane up in the air. And it's been great to have him here because the next level up, right, of execution and growth and business development and sales and marketing. So, you're exactly right. I mean, we were a young pup several years ago, when we were talking to you and, you know, we're a little bit older, a little bit wiser. But no, it's great to have Ed here. And just the leadership in general; we've grown immensely.Corey: Now, we are recording this in advance of re:Invent, so there's always the question of, “Wow, are we going to look really silly based upon what is being announced when this airs?” Because it's very hard to predict some things that AWS does. And let's be clear, I always stay away from predictions, just because first, I have a bit of a knack for being right. But also, when I'm right, people will think, “Oh, Corey must have known about that and is leaking,” whereas if I get it wrong, I just look like a fool. There's no win for me if I start doing the predictive dance on stuff like that.But I have to level with you, I have been somewhat surprised that, at least as of this recording, AWS has not moved more in your direction because storing data in S3 is kind of their whole thing, and querying that data through something that isn't Athena has been a bit of a reach for them that they're slowly starting to wrap their heads around. But their UltraWarm nonsense—which is just, okay, great naming there—what is the point of continually having a model where oh, yeah, we're going to just age it out, the stuff that isn't actively being used into S3, rather than coming up with a way to query it there. Because you've done exactly that, and please don't take this as anything other than a statement of fact, they have better access to what S3 is doing than you do. You're forced to deal with this thing entirely from a public API standpoint, which is fine. They can theoretically change the behavior of aspects of S3 to unlock these use cases if they chose to do so. And they haven't. Why is it that you're the only folks that are doing this?Thomas: No, it's a great question, and I'll give them props for continuing to push the data lake [unintelligible 00:17:09] to the cloud providers' S3 because it was really where I saw the world. Lakes, I believe in. I love them. They love them. However, they promote the move the data out to get access, and it seems so counterintuitive on why wouldn't you leave it in and put these services, make them more intelligent? So, it's funny, I've trademark ‘Smart Object Storage,' I actually trademarked—I think you [laugh] were a part of this—‘UltraHot,' right? Because why would you want UltraWarm when you can have UltraHot?And the reason, I feel, is that if you're using Parquet for Athena [unintelligible 00:17:40] store, or Lucene for Elasticsearch, these two index technologies were not designed for cloud storage, for real-time streaming off of cloud storage. So, the trick is, you have to build UltraWarm, get it off of what they consider cold S3 into a more warmer memory or SSD type access. What we did, what the invention I created was, that first read is hot. That first read is fast.Snowflake is a good example. They give you a ten terabyte demo example, and if you have a big instance and you do that first query, maybe several orders or groups, it could take an hour to warm up. The second query is fast. Well, what if the first query is in seconds as well? And that's where we really spent the last five, six years building out the tech and the vision behind this because I like to say you go to a doctor and say, “Hey, Doc, every single time I move my arm, it hurts.” And the doctor says, “Well, don't move your arm.”It's things like that, to your point, it's like, why wouldn't they? I would argue, one, you have to believe it's possible—we're proving that it is—and two, you have to have the technology to do it. Not just the index, but the architecture. So, I believe they will go this direction. You know, little birdies always say that all these companies understand this need.Shoot, Snowflake is trying to be lake-y; Databricks is trying to really bring this warehouse lake concept. But you still do all the pipelining; you still have to do all the data management the way that you don't want to do. It's not a lake. And so my argument is that it's innovation on why. Now, they have money; they have time, but, you know, we have a big head start.Corey: I remembered last year at re:Invent they released a, shall we say, significant change to S3 that it enabled read after write consistency, which is awesome, for again, those of us in the business of misusing things as databases. But for some folks, the majority of folks I would say, it was a, “I don't know what that means and therefore I don't care.” And that's fine. I have no issue with that. There are other folks, some of my customers for example, who are suddenly, “Wait a minute. This means I can sunset this entire janky sidecar metadata system that is designed to make sure that we are consistent in our use of S3 because it now does it automatically under the hood?” And that's awesome. Does that change mean anything for ChaosSearch?Thomas: It doesn't because of our architecture. We're append-only, write-once scenario, so a lot of update-in-place viewpoints. My viewpoint is that if you're seeing S3 as the database and you need that type of consistency, it make sense of why you'd want it, but because of our distributive fabric, our stateless architecture, our append-only nature, it really doesn't affect us.Now, I talked to the S3 team, I said, “Please if you're coming up with this feature, it better not be slower.” I want S3 to be fast, right? And they said, “No, no. It won't affect performance.” I'm like, “Okay. Let's keep that up.”And so to us, any type of S3 capability, we'll take advantage of it if benefits us, whether it's consistency as you indicated, performance, functionality. But we really keep the constructs of S3 access to really limited features: list, put, get. [roll-on 00:20:49] policies to give us read-only access to your data, and a location to write our indices into your account, and then are distributed fabric, our service, acts as those indices and query them or searches them to resolve whatever analytics you need. So, we made it pretty simple, and that is allowed us to make it high performance.Corey: I'll take it a step further because you want to talk about changes since the last time we spoke, it used to be that this was on top of S3, you can store your data anywhere you want, as long as it's S3 in the customer's account. Now, you're also supporting one-click integration with Google Cloud's object storage, which, great. That does mean though, that you're not dependent upon provider-specific implementations of things like a consistency model for how you've built things. It really does use the lowest common denominator—to my understanding—of object stores. Is that something that you're seeing broad adoption of, or is this one of those areas where, well, you have one customer on a different provider, but almost everything lives on the primary? I'm curious what you're seeing for adoption models across multiple providers?Thomas: It's a great question. We built an architecture purposely to be cloud-agnostic. I mean, we use compute in a containerized way, we use object storage in a very simple construct—put, get, list—and we went over to Google because that made sense, right? We have customers on both sides. I would say Amazon is the gorilla, but Google's trying to get there and growing.We had a big customer, Equifax, that's on both Amazon and Google, but we offer the same service. To be frank, it looks like the exact same product. And it should, right? Whether it's Amazon Cloud, or Google Cloud, multi-select and I want to choose either one and get the other one. I would say that different business types are using each one, but our bulk of the business isn't Amazon, but we just this summer released our SaaS offerings, so it's growing.And you know, it's funny, you never know where it comes from. So, we have one customer—actually DigitalRiver—as one of our customers on Amazon for logs, but we're growing in working together to do a BI on GCP or on Google. And so it's kind of funny; they have two departments on two different clouds with two different use cases. And so do they want unification? I'm not sure, but they definitely have their BI on Google and their operations in Amazon. It's interesting.Corey: You know its important to me that people learn how to use the cloud effectively. Thats why I'm so glad that Cloud Academy is sponsoring my ridiculous non-sense. They're a great way to build in demand tech skills the way that, well personally, I learn best which I learn by doing not by reading. They have live cloud labs that you can run in real environments that aren't going to blow up your own bill—I can't stress how important that is. Visit cloudacademy.com/corey. Thats C-O-R-E-Y, don't drop the “E.” Use Corey as a promo-code as well. You're going to get a bunch of discounts on it with a lifetime deal—the price will not go up. It is limited time, they assured me this is not one of those things that is going to wind up being a rug pull scenario, oh no no. Talk to them, tell me what you think. Visit: cloudacademy.com/corey, C-O-R-E-Y and tell them that I sent you!Corey: I know that I'm going to get letters for this. So, let me just call it out right now. Because I've been a big advocate of pick a provider—I care not which one—and go all-in on it. And I'm sitting here congratulating you on extending to another provider, and people are going to say, “Ah, you're being inconsistent.”No. I'm suggesting that you as a provider have to meet your customers where they are because if someone is sitting in GCP and your entire approach is, “Step one, migrate those four petabytes of data right on over here to AWS,” they're going to call you that jackhole that you would be by making that suggestion and go immediately for option B, which is literally anything that is not ChaosSearch, just based upon that core misunderstanding of their business constraints. That is the way to think about these things. For a vendor position that you are in as an ISV—Independent Software Vendor for those not up on the lingo of this ridiculous industry—you have to meet customers where they are. And it's the right move.Thomas: Well, you just said it. Imagine moving terabytes and petabytes of data.Corey: It sounds terrific if I'm a salesperson for one of these companies working on commission, but for the rest of us, it sounds awful.Thomas: We really are a data fabric across clouds, within clouds. We're going to go where the data is and we're going to provide access to where that data lives. Our whole philosophy is the no-movement movement, right? Don't move your data. Leave it where it is and provide access at scale.And so you may have services in Google that naturally stream to GCS; let's do it there. Imagine moving that amount of data over to Amazon to analyze it, and vice versa. 2020, we're going to be in Azure. They're a totally different type of business, users, and personas, but you're getting asked, “Can you support Azure?” And the answer is, “Yes,” and, “We will in 2022.”So, to us, if you have cloud storage, if you have compute, and it's a big enough business opportunity in the market, we're there. We're going there. When we first started, we were talking to MinIO—remember that open-source, object storage platform?—We've run on our laptops, we run—this [unintelligible 00:25:04] Dr. Seuss thing—“We run over here; we run over there; we run everywhere.”But the honest truth is, you're going to go with the big cloud providers where the business opportunity is, and offer the same solution because the same solution is valued everywhere: simple in; value out; cost-effective; long retention; flexibility. That sounds so basic, but you mentioned this all the time with our Rube Goldberg, Amazon diagrams we see time and time again. It's like, if you looked at that and you were from an alien planet, you'd be like, “These people don't know what they're doing. Why is it so complicated?” And the simple answer is, I don't know why people think it's complicated.To your point about Amazon, why won't they do it? I don't know, but if they did, things would be different. And being honest, I think people are catching on. We do talk to Amazon and others. They see the need, but they also have to build it; they have to invent technology to address it. And using Parquet and Lucene are not the answer.Corey: Yeah, it's too much of a demand on the producers of that data rather than the consumer. And yeah, I would love to be able to go upstream to application developers and demand they do things in certain ways. It turns out as a consultant, you have zero authority to do that. As a DevOps team member, you have limited ability to influence it, but it turns out that being the ‘department of no' quickly turns into being the ‘department of unemployment insurance' because no one wants to work with you. And collaboration—contrary to what people wish to believe—is a key part of working in a modern workplace.Thomas: Absolutely. And it's funny, the demands of IT are getting harder; the actual getting the employees to build out the solutions are getting harder. And so a lot of that time is in the pipeline, is the prep, is the schema, the sharding, and et cetera, et cetera, et cetera. My viewpoint is that should be automated away. More and more databases are being autotune, right?This whole knobs and this and that, to me, Glue is a means to an end. I mean, let's get rid of it. Why can't Athena know what to do? Why can't object storage be Athena and vice versa? I mean, to me, it seems like all this moving through all these services, the classic Amazon viewpoint, even their diagrams of having this centralized repository of S3, move it all out to your services, get results, put it back in, then take it back out again, move it around, it just doesn't make much sense. And so to us, I love S3, love the service. I think it's brilliant—Amazon's first service, right?—but from there get a little smarter. That's where ChaosSearch comes in.Corey: I would argue that S3 is in fact, a modern miracle. And one of those companies saying, “Oh, we have an object store; it's S3 compatible.” It's like, “Yeah. We have S3 at home.” Look at S3 at home, and it's just basically a series of failing Raspberry Pis.But you have this whole ecosystem of things that have built up and sprung up around S3. It is wildly understated just how scalable and massive it is. There was an academic paper recently that won an award on how they use automated reasoning to validate what is going on in the S3 environment, and they talked about hundreds of petabytes in some cases. And folks are saying, ah, S3 is hundreds of petabytes. Yeah, I have clients storing hundreds of petabytes.There are larger companies out there. Steve Schmidt, Amazon's CISO, was recently at a Splunk keynote where he mentioned that in security info alone, AWS itself generates 500 petabytes a day that then gets reduced down to a bunch of stuff, and some of it gets loaded into Splunk. I think. I couldn't really hear the second half of that sentence because of the sound of all of the Splunk salespeople in that room becoming excited so quickly you could hear it.Thomas: [laugh]. I love it. If I could be so bold, those S3 team, they're gods. They are amazing. They created such an amazing service, and when I started playing with S3 now, I guess, 2006 or 7, I mean, we were using for a repository, URL access to get images, I was doing a virtualization [unintelligible 00:29:05] at the time—Corey: Oh, the first time I played with it, “This seems ridiculous and kind of dumb. Why would anyone use this?” Yeah, yeah. It turns out I'm really bad at predicting the future. Another reason I don't do the prediction thing.Thomas: Yeah. And when I started this company officially, five, six years ago, I was thinking about S3 and I was thinking about HDFS not being a good answer. And I said, “I think S3 will actually achieve the goals and performance we need.” It's a distributed file system. You can run parallel puts and parallel gets. And the performance that I was seeing when the data was a certain way, certain size, “Wait, you can get high performance.”And you know, when I first turned on the engine, now four or five years ago, I was like, “Wow. This is going to work. We're off to the races.” And now obviously, we're more than just an idea when we first talked to you. We're a service.We deliver benefits to our customers both in logs. And shoot, this quarter alone we're coming out with new features not just in the logs, which I'll talk about second, but in a direct SQL access. But you know, one thing that you hear time and time again, we talked about it—JSON, CloudTrail, and Kubernetes; this is a real nightmare, and so one thing that we've come out with this quarter is the ability to virtually flatten. Now, you heard time and time again, where, “Okay. I'm going to pick and choose my data because my database can't handle whether it's elastic, or say, relational.” And all of a sudden, “Shoot, I don't have that. I got to reindex that.”And so what we've done is we've created a index technology that we're always planning to come out with that indexes the JSON raw blob, but in the data refinery have, post-index you can select how to unflatten it. Why is that important? Because all that tooling, whether it's elastic or SQL, is now available. You don't have to change anything. Why is Snowflake and BigQuery has these proprietary JSON APIs that none of these tools know how to use to get access to the data?Or you pick and choose. And so when you have a CloudTrail, and you need to know what's going on, if you picked wrong, you're in trouble. So, this new feature we're calling ‘Virtual Flattening'—or I don't know what we're—we have to work with the marketing team on it. And we're also bringing—this is where I get kind of excited where the elastic world, the ELK world, we're bringing correlations into Elasticsearch. And like, how do you do that? They don't have the APIs?Well, our data refinery, again, has the ability to correlate index patterns into one view. A view is an index pattern, so all those same constructs that you had in Kibana, or Grafana, or Elastic API still work. And so, no more denormalizing, no more trying to hodgepodge query over here, query over there. You're actually going to have correlations in Elastic, natively. And we're excited about that.And one more push on the future, Q4 into 2022; we have been given early access to S3 SQL access. And, you know, as I mentioned, correlations in Elastic, but we're going full in on publishing our [TPCH 00:31:56] report, we're excited about publishing those numbers, as well as not just giving early access, but going GA in the first of the year, next year.Corey: I look forward to it. This is also, I guess, it's impossible to have a conversation with you, even now, where you're not still forward-looking about what comes next. Which is natural; that is how we get excited about the things that we're building. But so much less of what you're doing now in our conversations have focused around what's coming, as opposed to the neat stuff you're already doing. I had to double-check when we were talking just now about oh, yeah, is that Google cloud object store support still something that is roadmapped, or is that out in the real world?No, it's very much here in the real world, available today. You can use it. Go click the button, have fun. It's neat to see at least some evidence that not all roadmaps are wishes and pixie dust. The things that you were talking to me about years ago are established parts of ChaosSearch now. It hasn't been just, sort of, frozen in amber for years, or months, or these giant periods of time. Because, again, there's—yeah, don't sell me vaporware; I know how this works. The things you have promised have come to fruition. It's nice to see that.Thomas: No, I appreciate it. We talked a little while ago, now a few years ago, and it was a bit of aspirational, right? We had a lot to do, we had more to do. But now when we have big customers using our product, solving their problems, whether it's security, performance, operation, again—at scale, right? The real pain is, sure you have a small ELK cluster or small Athena use case, but when you're dealing with terabytes to petabytes, trillions of rows, right—billions—when you were dealing trillions, billions are now small. Millions don't even exist, right?And you're graduating from computer science in college and you say the word, “Trillion,” they're like, “Nah. No one does that.” And like you were saying, people do petabytes and exabytes. That's the world we're living in, and that's something that we really went hard at because these are challenging data problems and this is where we feel we uniquely sit. And again, we don't have to break the bank while doing it.Corey: Oh, yeah. Or at least as of this recording, there's a meme going around, again, from an old internal Google Video, of, “I just want to serve five terabytes of traffic,” and it's an internal Google discussion of, “I don't know how to count that low.” And, yeah.Thomas: [laugh].Corey: But there's also value in being able to address things at much larger volume. I would love to see better responsiveness options around things like Deep Archive because the idea of being able to query that—even if you can wait a day or two—becomes really interesting just from the perspective of, at that point, current cost for one petabyte of data in Glacier Deep Archive is 1000 bucks a month. That is ‘why would I ever delete data again?' Pricing.Thomas: Yeah. You said it. And what's interesting about our technology is unlike, let's say Lucene, when you index it, it could be 3, 4, or 5x the raw size, our representation is smaller than gzip. So, it is a full representation, so why don't you store it efficiently long-term in S3? Oh, by the way, with the Glacier; we support Glacier too.And so, I mean, it's amazing the cost of data with cloud storage is dramatic, and if you can make it hot and activated, that's the real promise of a data lake. And, you know, it's funny, we use our own service to run our SaaS—we log our own data, we monitor, we alert, have dashboards—and I can't tell you how cheap our service is to ourselves, right? Because it's so cost-effective for long-tail, not just, oh, a few weeks; we store a whole year's worth of our operational data so we can go back in time to debug something or figure something out. And a lot of that's savings. Actually, huge savings is cloud storage with a distributed elastic compute fabric that is serverless. These are things that seem so obvious now, but if you have SSDs, and you're moving things around, you know, a team of IT professionals trying to manage it, it's not cheap.Corey: Oh, yeah, that's the story. It's like, “Step one, start paying for using things in cloud.” “Okay, great. When do I stop paying?” “That's the neat part. You don't.” And it continues to grow and build.And again, this is the thing I learned running a business that focuses on this, the people working on this, in almost every case, are more expensive than the infrastructure they're working on. And that's fine. I'd rather pay people than technologies. And it does help reaffirm, on some level, that—people don't like this reminder—but you have to generate more value than you cost. So, when you're sitting there spending all your time trying to avoid saving money on, “Oh, I've listened to ChaosSearch talk about what they do a few times. I can probably build my own and roll it at home.”It's, I've seen the kind of work that you folks have put into this—again, you have something like 100 employees now; it is not just you building this—my belief has always been that if you can buy something that gets you 90, 95% of where you are, great. Buy it, and then yell at whoever selling it to you for the rest of it, and that'll get you a lot further than, “We're going to do this ourselves from first principles.” Which is great for a weekend project for just something that you have a passion for, but in production mistakes show. I've always been a big proponent of buying wherever you can. It's cheaper, which sounds weird, but it's true.Thomas: And we do the same thing. We have single-sign-on support; we didn't build that ourselves, we use a service now. Auth0 is one of our providers now that owns that [crosstalk 00:37:12]—Corey: Oh, you didn't roll your own authentication layer? Why ever not? Next, you're going to tell me that you didn't roll your own payment gateway when you wound up charging people on your website to sign up?Thomas: You got it. And so, I mean, do what you do well. Focus on what you do well. If you're repeating what everyone seems to do over and over again, time, costs, complexity, and… service, it makes sense. You know, I'm not trying to build storage; I'm using storage. I'm using a great, wonderful service, cloud object storage.Use whats works, whats works well, and do what you do well. And what we do well is make cloud object storage analytical and fast. So, call us up and we'll take away that 2 a.m. call you have when your cluster falls down, or you have a new workload that you are going to go to the—I don't know, the beach house, and now the weekend shot, right? Spin it up, stream it in. We'll take over.Corey: Yeah. So, if you're listening to this and you happen to be at re:Invent, which is sort of an open question: why would you be at re:Invent while listening to a podcast? And then I remember how long the shuttle lines are likely to be, and yeah. So, if you're at re:Invent, make it on down to the show floor, visit the ChaosSearch booth, tell them I sent you, watch for the wince, that's always worth doing. Thomas, if people have better decision-making capability than the two of us do, where can they find you if they're not in Las Vegas this week?Thomas: So, you find us online chaossearch.io. We have so much material, videos, use cases, testimonials. You can reach out to us, get a free trial. We have a self-service experience where connect to your S3 bucket and you're up and running within five minutes.So, definitely chaossearch.io. Reach out if you want a hand-held, white-glove experience POV. If you have those type of needs, we can do that with you as well. But we booth on re:Invent and I don't know the booth number, but I'm sure either we've assigned it or we'll find it out.Corey: Don't worry. This year, it is a low enough attendance rate that I'm projecting that you will not be as hard to find in recent years. For example, there's only one expo hall this year. What a concept. If only it hadn't taken a deadly pandemic to get us here.Thomas: Yeah. But you know, we'll have the ability to demonstrate Chaos at the booth, and really, within a few minutes, you'll say, “Wow. How come I never heard of doing it this way?” Because it just makes so much sense on why you do it this way versus the merry-go-round of data movement, and transformation, and schema management, let alone all the sharding that I know is a nightmare, more often than not.Corey: And we'll, of course, put links to that in the [show notes 00:39:40]. Thomas, thank you so much for taking the time to speak with me today. As always, it's appreciated.Thomas: Corey, thank you. Let's do this again.Corey: We absolutely will. Thomas Hazel, CTO and Founder of ChaosSearch. I'm Cloud Economist Corey Quinn, and this is Screaming in the Cloud. If you've enjoyed this podcast episode, please leave a five-star review on your podcast platform of choice, whereas if you've hated this episode, please leave a five-star review on your podcast platform of choice along with an angry comment because I have dared to besmirch the honor of your homebrewed object store, running on top of some trusty and reliable Raspberries Pie.Corey: If your AWS bill keeps rising and your blood pressure is doing the same, then you need The Duckbill Group. We help companies fix their AWS bill by making it smaller and less horrifying. The Duckbill Group works for you, not AWS. We tailor recommendations to your business and we get to the point. Visit duckbillgroup.com to get started.Announcer: This has been a HumblePod production. Stay humble.
About NickNick Heudecker leads market strategy and competitive intelligence at Cribl, the observability pipeline company. Prior to Cribl, Nick spent eight years as an industry analyst at Gartner, covering data and analytics. Before that, he led engineering and product teams at multiple startups, with a bias towards open source software and adoption, and served as a cryptologist in the US Navy. Join Corey and Nick as they discuss the differences between observability and monitoring, why organizations struggle to get value from observability data, why observability requires new data management approaches, how observability pipelines are creating opportunities for SRE and SecOps teams, the balance between budgets and insight, why goats are the world's best mammal, and more.Links: Cribl: https://cribl.io/ Cribl Community: https://cribl.io/community Twitter: https://twitter.com/nheudecker Try Cribl hosted solution: https://cribl.cloud TranscriptAnnouncer: Hello, and welcome to Screaming in the Cloud with your host, Chief Cloud Economist at The Duckbill Group, Corey Quinn. This weekly show features conversations with people doing interesting work in the world of cloud, thoughtful commentary on the state of the technical world, and ridiculous titles for which Corey refuses to apologize. This is Screaming in the Cloud.Corey: This episode is sponsored in part by Thinkst. This is going to take a minute to explain, so bear with me. I linked against an early version of their tool, canarytokens.org in the very early days of my newsletter, and what it does is relatively simple and straightforward. It winds up embedding credentials, files, that sort of thing in various parts of your environment, wherever you want to; it gives you fake AWS API credentials, for example. And the only thing that these things do is alert you whenever someone attempts to use those things. It's an awesome approach. I've used something similar for years. Check them out. But wait, there's more. They also have an enterprise option that you should be very much aware of canary.tools. You can take a look at this, but what it does is it provides an enterprise approach to drive these things throughout your entire environment. You can get a physical device that hangs out on your network and impersonates whatever you want to. When it gets Nmap scanned, or someone attempts to log into it, or access files on it, you get instant alerts. It's awesome. If you don't do something like this, you're likely to find out that you've gotten breached, the hard way. Take a look at this. It's one of those few things that I look at and say, “Wow, that is an amazing idea. I love it.” That's canarytokens.org and canary.tools. The first one is free. The second one is enterprise-y. Take a look. I'm a big fan of this. More from them in the coming weeks.Corey: This episode is sponsored in part by our friends at Jellyfish. So, you're sitting in front of your office chair, bleary eyed, parked in front of a powerpoint and—oh my sweet feathery Jesus its the night before the board meeting, because of course it is! As you slot that crappy screenshot of traffic light colored excel tables into your deck, or sift through endless spreadsheets looking for just the right data set, have you ever wondered, why is it that sales and marketing get all this shiny, awesome analytics and inside tools? Whereas, engineering basically gets left with the dregs. Well, the founders of Jellyfish certainly did. That's why they created the Jellyfish Engineering Management Platform, but don't you dare call it JEMP! Designed to make it simple to analyze your engineering organization, Jellyfish ingests signals from your tech stack. Including JIRA, Git, and collaborative tools. Yes, depressing to think of those things as your tech stack but this is 2021. They use that to create a model that accurately reflects just how the breakdown of engineering work aligns with your wider business objectives. In other words, it translates from code into spreadsheet. When you have to explain what you're doing from an engineering perspective to people whose primary IDE is Microsoft Powerpoint, consider Jellyfish. Thats Jellyfish.co and tell them Corey sent you! Watch for the wince, thats my favorite part.Corey: Welcome to Screaming in the Cloud. I'm Corey Quinn. This promoted episode is a bit fun because I'm joined by someone that I have a fair bit in common with. Sure, I moonlight sometimes as an analyst because I don't really seem to know what that means, and he spent significant amounts of time as a VP analyst at Gartner. But more importantly than that, a lot of the reason that I am the way that I am is that I spent almost a decade growing up in Maine, and in Maine, there's not a lot to do other than sit inside for the nine months of winter every year and develop personality problems.You've already seen what that looks like with me. Please welcome Nick Heudecker, who presumably will disprove that, but maybe not. He is currently a senior director of market strategy and competitive intelligence at Cribl. Nick, thanks for joining me.Nick: Thanks for having me. Excited to be here.Corey: So, let's start at the very beginning. I like playing with people's titles, and you certainly have a lofty one. ‘competitive intelligence' feels an awful lot like jeopardy. What am I missing?Nick: Well, I'm basically an internal analyst at the company. So, I spend a lot of time looking at the broader market, seeing what trends are happening out there; looking at what kind of thought leadership content that I can create to help people discover Cribl, get interested in the products and services that we offer. So, I'm mostly—you mentioned my time in Maine. I was a cryptologist in the Navy and I spent almost all of my time focused on what the bad guys do. And in this job, I focus on what our potential competitors do in the market. So, I'm very externally focused. Does that help? Does that explain it?Corey: No, it absolutely does. I mean, you folks have been sponsoring our nonsense for which we thank you, but the biggest problem that I have with telling the story of Cribl was that originally—initially it was, from my perspective, “What is this hokey nonsense?” And then I learned and got an answer and then finish the sentence with, “And where can I buy it?” Because it seems that the big competitive threat that you have is something crappy that some rando sysadmin has cobbled together. And I say that as the rando sysadmin, who has cobbled a lot of things like that together. And it's awful. I wasn't aware you folks had direct competitors.Nick: Today we don't. There's a couple that it might be emerging a little bit, but in general, no, it's mostly us, and that's what I analyze every day. Are there other emerging companies in the space? Are there open-source projects? But you're right, most of the things that we compete against are DIY today. Absolutely.Corey: In your previous role, which you were at for a very long time in tech terms—which in a lot of other cases is, “Okay, that doesn't seem that long,” but seven and a half years is a respectable stint at a company. And you were at Gartner doing a number of analyst-like activities. Let's start at the beginning because I assure you, I'm asking this purely for the audience and not because I don't know the answer myself, but what exactly is the purpose of an analyst firm, of which Gartner is the most broadly known and, follow up, why do companies care what Gartner thinks?Nick: Yeah. It's a good question, one that I answer a lot. So, what is the purpose of an analyst firm? The purpose of an analyst firm is to get impartial information about something, whether that is supply chain technology, big data tech, human resource management technologies. And it's often difficult if you're an end-user and you're interested in say, acquiring a new piece of technology, what really works well, what doesn't.And so the analyst firm because in the course of a given year, I would talk to nearly a thousand companies and both end-users and vendors as well as investors about what they're doing, what challenges they're having, and I would distill that down into 30-minute conversations with everyone else. And so we provided impartial information in aggregate to people who just wanted to help. And that's the purpose of an analyst firm. Your second question, why do people care? Well, I didn't get paid by vendors.I got paid by the company that I worked for, and so I got to be Tron; I fought for the users. And because I talk to so many different companies in different geographies, in different industries, and I share that information with my colleagues, they shared with me, we had a very robust understanding of what's actually happening in any technology market. And that's uncommon kind of insight to really have in any kind of industry. So, that's the purpose and that's why people care.Corey: It's easy from the engineering perspective that I used to inhabit to make fun of it. It's oh, it's purely justification when you're making a big decision, so if it goes sideways—because find me a technology project that doesn't eventually go sideways—I want to be able to make sure that I'm not the one that catches heat for it because Gartner said it was good. They have an amazing credibility story going on there, and I used to have that very dismissive perspective. But the more I started talking to folks who are Gartner customers themselves and some of the analyst-style things that I do with a variety of different companies, it's turned into, “No, no. They're after insight.”Because it turns out, from my perspective at least, the more that you are focused on building a product that solves a problem, you sort of lose touch with the broader market because the only people you're really talking to are either in your space or have already acknowledged and been right there and become your customer and have been jaded to see things from your point of view. Getting a more objective viewpoint from an impartial third party does have value.Nick: Absolutely. And I want you to succeed, I want you to be successful, I want to carry on a relationship with all the clients that I would speak with, and so one of the fun things I would always ask is, “Why are you asking me this question now?” Sometimes it would come in, they'd be very innocuous;, “Compare these databases,” or, “Compare these cloud services.” “Well, why are you asking?” And that's when you get to, kind of like, the psychology of it.“Oh, we just hired a new CIO and he or she hates vendor X, so we have to get rid of it.” “Well, all right. Let's figure out how we solve this problem for you.” And so it wasn't always just technology comparisons. Technology is easy, you write a check and you hope for the best.But when you're dealing with large teams and maybe a globally distributed company, it really comes down to culture, and personality, and all the harder factors. And so it was always—those were always the most fun and certainly the most challenging conversations to have.Corey: One challenge that I find in this space is—in my narrow niche of the world where I focus on AWS bills, where things are extraordinarily yes or no, black or white, binary choices—that I talked to companies, like during the pandemic, and they were super happy that, “Oh, yeah. Our infrastructure has auto-scaling and it works super well.” And I look at the bill and the spend graph over time is so flat you could basically play a game of pool on top of it. And I don't believe that I'm talking to people who are lying to me. I truly don't believe that people make that decision, but what they believe versus what is evidenced in reality are not necessarily congruent. How do you disambiguate from the stories that people want to tell about themselves? And what they're actually doing?Nick: You have to unpack it. I think you have to ask a series of questions to figure out what their motivation is. Who else is on the call, as well? I would sometimes drop into a phone call and there would be a dozen people on the line. Those inquiry calls would go the worst because everyone wants to stake a claim, everyone wants to be heard, no one's going to be honest with you or with anyone else on the call.So, you typically need to have a pretty personal conversation about what does this person want to accomplish, what does the company want to accomplish, and what are the factors that are pushing against what those things are? It's like a novel, right? You have a character, the character wants to achieve something, and there are multiple obstacles in that person's way. And so by act five, ideally everything wraps up and it's perfect. And so my job is to get the character out of the tree that is on fire and onto the beach where the person can relax.So, you have to unpack a lot of different questions and answers to figure out, well, are they telling me what their boss wants to hear or are they really looking for help? Sometimes you're successful, sometimes you're not. Not everyone does want to be open and honest. In other cases, you would have a team show up to a call with maybe a junior engineer and they really just want you to tell them that the junior engineer's architecture is not a good idea. And so you do a lot of couples therapy as well. I don't know if this is really answering the question for you, but there are no easy answers. And people are defensive, they have biases, companies overall are risk-averse. I think you know this.Corey: Oh, yeah.Nick: And so it can be difficult to get to the bottom of what their real motivation is.Corey: My approach has always been that if you want serious data, you go talk to Gartner. If you want [anec-data 00:09:48] and some understanding, well, maybe we can have that conversation, but they're empowering different decisions at different levels, and that's fine. To be clear, I do not consider Gartner to be a competitor to what I do in any respect. It turns out that I am not very good at drawing charts in varying shades of blue and positioning things just so with repeatable methodology, and they're not particularly good at having cartoon animals as their mascot that they put into ridiculous situations. We each have our portion of the universe, and that's working out reasonably well.Nick: Well, and there's also something to unpack there as well because I would say that people look at Gartner and they think they have a lot of data. To a certain degree they do, but a lot of it is not quantifiable data. If you look at a firm like IDC, they specialize in—like, they are a data house; that is what they do. And so their view of the world and how they advise their clients is different. So, even within analyst firms, there is differentiation in what approach they take, how consultative they might be with their clients, one versus another. So, there certainly are differences that you could find the more exposure you get into the industry.Corey: For a while, I've been making a recurring joke that Route 53—Amazon's managed DNS service—is in fact a database. And then at some point, I saw a post on Reddit where someone said, “Yeah, I see the joke and it's great, but why should I actually not do this?” At which point I had to jump in and say, “Okay, look. Jokes are all well and good, but as soon as people start taking me seriously, it's very much time to come clean.” Because I think that's the only ethical and responsible thing to do in this ecosystem.Similarly, there was another great joke once upon a time. It was an April Fool's Day prank, and Google put out a paper about this thing they called MapReduce. Hilarious prank that Yahoo fell for hook, line, and sinker, and wound up building Hadoop out of it and we're still paying the price for that, years later. You have a bit of a reputation from your time at Gartner as being—and I quote—“The man who killed Hadoop.” What happened there? What's the story? And I appreciate your finally making clear to the rest of us that it was, in fact, a joke. What happened there?Nick: Well, one of the pieces of research that Gartner puts out every year is this thing called a Hype Cycle. And we've all seen it, it looks like a roller coaster in profile; big mountain goes up really high and then comes down steeply, drops into a valley, and then—Corey: ‘the trough of disillusionment,' as I recall.Nick: Yes, my favorite. And then plateaus out. And one of the profiles on that curve was Hadoop distributions. And after years of taking inquiry calls, and writing documents, and speaking with everybody about what they were doing, we realized that this really isn't taking off like everyone thinks it is. Cluster sizes weren't getting bigger, people were having a lot of challenges with the complexity, people couldn't find skills to run it themselves if they wanted to.And then the cloud providers came in and said, “Well, we'll make a lot of this really simple for you, and we'll get rid of HDFS,” which is—was a good idea, but it didn't really scale well. I think that the challenge of having to acquire computers with compute storage and memory again, and again, and again, and again, just was not sustainable for the majority of enterprises. And so we flagged it as this will be obsolete before plateau. And at that point, we got a lot of hate mail, but it just seemed like the right decision to make, right? Once again, we're Tron; we fight for the users.And that seemed like the right advice and direction to provide to the end-users. And so didn't make a lot of friends, but I think I was long-term right about what happened in the Hadoop space. Certainly, some fragments of it are left over and we're still seeing—you know, Spark is going strong, there's a lot of Hive still around, but Hadoop as this amalgamation of open-source projects, I think is effectively dead.Corey: I sure hope you're right. I think it has a long tail like most things that are there. Legacy is the condescending engineering term for ‘it makes money.' You were at Gartner for almost eight years and then you left to go work at Cribl. What triggered that? What was it that made you decide, “This is great. I've been here a long time. I've obviously made it work for me. I'm going to go work at a startup that apparently, even though it recently raised a $200 million funding round”—congratulations on that, by the way—“It still apparently can't afford to buy a vowel in its name.” That's C-R-I-B-L because, of course, it is. Maybe another consonant, while you're shopping. But okay, great. It's oddly spelled, it is hard to explain in some cases, to folks who are not already feeling pain in that space. What was it that made you decide to sit up and, “All right, this is where I want to be?”Nick: Well, I met the co-founders when I was an analyst. They were working at Splunk and oddly enough—this is going to be an interesting transition compared to the previous thing we talked about—they were working on Hunk, which was, let's use HDFS to store Splunk data. Made a lot of sense, right? It could be much more cost-effective than high-cost infrastructure for Splunk. And so they told me about this; I was interested.And so I met the co-founders and then I reconnected with them after they left and formed Cribl. And I thought the story was really cool because where they're sitting is between sources and destinations of observability data. And they were solving a problem that all of my customers had, but they couldn't resolve. They would try and build it themselves. They would look at—Kafka was a popular choice, but that had some challenges for observability data—works fantastically well for application data.And they were just—had a very pragmatic view of the world that they were inhabiting and the problem that they were looking to solve. And it looked kind of like a no-brainer of a problem to solve. But when you double-click on it, when you really look down and say, “All right, what are the challenges with doing this?” They're really insurmountable for a lot of organizations. So, even though they may try and take a DIY approach, they often run into trouble after just a few weeks because of all the protocols you have to support, all the different data formats, and all the destinations, and role-based access control, and everything else that goes along with it.And so I really liked the team. I thought the product inhabited a unique space in the market—we've already talked about the lack of competitors in the space—and I just felt like the company was on a rocket ship—or is a rocket ship—that basically had unbounded success potential. And so when the opportunity arose to join the team and do a lot of the things I like doing as an analyst—examining the market, talking to people looking at competitive aspects—I jumped at it.Corey: It's nice when you see those opportunities that show up in front of you, and the stars sort of align. It's like, this is not just something that I'm excited about and enthused about, but hey, they can use me. I can add something to where they're going and help them get there better, faster, sooner, et cetera, et cetera.Nick: When you're an analyst, you look at dozens of companies a month and I'd never seen an opportunity that looked like that. Everything kind of looked the same. There's a bunch of data integration companies, there's a bunch of companies with Spark and things like that, but this company was unique; the product was unique, and no one was really recognizing the opportunity. So, it was just a great set of things that all happen at the same time.Corey: It's always fun to see stars align like that. So—Nick: Yeah.Corey: —help me understand in a way that can be articulated to folks who don't have 15 years of grumpy sysadmin experience under their belts, what does Cribl do?Nick: So, Cribl does a couple of things. Our flagship product is called LogStream, and the easiest way to describe that is as an abstraction between sources and destinations of data. And that doesn't sound very interesting, but if you, from your sysadmin background, you're always dealing with events, logs, now there's traces, metrics are also hanging around—Corey: Oh, and of course, the time is never synchronized with anything either, so it's sort of a giant whodunit, mystery, where half the eyewitnesses lie.Nick: Well, there's that. There's a lot of data silos. If you got an agent deployed on a system, it's only going to talk to one destination platform. And you repeat this, maybe a dozen times per server, and you might have 100,000 or 200,000 servers, with all of these different agents running on it, each one locked into one destination. So, you might want to be able to mix and match that data; you can't. You're locked in.One of the things LogStream does is it lets you do that exact mixing and matching. Another thing that this product does, that LogStream does, is it gives you ability to manage that data. And then what I mean by that is, you may want to reduce how much stuff you're sending into a given platform because maybe that platform charges you by your daily ingest rates or some other kind of event-based charges. And so not all that data is valuable, so why pay to store it if it's not going to be valuable? Just dump it or reduce the amount of volume that you've got in that payload, like a Windows XML log.And so that's another aspect that it allows you to do, better management of that stuff. You can redact sensitive fields, you can enrich the data with maybe, say, GeoIPs so you know what kind of data privacy laws you fall under and so on. And so, the story has always been, land the data in your destination platform first, then do all those things. Well, of course, because that's how they charge you; they charge you based on daily ingest. And so now the story is, make those decisions upfront in one place without having to spread this logic all over, and then send the data where you want it to go.So, that's really, that's the core product today, LogStream. We call ourselves an observability pipeline for observability data. The other thing we've got going on is this project called AppScope, and I think this is pretty cool. AppScope is a black box instrumentation tool that basically resides between the application runtime and the kernel and any shared libraries. And so it provides—without you having to go back and instrument code—it instruments the application for you based on every call that it makes and then can send that data through something like LogStream or to another destination.So, you don't have to go back and say, “Well, I'm going to try and find the source code for this 30-year old c++ application.” I can simply run AppScope against the process, and find out exactly what that application is doing for me, and then relay that information to some other destination.Corey: This episode is sponsored in part by Liquibase. If you're anything like me, you've screwed up the database part of a deployment so severely that you've been banned from touching every anything that remotely sounds like SQL, at at least three different companies. We've mostly got code deployments solved for, but when it comes to databases we basically rely on desperate hope, with a roll back plan of keeping our resumes up to date. It doesn't have to be that way. Meet Liquibase. It is both an open source project and a commercial offering. Liquibase lets you track, modify, and automate database schema changes across almost any database, with guardrails to ensure you'll still have a company left after you deploy the change. No matter where your database lives, Liquibase can help you solve your database deployment issues. Check them out today at liquibase.com. Offer does not apply to Route 53.Corey: I have to ask because I love what you're doing, don't get me wrong. The counterargument that always comes up in this type of conversation is, “Who in their right mind looks at the state of the industry today and says, ‘You know what we need? That's right; another observability tool.'” what differentiates what you folks are building from a lot of the existing names in the space? And to be clear, a lot of the existing names in the space are treating observability simply as hipster monitoring. I'm not entirely sure they're wrong, but that's a different fight for a different time.Nick: Yeah. I'm happy to come back and talk about that aspect of it, too. What's different about what we're doing is we don't care where the data goes. We don't have a dog in that fight. We want you to have better control over where it goes and what kind of shape it's in when it gets there.And so I'll give an example. One of our customers wanted to deploy a new SIEM—Security Information Event Management—tool. But they didn't want to have to deploy a couple hundred-thousand new agents to go along with it. They already had the data coming in from another agent, they just couldn't get the data to it. So, they use LogStream to send that data to their new desired platform.Worked great. They were able to go from zero to a brand new platform in just a couple days, versus fighting with rolling out agents and having to update them. Did they conflict with existing agents? How much performance did it impact on the servers, and so on? So, we don't care about the destination. We like everybody. We're agnostic when it comes to where that data goes. And—Corey: Oh, it's not about the destination. It's about the journey. Everyone's been saying it, but you've turned it into a product.Nick: It's very spiritual. So, we [laugh] send, we send your observability data on a spiritual [laugh] journey to its destination, and we can do quite a bit with it on the way.Corey: So, you said you offered to go back as well and visit the, “Oh, it's monitoring, but we're going to call it observability because otherwise we get yelled out on Twitter by Charity Majors.” How do you view that?Nick: Monitoring is the things you already know. Right? You know what questions you want to ask, you get an alert if something goes out of bounds or something goes from green to red. Think about monitoring as a data warehouse. You shape your data, you get it all in just the right condition so you can ask the same question over and over again, over different time domains.That's how I think about monitoring. It's prepackaged, you know exactly what you want to do with it. Observability is more like a data lake. I have no idea what I'm going to do with this stuff. I think there's going to be some signals in here that I can use, and I'm going to go explore that data.So, if monitoring is your known knowns, observability is your unknown unknowns. So, an ideal observability solution gives you an opportunity to discover what those are. Once you discover them. Great. Now, you can talk about how to get them into your monitoring system. So, for me, it's kind of a process of discovery.Corey: Which makes an awful lot of sense. The problem I've always had with the monitoring approach is it falls into this terrible pattern of enumerate the badness. In other words, “Imagine all the ways that this system can fail,” and then build an alerting that lets you know when any of those things happen. And what happens next is inevitable to anyone who's ever dealt with the tricksy devils known as computers, and what happens, of course, is that they find new ways to fail and you generally get to add to the list of things to check for, usually at two o'clock in the morning.Nick: On a Sunday.Corey: Oh, absolutely. It almost doesn't matter when. The real problem is when these things happen, it's, “What day, actually, is it?” And you have to check the calendar to figure out because your third time that week being woken up in the dead of night. It's like an infant but less than endearing.So, that has been the old school approach, and there's unfortunately still an awful lot of, we'll just call it nonsense, in the industry that still does exactly the same thing, except now they call it observability because—hearkening back to earlier in our conversation—there's a certain point in the Gartner Hype Cycle that we are all existing within. What's the deal with that?Nick: Well, I think that there are a lot of entrenched interests in the monitoring space. And so I think you always see this when a new term comes around. Vendors will say, “All right, well, there's a lot of confusion about this. Let me back-fit my product into this term so that I can continue to look like I'm on the leading edge and I'm not going to put any of my revenues in jeopardy.” I know, that's a cynical view, but I've seen it over and over again.And I think that's unfortunate because there's a real opportunity to have a better understanding of your systems, to better understand what's happening in all the containers you're deploying and not tearing down the way that you should, to better understand what's happening in distributed systems. And it's going to be a real missed opportunity if that is what happens. If we just call this ‘Monitoring 2.0' it's going to leave a lot of unrealized potential in the market.Corey: The big problem that I've seen in a lot of different areas is—I'll be direct—consolidation where you have a company that starts to do a thing—and that's great—and then they start doing other things that are tied to it. And in turn, they start, I guess, gathering everything in the ecosystem. If you break down observability into various constituent parts, I—know, I know, the pillars thing is going to upset people; ignore that for now—and if you have an offering that's weak in a particular area, okay, instead of building it organically into the product, or saying, “Yeah, that's not what we do,” there's an instinct to acquire a company or build that functionality out. And it turns out that we're building what feels the lot to me like the SaaS equivalent of multifunction printers: they can print, they can scan, they can fax, and none of those three very well, so it winds up with something that dissatisfies everyone, rather than a best-of-breed solution that has a very clear and narrow starting and stopping point. How do you view that?Nick: Well, what you've described is a compromise, right? A compromise is everyone can work and no one's happy. And I think that's the advantage of where LogStream comes in. The reality is best-of-breed. Most enterprises today have 30 or more different monitoring tools—call them observability tools if you want to—and you will never pry those tools from the dead hands of those sysadmins, DevOps engineers, SREs, et cetera.They all integrate those tools into how they work and their processes. So, we're living in a best-of-breed world. It's like that in data and analytics—my former beat—and it's like that in monitoring and observability. People really gravitate towards the tools they like, they gravitate towards the tools their friends are using. And so you need a way to be able to mix and match that stuff.And just because I want to stay [laugh] on message, that's really where the LogStream story kind of blends in because we do that; we allow you to mix and match all those different pieces.Corey: Joke's on you. I use Nagios and I have no friends. I'm not convinced those two things are entirely unrelated, but here we are. So here's, I guess, the big burning question that a lot of folks—certainly not me, but other undefined folks, ‘lots of people are saying'—so you built something interesting that actually works. I want to be clear on this.I have spoken to customers of yours. They swear by it instead of swearing at it, which happens with other companies. Awesome. You have traction, you're moving forward, things are going great. Here's $200 million is the next part of that story, and on some level, my immediate reaction—which does need updating, let's be clear here—is like, all right.I'm trying to build a product. I can see how I could spend a few million bucks. “Well, what can you do with I don't know, 100 times that?” My easy answer is, “Something monstrous.” I don't believe that is the case here. What is the growth plan? What are you doing that makes having that kind of a war chest a useful and valuable thing to have?Nick: Well, if you speak with the co-founders—and they've been open about this—we view ourselves as a generational company. We're not just building one product. We've been thinking about, how do we deliver on observability as this idea of discovery? What does that take? And it doesn't mean that we're going to be less agnostic to other destinations, we still think there's an incredible amount of value there and that's not going away, but we think there's maybe an interim step that we build out, potentially this idea of an observability data lake where you can explore these environments.Certainly, there's other types of options in the space today. Most of them are SQL-based, which is interesting because the audience that uses monitoring and observability tools couldn't care less about SQL right? They want search, they want regex, and so you've got to have the right tool for that audience. And so we're thinking about what that looks like going forward. We're doubling down on people.Surprisingly, this is a very—like anything else in software, it is people-intensive. And so certainly those are other aspects that we're exploring with the recent investment, but definitely, multiproduct company is our future and continued expansion.Corey: Expansion is always a fun one. It's the idea of, great, are you looking at going deeper into the areas you're already active within, or is it more of a, “Ah, so we've solved the, effectively, log routing problem. That's great. Let's solve other problems, too.” Or is it more of a, I guess, a doubling down and focusing on what's working? And again, that probably sounds judgmental in a way I don't intend it to at all. I just have a hard time contextualizing that level of scale coming from a small company perspective the way that I do.Nick: Yeah. Our plan is to focus more intently on the areas that we're in. We have a huge basis of experience there. We don't want to be all things to all people; that dilutes the message down to nothing, so we want to be very specific in the audiences we talk to, the problems we're trying to solve, and how we try to solve them.Corey: The problem I've always found with a lot of the acquisition, growth thrashing of—let me call it what I think it is: companies in decline trying to strain relevancy, it feels almost like a, “We don't see a growth strategy. So, we're going to try and acquire everything that hold still long enough, at some level, trying to add more revenue to the pile, but also thrashing in the sense of, okay. They're going to teach us how to do things in creative, awesome ways,” but it never works out that way. When you have a 50,000 person company acquiring a 200 person company, invariably the bigger culture is going to dominate. And I don't understand why that mistake seems to continually happen again, and again, and again.And people think I'm effectively alluding to—or whenever the spoken word version of subtweeting is—a particular company or a particular acquisition. I'm absolutely not, there are probably 50 different companies listening right now who thinks, “Oh, God. He's talking about us.” It's the common repeating trend. What is that?Nick: It's hard to say. In some cases, these acquisitions might just be talent. “We need to know how to do X. They know how to do X. Let's do it.” They may have very unique niche technology or software that another company thinks they can more broadly apply.Also, some of these big companies, these may not be board-level or CEO-level decisions. A business unit might decide, “Oh, I like what that company is doing. I'm going to go acquire it.” And so it looks like MegaCorp bought TinyCorp, but it's really, this tiny business unit within MegaCorp bought tiny company. The reality is often different from what it looks like on the outside.So, that's one way. Another is, you know, if they're going to teach us to be more effective with tech or something like that, you're never going to beat culture. You're never going to be the existing culture. If it's 50,000, against 200, obviously we know who wins there. And so I don't know if that's realistic.I don't know if the big companies are genuine when they say that, but it could just be the messaging that they use to make people happy and hopefully retain as many of those new employees for as long as they can. Does that make sense?Corey: No, it makes perfect sense. It's the right answer. It does articulate what is happening there, and I think I keep falling prey to the same failure. And it's hard. It's pernicious, but companies are not monolithic entities.There's no one person at all of these companies each who is making these giant unilateral decisions. It's always some product manager or some particular person who has a vision and a strategy in the department. It is not something that the company board is agreeing on every little decision that gets made. They're distributed entities in many respects.Nick: Absolutely. And that's only getting more pervasive as companies get larger [laugh] through acquisition. So, you're going to see more and more of that, and so it's going to look like we're going to put one label on it, one brand. Often, I think internally, that's the exact opposite of what actually happened, how that decision got made.Corey: Nick, I want to thank you for taking so much time to speak with me about what you're up to over there, how your path has shaped, how you view the world, and also what Cribl does these days. If people want to learn more about what you're up to, how you think about the world, or even possibly going to work at Cribl which, having spoken to a number of people over there, I would endorse it. How do they find you?Nick: Best place to find us is by joining our community: cribl.io/community, and Cribl is spelled C-R-I-B-L. You can certainly reach out there, we've got about 2300 people in our community Slack, so it's a great group. You can also reach out to me on Twitter, I'm @nheudecker, N-H-E-U-D-E-C-K-E-R. Tell me what you thought of the episode; love to hear it. And then beyond that, you can also sign up for our free cloud tier at cribl.cloud. It's a pretty generous one terabyte a day processing, so you can start to send data in and send it wherever you'd like to be.Corey: To be clear, this free as in beer, not free as an AWS free tier?Nick: This is free as in beer.Corey: Excellent. Excellent.Nick: I think I'm getting that right. I think it's free as in beer. And the other thing you can try is our hosted solution on AWS, fully managed cloud at cribl.cloud, we offer a free one terabyte per day processing, so you can start to send data into that environment and send it wherever you'd like to go, in whatever shape that data needs to be in when it gets there.Corey: And we will, of course, put links to that in the [show notes 00:35:21]. Thank you so much for your time today. I really appreciate it.Nick: No, thank you for having me. This was a lot of fun.Corey: Nick Heudecker, senior director, market strategy and competitive intelligence at Cribl. I'm Cloud Economist Corey Quinn, and this is Screaming in the Cloud. If you've enjoyed this podcast, please leave a five-star review on your podcast platform of choice, whereas if you've hated this podcast, please leave a five-star review on your podcast platform of choice, along with a comment explaining that the only real reason a startup should raise a $200 million funding round is to pay that month's AWS bill.Corey: If your AWS bill keeps rising and your blood pressure is doing the same, then you need The Duckbill Group. We help companies fix their AWS bill by making it smaller and less horrifying. The Duckbill Group works for you, not AWS. We tailor recommendations to your business and we get to the point. Visit duckbillgroup.com to get started.Announcer: This has been a HumblePod production. Stay humble.
It's our first double-interview episode on the podcast! Alex is currently an Academic Advisor for the College of Communication and Information at Kent State University. She earned a master's degree in Higher Education Administration and Student Affairs, a Certificate in International Education, and a bachelor's degree in Human Development and Family Studies with a concentration in Family Life Education from Kent State University. Ted is currently a Senior Anti-Money Laundering (AML) Compliance Analyst. He earned a bachelor's degree in Human Development and Family Studies with a concentration in Child and Youth Development from Kent State University. Alex and Ted met at Kent State while attending HDFS classes for their major and serving as Resident Advisors in the campus residence halls. In this episode, they share how they found the field of HDFS and their professional experiences to date. As is true for all interviewees on this podcast, Alex and Ted's views are their own as private citizens and do not reflect the views of their current, former, or future employers.
Ericca is currently a School Social Worker at a charter school in Columbus, Ohio. She earned a master's degree in social work from The Ohio State University. She also holds a bachelor's degree in Human Development and Family Studies from Kent State University. In this episode, she discusses how she found the field of HDFS and her professional experiences to date. Please note that this episode does briefly discuss domestic violence.
Software Engineering Radio - The Podcast for Professional Software Developers
Dhruba Borthakur, CTO and co-founder of Rockset, discusses the use cases and core requirements of real-time analytics, as well as the evolution from batch to real time and the need for a new architecture with host Kanchan Shringi.
Sharonda earned a master's degree in Social Work from the University of Tennessee, Knoxville. She also holds a bachelor's degree from Middle Tennessee State University with a major in Child Development and Family Studies and a minor in Business Administration. She also holds a Certified Family Life Educator-Provisional (CFLE-P) credential from the National Council on Family Relations. She is the author of a book titled Mental Makeover: Reclaiming Your Beauty from the Inside Out. She is currently a Licensed Master Social Worker and Outpatient Therapist working in the field of community mental health services in Tennessee. In this episode, she discusses how she found the field of HDFS and her professional experiences to date. As is true for all interviewees on this podcast, Sharonda's views are her own as a private citizen and do not reflect the views of her current, former, or future employers. Website: http://sharondastiggers.com/
Adrienne earned a master's degree in Human and Social Services specializing in Mental Health Facilitation from Walden University. She also holds a bachelor's degree from Middle Tennessee State University with a major in Child Development and Family Studies and a minor in Psychology. She is currently a Team Lead for the Care Management Department of a nonprofit mental health agency in Tennessee. In this episode, she discusses how she found the field of HDFS and her professional experiences to date. As is true for all interviewees on this podcast, Adrienne's views are her own as a private citizen and do not reflect the views of her current, former, or future employers.
Olivia is a mama, a pregnancy, birth, and postpartum coach, holistic health educator, and gentle parenting advocate. She is the founder of Healthy Balanced Motherhood, a platform for mothers, partners, new parents, caregivers and birth professionals. She works one-on-one with clients and families to provide them with information about pregnancy, birth, postpartum, and parenthood that is specifically catered to their individual needs. website: Healthy Balanced Birthinstagram: Olivia | (@healthybalancedmotherhood) • Instagram photos and videos__________________________________________________________________411k podcast episode on birth + finances: 037. Healthy Financials and Healthy Births – The 411k__________________________________________________________________*RESOURCES:Home - Evidence Based Birth®Documentaries: click links below The Business of Being BornPregnant in AmericaWhy Not Home? A Documentary Film Exploring the Surprising Birth Choices of Doctors and Nurses A Documentary Film Exploring the Surprising Birth Choices of Doctors and NursesBooks: click links below Birth Without Fear: The Judgment-Free Guide to Taking Charge of Your Pregnancy, Birth, and Postpartum: Harshe, January: 9780316515610: Amazon.com: BooksIna May's Guide to Childbirth "Updated With New Material": Ina May Gaskin: 0074748342760: Amazon.com: BooksSpiritual Midwifery: Gaskin, Ina May: 8581000039303: Amazon.com: BooksThe Birth Partner 5th Edition: A Complete Guide to Childbirth for Dads, Partners, Doulas, and All Other Labor Companions: Simkin, Penny: 9781558329102: Amazon.com: BooksBirthing from Within: An Extra-Ordinary Guide to Childbirth Preparation: 9780965987301: Medicine & Health Science Books @ Amazon.com__________________________________________________________________HEHE:Doula + Birth Prep, M.S. HDFS (@tranquilitybyhehe) • Instagram photos and videosSARAH: Dr. Sarah, MS, DC | Birth Ed (@birthuprising) • Instagram photos and videos
This is another podcast episode for my HDFS 334 class at Colorado State University. This episode is about divorce and how it impacts parent and child. I applied these concepts to the movie, Marriage Story.
***In this bonus blog post episode, I share 5 tips for starting the process of career exploration within the field of HDFS.*** This season, I am launching a new series of bonus episodes based on content from the blog on my HDFS Careers website--a blog in which I provide unsolicited advice about preparing for and finding jobs as a family science major. I decided to share the blog posts audibly on the podcast because I know first-hand that it can be hard to make time to sit down and read things on the internet. With these bonus episodes, you can listen as you drive, do chores, walk to class, whatever. I will literally just be reading them straight from the website, so don't expect any frills--just the information. The views on the blog, website, and podcast are entirely my own, based on my own experiences and opinions. They do not necessarily reflect the opinions of my current or former workplaces. Also, I make no guarantees about the outcomes of taking advice on the blog, website or podcast. It is simply my opinion, and everyone has to make their own choices that are best for them.
Amanda is currently an Associate Therapist at Grow Through Life Counseling, a group private practice providing mental health care to children and adolescents. She is also a Music and Fitness Teacher at San Carlos Preschool and a Member of the California Association for Play Therapy. She serves as President-Elect of the San Diego Branch. She earned a master's degree in Early Childhood Mental Health from San Diego State University. She also holds a graduate certificate as an Early Childhood Socio-Emotional Behavior Regulation Intervention Specialist (Ec-SEBRIS) from San Diego State University, and she has a bachelor's degree in Child and Family Development (specializing in Trauma-Informed Care) from San Diego State University--California State University. In this episode, she discusses how she found the field of HDFS and her professional experiences to date. As is true for all interviewees on this podcast, Amanda's views are her own as a private citizen and do not reflect the views of her current, former, or future employers. Organization Mentioned in this Episode: https://www.calplaytherapy.org/
Maddie earned a bachelor's degree in Human Development and Family Studies from the University of Houston. While completing her degree, she also earned an Early Childhood Interventionist Certificate and a minor in Education. She is currently an Outreach Specialist at the Children's Museum of Houston and the Owner and Founder of Maddie's Delights, her baking company. In this episode, she discusses how she found the field of HDFS and her professional experiences to date. As is true for all interviewees on this podcast, Maddie's views are her own as a private citizen and do not reflect the views of her current, former, or future employers. Maddie's Delights Facebook: https://www.facebook.com/maddiesdelights/ Maddie's Delights Instagram: https://www.instagram.com/maddiesdelights/
This is another podcast episode for my HDFS 334 class at Colorado State University. This podcast episode is more about the parent impacts on their child and how it can affect bullying behaviors.
Caroline is a Certified Child Life Specialist II at Le Bonheur Children's Hospital in Memphis, Tennessee. She earned a bachelor's degree in Human Development and Family Studies from the University of Alabama with a double concentration in Child Life and Early Childhood Education. She earned a master's degree in Human Development and Family Studies with a concentration in Parent-Family Life Education, also from the University of Alabama. In this episode, she discusses how she found the field of HDFS and her professional experiences to date. As is true for all interviewees on this podcast, Caroline's views are her own as a private citizen and do not reflect the views of her current, former, or future employers. Link Mentioned: Association of Child Life Professionals
This is a homework assignment for my HDFS 334 class. This episode describes the Indifferent parenting style and applying it to the Netflix TV show, "Shameless."
Join us this week as we sit down with two guests, Anna Whitaker and Meredith Byrd, to talk about spiritual disciplines. Anna is a new friend of ours and a senior Christian Studies major here at AU, and if her last name sounds familiar it's because we had her husband on just a few episodes ago! Mere is an old friend of ours and she's a junior HDFS major at AU; if her name sounds familiar it's because this is her fourth time on the podcast! In the first part of the episode (3:40-7:40), we have the usual friendly banter, hear a little bit about our guests, and introduce today's conversation! In the second part of the episode (8:05-33:06), we hear a little bit about the history of Christian monasticism and this idea of the "rule of life" from Anna, and look at a book about liturgies in everyday life, "The Liturgy of the Ordinary," with Meredith. We ask some questions about spiritual disciplines, like what even are they? We focus in on spiritual disciplines as sustaining us through the mundane, strengthening our Faith when we need it most. We talk about the de-spiritualization of Christianity and how the superficiality of some churches causes people to forget about the spiritual disciplines and the lose the wonder of Christianity. We talk about how if living in the spirit was easy, every one would do it; the spiritual disciplines are called disciplines for a reason. We discuss emotions and personal experience and discerning between when it's the Spirit and when it's emotion, moving into the next part of our discussion. In the last part of the episode (33:36-1:00:36), we look at the misconceptions around being receptive to your emotions within the Church and Christianity, and juxtapose that against the example Christ set for us in His time on Earth. We take a look at the Catholic and Anglican Churches and discuss how they engage all of the senses using the spiritual disciplines. We talk about how we are spiritual creatures not trapped in physical bodies, but instead gifted them by God; able to use them to grow our spiritual selves. We need to exercise both the Spirit and the Body with spiritual disciplines. We close out by discussing the personal nuance of the spiritual disciplines. Your walk with God is just that—yours. We share our own experiences with the spiritual disciplines and discuss how what works for one person may not work for another. Lastly, we encourage our listeners to lean into the ways they connect with God and implement disciplines and liturgies into their own lives to exercise their whole self and grow in their spiritual walks. For more information on what we're all about here at The Audibility Podcast, go ahead and check out our website https://audibilitypodcast.com, and to get connected with us, follow us on Instagram, @audibilitypodcast. Resources: "The Liturgy of the Ordinary" by Tish Harrison Warren "You Are What You Love" by James K. A. Smith "The Celebration of Discipline" by Richard Foster "Sacred Pathways" by Gary Thomas "God in My Everything" by Ken Shigematsu
Dr. Brian Bishop-Wilkey is currently a Shopper & Category Insights Manager at Abbott Nutrition. He earned a Ph.D. in Human Development and Family Sciences from the University of Texas at Austin, a master's degree in Social Psychology from Texas A&M University, and a bachelor's degree in Psychology from Miami University. In this episode, he shares the story of how he found the field of HDFS and his professional experiences to date. As is true for all interviewees on this podcast, Dr. Bishop-Wilkey's views are his own as a private citizen and do not reflect the views of his current, former, or future employers. Book Mentioned in this Episode: Why We Buy by Paco Underhill
Dr. Catherine Cushinberry is currently the Executive Director/Vice President of City Year Memphis where she maintains strategic relationships with Shelby County, Journey Community Schools, Believe Memphis Academy, Frayser Community Schools, City Year Memphis' Board of Directors, corporate and philanthropic partners, and the Memphis Community. Dr. Cushinberry earned a Ph.D. in Human Development and Family Studies from the University of Missouri--Columbia. She also earned a master's degree in Organizational Communication from the University of Memphis and a bachelor's degree in Organizational Communication from Murray State University. In this episode, she shares the story of how she found the field of HDFS and her professional experiences to date. As is true for all interviewees on this podcast, Dr. Cushinberry's views as her own as a private citizen and do not reflect the views of her current, former, or future employers. Book Mentioned in this Episode: The Alchemist by Paulo Coelho
At the time of this recording, Allison was the Program Manager for Community Day Services. Since recording, she has earned a promotion. She begins her new role soon as the Director of Community Living and Community Day Service Programs. Congrats on the promotion, Allison! Allison earned a bachelor's degree in Human Environmental Sciences with a concentration in Human Development and Family Studies from the University of Missouri--Columbia. She also holds the Qualified Intellectual Disabilities Professional (QIDP) credential. In this episode, she discusses how she found the field of HDFS and her professional experiences to date. As is true for all interviewees on this podcast, Allison's views are her own as a private citizen and do not reflect the views of her current, former, or future employers. Books Mentioned in this Episode: Limbo: Blue-Collar Roots, White-Collar Dreams by Alfred Lubrano When Love Dies: How to Save a Hopeless Marriage by Judy Bodmer Listen at the player or subscribe to the podcast on iTunes.
Dr. Michalski is the Founder & CEO of Strategically Authentic. Her specialties include Strategic Planning, Leadership Development, and Project Management, to name just a few. She earned an Ed.D. in Instructional Leadership and Andragogy from Lindenwood University. She also earned a M.Ed. in Educational Leadership and Policy Analysis and a B.S. in Human Development and Family Studies from the University of Missouri--Columbia. In this episode, she shares the story of how she found the field of HDFS and her professional experiences to date. As is true for all interviewees on this podcast, Dr. Michalski's views are her own as a private citizen and do not reflect the views of her current, former, or future employers. Connect with Dr. Michalski on Instagram!
Coriell Haughton Weeks is currently a Preschool Director at a private school. She earned a bachelor's degree from the University of Alabama with a major in Human Development and Family Studies. In this episode, she shares the story of how she found the field of HDFS and her experience working in nonprofits, a school-based after-school program, faith-based organizations, early childhood education, and more! Coriell's views are her own as a private citizen and do not reflect the views of her current, former, or future employers.
Dr. Liz Keneski is currently the Head of Privacy Research for Facebook. She earned both a master's degree and a Ph.D. in Human Development and Family Sciences from the University of Texas at Austin. She also earned a bachelor's degree in Psychology from Colorado State University. In this episode, she shares the story of how she found the field of HDFS and her professional experiences to date. As is true for all interviewees on this podcast, Dr. Keneski's views are her own as a private citizen and do not reflect the views of her current, former, or future employers.
Dr. Natalie Hengstebeck is currently an American Association for the Advancement of Science (AAAS) Science and Technology Policy Fellow (STPF) at the National Science Foundation, where she works on strategic partnerships and evaluation within the Computer and Information Science and Engineering (CISE) Directorate. Dr. Hengstebeck earned both a M.S. and a Ph.D. in Human Development and Family Studies from the University of North Carolina at Greensboro. She also earned a bachelor's degree in Psychology from DePaul University with minors in Sociology, Professional Writing, and Communications and Media Studies. In this episode, she shares the story of how she found the field of HDFS and her professional experiences to date. As is true for all interviewees on this podcast, Dr. Hengstebeck's views are her own as a private citizen and do not reflect the views of her current, former, or future employers.
In this episode, we hear from Dennis Lynn, senior Instructor of Human Development and Family Services and 3rd-year HDFS student, Thelma Pruett. The HDFS program at OSU-Cascades incorporates a unique blend of compassion and critical thinking, encouraging students to lead with love and open hearts that are willing to learn and listen, in balance with the best thinking they can bring to it. Dennis shares about the program's emphasis on understanding one's self through self-reflection - what are the things that you're drawn to, what fits you, what doesn't - and using those to find how to make the greatest difference in the world. As Dennis brilliantly summarizes, "Know yourself, serve your family, then let it ripple out in amazing ways." Thelma talks about what experiences she has had that's led her to the passion she found within human services. "What do you like to do, what makes you happy, what interests you, volunteer in different places and know your decisions will lead you to what you want to do...Finding that what you don't like to do is as important as what you like to do. So being mindful, paying attention to yourself and your own experiences, it will lead you to success." The HDFS program has ultimately helped her have a better understanding of society in general and all the challenges that we as human beings go through in life.
April Shaver is currently an Advocacy Specialist N.E.S.T. (Nurturing Educational Social Triumphs) Team Lead at Child Advocates of Fort Bend in Rosenberg, Texas. She earned a bachelor's degree in Human Development and Family Studies from the University of Houston. In this episode, she shares the story of how she found the field of HDFS and the professional experiences she has had so far, including working in child life and teaching in Thailand. (Please note that child abuse and neglect is discussed in this episode.)
Equiller Mahone is currently the Assistant Executive Director for Children's Advocacy Centers of Mississippi in Jackson, Mississippi. She earned a bachelor's degree and a master's degree in Human Development and Family Studies from the University of Alabama. In this episode, she shares the story of how she found the HDFS major and her professional experiences working in the areas of domestic violence and child advocacy. She also shares lots of career tips based on her experience interviewing and hiring. (Please note that various forms of family violence and neglect are discussed in this episode.)
Krystal Vann is currently a Program Innovation Manager at Child Advocates of Fort Bend in Rosenberg, Texas. She earned a bachelor's degree from the University of Houston with a major in Human Development and Family Studies and a minor in Studio Art. In this episode, she shares the story of how she found the field of HDFS after initially majoring in art and the professional experiences she has had teaching in a charter school and working in child advocacy to help children in foster care. (Please note that child abuse and neglect is discussed in this episode.)
Sepideh Nash is currently a licensed Marriage and Family Therapy--Associate in Houston, Texas. She earned a bachelor's degree in Human Development and Family Studies from the University of Houston and a master's degree in Marriage and Family Therapy from the University of Houston--Clear Lake. In this episode, she shares the story of how she found the field of HDFS and the professional experiences she has had so far.
Welcome and thanks for taking a few minutes to explore this podcast! In this brief first episode, I will tell you a bit about who I am (Erica Jordan), why I decided to start this podcast, and what to expect from it. Resources Mentioned: Roadtrip Nation HDFSCareers.com
Cet épisode parle du coronavirus, des conférences annulées, de la popularité des langages, de GraphQL, de Ghostcat et pleins d’autres choses encore. L’intro date un peu: les infos sur le coronavirus étant encore plus fréquentes que les nouveaux framework JavaScript. Enregistré le 13 mars 2020 Téléchargement de l’épisode LesCastCodeurs-Episode–227.mp3 News Corona virus Les actions des grosses boites pas de meeting conf annulées limite du travail au bureau Langages RedMonk ranking - Le langage au top est… JavaScript Python Java Typescript dans le top 10 R monte Rust stable comme Go (+1) Kotlin 19, Scala 13 InfoQ meta sondage Java 8 le plus déployé en prod, 25% Java 11 et non LTS derrière Spring 60–80% IntelliJ 60–80%, Eclipse 20–25% mavenjvs Gradle 66–33 ou 50–50 Sondage sur Scala Scala.js 1.0.0 7 ans de dev not binary compatible with 0.6 nor 1.0RCx Ecrire en scala des applications front interop avec les libraries JavaScript GraalVM se dote d’un advisory board Gluon, Red Hat, Amazon, Microdoc, Shopify, Twitter, OCI, Neo4J, Pivotal, ARM et Oracle bien sûr Gros round d’investissement dans Azul investissement / achat: 340 M$ Librairies Eclipse MicroProfile GraphQL 1.0 GraphQL: spec pour generaliser les endpoints en leur donnat lflexibilite en terme de requetage et graph retourné make GraphSQL schema available execute GraphQL requests code first approach Apache Camel 3.1 et 3.0 déprécié Le guide de migration de Camel amélioration de mémoire Lightbend recoit 25M d’investissement de Dell capital pour la partie reactive spécifiquement pour le “serverless” pas de mention de Scala OPTIONNEL LightBend - Article sur pourquoi une architecture reactive est importante pour le cloud native bonne piqure de rappel data localisée par microservice les avantages des systèmes event based Middleware ElasticSearch en prod, les choses a savoir les concepts de base (Clusters, Nodes, Indices and Shards) Quorum comment des noeuds rejoingnent le cluster segments et le merge gestion de la memoire (compressed pointers /! inversé, 30GB, 2x memoire sur la machine par rapport au heap) voir https://stackoverflow.com/questions/25120546/trick-behind-jvms-compressed-oops#25120926 options par workload (write heavy vs read heavy topology monitoring Infrastructure La M&A de have i been p0wned: l’histoire de l’abandon societe KPMG due diligence des milliards de questions les doutes exclusivité le risque du changement de stratégie Cloud Les gens ralent car les clusters GKE vont avoir un cout de management de 10c/heure, ce qui change la relation du cluster au développeur (nombre de clusters en parallèle) Une comparaison des prix des clusters en fonction de leur taille et de leur host provider Amazon annonce Bottlerocket Mise a jour par image recrée plutôt que par package mis a jour plus immuable et donc facile en rollback par contre chaque host goes down et up si orchestrateur c’est ok Outillage IntelliJ Big Data Tools un IDE pour le big data! deja integration avec Zeppelin S3 nouveau Spark, HDFS, Paquet Architecture Les systèmes simples ont moins de downtime facile à comprendre, facile à corriger plus rapide de monter en competence trouver la cause est plus rapide solutions simples, plus d’alternatives disponibles regles: les fonctionalités de justifient pas la complexité, les idées complexes amènent des implémentations complexes, modifier avant d’ajouter challenge de l’automation pour faire avec moins de gens? OPTIONNEL 11 raisons pour lesquelles vous allez rater vos microservices voir les titres de section OPTIONNEL Retour d’experience sur l’usage incorrect d’un outil bloom filters probleme idéal pour bloom filters mais suspicieusement plus long que prévu profilers random access memory >> sequential reading (trop grand pour L3) alternative plus simple qui reduit le nombre le chargement memoire, pas la conso memoire Méthodologies Les trains de merge rebasing, la course au collègue garder master green pour la CD impossible de faire trops de merge en parallele ou doit faire pleins de rebase merge train sequentialise et batch les merges Retour sur le modèle GitFlow pas intuitif (merge bidirectionels dans le temps entre develop, feature branch, release branch, hotfix et master) et cout cognitif haut risque grandi de merge conflit peut pas rebaser continuous delivery != trop de barrières en cas de repos multiples ou mono repos, impossible a gérer (microservices) ok pour des cycles de release par trimestre avec des equipes sur des releases en parallele Mesure de la complexité de code: une meilleure mesure cyclomatic complexité est un mauvais oracle de la complexité de code les logiques conditionnelles emboîtées utilisent notre mémoire de travail (~indentation) les fonctions avec des dos d’anes d’indentation multiples sont les pires refactorer pour externaliser chaque Dans Sonarqube cela s’appelle Cognitive Complexity. Voici un exemple sur du code XWiki ou l’on voit très bien visuelement ce que cela veut dire: https://sonarcloud.io/project/issues?id=org.xwiki.commons%3Axwiki-commons&issues=AWzY6RXo8pMOHxUYvkyE&open=AWzY6RXo8pMOHxUYvkyE Sécurité Ghostcat: la faille dans Tomcat de 6 à 9 dans le protocole Apache JServ (implicitement trusté par Tomcat (cs une requête) peut lire le contenu des web apps si la webapp peut uploader => activer un remote execution upgrader Tomcat 7, 8, 9, si 6, vous êtes dans la merde attention Tomcat est embarqué dans pleins d’outils comme Wildfly, Spring Boot etc Letencrypt révoque 3 millions de certs a multiples domaines Loi, société et organisation Amicus brief sur le copyright d’API par IBM et Red Hat computer interfaces ne sont pas copyrightable moteur de l’economie du logiciel va etre entendu au printemps Amicus brief de chercheurs attaqué par Oracle payés par Google OPTIONNEL Les hackers de Equifax contamnés pour crime DOJ charcge 4 militaires Chinois Struts CVE Rubrique débutant La tonte de Yak appliquée à Donarld Knuth écrire un livre écrire un programme pour ecrire un livre invente un langage de programmation pour écrire le programme invente un mode de pagination design une police de caractère écrit un outil pour construire les polices de caractère invente un système de version pour son programme implémente un langage d’abstraction maison pour les documents imprimés Conférences ANNULÉ - Breizhcamp du 25 au 27 mars 2020 ANNULÉ - MiXiT du 29 au 30 avril 2020 VIRTUEL - GitHub Satellite les 6 et 7 mai ANNULÉ - RivieraDev du 13 au 15 mai 2020 Devoxx UK du 13 au 15 mai 2020 NewCrafts les 28 et 29 mai 2020 AlpesCraft les 4 et 5 juin 2020 ANNULÉ - Best of Web les 4 et 5 juin 2020 DevFest Lille le 12 juin 2020 - (Le CFP est ouvert) Voxxed Days Luxembourg du 17 au 19 juin 2020 ANNULÉ - Serverless Days Paris le 1 juillet 2020 NOUVELLE DATE - Devoxx France du 1 au 3 juillet 2020 Sunny Tech les 2 et 3 juillet 2020 Et encore plus sur Developers Conferences Agenda/List …. Liste d’Aurélie Nous contacter Soutenez Les Cast Codeurs sur Patreon https://www.patreon.com/LesCastCodeurs Faire un crowdcast ou une crowdquestion Contactez-nous via twitter https://twitter.com/lescastcodeurs sur le groupe Google https://groups.google.com/group/lescastcodeurs ou sur le site web https://lescastcodeurs.com/
Cet épisode parle du coronavirus, des conférences annulées, de la popularité des langages, de GraphQL, de Ghostcat et pleins d'autres choses encore. L'intro date un peu: les infos sur le coronavirus étant encore plus fréquentes que les nouveaux framework JavaScript. Enregistré le 13 mars 2020 Téléchargement de l'épisode [LesCastCodeurs-Episode-227.mp3](https://traffic.libsyn.com/lescastcodeurs/LesCastCodeurs-Episode-227.mp3) ## News ### Corona virus Les actions des grosses boites * pas de meeting * conf annulées * limite du travail au bureau ### Langages [RedMonk ranking - Le langage au top est...](https://redmonk.com/sogrady/2020/02/28/language-rankings-1-20/) * JavaScript Python Java * Typescript dans le top 10 * R monte * Rust stable comme Go (+1) * Kotlin 19, Scala 13 [InfoQ meta sondage](https://www.infoq.com/news/2020/02/developer-surveys/) * Java 8 le plus déployé en prod, 25% Java 11 et non LTS derrière * Spring 60-80% * IntelliJ 60-80%, Eclipse 20-25% * mavenjvs Gradle 66-33 ou 50-50 [Sondage sur Scala](https://scalacenter.github.io/scala-developer-survey-2019/) [Scala.js 1.0.0](https://www.scala-js.org/news/2020/02/25/announcing-scalajs-1.0.0/) * 7 ans de dev * not binary compatible with 0.6 nor 1.0RCx * Ecrire en scala des applications front * interop avec les libraries JavaScript [GraalVM se dote d'un advisory board](https://jaxenter.com/graalvm-project-advisory-board-168885.html?utm_source=twitter&utm_medium=social&utm_campaign=1week) * Gluon, Red Hat, Amazon, Microdoc, Shopify, Twitter, OCI, Neo4J, Pivotal, ARM et Oracle bien sûr [Gros round d'investissement dans Azul](https://www.azul.com/press_release/azul-systems-announces-strategic-growth-equity-investment-by-vitruvian-partners/) * investissement / achat: 340 M$ ### Librairies [Eclipse MicroProfile GraphQL 1.0](https://microprofile.io/2020/02/25/microprofile-graphql-1-0-released/) * GraphQL: spec pour generaliser les endpoints en leur donnat lflexibilite en terme de requetage et graph retourné * make GraphSQL schema available * execute GraphQL requests * code first approach [Apache Camel 3.1 et 3.0 déprécié](https://camel.apache.org/blog/release-3-1-0.html) [Le guide de migration de Camel](https://camel.apache.org/manual/latest/camel-3x-upgrade-guide.html) * amélioration de mémoire [Lightbend recoit 25M d'investissement](https://www.benzinga.com/pressreleases/20/03/g15505587/lightbend-closes-25-million-investment-round-led-by-dell-technologies-capital#/.XmZXZGRo7yc.twitter) * de Dell capital * pour la partie reactive * spécifiquement pour le "serverless" * pas de mention de Scala OPTIONNEL [LightBend - Article sur pourquoi une architecture reactive est importante pour le cloud native](https://www.lightbend.com/blog/stateful-cloud-native-applications-why-reactive-matters) * bonne piqure de rappel * data localisée par microservice * les avantages des systèmes event based ### Middleware [ElasticSearch en prod, les choses a savoir](https://facinating.tech/2020/02/22/in-depth-guide-to-running-elasticsearch-in-production/) * les concepts de base (Clusters, Nodes, Indices and Shards) * Quorum * comment des noeuds rejoingnent le cluster * segments et le merge * gestion de la memoire (compressed pointers /! inversé, 30GB, 2x memoire sur la machine par rapport au heap) voir * options par workload (write heavy vs read heavy * topology * monitoring ### Infrastructure [La M&A de have i been p0wned: l'histoire de l'abandon](https://www.troyhunt.com/project-svalbard-have-i-been-pwned-and-its-ongoing-independence/) * societe KPMG * due diligence * des milliards de questions * les doutes * exclusivité * le risque du changement de stratégie ### Cloud Les gens ralent car les clusters GKE vont avoir un cout de management de 10c/heure, ce qui change la relation du cluster au développeur (nombre de clusters en parallèle) [Une comparaison des prix des clusters en fonction de leur taille et de leur host provider](https://devopsdirective.com/posts/2020/03/managed-kubernetes-comparison/) Amazon annonce [Bottlerocket](https://aws.amazon.com/fr/bottlerocket/) * Mise a jour par image recrée plutôt que par package mis a jour * plus immuable et donc facile en rollback * par contre chaque host goes down et up * si orchestrateur c'est ok ### Outillage [IntelliJ Big Data Tools](https://blog.jetbrains.com/blog/2020/02/25/update-on-big-data-tools-plugin-spark-hdfs-parquet-and-more/) * un IDE pour le big data! * deja integration avec Zeppelin S3 * nouveau Spark, HDFS, Paquet ### Architecture [Les systèmes simples ont moins de downtime](https://www.gkogan.co/blog/simple-systems/?r=0) * facile à comprendre, facile à corriger * plus rapide de monter en competence * trouver la cause est plus rapide * solutions simples, plus d'alternatives disponibles * regles: les fonctionalités de justifient pas la complexité, les idées complexes amènent des implémentations complexes, modifier avant d'ajouter * challenge de l'automation pour faire avec moins de gens? OPTIONNEL [11 raisons pour lesquelles vous allez rater vos microservices](https://medium.com/xebia-engineering/11-reasons-why-you-are-going-to-fail-with-microservices-29b93876268b) * voir les titres de section OPTIONNEL [Retour d'experience sur l'usage incorrect d'un outil bloom filters](https://blog.cloudflare.com/when-bloom-filters-dont-bloom/) * probleme idéal pour bloom filters * mais suspicieusement plus long que prévu * profilers * random access memory >> sequential reading (trop grand pour L3) * alternative plus simple qui reduit le nombre le chargement memoire, pas la conso memoire ### Méthodologies [Les trains de merge](https://about.gitlab.com/blog/2020/01/30/all-aboard-merge-trains/) * rebasing, la course au collègue * garder master green pour la CD * impossible de faire trops de merge en parallele ou doit faire pleins de rebase * merge train sequentialise et batch les merges [Retour sur le modèle GitFlow](https://georgestocker.com/2020/03/04/please-stop-recommending-git-flow/) * pas intuitif (merge bidirectionels dans le temps entre develop, feature branch, release branch, hotfix et master) et cout cognitif haut * risque grandi de merge conflit * peut pas rebaser * continuous delivery != trop de barrières * en cas de repos multiples ou mono repos, impossible a gérer (microservices) * ok pour des cycles de release par trimestre avec des equipes sur des releases en parallele [Mesure de la complexité de code: une meilleure mesure](https://empear.com/blog/bumpy-road-code-complexity-in-context/) * cyclomatic complexité est un mauvais oracle de la complexité de code * les logiques conditionnelles emboîtées utilisent notre mémoire de travail (~indentation) * les fonctions avec des dos d'anes d'indentation multiples sont les pires * refactorer pour externaliser chaque Dans Sonarqube cela s'appelle Cognitive Complexity. Voici un exemple sur du code XWiki ou l'on voit très bien visuelement ce que cela veut dire: ### Sécurité [Ghostcat: la faille dans Tomcat de 6 à 9](https://snyk.io/blog/ghostcat-breach-affects-all-tomcat-versions/) * dans le protocole Apache JServ (implicitement trusté par Tomcat (cs une requête) * peut lire le contenu des web apps * si la webapp peut uploader => activer un remote execution * upgrader Tomcat 7, 8, 9, si 6, vous êtes dans la merde * attention Tomcat est embarqué dans pleins d'outils comme Wildfly, Spring Boot etc [Letencrypt révoque 3 millions de certs a multiples domaines](https://thehackernews.com/2020/03/lets-encrypt-certificate-revocation.html?m=1#click=https://t.co/zViFYyMIse) ### Loi, société et organisation [Amicus brief sur le copyright d'API par IBM et Red Hat](https://www.redhat.com/en/blog/red-hat-urges-us-supreme-court-support-unrestricted-use-software-interfaces) * computer interfaces ne sont pas copyrightable * moteur de l'economie du logiciel * va etre entendu au printemps [Amicus brief de chercheurs attaqué par Oracle](https://twitter.com/joshbloch/status/1237507340514889729) * payés par Google OPTIONNEL [Les hackers de Equifax contamnés pour crime](https://www.infoq.com/news/2020/02/equifax-charges/?utm_campaign=infoq_content&utm_source=twitter&utm_medium=feed&utm_term=java) * DOJ charcge 4 militaires Chinois * Struts CVE ## Rubrique débutant [La tonte de Yak appliquée à Donarld Knuth](https://yakshav.es/the-patron-saint-of-yakshaves/) * écrire un livre * écrire un programme pour ecrire un livre * invente un langage de programmation pour écrire le programme * invente un mode de pagination * design une police de caractère * écrit un outil pour construire les polices de caractère * invente un système de version pour son programme * implémente un langage d'abstraction maison pour les documents imprimés ## Conférences ANNULÉ - [Breizhcamp du 25 au 27 mars 2020](https://www.breizhcamp.org/) ANNULÉ - [MiXiT du 29 au 30 avril 2020](https://mixitconf.org/) VIRTUEL - [GitHub Satellite les 6 et 7 mai](https://githubsatellite.com/) ANNULÉ - [RivieraDev du 13 au 15 mai 2020](https://rivieradev.fr/) [Devoxx UK du 13 au 15 mai 2020](https://www.devoxx.co.uk/) [NewCrafts les 28 et 29 mai 2020](http://ncrafts.io/) [AlpesCraft les 4 et 5 juin 2020](https://www.alpescraft.fr/) ANNULÉ - [Best of Web les 4 et 5 juin 2020](http://bestofweb.paris/) [DevFest Lille le 12 juin 2020](https://devfest.gdglille.org/) - (Le [CFP](https://conference-hall.io/public/event/4o1awYXIRayhu3vmOmiQ) est ouvert) [Voxxed Days Luxembourg du 17 au 19 juin 2020](https://luxembourg.voxxeddays.com/) ANNULÉ - [Serverless Days Paris le 1 juillet 2020](https://paris.serverlessdays.io/en/) NOUVELLE DATE - [Devoxx France du 1 au 3 juillet 2020](https://www.devoxx.fr/) [Sunny Tech les 2 et 3 juillet 2020](https://sunny-tech.io/) Et encore plus sur [Developers Conferences Agenda/List](https://github.com/scraly/developers-conferences-agenda/blob/master/README.md) .... [Liste d'Aurélie](https://github.com/scraly/developers-conferences-agenda) ## Nous contacter Soutenez Les Cast Codeurs sur Patreon [Faire un crowdcast ou une crowdquestion](https://lescastcodeurs.com/crowdcasting/) Contactez-nous via twitter sur le groupe Google ou sur le site web
Apache Hadoop.HDFS.Apache Hive.Apache Spark.Presto.Architecture Of Giants: Data Stacks At Facebook, Netflix, Airbnb, And Pinterest.Data Wrangling.Null++ Docker Episode.Julia Language.kaggle.SED Podcast, Episode: Slack Data Platform with Josh Wills.Article Software 2.0.Aya's Recommendation for learning:Towards data science.Statistics and Data Science MicroMasters.DataCamp.Udemy: Python for Data Science and Machine Learning Bootcamp.Coursera's Deep Learning Specialization.Lex Fridman Artificial Intelligence Podcast & YouTube channel.Episode Notes:Aya: How To lie with statistics book.Luay: Great Expectations Data Pipeline Testing Framework.Alfy: JAM Stack.
In this video we are going through the high availability options you have for the mission critical services running within the SQL Server Big Data Clusters. Find out more here:High AvailabilityHigh Availability for HDFS and Spark[00:00] - Introduction[01:00] - Storage recap for Big Data Clusters[02:47] - High Availability options[04:34] - High Availability for SQL Server master instance[06:18] - Availability Groups in Big Data Clusters[07:07] - High Availability for HDFS[07:51] - High Availability for Spark[08:27] - Demo: configuring High Availability in Big Data Clusters[10:55] - Getting started[11:25] - Summary and wrap-up
In this video we are going through the high availability options you have for the mission critical services running within the SQL Server Big Data Clusters. Find out more here:High AvailabilityHigh Availability for HDFS and Spark[00:00] - Introduction[01:00] - Storage recap for Big Data Clusters[02:47] - High Availability options[04:34] - High Availability for SQL Server master instance[06:18] - Availability Groups in Big Data Clusters[07:07] - High Availability for HDFS[07:51] - High Availability for Spark[08:27] - Demo: configuring High Availability in Big Data Clusters[10:55] - Getting started[11:25] - Summary and wrap-up
Amazon EMR enables customers to run ETL, machine learning, real-time processing, data science, and low-latency SQL at petabyte scale. We focus this session on running Apache Spark on Amazon EMR. We introduce design patterns such as using Amazon S3 instead of HDFS, running long- and short-lived clusters, using notebooks, and performance-related enhancements. We discuss lowering cost with auto scaling and Spot Instances, and security with encryption and fine-grained access control with AWS Lake Formation.
Big Data synthesis est un podcast sur le Big Data par des gens du Big Data pour des gens du Big Data. L'épisode 6 s'intéresse à la sécurité du cluster et surtout les problèmes d'identification d'un cluster (en gros n'importe qui peut être root sur HDFS si vous n'y prenez pas garde). Et je vous propose deux solutions, l'une classique, chère et chiante et celle de la BARBACANE (musique hans zimmerienne)... N'hésitez pas à me contacter sur benoit.petitpas@saphir-data.fr Big Data synthesis est un podcast Saphir-data
Starburst gives analysts the freedom to work with diverse data sets wherever their location, without compromising on performance. Using Presto, the Starburst Distribution of Presto provides fast, interactive query performance across a wide variety of data sources including HDFS, Amazon S3, MySQL, SQL Server, PostgreSQL, Cassandra, MongoDB, Kafka, and Teradata, among others. Founded by the largest team of Presto committers outside of Facebook, Starburst is the only company providing enterprise support for the Presto project. Originally invented and open sourced by Facebook under the Apache License, Presto is the fastest growing SQL engine driven by a community of users --- from small companies to the Fortune 500. According to Justin Borgman, CEO of Starburst Data, the largest contributor to the Presto open source project outside of Facebook, there are three key items to keep in mind when creating a data architecture that can stand the test of time: Separate compute and storage Use open data formats Future proof your architecture with abstraction wherever you can By adhering to these three items, enterprises preserve their optionality, enabling them to better control costs and gain the most from the architecture investments. But each item plays a specific role in creating optionality. I invited Justin Borgman onto my daily tech podcast to learn more about data architecture freedom and much more. Justin has spent the better part of a decade in senior executive roles building new businesses in the data warehousing and analytics space. Prior to co-founding Starburst, Justin was Vice President and General Manager at Teradata (NYSE: TDC), where he was responsible for the company’s portfolio of Hadoop products. Prior to joining Teradata, Justin was co-founder and CEO of Hadapt, the pioneering “SQL-on-Hadoop” company that transformed Hadoop from file system to analytic database accessible to anyone with a BI tool. Hadapt was acquired by Teradata in 2014.
In this episode of the Data Show, I spoke with Dhruba Borthakur (co-founder and CTO) and Shruti Bhat (SVP of Product) of Rockset, a startup focused on building solutions for interactive data science and live applications. Borthakur was the founding engineer of HDFS and creator of RocksDB, while Bhat is an experienced product and marketing […]
In this episode of the Data Show, I spoke with Dhruba Borthakur (co-founder and CTO) and Shruti Bhat (SVP of Product) of Rockset, a startup focused on building solutions for interactive data science and live applications. Borthakur was the founding engineer of HDFS and creator of RocksDB, while Bhat is an experienced product and marketing […]
Amazon EMR is one of the largest Spark and Hadoop service providers in the world, enabling customers to run ETL, machine learning, real-time processing, data science, and low-latency SQL at petabyte scale. In this session, we introduce design patterns such as using Amazon S3 instead of HDFS, taking advantage of both long- and short-lived clusters, using notebooks, and other architectural best practices. We discuss lowering cost with Auto Scaling and Spot Instances, and security best practices for encryption and fine-grained access control. We showcase key improvements made to the service in 2017. We cover improvements in using the Amazon EMR API, best practices utilizing Spot instances and Spot Instances with Auto Scaling, improvements toward Amazon S3 performance on Amazon EMR, and security/authorization and authentication. We couple each of these with a demo or customer use case to illustrate the benefits. If you are an existing Amazon EMR user, you walk away with a thorough understanding of improvements made in 2018, and how they benefit you. If you are a new Amazon EMR user, get an understanding of common use cases and how other customers are using Amazon EMR.
In this last Big Data news episode for the month of November, we look forward to the H2O World event next week in London and we have articles on BI Maturity and the upcoming Apache Ozone project that will supplant HDFS in future Hadoop clusters soon(TM). BI Maturity: You can’t get there from here! http://makingdatameaningful.com/bi-maturity/ Introducing Apache Hadoop Ozone: An Object Store for Apache Hadoop https://hortonworks.com/blog/introducing-apache-hadoop-ozone-object-store-apache-hadoop/ Katacoda example down on this page https://hadoop.apache.org/ozone Please use the Contact Form on this blog or our twitter feed to send us your questions, or to suggest future episode topics you would like us to cover.
Is ETL dead in Data Science and Big Data? In today's podcast I share with you my views on your questions regarding ETL (extract, transform, load). Data Lakes & Data Warehouse where is the difference? Is ETL still practiced or did pre processing & cleansing replace it What would replace ETL in Data Engineering? How to become a data engineer? (check out my facebook note) How to get experience training at home? Real time analytics with RDBMS or HDFS?
Since both Dave and Jhon were not able to attend the Dataworks Summit in San Jose a couple of weeks ago, we have a guest, Ward Bekker, who was happy to join and educate us on the subject. DataWorks Summit San Jose 2018 In this episode we discuss the daily keynotes and Wards' selection of sessions at the Summit ranging from the new things in Yarn 3.0, Materialized views in Hive and much more. Ward Bekker (Linkedin) Pre-Sales Solutions Engineer II @ Hortonworks Some of the sessions and topics discussed are: Apache Hadoop State of the union https://dataworkssummit.com/san-jose-2018/session/apache-hadoop-yarn-state-of-the-union-2/ What is new in Apache Hive https://dataworkssummit.com/san-jose-2018/session/what-is-new-in-apache-hive/ Runing distributed tensorflow in production https://dataworkssummit.com/san-jose-2018/session/running-distributed-tensorflow-in-production-challenges-and-solutions-on-yarn-3-0-2/ Just the sketch: advanced streaming analytics in Apache Metron https://dataworkssummit.com/san-jose-2018/session/just-the-sketch-advanced-streaming-analytics-in-apache-metron/ Containers and Big Data https://dataworkssummit.com/san-jose-2018/session/containers-and-big-data/ Catch a hacker in realtime: Live visuals of bots and bad guys https://dataworkssummit.com/san-jose-2018/session/catch-a-hacker-in-realtime-live-visuals-of-bots-and-bad-guys/ HDFS tiered storage https://dataworkssummit.com/san-jose-2018/session/hdfs-tiered-storage/ Geospatial data platform at Uber https://dataworkssummit.com/san-jose-2018/session/geospatial-data-platform-at-uber/ What's the Hadoop-la about Kubernetes? https://dataworkssummit.com/san-jose-2018/session/whats-the-hadoop-la-about-kubernetes/ Please use the Contact Form on this blog or our twitter feed to send us your questions, or to suggest future episode topics you would like us to cover.
¡Yep! ¡Muy buenas! ¿Qué tal estáis gente maja? Esta semana os quiero dar una pequeña introducción a HDFS (Hadoop Distributed File System) o lo que viene siendo, el Sistema de Ficheros Distribuidos de Hadoop. ¿Qué es y qué características tiene HDFS? Como os digo, HDFS es el sistema de ficheros distribuidos de Hadoop. Tan sencillo como eso, un sistema de […]
On this week’s podcast, Danny Yuan, Uber’s Real-time Streaming/Forecasting Lead, lays out a thorough recipe book for building a real-time streaming platform with a major focus on forecasting. In this podcast, Danny discusses everything from the scale Uber operates at to what the major steps for training/deploy models in an iterative (almost Darwinistic) fashion and wraps with his advice for software engineers who want to begin applying machine learning into their day-to-day job. Why listen to this podcast: * Uber processes 850,000 - 1.3 million messages per second in their streaming platform with about 12 TB of growth per day. The system’s queries scan 100 million to 4 billion documents per second. * Uber’s frontend is mobile. The frontend talks to an API layer. All services generate events that are shuffled into Kafka. The real-time forecasting pipeline taps into Kafka to processes events and stores the data into Elasticsearch. * There is a federated query layer in front of Elasticsearch to provide OLAP query capabilities. * Apache Flink’s advanced windowing features, programming model, and checkpointing convinced Uber to move away from the simplicity of Apache Samza. * The forecasting system allows Uber to remove the notion of delay by using recent signals plus historical data to project what is happening now and what will happen into the future. * Uber’s pipeline for deploying ML models: HDFS, feature engineering, organizing into data structures (similar to data frames), deploy mostly offline training models, train models, & store into a container-based model manager. * A model serving layer is used to pick which model to use, forecasting results are stored in an OLAP data store, a validation layer compares real results against forecast results to verify the model is working as desired, and a rollback feature enables poor performing models to be automatically replaced by previous one. * “Without output, you don’t have input.” If you want to start leveraging machine learning, developers just need to start doing. Start with intuition and practice. Over time ask questions and learn what you need, then apply a laser focus to gain that knowledge. You can also subscribe to the InfoQ newsletter to receive weekly updates on the hottest topics from professional software development. bit.ly/24x3IVq Subscribe: www.youtube.com/infoq Like InfoQ on Facebook: bit.ly/2jmlyG8 Follow on Twitter: twitter.com/InfoQ Follow on LinkedIn: www.linkedin.com/company/infoq Check the landing page on InfoQ: https://bit.ly/2GJQbUo
Frank and Andy talked about doing a Deep Dive show where they take a deep look into a particular data science technology, term, or methodology. And now, they deliver! In this very first Deep Dive, Frank and Andy discuss the differences between Data Science and Data Engineering, where they overlap, where they differ, and why so many C-level execs can’t seem to figure out the deltas. Links Sponsor: Audible.com (http://thedatadrivenbook.com) – Get a free audio book when you sign up for a free trial! Sponsor: Enterprise Data & Analytics (https://entdna.com) Notable Quotes Frank’s new courses are up at WintellectNow ([01:30]) David Goggins (http://davidgoggins.com/) ([03:00]) Dive! Dive! Dive! It’s a deep dive on Data Science vs. Data Engineering ([06:00]) “Clean data” means different things to different people. ([09:30]) “Shaping the data.” ([11:00]) Our conversation with Buck Woody (http://datadriven.tv/buck_woody/) ([12:30]) Andy’s screed on managing NULLs ([14:00]) Andy’s screed on managing dupes ([17:00]) Frank, on aggregation and schema changes… ([21:21]) Attempted NoSQL definition ([23:45]) On MySQL… ([25:00]) Maybe “No” stands for “Not only” ([26:45]) “What sorcery is this?!” ([28:30]) Kevin Hazzard’s article on Database Design if we started today (https://blog.sqlauthority.com/2015/08/20/sql-server-rewriting-database-history-notes-from-the-field-094/) ([29:15]) Andy’s opinion: We’re not using the SSD-ness of SSD’s ([31:30]) “I don’t know how much simpler you can get.” – Andy ([33:00]) Denny Cherry’s company: Denny Cherry and Associates (https://www.dcac.co/) ([34:45]) “… somewhere between useless and lying…” ([35:45]) Frank on HDFS (https://hortonworks.com/apache/hdfs/) ([38:00]) ClearDB wiped out 13 years of Frank’s blog data, and we’re still bothered by that. ([40:30]) sklearn (http://scikit-learn.org/stable/) ([42:50]) Correlation is not causation. ([45:30]) How to Lie with Statistics (https://www.amazon.com/How-Lie-Statistics-Darrell-Huff/dp/0393310728/) ([45:45]) Movie/TV Reference: Star Trek TNG (http://www.imdb.com/title/tt0092455/) ([46:15]) CNTK (Microsoft Cognitive Toolkit) (https://www.microsoft.com/en-us/cognitive-toolkit/) ([48:00]) Frank, on selling ice cream… ([49:25]) On over-fitting (https://machinelearningmastery.com/overfitting-and-underfitting-with-machine-learning-algorithms/) ([55:30]) Training the model ([56:30]) Request for feedback! ([57:30])
Amazon EMR is one of the largest Hadoop operators in the world, enabling customers to run ETL, machine learning, real-time processing, data science, and low-latency SQL at petabyte scale. In this session, we introduce you to Amazon EMR design patterns such as using Amazon S3 instead of HDFS, taking advantage of both long and short-lived clusters, and other Amazon EMR architectural best practices. We talk about lowering cost with Auto Scaling and Spot Instances, and security best practices for encryption and fine-grained access control. Finally, we dive into some of our recent launches to keep you current on our latest features.
Dave has attended the Dataworks Summit in Sidney and we go over the different sessions he attended there. In this first of two episodes, the focus lies on the new goodness that Hadoop 3.0 will bring us soon. Hadoop 3.0 – Sanjay Radia https://www.slideshare.net/Hadoop_Summit/apache-hadoop-30-community-update-79999467 JDK 8+ Port number changes Class-path isolation HDFS – 3 node Namenode, intra data node balancer for balanced storage within a node, erasure coding 10TB node recovering in a few hours on a large cluster (3000 nodes) Erasure coding 2012, 2013, 2014 Erasure coding methods, blogs or stripes Surprisingly little performance difference for EC, what’s not shown is the network bandwidth cost, which is significantly higher Yarn 3.0 Scheduler, priorities within a queue Q – Inter queue priorities Long running services, dynamic container configuration, cpu and io easy, hard to do memory Service discovery in YARN via zookeeper, dns Elastic resource model, graceful decommissioning node managers Resource isolation with disk and network Yarn UI YARN federation SparkR best practices – Casey Stella https://www.slideshare.net/Hadoop_Summit/sparkr-best-practices-for-r-data-scientist Benefits pros/cons of R - Legions of academics have built R packages over the years Where Spark + R came from Data science workflow, data wrangling with spark and spark r Kerberos troubleshooting - Vipin Rathor https://www.slideshare.net/Hadoop_Summit/troubleshooting-kerberos-in-hadoop-taming-the-beast Why Kerberos, solves authentication Where is Kerberos used Enduser and service auth mechanism Hadoop delegation token Please use the Contact Form on this blog or our twitter feed to send us your questions, or to suggest future episode topics you would like us to cover.
Breaking up our series of insights from Alan Gates, we switch gears to another really interesting topic (and guest!) where we talk about the new visualisation features coming in Apache Zeppelin and we get it straight from the brains behind the new code, Bernhard Walter. Recent events 03:03 Jhon: Churn Prediction with Apache Spark Machine Learning by Carol McDonald (@caroljmcdonald) @mapr https://mapr.com/blog/churn-prediction-sparkml/ 12:12 Dave: HDFS Maintenance State by Manoj Govindassamy @cloudera https://blog.cloudera.com/blog/2017/05/hdfs-maintenance-state/ https://issues.apache.org/jira/browse/HDFS-7877 https://issues.apache.org/jira/browse/HDFS-6729 https://issues.apache.org/jira/browse/HDFS-7541 30:50 Modern Day Airships Bernhard Walter talks about the new visualisation options in Zeppelin with some of the what, why and how. 01:09:00 End Please use the Contact Form on this blog or our twitter feed to send us your questions, or to suggest future episode topics you would like us to cover.
In this episode of the SuperDataScience Podcast, I chat with the Global Analytics Consultant Scott King. You will predominantly hear talks about Hadoop, get to learn and discuss about HDFS, MapReduce, Hive, Kudu, Spark, YARN & Seahorse, know what is a Data Lake and how it is useful, hear advices for public speaking and get some tips and tricks on how to prepare on Data Science presentations and prepare for your audiences. If you enjoyed this episode, check out show notes, resources, and more at https://www.superdatascience.com/51
How does McGraw-Hill Education use the AWS platform to scale and reliably receive 10,000 learning events per second? How do we provide near-real-time reporting and event-driven analytics for hundreds of thousands of concurrent learners in a reliable, secure, and auditable manner that is cost effective? MHE designed and implemented a robust solution that integrates AWS API Gateway, AWS Lambda, Amazon Kinesis, Amazon S3, Amazon Elasticsearch Service, Amazon DynamoDB, HDFS, Amazon EMR, Amazopn EC2, and other technologies to deliver this cloud-native platform across the US and soon the world. This session describes the challenges we faced, architecture considerations, how we gained confidence for a successful production roll-out, and the behind-the-scenes lessons we learned.
Amazon EMR is one of the largest Hadoop operators in the world. In this session, we introduce you to Amazon EMR design patterns such as using Amazon S3 instead of HDFS, taking advantage of both long and short-lived clusters, and other Amazon EMR architectural best practices. We talk about how to scale your cluster up or down dynamically and introduce you to ways you can fine-tune your cluster. We also share best practices to keep your Amazon EMR cluster cost-efficient. Finally, we dive into some of our recent launches to keep you current on our latest features. This session will feature Asurion, a provider of device protection and support services for over 280 million smartphones and other consumer electronics devices. Asurion will share how they architected their petabyte-scale data platform using Apache Hive, Apache Spark, and Presto on Amazon EMR.
You told us, and we listened. By popular demand, Dr Pete and Russ Nash serves up a mammoth update covering the latest technical updates and announcements from AWS, covering: • The 38th AWS availability zone opening in East Ohio, US • P2 and Grace Hopper + GPU specs • Deep Learning AMI • Aurora + Lambda • EMR, HDFS and Snowball • AWS Server Migration Service • DevOps with new CloudFormation extensions • PowerShell everywhere + AWS Powershell Core on Linux/Mac • And a huge focus on IPv6 DNS, S3, WAF, including a trick and tips segment for using ipv6
Dr. Barbara Fiese, Director of the Family Resiliency Center and Professor in HDFS, discussed the factors that contribute to childhood obesity and answered your questions on June 2 on Twitter using #askACES. Podcast is courtesy of the College of ACES
In this episode we'll cover the basics of Apache Spark, including typical deployment situations, architecture and usage. 00:00 Recent events Seasons Greetings! Jhon shamelessly plugs his mini cluster build Apache Mesos Amazon IoT solution 05:28 Main Topic Who would use Apache Spark, why would you use it, where would you use it Apache Spark Architecture Apache Spark Components Apache Spark MLlib Apache Spark gotcha's Typical use cases for Apache Spark 28:20 Questions from our Listeners: What happens if all my data does not fit in memory? What is the security like for Spark? Why Spark on Hadoop instead of standalone Python, Scala, Java or something else for Spark? Can I access data on HDFS or local disk from my Spark script? 37:50 End Please use the Contact Form on this blog or our twitter feed to send us your questions, or to suggest future episode topics you would like us to cover.
In this episode we'll cover the basics of Apache Spark, including typical deployment situations, architecture and usage. 00:00 Recent events Seasons Greetings! Jhon shamelessly plugs his mini cluster build Apache Mesos Amazon IoT solution 05:28 Main Topic Who would use Apache Spark, why would you use it, where would you use it Apache Spark Architecture Apache Spark Components Apache Spark MLlib Apache Spark gotcha's Typical use cases for Apache Spark 28:20 Questions from our Listeners: What happens if all my data does not fit in memory? What is the security like for Spark? Why Spark on Hadoop instead of standalone Python, Scala, Java or something else for Spark? Can I access data on HDFS or local disk from my Spark script? 37:50 End Please use the Contact Form on this blog or our twitter feed to send us your questions, or to suggest future episode topics you would like us to cover.
In this episode we'll cover the basics of Apache Spark, including typical deployment situations, architecture and usage. 00:00 Recent events Seasons Greetings! Jhon shamelessly plugs his mini cluster build Apache Mesos Amazon IoT solution 05:28 Main Topic Who would use Apache Spark, why would you use it, where would you use it Apache Spark Architecture Apache Spark Components Apache Spark MLlib Apache Spark gotcha's Typical use cases for Apache Spark 28:20 Questions from our Listeners: What happens if all my data does not fit in memory? What is the security like for Spark? Why Spark on Hadoop instead of standalone Python, Scala, Java or something else for Spark? Can I access data on HDFS or local disk from my Spark script? 37:50 End Please use the Contact Form on this blog or our twitter feed to send us your questions, or to suggest future episode topics you would like us to cover.
Dans cet épisose, on discute avec Sam Bessalah de ce “nouveau” métier qu’est le data scientist. On explore aussi l’univers Apache Hadoop et l’univers Apache Mesos. Ces endroits sont pleins de projets aux noms bizarres, cette interview permet de s’y retrouver un peu dans cette mythologie. Enregistré le 16 decembre 2014 Téléchargement de l’épisode LesCastCodeurs-Episode–115.mp3 Interview Ta vie, ton oeuvre @samklr Ses présentations, encore ici et là Data scientist Kesako ?! C’est nouveau ? On a toujours eu des données pourtant dans nos S.I. ?! Le job le plus sexy du 21eme siecle ? Drew conway’s Data Science Venn diagram Traiter les données, les plateformes MapR, Hadoop, … C’est Quoi ? C’est nouveau ? Ca vient d’où ? Comment ça marche ? A quoi ça sert ? Ca s’intègre à tout ? Et nos sources de données legacy (Mon bon vieux mainframe et son EBCDIC) ? Où sont passés mes EAI, ETL, et autres outils d’intégration B2C/B2B ? EAI ETL EBCDIC BI (Business Intelligence) Hadoop MapReduce Doug Cutting Apache Lucene - moteur de recherche full-text Apache Hadoop - platforme de process distribués et scalables HDFS - système de fichier distribué Apache Hive - datawarehouse au dessus d’Hadoop offrant du SQL-like Terradata Impala - database analytique (“real time”) SQL queries etc Apache Tez - directed-acyclic-graph of tasks Apache Shark remplacé par Spark SQL Apache Spark - Spark has an advanced DAG execution engine that supports cyclic data flow and in-memory computing Apache Storm - process de flux de données de manière scalable et distribuée Data Flow Machine Learning - apprendre de la donnée Graph Lab Et l’infrastructure dans tout ça ? De nos bons vieux serveurs qui remplissent les salles machines au cloud (IAAS, PAAS), en passant par la virtualisation (), les conteneurs (XLC, Docker, …) …. Des ressources à gogo c’est bien mais comment les gérer ? YARN Apache Mesos Apache Mesos Comment démarrer Mesos Tutoriaux Data Center OS de Mesosphere Presentation de Same à Devoxx sur Mesos Mesos et les container docker Cluster Management and Containerization by Benjamin Hindman Integration continue avec Mesos par EBays Docker Docker Démarrer un cluster Spark avec Docker Shell Spark dans Docker Docker et Kubernetes dans Apache Hadoop YARN Cluster Hadoop sur Docker Docker, Kubernetes and Mesos cgroups LXC Docker vs LXC Marathon Chronos Code de Chronos Aurora Kubernetes Kubernetes workshop Oscar Boykin Scalding Présentation Scala + BigData et une autre Apache Ambari Comment je m’y mets ? Comment devient-on data scientist ? (se former, ouvrages de références, sources d’infos, …) Mesosphere Cours de Andrew Ng sur le Machine Learning Introduction to data science sur Coursera Kaggle MLlib Mahoot R Scikit-learn (Python) Machine Learning pour Hackers (livre) Scala TypeSafe Activator iPython NoteBooks Autres référence iPython NoteBooks Notebooks temporaires en line - démarre un container docker sur rackspace gratuitement (pour vous) Des notebooks Parallel Machine Learning with scikit-learn and IPython Visualiser les notebooks en ligne sans les télécharger Spark / Scala notebooks for web based spark development http://zeppelin-project.org/ Spark et Scala avec un notebook ipython Nous contacter Contactez-nous via twitter http://twitter.com/lescastcodeurs sur le groupe Google http://groups.google.com/group/lescastcodeurs ou sur le site web http://lescastcodeurs.com/ Flattr-ez nous (dons) sur http://lescastcodeurs.com/ En savoir plus sur le sponsoring? sponsors@lescastcodeurs.com
A simplified explanation of how Hadoop addresses the bottlenecks found in traditional databases and how these are overcome in HDFS. Watch the Video on YouTube Talk with a Specialist: intricity.com/intricity101 www.intricity.com youtube.com/intricity101
Hortonworks HDP1, Apache Hadoop 2.0, NextGen MapReduce (YARN), HDFS Federation and the future of Hadoop with Arun C. Murthy
Fakultät für Biologie - Digitale Hochschulschriften der LMU - Teil 01/06
In human diploid fibroblasts (HDFs) the cell nucleus is oval in shape, quite large in xy-diameter (10-20µm), but flat in z-direction (5µm). In these nuclei chromosome territories typically lie side by side or slightly above each other. The question whether these arrangements are ordered or variable has yielded conflicting answers. We hybridised an improved 7-fluorochrome MFISH probe set on 3D-preserved cell nuclei, fixed with buffered 4% paraformaldehyde. A LEICA wide field microscope with an 8-filter-wheel and an automated z-step motor was used for imaging of the 7 fluorochromes plus DAPI. Multicolor images from nuclei were taken as serial sections in z-direction. After deconvolution, a specifically developed Program goldFISH Saracoglu K. et al. 2001 was used to classify the images according to the labelling scheme. The classification algorithm corresponds to the procedure previously used for metaphase spreads, now adopted to 3-D studies of chromosome territory arrangements in the cell nucleus. The analysis of 30 G0-fibroblast nuclei and 30 prometaphase rosettes revealed a pronounced variability of chromosome territory neighbourhoods, as described by Allison D. C. et al. 1999, but in contrast to Nagele R. et al. 1995. However we noted a distinct radial order: small chromosomes were located close to the centre while large chromosomes were positioned towards the nuclear rim. This non-random radial positioning could also be observed in prometaphase rosettes