High-speed, low-latency computer networking bus used in supercomputing
 
			POPULARITY
欢迎收听雪球出品的财经有深度,雪球,国内领先的集投资交流交易一体的综合财富管理平台,聪明的投资者都在这里。今天分享的内容叫英伟达的护城河,来自古董鱼。看了一晚上英伟达的护城河,强行洗脑,最后的结论是英伟达不倒,我不撤退,一直AI下去。如果哪天英伟达被颠覆了,别问我还能不能拿,因为那时候我已经跑了。大家都以为英伟达的硬件强,其实它的隐形护城河是计算平台和编程模型加网络。我们来看看英伟达的先发优势与成熟度:他的计算平台和编程模型于 2007 年推出,经过近 20 年的发展,已成为 G P U 计算的行业标准。它积累了超过 400 万开发者,形成了庞大的社区和网络效应。从英伟达的全栈优化与工具链来看,计算平台和编程模型提供了从编译器、调试器到高度优化的核心库的全套工具。这些库经过英伟达的深度优化,能充分发挥其硬件性能,开发者无需编写底层代码即可获得顶尖性能。再从开发习惯与迁移成本来看,计算平台和编程模型广泛纳入大学课程和培训项目,工程师们从小白阶段就开始接触它。企业积累了大量的 CUDA 代码和专业知识,切换到其他平台需要重写代码、重新培训员工,并面临性能不确定的风险,这种切换成本高得难以想象。这种计算平台和编程模型的关键优势之一是,随着时间的推移,它通过新的软件更新不断改进硬件。刚刚对在H100和新的Blackwell GB200 NVL72这两种版本的芯片上运行AI训练进行了基准比较,结果表明了为什么计算平台和编程模型及其软件随着时间的推移的改进如此重要。最新,CoreWeave公司给出的数据,对 NVIDIA GB300 NVL72,进行了基准测试,其每 4x的 G P U 的单位时间内跑AI的速度比16x的H100高6倍,最初可不是这个比值,通过英伟达的计算平台和编程模型的不断优化,最后达到了这个高性能。其实一直有用CUDA转换器的,然而,用过转换器的,他们以大约80%的速度转换CUDA代码,而剩下的20%必须由内核工程师手动完成,这样成本并不便宜。同样有趣的是,虽然其他公司正在结成联盟,为Nv的全栈部分建立替代方案,但是目前没有一个与英伟达竞争的联盟出现。接着是英伟达网络的护城河。关于网络,通常说纵向扩展和横向扩展这两个部分,最近火的scale across先不提了。纵向扩展指的是机架里的 G P U 能够相互连接,形成单个 G P U 节点,并使其尽可能强大。然后,横向扩展网络使这些 G P U 节点能够连接到其他 G P U 节点,并共同形成一个大型 G P U 集群,使用其专有的 N V Link和 N V switches横向扩展时,他们使用从Mellanox收购中获得InfiniBand或以太网作为次要选项。英伟达的其他对手一起搞了个 U A link联盟,它的成员包含了能想到的其他公司。U A link有 A M D 、亚马逊、谷歌、英特尔、Meta、微软、思科、苹果、Astera Labs等公司组成。但它对 A M D 来说很重要,因为与英伟达相比,其最大的缺点之一是网络。网络不仅对培训人工智能工作负载很重要,而且对推理也很重要。随着推理模型的推论变得更加复杂,拥有良好的放大和缩小是关键。同时,为了解决这一挑战,他们希望支持所有可用的替代方案。这就是为什么他们有灵活的输入输出通道。这些灵活的输入输出通道使A M D能够支持不同的标准。虽然 U A Link还很年轻,但它已经遇到了很大的挫折。起初,博通是参与的关键公司之一,但后来退了。这是一个重大的挫折,因为 A M D 现在必须依靠AsteraLabs和Marvell来生产 U A Link联盟的交换机,而 U A Link交换机要到2027年才能准备就绪。这就是为什么我们可以看到,虽然 A M D 的MI400x显卡有 U A Link Serdes,但它并没有构成一个完整的扩展网络。不过,英伟达不仅仅是在关注这一发展,因为在UALink 1.0发布一个月后,他们宣布了NVLink Fusion,从纸面上看,它打开了NVLink生态系统。这对英伟达来说是一大步,因为一位前英伟达高级员工解释说,在内部实施这一步骤是多么具有挑战性,因为Meta想在他在那里工作时将 N V Links用于他们的MTIA,而英伟达的回答是坚定的“不”。NVLink 技术模块是用英伟达自家独有的方式和芯片传递数据的,其中一部分技术至今还是英伟达独有的。有了这套技术,英伟达只能让客户用他们的芯片间连接技术。现在客户也意识到了这一点,就像那位前英伟达员工提到的,他们担心这样一来,就算自己有定制的专用芯片ASIC,也会进一步被绑在英伟达的生态系统里 ,所以 U A Link到现在依旧是个替代选择。英伟达和 U A Link这边,有个关键角色是 Astera Labs公司 —— 毕竟现在博通已经自己单干、走自己的技术路线了。现在 U A Link联盟得靠 Astera Labs 来提供交换机。英伟达很清楚Astera Labs现在是 U A Link联盟里的核心部分,可能会想办法促使Astera Labs订购更多英伟达的 NVLink Fusion;而一旦Astera Labs用了NVLink Fusion,他们能为 U A Link服务的能力就会受限,至于这么做最终能不能帮到英伟达,还得靠时间来验证。在横向扩展方面,英伟达的InfiniBand网络技术,有个替代方案是支持远程直接内存访问的以太网。英伟达也支持这个替代方案,但只把它当作“次要选项”,英伟达甚至还有个 Spectrum X 以太网平台,因为他们通过收购,拿到了Spectrum系列交换机的技术和产能。很多大型科技公司也支持以太网,原因很实在:它成本更低,早就广泛用在数据中心里,而且有多家供应商可选。现在支持 RDMA 的以太网已经获得了不少采用度,因为大型科技公司和Meta这类企业,都愿意用它来减少对英伟达的依赖。不过,此前我们虽已探讨过纵向扩展和横向扩展软件与网络这两个核心层面,但还有一个新的关键层面才刚刚兴起,那就是HBM,高带宽内存。作为人工智能加速器的核心组件之一,HBM的重要性会随着AI模型向更大规模、更复杂结构发展,而愈发凸显。目前,海力士与美光是 HBM3 内存的主要供应商,不过三星预计也将完成相关认证流程,加入 HBM3 的供应体系。当向HBM4内存过渡时,将迎来一项关键变革:HBM4 的基础芯片晶圆需采用先进的逻辑芯片制造工艺。这意味着海力士与美光无法独立完成,必须将制造环节外包给台积电;同时,这些内存厂商还需与逻辑芯片设计公司或技术授权商展开合作,方能完成它的设计工作。这一变革为 “定制化 HBM 内存方案” 创造了空间,但反过来也意味着,HBM4的利润需与台积电共享一部分 —— 毕竟其制造环节高度依赖台积电。此外,HBM4 的复杂度远高于HBM3,需将内存厂商的芯片堆叠技术与代工厂的先进制造工艺相结合,这种局面实际上对英伟达更为有利,因为英伟达此前已计划自主设计HBM4的 3 纳米芯片裸片。事实上,我并不担心专用芯片ASIC会侵占过多市场份额。多数云服务提供商选择自主研发芯片,主要源于英伟达的市场垄断与显卡产能不足 —— 这实属无奈之举,他们为了更快获取可用算力,才不得不走上自主研发之路。此次英伟达发布的 Rubin 系列 CPX 产品,核心目标便是提升 AI 的上下文推理能力。在我看来,推理领域真正的领先者,并非 ASIC 这类专用推理芯片,仍属英伟达的产品。另有一项关键问题不容忽视:数据中心可使用的电力存在限制,尤其在北美地区,电力更是必须重视的硬性约束。为何 X AI 公司能在 122 天内建成全球规模最大的算力中心?一方面,马斯克拥有全球顶尖的工程团队与执行能力;更重要的是,X AI所能获得的供电支持,在全球范围内也处于顶尖水平。当你运营现有数据中心,或计划新建数据中心时,需与电力公司合作确定固定的电力使用额度,而这一额度具有明确上限 —— 你无法随意致电电力公司,提出 “需额外增加 10% 电力” 的需求。若我们对比英伟达当前一代与下一代服务器,那么在评估H100与GB300服务器时,核心衡量标准应是 “处理同等数量的令牌时,可节省多少电力”。而英伟达每次产品更新,实际上都在推进这项电力效率优化工作。所以,我想说的是英伟达的手里牌很多,老黄这个人能力强的可怕,就算现在出来ASIC和其他 G P U 竞争对手,都是更多跟随和模仿,对所有在供应链做硬件的公司都是利好,因为总的需求变多了,可以说遍地开花。
En la edición de hoy de Radar Empresarial, abordamos las recientes investigaciones por supuestas prácticas antimonopolio que enfrenta Nvidia en China. Las autoridades chinas han iniciado una pesquisa preliminar contra la empresa estadounidense de semiconductores, señalando posibles infracciones a sus normativas de competencia. El centro de esta investigación sería la compra que realizó Nvidia en 2019: la adquisición de Mellanox Technologies, una empresa israelí dedicada a soluciones de conectividad para centros de datos. Esta fue, hasta la fecha, la operación más grande de Nvidia, valorada en 6.900 millones de dólares. El interés de Nvidia por Mellanox surgió debido a la tecnología Infiniband, un sistema de interconexión de alta velocidad fundamental para aplicaciones de inteligencia artificial. Al incorporar esta empresa, Nvidia no solo reforzaba su posición en el ámbito de los centros de datos, sino que también expandía su presencia en el mercado de servidores, un área donde todavía tenía debilidades. Pese a que Mellanox es una firma israelí, la adquisición necesitaba el visto bueno de los reguladores chinos, dado que una parte importante de sus operaciones y ventas se realizaban en el país asiático. Hasta el momento, las autoridades chinas no han especificado las razones exactas por las que han reabierto el caso. No obstante, una de las condiciones impuestas en su momento fue que todos los productos vendidos en China cumplieran con términos justos, razonables y no discriminatorios. Aunque Nvidia asegura que ha actuado conforme a las leyes locales, la incertidumbre crece sobre las posibles implicaciones de esta revisión regulatoria. Si se llegara a determinar que Nvidia violó la ley, la compañía podría enfrentarse a sanciones económicas que oscilarían entre el 1% y el 10% de sus ingresos anuales del año anterior. Además, los reguladores podrían imponer restricciones, como la obligación de vender sus productos en China sin la tecnología de Mellanox. Esto se suma a las tensiones existentes en torno a sus chips H20, diseñados especialmente para el mercado chino, los cuales también han sido objeto de controversia a pesar del levantamiento parcial de restricciones por parte de Estados Unidos.
Hive Digital Technologies subsidiary BUZZ HPC's President and COO Craig Tavares joined Steve Darling from Proactive to share news of a groundbreaking preferred partnership with Bell Canada, the nation's largest telecommunications provider. Signed on August 3, the agreement will see BUZZ HPC deliver one of Canada's largest sovereign AI ecosystems through Bell AI Fabric. BUZZ HPC will supply Bell's government and enterprise customers with access to NVIDIA's most advanced GPU clusters—including Ampere, Hopper, and Blackwell—scalable over NVIDIA Quantum-2 InfiniBand networking. By combining BUZZ HPC's large-scale accelerated computing infrastructure—purpose-built for AI, machine learning, and scientific computing—with Bell AI Fabric's national fibre network, advanced data centres, and partner ecosystem, the partnership creates a powerful foundation for AI development in Canada. The integrated platform will enable a wide range of use cases, from building foundational AI models to fine-tuning existing ones, all hosted in secure Canadian-owned facilities that meet stringent data residency and cybersecurity requirements. The collaboration ensures true nationwide reach by expanding GPU infrastructure across multiple provinces. The first phase—a 5 MW deployment in Manitoba—will launch later this year, followed by additional deployments in other Bell AI Fabric data centres. These investments are designed to advance Canadian innovation, support national objectives, and provide domestic enterprises with world-class AI computing power. Tavares emphasized that the initiative will give Canadian innovators unparalleled access to high-performance NVIDIA accelerated computing resources, ensuring a competitive advantage in the rapidly evolving global AI landscape. #proactiveinvestors #hivedigitaltechnologieslet #tsxv #hive #nasdaq #hive #CryptoMining #GreenEnergy #BuzzHPC #AIInfrastructure #NvidiaH200 #QuebecDataCenter #SustainableTech #GPUCluster #TorontoTech #AITraining #HiveDigital #LiquidCooling #Supercomputing #GreenEnergyAI #Exahash #HighPerformanceComputing
“AI is a distributed system — and the network is the computer,” says Ram Velaga, senior vice president of Broadcom's Core Switching Group. In this episode of Tech Disruptors, Velaga joins Bloomberg Intelligence's Kunjan Sobhani to explain how Ethernet is expanding from scale-out AI networking into the scale-up domain, challenging proprietary solutions like NVLink and Infiniband. The discussion covers Broadcom's Tomahawk 6 and Ultra product lines, its open SUE spec, and why simplicity, bandwidth and vendor neutrality may shape the next generation of AI infrastructure.
Hive Digital Technologies subsidiary BUZZ HPC's President and COO Craig Tavares joined Steve Darling from Proactive to unveil a major expansion in the company's high-performance computing (HPC) infrastructure. BUZZ has launched a new NVIDIA Hopper GPU cluster in Quebec, the latest addition to its growing supercomputing network that now spans three clusters across Canada and Sweden. The newly deployed Quebec cluster is equipped with NVIDIA Hopper GPUs and powered by NVIDIA's Quantum-2 InfiniBand networking platform, delivering exceptional processing speed and data throughput. Tavares confirmed that the system is already operating near full utilization—a strong signal of market demand and operational readiness. “This cluster is a big step forward in our mission to provide accessible and scalable AI compute infrastructure across Canada,” Tavares noted. Since 2023, BUZZ HPC has focused on building a pan-Canadian HPC footprint, establishing state-of-the-art data centers in Quebec and New Brunswick. These facilities span multiple time zones and languages, enabling localized support and broader accessibility for Canadian businesses and institutions. In a move to lower entry barriers, BUZZ is offering free and subsidized HPC credits to Canadian startups, research institutions, and companies, empowering more organizations to harness advanced compute power for AI development and innovation. In parallel, Hive Digital Technologies has taken a significant step to scale its infrastructure by announcing a purchase and sale agreement to acquire a Toronto-based facility with 7.2 megawatts of installed capacity. The acquisition not only adds strategic geographic coverage but also supports Hive's broader vision to accelerate the development of a sovereign Canadian AI ecosystem through BUZZ HPC. These developments come as Hive Digital continues to make headlines in the digital infrastructure space. The company recently surpassed 11.4 Exahash per second (EH/s) in global Bitcoin mining hashrate, marking the early and successful completion of Phase 1 of its Yguazú project in Paraguay—a 100-megawatt operation that plays a pivotal role in Hive's scale-up strategy. The company has been increasing its hashrate at a rapid pace—approximately 1 EH/s per week—and is targeting 25 EH/s by Thanksgiving 2025. Together, these milestones reinforce Hive's position as both a global leader in Bitcoin mining and a fast-emerging force in high-performance AI computing infrastructure. With its dual focus on blockchain and HPC, the company is uniquely positioned to meet the demands of an increasingly digital and AI-driven economy. #proactiveinvestors #hivedigitaltechnologieslet #tsxv #hive #nasdaq #hive #CryptoMining #GreenEnergy #BuzzHPC #AIInfrastructure #NvidiaH200 #QuebecDataCenter #SustainableTech #GPUCluster #TorontoTech #AITraining #HiveDigital #LiquidCooling #Supercomputing #GreenEnergyAI #Exahash #HighPerformanceComputing
Send us a textWhen automation fails, it fails spectacularly—and at scale. The recent Google Cloud outage that took down over 54 global services for more than seven hours demonstrates this perfectly. A simple error—blank fields in automated policy updates—cascaded into widespread failures affecting millions of users worldwide. This episode dives deep into what went wrong, how it happened, and what it means for cloud resilience in the AI era.We also explore Cisco's dramatic pivot at Cisco Live 2025, where they've committed to refreshing their entire hardware stack and integrating AI throughout their ecosystem. Their new LLM called Deep Network suggests a future where networking infrastructure makes intelligent decisions autonomously. We discuss whether Cisco can deliver on these promises and what the unification of their Meraki and Catalyst lines might mean for customers.The Ultra Ethernet Consortium has finally released their 1.0 specification, establishing a comprehensive standard for high-performance computing environments. This 600+ page document marks a significant milestone in creating viable alternatives to InfiniBand for AI workloads. Meanwhile, Network-as-a-Service pioneer Meter secured $170 million in Series C funding, raising questions about the actual size and sustainability of the NaaS market.On the cybersecurity front, we examine two concerning developments: the mass exodus of leadership from CISA during heightened threat conditions, and a novel zero-click vulnerability in Microsoft 365 Copilot that can expose sensitive data without any user interaction. This "Echo Leak" vulnerability demonstrates how AI systems that automatically scan content create entirely new attack vectors that organizations must defend against.Join us for a fast-paced discussion about these pivotal developments in cloud computing, networking technology, and cybersecurity. What does all this mean for your infrastructure strategy? Listen and find out.Purchase Chris and Tim's new book on AWS Cloud Networking: https://www.amazon.com/Certified-Advanced-Networking-Certification-certification/dp/1835080839/ Check out the Fortnightly Cloud Networking Newshttps://docs.google.com/document/d/1fkBWCGwXDUX9OfZ9_MvSVup8tJJzJeqrauaE6VPT2b0/Visit our website and subscribe: https://www.cables2clouds.com/Follow us on BlueSky: https://bsky.app/profile/cables2clouds.comFollow us on YouTube: https://www.youtube.com/@cables2clouds/Follow us on TikTok: https://www.tiktok.com/@cables2cloudsMerch Store: https://store.cables2clouds.com/Join the Discord Study group: https://artofneteng.com/iaatj
Much of what we take for granted in the IT industry was seeded from HPC and the national labs. This episode of Utilizing Tech features Gary Grider, HPC Division Leader at Los Alamos National Labs, discussing leading-edge technology with Scott Shadley of Solidigm and Stephen Foskett. The Efficient Mission Centric Computing Consortium (EMC3) is working to bring technologies like sparse memory access and computational storage to life. These technologies are designed for today's massive scale data sets, but Moore's Law suggests that this scale might be coming soon to AI applications and beyond. The goal of the national labs is to work 5-10 years ahead of the market to lay the foundations for what will be needed in the future. Specific products like InfiniBand, Lustre, pNFS, and more were driven forward by these labs as well. Some promising future directions include 3D chip scaling, analog and biological computing, and quantum chips.Guest: Gary Grider, HPC Division Leader at Los Alamos National LabsHosts: Stephen Foskett, President of the Tech Field Day Business Unit and Organizer of the Tech Field Day Event SeriesJeniece Wnorowski, Head of Influencer Marketing at Solidigm Scott Shadley, Leadership Narrative Director and Evangelist at SolidigmFollow Tech Field Day on LinkedIn, on X/Twitter, on Bluesky, and on Mastodon. Visit the Tech Field Day website for more information on upcoming events. For more episodes of Utilizing Tech, head to the dedicated website and follow the show on X/Twitter, on Bluesky, and on Mastodon.
In this exciting episode of Solana Weekly, host Thomas Bahamas welcomes back Joshua from Solayer to discuss groundbreaking advancements in blockchain scalability through their latest innovation, InfiniSVM.Episode Highlights:
In this episode of Telemetry Now, we explore why AI workloads require a fundamentally different approach to data center networking. From Ethernet vs. InfiniBand to mitigating issues with packet loss, latency, and link flapping, along with emerging standards from the Ultra Ethernet Consortium, we unpack the technologies reshaping how AI infrastructure is built.
We finally chased down the smart and savvy Rob Quast, Principal Technologist at Pure, to discuss the latest trends in the tech world around AI, Networking and Cloud storage. Our conversation kicks off with Rob's journey, from his time at ePlus and working in sys admin roles, to a memorable experience at Pure's NYC Accelerate event where he saw the stock exchange firsthand. We touch on the various aspects of a PT's role, including Rob's experience showing demos and helping customers understand complex solutions. Our conversation shifts to some of Rob's most passionate work, including involvement in a large AI solution for Coreweave. He walks through the intricacies of a massive deal and how it required close collaboration with the product management team, post-validation work, and strategic networking. Rob also shares insights from a hyperscaler win for Pure Storage, explaining what "hyperscaler" means and how they're used in the industry. We dive into the technical details of large AI deals, such as GPUDirect, and discuss how Pure can better connect with networking professionals. A key theme here is the intersection of data storage and networking, with Rob revealing why NVMe TCP is a game-changer for the future of cloud infrastructure, particularly when compared to technologies like Infiniband. We close with conversation looking toward the future, exploring the realities of cloud storage and hybrid cloud solutions, and what customers are trying to achieve and the challenges they face. The episode wraps up with a focus on object storage, where Rob discusses how Pure Storage's Fusion platform helps manage global namespaces and facilitates multi-tenant cloud environments.
Hive Digital Technologies Executive Chairman Frank Holmes joined Steve Darling from Proactive to shared significant updates with Proactive, announcing the acquisition and deployment of NVIDIA H100 GPUs and the latest NVIDIA H200 GPUs. The company has invested $30 million into these cutting-edge supercomputing clusters, reinforcing its position as a leader in high-performance computing (HPC) and artificial intelligence (AI). The NVIDIA H200 cluster, consisting of 508 GPUs across 64 nodes connected via Infiniband, has been shipped and is set to arrive in Quebec in early January 2025. Deployment and customer access are planned for Q1 2025, leveraging Quebec's 100% renewable energy infrastructure to align with Hive's sustainability mission. Holmes emphasized that the new supercomputing clusters will unlock significant revenue opportunities, enabling Hive to meet the growing global demand for AI computing power. The company projects an annualized run-rate revenue of over $20 million by Q2 2025 from its HPC services, focusing on high-margin applications such as AI model training and cloud computing services. This strategic expansion highlights Hive's pivot toward diversifying its operations to capitalize on the booming AI market while continuing its Bitcoin mining activities. Holmes noted, "Our investments underscore our commitment to sustainability, innovation, and positioning Hive as a key player in the gig economy and digital transformation." #proactiveinvestors #hivedigitaltechnologieslet #tsxv #hive #nasdaq #hive #NvidiaGPUs #AIChips #TechInvestment #HighPerformanceComputing #AIInfrastructure HealthcareAI #AIForGood #FutureOfTech #DigitalTransformation #AI #ArtificialIntelligence #DigitalEconomy #GigEconomy #DataCenters #TechInnovation #HiveDigitalTechnologies #GreenEnergy #SustainableTech #AIRevolution #QuebecDataCenters#investment #investing #investment #investor #stockmarket #stocks #stock #stockmarketnews
Today, Dell Technologies announced significant updates to its PowerScale unstructured data storage platform, which is tailored to enhance AI-ready performance. As AI becomes the backbone of data-driven enterprises, PowerScale's advancements aim to eliminate barriers, optimise costs, and enable businesses to extract actionable insights from their data at speed. The updated PowerScale introduces cutting-edge 200Gb faster networking Ethernet and InfiniBand connectivity delivering a remarkable 220% boost in streaming write performance and a 99% increase in streaming read speeds compared to last year's performance, along with higher storage density with new 61TB SSD drives, 24TB HDDs, and enhanced searchability through the new MetadataIQ capabilities. These advancements are crucial as AI has become essential for businesses in Ireland, which aims to thrive in a data-driven economy. PowerScale's updates help eliminate data limitations, optimise AI workflows, and support large-scale AI implementations like Generative AI applications. These improvements empower businesses to turn data into actionable insights and drive innovation. Some of the key benefits of the updates include AI pipeline optimisation with ultra-fast connectivity, ensuring smooth data flow and enhanced GPU performance for quicker data processing and decision-making. The higher storage density leads to cost savings by reducing data centre expenses and improving data accessibility through MetadataIQ, which enables faster access to valuable information. This allows businesses to handle massive datasets efficiently while simultaneously reducing costs. The increased density means organisations can achieve up to a 50% reduction in data centre footprints, resulting in significant savings on energy and colocation expenses. This innovation also supports the growing demands of Generative AI and Large Language Models (LLMs), such as NVIDIA's DGX SuperPOD, making high-performance AI implementations more accessible. Speaking about the new updates Saif Aly Product Marketing Manager for Unstructured Data Solutions at Dell Technologies said: "AI is no longer a distant vision for the future - it's the backbone of businesses seeking to compete in today's data-driven economy. At Dell Technologies, we understand that AI's game-changing potential hinges on not just cutting-edge algorithms but also the infrastructure that supports and powers them." The latest PowerScale enhancements solidify Dell Technologies' commitment to supporting Irish businesses as they scale their AI initiatives. From accelerating performance to simplifying data management, these updates are designed to help organisations focus on outcomes rather than infrastructure. To learn more about the latest updates to PowerScale and how it can benefit your business, click here.
Ethernet competes with InfiniBand as a network fabric for AI workloads such as model training. One issue is that AI jobs don't tolerate latency, drops, and retransmits. In other words, AI workloads do best with a lossless network. And while Ethernet has kept up with increasing demands to support greater bandwidth and throughput, it was... Read more »
Ethernet competes with InfiniBand as a network fabric for AI workloads such as model training. One issue is that AI jobs don't tolerate latency, drops, and retransmits. In other words, AI workloads do best with a lossless network. And while Ethernet has kept up with increasing demands to support greater bandwidth and throughput, it was... Read more »
Ethernet competes with InfiniBand as a network fabric for AI workloads such as model training. One issue is that AI jobs don't tolerate latency, drops, and retransmits. In other words, AI workloads do best with a lossless network. And while Ethernet has kept up with increasing demands to support greater bandwidth and throughput, it was... Read more »
What are the requirements for running AI workloads over a data center fabric? Why is InfiniBand so popular for building AI networks? What about Ethernet for AI? Jeff Tantsura joins Tom Ammon and Russ White to discuss networks for AI workloads.
Мы уже как-то разговаривали про Infiniband. А у него ведь есть конкурент в лице RoCE - RDMA over Converged Ethernet, позволяющий строить вычислительные кластера, но не строить для них отдельные более другие сети. В этот раз с инженерами Яндекса поразбираемся и с RoCE, и с другими дружественными технологиями. Кто: Роман Глебов. Сетевой инженер в группе эксплуатации сетей. Виталий Венгловский. Сетевой инженер в группе R&D Про что: Сети HPC и их роль в индустрии сегодня Infiniband, история, текущее поколение RoCE история, текущее поколение Scheduled fabrics Кластера у гиперскейлеров, технологии, размер, топология NVLINK у Nvidia, ICI у Google Стандарты, консорциум ultra ethernet Сообщение telecom №138. RoCE история, текущее поколение появились сначала на linkmeup.
On today's Network Automation Nerds, we get into the infrastructure required to support AI workloads. We discuss key considerations including bandwidth, the substantial power and cooling requirements of AI infrastructure, and GPUs. We also talk about InfiniBand and Ethernet as network fabrics for AI workloads, cabling considerations, and more. This is a sponsored episode. Our... Read more »
On today's Network Automation Nerds, we get into the infrastructure required to support AI workloads. We discuss key considerations including bandwidth, the substantial power and cooling requirements of AI infrastructure, and GPUs. We also talk about InfiniBand and Ethernet as network fabrics for AI workloads, cabling considerations, and more. This is a sponsored episode. Our... Read more »
Meta GenAI Infra Blog Review // Special MLOps Podcast episode by Demetrios. // Abstract Demetrios explores Meta's innovative infrastructure for large-scale AI operations, highlighting three blog posts on training large language models, maintaining AI capacity, and building Meta's GenAI infrastructure. The discussion reveals Meta's handling of hundreds of trillions of AI model executions daily, focusing on scalability, cost efficiency, and robust networking. Key elements include the Ops planner work orchestrator, safety protocols, and checkpointing challenges in AI training. Meta's efforts in hardware design, software solutions, and networking optimize GPU performance, with innovations like a custom Linux file system and advanced networking file systems like Hammerspace. The podcast also discusses advancements in PyTorch, network technologies like Roce and Nvidia's Quantum 2 Infiniband fabric, and Meta's commitment to open-source AGI. // MLOps Jobs board https://mlops.pallet.xyz/jobs // MLOps Swag/Merch https://mlops-community.myshopify.com/ // Related Links Building Meta's GenAI Infrastructure blog: https://engineering.fb.com/2024/03/12/data-center-engineering/building-metas-genai-infrastructure/ --------------- ✌️Connect With Us ✌️ ------------- Join our slack community: https://go.mlops.community/slack Follow us on Twitter: @mlopscommunity Sign up for the next meetup: https://go.mlops.community/register Catch all episodes, blogs, newsletters, and more: https://mlops.community/ Connect with Demetrios on LinkedIn: https://www.linkedin.com/in/dpbrinkm/ Timestamps: [00:00] Meta handles trillions of AI model executions [07:01] Meta creating AGI, ethical and sustainable [08:13] Concerns about energy use in training models [12:22] Network, hardware, and job optimization for reliability [17:21] Highlights of Arista and Nvidia hardware architecture [20:11] Meta's clusters optimized for efficient fabric [24:40] Varied steps, careful checkpointing in AI training [28:46] Meta is maintaining huge GPU clusters for AI [29:47] AI training is faster and more demanding [35:27] Ops planner orchestrates a million operations and reduces maintenance [37:15] Ops planner ensures safety and well-tested changes
It's return guest season here at Latent Space! We last talked to Kanjun in October and Jonathan in May (and December post Databricks acquisition): Imbue and Databricks are back for a rare treat: a double-header interview talking about DBRX from Databricks and Imbue 70B, a new internal LLM that “outperforms GPT-4o” zero-shot on a range of reasoning and coding-related benchmarks and datasets, while using 7x less data than Llama 3 70B.While Imbue, being an agents company rather than a model provider, are not releasing their models today, they are releasing almost everything else: * Cleaned-up and extended versions of 11 of the most popular NLP reasoning benchmarks* An entirely new code-focused reasoning benchmark* A fine-tuned 70B model, built with Meta Llama 3, to identify ambiguity* A new dataset of 450,000 human judgments about ambiguity* Infrastructure scripts for bringing a cluster from bare metal to robust, high performance training* Our cost-aware hyperparameter optimizer, CARBS, which automatically and systematically fine-tunes all hyperparameters to derive optimum performance for models of any sizeAs well as EXTREMELY detailed posts on the infrastructure needs, hyperparameter search, and clean versions of the sorry state of industry standard benchmarks. This means for the FIRST TIME (perhaps since Meta's OPT-175B in 2022?) you have this level of educational detail into the hardware and ML nitty gritty of training extremely large LLMs, and if you are in fact training LLMs of this scale you now have evals, optimizers, scripts, and human data/benchmarks you can use to move the industry forward together with Imbue.We are busy running the sold-out AI Engineer World's Fair today, and so are unable to do our usual quality writeup, however, please enjoy our show notes and the excellent conversation! Thanks also to Kanjun, Ashley, Tom and the rest of team Imbue for setting up this interview behind the scenes.Video podTimestamps* [00:00:00] Introduction and catch up with guests* [00:01:55] Databricks' text to image model release* [00:03:46] Details about the DBRX model* [00:05:26] Imbue's infrastructure, evaluation, and hyperparameter optimizer releases* [00:09:18] Challenges of training foundation models and getting infrastructure to work* [00:12:03] Details of Imbue's cluster setup* [00:18:53] Process of bringing machines online and common failures* [00:22:52] Health checks and monitoring for the cluster* [00:25:06] Typical timelines and team composition for setting up a cluster* [00:27:24] Monitoring GPU utilization and performance* [00:29:39] Open source tools and libraries used* [00:32:33] Reproducibility and portability of cluster setup* [00:35:57] Infrastructure changes needed for different model architectures* [00:40:49] Imbue's focus on text-only models for coding and reasoning* [00:42:26] CARBS hyperparameter tuner and cost-aware optimization* [00:51:01] Emergence and CARBS* [00:53:18] Evaluation datasets and reproducing them with high quality* [00:58:40] Challenges of evaluating on more realistic tasks* [01:06:01] Abstract reasoning benchmarks like ARC* [01:10:13] Long context evaluation and needle-in-a-haystack tasks* [01:13:50] Function calling and tool use evaluation* [01:19:19] Imbue's future plans for coding and reasoning applications* [01:20:14] Databricks' future plans for useful applications and upcoming blog postsTranscriptSWYX [00:00:00]: Welcome to the Latent Space Podcast, another super special edition. Today, we have sort of like a two-header. John Frankel from Mosaic Databricks, or Databricks Mosaic, and Josh Albrecht from MBU. Welcome.JOSH [00:00:12]: Hey, glad to be here.SWYX [00:00:14]: Thank you for having us. Hey, so both of you are kind of past guests. Jonathan, you were actually one of the most popular episodes from last year talking about MPT7B. Remember the days when we trained large models and there was 7B?JONATHAN [00:00:30]: Yeah, back when reproducing LLAMA1-7B was considered a huge accomplishment for the field. Those are the good old days. I miss that.SWYX [00:00:38]: As the things have accelerated a lot. Actually, let's do a quick catch up and Josh, you can chime on in as well. So Databricks got acquired. I talked to you at New York.JONATHAN [00:00:45]: Mosaic got acquired, although sometimes it feels like Mosaic acquired Databricks because, you know, we're having a lot of fun being here. But, you know, yeah.SWYX [00:00:52]: Yeah. I mean, you are chief scientist now of Databricks.JONATHAN [00:00:55]: Chief AI scientist. Careful with the title. As much as I would love to understand how Spark works, I'm going to have to defer that to much smarter people than me.SWYX [00:01:03]: Got it. And I don't know about like what you would highlight so far as a post-acquisition, but the most recent news is that you guys released DBRX. Is that the thing that most people should be aware of?JONATHAN [00:01:13]: Actually, that's no longer the most recent news. Honestly, the most recent news, we announced this, but it was at our Data and AI Summit last week. So it was announced among like 100,000 other things, is that we finally released our text to image model, which has been a year in the making through a collaboration directly with Shutterstock. There was a lot of work put into finding a dataset that we were comfortable with working on and trying to build a model that honestly, I felt like I could trust and that others might be able to trust to put out in the world. So that model was released last week. It's unfortunately just available via API due to the fact that the data is quite sensitive and quite valuable. It's Shutterstock's entire business in a lot of ways, but I'm still really excited that there's now a model that is trained on a dataset where the provenance of every single image is known, and it's a damn good model. So I'm really proud of the team on that.SWYX [00:01:55]: Yeah, amazing. Josh, do you have any thoughts on image model questions?JOSH [00:01:59]: That is not my area of expertise, but I was excited to see the release of it last week as well, and very happy that you guys did a nice job on the data side of everything there. So that was cool to see.SWYX [00:02:09]: I think what's unusual is like, I think Shutterstock's doing multiple deals in multiple labs. So what is the Shutterstock model? Like, I guess, is this the house model for Shutterstock? Is this Databricks' version of the Shutterstock model? Like, what is this?JONATHAN [00:02:22]: The way that I would think about it is that Shutterstock is doing an amazing business in AI across the board. Their dataset is kind of widely known to be the best stock photos dataset in the world, the most comprehensive, the biggest. When you think about like, what dataset am I going to train a multimodal model on? You call Shutterstock. And I, at least I've heard in the news, like OpenAI, Google, Meta, Apple have all called Shutterstock and made those deals. So a lot of models have had Shutterstock data incorporated into them. But this is the only model I know of so far where it was, you know, exclusively and specifically trained just on the vanilla Shutterstock data. There was nothing else mixed in. We didn't go and scrape the web and find other data or combined datasets or anything like that. And so this is, in some sense, the house blend. But the other piece is that it's just a dataset where the provenance of every image is known in public. Where did the data come from? It is the Shutterstock collection. That's it. You know, nothing less, nothing more. And certainly being at Databricks, if I've learned one thing, I've learned about enterprise customers and what they want out of AI. And one of the things they ask for most is just, what can you tell me about the data the model was trained on? And here, especially for text to image models, where images are just tricky subject matter, there's been a lot of kind of legal conversation about images, especially. It's nice to just have something where I can point to it and say, you know, if you want to know where the images came from, these are what they are and this is how they got there.SWYX [00:03:36]: I will talk a little bit about Databricks because it's relevant to the rest of today's episode. So Databricks, sorry, I keep misspeaking. It's DBRX.JONATHAN [00:03:46]: DBRX, actually, there's been a pronunciation update. It is now D-B-Rex. So we have decided to add a dinosaur mascot because what model doesn't like a mascot? So literally, I wish I could pull it up. There is a little plush dinosaur that we had made. It's like the world's cutest dinosaur, but it is the official mascot of D-B-Rex. And there's a little dinosaur logo that, you know, you'll probably see around a little bit more because DBRX is a mouthful, but D-B-Rex, like, you know, it's just kind of...SWYX [00:04:13]: Rolls off the tongue. I love mascots. Like every company should have a mascot. And I think Hugging Face got it right. You need an emoji mascot because that's the minimal viable image.JONATHAN [00:04:21]: I probably shouldn't talk at all about, you know, Velociraptor, but, you know, that's a, maybe that's something we can talk about later in the summer. I'll just leave it at that.SWYX [00:04:28]: Okay. That's a hint to names. I feel like your names leak a lot of alpha. So just to quickly cover the headline details, DBRX, as Make Sure Experts model, that's fairly big, 132 billion total parameters, so 36 billion active on any input, pre-trained on 12 trillion tokens of text and code, and did really well on evals to the point where you had to dye your hair blue. That's my high level conclusion.JONATHAN [00:04:53]: Never make a bet with your team two weeks out from model launch, even when, you know, human eval is looking quite bad. Because if you set some bar, even if it's arbitrary and you think there's no way in hell they're going to hit it, apparently money doesn't motivate people anymore. Humiliating their boss motivates people. So Josh, you should really take a hint from this. You know, you cannot pay someone enough money to make up for you dyeing your hair blue.JOSH [00:05:15]: I'll keep that in mind for our next model.SWYX [00:05:17]: It works. So speaking of Imbue's next model, perhaps Josh, you want to actually just say hi to the general sort of latent space audience and talk about what we're releasing today. Yeah.JOSH [00:05:26]: I'm Josh, CTO of Imbue, and we're not releasing the model. We're not releasing the weights, but we are releasing a bunch of different things that should make it easier for other people to make their own models. So I think right now, training foundation models from scratch is like a very difficult, time-consuming, expensive, kind of risky endeavor, especially for smaller companies. And the things that we're releasing hopefully make that at least a little bit easier. So the things that we're releasing fall into kind of three different buckets. One is infrastructure and scripts for dealing with the kind of hardware and hardware failures and understanding how well is the actually lowest level of thing actually working so that you can actually do your training at all and at a reasonable speed without having to constantly restart, etc. So infrastructure and training scripts. A second set of things is around the evaluation. So after you've trained it, like how well is this actually working and how do you know how well it's working? We're releasing a whole bunch of different data there, a new benchmark about code, reasoning, understanding, as well as our own private versions of 11 different open source benchmarks. So things like pool queue or ANLI, where we've gone through and kind of cleaned up the data as much as possible by looking at all the ones that models get wrong or that are flagged for ambiguity and also our own kind of private reproductions of those where we've done like a kind of clean room black box, like, okay, this is what the data set is supposed to be. Here are some examples. Let's make our own version of this to make sure that there is no data contamination, etc. To make sure that we're actually, you know, not testing on train. And then I think a final thing that we're releasing there is around 450,000 human judgments about ambiguity and question quality, which we used in the process of cleaning these evaluations and we also hope will be helpful for other people training kind of similar models. And then the third thing is CARBS, our hyperparameter, our cost-aware hyperparameter optimizer, which was especially helpful for being able to experiment at much smaller scales and then scale those experiments up to the much larger scale kind of on the first try without having to retry it. You don't want to be training, you know, 10, 20 different 70B models. You really want to get these larger modelsSWYX [00:07:30]: right on the first try.JOSH [00:07:30]: And so the ability to kind of tune things very precisely and learn scaling laws, not just for, you know, the like data and flops, but also for learning rate and all the other hyperparameters and see like how should you scale these things up was extremely valuable to us as we were training the larger models. Yeah, that's a lot of stuff.SWYX [00:07:49]: Yeah, exactly. So there's a bunch of stuffJOSH [00:07:50]: we'll have to go through all of it.JONATHAN [00:07:52]: Yeah, I just want to throw in how excited I am about this. This is the stuff that nobody ever talks about. That is the difference between success and failure in this stuff. Like, can you get your cluster to run? Can you get software on your cluster? Can you figure out what broke? Because fault tolerance is still not really built into any of the fundamental primitives of training models. And so if something breaks, you have to go figure out what broke, your job stops, you have to restart your job. It is a nightmare just to get to the point where anything can train on the cluster. A basic MPI hello world that has the GPUs talk to each other is hard enough, let alone actually training a model, let alone getting good performance out of the GPUs, let alone actually getting a model that converges to anything interesting. There's so many levels of things you have to accomplish. This is the kind of stuff that matters. I think to a point that Josh made earlier, before we got on here, there are plenty of weights out there. Nobody's released this.JOSH [00:08:46]: Yeah, that was part of the motivation actually is that there are lots of other things that are complimentary, but I have not seen nearly as much discussion about some of these other things that we think are pretty important. I mean, in some sense,SWYX [00:08:56]: I'm very excited to have Jonathan on because this is a little bit, you're a bread and butter with Mosaic. And I think you've released some part with Composer. And I think it's just really interesting to see like a different take, basically a full stack take that's kind of open source today.JONATHAN [00:09:18]: Yeah, it's really kind of, it's been an ordeal to figure this out. And every time something changes, whether it's a new GPU or even a new driver update, you get new creative errors and new things go wrong. And, you know, we've dealt with the weirdest things from, you know, our InfiniBand cables getting stolen from the data center twice, like in boxes before they arrived at the data center. Like, you know, Porch Pirate basically had stolen our InfiniBand cables back when those were hard to come by. To like, you know, weird recalls of switches to like the strangest stuff has happened. I have my favorite GPU failures I've seen, like ones where the GPU doesn't fail, it has a correctable memory issue and the memory correction causes the GPU to become a straggler and hold up the whole job. Like weird stuff happens and figuring out how to not just identify all of that, but then eventually productize it, is in some sense, the entire story of Mosaic and now Databricks in terms of our ML offering. Really, the thing we offer is we have gone through this suffering and figured out how to even productize that. It has been a pain in the butt.SWYX [00:10:20]: Yeah, it's a lot of work.JOSH [00:10:20]: I think my favorite failure was GPU is just giving wrong math. Like if they give errors, great, because you can see the errors, but if they just give you the wrong math back, not so fun.SWYX [00:10:30]: When did they give you wrong math?JOSH [00:10:32]: Like literally you could just, you know, add two things. For example, the numbers come back. They're not the numbers that they're supposed to be.JONATHAN [00:10:40]: I think it's important to say at this stage, just because like it, I think it goes without saying for Josh and I, but it's worth saying here, this isn't to say that like anything is wrong with us. It's not like NVIDIA did a bad job or, you know, Mellanox did a bad job or the like the server builder, the data center operator, the cloud provider, like the million other parties that are involved in building this. We are running these insane chips that are huge and complicated and built on tiny transistors at insane frequencies with insane heat in data centers that for the most part, were not built remotely for this kind of power or heat and have been retrofitted for this. Like failures happen on a good day with normal CPUs. And this is not a good day and not a normal CPU for the most part. It's fun to joke about all the weird things we see. This is not to say anybody's done anything wrong. This is just kind of part and parcel of working on a massive cluster running at multiple megawatts of power at a time.SWYX [00:11:32]: It's crazy. Yeah.JONATHAN [00:11:33]: So optical cables, like all sorts, like everything.SWYX [00:11:37]: I'll take the opportunity to start going to the sort of infra piece. There's just like a description of the infra just to give people a sense of what we talk about when we talk about massive clusters. So I'm just going to read off the blog post here. This post is about one cluster that has 4,092 H100 GPUs spread across 511 computers. They use unified fabric manager nodes, which manage the infinite band network. And you talk a little bit about your networking. Is there anything unusual about this setup that you'll call out to people?JOSH [00:12:03]: Yeah, actually this particular cluster is a little bit non-standard. The normal, like vanilla setup for these large clusters as vanilla as it can be is what's normally like a 127 node cluster. So closer to like 1024 GPUs instead of 4,000. Here we have a larger cluster. As you start to get into the larger clusters, the networking becomes a little bit more custom. It's a little bit more, it's a little bit trickier. It's a little bit more difficult to get these things to all be able to talk to each other at the same speed. And so this has, in this particular case, this is a three tier network architecture instead of two tiers, kind of the normal one. So most of the clusters are a little bit smaller. As you get to even larger scales, then this becomes even much more complicated,SWYX [00:12:43]: much more expensive.JOSH [00:12:43]: So we chose this particular scale, kind of knowing our own workloads and kind of what we wanted to do. This was kind of the right size for us. But yeah, I think it's not exactly vanilla already. It's already getting into kind of the custom territory.SWYX [00:12:54]: So my understanding is that there, and is there any part of this that comes with the Voltage Park deal that you guys had? Is that part of the hardware that you got from the deal with them?JOSH [00:13:04]: Yeah, so we worked really closely with Voltage Park to set up all their clusters and infrastructure and everything and kind of decide even like what to order, how should the networking work? Like we were very involved in kind of the construction and bring up of this. And that's what this post is about, is about that process of like bringing up all these, there's like different clusters in different places of different scales. So in this particular post, we're talking about this one 4096 GPU, but there are other clusters that they have as well. And we were very closely involved with figuring out the exact architecture and kind of the trade-offs that go along with picking, you know, those exact components. You really don't want to like place the wrong order because it takes months to get it and it's very expensive. So yeah, we were happy to help out with that.JONATHAN [00:13:43]: And then your bit of good cables get stolen.SWYX [00:13:44]: Yeah, yeah, exactly.JOSH [00:13:47]: We wanted to make sure that we ended up with compute that would work for us and that would also work for their other customers. And so we kind of helped design something so that we would get exactly what we were looking for. We knew that these kinds of details would be super important and that getting down to the level of the hardware and like having these good scripts and everything was going to be a core part of like actually getting this to work. I'm very glad that we did that. I don't think that most companies kind of take that full stack approach, but for us, it certainly paid off.SWYX [00:14:12]: Yeah, it's basically sort of built to spec. It's interesting that relationship because you usually, for the rest of us who don't operate at your scale, we take whatever we can get from cloud providers, but you are basically co-designing from the single machine up. And you described that a little bit. Do you want to take us through the process that you described here?JOSH [00:14:27]: Yeah, so for the actual, like the blog post and kind of bringing these machines online.SWYX [00:14:32]: Yeah.JOSH [00:14:32]: So yeah, I think the process, as we have it broken down in the blog post, there's kind of a few different layers. First is like getting the individual machines to work at all and then getting the machines to actually be able to talk to each other. So getting the InfiniBand networking to work and then getting to a point where, you know, not just the machines are working and they can talk to each other, but everything is actually working correctly. There's a big gap between like it's working at all to it's working perfectly correctly. And then after you have all this stuff working perfectly correctly, nice and healthy, then now you get into kind of the software data, like training issues. And then after that, you're still not done. Like now, even once you're training at full speed, things are going to fail over time. Things are going to change. There's going to be new, you know, firmware updates. Like how do you kind of deal with this change and flux over time without going crazySWYX [00:15:16]: and pulling your hair out,JOSH [00:15:16]: trying to like reproduce things or understand why there were regressions. And so there's a lot of work to kind of automate the infrastructure tooling as well. And kind of the first step, like bringing these things online in the first place, you know, you have hundreds of machines at this point. So you don't necessarily want to be like walking around with like a CD-ROM or a USB drive, like plugging it in with your keyboard, like hitting next, next, next on the OS install. That's not how this works. You do that for one machine. And then you use, we use this thing called Metal as a Service to bring up all the other machines. So it's a kind of server that can kind of install the operating system on these other machines. So most like when you're talking about these machines, like each machine is, you know, on the order of hundreds of thousands of dollars. So they usually come with a kind of out-of-band management interface as well. So they don't, they have their InfiniBand networking. They have their normal 100 gigabit per second Ethernet networking. These are like dual, redundant, et cetera. And then you also have this extra out-of-band management network. So you can log in and you can see like the boot screen or you can see the blue screen of death. You can like get in there and actually see what was wrong, which is pretty fun. And it makes it like possible to automate a lot of this work. So the beginning of that, and the blog post goes into much more detail about like exactly how we set these up and kind of the other errors that we ran into. When you're bringing these online, you'll definitely have failures. Even if they all worked in the factory, they get shipped, some parts come loose, something fails, something goes wrong. So when you're bringing them online, there'll be some that don't quite work for all sorts of reasons. As you start to be working with machines at this scale, like if something happens one in a thousand times, you're like pretty likely to see it. And so you can get pretty rare, weird things, especially since we had fairly early builds and fairly early versions of this hardware. Like these are some of the like first machines that were ever produced, some of the first GPUs. So you've got some extra special things there. We definitely worked with Dell, for example, on making fixes in the firmware level to be like, okay, like this thing is wrong. Like we need to update this at the firmware to like actually fix this particular thing. So we worked pretty closely with Dell and Nvidia. Yeah, that's what I'm saying. Like this stuff gets complicated. And the thing is like, you know, taking a step back, the whole reason we're doing this, right, is that we knew that this was going to be complicated. There would be these kinds of failures. And if we're just using, you know, AWS or some other cloud provider, these errors are still gonna be there and you're gonna have no way to know and no way to debug this and no way to diagnose what's going wrong. And so we would much rather be able to like call up Dell and say, hey, this isn't working. And they're like, yep, okay, cool. Let's debug it together. Oh, I see. Yeah, cool. We'll ship a firmware update and actually fix this for you. That was a much better experience than like, great, just magically fails. I guess we restart and hope that that machine goes away. Like that's not a very good place to be. So yeah, that's kind of the first place is getting to a place where like GPU training is working on your single node machines. You can observe stuff. We have tons of tooling around like, you know, Prometheus and all sorts of other tools for understanding what's going on in these machines because you don't want to be like logging into each one and looking at the temperature or something you really need to have tooling to collect all these metrics, et cetera. Unfortunately, all of the scripts that we have for this are like for this entire cluster and for all this infrastructure are a little bit like special purpose for our particular thing. So it's not that every script that we have, it's not that you can just like take this and plug this in. Even if we did open source all the tooling that we have, you'd still have to do like a lot of work to open source it. What we are releasing is as many of the things that we can that are going to be useful for other people. You're still going to have to have some way of kind of managing these things, making your own like logging aggregators, et cetera, et cetera. So that's kind of bringing them up to the like, you know, the single nodes that are working. From there, it goes into, I'm happy to keep going if you want. Well, I just want to leave the opportunity for JohnSWYX [00:18:53]: to comment if there's anything that's different from how he runs things.JONATHAN [00:18:57]: Oh, I mean, all I'll say is I'll endorse this and say this s**t is hard. Like this is really, really hard. And, you know, I have a special props to, you know, the folks in Vue because they were building this from the ground up. You know, at Databricks and at Mosaic, we typically work with cloud providers because some of this stuff is just, there's too much to handle. It's complicated. There's a lot to deal with. And this doesn't even get into things like physical security, you know, securing power if you're the data center operator. Like this gets infinitely complicated and you have to abstract somewhere. Like, you know, and then you get to the folks who are literally building their own custom chips and like, good God.SWYX [00:19:36]: Like, oh my God, that's, you know,JONATHAN [00:19:38]: if you're one of those folks, you're having, you know, pour one out for the infra people at some of the AI chip startups who are having a really, really interesting time right now. But this stuff is really hard. And I don't think we talk about it much because there's so many other things that are hard. But the other hard things, I think everybody's becoming pretty familiar with at this point. This is something that I don't think there's ever really been a comprehensive discussion of, at least not that I've seen.SWYX [00:20:00]: Yeah, so my impression is that you guys, Mosaic, have your own software for sort of spinning up and down machines, just like Imbue had to build. But Imbue probably, it sounds like Imbue, you guys went fuller stack. I don't know how to describe it. Like Mosaic is not working with Dell on like their firmware.JONATHAN [00:20:21]: No, no, we're typically working with like, you know, pick your cloud provider on their Dell firmware or what have you. Like, it's kind of, I think one of the things, I don't know, Josh, you can correct me on this. It's kind of impossible if you're doing training to not go all the way through the entire stack, regardless of what happens. Like somehow I'm still chatting with cloud providers about power contracts, even though the whole point of dealing with the cloud provider is not to have to think about power contracts. Somehow I'm still asking them about which InfiniBand provider they used this time to see if this is part of the bad batch of cables I encountered on that cloud provider or what have you. Or like, we're still talking about a firmware update from pick your provider. You can't not do this. It's convenient that they have data center staff who are worrying about what to send back to which provider when, and they have people who can go and wait for the InfiniBand cables so they don't get stolen outside. But, you know, it's kind of, it's impossible not to really go full stack if you're thinking about the infrastructure at all. I don't know, Josh, correct me. No, I think that's right.JOSH [00:21:17]: That's what we expected from the beginning as well, is that we would inevitably have to get into the details here. And I'm glad that we kind of just planned for it. I think it made it a lot easier from our perspective to have direct control over this. Instead of having to go to the cloud provider that goes to the data center, that goes to the supplier, we could just go direct to NVIDIA or DellSWYX [00:21:37]: or the data center,JOSH [00:21:37]: whoever was responsible and be like, hey, this thing needs to change. And they're like, oh, okay. Yeah, that is our responsibility. Great, we can fix that. So it was just a lot easier for us to fix these bugs than if we had to go through an extra layer of email.SWYX [00:21:48]: Something we discussed in the pre-show was that you had a rule of thumb for your cluster of reliability. You say here in the post, by and large, you expect around 3% of your machines to break every week. So you're basically going to turn through all your machines in a year.JOSH [00:22:04]: As it says in the post. So that would be true if it was a uniform failure like that. But as it says in the post, it's usually these kind of problematic nodes. And to be clear, that is the number that we've heard from other people is like they're having about 3%. I don't think we're experiencing failure rates that are that high. I think ours is actually quite a bit lower than that, probably because we've taken the time to like dig into a large, maybe larger number than we should have of these failures and get to the root cause of it and be like, oh, okay, like that's exactly what's going wrong.SWYX [00:22:33]: How do we fix this?JOSH [00:22:33]: How do we prevent this from happening? How do we make automated checks for this so that if it does happen, it just goes back to whoever owns that particular part of the process and they can fix it immediately.SWYX [00:22:43]: And that's part of what you're also open sourcing, which is the health checks, right? You got the NIC health checks, GPU health check, this space health check, Docker D message. I don't know what that is.JOSH [00:22:52]: That one is just a lot of stuff.SWYX [00:22:54]: Yeah.JOSH [00:22:55]: That one is one where we realized that actually like when these machines boot, sometimes they wouldn't actually boot cleanly all the way. Or when they rebooted, they had problems that they didn't have when they were working before, which was kind of frustrating. Like usually if you restart your computer,SWYX [00:23:08]: it gets better.JOSH [00:23:08]: Here you restart. It did not get better.SWYX [00:23:10]: It got worse.JOSH [00:23:10]: That was very frustrating. So this health check looks at every particular line we've ever seen from the boot, like in D message, like every single log line that your computer emitsSWYX [00:23:21]: and says like,JOSH [00:23:21]: have we ever seen this before?SWYX [00:23:23]: Is this expected?JOSH [00:23:23]: Is this in the right order? Or is there something out of place? If there's anything out of place, let me say, okay, great. Like now it goes into this, like longer, more triage list of like, all right, great. Like, is this acceptable?SWYX [00:23:33]: Should we flag this?JOSH [00:23:33]: Like, should someone take a look at this? So we're looking down at a very, very granular detail level, what's happening on these computers to make sure that nothing is out of place. And that's critical because without that, if you're running your training, as Jonathan said, and this thing is slow, like what are you supposed to do? Right?SWYX [00:23:49]: Like you really,JOSH [00:23:49]: you really want to be very certain that like all 4,000 of these GPUs are working like they're supposed to.SWYX [00:23:54]: We know that.JOSH [00:23:54]: And so if it's slow, it's because like we messed up the config or something else and not because of this earlier thing that's like really hard to detect in software later.JONATHAN [00:24:01]: Yeah. I think the, I'm just curious to ask,SWYX [00:24:03]: like, you know,JONATHAN [00:24:03]: suppose you were to set up another, let's say another H100 cluster and it were at a different data center. And instead of the vendor being Dell, it was super micro or what have you. How much of this would be repeatable? And how much of this would you have to redo? I, you know, I genuinely don't know.SWYX [00:24:18]: A decent amount.JOSH [00:24:19]: I think it would go a lot faster the second time. I think there's lots of learnings that we had. And also the blog post,SWYX [00:24:24]: you know, yes,JOSH [00:24:24]: we are releasing the health checks, releasing some scripts, but a lot of the valuable stuff is also in the blog post itself, in the details and kind of the, you know, the learnings that we've had and the sort of errors that we run into. We tried to as much as possible surface those to other peopleSWYX [00:24:36]: could learn from thoseJOSH [00:24:36]: and avoid the same mistakes or failures as well. But I think it would go a lot faster.SWYX [00:24:41]: Although, yes,JOSH [00:24:41]: there would certainly be some things that'd be a little bit different. I mean, there'd probably be different CPUsSWYX [00:24:46]: or whatever,JOSH [00:24:46]: but I think a lot of that stuff is less,SWYX [00:24:49]: it's less,JOSH [00:24:49]: that's the like, that's less variable. I think most of it would apply the second time around. Although I'm sure next timeSWYX [00:24:56]: we're building one,JOSH [00:24:56]: it'll probably be, you know, at a scale that's 10x as big with a different chip or something like this.SWYX [00:25:00]: And then who knows?JOSH [00:25:01]: Yeah, with Kinect X8,JONATHAN [00:25:02]: that will have its own fun behavior and all that good stuff. Yeah.SWYX [00:25:06]: Perhaps there's something that people don't discuss about, and you don't even talk about this in the blog, but I always wonder is what is the timeline that's like kind of reasonable for this amount of work, at least the initial stages? And also what does the team composition look like for setting up a cluster, right? Like what are the mix of skills that you typically would require to get all this going?JOSH [00:25:27]: I'm, I can't really speak to typical. One thing I am very proud of is how much we accomplished with such a ridiculously small team. Like our infrastructure team is like, you know, fluctuates from week to week, depending on like how many things are on fire and how much we need to build. But it's like between like three and six people, like it's small. It's not like some huge team of like tons and tons of engineers. But those people are very, very good at what they do. And so that has allowed us to get a lot of mileage out of out of these things. I think it's not that we're building everything, right? It's not that three to six people build this whole thing. I definitely want to like, you know, say thanks very much to Dell and H5 and NVIDIA and the other people that have done a lot of the work, like to bring up this cluster, you know, with 4000 GPUs and three tier networking, networking architecture, you have 12,000 cables. So that's 24,000 things that need to be plugged in. Like that's just a lot of stuff to plug in, right? And you don't want to mess it up. Like each one needs to be done correctly. Like it's a little bit loose. Like it doesn't really work.SWYX [00:26:23]: If you break it,JOSH [00:26:23]: you need to replace it. Like there's a lot of workSWYX [00:26:26]: that goes into this.JOSH [00:26:27]: Yeah.SWYX [00:26:28]: And then, you know,JOSH [00:26:28]: that's just like that's it. That's if you were to do everything right the first time.SWYX [00:26:32]: And if you didn'tJOSH [00:26:32]: have to fix anything. But inevitably, you know, you will have to replace something, which means like taking all the wires out, pulling the thing out, taking all the GPUs out, going and fixing some cable, putting it all back correctly, putting it back in, doing this every time. So there were a lot of people at Dell, NVIDIA and at H5 that all helped a ton with this stuff. I don't know the exact size of the Dell team. It also fluctuated over time.SWYX [00:26:55]: Yeah, excellent. And then, you know, you so you have all the hardware set up and now you're firing it up for a single node. There's a long description that you guys have about just like monitoring the MFU, right? And what each situation might look might be indicative of. One of the most interesting things to me that I saw from here is like, you know, if training immediately starts off at 60 to 80% MFU, something's wrong.SWYX [00:27:24]: But like, you know, like what what are like, you know, some anecdotes or, you know, notable scenarios here that you might you might call out as maybe counterintuitive or super interesting.JOSH [00:27:36]: There's just so many of them. I mean, one of them, which I think is probably pretty common, like common knowledge by this point. But like we did have a sort of likeSWYX [00:27:46]: which one was this exactly?JOSH [00:27:47]: I think for the MFU, like gradually getting worse over time. I think that one, when we saw that the first time we were like, what the heck is going on? Like, why does it get just like a little bit worse? This is so strange. Like, what is it getting lazy or tired or something? Like, is it heat? Like what's going on? And in this particular case, it was memory fragmentation. Because you have hundreds of machines, they're doing garbage collection slightly different times. And then they get slightly further apart and slightly more and more jittered until eventually they're all happening kind of at random times. And just like really messing up each one of your steps. So you just turn off garbage collection and call it a day, basically,SWYX [00:28:20]: to be honest.JOSH [00:28:20]: There's other things you can do if you want to be a little bit more sophisticated about it. But you can also just manuallyJONATHAN [00:28:25]: have it all garbage collect on some interval. Like that's what we've done. We just have a garbage collection callback that just runs. But I've seen the exact same thing.JOSH [00:28:33]: Yeah, yeah, exactly. So I thought that one was kind of funny. And we did trace that one down and look and we did find the actual call. Like, again, this goes to like having good tools. So we had really good tools where we could look at a bunch of like actual traces in C and be like, OK, cool. This is the thing that's taking a lot of time. Or like, you know, this is the thing that doesn't quite line up here. Like, oh, I guess it's garbage collection. OK, cool.SWYX [00:28:52]: Interesting.JOSH [00:28:52]: Yeah, let's just try taking it off.SWYX [00:28:54]: OK, great.JOSH [00:28:54]: That's what it was. Now we can fix it. So for each of them, like basically bugs are not hard if you have good tools. But if you don't have good tools, bugs can be very, very hard. So similarly for like heat, another thing that we saw was like, oh, you know, the CPU is getting throttled. OK, well, it's easy to see if you're monitoring the CPU throttling or monitoring the heat. If you're not monitoring that, it's really hard to know why it's just suddenly one of them is going slower. I noticed also in the pieceSWYX [00:29:17]: that you mentioned FSDP with 0.3. Actually, we met, I went to iClear and Guanhua from the DSP team was there presenting 0++. I was wondering if you want to make any call outs to, you know, particular open source or open library or open whatever implementation teams that were super helpful in your process. I think we ended up actuallyJOSH [00:29:39]: pulling from a whole bunch of different ones to pull things in into our own particular pipeline. So we use things from NVIDIA's, you know, Megatron stuff. We use stuff from probably DeepSpeed. I think we pulled in a bunch of different pieces from a bunch of different places. So it was really nice to see all these working open source like examples. I think I really appreciate all the effort that has gone into actually tuning these things because you can tune them, but it's a lot of work to like tune this stuff and do all this stuff from scratch. It's really nice to have like a working example. I think those are probably the two biggest ones, DeepSpeed and Megatron alone, but there are probably other ones as well.SWYX [00:30:13]: Is there a particular thing in the ecosystem where you would call out as like, you know, there should be something here that is open source, but like it's not really, it's like everyone kind of builds it on their own. I want to say something with the file system because everyone talks about the file system eventually.JOSH [00:30:28]: The file system actually was,SWYX [00:30:30]: I mean, we did somethingJOSH [00:30:31]: kind of dumb there. Like we have our own sort of local mirror so that we can, you know, like a crappy version of S3SWYX [00:30:38]: that's local,JOSH [00:30:38]: but it's just a pretty simple script, right?SWYX [00:30:41]: Like I think we run likeJOSH [00:30:41]: a little web server that just like serves files and then, you know, it can upload themSWYX [00:30:45]: and download them.JOSH [00:30:45]: Okay, great. And part of the reason we did that is that our internet connectionSWYX [00:30:50]: in the beginningJOSH [00:30:50]: was not the like full speedSWYX [00:30:52]: one that we wouldJOSH [00:30:52]: eventually have. And so we are a little bit more kind of bottlenecked in terms of internet bandwidth. And so we had this. I think we looked at a bunch of services out there like Minio and some other ones, but a lot of these like come with a lot of extra overhead and maintenance. And since we already have so much infrastructureSWYX [00:31:09]: to deal with,JOSH [00:31:09]: we kind of didn't want to, you know, bring in a whole other like cloud provider, virtualize something, something.SWYX [00:31:14]: We just wanted something simple.JOSH [00:31:14]: So we went with that, which has been quite helpful. Like our toolsSWYX [00:31:19]: are usually quite simple.JOSH [00:31:19]: It's like Bash and Python and SSH and Docker. Like we'd like to keep things simple so that's easier to debug, like less layers of infrastructure, less layers of abstraction, make it a lot easier to work with. Like we don't use Kubernetes,SWYX [00:31:30]: for example,JOSH [00:31:30]: and we just directly launch these things. And it's just been much easier to debug this way. One tool actually that does come into mind that I will call out is Kraken from Uber. That was great. We love that tool. We were a little bit skeptical. What is it?SWYX [00:31:44]: I'm sorry. Yeah.JOSH [00:31:45]: So Kraken is this, yeah, it's a distributed like Docker registry, basically, that uses BitTorrent to like transfer things between the machines in a sort of nice optimal way. Like in the very beginning, the naive way is like you have this one Docker registry, which was outside of the cluster. So every time we change an image, you know, there's many gigabytes that each of the 500 machines needs to download.SWYX [00:32:07]: So that just takesJOSH [00:32:07]: a really long time. So what this thing does is like just one of them downloads it and then like they all sort of broadcast all the pieces to each other. And it was just like a really nice, fast way of getting these images down. And it was very robust.SWYX [00:32:19]: Like there's a lotJOSH [00:32:19]: going on under the hood, but I think it's a pretty cool tool that we haven't really had any bugs with it at all. Amazing.SWYX [00:32:26]: Yeah. I mean, that's all my questions, I guess, for the info piece. I don't know if, John, you had something that you were sort of burning to ask or.JONATHAN [00:32:33]: No, all I can say is just sameSWYX [00:32:36]: in a lot of places, like, you know, and they're done thatJONATHAN [00:32:38]: seeing this plus one. I think the one big difference, you know, perhaps in philosophies is we've tried to basically standardize on as much commodity stuff as possible, just because, you know, I think the reason I asked about trying to do thisSWYX [00:32:50]: on multiple differentJONATHAN [00:32:50]: pieces of infrastructure is like, I think we're running on like six or seven different clouds right now. And everybody has done something slightly different. And my gosh, the little differences add up as you know, you've seen. And so, you know,SWYX [00:33:04]: our philosophy has been like, whatever the hellJONATHAN [00:33:05]: we can standardize, please let's standardize it. Like vanilla off the shelf FSDB.SWYX [00:33:10]: And like, you know,JONATHAN [00:33:10]: we wrote our own data loader, but we've tried to make that as much of a standard as we can across our infrastructure and in Databricks, because things just start getting really complicatedSWYX [00:33:18]: or like we useJONATHAN [00:33:18]: Kubernetes extensively because it at least gives us a uniform set of APIs. Like that's our hardware abstraction layer to a certain extent for everything else. So it's just, you know, a difference in philosophy there. But otherwise, like, yeah, this stuff is really, really hard. And I feel like we take for granted how much of this, you know, is done for us when you go and you just query chat GPT, for example. Like, oh my God, everything going on underneath that, you know, it's kind of a miracle that the machines boot up, let alone that you can like query a giant language model that's probably doing inference across multiple machines and was trained across thousands of machines. Like, you know, minor miracle.SWYX [00:33:54]: Yeah, it is an awesome amount of power that we invoke with a single API call that we take for granted these days. It's absurd. Yeah, I mean, like Kubernetes, like that point about Kubernetes, I will say as a former AWS employee, like it seems like it would be ideal for imbue to at some point make it more abstracted or agnostic because you're going to want to, you know, replicate your setup. We do have our ownJOSH [00:34:19]: sort of replacement. It's just a much simpler version of Kubernetes. Kubernetes is really designed for running services, not for running experiments. Like that's not its like main architecture. And so for us, like we have everything that's like, cool, you're going to run an experiment. So you want it to run to completion, right?SWYX [00:34:34]: OK, great.JOSH [00:34:34]: Like the primitives are sort of built around a slightly different style. And that makes it a lot easier, like just a lot simpler to fit that the nature of like these machines are going to disappear. They will need to be rebooted for infrastructure upgrades. They will like something will happen to the GPUs. Failure is like baked into this as like a core part of our infrastructure. So it's not that we don't have an abstraction. It's that it's a sort of simpler, more tailored abstraction for the particular work that we're doing.JONATHAN [00:34:58]: Yeah, I think it all depends on what your goals are. And like, I think the challenge in a lot of the deep learning stuff right now is that people are trying to like, people often build things that are more complicated than necessary to get the job done. And the complication is the enemy of everything. You know, don't use a fancier parallelism strategy than you have to. Don't use a fancier set of libraries than you have to.SWYX [00:35:18]: Don't do anythingJONATHAN [00:35:18]: that you don't have to do because it's hard enough as it is. Like, don't overcomplicateSWYX [00:35:23]: your own life.JONATHAN [00:35:23]: Don't try to bring in more tools or more fancy architecture tweaks if you absolutely don't have to.SWYX [00:35:29]: Like getting to the minimumJONATHAN [00:35:30]: necessary to get the job done. And it's really tempting to want to try to use everything. So like, I totally understand that one.SWYX [00:35:37]: I think the last piece I'll maybe call out is that I'm just going to weave this in just because I see the opportunity to do it. Are there any infrastructure shifts that need to be, that need to rise because of changing architecture? So I think, for example,SWYX [00:35:57]: you're announcing a dense model, a 70B dense model, whereas John just worked on DBRX and the image-to-text model, which presumably has different bottlenecks.JONATHAN [00:36:10]: That's correct for us. You know, we train both dense and mixture of expert models. The one we happened to, you know, kind of get permission to open source was a mixture of expert model. And those models are very demanding when it comes to network bandwidth, at least if you're training them in kind of FSTP 03 style, where there's just a lot of parameters getting shuffled back and forth. And your ratio of kind of compute to amount of data that you have to shuffle back and forth becomes a lot worse because you're now, you know, you're only using a fraction of the parameters for every token instead of all the parameters. And so we had to really push the envelope on getting all the stuff to the right places on time. And so actually the networking part of DBRX was the single hardest thing, I think, of the entire process. Just get MOE training, working at scale across a big cluster. We still managed to, I think, do it all with commodity parts, which was very exciting. You know, we were using FSTP and we eventually used HSTP so that we could have HSTP as a version of FSTP where you have multiple smaller replicas and you're doing data parallel within those replicas. And that helped a lot with network latency issues that we were running into just because we were transmitting so much data, you know, for every single part of the process. I think it actually, like, it was instructive for how Google designs their hardware and software together personally. Their training, as far as I understand, using kind of a 03 style of training and have been for a while. They also train mixture of expert models. TPUs have a very different network bandwidth to compute ratio. They have a lot more bandwidth just objectively. And TPUs per chip tend to be a little bit less compute intensive and have a little bit less memory. You know, it's just a different design choice. So the ratio of flops to bandwidth is very different. And that means that it's much easier for Google to be able to pull offSWYX [00:37:54]: some of this stuff.JONATHAN [00:37:54]: They also have interesting, you know, Torus style network architecture or Torus style, like, literal network architectureSWYX [00:38:00]: is not like the model,JONATHAN [00:38:00]: but the network.SWYX [00:38:02]: Is this the sort of block attention? I forgot what you call it. So this is just more or the,JONATHAN [00:38:07]: yeah, this is more, not the ring attention, but these are the ring all reduces. Like you have three different dimensions of rings because they kind of put you in these three dimensional Toruses from what I understand. And so like, you know, Google's infrastructure in some sense is kind of, I wouldn't say built for this, but maybe the way that Google trains models is built for a slightly different bit of infrastructure they have. And it's kind of neat to think about that. You know, as one thing that I think NVIDIA announced for, you know, for, for both the GH200 and the GB200 is this hybrid networking where you'll have blocks of NVLink network chips. I think for the GB200, I think it's like groups of 72 GPUs will all have NVLink to each other. So higher bandwidth, then you'll have normal networking of some kind, InfiniBand or Rocky or what have you between these blocks. And that's kind of a, you know, it's a change due to the fact that, you know, it's hard to build really high bandwidth networks over very large groups, but it is now a blocked networking. And you have to think about how you architect your model and your parallelism differently. You also have to think about fault tolerance differently because it now matters where you lose a GPU, whereas it didn't before. So, you know, it's, it's, it's just all really interesting and really fun speaking personally, but it's going to mean new nightmares when we all move to that generation and have to think about, you know, new versions of these problems.JOSH [00:39:20]: As you go up to larger scales, it gets quite different. Like right now, you know, if you're experiencing, let's say, for example, you experience a GPU failure every day, that's fine.SWYX [00:39:31]: Just restart.JOSH [00:39:31]: If you make your thing 24 times as big, now it's once an hour. Now it stops being quite as easy to just restart, right? So now you have to kind of break, like bake in this sort of redundancy that you didn't have before. So I think as you go up in scale, you end up running into like a lot of really interesting problems that also inform the, the actual like design. Yeah, I mean, as an orchestration guy,SWYX [00:39:52]: this is why I always emphasize like very cheap storage or very fast storage. So you can checkpoint more, but I don't think that's probably not the best solution to for fast, you know, training.JONATHAN [00:40:05]: Which works fine when you're doing language and then you move to vision or video. And then, you know, you have multi petabyte datasetsSWYX [00:40:12]: and getting, you know,JONATHAN [00:40:13]: cheap, fast multi petabyte storage starts to bite. Like I've certainly encountered issues where the literal data center where my GPUs were did not have enough, you know, object store to fit the datasets that people wanted to bring into that data center from whichever users were, were trying to bring them in. And then you get to a wholeSWYX [00:40:31]: different world of hurtJONATHAN [00:40:31]: where you have to keep your data in a different region because the region is just out of storage. So things get fun really fast.SWYX [00:40:39]: Speaking of vision, Josh, actually, you know, Embu is an agents company, but you're only, you're announcing a text-only model. What, where does, where does the vision side come in?JOSH [00:40:49]: I think we've actually done a lot of work in the past and people can see kind of our blog posts about sort of self-supervised learning and some other kind of vision-related stuff in the past as well. So we're very familiar with, with that stuff. But I think our main focus right now is on kind of, as we say, coding and reasoning. And there, there's certainly a visual component to some problems. But, you know, it's not necessarily required for all problems. And actually we found that for most of the kind of like code writing and, and reasoning problems that we care about, the visual part isn't really a huge important part of it. Sometimes if you really need to, you can maybe describeSWYX [00:41:24]: the thing.JOSH [00:41:24]: There are other like, you know, multimodal models that you can use off the shelf to sort of plug in for those particular piecesSWYX [00:41:30]: that you need, right?JOSH [00:41:30]: Like if something is driving a browser or whatever, like you can sometimes get away with not having to have that baked into the original model. So our folk were, you know, in a sense, we kind of do a lot across the stack. We're working on our own infrastructure and pre-training and RL and fine tuning and products and everything. But in another sense, we're very narrowly focused on the application side. So all of the stuff across the stack is kind of going toward a very particular purpose. And so that particular purpose right now doesn't really need vision. So we think that people are going to make all sorts of really cool image modelsSWYX [00:42:00]: like Jonathan, right?JOSH [00:42:00]: And all sorts of interesting multimodal models into the future. We'll let them go do that. That's great. We'll take advantage of that, partner with those people in the future. And right now we're really focused on kind of the core reasoning and coding capabilities and aspects of the model.SWYX [00:42:14]: I wanted to go into carbs since that's kind of the next layer of the stack. We talked about carbs in the first episode with Kanjin because you've actually had a blog post about it like a couple of years ago. Maybe let's introduce it.JONATHAN [00:42:26]: Has that been a couple of years now?JOSH [00:42:28]: No, it must have been at least one year. Hopefully it's not multiple years.SWYX [00:42:32]: Sorry, I'm counting AI time. Yeah, yeah. Yeah, I was going to sayJONATHAN [00:42:35]: you're making me feel really old right now.SWYX [00:42:39]: I count everything before the generally intelligent rename as like, you know, prehistory. Yeah. And now sort of modernity, right? So I actually thought carbs was more about hyperparameter optimization in a sense of like sort of parameters, hyperparameter search. Whereas, you know, when you introduced it, especially in this blog post, it's more about scaling laws and predictability of like, are we sort of in the right ballpark before we scale things up? Maybe sort of recount the history of carbs.JOSH [00:43:10]: Yeah, so it really is a little bit of both. So carbs is, it's maybe a backronym, but it's for cost aware Pareto region Bayesian search. So this is about technically how it works, but carbs is like, you know, we like pastries and stuff.SWYX [00:43:26]: So great, why not? But the point is thatJOSH [00:43:29]: it's a cost aware hyperparameter tuner. So most hyperparameter tuners, you kind of say, OK, here's this objective function. I want you to make this number as big as possible or as small as possible, whichever direction you want to go. So yeah, just go make this number, you know, as small as possible. OK, so it'll try a bunch of differentSWYX [00:43:46]: hyperparameters,JOSH [00:43:46]: a bunch of different configurationsSWYX [00:43:48]: to figure out, like,JOSH [00:43:48]: how do I tweak your network and architecture, et cetera, to get the kind of best performance I possibly can. That's usually saying, like, you know, almost all of these hyperparameter configurations are, let's say they're all going to use the same number of GPUs or the same number of nodes.SWYX [00:44:01]: So it's going to runJOSH [00:44:01]: for the same amount of time.SWYX [00:44:03]: So you can do that.JOSH [00:44:03]: You can get a number out and that's great. But what carbs does is it says,SWYX [00:44:07]: OK, actually,JOSH [00:44:07]: what if we relax that constraint? What if we say each of these different points, we're going to model how expensive it will be to sample this configuration. So if what if we train with just one one hundredth of the data? Like, how well can we do?SWYX [00:44:19]: What if we trainJOSH [00:44:19]: with one tenth of the data? What if we train with all the data? That way you can understand, like, as we get more and more data, as we spend more and more compute,SWYX [00:44:26]: as we make a biggerJOSH [00:44:26]: and bigger network, how does performance change with these things that change? Like how expensive it is to even explore this data point. So by doing that, we can see the scaling laws for not just, you know,SWYX [00:44:36]: the scaling lawsJOSH [00:44:36]: from like the, you know, Chantilla paper, the scaling laws for all parameters. We can see how does how does the number of layers change with this? How does the, you know, the learning rate change? How do the like, you know, various types of regularization change? So you can see these nice scaling laws. And as you're going across costs, like how should this be changing as you're scaling up your model? So that, coupled with the kind of metric that we chose, which is a very precise way of measuring performance, allowed us to really like hone in on parameters that worked really wellSWYX [00:45:05]: and understand, like,JOSH [00:45:05]: how do we want to scale those up, especially as we're changingSWYX [00:45:08]: things about the network?JOSH [00:45:08]: Like one of the things that we did is we used a custom tokenizer. As we change this tokenizer, changes a bunch of other things about the model. So how should we scale up this entirely new tokenizer? Like no one has ever made a model this large with this tokenizer before. And so how do we want toSWYX [00:45:22]: change all these things?JOSH [00:45:22]: Harps kind of shows you, like, look, as you change these parameters, like these other ones are kind of dependent on this.SWYX [00:45:28]: Like this is the, these areJOSH [00:45:28]: the relationships between them. So you can better understand, like, OK, if I'm going to scale this up 10x or 100x, like, where do I want to be? I can only go so far. And so, you know, we did run, like, I think maybe it was like a 14b one or somethingSWYX [00:45:40]: like that to check.JOSH [00:45:41]: But and so we had a bunch of like 1b or 14b and then at 70b. I don't think we had a, I think we just did like one at 14b. So you can, we get to check that like, oh, is this on the curve? Like, is this where we expect? It was like right there. So then great, go on to the next one. Yeah, I mean, that makes a lot of sense.SWYX [00:45:56]: I wonder if, so one of the key questions, and correct me if I'm wrong, but like usually people do search or do their evals just based on loss. But you actually evaluate based on, you know, the sort of end state evals that people might expect, like HellaSwag and Lombata, whatever. What is the norm here? Is there a norm?JOSH [00:46:20]: Yeah, I don't know if there's a hundred percent.SWYX [00:46:21]: I don't know. I only see loss on most people's reports.JOSH [00:46:25]: I think it's easy to, like, loss is very nice because it's very precise. It will tell you, like, very fine grained differences between like really small changes in your hyperparameters or network architecture. Whereas, especially at the smaller scales, if you're looking at like accuracy, it's very noisy. Like it might be zero or a hundred or like, you know, fluctuating by like 10 or 20 percentage points, which makes it really hard to tell, like, did that change actually mean anything? So our loss is sort of a combination of these two. Instead of saying, like, let's just look at perplexity, we say, let's look at perplexity on the tasks that we care about for multiple choice questions effectively.SWYX [00:47:00]: So we're saying like, yes,JOSH [00:47:00]: this is formulated as a multiple choice question, and we're going to look at the, like, you know, the loss of perplexity for this particular answer token. And that ends up being something that's like both targeted to what you actually care about and also very precise. The nice thing about this though is that it's independent of the data that you train on. One thing that's annoying about perplexity or about loss is that as you change your data set, this is really obnoxious because now it fundamentally changes your loss, right? And so you can't tell, like, how do I tweak my data set? But because we have this held out evaluation dat
Cutting-edge AI infrastructure needs all the performance it can get, but these environments must also be efficient and reliable. This episode of Utilizing Tech, brought to you by Solidigm, features Davide Villa of Xinnor discussing the value of modern software RAID and NVMe SSDs with Ace Stryker and Stephen Foskett. Xinnor xiRAID leverages the resources of the server, including the AVX instruction set found on modern CPUs, to combine NVMe SSDs, providing high performance and reliability inside the box. Modern servers have multiple internal drive slots, and all of these drives must be managed and protected in the event of failure. This is especially important in AI servers, since an ML training run can take weeks, amplifying the risk of failure. Software RAID can be used in many different implementations, with various file systems, including NFS and high-performance networks like InfiniBand. And it can be tuned to maximize performance for each workload. Xinnor can help customers to tune the software to maximize reliability of SSDs, especially with QLC flash, by adapting the chunk size and minimizing write amplification. Xinnor also produces a storage platform solution called xiSTORE that combines xiRAID with the Lustre FS clustered file system, which is already popular in HPC environments. Although many environments can benefit from a full-featured storage platform, others need a software RAID solution to combine NVMe SSDs for performance and reliability. Hosts: Stephen Foskett, Organizer of Tech Field Day: https://www.linkedin.com/in/sfoskett/ Ace Stryker, Director of Product Marketing, AI Product Marketing at Solidigm: https://www.linkedin.com/in/acestryker/ Davide Villa, Chief Revenue Officer at Xinnor: https://www.linkedin.com/in/davide-villa-b1256a2/ Follow Utilizing Tech Website: https://www.UtilizingTech.com/ X/Twitter: https://www.twitter.com/UtilizingTech  Tech Field Day Website: https://www.TechFieldDay.com LinkedIn: https://www.LinkedIn.com/company/Tech-Field-Day  X/Twitter: https://www.Twitter.com/TechFieldDay  Tags: #UtilizingTech, #Sponsored, #AIDataInfrastructure, #AI, @SFoskett, @TechFieldDay, @UtilizingTech, @Solidigm,
Google DeepMind's new AI tool that generates video soundtracks by combining text prompts with visual content. Challenges of building large training AI clusters, including power, network topology, and reliability. How large language models acquire factual knowledge during pretraining and their probabilistic reasoning capabilities. LLARVA's vision-action instruction tuning that enhances robot learning. Contact: sergi@earkind.com Timestamps: 00:34 Introduction 01:47 Google DeepMind's new AI tool uses video pixels and text prompts to generate soundtracks 03:31 100,000 H100 Clusters: Power, Network Topology, Ethernet vs InfiniBand, Reliability, Failures, Checkpointing 05:22 Large language model data pipelines and Common Crawl (WARC/WAT/WET) 06:47 Fake sponsor 08:20 How Do Large Language Models Acquire Factual Knowledge During Pretraining? 10:01 What Are the Odds? Language Models Are Capable of Probabilistic Reasoning 11:22 LLARVA: Vision-Action Instruction Tuning Enhances Robot Learning 13:06 Outro
On this episode of the Six Five On the Road, hosts Dave Nicholson and Lisa Martin are joined by Broadcom's Hasan Siraj, Head of Software Products / Ecosystem for a conversation on how Ethernet is pivotal in scaling AI infrastructure. Broadcom, a global technology leader, is at the forefront of addressing the complexities and needs of AI networking to facilitate the growth and deployment of AI technologies. Their discussion covers: The crucial role of networking in AI deployments Broadcom's solutions to the key challenges in AI networking Reasons for choosing Ethernet over Infiniband for AI cluster deployments Anticipated evolution of AI infrastructure over the next 3-5 years and upcoming challenges The demographic of customers deploying AI and their current stage in the AI adoption journey
Chip Stock Investor explores the implications of a recent stock price downgrade of Arista Networks (ANET), emphasizing the importance of a moderate viewpoint in evaluating businesses. The downgrade, which shifted Arista from a 'buy' to a 'sell' due to Nvidia's (NVDA) “rising competition,” spurred a discussion on the essential technologies of Ethernet and InfiniBand, their roles in data centers and AI, and how these technologies impact the rivalry between Nvidia and Arista Networks. Nick and Kasey explain why Ethernet and InfiniBand matter, NVIDIA's strategic acquisition that put them in the lead in InfiniBand, and the evolving landscape of AI and data center networking. Arista Networks is no slouch though. And Ethernet is still very much a part of the modern data center. Check out this episode of Chip Stock Investor for our take on Arista Networks business model, and whether the “new Nvidia risk” is valid or not. Want to Become the Ultimate Chip Stock Investor Insider? Get more than just great content – become part of the Chip Stock Investor family! Our Membership is where you'll connect with like-minded investors, chat directly with us, and gain insider insights. Downloadable show notes, exclusive emojis, and a super welcoming Discord community await. Upgrade your experience today! Join at Youtube: https://www.youtube.com/channel/UC3aD-gfmHV_MhMmcwyIu1wA/join Join on Ko-Fi: https://ko-fi.com/chipstockinvestor If monthly membership isn't your thing, don't worry, you can purchase our show notes in our Ko-Fi shop. https://ko-fi.com/chipstockinvestor/shop
Welcome to this week's edition of “MI&S Datacenter Podcast” I'm Patrick Moorhead with Moor Insights & Strategy, and I am joined by co-hosts Matt, Will, and Paul. We analyze the week's top datacenter and datacenter edge news. We talk compute, cloud, security, storage, networking, operations, data management, AI, and more! Has Fortinet Established Itself with Secure Networking? https://x.com/WillTownTech/status/1775709723154944204 Has Broadcom Built the Semiconductor Development Model of the Future? https://www.forbes.com/sites/patrickmoorhead/2024/04/01/broadcom-scales-connectivity-and-performance-for-advanced-ai-workloads/?sh=304f0995631d You Can Make AI Dangerous https://www-cdn.anthropic.com/af5633c94ed2beb282f6a53c595eb437e8e7b630/Many_Shot_Jailbreaking__2024_04_02_0936.pdf The InfiniBand vs Ethernet Cage Fight for AI Workloads https://x.com/WillTownTech/status/1775150343657333125 Nutanix and Cisco are Delivering the Cloud Experience of the Future - Today https://www.nutanix.com/solutions/ai Logical Qubit Breakthrough https://cloudblogs.microsoft.com/quantum/2024/04/03/how-microsoft-and-quantinuum-achieved-reliable-quantum-computing/ https://www.quantinuum.com/news/quantinuum-and-microsoft-announce-new-era-in-quantum-computing-with-breakthrough- Disclaimer: This show is for information and entertainment purposes only. While we will discuss publicly traded companies on this show. The contents of this show should not be taken as investment advice.
Take a Network Break! Nvidia announces new 800G switches, one for Ethernet and one for InfiniBand, for building AI fabrics. Nvidia also announces an “AI supercomputer,” a rack-scale pre-built bundle of Nvidia GPUs and CPUs connected via InfiniBand switches. The NaaS startup Meter announces new campus switches and what it calls a “digital twin” capability,... Read more »
Take a Network Break! Nvidia announces new 800G switches, one for Ethernet and one for InfiniBand, for building AI fabrics. Nvidia also announces an “AI supercomputer,” a rack-scale pre-built bundle of Nvidia GPUs and CPUs connected via InfiniBand switches. The NaaS startup Meter announces new campus switches and what it calls a “digital twin” capability,... Read more »
Take a Network Break! Nvidia announces new 800G switches, one for Ethernet and one for InfiniBand, for building AI fabrics. Nvidia also announces an “AI supercomputer,” a rack-scale pre-built bundle of Nvidia GPUs and CPUs connected via InfiniBand switches. The NaaS startup Meter announces new campus switches and what it calls a “digital twin” capability,... Read more »
Dell Technologies is strengthening its collaboration with NVIDIA to help enterprises adopt AI technologies. By expanding the Dell Generative AI Solutions portfolio, including the new Dell AI Factory with NVIDIA, organizations can accelerate the integration of their data, AI tools and on-premises infrastructure to maximize their generative AI (GenAI) investments. "Our enterprise customers are looking for an easy way to implement AI solutions - that is exactly what Dell Technologies and NVIDIA are delivering," said Michael Dell, founder and CEO of Dell Technologies. "Through our combined efforts, organizations can seamlessly integrate data with their own use cases and streamline the development of customized GenAI models." "AI factories are central to creating intelligence on an industrial scale," said Jensen Huang, founder and CEO, NVIDIA. "Together, NVIDIA and Dell are helping enterprises create AI factories to turn their proprietary data into powerful insights." High-quality results from high-quality data Through close collaboration between Dell and NVIDIA, additions to the end-to-end Dell Generative AI Solutions portfolio helps customers modernize with AI, accelerate business transformation and boost productivity: Dell AI Factory with NVIDIA is the industry's first end-to-end AI enterprise solution integrating Dell's compute, storage, client device, software and services capabilities with NVIDIA's advanced AI infrastructure and software suite, all underpinned by a high-speed networking fabric. Delivered as a fully integrated solution, Dell AI Factory with NVIDIA takes advantage of rack-level design, with rigorous testing and validation, to deliver a seamless solution for transforming data into valuable insights and outcomes. This solution also leverages existing offerings in enterprise data security with accompanying Dell services offerings in security and privacy. The Dell AI Factory with NVIDIA supports a wide array of AI use cases and applications to support the entire GenAI lifecycle, from model creation and tuning to augmentation and inferencing. Customers can also take advantage of enterprise-grade professional services that help organizations accelerate their strategy, data preparation, implementation and adoption of the AI Factory, advancing AI capabilities. The Dell AI Factory with NVIDIA is available via traditional channels and Dell APEX. Dell Technologies will collaborate with NVIDIA to introduce a rack scale, high-density, liquid-cooled architecture based on the NVIDIA Grace Blackwell Superchip. These systems will support the next-generation ecosystem aiming to provide the foundation for improvements in performance density for enterprise AI workloads. Dell PowerEdge XE9680 servers will support new NVIDIA GPU models, including the NVIDIA B200 Tensor Core GPU, expected to offer up to 15 times higher AI inference performance and lower total cost of ownership. Dell PowerEdge servers will also support other NVIDIA Blackwell architecture-based GPUs as well as H200 Tensor Core GPUs and NVIDIA Quantum-2 InfiniBand and Spectrum-X Ethernet networking platforms. Dell Generative AI Solutions with NVIDIA - Retrieval-Augmented Generation (RAG) leverages new microservices in NVIDIA AI Enterprise to offer a pre-validated, full-stack solution to speed enterprise AI adoption with RAG. This solution helps organizations improve GenAI model quality and increase results accuracy with proprietary business data and knowledge bases. Dell Generative AI Solutions with NVIDIA - Model Training offers a pre-validated, full-stack solution for organizations seeking to build their own custom, domain-specific models. Dell Data Lakehouse, now globally available, is an open, modern data lakehouse that helps organizations discover, process and analyze data in one place across hybrid and multicloud environments. Dell PowerScale is the world's first Ethernet storage solution validated with NVIDIA DGX SuperPOD with DGX H100 systems, helping custome...
Speaker CFPs and Sponsor Guides are now available for AIE World's Fair — join us on June 25-27 for the biggest AI Engineer conference of 2024!Soumith Chintala needs no introduction in the ML world — his insights are incredibly accessible across Twitter, LinkedIn, podcasts, and conference talks (in this pod we'll assume you'll have caught up on the History of PyTorch pod from last year and cover different topics). He's well known as the creator of PyTorch, but he's more broadly the Engineering Lead on AI Infra, PyTorch, and Generative AI at Meta.Soumith was one of the earliest supporters of Latent Space (and more recently AI News), and we were overjoyed to catch up with him on his latest SF visit for a braindump of the latest AI topics, reactions to some of our past guests, and why Open Source AI is personally so important to him.Life in the GPU-Rich LaneBack in January, Zuck went on Instagram to announce their GPU wealth: by the end of 2024, Meta will have 350k H100s. By adding all their GPU clusters, you'd get to 600k H100-equivalents of compute. At FP16 precision, that's ~1,200,000 PFLOPS. If we used George Hotz's (previous guest!) "Person of Compute" measure, Meta now has 60k humans of compute in their clusters. Occasionally we get glimpses into the GPU-rich life; on a recent ThursdAI chat, swyx prompted PaLM tech lead Yi Tay to write down what he missed most from Google, and he commented that UL2 20B was trained by accidentally leaving the training job running for a month, because hardware failures are so rare in Google.Meta AI's Epic LLM RunBefore Llama broke the internet, Meta released an open source LLM in May 2022, OPT-175B, which was notable for how “open” it was - right down to the logbook! They used only 16 NVIDIA V100 GPUs and Soumith agrees that, with hindsight, it was likely under-trained for its parameter size.In Feb 2023 (pre Latent Space pod), Llama was released, with a 7B version trained on 1T tokens alongside 65B and 33B versions trained on 1.4T tokens. The Llama authors included Guillaume Lample and Timothée Lacroix, who went on to start Mistral.July 2023 was Llama2 time (which we covered!): 3 model sizes, 7B, 13B, and 70B, all trained on 2T tokens. The three models accounted for a grand total of 3,311,616 GPU hours for all pre-training work. CodeLlama followed shortly after, a fine-tune of Llama2 specifically focused on code generation use cases. The family had models in the 7B, 13B, 34B, and 70B size, all trained with 500B extra tokens of code and code-related data, except for 70B which is trained on 1T.All of this on top of other open sourced models like Segment Anything (one of our early hits!), Detectron, Detectron 2, DensePose, and Seamless, and in one year, Meta transformed from a company people made fun of for its “metaverse” investments to one of the key players in the AI landscape and its stock has almost tripled since (about $830B in market value created in the past year).Why Open Source AIThe obvious question is why Meta would spend hundreds of millions on its AI efforts and then release them for free. Zuck has addressed this in public statements:But for Soumith, the motivation is even more personal:“I'm irrationally interested in open source. I think open source has that fundamental way to distribute opportunity in a way that is very powerful. Like, I grew up in India… And knowledge was very centralized, but I saw that evolution of knowledge slowly getting decentralized. And that ended up helping me learn quicker and faster for like zero dollars. And I think that was a strong reason why I ended up where I am. So like that, like the open source side of things, I always push regardless of like what I get paid for, like I think I would do that as a passion project on the side……I think at a fundamental level, the most beneficial value of open source is that you make the distribution to be very wide. It's just available with no friction and people can do transformative things in a way that's very accessible. Maybe it's open source, but it has a commercial license and I'm a student in India. I don't care about the license. I just don't even understand the license. But like the fact that I can use it and do something with it is very transformative to me……Like, okay, I again always go back to like I'm a student in India with no money. What is my accessibility to any of these closed source models? At some scale I have to pay money. That makes it a non-starter and stuff. And there's also the control issue: I strongly believe if you want human aligned AI, you want all humans to give feedback. And you want all humans to have access to that technology in the first place. And I actually have seen, living in New York, whenever I come to Silicon Valley, I see a different cultural bubble.We like the way Soumith put it last year: Closed AI “rate-limits against people's imaginations and needs”!What It Takes For Open Source AI to WinHowever Soumith doesn't think Open Source will simply win by popular demand. There is a tremendous coordination problem with the decentralized nature of the open source AI development right now: nobody is collecting the valuable human feedback in the way that OpenAI or Midjourney are doing.“Open source in general always has a coordination problem. If there's a vertically integrated provider with more resources, they will just be better coordinated than open source. And so now open source has to figure out how to have coordinated benefits. And the reason you want coordinated benefits is because these models are getting better based on human feedback. And if you see with open source models, like if you go to the /r/localllama subreddit, like there's so many variations of models that are being produced from, say, Nous research. I mean, like there's like so many variations built by so many people. And one common theme is they're all using these fine-tuning or human preferences datasets that are very limited and they're not sufficiently diverse. And you look at the other side, say front-ends like Oobabooga or like Hugging Chat or Ollama, they don't really have feedback buttons. All the people using all these front-ends, they probably want to give feedback, but there's no way for them to give feedback… So we're just losing all of this feedback. Maybe open source models are being as used as GPT is at this point in like all kinds of, in a very fragmented way, like in aggregate all the open source models together are probably being used as much as GPT is, maybe close to that. But the amount of feedback that is driving back into the open source ecosystem is like negligible, maybe less than 1% of like the usage. So I think like some, like the blueprint here I think is you'd want someone to create a sinkhole for the feedback… I think if we do that, if that actually happens, I think that probably has a real chance of the open source models having a runaway effect against OpenAI, I think like there's a clear chance we can take at truly winning open source.”If you're working on solving open source coordination, please get in touch!Show Notes* Soumith Chintala Twitter* History of PyTorch episode on Gradient Podcast* The Llama Ecosystem* Apple's MLX* Neural ODEs (Ordinary Differential Equations)* AlphaGo* LMSys arena* Dan Pink's "Drive"* Robotics projects:* Dobb-E* OK Robot* Yann LeCun* Yangqing Jia of Lepton AI* Ed Catmull* George Hotz on Latent Space* Chris Lattner on Latent Space* Guillaume Lample* Yannic Kilcher of OpenAssistant* LMSys* Alex Atallah of OpenRouter* Carlo Sferrazza's 3D tactile research* Alex Wiltschko of Osmo* Tangent by Alex Wiltschko* Lerrel Pinto - RoboticsTimestamps* [00:00:00] Introductions* [00:00:51] Extrinsic vs Intrinsic Success* [00:02:40] Importance of Open Source and Its Impact* [00:03:46] PyTorch vs TinyGrad* [00:08:33] Why PyTorch is the Switzerland of frameworks* [00:10:27] Modular's Mojo + PyTorch?* [00:13:32] PyTorch vs Apple's MLX* [00:16:27] FAIR / PyTorch Alumni* [00:18:50] How can AI inference providers differentiate?* [00:21:41] How to build good benchmarks and learnings from AnyScale's* [00:25:28] Most interesting unexplored ideas* [00:28:18] What people get wrong about synthetic data* [00:35:57] Meta AI's evolution* [00:38:42] How do you allocate 600,000 GPUs?* [00:42:05] Even the GPU Rich are GPU Poor* [00:47:31] Meta's MTIA silicon* [00:50:09] Why we need open source* [00:59:00] Open source's coordination problem for feedback gathering* [01:08:59] Beyond text generation* [01:15:37] Osmo and the Future of Smell Recognition TechnologyTranscriptAlessio [00:00:00]: Hey everyone, welcome to the Latent Space podcast. This is Alessio, partner and CTO in residence at Decibel Partners, and I'm joined by my co-host Swyx, founder of Smol AI.Swyx [00:00:15]: Hey, and today we have in the studio Soumith Chintala, welcome.Soumith [00:00:17]: Thanks for having me.Swyx [00:00:18]: On one of your rare visits from New York where you live. You got your start in computer vision at NYU with Yann LeCun. That was a very fortuitous start. I was actually listening to your interview on the Gradient podcast. So if people want to know more about the history of Soumith, history of PyTorch, they can go to that podcast. We won't spend that much time there, but I just was marveling at your luck, or I don't know if it's your luck or your drive to find AI early and then find the right quality mentor because I guess Yan really sort of introduced you to that world.Soumith [00:00:51]: Yeah, I think you're talking about extrinsic success, right? A lot of people just have drive to do things that they think is fun, and a lot of those things might or might not be extrinsically perceived as good and successful. I think I just happened to like something that is now one of the coolest things in the world or whatever. But if I happen, the first thing I tried to become was a 3D VFX artist, and I was really interested in doing that, but I turned out to be very bad at it. So I ended up not doing that further. But even if I was good at that, whatever, and I ended up going down that path, I probably would have been equally happy. It's just like maybe like the perception of, oh, is this person successful or not might be different. I think like after a baseline, like your happiness is probably more correlated with your intrinsic stuff.Swyx [00:01:44]: Yes. I think Dan Pink has this book on drive that I often refer to about the power of intrinsic motivation versus extrinsic and how long extrinsic lasts. It's not very long at all. But anyway, now you are an investor in Runway, so in a way you're working on VFX. Yes.Soumith [00:02:01]: I mean, in a very convoluted way.Swyx [00:02:03]: It reminds me of Ed Catmull. I don't know if you guys know, but he actually tried to become an animator in his early years and failed or didn't get accepted by Disney and then went and created Pixar and then got bought by Disney and created Toy Story. So you joined Facebook in 2014 and eventually became a creator and maintainer of PyTorch. And there's this long story there you can refer to on the gradient. I think maybe people don't know that you also involved in more sort of hardware and cluster decision affair. And we can dive into more details there because we're all about hardware this month. Yeah. And then finally, I don't know what else, like what else should people know about you on a personal side or professional side?Soumith [00:02:40]: I think open source is definitely a big passion of mine and probably forms a little bit of my identity at this point. I'm irrationally interested in open source. I think open source has that fundamental way to distribute opportunity in a way that is very powerful. Like, I grew up in India. I didn't have internet for a while. In college, actually, I didn't have internet except for GPRS or whatever. And knowledge was very centralized, but I saw that evolution of knowledge slowly getting decentralized. And that ended up helping me learn quicker and faster for zero dollars. And I think that was a strong reason why I ended up where I am. So the open source side of things, I always push regardless of what I get paid for, like I think I would do that as a passion project on the side.Swyx [00:03:35]: Yeah, that's wonderful. Well, we'll talk about the challenges as well that open source has, open models versus closed models. Maybe you want to touch a little bit on PyTorch before we move on to the sort of Meta AI in general.PyTorch vs Tinygrad tradeoffsAlessio [00:03:46]: Yeah, we kind of touched on PyTorch in a lot of episodes. So we had George Hotz from TinyGrad. He called PyTorch a CISC and TinyGrad a RISC. I would love to get your thoughts on PyTorch design direction as far as, I know you talk a lot about kind of having a happy path to start with and then making complexity hidden away but then available to the end user. One of the things that George mentioned is I think you have like 250 primitive operators in PyTorch, I think TinyGrad is four. So how do you think about some of the learnings that maybe he's going to run into that you already had in the past seven, eight years almost of running PyTorch?Soumith [00:04:24]: Yeah, I think there's different models here, but I think it's two different models that people generally start with. Either they go like, I have a grand vision and I'm going to build a giant system that achieves this grand vision and maybe one is super feature complete or whatever. Or other people say they will get incrementally ambitious, right? And they say, oh, we'll start with something simple and then we'll slowly layer out complexity in a way that optimally applies Huffman coding or whatever. Like where the density of users are and what they're using, I would want to keep it in the easy, happy path and where the more niche advanced use cases, I'll still want people to try them, but they need to take additional frictional steps. George, I think just like we started with PyTorch, George started with the incrementally ambitious thing. I remember TinyGrad used to be, like we would be limited to a thousand lines of code and I think now it's at 5,000. So I think there is no real magic to which why PyTorch has the kind of complexity. I think it's probably partly necessitated and partly because we built with the technology available under us at that time, PyTorch is like 190,000 lines of code or something at this point. I think if you had to rewrite it, we would probably think about ways to rewrite it in a vastly simplified way for sure. But a lot of that complexity comes from the fact that in a very simple, explainable way, you have memory hierarchies. You have CPU has three levels of caches and then you have DRAM and SSD and then you have network. Similarly, GPU has several levels of memory and then you have different levels of network hierarchies, NVLink plus InfiniBand or Rocky or something like that, right? And the way the flops are available on your hardware, they are available in a certain way and your computation is in a certain way and you have to retrofit your computation onto both the memory hierarchy and like the flops available. When you're doing this, it is actually a fairly hard mathematical problem to do this setup, like you find the optimal thing. And finding the optimal thing is, what is optimal depends on the input variables themselves. So like, okay, what is the shape of your input tensors and what is the operation you're trying to do and various things like that. Finding that optimal configuration and writing it down in code is not the same for every input configuration you have. Like for example, just as the shape of the tensors change, let's say you have three input tensors into a Sparstar product or something like that. The shape of each of these input tensors will vastly change how you do this optimally placing this operation onto the hardware in a way that will get you maximal throughput. So a lot of our complexity comes from writing out hundreds of configurations for each single PyTorch operator and templatizing these things and symbolically generating the final CUDA code or CPU code. There's no way to avoid it because mathematically we haven't found symbolic ways to do this that also keep compile time near zero. You can write a very simple framework, but then you also should be willing to eat the long compile time. So if searching for that optimal performance at runtime, but that's the trade off. There's no, like, I don't think unless we have great breakthroughs George's vision is achievable, he should be thinking about a narrower problem such as I'm only going to make this for work for self-driving car connets or I'm only going to make this work for LLM transformers of the llama style. Like if you start narrowing the problem down, you can make a vastly simpler framework. But if you don't, if you need the generality to power all of the AI research that is happening and keep zero compile time and in all these other factors, I think it's not easy to avoid the complexity.Pytorch vs MojoAlessio [00:08:33]: That's interesting. And we kind of touched on this with Chris Lattner when he was on the podcast. If you think about frameworks, they have the model target. They have the hardware target. They have different things to think about. He mentioned when he was at Google, TensorFlow trying to be optimized to make TPUs go brr, you know, and go as fast. I think George is trying to make especially AMD stack be better than ROCm. How come PyTorch has been such as Switzerland versus just making Meta hardware go brr?Soumith [00:09:00]: First, Meta is not in the business of selling hardware. Meta is not in the business of cloud compute. The way Meta thinks about funding PyTorch is we're funding it because it's net good for Meta to fund PyTorch because PyTorch has become a standard and a big open source project. And generally it gives us a timeline edge. It gives us leverage and all that within our own work. So why is PyTorch more of a Switzerland rather than being opinionated? I think the way we think about it is not in terms of Switzerland or not. We actually the way we articulate it to all hardware vendors and software vendors and all who come to us being we want to build a backend in core for PyTorch and ship it by default is we just only look at our user side of things. Like if users are using a particular piece of hardware, then we want to support it. We very much don't want to king make the hardware side of things. So as the MacBooks have GPUs and as that stuff started getting increasingly interesting, we pushed Apple to push some engineers and work on the NPS support and we spend significant time from Meta funded engineers on that as well because a lot of people are using the Apple GPUs and there's demand. So we kind of mostly look at it from the demand side. We never look at it from like oh which hardware should we start taking opinions on.Swyx [00:10:27]: Is there a future in which, because Mojo or Modular Mojo is kind of a superset of Python, is there a future in which PyTorch might use Mojo features optionally?Soumith [00:10:36]: I think it depends on how well integrated it is into the Python ecosystem. So if Mojo is like a pip install and it's readily available and users feel like they can use Mojo so smoothly within their workflows in a way that just is low friction, we would definitely look into that. Like in the same way PyTorch now depends on Triton, OpenAI Triton, and we never had a conversation that was like huh, that's like a dependency. Should we just build a Triton of our own or should we use Triton? It almost doesn't, like those conversations don't really come up for us. The conversations are more well does Triton have 10,000 dependencies and is it hard to install? We almost don't look at these things from a strategic leverage point of view. We look at these things from a user experience point of view, like is it easy to install? Is it smoothly integrated and does it give enough benefits for us to start depending on it? If so, yeah, we should consider it. That's how we think about it.Swyx [00:11:37]: You're inclusive by default as long as it meets the minimum bar of, yeah, but like maybe I phrased it wrongly. Maybe it's more like what problems would you look to solve that you have right now?Soumith [00:11:48]: I think it depends on what problems Mojo will be useful at.Swyx [00:11:52]: Mainly a performance pitch, some amount of cross compiling pitch.Soumith [00:11:56]: Yeah, I think the performance pitch for Mojo was like, we're going to be performant even if you have a lot of custom stuff, you're going to write arbitrary custom things and we will be performant. And that value proposition is not clear to us from the PyTorch side to consider it for PyTorch. So PyTorch, it's actually not 250 operators, it's like a thousand operators. PyTorch exposes about a thousand operators and people kind of write their ideas in the thousand operators of PyTorch. Mojo is like, well, maybe it's okay to completely sidestep those thousand operators of PyTorch and just write it in a more natural form. Just write raw Python, write for loops or whatever, right? So from the consideration of how do we intersect PyTorch with Mojo, I can see one use case where you have custom stuff for some parts of your program, but mostly it's PyTorch. And so we can probably figure out how to make it easier for say Torch.compile to smoothly also consume Mojo subgraphs and like, you know, the interoperability being actually usable, that I think is valuable. But Mojo as a fundamental front end would be replacing PyTorch, not augmenting PyTorch. So in that sense, I don't see a synergy in more deeply integrating Mojo.Pytorch vs MLXSwyx [00:13:21]: So call out to Mojo whenever they have written something in Mojo and there's some performance related thing going on. And then since you mentioned Apple, what should people think of PyTorch versus MLX?Soumith [00:13:32]: I mean, MLX is early and I know the folks well, Ani used to work at FAIR and I used to chat with him all the time. He used to be based out of New York as well. The way I think about MLX is that MLX is specialized for Apple right now. It has a happy path because it's defined its product in a narrow way. At some point MLX either says we will only be supporting Apple and we will just focus on enabling, you know, there's a framework if you use your MacBook, but once you like go server side or whatever, that's not my problem and I don't care. For MLS, it enters like the server side set of things as well. Like one of these two things will happen, right? If the first thing will happen, like MLX's overall addressable market will be small, but it probably do well within that addressable market. If it enters the second phase, they're going to run into all the same complexities that we have to deal with. They will not have any magic wand and they will have more complex work to do. They probably wouldn't be able to move as fast.Swyx [00:14:44]: Like having to deal with distributed compute?Soumith [00:14:48]: Distributed, NVIDIA and AMD GPUs, like just like having a generalization of the concept of a backend, how they treat compilation with plus overheads. Right now they're deeply assumed like the whole NPS graph thing. So they need to think about all these additional things if they end up expanding onto the server side and they'll probably build something like PyTorch as well, right? Like eventually that's where it will land. And I think there they will kind of fail on the lack of differentiation. Like it wouldn't be obvious to people why they would want to use it.Swyx [00:15:24]: I mean, there are some cloud companies offering M1 and M2 chips on servers. I feel like it might be interesting for Apple to pursue that market, but it's not their core strength.Soumith [00:15:33]: Yeah. If Apple can figure out their interconnect story, maybe, like then it can become a thing.Swyx [00:15:40]: Honestly, that's more interesting than the cars. Yes.Soumith [00:15:43]: I think the moat that NVIDIA has right now, I feel is that they have the interconnect that no one else has, like AMD GPUs are pretty good. I'm sure there's various silicon that is not bad at all, but the interconnect, like NVLink is uniquely awesome. I'm sure the other hardware providers are working on it, but-Swyx [00:16:04]: I feel like when you say it's uniquely awesome, you have some appreciation of it that the rest of us don't. I mean, the rest of us just like, you know, we hear marketing lines, but what do you mean when you say NVIDIA is very good at networking? Obviously they made the acquisition maybe like 15 years ago.Soumith [00:16:15]: Just the bandwidth it offers and the latency it offers. I mean, TPUs also have a good interconnect, but you can't buy them. So you have to go to Google to use it.PyTorch MafiaAlessio [00:16:27]: Who are some of the other FAIR PyTorch alumni that are building cool companies? I know you have Fireworks AI, Lightning AI, Lepton, and Yangqing, you knew since college when he was building Coffee?Soumith [00:16:40]: Yeah, so Yangqing and I used to be framework rivals, PyTorch, I mean, we were all a very small close-knit community back then. Caffe, Torch, Theano, Chainer, Keras, various frameworks. I mean, it used to be more like 20 frameworks. I can't remember all the names. CCV by Liu Liu, who is also based out of SF. And I would actually like, you know, one of the ways it was interesting is you went into the framework guts and saw if someone wrote their own convolution kernel or they were just copying someone else's. There were four or five convolution kernels that were unique and interesting. There was one from this guy out of Russia, I forgot the name, but I remembered who was awesome enough to have written their own kernel. And at some point there, I built out these benchmarks called ConNet benchmarks. They're just benchmarking all the convolution kernels that are available at that time. It hilariously became big enough that at that time AI was getting important, but not important enough that industrial strength players came in to do these kinds of benchmarking and standardization. Like we have MLPerf today. So a lot of the startups were using ConNet benchmarks in their pitch decks as like, oh, you know, on ConNet benchmarks, this is how we fare, so you should fund us. I remember Nirvana actually was at the top of the pack because Scott Gray wrote amazingly fast convolution kernels at that time. Very interesting, but separate times. But to answer your question, Alessio, I think mainly Lepton, Fireworks are the two most obvious ones, but I'm sure the fingerprints are a lot wider. They're just people who worked within the PyTorch Cafe2 cohort of things and now end up at various other places.Swyx [00:18:50]: I think as a, both as an investor and a people looking to build on top of their services, it's a uncomfortable slash like, I don't know what I don't know pitch. Because I've met Yang Tsing and I've met Lin Chao. Yeah, I've met these folks and they're like, you know, we are deep in the PyTorch ecosystem and we serve billions of inferences a day or whatever at Facebook and now we can do it for you. And I'm like, okay, that's great. Like, what should I be wary of or cautious of when these things happen? Because I'm like, obviously this experience is extremely powerful and valuable. I just don't know what I don't know. Like, what should people know about like these sort of new inference as a service companies?Soumith [00:19:32]: I think at that point you would be investing in them for their expertise of one kind. So if they've been at a large company, but they've been doing amazing work, you would be thinking about it as what these people bring to the table is that they're really good at like GPU programming or understanding the complexity of serving models once it hits a certain scale. You know, various expertise like from the infra and AI and GPUs point of view. What you would obviously want to figure out is whether their understanding of the external markets is clear, whether they know and understand how to think about running a business, understanding how to be disciplined about making money or, you know, various things like that.Swyx [00:20:23]: Maybe I'll put it like, actually I will de-emphasize the investing bit and just more as a potential customer. Oh, okay. Like, it's more okay, you know, you have PyTorch gods, of course. Like, what else should I know?Soumith [00:20:37]: I mean, I would not care about who's building something. If I'm trying to be a customer, I would care about whether...Swyx [00:20:44]: Benchmarks.Soumith [00:20:44]: Yeah, I use it and it's usability and reliability and speed, right?Swyx [00:20:51]: Quality as well.Soumith [00:20:51]: Yeah, if someone from some random unknown place came to me and say, user stuff is great. Like, and I have the bandwidth, I probably will give it a shot. And if it turns out to be great, like I'll just use it.Benchmark dramaSwyx [00:21:07]: Okay, great. And then maybe one more thing about benchmarks, since we already brought it up and you brought up Confident Benchmarks. There was some recent drama around AnyScale. AnyScale released their own benchmarks and obviously they look great on their own benchmarks, but maybe didn't give the other... I feel there are two lines of criticism. One, which is they didn't test some apples for apples on the kind of endpoints that the other providers, that they are competitors with, on their benchmarks and that is due diligence baseline. And then the second would be more just optimizing for the right thing. You had some commentary on it. I'll just kind of let you riff.Soumith [00:21:41]: Yeah, I mean, in summary, basically my criticism of that was AnyScale built these benchmarks for end users to just understand what they should pick, right? And that's a very good thing to do. I think what they didn't do a good job of is give that end user a full understanding of what they should pick. Like they just gave them a very narrow slice of understanding. I think they just gave them latency numbers and that's not sufficient, right? You need to understand your total cost of ownership at some reasonable scale. Not oh, one API call is one cent, but a thousand API calls are 10 cents. Like people can misprice to cheat on those benchmarks. So you want to understand, okay, like how much is it going to cost me if I actually subscribe to you and do like a million API calls a month or something? And then you want to understand the latency and reliability, not just from one call you made, but an aggregate of calls you've made over several various times of the day and times of the week. And the nature of the workloads, is it just some generic single paragraph that you're sending that is cashable? Or is it like testing of real world workload? I think that kind of rigor, like in presenting that benchmark wasn't there. It was a much more narrow sliver of what should have been a good benchmark. That was my main criticism. And I'm pretty sure if before they released it, they showed it to their other stakeholders who would be caring about this benchmark because they are present in it, they would have easily just pointed out these gaps. And I think they didn't do that and they just released it. So I think those were the two main criticisms. I think they were fair and Robert took it well.Swyx [00:23:40]: And he took it very well. And we'll have him on at some point and we'll discuss it. But I think it's important for, I think the market being maturing enough that people start caring and competing on these kinds of things means that we need to establish what best practice is because otherwise everyone's going to play dirty.Soumith [00:23:55]: Yeah, absolutely. My view of the LLM inference market in general is that it's the laundromat model. Like the margins are going to drive down towards the bare minimum. It's going to be all kinds of arbitrage between how much you can get the hardware for and then how much you sell the API and how much latency your customers are willing to let go. You need to figure out how to squeeze your margins. Like what is your unique thing here? Like I think Together and Fireworks and all these people are trying to build some faster CUDA kernels and faster, you know, hardware kernels in general. But those modes only last for a month or two. These ideas quickly propagate.Swyx [00:24:38]: Even if they're not published?Soumith [00:24:39]: Even if they're not published, the idea space is small. So even if they're not published, the discovery rate is going to be pretty high. It's not like we're talking about a combinatorial thing that is really large. You're talking about Llama style LLM models. And we're going to beat those to death on a few different hardware SKUs, right? Like it's not even we have a huge diversity of hardware you're going to aim to run it on. Now when you have such a narrow problem and you have a lot of people working on it, the rate at which these ideas are going to get figured out is going to be pretty rapid.Swyx [00:25:15]: Is it a standard bag of tricks? Like the standard one that I know of is, you know, fusing operators and-Soumith [00:25:22]: Yeah, it's the standard bag of tricks on figuring out how to improve your memory bandwidth and all that, yeah.Alessio [00:25:28]: Any ideas instead of things that are not being beaten to death that people should be paying more attention to?Novel PyTorch ApplicationsSwyx [00:25:34]: One thing I was like, you know, you have a thousand operators, right? Like what's the most interesting usage of PyTorch that you're seeing maybe outside of this little bubble?Soumith [00:25:41]: So PyTorch, it's very interesting and scary at the same time, but basically it's used in a lot of exotic ways, like from the ML angle, what kind of models are being built? And you get all the way from state-based models and all of these things to stuff nth order differentiable models, like neural ODEs and stuff like that. I think there's one set of interestingness factor from the ML side of things. And then there's the other set of interesting factor from the applications point of view. It's used in Mars Rover simulations, to drug discovery, to Tesla cars. And there's a huge diversity of applications in which it is used. So in terms of the most interesting application side of things, I think I'm scared at how many interesting things that are also very critical and really important it is used in. I think the scariest was when I went to visit CERN at some point and they said they were using PyTorch and they were using GANs at the same time for particle physics research. And I was scared more about the fact that they were using GANs than they were using PyTorch, because at that time I was a researcher focusing on GANs. But the diversity is probably the most interesting. How many different things it is being used in. I think that's the most interesting to me from the applications perspective. From the models perspective, I think I've seen a lot of them. Like the really interesting ones to me are where we're starting to combine search and symbolic stuff with differentiable models, like the whole AlphaGo style models is one example. And then I think we're attempting to do it for LLMs as well, with various reward models and search. I mean, I don't think PyTorch is being used in this, but the whole alpha geometry thing was interesting because again, it's an example of combining the symbolic models with the gradient based ones. But there are stuff like alpha geometry that PyTorch is used at, especially when you intersect biology and chemistry with ML. In those areas, you want stronger guarantees on the output. So yeah, maybe from the ML side, those things to me are very interesting right now.Swyx [00:28:03]: Yeah. People are very excited about the alpha geometry thing. And it's kind of like, for me, it's theoretical. It's great. You can solve some Olympia questions. I'm not sure how to make that bridge over into the real world applications, but I'm sure people smarter than me will figure it out.Synthetic Data vs Symbolic ModelsSoumith [00:28:18]: Let me give you an example of it. You know how the whole thing about synthetic data will be the next rage in LLMs is a thing?Swyx [00:28:27]: Already is a rage.Soumith [00:28:28]: Which I think is fairly misplaced in how people perceive it. People think synthetic data is some kind of magic wand that you wave and it's going to be amazing. Synthetic data is useful in neural networks right now because we as humans have figured out a bunch of symbolic models of the world or made up certain symbolic models because of human innate biases. So we've figured out how to ground particle physics in a 30 parameter model. And it's just very hard to compute as in it takes a lot of flops to compute, but it only has 30 parameters or so. I mean, I'm not a physics expert, but it's a very low rank model. We built mathematics as a field that basically is very low rank. Language, a deep understanding of language, like the whole syntactic parse trees and just understanding how language can be broken down and into a formal symbolism is something that we figured out. So we basically as humans have accumulated all this knowledge on these subjects, either synthetic, we created those subjects in our heads, or we grounded some real world phenomenon into a set of symbols. But we haven't figured out how to teach neural networks symbolic world models directly. The only way we have to teach them is generating a bunch of inputs and outputs and gradient dissenting over them. So in areas where we have the symbolic models and we need to teach all the knowledge we have that is better encoded in the symbolic models, what we're doing is we're generating a bunch of synthetic data, a bunch of input output pairs, and then giving that to the neural network and asking it to learn the same thing that we already have a better low rank model of in gradient descent in a much more over-parameterized way. Outside of this, like where we don't have good symbolic models, like synthetic data obviously doesn't make any sense. So synthetic data is not a magic wand where it'll work in all cases in every case or whatever. It's just where we as humans already have good symbolic models off. We need to impart that knowledge to neural networks and we figured out the synthetic data is a vehicle to impart this knowledge to. So, but people, because maybe they don't know enough about synthetic data as a notion, but they hear, you know, the next wave of data revolution is synthetic data. They think it's some kind of magic where we just create a bunch of random data somehow. They don't think about how, and then they think that's just a revolution. And I think that's maybe a gap in understanding most people have in this hype cycle.Swyx [00:31:23]: Yeah, well, it's a relatively new concept, so. Oh, there's two more that I'll put in front of you and then you can see what you respond. One is, you know, I have this joke that it's, you know, it's only synthetic data if it's from the Mistral region of France, otherwise it's just a sparkling distillation, which is what news research is doing. Like they're distilling GPT-4 by creating synthetic data from GPT-4, creating mock textbooks inspired by Phi 2 and then fine tuning open source models like Llama. And so I don't know, I mean, I think that's, should we call that synthetic data? Should we call it something else? I don't know.Soumith [00:31:57]: Yeah, I mean, the outputs of LLMs, are they synthetic data? They probably are, but I think it depends on the goal you have. If your goal is you're creating synthetic data with the goal of trying to distill GPT-4's superiority into another model, I guess you can call it synthetic data, but it also feels like disingenuous because your goal is I need to copy the behavior of GPT-4 and-Swyx [00:32:25]: It's also not just behavior, but data set. So I've often thought of this as data set washing. Like you need one model at the top of the chain, you know, unnamed French company that has that, you know, makes a model that has all the data in it that we don't know where it's from, but it's open source, hey, and then we distill from that and it's great. To be fair, they also use larger models as judges for preference ranking, right? So that is, I think, a very, very accepted use of synthetic.Soumith [00:32:53]: Correct. I think it's a very interesting time where we don't really have good social models of what is acceptable depending on how many bits of information you use from someone else, right? It's like, okay, you use one bit. Is that okay? Yeah, let's accept it to be okay. Okay, what about if you use 20 bits? Is that okay? I don't know. What if you use 200 bits? I don't think we as society have ever been in this conundrum where we have to be like, where is the boundary of copyright or where is the boundary of socially accepted understanding of copying someone else? We haven't been tested this mathematically before,Swyx [00:33:38]: in my opinion. Whether it's transformative use. Yes. So yeah, I think this New York Times opening eye case is gonna go to the Supreme Court and we'll have to decide it because I think we never had to deal with it before. And then finally, for synthetic data, the thing that I'm personally exploring is solving this great stark paradigm difference between rag and fine tuning, where you can kind of create synthetic data off of your retrieved documents and then fine tune on that. That's kind of synthetic. All you need is variation or diversity of samples for you to fine tune on. And then you can fine tune new knowledge into your model. I don't know if you've seen that as a direction for synthetic data.Soumith [00:34:13]: I think you're basically trying to, what you're doing is you're saying, well, language, I know how to parametrize language to an extent. And I need to teach my model variations of this input data so that it's resilient or invariant to language uses of that data.Swyx [00:34:32]: Yeah, it doesn't overfit on the wrong source documents.Soumith [00:34:33]: So I think that's 100% synthetic. You understand, the key is you create variations of your documents and you know how to do that because you have a symbolic model or like some implicit symbolic model of language.Swyx [00:34:48]: Okay.Alessio [00:34:49]: Do you think the issue with symbolic models is just the architecture of the language models that we're building? I think maybe the thing that people grasp is the inability of transformers to deal with numbers because of the tokenizer. Is it a fundamental issue there too? And do you see alternative architectures that will be better with symbolic understanding?Soumith [00:35:09]: I am not sure if it's a fundamental issue or not. I think we just don't understand transformers enough. I don't even mean transformers as an architecture. I mean the use of transformers today, like combining the tokenizer and transformers and the dynamics of training, when you show math heavy questions versus not. I don't have a good calibration of whether I know the answer or not. I, you know, there's common criticisms that are, you know, transformers will just fail at X. But then when you scale them up to sufficient scale, they actually don't fail at that X. I think there's this entire subfield where they're trying to figure out these answers called like the science of deep learning or something. So we'll get to know more. I don't know the answer.Meta AI and Llama 2/3Swyx [00:35:57]: Got it. Let's touch a little bit on just Meta AI and you know, stuff that's going on there. Maybe, I don't know how deeply you're personally involved in it, but you're our first guest with Meta AI, which is really fantastic. And Llama 1 was, you know, you are such a believer in open source. Llama 1 was more or less the real breakthrough in open source AI. The most interesting thing for us covering on this, in this podcast was the death of Chinchilla, as people say. Any interesting insights there around the scaling models for open source models or smaller models or whatever that design decision was when you guys were doing it?Soumith [00:36:31]: So Llama 1 was Guillaume Lample and team. There was OPT before, which I think I'm also very proud of because we bridged the gap in understanding of how complex it is to train these models to the world. Like until then, no one really in gory detail published.Swyx [00:36:50]: The logs.Soumith [00:36:51]: Yeah. Like, why is it complex? And everyone says, oh, it's complex. But no one really talked about why it's complex. I think OPT was cool.Swyx [00:37:02]: I met Susan and she's very, very outspoken. Yeah.Soumith [00:37:05]: We probably, I think, didn't train it for long enough, right? That's kind of obvious in retrospect.Swyx [00:37:12]: For a 175B. Yeah. You trained it according to Chinchilla at the time or?Soumith [00:37:17]: I can't remember the details, but I think it's a commonly held belief at this point that if we trained OPT longer, it would actually end up being better. Llama 1, I think, was Guillaume Lample and team Guillaume is fantastic and went on to build Mistral. I wasn't too involved in that side of things. So I don't know what you're asking me, which is how did they think about scaling loss and all of that? Llama 2, I was more closely involved in. I helped them a reasonable amount with their infrastructure needs and stuff. And Llama 2, I think, was more like, let's get to the evolution. At that point, we kind of understood what we were missing from the industry's understanding of LLMs. And we needed more data and we needed more to train the models for longer. And we made, I think, a few tweaks to the architecture and we scaled up more. And that was Llama 2. I think Llama 2, you can think of it as after Guillaume left, the team kind of rebuilt their muscle around Llama 2. And Hugo, I think, who's the first author is fantastic. And I think he did play a reasonable big role in Llama 1 as well.Soumith [00:38:35]: And he overlaps between Llama 1 and 2. So in Llama 3, obviously, hopefully, it'll be awesome.Alessio [00:38:42]: Just one question on Llama 2, and then we'll try and fish Llama 3 spoilers out of you. In the Llama 2 paper, the loss curves of the 34 and 70B parameter, they still seem kind of steep. Like they could go lower. How, from an infrastructure level, how do you allocate resources? Could they have just gone longer or were you just, hey, this is all the GPUs that we can burn and let's just move on to Llama 3 and then make that one better?Soumith [00:39:07]: Instead of answering specifically about that Llama 2 situation or whatever, I'll tell you how we think about things. Generally, we're, I mean, Mark really is some numbers, right?Swyx [00:39:20]: So let's cite those things again. All I remember is like 600K GPUs.Soumith [00:39:24]: That is by the end of this year and 600K H100 equivalents. With 250K H100s, including all of our other GPU or accelerator stuff, it would be 600-and-something-K aggregate capacity.Swyx [00:39:38]: That's a lot of GPUs.Soumith [00:39:39]: We'll talk about that separately. But the way we think about it is we have a train of models, right? Llama 1, 2, 3, 4. And we have a bunch of GPUs. I don't think we're short of GPUs. Like-Swyx [00:39:54]: Yeah, no, I wouldn't say so. Yeah, so it's all a matter of time.Soumith [00:39:56]: I think time is the biggest bottleneck. It's like, when do you stop training the previous one and when do you start training the next one? And how do you make those decisions? The data, do you have net new data, better clean data for the next one in a way that it's not worth really focusing on the previous one? It's just a standard iterative product. You're like, when is the iPhone 1? When do you start working on iPhone 2? Where is the iPhone? And so on, right? So mostly the considerations are time and generation, rather than GPUs, in my opinion.Alessio [00:40:31]: So one of the things with the scaling loss, like Chinchilla is optimal to balance training and inference costs. I think at Meta's scale, you would rather pay a lot more maybe at training and then save on inference. How do you think about that from infrastructure perspective? I think in your tweet, you say you can try and guess on like how we're using these GPUs. Can you just give people a bit of understanding? It's like, because I've already seen a lot of VCs say, Llama 3 has been trained on 600,000 GPUs and that's obviously not true, I'm sure. How do you allocate between the research, FAIR and the Llama training, the inference on Instagram suggestions that get me to scroll, like AI-generated stickers on WhatsApp and all of that?Soumith [00:41:11]: Yeah, we haven't talked about any of this publicly, but as a broad stroke, it's like how we would allocate resources of any other kinds at any company. You run a VC portfolio, how do you allocate your investments between different companies or whatever? You kind of make various trade-offs and you kind of decide, should I invest in this project or this other project, or how much should I invest in this project? It's very much a zero sum of trade-offs. And it also comes into play, how are your clusters configured, like overall, what you can fit of what size and what cluster and so on. So broadly, there's no magic sauce here. I mean, I think the details would add more spice, but also wouldn't add more understanding. It's just gonna be like, oh, okay, I mean, this looks like they just think about this as I would normally do.Alessio [00:42:05]: So even the GPU rich run through the same struggles of having to decide where to allocate things.Soumith [00:42:11]: Yeah, I mean, at some point I forgot who said it, but you kind of fit your models to the amount of compute you have. If you don't have enough compute, you figure out how to make do with smaller models. But no one as of today, I think would feel like they have enough compute. I don't think I've heard any company within the AI space be like, oh yeah, like we feel like we have sufficient compute and we couldn't have done better. So that conversation, I don't think I've heard from any of my friends at other companies.EleutherSwyx [00:42:47]: Stella from Eleuther sometimes says that because she has a lot of donated compute. She's trying to put it to interesting uses, but for some reason she's decided to stop making large models.Soumith [00:42:57]: I mean, that's a cool, high conviction opinion that might pay out.Swyx [00:43:01]: Why?Soumith [00:43:02]: I mean, she's taking a path that most people don't care to take about in this climate and she probably will have very differentiated ideas. I mean, think about the correlation of ideas in AI right now. It's so bad, right? So everyone's fighting for the same pie. In some weird sense, that's partly why I don't really directly work on LLMs. I used to do image models and stuff and I actually stopped doing GANs because GANs were getting so hot that I didn't have any calibration of whether my work would be useful or not because, oh yeah, someone else did the same thing you did. It's like, there's so much to do, I don't understand why I need to fight for the same pie. So I think Stella's decision is very smart.Making BetsAlessio [00:43:53]: And how do you reconcile that with how we started the discussion about intrinsic versus extrinsic kind of like accomplishment or success? How should people think about that especially when they're doing a PhD or early in their career? I think in Europe, I walked through a lot of the posters and whatnot, there seems to be mode collapse in a way in the research, a lot of people working on the same things. Is it worth for a PhD to not take a bet on something that is maybe not as interesting just because of funding and visibility and whatnot? Or yeah, what suggestions would you give?Soumith [00:44:28]: I think there's a baseline level of compatibility you need to have with the field. Basically, you need to figure out if you will get paid enough to eat, right? Like whatever reasonable normal lifestyle you want to have as a baseline. So you at least have to pick a problem within the neighborhood of fundable. Like you wouldn't wanna be doing something so obscure that people are like, I don't know, like you can work on it.Swyx [00:44:59]: Would a limit on fundability, I'm just observing something like three months of compute, right? That's the top line, that's the like max that you can spend on any one project.Soumith [00:45:09]: But like, I think that's very ill specified, like how much compute, right? I think that the notion of fundability is broader. It's more like, hey, are these family of models within the acceptable set of, you're not crazy or something, right? Even something like neural or DS, which is a very boundary pushing thing or states-based models or whatever. Like all of these things I think are still in fundable territory. When you're talking about, I'm gonna do one of the neuromorphic models and then apply image classification to them or something, then it becomes a bit questionable. Again, it depends on your motivation. Maybe if you're a neuroscientist, it actually is feasible. But if you're an AI engineer, like the audience of these podcasts, then it's more questionable. The way I think about it is, you need to figure out how you can be in the baseline level of fundability just so that you can just live. And then after that, really focus on intrinsic motivation and depends on your strengths, like how you can play to your strengths and your interests at the same time. Like I try to look at a bunch of ideas that are interesting to me, but also try to play to my strengths. I'm not gonna go work on theoretical ML. I'm interested in it, but when I want to work on something like that, I try to partner with someone who is actually a good theoretical ML person and see if I actually have any value to provide. And if they think I do, then I come in. So I think you'd want to find that intersection of ideas you like, and that also play to your strengths. And I'd go from there. Everything else, like actually finding extrinsic success and all of that, I think is the way I think about it is like somewhat immaterial. When you're talking about building ecosystems and stuff, slightly different considerations come into play, but that's a different conversation.Swyx [00:47:06]: We're gonna pivot a little bit to just talking about open source AI. But one more thing I wanted to establish for Meta is this 600K number, just kind of rounding out the discussion, that's for all Meta. So including your own inference needs, right? It's not just about training.Soumith [00:47:19]: It's gonna be the number in our data centers for all of Meta, yeah.Swyx [00:47:23]: Yeah, so there's a decent amount of workload serving Facebook and Instagram and whatever. And then is there interest in like your own hardware?MTIASoumith [00:47:31]: We already talked about our own hardware. It's called MTIA. Our own silicon, I think we've even showed the standard photograph of you holding the chip that doesn't work. Like as in the chip that you basically just get like-Swyx [00:47:51]: As a test, right?Soumith [00:47:52]: Yeah, a test chip or whatever. So we are working on our silicon and we'll probably talk more about it when the time is right, but-Swyx [00:48:00]: Like what gaps do you have that the market doesn't offer?Soumith [00:48:04]: Okay, I mean, this is easy to answer. So basically, remember how I told you about there's this memory hierarchy and like sweet spots and all of that? Fundamentally, when you build a hardware, you make it general enough that a wide set of customers and a wide set of workloads can use it effectively while trying to get the maximum level of performance they can. The more specialized you make the chip, the more hardware efficient it's going to be, the more power efficient it's gonna be, the more easier it's going to be to find the software, like the kernel's right to just map that one or two workloads to that hardware and so on. So it's pretty well understood across the industry that if you have a sufficiently large volume, enough workload, you can specialize it and get some efficiency gains, like power gains and so on. So the way you can think about everyone building, every large company building silicon, I think a bunch of the other large companies are building their own silicon as well, is they, each large company has a sufficient enough set of verticalized workloads that can be specialized that have a pattern to them that say a more generic accelerator like an NVIDIA or an AMD GPU does not exploit. So there is some level of power efficiency that you're leaving on the table by not exploiting that. And you have sufficient scale and you have sufficient forecasted stability that those workloads will exist in the same form, that it's worth spending the time to build out a chip to exploit that sweet spot. Like obviously something like this is only useful if you hit a certain scale and that your forecasted prediction of those kind of workloads being in the same kind of specializable exploitable way is true. So yeah, that's why we're building our own chips.Swyx [00:50:08]: Awesome.Open Source AIAlessio [00:50:09]: Yeah, I know we've been talking a lot on a lot of different topics and going back to open source, you had a very good tweet. You said that a single company's closed source effort rate limits against people's imaginations and needs. How do you think about all the impact that some of the Meta AI work in open source has been doing and maybe directions of the whole open source AI space?Soumith [00:50:32]: Yeah, in general, I think first, I think it's worth talking about this in terms of open and not just open source, because like with the whole notion of model weights, no one even knows what source means for these things. But just for the discussion, when I say open source, you can assume it's just I'm talking about open. And then there's the whole notion of licensing and all that, commercial, non-commercial, commercial with clauses and all that. I think at a fundamental level, the most benefited value of open source is that you make the distribution to be very wide. It's just available with no friction and people can do transformative things in a way that's very accessible. Maybe it's open source, but it has a commercial license and I'm a student in India. I don't care about the license. I just don't even understand the license. But like the fact that I can use it and do something with it is very transformative to me. Like I got this thing in a very accessible way. And then it's various degrees, right? And then if it's open source, but it's actually a commercial license, then a lot of companies are gonna benefit from gaining value that they didn't previously have, that they maybe had to pay a closed source company for it. So open source is just a very interesting tool that you can use in various ways. So there's, again, two kinds of open source. One is some large company doing a lot of work and then open sourcing it. And that kind of effort is not really feasible by say a band of volunteers doing it the same way. So there's both a capital and operational expenditure that the large company just decided to ignore and give it away to the world for some benefits of some kind. They're not as tangible as direct revenue. So in that part, Meta has been doing incredibly good things. They fund a huge amount of the PyTorch development. They've open sourced Llama and those family of models and several other fairly transformative projects. FICE is one, Segment Anything, Detectron, Detectron 2. Dense Pose. I mean, it's-Swyx [00:52:52]: Seamless. Yeah, seamless.Soumith [00:52:53]: Like it's just the list is so long that we're not gonna cover. So I think Meta comes into that category where we spend a lot of CapEx and OpEx and we have a high talent density of great AI people and we open our stuff. And the thesis for that, I remember when FAIR was started, the common thing was like, wait, why would Meta wanna start a open AI lab? Like what exactly is a benefit from a commercial perspective? And for then the thesis was very simple. It was AI is currently rate limiting Meta's ability to do things. Our ability to build various product integrations, moderation, various other factors. Like AI was the limiting factor and we just wanted AI to advance more and we didn't care if the IP of the AI was uniquely in our possession or not. However the field advances, that accelerates Meta's ability to build a better product. So we just built an open AI lab and we said, if this helps accelerate the progress of AI, that's strictly great for us. But very easy, rational, right? Still the same to a large extent with the Llama stuff. And it's the same values, but the argument, it's a bit more nuanced. And then there's a second kind of open source, which is, oh, we built this project, nights and weekends and we're very smart people and we open sourced it and then we built a community around it. This is the Linux kernel and various software projects like that. So I think about open source, like both of these things being beneficial and both of these things being different. They're different and beneficial in their own ways. The second one is really useful when there's an active arbitrage to be done. If someone's not really looking at a particular space because it's not commercially viable or whatever, like a band of volunteers can just coordinate online and do something and then make that happen. And that's great.Open Source LLMsI wanna cover a little bit about open source LLMs maybe. So open source LLMs have been very interesting because I think we were trending towards an increase in open source in AI from 2010 all the way to 2017 or something. Like where more and more pressure within the community was to open source their stuff so that their methods and stuff get adopted. And then the LLMs revolution kind of took the opposite effect OpenAI stopped open sourcing their stuff and DeepMind kind of didn't, like all the other cloud and all these other providers, they didn't open source their stuff. And it was not good in the sense that first science done in isolation probably will just form its own bubble where people believe their own b******t or whatever. So there's that problem. And then there was the other problem which was the accessibility part. Like, okay, I again always go back to I'm a student in India with no money. What is my accessibility to any of these closers models? At some scale I have to pay money. That makes it a non-starter and stuff. And there's also the control thing. I strongly believe if you want human aligned stuff, you want all humans to give feedback. And you want all humans to have access to that technology in the first place. And I actually have seen, living in New York, whenever I come to Silicon Valley, I see a different cultural bubble. Like all the friends I hang out with talk about some random thing like Dyson Spheres or whatever, that's a thing. And most of the world doesn't know or care about any of this stuff. It's definitely a bubble and bubbles can form very easily. And when you make a lot of decisions because you're in a bubble, they're probably not globally optimal decisions. So I think open source, the distribution of open source powers a certain kind of non-falsifiability that I think is very important. I think on the open source models, like it's going great in the fact that LoRa I think came out of the necessity of open source models needing to be fine-tunable in some way. Yeah, and I think DPO also came out of the academic open source side of things. So do any of the closed source labs, did any of them already have LoRa or DPO internally? Maybe, but that does not advance humanity in any way. It advances some companies probability of doing the winner takes all that I talked about earlier in the podcast.Open Source and TrustI don't know, it just feels fundamentally good. Like when people try to, you know, people are like, well, what are the ways in which it is not okay? I find most of these arguments, and this might be a little controversial, but I find a lot of arguments based on whether closed source models are safer or open source models are safer very much related to what kind of culture they grew up in, what kind of society they grew up in. If they grew up in a society that they trusted, then I think they take the closed source argument. And if they grew up in a society that they couldn't trust, where the norm was that you didn't trust your government, obviously it's corrupt or whatever, then I think the open source argument is what they take. I think there's a deep connection to like people's innate biases from their childhood and their trust in society and governmental aspects that push them towards one opinion or the other. And I'm definitely in the camp of open source is definitely going to actually have better outcomes for society. Closed source to me just means that centralization of power, which, you know, is really hard to trust. So I think it's going well
From the NVIDIA Earnings Call held on Wednesday, February 21, 2024, the company reported a strong financial performance, repeatedly surpassing revenue expectations for Q4, as well as displaying considerable revenue growth for FY 2024. This fiscal prosperity underlines NVIDIA's expertise in accelerated computing and AI infrastructure, as reflected by the positive growth trends in its data center and gaming revenue segments. CEO Jensen Huang shed light on NVIDIA's journey during the call, stating, "GPUs are in every single step of a recommender system now. And as you know, recommender system is the single largest software engine on the planet... This is the list of recommender systems. And none of these, as I mentioned, existed a year ago, 100% new." The company's robust product portfolio, which includes solutions such as the NVIDIA Hopper GPU computing platform, InfiniBand networking, and GeForce RTX GPUs, contributes to the growth in key market sectors. NVIDIA's focus on AI is evident in innovative solutions like the NVIDIA DGX Cloud and NVIDIA AI Enterprise that enhance its hardware products with corresponding software and services. Emerging consumer trends underscore the growing adoption of AI across various sectors, most notably consumer internet industries. Deep learning and generative AI technologies are driving improved customer engagement and increased adoption of automated tools. NVIDIA, committed to delivering AI infrastructure across various industries and to collaborating with partners, has adapted to these trends with its ongoing investments in accelerated computing. In conclusion, based on the information available from the latest Earnings Call, NVIDIA's performance in financial terms, its focus on accelerated computing and AI infrastructure, alignment with AI trends, and investment plans, demonstrate a realistically positive position for potential growth in a highly competitive tech industry. The company's forward-thinking approach and commitment to technology advancement should help it maintain its industry position, but like all companies operating in this dynamic industry, that position is subject to market forces and industry trends. NVDA Company info: https://finance.yahoo.com/quote/NVDA/profile For more PSFK research : www.psfk.com This email has been published and shared for the purpose of business research and is not intended as investment advice.
Fredrik has Matt Topol and Lars Wikman over for a deep and wide chat about Apache Arrow and many, many topics in the orbit of the language-independent columnar memory format for flat and hierarchical data. What does that even mean? What is the point? And why does Arrow only feel more and more interesting and useful the more you think about deeply integrating it into your systems? Feeding data to systems fast enough is a problem which is focused on much less than it ought to be. With Arrow you can send data over the network, process it on the CPU - or GPU for that matter- and send it along to the database. All without parsing, transformation, or copies unless absolutely necessary. Thank you Cloudnet for sponsoring our VPS! Comments, questions or tips? We are @kodsnack, @tobiashieta, @oferlund and @bjoreman on Twitter, have a page on Facebook and can be emailed at info@kodsnack.se if you want to write longer. We read everything we receive. If you enjoy Kodsnack we would love a review in iTunes! You can also support the podcast by buying us a coffee (or two!) through Ko-fi. Links Lars Matt Øredev Matt’s Øredev presentations: State of the Apache Arrow ecosystem: How your project can leverage Arrow! and Leveraging Apache Arrow for ML workflows Kallbadhuset Apache Arrow Lars talks about his Arrow rabbit hole in Regular programming SIMD/vectorization Spark Explorer - builds on Polars Null bitmap Zeromq Airbyte Arrow flight Dremio Arrow flight SQL Influxdb Arrow flight RPC Kafka Pulsar Opentelemetry Arrow IPC format - also known as Feather ADBC - Arrow database connectivity ODBC and JDBC Snowflake DBT - SQL to SQL Jinja Datafusion Ibis Substrait Meta’s Velox engine Arrow’s project management committee (PMC) Voltron data Matt’s Arrow book - In-memory analytics with Apache Arrow Rapids and Cudf The Theseus engine - accelerator-native distributed compute engine using Arrow The composable codex The standards chapter Dremio Hugging face Apache Hop - orchestration data scheduling thing Directed acyclic graph UCX - libraries for finding fast routes for data Infiniband NUMA CUDA GRPC Foam bananas Turkish pepper - Tyrkisk peber Plopp Marianne Titles For me, it started during the speaker’s dinner Old, dated, and Java A real nerd snipe Identical representation in memory Working on columns It’s already laid out that way Pass the memory, as is Null plus null is null A wild perk Arrow into the thing So many curly brackets you need to store Arrow straight through Something data people like to do So many backends The SQL string is for people I’m rude, and he’s polite Feed the data fast enough A depressing amount of JSON Arrow the whole way through These are the problems in data Reference the bytes as they are Boiling down to Arrow Data lakehouses Removing inefficiency
InfiniBand is the king of AI networking today. Ethernet is making a big leap to take some of that market share but it's not going to dethrone the incumbent any time soon. In this episode, join Jody Lemoine, David Peñaloza, and Chris Grundemann along with Tom Hollingsworth as they debate the merits of using Ethernet in place of InfiniBand. They discuss the paradigm shift as well as the suitability of the protocols to the workloads as well as how Ultra Ethernet is similar to another shift in converged protocols - Fibre Channel over Ethernet. © Gestalt IT, LLC for Gestalt IT: Ethernet Won't Replace InfiniBand for AI Networking in 2024
InfiniBand is the king of AI networking today. Ethernet is making a big leap to take some of that market share but it's not going to dethrone the incumbent any time soon. In this episode, join Jody Lemoine, David Peñaloza, and Chris Grundemann along with Tom Hollingsworth as they debate the merits of using Ethernet in place of InfiniBand. They discuss the paradigm shift as well as the suitability of the protocols to the workloads as well as how Ultra Ethernet is similar to another shift in converged protocols - Fibre Channel over Ethernet. © Gestalt IT, LLC for Gestalt IT: Ethernet Won't Replace InfiniBand for AI Networking in 2024
Dell Technologies is helping customers achieve faster AI and generative AI (GenAI) performance with new enterprise data storage advancements and validation with the NVIDIA DGX SuperPOD AI infrastructure. "Storage performance is a critical factor for successful AI and generative AI outcomes," said Arthur Lewis, president of Infrastructure Solutions Group, Dell Technologies. "Customers are relying on us to continually push the boundaries of storage innovation, including removing data access bottlenecks that limit the throughput and scalability of compute-intensive applications. Addressing the need for high performance and efficiency for AI storage New advancements from Dell PowerScale, the world's most flexible, secure and efficient scale-out file storage system, address increasing customer demands for higher AI and GenAI performance. Now, with PowerScale OneFS software enhancements, companies can prepare, train, fine-tune and inference AI models more quickly. With new PowerScale all-flash storage systems based on the latest generation of Dell PowerEdge servers, customers will see up to a 2X performance increase for streaming reads and writes. "We make an engineering change every 17 minutes, a speed unachievable without a powerful IT infrastructure that underpins all of our processes," said Dan Keyworth, director of business technology at McLaren Racing. "Thanks to Dell PowerScale, we're equipped to leverage cutting-edge AI applications, combining high-performance storage with computing capabilities to create a winning formula for success." PowerScale will offer a new smart scale-out capability to improve single compute node performance for enhanced GPU utilization, leading to faster storage throughput for AI training, checkpointing and inferencing. "Dell continues to design its file and object storage solutions with the customer in mind, allowing them to modernize their infrastructure with an AI-ready foundation to support the most intensive workloads, such as generative AI," said Scott Sinclair, practice director of Enterprise Strategy Group. Dell PowerScale undergoes validation for NVIDIA DGX SuperPOD Through Dell's collaboration with NVIDIA, customers will be able to take advantage of a validated combination of NVIDIA DGX systems, Dell PowerScale storage and NVIDIA Quantum-2 InfiniBand and Spectrum Ethernet networking to achieve faster and more efficient AI storage. Dell's solution is expected to be the first ethernet storage solution validated on NVIDIA DGX SuperPOD. With Dell PowerScale becoming validated for NVIDIA DGX SuperPOD, NVIDIA's turnkey data centre AI infrastructure solution, customers will be able to confidently accelerate their AI and GenAI initiatives with Dell's industry-leading network-attached storage.5 NVIDIA DGX SuperPOD includes the NVIDIA AI Enterprise software platform to provide a full-stack, secure and stable AI supercomputing solution. Bringing AI to data wherever it resides With nearly 87% of companies embracing multicloud strategies, Dell is giving customers the freedom to process data wherever it makes the most sense for them - on premises, at the edge or in public clouds. Dell APEX File Storage for Microsoft Azure, the latest addition to the Dell APEX Storage for Public Cloud portfolio, delivers enterprise-class file performance and management capabilities in Microsoft Azure. Customers will be able to easily meet the needs of performance-intensive AI and machine learning applications like Azure OpenAI Service and Azure AI Vision. Dell APEX File Storage in AWS and Azure allows customers to take advantage of cloud-native AI and GenAI workflows on data in public clouds or on premises with improved data access and movement. For example, through Dell's expanding Databricks collaboration, customers can choose from a variety of large language models (LLMs) and use libraries from Databricks MosaicML to retrain a foundational model with their proprietary data stored in Dell APEX File Storage, offering...
Amazon announced that they will be using Nvidia's NVSwitch to create new rackscale AI platforms. This allows customers to use Nvidia technology like Grace Hopper or build something using AWS Nitro DPUs and Elastic Fabric Adapaters. That last combination doesn't use InfiniBand and moves to Ethernet. Two new chips are coming out from the Amazon labs. The first is Graviton4, the latest generation of Arm processor. Graviton4 has 50% more cores and 75% more memory bandwidth. This version of Graviton is based on the Demeter Neoverse V2 core. On the AI front Amazon also announced the next revision of their AI acceleration chip, Trainium2. This update has a 4x performance increase from the first generation and allows for liquid cooling. Trainium2 is designed to be deployed in clusters of 16 chips. In the data space, AWS made news with the GA of Bedrock, which enables customers to run foundational ML models from companies like Anthropic, Meta, and of course AWS itself. But many companies are worried about hallucination and want to include their own data in these models. That's what AWS is delivering with Guardrails, and what AWS partners are leaning into as well. This and more on this week's Gestalt IT Rundown. 0:00 - Welcome to the Rundown 1:18 - DAOS Foundation Launched for Object Storage 4:57 - Autonomous Purple Team annouced by Skyhawk at AWS re:Invent 8:26 - Couchbase's Capella Columnar Service Revealed 10:59 - Hackers Were Inside NXP For Two Years Before Detection 14:47 - AWS re:Invent Announcements 15:26 - NVSwitch For AI Nodes 20:46 - ARM V2 for Graviton4 and Trainium2 25:22 - AWS Guardrails for Bedrock 30:10 - The Weeks Ahead 33:10 - Thanks for Watching Follow our Hosts on Social Media Tom Hollingsworth: https://www.twitter.com/NetworkingNerd Stephen Foskett: https://www.twitter.com/SFoskett Follow Gestalt IT Website: https://www.GestaltIT.com/ Twitter: https://www.twitter.com/GestaltIT LinkedIn: https://www.linkedin.com/company/Gestalt-IT Tags: #Rundown, AI, #AWSreinvent, #AWS, #Storage, @SkyhawkCloudSec, @Couchbase, #Capella, #Data, #Security, #NXP, #China, @AWS, @NVIDIA, #GraceHopper, @Arm, #Graviton4, #Bedrock, @NetworkingNerd, @SFoskett, @GestaltIT,
It's time for Eyvonne, Tom, and Russ to talk about some current stories in the world of networking—the May roundtable. Yes, I know it's already June, and I'm a day late, but ... This month we talk about the IT worker shortage, Infiniband, and the "next big thing." So draw up a place to sit and hang out with us as we chat.
Tiered memory will have different performance, so operating systems will need to incorporate techniques to adapt to pages with different characteristics. This episode of Utilizing CXL features Hasan Al Maruf, part of a team that developed transparent page placement for Linux. He began his work enabling transparent page placement for InfiniBand-connected peers before applying the concept to NUMA nodes and now CXL memory. Hosts: Stephen Foskett: https://www.twitter.com/SFoskett Craig Rodgers: https://www.twitter.com/CraigRodgersms Guest Host: Hasan Al Maruf, Researcher at the University of Michigan: https://www.linkedin.com/in/hasanalmaruf/ Follow Gestalt IT and Utilizing Tech Website: https://www.UtilizingTech.com/ Website: https://www.GestaltIT.com/ Twitter: https://www.twitter.com/GestaltIT LinkedIn: https://www.linkedin.com/company/1789 Tags: #UtilizingCXL #TieringMemory #CXL #Linux @UtilizingTech @GestaltIT @UMich
This week we tried to migrate our Matrix instance to our own data center, it didn't go well. We discuss our mistakes and what our next steps are. Your calls, your emails we cover it all in this episode! -- During The Show -- 00:55 Steve's Weekend Client Prep 02:23 Caller Tony Client received a police call Transparent Proxy Managed Switches DNS Resolver/Forwarder Firewall Rules Squid Proxy Dragon's Tail 11:20 VLAN questions - Tyler TP Link G105E (http://www.amazon.com/dp/B00N0OHEMA/?tag=minddripmedia-20) HP OfficeConnect Switch (https://www.ebay.com/itm/373794096000?epid=9030355100&hash=item5707dd4380:g:htwAAOSwhlZhk9rb) Multiple Network Cards Managed Switch Cisco HP 1920 Series Stacking 17:30 Light Automation - John All smart bulbs have a failsafe Shelly 1 (https://shelly.cloud/products/shelly-1-smart-home-automation-relay/) Smart Assistant Smart stuff at the switch GE Enbrighten Zwave Plus (http://www.amazon.com/dp/B07226MG2T/?tag=minddripmedia-20) UltraPro Switch (http://www.amazon.com/dp/B07B3LXZJ9/?tag=minddripmedia-20) Lutron Switches (https://www.lutron.com/en-US/Products/Pages/WholeHomeSystems/RadioRA2/Overview.aspx) Inovelli (https://inovelli.com/products/switches/) 24:46 User recommends software - Brett DerbyNet (https://jeffpiazza.github.io/derbynet/) Pinewood Derby Software Real Time stats, Live instant replay, and more! 27:03 RHCE Questions - MHJ RHCE is hard Skills Based RHCSA is required before you can be called a RHCE Class is not required, but will help you a lot VTC Class 36:50 Tim Asked The guy with MoCA adapters that fail... He has to be sure that any splitters along the way are compatible (will pass the right frequencies). Also, if the cable is connected to CATV, he may need an in-line filter that will prevent his in-house MoCA signalling from propagating up the line to the cable company or the neighbors if the neighborhood infrastructure is old enough. 37:20 sunjam Asked Curious about custom local domain name resolution. Setup avahi for hostname.local on my pi, but curious about adding services to *.home in addition to .local Pseudo TLD Reserved Pseudo Domains mDNS 39:21 CubicleNate Asked I want to know where to get low(er) cost used rack mountable storage solutions. What interfaces are used between them? I have a raid 10 btrfs array I want to grow. eSATA Enclosure (https://www.datoptic.com/ec/hardware-raid-five-5-bay-1u-rackmount-esata-interface.html) InfiniBand (https://en.wikipedia.org/wiki/InfiniBand) 42:19 Caller George Digitize VHS C tapes Janacane Converter (http://www.amazon.com/dp/B07NPFJJ7K/?tag=minddripmedia-20) VHS player with built in converter to DVD 46:48 Linux News Wire Slackware 15 Announcement (http://www.slackware.com/announce/15.0.php) Slackware 15 The Register (https://www.theregister.com/2022/02/07/slackware/) Tiny Core Linux 13 (https://www.tomshardware.com/news/tiny-core-linux-13-released) Peppermint 11 Release (https://news.itsfoss.com/peppermint-11-release/) Trisquel 10 Release (https://trisquel.info/en/trisquel-10-nabia-release-announcement) Openmandriva 4.3 Released (https://www.openmandriva.org/en/news/article/openmandriva-lx-4-3-released) Libre Office 73 Community (https://blog.documentfoundation.org/blog/2022/02/02/libreoffice-73-community/) AMD Linux Client Hiring (https://www.phoronix.com/scan.php?page=news_item&px=AMD-Linux-Client-Hiring-2022) RISC V Investment from Intel (https://riscv.org/whats-new/2022/02/intel-corporation-makes-deep-investment-in-risc-v-community-to-accelerate-innovation-in-open-computing/) Intel Invests Billion into RISC V (https://www.zdnet.com/article/intel-invests-in-open-source-risc-v-processors-with-a-billion-dollars-in-new-chip-foundries/) 3dfx Glide Coming to Linux (https://www.tomshardware.com/news/3dfx-glide-linux-gaming) Local Privilege Escalation (https://www.crowdstrike.com/blog/hunting-pwnkit-local-privilege-escalation-in-linux/) 12 Year Old Local Privilege Escalation on Linux (https://www.cpomagazine.com/cyber-security/all-linux-distributions-affected-by-12-year-old-pwnkit-local-privilege-escalation-bug-allowing-an-attacker-to-execute-commands-as-root/) Sama Code Execution Bug (https://www.helpnetsecurity.com/2022/02/02/samba-bug-may-allow-code-execution-as-root-on-linux-machines-nas-devices-cve-2021-44142/) Argo CD High Severity Flaw (https://www.theregister.com/2022/02/04/argo_cd_0day_kubernetes/) Google Open Source Story (https://thenewstack.io/a-kubernetes-documentary-shares-googles-open-source-story/) Oracle Linux in Windows Store (https://www.theregister.com/2022/02/02/oracle_linux_microsoft/) Github Sponsor Only Repos (https://techcrunch.com/2022/02/02/github-introduces-sponsor-only-repositories/) 48:39 MDM Data Migration Linux Delta Matrix Server Digital Ocean Bill was doubling 12,000+ Users Data Migration took 4+ days Ansible and Containers Postgres issues Learned a lot What is your baseline? Possible Solutions -- The Extra Credit Section -- For links to the articles and material referenced in this week's episode check out this week's page from our podcast dashboard! This Episode's Podcast Dashboard (http://podcast.asknoahshow.com/272) Phone Systems for Ask Noah provided by Voxtelesys (http://www.voxtelesys.com/asknoah) Join us in our dedicated chatroom #GeekLab:linuxdelta.com on Matrix (https://element.linuxdelta.com/#/room/#geeklab:linuxdelta.com) -- Stay In Touch -- Find all the resources for this show on the Ask Noah Dashboard Ask Noah Dashboard (http://www.asknoahshow.com) Need more help than a radio show can offer? Altispeed provides commercial IT services and they're excited to offer you a great deal for listening to the Ask Noah Show. Call today and ask about the discount for listeners of the Ask Noah Show! Altispeed Technologies (http://www.altispeed.com/) Contact Noah live [at] asknoahshow.com -- Twitter -- Noah - Kernellinux (https://twitter.com/kernellinux) Ask Noah Show (https://twitter.com/asknoahshow) Altispeed Technologies (https://twitter.com/altispeed) Special Guest: Steve Ovens.
Из доклада вы узнаете краткую историю кластеров для обучения нейронных сетей в Яндексе: — Зачем они нам понадобились? — Что такое современный HPC и почему это не просто объединение нескольких сотен серверов? — Способы создания HPC, и почему Яндекс выбрал наиболее трудный. Поговорим о борьбе за производительность: — Почему такие кластеры, как у нас, не работают «из коробки»? — Как мы оптимизировали производительность одного узла от 30 до 110 терафлопс. — Как масштабировали производительность на 200 узлов, получив в сумме 21,6 петафлопс. Также мы подробно расскажем о том, что представляет из себя распределённое обучение и почему это сложно; и поделимся 10 правилами, без которых GPU-кластеры никогда не окупятся и будут просто дорогой игрушкой. О спикере: Дмитрий Монахов занимается в Яндексе поддержкой и разработкой ядра Linux для нужд инфраструктуры внутреннего облака. Отвечает за файловые системы, распределённые системы и алгоритмы, RDMA, Infiniband, HPC и GPU. С 2008-го по 2018-й занимался разработкой ядра Linux, локальных и распределённых файловых систем в компаниях SwSoft, Parallels и Virtuozzo.
About JakeTechnical Lead by day at the Met Office in the UK, leading a team of software developers delivering services for the UK. By night, gamer and fitness instructor, attempting to get a home cinema and gaming setup whilst coralling 3 cats, 2 rabbits, 2 fish tanks, and my wonderful girlfriend.Links: Met Office: https://www.metoffice.gov.uk Twitter: https://twitter.com/jakehendy TranscriptAnnouncer: Hello, and welcome to Screaming in the Cloud with your host, Chief Cloud Economist at The Duckbill Group, Corey Quinn. This weekly show features conversations with people doing interesting work in the world of cloud, thoughtful commentary on the state of the technical world, and ridiculous titles for which Corey refuses to apologize. This is Screaming in the Cloud.Corey: It seems like there is a new security breach every day. Are you confident that an old SSH key, or a shared admin account, isn't going to come back and bite you? If not, check out Teleport. Teleport is the easiest, most secure way to access all of your infrastructure. The open source Teleport Access Plane consolidates everything you need for secure access to your Linux and Windows servers—and I assure you there is no third option there. Kubernetes clusters, databases, and internal applications like AWS Management Console, Yankins, GitLab, Grafana, Jupyter Notebooks, and more. Teleport's unique approach is not only more secure, it also improves developer productivity. To learn more visit: goteleport.com. And not, that is not me telling you to go away, it is: goteleport.com. Corey: This episode is sponsored in part by our friends at Redis, the company behind the incredibly popular open source database that is not the bind DNS server. If you're tired of managing open source Redis on your own, or you're using one of the vanilla cloud caching services, these folks have you covered with the go to manage Redis service for global caching and primary database capabilities; Redis Enterprise. To learn more and deploy not only a cache but a single operational data platform for one Redis experience, visit redis.com/hero. Thats r-e-d-i-s.com/hero. And my thanks to my friends at Redis for sponsoring my ridiculous non-sense. Corey: Welcome to Screaming in the Cloud. I'm Corey Quinn. It's often said that the sun never sets on the British Empire, but it's often very cloudy and hard to see the sun because many parts of it are dreary and overcast. Here to talk today about how we can predict those things in advance—in theory—is Jake Hendy, Tech Lead at the Met Office. Jake, thanks for joining me.Jake: Hey, Corey, it's lovely to be here. Thanks for inviting me on.Corey: There's a common misconception that its startups in San Francisco or the culture thereof, if you can even elevate it to being a culture above something you'd find in a petri dish, that is where cloud stuff happens, where the computer stuff is done. And I've always liked cutting against that. There are governments that are doing interesting things with Cloud; there are large companies and ‘move fast and break things' is the exact opposite of what you generally want from institutions that date back centuries. What's it like working on Cloud, something that for all intents and purposes didn't exist 20 years ago, in the context of a government office?Jake: As you can imagine, it was a bit of a foray into cloud for us when it first came around. We weren't one of the first people to jump. The Met Office, we've got our own data centers, which we've proudly sit on that contains supercomputers and mainframes as well as a plethora of x86 hardware. So, we didn't move fast at the start, but nowadays, we don't move at breakneck speeds, but we like to take advantage of those managed services. It gets out of the way of managing things for us.Corey: Let's back up a second because I tend to be stereotypically American in many ways. What is the Met Office?Jake: What is the Met Office? The Met Office is the UK's National Meteorological Service. And what does that mean? We do a lot of things though with meteorology, from weather forecasting and climate research from our Hadley Centre—which is world-renowned—down to observations, collections, and partnerships around the world. So, if you've been on a plane over Europe, the Middle East, Africa, over parts of Asia, that plane took off because the Met Office provided a forecast for that plane. There's a whole range of things we can talk about there, if you want Corey, of what the Met Office actually does.Corey: Well, let's ask some of the baseline questions. You think of a weather office in a particular country as, oh okay, it tracks the weather in the area of operations for that particular country. Are you looking at weather on a global basis, on a somewhat local basis, or—as mentioned—since due to a long many-century history it turns out that there are UK Commonwealth territories scattered around the globe, where do you start? Where do you stop?Jake: We don't start and we don't stop. The Met Office is very much a 24/7 operation. So, we've got a 24/7 operation center with staff constantly manning it, doing all sorts of things. So, we've got a defense, we work heavily with our defense colleagues from UK armed forces to NATO partners; we've got aviation, as mentioned; we've got marine shipping from—most of the listeners in the UK will have heard of the shipping forecast at one point or another. And we've got private sector as well, from transport, to energy, supermarkets, and more. We have a very heavy UK focus, for obvious reasons, but our remit goes wide. You can actually go and see some of our model data is actually on Amazon Open Data. We've got MOGREPS, which is our ensemble forecast, as well as global models and UK models, with a 24-hour time lag, but feel free to go and have a play. And you can see the wide variety of data that we produce in just those few models.Corey: Yeah, just pulling up your website now; looking at where I am here in San Francisco, it gives me a detailed hour-by-hour forecast. There are only two problems I see with it. The first is that it's using Celsius units, which I—Jake: [laugh].Corey: —as a matter of policy, don't believe in because in this country, we don't really use things that make sense in measuring context. And also, I don't believe it's a real weather site because it's not absolutely festooned with advertisements for nonsense, which is apparently—I wasn't aware—a thing that you could have on the internet. I thought that showing weather data automatically meant that you had to attempt to cater to the lowest common denominator at all times.Jake: That's an interesting point there. So, the Met Office is owned and operated by Her Majesty's Government. We are a Trading Fund with the Department for Business, Energy and Industrial Strategy. But what does that mean it's a Trading Fund?k it means that we're funded by public money. So, that's called the Public Weather Service.But we also offer a more commercial venture. So, depending on what extensions you've got going on in your browser, there are actually adverts that do run on our website, and we do this to help recover some of the cost. So, the Public Weather Service has to recover some of that. And then lots of things are funded by the Public Weather Service, from observations, to public forecasting. But then there are more those commercial ventures such as the energy markets that have more paid products, and things like that as well. So, maybe not that many adverts, but definitely more usable.Corey: Yeah, I disabled the ad blocker, and I'm reloading it and I'm not seeing any here. Maybe I'm just considered to be such a poor ad targeting prospect at this point that people have just given up in despair. Honestly, people giving up on me in despair is kind of my entire shtick.Jake: We focus heavily on user-centered design, so I was fortunate in their previous team to work in our digital area, consumer digital, which looked after our web and mobile channels. And I can heartily say that there are a lot of changes, had a lot of heavy research into them. Not just internal, getting [unintelligible 00:06:09] and having a look at it, but what does this is actually mean for members of the? Public sending people out doing guerrilla public testing, standing outside Tescos—which is one of our large superstores here—and saying, “Hey, what do you think of this?” And then you'd get a variety of opinions, and then features would be adjusted, tweaked, and so on.Corey: So, you folks have been a relatively early adopter, especially in an institutional context. And by institution, I mean, one of those things that feels like it is as permanent as the stones in a castle, on some level, something that's lasted more than 20 years here in California, what a concept. And part of me wonders, were you one of the first UK government offices to use the cloud, and is that because you do weather and someone was very confused by what Cloud meant?Jake: [laugh]. I think we were possibly one of the first; I couldn't say if we were the first. Over in the UK, we've got a very capable network of government agencies doing some wonderful, and very cloud things. And the Government Digital Service was an initiative set up—uh, I can't remember, and I—unfortunately I can't remember the name of the report that caused its creation, but they had a big hand in doing design and cloud-first deployments. In the Met Office, we didn't take a, “Ah, screw it. Let's jump in,” we took a measured step into the cloud waters.Like I said, we've been running supercomputers since the '50s, and mainframes as well, and x86. I mean, we've been around for 100 years, so we constantly adapt, and engage, and iterate, and improve. But we don't just jump in and take a risk because like you said, we are an institution; we have to provide services for the public. It's not something that you can just ignore. These are services that protect life and property, both at home and abroad.Corey: You have provided a case study historically to AWS, about your use cases of what you use, back in 2014. It was, oh, you're a heavy user of EC2, and looking at the clock, and oh, it's 2014. Surprise. But you've also focused on other services as well. I believe you personally provided a bit of a case study slash story of round your use of Pinpoint of all things, which is a wrapper around SES, their email service, in the hopes of making it a little bit more, I guess, understandable slash fully-featured for contacting people, but in my experience is a great sales device to drive business to its competitors.What's it been like working, I guess, both simultaneously with the tried and true, tested yadda, yadda, yadda, EC2 RDS style stuff, but then looking at what else you're deep into Lambda, and DynamoDB, and SQS sort of stands between both worlds give it was the first service in beta, but it also is a very modern way of thinking about services. How do you contextualize all of that? Because AWS has product strategies, clearly, “Yes.” And they build anything for anyone is more or less what it seems. How do you think about the ecosystem of services that are available and apply it to problems that you're working on?Jake: So, in my personal opinion, I think the Met Office is one of a very small handfuls of companies around the world that could use every Amazon service that's offered, even things like Ground Station. But on my first day in the office, I went and sat at my desk and was talking to my new colleagues, and I looked to the left and he said, “Oh, yeah, that's a satellite dish collecting data from a satellite passing overhead.” So, we very much pick the best tool for the job. So, we have systems which do heavy number crunching, and very intense things, we'll go for EC2.We have systems that store data that needs relationships and all sorts of things. Fine, we'll go RDS. In my space, we have over a billion observations a year coming through the system I lead on SurfaceNet. So, do we need RDS? No. What about if we use something like S3 and Glue and Athena to run queries against this?We're very fortunate that we can pick the best tool for the job, and we pride ourselves on getting the most out of our tools and getting the most value for money. Because like I said, we're funded by the taxpayer; the taxpayer wants value for money, and we are taxpayers ourselves. We don't want to see our money being wasted when we got a hundred size auto-scaling group, when we could do it with Lambda instead.Corey: It's fascinating talking about some of the forward-looking stuff, and oh, serverless and throw everything at Cloud and be all in on cloud. Cloud, cloud, cloud. Cloud is the future. But earlier this year, there was a press release where the Met Office and Microsoft are going to be joining forces to build the world's, and I quote, “Most powerful weather and climate forecasting supercomputer.” The government—your government, to be clear—is investing over a billion pounds in the project.It is slated to be online and running by the middle of next year, 2022, which for a government project as I contextualize them feels like it's underwear-on-outside-the-pants superhero speed. But that, I guess, is what happens when you start looking at these public-private partnerships in some respects. How do you contextualize that? What is the story behind, oh, we're—you're clearly investing heavily in cloud, but you're also building your own custom enormous supercomputer rather than just waiting for AWS to drop one at re:Invent. What is the decision-making process look like? What is the strategy behind it?Jake: Oh. [laugh]. So—I'll have to be careful here—supercomputing is something that we've been doing for a long time, since the '50s, and we've grown with that. When the Met Office moved offices from Bracknell in 2002, 2003, we run two supercomputers for operational resilience, at that point [unintelligible 00:12:06] building in the new building; it was ready, and they were like, “Okay, let's move a supercomputer.” So, it came hurtling down the motorway, plugged in, and congrats, we've now got two supercomputers running again. We're very fortunate—Corey: We had one. It got lonely. We wanted to make it a friend. Yeah, I get it.Jake: Yeah. It's long distance; it works. And the Met Office is actually very good at running projects. We've done many supercomputers over the years, and supercomputing our models, we run some very intense models, and we have more demands. We know we can do better.We know there's the observations in my group we collect, there's the science that's continually improving and iterating and getting better, and our limit isn't poor optimizations or poorly written code. They're scientists running some fantastic code; we have a team who go and optimize these models, and you know, in one release, they may knock down a model runtime by four minutes. And you think, okay, that's four minutes, but for example, if that's four minutes across 400 nodes, all of a sudden you've now got 400 nodes that have then got four minutes more of compute. That could be more research, that could be a different model run. You know, we're very good at running these things, and we're very fortunate with very technically capable to understand the difference between a workload that belongs on AWS, a workload that belongs on a supercomputer.And you know, a supercomputer has many benefits, which the cloud providers… are getting into, you know, we have a high performance clusters on Amazon and Azure, or with, you know, InfiniBand networking. But sometimes you really can't beat a hunking great big ton of metal and super water-cooling, sat in a data center somewhere, backed by—we're very fortunate to have one hundred percent renewable energy for the supercomputer, which is—if you look at any of the power requirements for a supercomputer is phenomenal, so we're throwing that credentials behind it for climate change as well. You can't beat a supercomputer sometimes.Corey: This episode is sponsored by our friends at Oracle HeatWave is a new high-performance accelerator for the Oracle MySQL Database Service. Although I insist on calling it “my squirrel.” While MySQL has long been the worlds most popular open source database, shifting from transacting to analytics required way too much overhead and, ya know, work. With HeatWave you can run your OLTP and OLAP, don't ask me to ever say those acronyms again, workloads directly from your MySQL database and eliminate the time consuming data movement and integration work, while also performing 1100X faster than Amazon Aurora, and 2.5X faster than Amazon Redshift, at a third of the cost. My thanks again to Oracle Cloud for sponsoring this ridiculous nonsense. Corey: I'm somewhat fortunate in the despite living in a world of web apps, these days, my business partner used to work at the Department of Energy at Oak Ridge National Lab, helping with the care and feeding of the supercomputer clusters that they had out there. And you're absolutely right; that matches my understanding with the idea that there are certain workloads you're not going to be able to beat just having this enormous purpose-built cluster sitting there ready to go. Or even if you can, certainly not economically. I have friends who are in the batch side of the world, the HPC side of the world over in the AWS organizations, and they keep—“Hey, look at this. This thing's amazing.”But so much of what they're talking about seems to distill down to, “I have this one-off giant compute task that needs to get done.” Yes, you're right. If I need to calculate the weather one time, then okay, I can make an argument for going with cloud but you're doing this on what appears to be a pretty consistent basis. You're not just assuming—as best I can tell that, “And starting next Wednesday, it will be sunny forever. The end.”Jake: I'm sure many people would love it if we could do weather on-demand.Corey: Oh, yes. [unintelligible 00:15:09] going to reserved instance weather. That would be great. Like, “All right. I'd like to schedule some rain, please.” It really seems like it's one of those areas that is one of the most commonly accepted in science fiction without any real understanding of just what it would take to do something like that. Even understanding and predicting the weather is something that is beyond an awful lot of our current capabilities.Jake: This is exactly it. So, the Met Office is world-renowned for its research capabilities and those really in-depth, very powerful models that we run. So, I mentioned earlier, something called MOGREPS, which is the Met Office's ensemble-based models. And what do we mean by ensembles? You may see in the documentation it's got 18 members.What does that mean? It means that we actually run a simulation 18 times, and we tweak the starting parameters based on these real world inputs. And then you have a number of members that iterate through and supercomputer runs all of them. And we have deterministic models, which have one set of inputs. And you know, it's not just, as you say, one time; these models must run.There are a number of models we do, models on sea state as well, and they've all got to run, so we generally tend to run our supercomputers at top capacity. It's not often you get to go on a supercomputer and there'll be some space for your job to execute right this minute. And there's all the setup as well, so it's not just okay, the supercomputer is ready to go, but there's all the things that go into it, like, those observations, whether it's from the surface, whether it's from satellite data passing overhead, we have our own lightning network, as well. We have many things, like a radar network that we own, and operate. We collaborate with the environment agency for rainfall. And all these things they feed into these models.Okay, now we produce a model, and now it's got to go out. So, it's got to come off the supercomputer, it's got to be processed, maybe the grid that we run the models on needs to be reprojected because different people feed maps in different ways. Then there's got to be cut up because not every customer wants to know what the weather is everywhere. They've got a bit they care about. And of course, these models aren't small; you know, they can be terabytes, so there's also a case of customers might not want to download terabytes; that might cost them a lot. They might only be able to process gigabytes an hour.But then there's other products that we do processing on, so weather models, it might take 40 minutes to over an hour for a model to run. Okay, that's great. You might have missed the first step. Okay, well, we can enrich it with other data that's come in, things like nowcasting, where we do very short runs for the next six-hour forecast. There's a whole number of things that run in the office. And we don't have a choice; they run operationally 24/7, around the clock.I mentioned to you before we started recording, we had an incident of ‘Beast from the East' a number of years back. Some of your listeners may remember this; in the UK, we had a front come in from the east and the UK was blanketed with snow. It was a real severe event. We pretty much kept most of our services running. We worked really hard to make sure that they continued working.And personally I say, perhaps when you go shopping for Black Friday, you might go to a retailer and it's got a queue system up because, you know, it mimics that queue thing when you're outside a store, like in Times Square, and it's raining, be like oh, I might get a deal a minute. I think possibly in the Met Office, we have almost the inverse problem. If the weather's benign, we're still there. People rely on us to go, “Yeah, okay. I can go out and have fun.” When the weather's bad, we don't have a choice. We have to be there because everybody wants us to be there, but we need to be there. It's not a case of this is an optional service.Corey: People often forget that yeah, we are living in a world in which, especially with climate change doing what it's doing, if you get this wrong, people can very easily die. That is not something to take lightly. It's not just about can I go outside and play a pickup game of basketball today?Jake: Exactly. So, you know, operationally, we have something called the National Severe Weather Warning Service, where we issue guidance and alerts across the UK, based on severe weather. And there's a number of different weather types that we issued guidance for. And the severity of that goes from yellow to amber to red. And these are manually generated products, so there's the chief meteorologist who's on shift, and he approves these.And these warnings don't just go out to the members of the public. They go out to Cabinet Office, they go out to first responders, they go out to a number of people who are interested in the weather and have a responsibility. But the other side is that we don't issue a weather warning willy-nilly. It's a measured, calculated decision by our very capable operations team. And once that weather system has passed, the weather story has changed, we'll review it. We go back and we say what could we have done differently?Could the models have predicted this earlier? Could we have new data which would have picked up on this? Some of our next generation products that are in beta, would they have spotted this earlier? There's a lot of service review that continually goes on because like I said, we are the best, and we need to stay the best. People rely on us.Corey: So, here's a question that probably betrays my own ignorance, and that's okay, that's what I'm here to do. When I was a kid, I distinctly remember—first, this is not the era wish the world was black and white; I'm a child of the '80s, let's be clear here, so this is not old-timey nonsense quite as much, but distinctly remember that it was a running gag how unreliable the weather report always was, and it was a bit hit or miss, like, “Well, the paper says it's going to be sunny today, but we're going to pack an umbrella because we know how this works.” It feels, and I could be way off base on this, but it really feels like weather forecasting has gotten significantly more accurate since I was a kid. Is that just nostalgia, and I remember my parents complaining about it, or has there been a qualitative improvement in the accuracy of weather forecasting?Jake: I wish I could tell you all the scientific improvements that we've made, but there's many groups of scientists in the office who I would more than happily shift that responsibility over to, but quite simply, yes. We have a lot of partners we work with around the world—the National Weather Service, DWD in Germany, Meteo France, just to name but a few; there are many—and we all collaborate with data. We all iterate. You know, the American Meteorological Society holds a conference every year, which we attend. And there have been absolutely leaping changes in forecast quality and accuracy over the years.And that's why we continually upgrade our supercomputers. Like I said, yeah, there's research and stuff, but we're pulling in all this science and Meteorology is generally very chaotic systems. We're still discovering many things around how the climate works and how the weather systems work. And we're going to use them to help improve quality of life, early warnings, actually, we can say, oh, in three days time, it's going to be sunny at the beach. Be great if you could know that seven days in advance. It would be great if you knew that 14 days in advance.I mean, we might not do that because at the moment, we might have an idea, but there's also the case of understanding, you know, it's a probability-based decision. And people say, “Oh, it's not going to rain.” But actually, it's a case of, well, we said there's a 20% probability is going to rain. That doesn't mean it's not going to, but it's saying, “Two times out of ten, at this time it's going to rain.” But of course, if you go out 14 days, that's a long lead time, and you know, you talk about chaos theory, and the butterfly moves and flaps its wings, and all of a sudden a [cake 00:22:50] changes color from green to pink or something like that, some other location in the world.These are real systems that have real impacts, so we have to balance out the science of pure numbers, but what do people do with it? And what can people do with it, as well? So, that's why we talk about having timely data as well. People say, “Well, you could run these simulations and all your products take longer to process them and generate them,” but for example, in SurfaceNet, we have five minutes to process an observation once it comes in. We could spend hours fine-tuning that observation to make it perfect, but it needs to be useful.Corey: As you take a look throughout all of the things that AWS is doing—and sure, not all of these are going to necessarily apply directly to empowering the accuracy of weather forecasts, let's be clear here—but you have expressed personal interest in for example, IoT, a bunch of the serverless nonsense we're seeing out there. What excites you the most? What has you the most enthusiastic about what the future the cloud might hold? Because unlike almost everyone else I talk to in this space, you are not selling anything. You don't have a position—that I'm aware of—that oh, yeah, I super want to see this particular thing win the industry because that means you get to buy a boat.You work for the Met Office; you know that in some cases, oh, that boat is not going to have a great time in that part of the world anyway. I don't need one. So, you're a little bit more objective than most people. I have pushing a corporate story. What excites you? Where do you see the future of this industry going in ways that are neat?Jake: Different parts of the office will tell you different things, you know. We worked with Google DeepMind on AI and machine learning. We work with many partners on AI and machine learning, we use it internally, as well. On a personal level, I like quality of life improvements and things that just make my life as both the developer fun and interesting. So, CDK was a big thing.I was a CloudFormation wizard—still hate writing YAML—but the CDK came along and it was [unintelligible 00:24:52] people wouldn't say, but that wasn't, like, know when Lambda launched back in, what, 2013? 2014? No, but it made our lives easier. It meant that actually, we didn't have to worry about, okay, how do we do templating with YAML? Do we have to run some pre-processes or something?It meant that we could invest a little bit of time upfront on CDK and migrating everything over, and then that freed us up to actually doing things that we need for what we call the business or the organization, delivering value, you know? It's great playing with tech but, you know, I need to deliver value. And I think, what was it, in the Google SRE book, they limit the things they do, toiling of manual tasks that don't really contribute anything, they're more like keeping the lights on. Let's get rid of that. Let's focus on delivering value.It's why Lambda is so great. I could patch an EC2, I can automate it, you know, you got AWS Systems Manager Patch Manager, or… whatever its name is, they can go and manage all those patches for you. Why when I can do it in a Lambda and I don't need to worry about it?Corey: So, one last question that I have for you is that you're a tech lead. It's easy for folks to fall into the trap of assuming, “Oh, you're a government. It's like an enterprise only bigger, slower, and way, way, way busier.” How many hundreds of thousands of engineers are working at the Met Office along with you?Jake: So, you can have a look at our public report and you can see the number of staff we have. I think there's about 1800 staff that work at the Met Office. And that includes our account manage, that includes our scientists, that includes HR and legal. And I'd say there's probably less than 300 people who work in technology, as we call it, which is managing our IT estate, managing our Linux estate, managing our storage area networks because, funnily enough, managing petabytes of data is not an easy thing. You know, managing a supercomputer, a mainframe.There really aren't that many people here at the office, but we do so much great stuff. So, as a technical lead, I'm not just a leader of services, but I lead a team of people. I'm responsible for them, for empowering them, and helping them to develop their own careers and their own training. So, it's me and a team of four that look after SurfaceNet. And it's not just SurfaceNet; we've got other systems we look after that SurfaceNet produces data for. Sending messages around the world on the World Meteorological Organization's global telecommunications system. What a mouthful. But you know, these messages go all around the world. And some people might say, “Well, I got a huge team for that.” Well, [unintelligible 00:27:27]. We have other teams that help us—I say, help us—in their own right, they transmit that data. But we're really—I personally wouldn't say we were huge, but boy, do we pack a punch.Corey: Can I just say on a personal note, it's so great to talk to someone who's focusing on building out these environments and solving these problems for a higher purpose slash calling than—and I will get letters for this—than showing ads to people on the internet. I really want to thank you for taking time out of your day to speak with me. If people want to learn more about what you're up to, how you do it, potentially consider maybe joining you if they are eligible to work at the Met Office, where can they find you?Jake: Yeah, so you do have to be a resident in the UK, but www.metoffice.gov.uk is our home on the internet. You can find me on Twitter at @jakehendy, and I could absolutely chew Corey's ear off for many more hours about many of the wonderful services that the Met Office provides. But I can tell he's got something more interesting to do. So, uh [crosstalk 00:28:29]—Corey: Oh, you'd be surprised. It's loads of fun to—no, it's always fun to talk to people who are just in different areas that I don't get to work with very often. It turns out that most of my customers are not focused on telling you what the weather is going to do. And that's fine; it takes all kinds. It's just neat to have this conversation with a different area of the industry. Thank you so much for being so generous with your time. I appreciate it.Jake: Thank you very much for inviting me on. I guess if we get some good feedback, I'll have to come on and I will have to chew your ear off after all.Corey: Don't offer if you're not serious.Jake: Oh, I am.Corey: Jake Hendy, Tech Lead at the Met Office. I'm Cloud Economist Corey Quinn and this is Screaming in the Cloud. If you've enjoyed this podcast, please leave a five-star review on your podcast platform of choice, whereas if you've hated this podcast, please leave a five-star review on your podcast platform of choice along with a comment yelling at one or both of us for having the temerity to rain on your parade.Corey: If your AWS bill keeps rising and your blood pressure is doing the same, then you need The Duckbill Group. We help companies fix their AWS bill by making it smaller and less horrifying. The Duckbill Group works for you, not AWS. We tailor recommendations to your business and we get to the point. Visit duckbillgroup.com to get started.Announcer: This has been a HumblePod production. Stay humble.
[קישור לקובץ mp3] שלום וברוכים הבאים לפרק מספר 427 של רברס עם פלטפורמה - התאריך היום הוא ה-25 בנובמבר 2021, והיום אנחנו מקליטים ב-Remote עם ברלין [First we take] - עם יאיר עציוני, שנמצא בברלין - הי יאיר! תודה שאתה פה, כיף שאתה איתנו.יאיר הוא איש DevOps ותיק, והנושא שלנו יהיה מה שנקרא “DevOps Reloaded” או - “בוא נדבר שוב על DevOps ונבין מה זה אומר, וננסה לחזור קצת ל-Basics ונדבר על הנושא כולו” [“DevOps Reloaded” אכן יותר קליט].אז לפני שאנחנו צוללים פנימה - יאיר, מי אתה? מה אתה עושה היום?(יאיר) קוראים לי יאיר עציוני, אני במקור פתח-תקוואי, גרתי בתל אביב 10 שנים.יש לי משהו כמו 20 שנות ניסיון של עבודה בסקטור ה-IT והתוכנה בישראל - עבדתי באמדוקס, כמו רבים וטובים, שם התחלתיעבדתי בסטארטאפים, ב-Qlusters, ב-ECI Telecom, ב-Voltaire, חברת ה-Infiniband . . . התמחיתי בעיקר ב-Linux System ו-Quality Assurance ו-Networking - כל הדברים האלהבאיזשהו שלב, כשהתחיל להגיע הענן, אז בגלל הרקע העבירו אותי הרבה לענןאז AWS, סטארטאפים שוב פעם . . . אחרי זה עבדתי ב-mcAfee - עבדתי בסטארטאפ ישראלי שהתעסק ב-Security, שעבר לידי mcAfeeעבדתי שם גם איזו תקופה . . . Security, Networking, Kernal, דברים כאלהבעיקר כ-QA Engineerואז עברתי לברלין - אחרי שבעצם “פרשתי” מהתחום, אמרתי שאני יותר לא הולך לעבוד בתחום . . . (רן) כן, זה מזכיר לי את “אני עם הסמים גמרתי” . . . . “אני עם ה-DevOps גמרתי” . . . (יאיר) אז זהו, שלא ידעתי שיש בכלל דבר כזה DevOps - אבל הייתי איש QA שעושה Deployments, יודע System, לקנפג (Configure) לעצמו את הסביבות - ואז התחילו להציע לי את הדבר הזה, DevOps . . .אמרתי “מה זה DevOps?” - כי בברלין זה ניהיה פתאום “חם”:מה זה Puppet? מה זה Chef? מה זה הדברים האלה? התחלתי לבדוק . . . ואז עשיתי כמה תפקידים של איש DevOps . . . בכל התפקידים האלה - כמה שקראו לי “איש DevOps”, אני עדיין הרגשתי מעיין שאני System Administrator[“ה-DevOps של המלך”]והשינוי הכי גדול, אני חושב, היה כשפגשתי את מקום העבודה שאני עובד בו עכשיו - שקוראים לו Polar Squadאני יכול להרחיב עליהם טיפה - זו חברה מפינלנד שעושה רק DevOpsובהגדרה של החברה הזאת, אנחנו בעצם יועצים - בעברית אפשר להגיד שאנחנו עושים “ייעוץ תקשוב בענן”אנחנו רואים DevOps בצורה אחרת - אנחנו לא רק עושים “תקשוב בענן”, אנחנו גם עושים משהו שנקרא “ייעוץ ארגוני”, אם אני שוב פעם ניהיה . . . .(רן) נראה לי ש”תקשוב” יש רק בצה”ל . . . אבל אני בטוח שכולם מבינים . . . .בשאר החלקים של התעשייה זה כנראה “תקשורת” או . . . (יאיר) אני אוהב את המילה “תקשוב”, זו אחת המילים האהובות עלי בעברית . . . גם נחמד סוף סוף לדבר קצת עברית . . . המוח שלי צריך עכשיו לעשות המון רי-קליברציה (recalibration) . . .(רן) ביום-יום, דרך אגב, מה אתה - אנגלית? גרמנית?(יאיר) אנגלית - אני התחלתי כבר לחשוב באנגלית . . . אני הייתי לפני כמה חודשים בפתח תקווה [קורה לטובים ביותר], ואני מוצא את עצמי בסופר חושב באנגלית, כשאני צריך לקנות דברים, ואני אומר - “משהו לא בסדר” . . .. המון המון אנגלית כרגע, וברלין היא מאוד International, אז אנגלית זו השפה הרשמית של ה . . . Silicon Allee, מה שנקרא - סצנת הסטארטאפים הלא-ברורה שיש פה.ומה שמעניין, וזה אולי גם משהו שיחבר אותנו להמשך השיחה, זה שבפינלנד הם לוקחים את הדברים בצורה . . . .הם בהרבה מאוד דברים שונים מהישראלים ומאוד דומים לישראלים, אבל הם לוקחים דברים בצורה מאוד רציונלית - והם לא יודעים לעשות חצי עבודה . . . ובפינלנד עשו מחקרים מאוד גדולים על הנפילה של Nokia - זה משהו שבעצם פגע בהם באיזושהי צורה, כי זה משהו שהם מאוד אהבו, זו הייתה גאווה כזאת שם.וכשהם עשו מחקר, הם גילו שמה שבעצם היה חסר זה שהאנשים המקצוענים בתחום שלהם - אנשי ה-System, ה-Product - לא הצליחו להעביר את המסרים ל-C-Levels - וה-C-Levels היו מנותקים ממה שקורה.[זמן טוב לעצור ולצפות שוב ב- Riot On Documentary (2002), שימו לב רק להחזיק חזק לפני]ומה שהתפתח שם זה בעצם זו סצנה שלמה של . . . הם קוראים לזה Flat Hierarchies - בברלין, מיליון חברות יגידו לך שיש להן “Flat Hierarchies”, אבל אין.הן תמיד No Flat בכלל - רק כתוב “Flat Hierarchies” . . .ואני עובד בחברה שאין בה CEO בכלל . . . . אנשים יכולים להגדיר את עצמם . . .בחרנו אפילו את Teal בתור . . . אם אתה מכיר, ייעוץ . . . .לבנות ארגון בצורה של Teal? זה בעצם לבנות אותו מלמטה למעלה . . . .(רן) Teal, לא “טיל” בעברית . . .. אני אעיר, ככה בהערת אגב - דיברת על Nokia ועל פינלנד - אז לי יש משפחה ויש לי גם חבר שגר בפינלנד - והוא גם גר “בעיר של Nokia”, או שלפחות פעם נקראה - קוראים לזה Tampere, איפה שהמפעל הראשי . . . (יאיר) הייתי ב-Tampere!(רן) כן, אז זו עיר מאוד מאוד יפה - אבל Nokia כמעט ולא קיימת שם.אני חושב שהיא עוד קיימת, אבל בטח לא מה שהיה פעם . . . [עדיין כאן . . .](יאיר) כן . . . דרך אגב, תגיד לו שאתה רוצה שהוא יביא לך Mustamakkaraשזה מצחיק - Makkara זה “נקניקיה” בפינית . . . Ma-kara - הם שמעו אותי מדבר עם אשתי והם התפוצצו מצחוק . . . מה שקורה זה שבעצם אנחנו חלק מ-Ecosystem מאוד גדול של חברות מאוד “אידיאליסטיות” וגם העניין הוא שהחברה שלנו יודעת לעשות רק דבר אחד - ואותו דוחפים את האנשים לבנות לבד.זאת אומרת - אין לי HR, אני מנהל את הסניף בברלין ואין לי HR, אני עושה את ה-HR ואני גם עושה את ה-Process-ים.לכן יש מקום מאוד גדול להתפתח בתור בנאדם, וללמוד על התחום שלך - ועל תחומים שאתה לא מכיר בכלל.וזה מאוד מחובר גם ל-DevOps, אנחנו תיכף נגיע לזה - שבעצם אתה לא רק מהנדס בדיקות, אתה יכול להיות הרבה יותר מזה, אז למה ש”נקטין אותך” לזה.אנחנו עובדים עם הרבה מאוד לקוחות - הרבה מהייעוץ הוא ייעוץ ארגוני.הרבה אנשים אומרים לזה משהו כמו “אבל תראה, עשיתי את הכל אוטומטי - ה-כ-ל אוטומטי - יש לי Pipeline-ים, Infrastructure-as-a-Code, הכל מתוקתק - ואני עדיין לא רואה שום דבר משתפר. למה?”(רן) זה באמת ככה . . . מפה אנחנו כבר ממש צוללים לנושא. בטח אתה, שיש לך את הניסיון הזה, לדבר ככה עם לא מעט לקוחות ולהטמיע פרקטיקות - כנראה שאחת התגובות הראשונות שאתה שומע, כמו שכבר התחלת להגיד, ואני מניח שהרבה מהמאזינים שלנו גם שמעו את זה, זה “אוקיי, ניסיתי DevOps, נסיתי טרנספורמציה - למה זה לא עובד? מה חסר? למה לאחרים זה עובד ולי זה לא עובד?” . . .(יאיר) אוקיי . . . אני אתן דוגמא, ואחרי זה מהדוגמא אני אבנה את זה.אני יכול לתת כדוגמא שני לקוחות שלנו - שתי חברות שבעצם הן נכנסו לעניין ה-Kubernetes ולעניין ה-DevOps.דרך אגב - Kubernetes לא בהכרח אומר DevOps, אבל במקרה הזה אפשר להגיד שכן.חברה אחת . . . (רן) כן, נגיד רק באותה הזדמנות שפרויקט לא אומר בהכרח Big Data . . . . אבל ניתן לך את הקונטרה הזו.(יאיר) העניין הוא כזה - לקוח אחד היה, נקרא לזה סטארטאפ-מאוד-חדשני או היפסטרי-כזהאתה יודע - הם כולם עשו את הקפה שלהם Brewed והיו חברה מאוד Green Field-יתה-Frontend, ה-Backend, ה-SRE - הם כולם היו Developers by definion, אנשים שבאים מ-Coding.והם עבדו ביחד - ראיתי איך הם עובדים, זאת אומרת - איך דבר כזה ש . . . זה Cross-Functional teams, עם אחריות מסויימת לכל בנאדם - אבל הם עבדו ביחד, הם . . . היה חסר להם המון ידע בעולם ה-Kubernetes - ב-Pipelines שלהם, באיך לשפר את זה מ-5 דקות Deployment ל-10 שניות Deployment, או 7 שניות או . . . הם לא ידעו כל כך את הטכנולוגיה שמאחורי Kubernetes - אבל הם ידעו לעבוד ממש ממש יפה ביחד.הם - פוף! הם חברה שטסה . . . הם עושים Sprint-ים והם מתקתקים את ה-Sprint-ים והם עובדים כצוותהם נהנים לעבוד ביחד - כל החבר'ה שם, גם היה להם את אותו . . . הייתי אומר שהם התאימו לעבוד אחד עם השני, אם אתה מבין למה אני מתכווןאולי לא המתכנתים הכי מבריקים בעולם - אבל אנשים ברמה גבוהה.החברה השנייה הייתה מעיין ארגון יותר קלאסי - היו להם Sprint-ים, אבל לא היו להם Release-ים בסוף ה-Sprint-ים בהכרחהם היו מאוד מאוד מנותקים אחד מהשני, זאת אומרת - הייתה קבוצת ה-Ops שהייתה מתפרקת כל הזמן, אנשים לא רצו להיות בה, כי כשיש לך 50 הודעות Errors בלילה, אז אתה לא בנאדם שמח . . . .היו Frontend ו-Backend וקבוצת Full-stack - אף אחד לא מדבר עם השני . . . שם גם עשינו ייעוץ ארגוני - אתה ממש רואה את זה, אתה יושב “בתוך הלקוח” ואתה רואה שלושה אנשים רצים כמו מטורפים, מזיעים - ואחרים שרואים YouTube . . . אני לא נגד לראות YouTube בעבודה, אבל כשמישהו אחד מזיע ומישהו אחר רק רואה YouTube . . . .אמרתי לו, ל-CTO - “אני לא אומר שאתה צריך להעביד את כולם בפרך, אבל את שם לב שאתה ועוד שניים עושים הכל - והאחרים מסתכלים עליכם?” . . .. וכמובן כשהיינו צריכים להעביר להם את הידע על ה-HELM charts שבנינו להם - על ה-Repos, על ה-TerraForm, איך כל העסק הזה עובד - אף אחד לא רצה לדעת . . . מה שקרה להם בעצם היה שהם בעצם הם שמרו על המבנה הקודם - אף אחד לא נהנה מה-APIs החדשים של Kubernetes שיכולים לשדרג אותך - ובעצם ה-Ops קיבלו עוד ועוד ועוד ועוד עבודה . . . (רן) אז מה שאתה אומר זה, אם אני אנסה להסיק את המשל ממה שאתה אומר - יש כלים בעולם, לדוגמא Kubernetesאם אנשי ה-Ops פעם השתמשו בכלי אחד והיום משתמשים בכלי אחר - לא עשית בזה כלום . . . מה שהכלים מאפשרים לך זה לחלק את הנטל בין אנשי ה-Ops לאנשי הפיתוח - ושכל אחד ינצל את החלק הרלוונטי אליו בתוך הכלי.לצורך העניין, Kubernetes עושה נקרא-לזה-דמוקרטיזציה של ה-Infrastructure - לא יודע אם זו מילה שהמצאתי עכשיו או לא, אבל בכל אופן זה מאפשר לחלק את הנטל.אם חלק מהחברה הוא גם ככה Idle, שום Kubernetes לא יעזור, כי יש פה איזשהו עניין תרבותי . . . אתה אומר שמי שבא ואומר “טרנספורמצית ה-DevOps שעשינו לא עבדה לי” - אתה אומר שלפחות אחד מהמקרים, או אחת מהסיטואציות שיצא לך לראות, זה שהבעיה היא בתרבות הארגונית ברוב המקרים, ולא מן הסתם בטכנולוגיה או בהטמעהיתכן שיש גם שם בעיה, אבל זה לא מה שאתה מתאר . . .(יאיר) בדיוק - אני הייתי אומר כזה דבר: ההגדרה של DevOps, לפחות אצלנו ב-Polar Squad, היא הגדרה כפולהאנחנו אומרים שזה . . . חייב לבוא שינוי תרבותי ב-DevOps, והשינוי התרבותי הוא כלל-חברתיזה גם Pattern שאני רואה כל הזמן - יש לך צוות DevOps, אבל זה צוות שכל אחד יודע רק משהו מאוד ספציפי בצוות . . . כבר זה לא DevOps, ב-By definion - כי הם . . . אני רואה הרבה אנשים, ואתה לא מאמין כמה מהם אתה . . . הוא יודע רק חתיכה מאוד מאוד קטנה ממה שהוא עושה, הוא לא רואה את התמונה [הכוללת], הוא לא יודע כלום על התמונהואחרי זה, יש לך מלא צוותים בחברה - כל אחד רואה את הפינה שלו, הם לא עובדים ביחד.(רן) האם קיים בכלל “צוות DevOps”, לדעתך? האם זה נכון שיהיה בחברה צוות שקוראים לו “DevOps”?(יאיר) אנחנו נכנסים פה עכשיו לדלת מאוד . . . אני, אישית, מאמין בזה, מהסיבה . . .באמת שהתעמקתי בנושא - למדתי היסטוריה ופילוסופיה באוניברסיטת תל אביב, התחום שלי זה היסטוריה גרמנית של הרעיונות . . . . ואני הייתי מאוד לא מרוצה מהעניין . . . הרגשתי שאין דבר כזה “DevOps Engineer”, זאת אומרת - מבחינתי התפקיד הזה . . . אני מקבל את זה שיש Platform Engineer, אני מקבל את זה שיש Cloud Expert או Cloud Architect, אני אפילו מקבל את ה-SRE, כי ה-SRE - אני מבין את העבודה שלו, גם אם אני לא בטוח שצריך SRE אבל ניחא, “בסדר”, כמו שאמא שלי אומרת . . . אתה צריך מישהו שיעשה לה Reliability בחברה? אני מבין את זהאבל אני לא כל כך מבין את ה-”DevOps Engineer” . . . .אני מבין “DevOps Consultant” - זו הייתה בחירה מודעת ללכת על ה-DevOps Consultant - אני בא, מלמד אותך לעשות את המתודולוגיה הזאת ואני משתחרר, אני הולך, כאילו . . . אני יכול לקבל אפילו DevOps Avdocat, או DevOps Coach - וזה תפקיד שאנחנו חושבים עליו הרבה, על איך עושים אותו בחברה.אני לא חושב ש-DevOps Coach יכול להיות Agile Coach - כי Agile Coach הרבה פעמים לא יודעים איך תוכנה עובדת . . . .אני לא חושב שאתה יכול לייעץ בתוך ארגון או לעזור לארגון לעשות טרנספורמציה, אם אתה לא מבין איך AWS ו-Linux ו-CI/CD Pipelines עובדים.כי אתה לא יכול לדבר באוויר - אתה צריך להראות . . . .נגיד, יש פרויקט שאני יכול לספר, בקצרה, עליו - עשו אותו בבנק בפינלנד, מאוד-מאוד גדולהם פשוט בנו איזושהי Framework של Pipelines ואת כל ה-Deployments והאיך עושים את ה-Enviromentsואז הם שנה עברו, צוות-צוות - לימדו את האנשים, החזיקו להם את הידייםתחשוב - זה בנק, זה מתכנתים Old School by definition - החזיקו להם את הידיים, שמרו עליהם, “תעשה - זה Dokcer, תעשה . . .”אז זה חשוב מאוד . . .(רן) וזה עבד?(יאיר) כן - הבנק עבר אוטומציה מטורפת . . . תראה, אני חייב לשים שוב פעם את הכל בסוגריים - בפינלנד, כשהייתי ברשות השידור בפינלד, אז הם עובדים ב-Scrum וזה קצת לא מה שאנחנו חושבים . . . זה לא רשות השידור בישראל, זה אתר מטורף שכאילו כולו על Infrastructure-as-a-Code והכל שם אוטומטי לחלוטיןאני הייתי שם, ראיתי מה הם עושים - זה קצת . . . מאוד היפסטרי כזה, לא יודע אם זה Applicable לגרמניה וישראל, אבל עדיין . . . . [רגע, אתר רשות השידור כמקום היפסטרי - תן לזה לשקוע . . .]אבל עדיין - הבנק הזה עבר . . . בנק מאושר, הם עשו את זה.בטוח שיש להם מלא בעיות, אני בטוח שזה לא . . . צריך גם להגדיר את זה - מבחינתי, DevOps זו אוטופיה וזה משהו שאנחנו כל הזמן עובדים עליואין Endless loop of measurments . . . .(רן) כן, אז זה בעצם לבוא - אם אני מתרגם את מה שאתה אומר - זה לבוא ולהגיד ש”יש איש DevOps” או ש”יש צוות DevOps” זה אולי שקול ללהגיד “יש איש חדשנות!” או “יש צוות חדשנות!” - אז מה, זה אומר שכל השאר לא חדשניים? זה אומר שכל השאר לא עושים את זה? . . . . אז לבוא ולהגיד ש”יש איש DevOps” זה לבוא ולהגיד שכל השאר לא עושים את זה - וזה בדיוק האנטי-תזה למה ש-DevOps בא ואומר: DevOps בא ואומר שזה של כולם, זה לא רק של מישהו אחד.(יאיר) בדיוק - זה גם של ה-Salesman וזה גם של ה . . . .אני אגיד לך דבר כזה - אם ה-DevOps נשאר בתוך קבוצה מאוד קטנה של שלושה אנשים, אז לא עשינו כלום . . .אם DevOps נשאר קבוצה של שבעה אנשים - לא עשינו כלום . . . .אני לא יכול להגיד לך אם אני יודע . . . עכשיו קוראים לזה “BizOps” ו-”DesignOps” ו-”GitOps” וכל מיני . . . ה-”PeopleOps” . . . אני חושב שכל הדברים האלה מגיעים מאנשים שלא כל כך הבינו . . . .(רן) כן, אז יש את הצד התרבותי - ועכשיו אתה יודע, זה באמת . . . אני חושב שכולם יודעים שהוא קיים, אבל עד שאתה לא באמת חווה את זה, אתה לא באמת מבין מה המשמעות של זה - ולפעמים אני חייב להגיד שגם אני עושה את הטעויות, ורק כשאני מסתכל על זה מהצד - אז אני קולט שעשיתי שם טעויות.אז זה עניין שלוקח הרבה מאוד זמן להבין אותו - ובהקשר הזה, אנשים כמוך, שראו הרבה מאוד חברות ויש להם את הניסיון הזה, יכולים לבוא ולתת את הפרספקטיבה הנכונה.אבל יש גם את העניין הטכנולוגי, שקצת נגענו בו - וחשוב להגיד ש-DevOps זה שילוב של שניהם, ואני חושב שזה נאמר כבר אלפי פעמים, אז פה אנחנו לא חדשים - אבל בוא רגע נדבר על הצד הטכנולוגי, ואולי ככה נעשה איזושהי סקירה קצרה של אילו דברים מעניינים, בצד הטכנולוגי, קרו בזמן האחרון, שבעצם נותנים לנו ומאפשרים לנו לקחת את ה-DevOps צעד אחד קדימה.(יאיר) אוקיי, אז אני חושב שהדבר הכי חשוב שאנחנו רואים לאחרונה זה כניסה של APIs לעולם ה-Infrastructure.בעצם, מה שאנחנו רואים זה שנכנסים כלים של פיתוח לעולם ה-Infrastructure.אני אתן לך דוגמא - כשאני הייתי SysAdmin, היו לי כמה Batch-scripts, ואני לא חושב ש-Git היה אז - וגם אם היה, לא הייתי חולם לשים את זה ב-Git . . . הייתה לי ספרייה כזו של Script - Install - Install Apache . . . . עכשיו זה עולם אחר - אתה לא יכול יותר לעשות את זה בצורה כזאת, כי המערכות כל כך מורכבות - אתה רוצה שכולם יחלקו את המידע ושזה יהיה דקלרטיבי (Declarative) ככל האפשראז בעצם תחשוב על זה - כלי כמו Kubernetes, כלי כמו TerraForm, כלי כמו CDK - משתמשים בעצם ביכולת שענקי התקשוב בענן ו-Google נתנו לנו בעצםבעצם, המפתח וה-Operator מתחילים לעשות קונסולידציה (Consolidation) - הם שניהם עושים הרבה Merge Requests ו-Pull Requests ו-Git ניהיה ה-Source of Truthזה Hopefully, זה לא תמיד קורה . . . אבל אם תחשוב על זה, אתה בעצם משוחרר פתאום - ה-AWS שאני התחלתי לעבוד עליו היה Datacrnter קלאסי . . . הווה אומר - אתה עושה Provisioning למכונות, אחרי זה הם התחילו להוסיף Service-ים - ה-S3 וכל הבניינים האלה.עכשיו - זו מפלצת של Service-ים . . . .מה שאני מנסה להגיד זה שיש את הדבר הזה שאומרים “No Vendor locking” - אבל אם אתה סטארטאפ צעיר, עני יחסית, אין לכם הרבה כסף, אז נכון - זה יעלה לך כסף, אני מסכים, אבל כשאני חושב על העבר ואני חושב על ההווה - אתה יכול, יחסית בזול, אם אתה תחשוב על זה טוב, לבנות לעצמך מערכות ממש טובות - ואחרים עושים לך Lift & Shift.לדעתי, אם האתוס, כשאני הייתי צעיר, היה “בוא נבנה לבד הכל, בוא נעשה הכל לבד” - עכשיו, מי שעושה את זה הוא מתאבד . . .אתה לעולם לא תסיים . . .(רן) אני מסכים לגבי המורכבות - אני חייב להגיד שכל יום, כשאני נכנס ל-Dashoboard של AWS, אני מגלה שם שירותים חדשים שאני לא מבין, אני אפילו לא יודע איך קוראים את השם שלהם, שלא לדבר על מה הם עושים . . . בחלק קטן מאוד שלהם אני משתמש.עכשיו, דיברנו על הדמוקרטיזציה של ה-Infrastrucure - אני אגיד את זה, עד שזה יקלט - אחד האתגרים שלי באופן אישי יצא לראות כשבאים ומכנסים פרקטיקות של DevOps, זה שלאנשי הפיתוח לפעמים קשה לעכל את זה - והדילמה היא . . . כי עכשיו לא צריכים לדעת רק את שפת התכנות - לא רק צריכים לדעת Java ואת כל הספריות שלה או Python או Whatever - הם גם צריכים להבין Infrastructure, משהו שלפני זה מישהו אחר עשה להם, אז עכשיו גם הם צריכים להבין בזה . . .ונשאלת השאלה - מצד אחד זה טוב, אבל מצד שני גם נשאלת השאלה - מהי רמת האבסטרקציה (Abstraction) הנכונה? זאת אומרת - איזו אבסטרקציה צריך לחשוף למפתחים, כדי שיהיו פרודוקטיביים? כדי שבאמת נוכל . . . כדי שהם יהיו איתנו onboard בכל הסיפור הזה של ה-DevOps - וזה די מתקשר לכל הסיפור הזה של Developer Platform, שאני יודע שאתה רוצה להזכיר . . . .אז בוא רגע נדבר על זה - מניסיונך, איזו רמת אבסטרקציה נכונה יכולה לעבוד, כדי שמפתחים יהיו לגמרי Onboard ופרודוקטיביים?(יאיר) תראה, זה מאוד מאוד תלוי . . . אני חושב שקשה לי לתת לזה תשובה אחת.אני חושב שזה גם משתפר עם הזמן, וזה גם מאוד תלוי מי המפתחים - יש מפתחים שמתים לדעת את הדברים האלה ויש מפתחים שלעולם לא יגעו בזה גם . . . (רן) אז אם אתה מגיע עכשיו לחברה, נניח - או אולי אתה יכול להיזכר באחד המקרים האחרונים, שהגעתם לחברה ואני מתאר לעצמי שבאיזשהו שלב גם השאלה הזו עלתה: האם אנחנו רוצים לייצר פלטפורמה למפתחים, ואם כן - אז מה אנחנו רוצים לחשוף להם? האם לחשוף להם Barebone Kubernetes? האם לחשוף להם איזשהו ממשק מעל? האם לחשוף להם שלושה ממשקים מעל? זאת אומרת - איך? מה אנחנו חושפים למפתחים פה?[רפרנס - 368 Kubernetes and Dyploma at outbrain](יאיר) תראה, הייתי אומר שמקרה קלאסי . . . הרבה פעמים, אפשר להמליץ לאנשים להשתמש . . . או שאתה בונה את הפלטפורמה להם . . . הכי טוב למפתחים זה לעבוד עם API - ל-Kubernetes יש API, ויחסית נוח לייצר מולו דברים.אם נגיד . . . כלים כמו TerraForm וזה, אם הם פחות אוהבים, ובכל מקרה עדיף שה-TerraForm שלך יהיה בתוך ה-CI/CD Pipelines, עדיף שכמה שפחות אתה “תעשה עם המקלדת” TerraForm . . .באופן כללי - כמה שפחות מקלדת זה יותר טוב.אני חושב שאם הם בעניין, אז אפשר גם לפתוח קצת, לתת להם קצת kubectl, קצת . . . אבל API - זה הדבר.ולתת להם את זה לאט - כי יש כאן גם Context change - הבנאדם כותב Java, או איזושהי שפה, המון שנים - ונוח לו.הוא מבין שמשהו משתנה, והוא לא רוצה שתפחיד אותו . . . זה ה-Level של האבסטרקציה.או שאפשר להשתמש בכלים כמו humanitec, למשל, שבעצם נותנים לך עוד שכבה, נותנים לך UI יפה כזה מעל ה-Kubernetes - ומחברים לך את כל ה-Dots . . . ואז בעצם יש לך מעיין משהו מאוד נוח לשימוש, שאני חושב שאחרי הסבר מאוד קל אז כל מפתח ישמח לעבוד איתו.ושוב פעם, זה חוזר לעניין הזה שאני מאוד מאוד מאמין בו - אל תבנה לבד כלים, תשתמש בדברים מוכניםאתה חוסך המון זמן וכסף.(רן) כן . . . דרך אגב, אני לא הכרתי את humanitec, אז תודה על הרפרנס . . . אני מסתכל עכשיו באתר וכתוב שזה “Enable developer self-service” - אז מה זה “Self Service”? זה אומר לתת למפתחים להקצות לעצמם משאבים, בזמן שהם צריכים, בלי פגישה ובלי טפסים, לצורך העניין? לייצר API, שהם יכולים דרכו לעשות Provisioning ל-Workloads שלהם?(יאיר) בדיוק . . . (רן) . . . כשעל פניו, זה גם משהו ש-Kubernetes נותן, אבל יכול להיות שהם עושים את זה בצורה יותר “הומנית”, בצורה יותר נוחה . . .(יאיר) מה שהם עושים זה שהם בעצם נותנים עוד שכבה של אבסטרקציה - ובעצם הם עוזרים לך, אתה לא צריך לעשות את ה-Glue, הם עשו בשבילך את כל ה-Glue . . . אני לא יודע אם אתה מכיר או חי את ה-Kubernetes, אבל Kubernetes [זה משהו ש]צריך לדעת לתפעל אותו.אם אתה פשוט זורק Kubernetes בענן איפשהו [רעיון לספורט אולימפי חדש?] וחושב שהדברים יהיו שמחים - אז זה לא, אתה תיהיה מאוד מסכן.הם פשוט מקלים עליך בהרבה הרבה דברים - הם עשו המון עבודה, הם הוסיפו המון APIs, הם הוסיפו המון ממשקיםהם צוות מאוד מאוד חזק - המון אנשים שבאים ממקומות מאוד טובים . . .(רן) . . . דרך אגב - מקלים עליך מהצד של לתפעל את ה-Cluster עצמו, או בצד של להתממשק אליו ולהשתמש בו?(יאיר) יותר בצד של להתממשק ולהשתמש בו, אבל הם גם יכולים לספק לך לפעמים את ה-Cluster, אם אתה רוצה.ואז אתה על ה-Cluster שלהם . . . כל מיני דברים כאלה, בהחלט.(רן) אז יצא לנו לדבר ספציפית על Kubernetes, אבל מן הסתם זו רק דוגמא - יש גם כלים אחרים בעולם, ותהיתי האם פה יש לך אילו-שהן תובנות, לגבי איך יראה ה-Stack הטכנולוגי של עוד X שנים? . . . לא יודע, תבחר X . . . נגיד 5 שנים? 10 שנים? האם תיהיה איזושהי קונסולידציה (Consolidation) לכיוון איזשהו Stack מיוחד, או שאנחנו נמשיך לראות ככה הסתעפויות - ואני יודע שיש פה מן הסתם גם שאלות עסקיות וכלכליות, זה לא רק שאלה טכנולוגית, ברור לגמרי . . . אבל, זאת אומרת, מהדברים שאתה רואה היום - האם אתה רואה ניצנים של התפתחויות חדשות בנושא של הפלטפורמות ענן?(יאיר) אני חושב שהפלטפורמות ענן - החלום שלהן זה . . . הן עובדות בשיטה של סוחר סמים - הן רוצות שתיכנס בחינם, כשאתה חלש וקטן זה נראה לך זול, אתה קונה כמה שיותר שירותים, ואחרי איזה כמה זמן “הו, לא! אני מכור ל-Lambda!” או “אני מכור ל-ALB” . . . אתה לא יכול לצאת מזה.אז הם ישפרו וישדרגו את השירותים שלהםאם, נגיד, Azure ו-AWS נכנסו חזק ל-Kubernetes, הם יעשו “humanitec משל עצמם”, איכשהוהם יעלו על הגל הזה.אני חושב שהרצון של האנשים הוא פשוט לעבוד מהר יותר - והרצון של האנשים לעבוד מהר הולך בניגוד גמור לרמה של ה-Complexity שאנחנו מתעסקים איתה כי microServices זה נחמד, אבל זה קשה לתפעול - צריך המון המון Context, המון המון דבריםוה-Context משתנה המון, אתה . . . . יש איזשהו כלי שאתה חושב שהוא מגניב, ופתאום הוא נעלם לגמרי, ואתה לא יודע מה יהיה הכלי הבא.אבל אני חושב שזה ילך לעוד ועוד אבסטרקציות - עוד ועוד אבסטרקציות.אנשים, אפילו אנשי Ops - מעט מאוד אנשים התחילו “להיכנס מתחת לברזל”, ועוד ועוד אנשים יעלו מעל . . .אני אתן לך דוגמא, ברמת עבודה: אני והבחור השני, שהוא יחסית “ענתיקה” אצלי בצוות - אנחנו, יש לנו תמיד את השאלה הקלאסית שקשורה ל-TCP ול-HTTP - אתה לא מבין כמה אנשים עם ניסיון לא יודעים, לא יכולים להסביר לי את הדבר הזה . . .ותמיד אומר לי הבחור היותר צעיר בצוות - “אבל אתם עתיקים, אתם . . . .”אבל איך אתה יכול לפתור? עדיין ה-ALB שלך . . . מצטער, איך אתה יכול לפתור תקלה, אם אתה לא מבין ואתה לא יודע מה זה Three-way handshake? אני לא יכול, אני מצטער - זה מעצבן אותי . . . (רן) אני כאילו מתפתה לבוא ולהגיד “בוא תשאל אותי רגע את השאלת ראיון בשידור”, ונראה אם אני מצליח לבזות את עצמי, אבל אני אחסוך את זה לעצמי . . . . אתה יודע מצד אחד, יצא לי לחשוב על זה כמה פעמים: תראה, אני יודע איך עובד TCP ו-Three-way handshake, סבבה - אבל יש עוד הרבה דברים שאני לא יודע, אוקיי? אני לא יודע איך עובד הה-CPU ואני גם לא יודע איך עובד ה-GPU ואני לא יודע איך עובד הזכרון של ה-GPU - ויש עוד המון דברים שאני לא יודע.באיזשהו שלב, אתה יודע - זה איזשהו צורך השרדותי: אם אתה תדע את הכל, אתה לא תדע להבחין בין מה שרלוונטי לך לבין מה שלא רלוונטי, מעבר לזה שזה לא פרקטי לדעת את הכל.אז אני אומר שבאיזשהו מובן, זה כאילו מעצבן אותך שהם לא יודעים TCPו-Three-way handshake - ומצד שני, הם “מפנים מקום ב-RAM שלהם” לדברים אחרים, שאולי הם יותר רלוונטיים . . . אז יכול להיות שבראייה השרדותית, הם אולי עשו את הבחירה הנכונה, אפילו שהם לא עשו את זה במודע - אבל הם עשו את הבחירה הנכונה של “בוא לא נלמד את זה, כי זו בעיה פתורה - ואני אשקיע את הזמן בללמוד HELM או Whatever, דברים אחרים שיש להם מקום בזכרון . . . .(יאיר) קודם כל, קיבלתי לעבודה אחד כזה, אז . . . אני נשמע נוקשה אבל אני ממש לא נוקשה.(2) אני חושב - וב-Context של השאלה זה נאמר גם - אני אומר לו “השאלה היא לא… אני לא רוצה שאתה תגיד לי . . .” - כי היה מישהו שלא היה כל כך מומחה לרשתות, שנתן לי מרמת ה-ARP, ה-MAC, והוא נכנס שם ממש לפאקטות (Packets) - ואמרתי “בסדר, זה לא מעניין אותי גם . . . “אבל מה שכן, ב-Context של Infrastructure Engineer, רק תן לי את ה . . . . אני לא מצפה ממך עכשיו להיות אלוף העולם ברשתות, אבל אני רוצה שלפחות תדע שיש שכבות וזה באמת לא הרבה לבקש את זה, לא מדובר פה באיזה Pinpointing, כן . . . מדובר ב . . .אתה יודע - יש שכבות ואתה לא יכול לפתור את ה . . . זה אומר - מבחינתי זה אומר, וסליחה שאני לא מסכים . . . אבל שוב פעם - קיבלתי מישהו גם כשהוא לא ידע את זה, כי הוא ידע מלא דברים אחרים . . . .אז זה לא 100%, כן? אבל . . . (רן) כן - הוא הראה יכולת להעמיק, אתה אומר . . . ודרך אגב, אנחנו מן הסתם סוטים פה לנושא של “איך מראיינים בנושא של DevOps” . . . אבל זה גם נושא מעניין, אולי גם על זה צריך להקליט פעם משהו...אתה אומר, אבל, שהוא העמיק במשהו, אוקיי? הוא הוכיח שהוא יודע להעמיק, ספציפית . . . (יאיר) אני אגיד לך את האמת - באמת באמת - אני מחפש את ה-State of Mind.טכנולוגיה אפשר ללמודהשאלות האלה הן רק יותר כדי לדעת . . . תשמע, אחרת אני אקח אנשים עם State of Mind “מהרחוב” ואני אלמד אותם - ואני לא יכול.השאלות האלה הן איזשהו “בזיק” שאני זורק באוויר כדי לראות איך הם מגיבים - אבל בעיקר חשוב לי איך הוא הוא חושב? האם הוא בא עם סקרנות? האם הוא בא עם יכולת לעשות אבסטרקציה מהדברים שהוא מתעסק בהם? או שהוא מפציץ, או שהוא רובוט . . . (רן) בוא נחזור רגע לנושא שלנו - ואנחנו כבר ככה לקראת הסוף, אז נבחר עוד נושא אחד.רציתי אולי קצת לדבר על Cloud Native - מן הסתם זה Term ששומעים לא מעט . . . מה זה? למי זה טוב? מתי אני צריך את זה?אתה יודע - כולם מדברים על זה, אולי כדאי שגם אני אדע מה זה . . . .(יאיר) אוקיי, קודם כל - Cloud Native זה דבר שכל ברנש או ברנשית שעובדים בפיתוח כרגע כדאי שידעו.זה בעצם גם . . . זה גם סוג-של Non-profit organiztion שמונהול בעצם ע”י כל הענקיות - זה CNCF - ה-Cloud Native Foundationאני מצטער, אבל לפעמים אני שוכח מילים בעברית . . .ובעצם זה גם מביא איזושהי גישה לאיך בעצם אתה אמור לפתח תוכנה - בענן.עכשיו - אני יודע, ואני גם אומר את זה: “ענן התקשוב” הוא לא איזו המצאה כל כך מדהימה וחדשה, אני חושב שמי שעבד אפילו עם Mainframe יודע שבעצם זה היה סוג של ענן תקשובמלא מחשבי-על מחוברים ברשת.אבל כן - אנחנו עכשיו נמצאים בסיטואציה שבה העולם משתנהזאת אומרת, אפילו חברות ענק מתחילות - וזה בגרמניה, המדינה שהיא, נגיד, מאוד מאוד איטית ביכולת שלה לחבק ולקבל טכנולוגיות - מתחילה עכשיו לצאת מהעולם הזה של ה-On-Premise מעולם הזה של “אני צריך את ה-Server-ים שלי אצלי כי הם Secure” . . .ומתחילה לחשוב על הענן בתור “צביר של שירותים”.וצביר השירותים הזה יכול לקדם אותך לעבוד מאוד-מאוד-מאוד מהר.אם אתה מוסיף לזה את הקונספטים של Agile ו-DevOps, אתה יכול בעצם לייצר לעצמך סביבות אלסטיות בטירוףאתה בעצם יכול להשתמש במלא כלים.אני רק אוסיף עוד דבר אחד - זה [אלו] קהילות מאוד מ אוד Vibrant - כל ענקיות התוכנה משלמות מלא-מלא כסף . . .למשל - HELM נשלטת לחלוטין ע”י Microsoft - כל אנשי Microsoft שעובדים על HELM מקבלים משכורות מ-Microsoft . . . (רן) כן, ראיתי את זה ב-GitHub, אני חושב שמי שיצר את זה עובד שם וככה זה התגלגל, אבל אפשר לדבר על זה כמה מילים . . .[מעניין -Matt Butcher, ונראה שבדיוק החודש הוא עבר הלאה . . . .] רק רציתי להעיר, להיות קצת יותר קונקרטי: אמרת “צביר של שירותים”, אז בוא נסתכל רגע על דוגמא קונקרטיתלמשל Storage - אם בעבר ה-Storgae היה היכולת לעשות Mount לאיזשהו דיסק פיזי בתוך המחשב שלך, אז היום Storage, בהרבה מקרים, זה משהו שנמצא רחוק - S3 זו דוגמא קלאסית.עכשיו - אתה לא יודע כמה מחשבים יש מאחורי זה, אתה לא יודע איפה מאחסנים את זה, אין לך שום מושג . . . אבל יש לך API - ואתה יודע שזה אלסטי: כשתצטרך, יהיה לך את זה - ואתה תשלם רק על מה שאתה משתמש.זו דוגמא, דרך אגב - השירות, ספציפית S3, היה קיים הרבה לפני שהמציאו את המונח Cloud Native - וכמו בהרבה מקרים, כמו ב-Design Patterns, קודם כל מסתכלים על מה קורה ורק אחר כך נותנים לזה שם . . . אז למעשה אתה אומר - Cloud Native זה בעצם שנתנו שם להרבה מאוד התנהגויות שמצאו בשטח, שמה שמשותף לכל ההתנהגויות האלה זה שמשתמשים בשירותי ענן שונים . . .ודרך אגב - אנחנו אומרים “ענן”, אבל זה לא חייב להיות ענן, זה גם . . . אני מכיר אימפלמנטציות (Implementations) של Cloud Native, נקרא לזה - שהן בכלל לא ב-Cloud, שהן On-Premise . . . .(יאיר) נכון . . . (רן) . . . כי הם משתמשים בקונספטים של Cloud Native - אז אולי המילה “Cloud” היא קצת אולי מבלבלת . . . (יאיר) . . . יש כלי Native וכל ה . . . כל הדברים האלה, בהחלט.שוב פעם - אל תשכח שמתחת לכל הדברים האלה, זה Marketing Tools, אוקיי? . . . אז ברור שחברות הענן רוצות שאתה תחשוב שהן - יש להן בעלות על הענן, כי אתה משלם להן כסף . . .יש סיבה לזה ש-Kubernetes שיחררה, או ש-Kubernetes שוחרר מ-Google - אבל Borg לא שוחרר מ-Google . . .כי Kubernetes היא גירסת הOpen Source של Borgאתה גם רואה את ה-Distruption ש-Kubernetes עושה ואיך הוא תפס את AWS ואיך ש-AWS רצה אחרי זה - ואתה מבין למה.יש פה עניינים - יש פה סכומי-עתק, כן? כי AWS - זה המנוע של Amazon, ו-Microsoft שמה את כל הביצים שלה בריצה מטורפת על Azureו-Google קצת עובדים אחרת - אני אף פעם לא מצליח להבין את הפילוסופיה של מה שהם מנסים לעשות, אבל יש להם את האימפלמנטצית (Implementation) Kubernetes הכי טובה, אז אתה תמיד צריך לזכור - אפילו שאני מדבר במשפטים אורכים עם הרבה פסיקים [1+] - בסיכומו של הם רוצים למכור לך משהו . . . אתה יכול לעשות את כל הדברים האלה אצלך ב-On-Prem, אתה יכול להריץ איזו אימפלמנטציה שאתה רוצה, זה לא רק מהם - ואתה יכול לקבל את אותם Service-ים - אצלך.ההבדל היחיד שהייתי מוסיף זה ששם מישהו עושה לך את ה-SRE, את ה-Lift & Shift - הוא דואג . . .מישהו דואג שה-S3 שלך תמיד יהיה שם - ואם הוא לא שם, אז הוא יחזיר לך את הכסףוזו נקודה שהיא מאוד מאוד חשובה להבהרה - כי בעצם כל העניין הזה שאתה משלם למישהו אחר קצת מוריד מעצמך את העומסואתה יכול לבחור במה אתה רוצה להתעסקזאת אומרת - אני בכלל “לא רוצה לראות” את ה-Infrastructure, אני לא רוצה לשמוע מ-VMsאני רוצה X מקומות שאני עובד איתם - כמו שאמרנו, נגיד ארבעה-חמישה Services - ושחרר אותי מהכל, אני לא רוצה לראות את זה - ואתה יכול להגיע למקום הזה עכשיו, או להתקרב אליו מאוד-מאוד-מאוד.(רן) אז אם ננסה לסכם רגע את ה-Take-away מהסעיף הזה של ה-Cloud Native, אז(1) זה אוסף של קונספטים שכדאי להכיר(2) צריך לזכור שיש מאחורי זה Marketing, אז לא הכל שם “חקוק בסלע” [מועמד לפרס ה-understatmenet של השנה?]אבל כן יש שם לא מעט Best Practices שכדאי להכיר ולאמץ את מה שרלוונטי אליכם.וה-Term עצמו - “Cloud” - יכול להיות אולי קצת מבלבל, כי תכל'ס אני חושב שכמעט כל ה-Best Practices שקיימים שם, גם יכולים להיות מחוץ ל-Cloudאני יודע שיש הרבה מאוד כלים שהם כלים מצויינים, בלי שום קשר ל-Cloud - כמו Grafana ואחרים - שהם חלק מתוך Cloud Native, ואין שום תלות בינהם לבין היכולת לרוץ על VM ב-Cloudאבל בכל אופן - יש שם לא מעט Resource-ים טובים, וכל הענקים למעשה מובילים את זה - כי אף אחד לא רוצה להישאר בחוץ, כי זו פלטפורמת Marketing מאוד טובה . . . (יאיר) לגמרי . . . .(רן) בסדר, אננחו מגיעים, ככה, לסיום - האם יש משהו שתרצה עוד להוסיף?(יאיר) אני חושב ש . . . הדבר שהייתי רוצה להגיד לאנשים זה שאם אתם יוצאים למסע הזה, של DevOps ו-Cloud Native, ואתם רוצים לעבוד עם הכלים האלה - תחשבו טוב למה . . . מה הכלים האלה יתנו לי? כי כלים-לשם-כלים זה Idle . . .תמיד תחשבו - וזה אולי מביא אותנו בסוף גם להתחלה, ל-Culture ול-DevOps - תחשבו איך הכלים האלה ישפרו את מה שאנחנו עושים ביחד.ומה שאנחנו עושים זה שאנחנו רוצים שה-Business יעבוד . . . איך זה יעשה את ה-Business יותר טוב?מה ה-Added value שאני מקבל על זה - על כל צעד שאני עושה:האם יש לי את האנשים לזה? האם יש לי את הארכיטקטורה המתאימה לזה?למשל, תשים Monolith ב-Kubernetes - סתם, אתה לא מרוויח מזה הרבה, אתה “קונה סבל”, מה שנקרא . . .(רן) . . . צריך גם את המוכנות הטכנולוגית - אבל גם את המוכנות התרבותיתשגם האחרים בחברה ירצו להיות חלק מזה, ואתה לא סתם זורק עליהם סט של טכנולוגיות שהם יחליטו להתעלם מהן ביום שאחרי . . .(יאיר) וגם הייתי אומר שתראה אם זה מתאים . . . הרבה פעמים אני הייתי חלק מצוותים - אני חייב להיות כנה עם זה - בחרנו כלים כי הם נראו לנו מגניביםבחרנו כלים כי הכרנו אותםבחרנו כלים כי זה מה שהחלטנו באותו הרגע, כי הייתה ישיבה ומישהו היה צריך לצעוק משהו . . . [זה ברקע]קצת . . . זה מה שנחמד בזה, ומה שאני רואה עכשיו - איך כל כך הרבה אנשים חוזרים על אותם Patterns של שגיאותוכל מה שאני רוצה להגיד זה “גם אני הייתי שם!” - ועכשיו אני בחוץ, אני לא עושה את השגיאות, אני רק רואה את השגיאות - בואו נעצור רגע, בואו נחשוב . . . בואו נעשה משהו יותר טוב הפעם.(רן) כן . . . טוב - תודה יאיר, תודה רבה! היה כיף והיה מעניין - ובהצלחה והמשך הצלחה ב-Polar Squad.נשמור על קשר - להתראות! האזנה נעימה ותודה רבה לעופר פורר על התמלול!
This week's Network Break examines new 400G switches from Arista, discusses the Wi-Fi Alliance's certification program for the HaLow long-range low-power standard, targets key Nvidia announcements, catches up on the latest in space networking, and more IT news.
This week's Network Break examines new 400G switches from Arista, discusses the Wi-Fi Alliance's certification program for the HaLow long-range low-power standard, targets key Nvidia announcements, catches up on the latest in space networking, and more IT news.
This week's Network Break examines new 400G switches from Arista, discusses the Wi-Fi Alliance's certification program for the HaLow long-range low-power standard, targets key Nvidia announcements, catches up on the latest in space networking, and more IT news.
This week's Network Break examines new 400G switches from Arista, discusses the Wi-Fi Alliance's certification program for the HaLow long-range low-power standard, targets key Nvidia announcements, catches up on the latest in space networking, and more IT news. The post Network Break 359: Arista Increases Its 400G Switch Portfolio; Nvidia Accelerates InfiniBand appeared first on Packet Pushers.
This week's Network Break examines new 400G switches from Arista, discusses the Wi-Fi Alliance's certification program for the HaLow long-range low-power standard, targets key Nvidia announcements, catches up on the latest in space networking, and more IT news. The post Network Break 359: Arista Increases Its 400G Switch Portfolio; Nvidia Accelerates InfiniBand appeared first on Packet Pushers.
This week's Network Break examines new 400G switches from Arista, discusses the Wi-Fi Alliance's certification program for the HaLow long-range low-power standard, targets key Nvidia announcements, catches up on the latest in space networking, and more IT news. The post Network Break 359: Arista Increases Its 400G Switch Portfolio; Nvidia Accelerates InfiniBand appeared first on Packet Pushers.
Gregory Kurtzer Gregory is the Founder and Chief Executive Officer at Ctrl IQ, Inc and the Founder of CentOS and Rocky Linux. https://www.linkedin.com/in/gmkurtzer/ https://github.com/gmkurtzer https://gmkurtzer.github.io https://ctrliq.com/ Notes: MPI Hello world - https://mpitutorial.com/tutorials/mpi-hello-world/ HPL Linpack - https://www.netlib.org/benchmark/hpl/ OpenHPC Linux Foundation Project - https://linuxfoundation.org/press-release/high-performance-computing-leaders-unite-to-develop-open-source-framework/ Warewulf - https://github.com/hpcng/warewulf Credits: Music by ikson: https://www.iksonmusic.com Special Guest: Gregory Kurtzer.
Google is backing a new project from the Linux Foundation to the tune of $1 million that aims to bolster the security of critical open-source projects. GitLab is going public with their IPO, plus we address your feedback and calls! -- During The Show -- 02:45 Caller Chris 450 TB backup is 90% full Connect Multiple FreeNAS Servers Different shares on different servers 45 Drives (https://www.45drives.com/) Twinax (https://en.wikipedia.org/wiki/Twinaxial_cabling) Infinaband (https://en.wikipedia.org/wiki/InfiniBand) 09:05 Bhikhu Responds to ANS Ep 252 Conduit (https://conduit.rs/) Mirotalk Heroku (https://mirotalk.herokuapp.com/) Cabal Chat (https://cabal.chat/) Email Solutions Main in a Box (https://mailinabox.email/) Iredmail (https://www.iredmail.org/) Cuttle Fish (https://cuttlefish.io/) Mailu.io (https://mailu.io/1.8/) James Apache (https://james.apache.org/) Accessable Coconut can't host on github due to file size restrictions Matrix is going towards P2P Shut off federation solves many issues Don't host your own email 16:00 In Person LinuxFests? - Jeremy OLF Conference (https://olfconference.org/registration/) Youth Hacking Event (https://fsfe.org/activities/yh4f/index.html) Nugget Registration (https://registration.socio.events/e/nugget2021) Noah's Favorite Thinkpad Models X1 Carbon T480 T490 24:30 Automation of Warehouse - James Run a risk of getting devices with un-flashable chips Cloudfree Shop (https://cloudfree.shop/) Athom Store (https://athom.aliexpress.com/store/group/Tasmota/5790427_517820063.html?spm=a2g0o.detail.100008.9.cf265affja5i7K) My Local Bytes (https://www.mylocalbytes.com/) 26:26 Caller Sabrent Audio Card (http://www.amazon.com/dp/B00IRVQ0F8/?tag=minddripmedia-20) 28:44 Wireless ISP & Double NAT - Frank Double NAT Cradle Point (https://cradlepoint.com/) Carrier Grade NAT VPN out 36:46 Password ManagerAdd-On? - Jonathan Doesn't pass the sniff test If you trust bitwarden trust them to make a good extension Think about what threat vector you are protecting against 41:20 Noah's Request Drum Charts/National Number System 44:00 Pick of the Week Linux Distro for Windows users MS Software/services included Tech Republic Article (https://www.techrepublic.com/article/windowsfx-is-the-linux-distribution-windows-users-have-been-looking-for/) 45:55 Gadget of the Week Open Book - Open Source Ebook reader GitHub (https://github.com/joeycastillo/The-Open-Book) 47:43 GitLab Goes Public Does this bother people on GitLab? GitLab Press Release (https://about.gitlab.com/press/releases/2021-10-04-gitlab-announces-launch-of-initial-public-offering.html) 51:00 Google Offers 1M to Harden Security in FOSS Public Money should be spent on Public Code Zdnet Article (https://www.zdnet.com/article/open-source-google-is-going-to-pay-developers-to-make-projects-more-secure/) -- The Extra Credit Section -- For links to the articles and material referenced in this week's episode check out this week's page from our podcast dashboard! This Episode's Podcast Dashboard (http://podcast.asknoahshow.com/253) Phone Systems for Ask Noah provided by Voxtelesys (http://www.voxtelesys.com/asknoah) Join us in our dedicated chatroom #GeekLab:linuxdelta.com on Matrix (https://element.linuxdelta.com/#/room/#geeklab:linuxdelta.com) -- Stay In Touch -- Find all the resources for this show on the Ask Noah Dashboard Ask Noah Dashboard (http://www.asknoahshow.com) Need more help than a radio show can offer? Altispeed provides commercial IT services and they're excited to offer you a great deal for listening to the Ask Noah Show. Call today and ask about the discount for listeners of the Ask Noah Show! Altispeed Technologies (http://www.altispeed.com/) Contact Noah live [at] asknoahshow.com -- Twitter -- Noah - Kernellinux (https://twitter.com/kernellinux) Ask Noah Show (https://twitter.com/asknoahshow) Altispeed Technologies (https://twitter.com/altispeed) Special Guest: Steve Ovens.
Episode 010 | September 28, 2021Artificial intelligence, Machine Learning, Deep Learning, and Deep Neural Networks are today critical to the success of many industries. But they are also extremely compute intensive and expensive to run in terms of both time and cost, and resource constraints can even slow down the pace of innovation. Join us as we speak to Muthian Sivathanu, Partner Research Manager at Microsoft Research India, about the work he and his colleagues are doing to enable optimal utilization of existing infrastructure to significantly reduce the cost of AI.Muthian's interests lie broadly in the space of large-scale distributed systems, storage, and systems for deep learning, blockchains, and information retrieval.Prior to joining Microsoft Research, he worked at Google for about 10 years, with a large part of the work focused on building key infrastructure powering Google web search — in particular, the query engine for web search. Muthian obtained his Ph.D from University of Wisconsin Madison in 2005 in the area of file and storage systems, and a B.E. from CEG, Anna University, in 2000.For more information about the Microsoft Research India click here.RelatedMicrosoft Research India Podcast: More podcasts from MSR IndiaiTunes: Subscribe and listen to new podcasts on iTunesAndroidRSS FeedSpotifyGoogle PodcastsEmail TranscriptMuthian Sivathanu: Continued innovation in systems and efficiency and costs are going to be crucial to drive the next generation of AI advances, right. And the last 10 years have been huge for deep learning and AI and primary reason for that has been the significant advance in both hardware in terms of emergence of GPUs and so on, as well as software infrastructure to actually parallelize jobs, run large distributed jobs efficiently and so on. And if you think about the theory of deep learning, people knew about backpropagation about neural networks 25 years ago. And we largely use very similar techniques today. But why have they really taken off in the last 10 years? The main catalyst has been sort of advancement in systems. And if you look at the trajectory of current deep learning models, the rate at which they are growing larger and larger, systems innovation will continue to be the bottleneck in sort of determining the next generation of advancement in AI.[Music]Sridhar Vedantham: Welcome to the Microsoft Research India podcast, where we explore cutting-edge research that's impacting technology and society. I'm your host, Sridhar Vedantham.[Music]Sridhar Vedantham: Artificial intelligence, Machine Learning, Deep Learning, and Deep Neural Networks are today critical to the success of many industries. But they are also extremely compute intensive and expensive to run in terms of both time and cost, and resource constraints can even slow down the pace of innovation. Join us as we speak to Muthian Sivathanu, Partner Research Manager at Microsoft Research India, about the work he and his colleagues are doing to enable optimal utilization of existing infrastructure to significantly reduce the cost of AI.[Music]Sridhar Vedantham: So Muthian, welcome to the podcast and thanks for making the time for this.Muthian Sivathanu: Thanks Sridhar, pleasure to be here.Sridhar Vedantham: And what I'm really looking forward to, given that we seem to be in some kind of final stages of the pandemic, is to actually be able to meet you face to face again after a long time. Unfortunately, we've had to again do a remote podcast which isn't all that much fun.Muthian Sivathanu: Right, right. Yeah, I'm looking forward to the time when we can actually do this again in office.Sridhar Vedantham: Yeah. Ok, so let me jump right into this. You know we keep hearing about things like AI and deep learning and deep neural networks and so on and so forth. What's very interesting in all of this is that we kind of tend to hear about the end product of all this, which is kind of, you know, what actually impacts businesses, what impacts consumers, what impacts the health care industry, for example, right, in terms of AI. It's a little bit of a mystery, I think to a lot of people as to how all this works, because... what goes on behind the scenes to actually make AI work is generally not talked about. Muthian Sivathanu: Yeah.Sridhar Vedantham: So, before we get into the meat of the podcast you just want to speak a little bit about what goes on in the background.Muthian Sivathanu: Sure. So, machine learning, Sridhar, as you know, and deep learning in particular, is essentially about learning patterns from data, right, and deep learning system is fed a lot of training examples, examples of input and output, and then it automatically learns a model that fits that data, right. And this is typically called the training phase. So, training phase is where it takes data builds a model how to fit. Now what is interesting is, once this model is built, which was really meant to fit the training data, the model is really good at answering queries on data that it had never seen before, and this is where it becomes useful. These models are built in various domains. It could be for recognizing an image for converting speech to text, and so on, right. And what has in particular happened over the last 10 or so years is that there has been significant advancement both on the theory side of machine learning, which is, new algorithms, new model structures that do a better job at fitting the input data to a generalizable model as well as rapid innovation in systems infrastructure which actually enable the model to sort of do its work, which is very compute intensive, in a way that's actually scalable that's actually feasible economically, cost effective and so on.Sridhar Vedantham: OK, Muthian, so it sounds like there's a lot of compute actually required to make things like AI and ML happen. Can you give me a sense of what kind of resources or how intensive the resource requirement is?Muthian Sivathanu: Yeah. So the resource usage in a machine learning model is a direct function of how many parameters it has, so the more complex the data set, the larger the model gets, and correspondingly requires more compute resources, right. To give you an idea, the early machine learning models which perform simple tasks like recognizing digits and so on, they could run on a single server machine in a few hours, but models now, just over the last two years, for example, the size of the largest model that's useful that state of the art, that achieves state of the art accuracy has grown by nearly three orders of magnitude, right. And what that means is today to train these models you need thousands and thousands of servers and that's infeasible. Also, accelerators or GPUs have really taken over the last 6-7 years and GPUs. A single V-100 GPU today, a Volta GPU from NVIDIA can run about 140 trillion operations per second. And you need several hundreds of them to actually train a model like this. And they run for months together to train a 175 billion model, which is called GPT 3 recently, you need on the order of thousands of such GPUs and it still takes a month.Sridhar Vedantham: A month, that's sounds like a humongous amount of time. Muthian Sivathanu: Exactly, right? So that's why I think just as I told you how the advance in the theory of machine learning in terms of new algorithms, new model structures, and so on have been crucial to the recent advance in the relevance in practical utility of deep learning.Equally important has been this advancement in systems, right, because given this huge explosion of compute demands that these workloads place, we need fundamental innovation in systems to actually keep pace, to actually make sure that you can train them in reasonable time, you can actually do that with reasonable cost.Sridhar Vedantham: Right. Ok, so you know for a long time, I was generally under the impression that if you wanted to run bigger and bigger models and bigger jobs, essentially you had to throw more hardware at it because at one point hardware was cheap. But I guess that kind of applies only to the CPU kind of scenario, whereas the GPU scenario tends to become really expensive, right?Muthian Sivathanu: Yep, yeah.Sridhar Vedantham: Ok, so in which case, when there is basically some kind of a limit being imposed because of the cost of GPUs, how does one actually go about tackling this problem of scale?Muthian Sivathanu: Yeah, so the high-level problem ends up being, you have limited resources, so let's say you can view this in two perspectives, right. One is from the perspective of a machine learning developer or a machine learning researcher, who wants to build a model to accomplish a particular task right. So, from the perspective of the user, there are two things you need. A, you want to iterate really fast, right, because deep learning, incidentally, is this special category of machine learning, where the exploration is largely by trial and error. So, if you want to know which model actually works which parameters, or which hyperparameter set actually gives you the best accuracy, the only way to really know for sure is to train the model to completion, measure accuracy, and then you would know which model is better, right. So, as you can see, the iteration time, the time to train a model to run inference on it directly impacts the rate of progress you can achieve. The second aspect that the machine learning researcher cares about is cost. You want to do it without spending a lot of dollar cost.Sridhar Vedantham: Right.Muthian Sivathanu: Now from the perspective of let's say a cloud provider who runs this, huge farm of GPUs and then offers this as a service for researchers, for users to run machine learning models, their objective function is cost, right. So, to support a given workload you need to support it with as minimal GPUs as possible. Or in other words, if you have a certain amount of GPU capacity, you want to maximize the utilization, the throughput you can get out of those GPUs, and that's where a lot of the work we've been doing at MSR has focused on. How do you sort of multiplex lots and lots of jobs onto a finite set of GPUs, while maximizing the throughput that you can get from them?Sridhar Vedantham: Right, so I know you and your team have been working on this problem for a while now. Do you want to share with us some of the key insights and some of the results that you've achieved so far, because it is interesting, right? Schedulers have been around for a while. It's not that there aren't schedulers, but essentially what you're saying is that the schedulers that exist do not really cut it, given the, intensity of the compute requirements as well as the jobs, as the size of the jobs and models that are being run today in terms of deep learning or even machine learning models, right?Muthian Sivathanu: That's right.Sridhar Vedantham: So, what are your, key insights and what are some of the results that you guys have achieved?Muthian Sivathanu: So, you raise a good point. I mean, schedulers for distributed systems have been around for decades, right. But what makes deep learning somewhat special is that it turns out, in contrast to traditional schedulers, which have to view a job as a black box, because they're meant to run arbitrary jobs. There is a limit to how efficient they can be. Whereas in deep learning, first of all because deep learning is such high impact area with lots, and I mean from an economic perspective, there are billions of dollars spent in these GPUs and so on. So, there is enough economic incentive to extract the last bit of performance out of these expensive GPUs, right. And that lends itself into this realm of- what if we co-design? What if we custom design a scheduler for the specific case of deep learning, right. And that's what we did in the Gandiva project which we published at OSDI in 2018. What we said was, instead of viewing a deep learning job as just another distributed job which is opaque to us, let's actually exploit some key characteristics that are unique to deep learning jobs, right? And one of those characteristics, is that although, as I said, a single deep learning training job can run for days or even months, right, deep within it is actually composed of millions and millions of these what are called mini batches. So, what is a mini batch? A mini batch is an iteration in the training where it reads one set of input training examples, runs it through the model, and then back propagates the loss, and essentially, changes the parameters to fit that input. And this sequence this mini batch repeats over and over again across millions and millions of mini batches. And what makes it particularly interesting and relevant from a systems optimization viewpoint is that from a resource usage perspective and from a performance perspective, mini batches are identical. They may be operating on different data in each mini batch, but the computation they do is pretty much identical. And what that means is we can look at the job for a few mini batches and we can know what exactly is going to do for the rest of its life time, right. And that allows us to, for example, do things like, we can automatically decide which hardware generation is the best fit for this job, because you can just measure it in a whole bunch of hardware configurations. Or when you're distributing the job, you can compare it across a whole bunch of parallelism configurations, and you can automatically figure out, this is the right configuration, right hardware assignment for this particular job, which you couldn't do in an arbitrary job with a distributed scheduler because the job could be doing different things at different times. Like a MapReduce job for example, it would keep fluctuating across how we'd use a CPU, network, storage, and so on, right. Whereas with deep learning there is this remarkable repeatability and predictability, right. What it also allows us to do is, we can then look within a mini batch what happens, and it turns out, one of the things that happens is, if you look at the memory usage, how much GPU memory the training loop itself is consuming, somewhere at the middle of a mini batch, the memory peaks to almost fill the entire GPU memory, right. And then by the time the mini batch ends, the memory usage drops down by like a factor of anywhere between 10 to 50x. Right, and so there is this sawtooth pattern in the memory usage, and so one of the things we did in Gandiva was proposed this mechanism of transparently migrating a job, so you should be able to, on demand checkpoint a job. The scheduler should be able to do it and just move it to a different machine, maybe even essentially different GPU, different machine, and so on, right. And this is very powerful from load balancing. Lots of scheduling things become easy if you do this. Now, when you're doing that, when you are actually moving a job from one machine to another, it helps if the amount of state you need to move is small, right. And so that's where this awareness of mini batch boundaries and so on helps us, because now you can choose when exactly to move it so that you move 50x, smaller amount of state.Sridhar Vedantham: Right. Very interesting, and another part of this whole thing about resources and compute and all that is, I think, the demands on storage itself, right?Muthian Sivathanu: Yeah.Sridhar Vedantham: Because if the models are that big, that you need some really high-powered GPUs to compute, how do you manage the storage requirements?Muthian Sivathanu: Right, right. So, it turns out the biggest requirement from storage that deep learning poses is on the throughput that you need from storage, right. So, as I mentioned, because GPUs are the most expensive resource in this whole infrastructure stack, the single most important objective is to keep GPUs busy all the time, right. You don't want them idling, at all. What that means is the input training data that the model needs in order to run its mini batches, that is to be fed to it at a rate that is sufficient to keep the GPUs busy. And GPUs process, I mean the amount of data that the GPU can process from a compute perspective has been growing at a very rapid pace, right. And so, what that means is, you know, when between Volta series and an Ampere series, for example, of GPUs there is like 3X improvement in compute speed, right. Now that means the storage bandwidth should keep up with that pace, otherwise faster GPU doesn't help. It will be stalling on IO. So, in that context one of the systems we built was the system called Quiver, where we say a traditional remote storage system like the standard model for running this training is...the datasets are large- I mean the data sets can be in terabytes, so, you place it on some remote cloud storage system, like Azure blob or something like that, and you read it remotely from whichever machine does the training, right. And that bandwidth simply doesn't cut it because it goes through network backbone switches and so on, and it becomes insanely expensive to sustain that level of bandwidth from a traditional cloud storage system, right. So what we need, to achieve here is hyper locality. So, ideally the data should reside on the exact machine that runs the training, then it's a local read and it has to reside on SSD and so on, right. So, you need several gigabytes per second read bandwidth.Sridhar Vedantham: And this is to reduce network latency?Muthian Sivathanu: Yes, this is to reduce network latency and congestion, like when it goes through lots of back end, like T1 switches, T2 switches etc. The end-to-end throughput that you get across the network is not as much as what you can get locally, right?Sridhar Vedantham: Right.Muthian Sivathanu: So, ideally you want to keep the data local in the same machine, but as I said, for some of these models, the data set can be in tens of terabytes. So, what we really need is a distributed cache, so to speak, right, but a cache that is locality aware. So, what we have is a mechanism by which, within each locality domain like a rack for example, we have a copy of the entire training data, so, a rack could comprise maybe 20 or 30 machines, so across them you can still fit the training data and then you do peer to peer across machines in the rack for the access to the cache. And within a rack, network bandwidth is not a limitation. You can get nearly the same performance as you could from local SSD, so that's what we did in Quiver and there are a bunch of challenges here, because if every model wants the entire training data to be local to be within the rack, then there is just no cache space for keeping all of that.Sridhar Vedantham: Right.Muthian Sivathanu: Right. So we have this mechanism by which we can transparently share the cache across multiple jobs, or even multiple users without compromising security, right. And we do that by sort of intelligent content addressing of the cache entries so that even though two users may be accessing different copies of the same data internally in the cache, they will refer to the same instance.Sridhar Vedantham: Right, I was actually just going to ask you that question about how do you maintain security of data, given that you're talking about distributed caching, right? Because it's very possible that multiuser jobs will be running simultaneously, but that's good, you answered it yourself. So, you know I've heard you speak a lot about things like micro design and so on. How do you bring those principles to bear in these kind of projects here?Muthian Sivathanu: Right, right. So, I alluded to this a little bit in one of my earlier points, which is the interface, I mean, if you look at a traditional scheduler which we use the job as a black box, right. That is an example of traditional philosophy to system design, where you build each layer independent of the layer above or below it, right, so that, there are good reasons to do it because you know, like multiple use cases can use the same underlying infrastructure, like if you look at an operating system, it's built to run any process, whether it is Office or a browser or whatever, right.Sridhar Vedantham: Right.Muthian Sivathanu: But, in workloads like deep learning, which place particularly high demands on compute and that are super expensive and so on, there is benefit to sort of relaxing this tight layering to some extent, right. So that's the philosophy we take in Gandiva, for example, where we say the scheduler no longer needs to think of it as a black box, it can make use of internal knowledge. It can know what mini batch boundaries are. It can know that mini batch times are repeatable and stuff like that, right. So, co-design is a philosophy that has been gaining traction over the last several years, and people typically refer to hardware, software co-design for example. What we do in micro co-design is sort of take a more pragmatic view to co-design where we say look, it's not always possible to rebuild entire software layers from scratch to make them more tightly coupled, but the reality is in existing large systems we have these software stacks, infrastructure stacks, and what can we do without rocking the ship, without essentially throwing away everything in building everything from a clean slate. So, what we do is very surgical, carefully thought through interface changes, that allow us to expose more information from one layer to another, and then we also introduce some control points which allow one layer to control. For example, the scheduler can have a control point to ask a job to suspend. And it turns out by opening up those carefully thought through interface points, you leave the bulk of the infrastructure unchanged, but yet achieve these efficiencies that result from richer information and richer control, right. So, micro co-design is something we have been adopting, not only in Gandiva and Quiver, but in several other projects in MSR. And MICRO stands for Minimally Invasive Cheap and Retrofittable Co-design. So, it's a more pragmatic view to co-design in the context of large cloud infrastructures.Sridhar Vedantham: Right, where you can do the co-design with the minimum disruption to the existing systems.Muthian Sivathanu: That's right. Sridhar Vedantham: Excellent. [Music]Sridhar Vedantham: We have spoken a lot about the work that you've been doing and it's quite impressive. Do you have some numbers in terms of you know, how jobs will run faster or savings of any nature, do you have any numbers that you can share with us? Muthian Sivathanu: Yeah, sure. So the numbers, as always depend on the workload and several aspects. But I can give you some examples. So, in the Gandiva work that we did. We, introduce this ability to time slice jobs, right. So, the idea is, today when you launch a job in a GPU machine, that job essentially holds on to that machine until it completes, and until that time it has exclusive possession of that GPU, no other job can use it, right. And this is not ideal in several scenarios. You know, one classic example is hyperparameter tuning, where you have a model and you need to decide what exact hyperparameter values like learning rate, etc. actually are the best fit and give the best accuracy for this model. So, people typically do what is called the hyperparameter search where you run maybe 100 instances of the model, see how it's doing, maybe kill some instances spawn of new instances, and so on, right. And hyperparameter exploration really benefits from parallelism. You want to run all these instances at the same time so that you have an apples-to-apples comparison of how they are doing. And if you want to run like 100 configurations and you have only 10 GPUs, that significantly slows down hyperparameter exploration- it serializes it, right. What Gandiva has is an ability to perform fine grained time slicing of the same GPU across multiple jobs, just like how an operating system time slices multiple processes, multiple programs on the same CPU, we do the same in GPU context, right. And because we make use of mini batch boundaries and so on, we can do this very efficiently. And with that we showed that for typical hyperparameter tuning, we can sort of speed up the end-to-end time to accuracy by nearly 5-6x, right. Uh, and so this is one example of how time slicing can help. We also saw that from a cluster wide utilization perspective, some of the techniques that Gandiva adopted can improve overall cluster utilization by 20-30%. Right, and this directly translates to cost incurred to the cloud provider running those GPS because it means with the same GPU capacity, I can serve 30% more workload or vice versa, right, for a given workload I only need 30% lesser number of GPUs.Sridhar Vedantham: Yeah, I mean those savings sound huge and I think you're also therefore talking about reducing the cost of AI making the process of AI itself more efficient. Muthian Sivathanu: That's correct, that's correct. So, the more we are able to extract performance out of the same infrastructure, the cost per model or the cost per user goes down and so the cost of AI reduces and for large companies like Microsoft or Google, which have first party products that require deep learning, like search and office and so on, it reduces the capital expenditure running such clusters to support those workloads.Sridhar VedanthamRight.Muthian Sivathanu: And we've also been thinking about areas such as, today there is this limitation that large models need to run in really tightly coupled hyperclusters which are connected via InfiniBand and so on. And that brings up another dimension of cost escalation to the equation, because these are sparse, the networking itself is expensive, there is fragmentation across hyperclusters and so on. What we showed in some recent work is how can you actually run training of large models in just commodity VMs-these are just commodity GPU VMs- but without any requirement on them being part of the same InfiniBand cluster or hypercluster, but just they can be scattered anywhere in the data center, and more interestingly, we can actually run these off of spot VMs. So Azure, AWS, all cloud providers provide these bursty VMs or low priority VMs, which is away essentially for them to sell spare capacity, right. So, you get them at a significant discount. Maybe 5-10x cheaper price. And the disadvantage, I mean the downside of that is they can go away at any time. They can be preempted when real demand shows up. So, what we showed is it's possible to train such massive models at the same performance, despite these being on spot VMs and spread over a commodity network without custom InfiniBand and so on. So that's another example how you can bring down the cost of AI by reducing constraints on what hardware you need.Sridhar Vedantham: Muthian, we're kind of reaching the end of the podcast, and is there anything that you want to leave the listeners with, based on your insights and learning from the work that you've been doing? Muthian Sivathanu: Yeah, so taking a step back, right? I think continued innovation in systems and efficiency and costs are going to be crucial to drive the next generation of AI advances, right. And the last 10 years have been huge for deep learning and AI and primary reason for that has been the significant advance in both hardware in terms of emergence of GPUs and so on, as well as software infrastructure to actually parallelize jobs, run large distributed jobs efficiently and so on. And if you think about the theory of deep learning, people knew about backpropagation about neural networks 25 years ago. And we largely use very similar techniques today. But why have they really taken off in the last 10 years? The main catalyst has been sort of advancement in systems. And if you look at the trajectory of current deep learning models, the rate at which they are growing larger and larger, systems innovation will continue to be the bottleneck in sort of determining the next generation of advancement in AI.Sridhar Vedantham: Ok Muthian, I know that we're kind of running out of time now but thank you so much. This has been a fascinating conversation.Muthian Sivathanu: Thanks Sridhar, it was a pleasure.Sridhar Vedantham: Thank you
Enterprises are working to simplify the process of deploying and managing systems to support AI applications. That's what NVIDIA's DGX architecture is designed to do, and what we'll talk about on this episode. Frederic Van Haren and Stephen Foskett are joined by Tony Paikeday, Senior Director, AI Systems at NVIDIA, to discuss the tools needed to operationalize AI at scale. Although many NVIDIA DGX systems have been purchased by data scientists or directly by lines of business, it is also a solution that CIOs have embraced. The system includes NVIDIA GPUs of course but also CPU, storage, and connectivity and all of this is held together with software that makes it easy to use as a unified solution. AI is a unique enterprise workload in that it requires high storage IOPS and low storage and network latency. Another issue is balancing these needs to scale performance in a linear manner as more GPUs are used, and this is why NVIDIA relies on NVLink and NVSwitch as well as DPU and InfiniBand to connect the largest systems Three Questions How big can ML models get? Will today's hundred-billion parameter model look small tomorrow or have we reached the limit? Will we ever see a Hollywood-style “artificial mind” like Mr. Data or other characters? Can you give an example where an AI algorithm went terribly wrong and gave a result that clearly wasn't correct? *Question asked by Mike O'Malley of SenecaGlobal. Guests and Hosts Tony Paikeday, Senior Director Senior Director, AI systems at NVIDIA. Connect with Tony on LinkedIn or on Twitter at @TonyPaikeday. Frederic Van Haren, Founder at HighFens Inc., Consultancy & Services. Connect with Frederic on Highfens.com or on Twitter at @FredericVHaren. Stephen Foskett, Publisher of Gestalt IT and Organizer of Tech Field Day. Find Stephen's writing at GestaltIT.com and on Twitter at @SFoskett. Date: 9/21/2021 Tags: @TonyPaikeday, @nvidia, @SFoskett, @FredericVHaren
In this episode I talk with Liran Zvibel, Co-Founder and CEO at Weka. Weka has built the world's fastest parallel file system, designed to solve the performance challenges in AI deep learning and technical computing. WekaFS delivers high performance, low latency storage. Weka provides a seamless hybrid cloud solution between on-premises data centers and the public cloud. With customers in several high performance markets including autonomous vehicle AI training, genomics, financial modeling, EDA, software development, satellite imaging and media and entertainment. Weka was founded in 2014 and is based San Francisco. Technology and Technology Partners Mentioned NVMe, DELL, HPE, nVidia, SuperMicro, Cisco, Mellenox, Infiniband, Linux Web: https://wkea.io WekaFS Tech Deep Dive: https://www.youtube.com/watch?v=Cn7VGbiO9Us Interested in being on #GTwGT? https://launch.gtwgt.com Music: https://www.bensound.com
В течение восьми лет мы обходили стороной такую любопытную технологию, как InfiniBand. Вероятно, потому, что 70% сетевых инженеров никогда не потрогают его руками. И всё же полезно взглянуть на то, что можно сделать, если разрабатывать технологию в 90-е, после того, как Ethernet со своими товарищами TCP и IP собрали все грабли и обросли подпорками. В гостях у linkmeup — Mellanox — единственный производитель оборудования InfiniBand в мире. Познать новый для себя мир за один выпуск затея провальная, но мы попробовали, поэтому и длительность у него нетипичная. И всё же много тем осталось не освещено, а вопросов не отвечено. Поэтому, возможно, мы запишем ещё один выпуск. Кто: Александр Петровский. Системный инженер NVIDIA (Mellanox)Нейман Борис. Ведущий Инженер в сетевом отделе компании NVIDIA (Mellanox) Про что: Infiniband. Сколько в мире производителей InfiniBand? Экскурс в историю: про задачи и технологию.А зачем InfiniBand, если можно купить 400Гб/с Ethernet и сварить на нём RoCE?QoS и управление перегрузками.IP over IB, IB over Ethernet, Ethernet over IB и прочие позы.Целые отдельные кластера, провязанные IB для систем Искусственного Интеллекта.Есть ли будущее у IB? Слайды к выпуску Скачать файл подкаста Слушать на YouTube Добавить RSS в подкаст-плеер. Подкаст доступен в iTunes. Скачать все выпуски подкаста вы можете с яндекс-диска. Url podcast:https://dts.podtrac.com/redirect.mp3/https://fs.linkmeup.ru/podcasts/telecom/linkmeup-V090(2020-08).mp3
В течение восьми лет мы обходили стороной такую любопытную технологию, как InfiniBand. Вероятно, потому, что 70% сетевых инженеров никогда не потрогают его руками. И всё же полезно взглянуть на то, что можно сделать, если разрабатывать технологию в 90-е, после того, как Ethernet со своими товарищами TCP и IP собрали все грабли и обросли подпорками. В гостях у linkmeup — Mellanox — единственный производитель оборудования InfiniBand в мире. Познать новый для себя мир за один выпуск затея провальная, но мы попробовали, поэтому и длительность у него нетипичная. И всё же много тем осталось не освещено, а вопросов не отвечено. Поэтому, возможно, мы запишем ещё один выпуск. Кто: Александр Петровский. Системный инженер NVIDIA (Mellanox)Нейман Борис. Ведущий Инженер в сетевом отделе компании NVIDIA (Mellanox) Про что: Infiniband. Сколько в мире производителей InfiniBand? Экскурс в историю: про задачи и технологию.А зачем InfiniBand, если можно купить 400Гб/с Ethernet и сварить на нём RoCE?QoS и управление перегрузками.IP over IB, IB over Ethernet, Ethernet over IB и прочие позы.Целые отдельные кластера, провязанные IB для систем Искусственного Интеллекта.Есть ли будущее у IB? Слайды к выпуску Скачать файл подкаста Слушать на YouTube Добавить RSS в подкаст-плеер. Подкаст доступен в iTunes. Скачать все выпуски подкаста вы можете с яндекс-диска. Url podcast:https://dts.podtrac.com/redirect.mp3/https://fs.linkmeup.ru/podcasts/telecom/linkmeup-V090(2020-08).mp3
В течение восьми лет мы обходили стороной такую любопытную технологию, как InfiniBand. Вероятно, потому, что 70% сетевых инженеров никогда не потрогают его руками. И всё же полезно взглянуть на то, что можно сделать, если разрабатывать технологию в 90-е, после того, как Ethernet со своими товарищами TCP и IP собрали все грабли и обросли подпорками. В гостях у linkmeup — Mellanox — единственный производитель оборудования InfiniBand в мире. Познать новый для себя мир за один выпуск затея провальная, но мы попробовали, поэтому и длительность у него нетипичная. И всё же много тем осталось не освещено, а вопросов не отвечено. Поэтому, возможно, мы запишем ещё один выпуск. Кто: Александр Петровский. Системный инженер NVIDIA (Mellanox)Нейман Борис. Ведущий Инженер в сетевом отделе компании NVIDIA (Mellanox) Про что: Infiniband. Сколько в мире производителей InfiniBand? Экскурс в историю: про задачи и технологию.А зачем InfiniBand, если можно купить 400Гб/с Ethernet и сварить на нём RoCE?QoS и управление перегрузками.IP over IB, IB over Ethernet, Ethernet over IB и прочие позы.Целые отдельные кластера, провязанные IB для систем Искусственного Интеллекта.Есть ли будущее у IB? Слайды к выпуску Скачать файл подкаста Слушать на YouTube Добавить RSS в подкаст-плеер. Подкаст доступен в iTunes. Скачать все выпуски подкаста вы можете с яндекс-диска.
In this episode I talk with Sheng Yeo, CEO and Co-Founder of OrionVM. OrionVM are a wholesale Cloud Service Provider who designed their cloud platform for speed and distributed resiliency leveraging the InfiniBand networking standard as the core of their architecture. They where doing Hyper Converged Infrastructure (HCI) before HCI was a thing! Sheng gives us an insight into their founding and we go through what makes them unique as a IaaS provider doing great things differently, creating their own niche in the market while continuing to learn and grow along the way. Technology and Technology Partners Mentioned Linux, Xen, InfiniBand, Cumulus Networks Web: https://www.orionvm.com Architecture: https://www.orionvm.com/features Interested if being on #GTwGT? Register interest here: https://launch.gtwgt.com Music: https://www.bensound.com
The show starts with Dan, Jessi and Shahin in attendance. Henry is traveling from his old home base in Minnesota to his new command bunker in lovely Las Cruces, NM. Last we heard he was in Kansas City and making good time. We’re not sure how long we’re going to have to do without him as Comcast seems to be slow playing him on his internet installation timeline.Why Freeze the Whole Room If you Just want a Frozen Atom?Our big topic today is the quantum computing company ColdQuanta. It’s headed by an old pal of ours Bo Ewald and has just come out of stealth mode into the glaring spotlight of RadioFreeHPC. They have a unique approach to quantum computing, trapping atoms themselves to create Bose-Einstein Condensate. This is a fifth state of matter, which matters quite a bit. When you freeze a gas of Bosons at low density to near zero, you start to get macroscopic access to microscopic quantum mechanical effects, which is a pretty big deal. With the quantum mechanics start, you can control it, change it, and get computations out of it. The secret sauce for ColdQuanta is served cold, all the way down into the micro-kelvins and kept very locally, which makes it easier to get your condensate.The company is focused on measurement and sensing but also mention straight computation, the latter like most of the other quantum competitors. They were the first company to put their quantum computer in space and the first to create Bose-Einstein Condensate while in orbit at the International Space Station.Catch of the WeekJessi: Want to chill out and help NASA at the same time? Jessi has found a way with NeMO-Net, a game where users cruise through an animated ocean floor and classify coral structures. Your answers are then fed into NASA’s Pleiades supercomputer, which uses the data as fodder to improve it’s own identification prowess. It’s a great way to while away the hours during these Covid19 shut downs, right?Shahin: has two catches, the first is a celebration of IBM’s quant-iversary, marking the fourth anniversary of them having a quantum computer on the web – many happy returns to Big Blue. They’re also sponsoring a contest, see the web link for details.In his second catch, Shahin shamelessly promotes his recent talk at the HPC AI Advisory Council virtual Stanford conference. He did a great job on covering just about every buzzword topic in the industry in only 30 minutes, well done.Dan: Dano likes fast things and seeing fast things get even faster. This is what attracted him to the story about ISV Risk Fuel and Microsoft’s Azure posting an article boasting a 20 million x speedup of derivative processing. A 20 million times speedup of anything is pretty significant and they achieve this with a combination of 8 NVIDIA V100 GPUs (w/32GB memory each), InfiniBand and Risk Fuel’s amazing software. What’s great about this is that with this speed the model has complete fidelity with traditional calculations. In other words, you can speed all you like without any downside when it comes to accuracy – amazing stuff.Join us!* Download the MP3 * Sign up for the insideHPC Newsletter* Follow us on Twitter * Subscribe on Spotify * Subscribe on Google Play * Subscribe on iTunes * RSS Feed * eMail us
Here are the show notes for Episode 25 “Flit for Purpose”. The show is called this because it relates to our Topic, and also can be related to our Mainframe topic (as a pun for “Fit for Purpose”). Mainframe Topic: Highest highlights of z/OS V2.4 and z/OS on z15 Highlight 1: zCX Highlight 2: z/OSMF Lots of z/OSMF enhancements that have arrived in z/OS V2.4, and the good news is that most of them are rolled back to V2.3 in PTFs that have been arriving quarterly. Security Configuration Assistant: A way within z/OSMF to validate your security configuration with graphic views, on the user and user group level. Designed to work with all three External Security Managers! Available back to V2.3 with APAR PH15504 and additional group id enhancements in APAR PH17871 Diagnostic Assistant for z/OSMF : A much simplier way to gather the necessary information to need for a Service person to perform debug for your z/OSMF problem. Hightlight 3: SRB on on z15: System Recovery Boost :Speeds up your shutdown for up to 30 minutes and speeds your re-IPL for 60 minutes, with no increase to your rolling four hour average. Performance Topic: z15 from chip design on upwards Disclaimer: personal view, not from Development or Marketing. Marna and Martin were talking about the z15 Chip design – and we thought those observations might be useful to include in the Performance topic. Two traditional levers were raising clock speed or shrinking the feature size. GHz and nm aren't the be all and end all. Look at chip design. Start with a similar sized CP chip and putting more on it. It helped to get rid of the Infiniband-related circuits, and some layout enhancements. At the top end there are up to 190 characterisable cores, coming up from 170. This can give us a fifth drawer – which is quite important. Topic: How To Do A Moonlight Flit This topic is about moving one's social output, in particular blogs and podcast series. Martin's blog had to move, because the IBM developerWorks blog site is being shut down. Martin's blog has had to move, as well as this podcast. Immediately people might worry about Request For Enhancements being affected , and it is not. Martin and Marna discuss important aspects to consider when moving your social media. You must consider all the pieces when do you the most. You must also try to redirect your audience. Contacting us You can reach Marna on Twitter as mwalle and by email. You can reach Martin on Twitter as martinpacker and by email and blogs at blog.
Shiny Crystal BallIt’s our first episode of 2020, yay! The first that was recorded in 2020 anyway. It's a predictable 20/20 joke (more of a meh comment really) but the topic today is... PREDICTIONS. More specifically, it's our predictions of what’s going to happen in the next year. We may not always be correct, but we think maybe we’re always certain. We look at compute, interconnects, security, and general innovations:ComputeDan says that we’re going to have more of it. Henry predicts that we’ll see a RISC-V based supercomputer on the TOP500 list by the end of 2020 – gutsy call on that. This is a double down on a bet that Dan and Henry have, so he’s reinforcing his position. Dan also sees 2020 as the “Year of the FPGA” when we start to see more and more HPC boxes fueled by FPGA, which is something Shahin mostly agrees with while Henry disputes it. We also touch on liquid cooling and process size as part of this topic.InterconnectsDan thinks that InfiniBand will announce 400 GBs interconnect by the end of this year – a bold prediction. On a communications note, Henry says that 20% of the US user base will have access to 5G phone coverage by the end of the year. Shahin asserts that only 3% of the market will actually buy it, but Dan and Henry say not so fast – it’ll be closer to 10%. Shahin is looking for a 5G connection for servers. Not as an interconnect, but more as a WAN or a cluster that spans an entire county. On another note, Shahin believes that HPE will formally get into the interconnect business, selling the Slingshot interconnect.Security TrendsDan says we need more of it but doesn’t see anything that’s going to move the needle back towards the users. Jessi thinks that security education has improved things security-wise and that will continue in 2020. Henry and Dan disagree. Jessi is adamant.Innovation/TrendsDan pegs in-memory computing as a field that will blossom over the coming year(s). Shahin agrees that in-memory is very interesting and ripe for innovation as well. But he also sees a lot of developments in the AI processor space. Henry talks about a new application workflow that will go something like this: Object > MemMap > Compute on the MemMap file/data > back to Object, with no POSIX in the way. Shahin also sees more quantum supremacy in the news in the coming year.Letter(s) to the Editor!We discuss our first letter to the editor, from a listener who wasn’t a fan of the episode where we answered Jessi’s question about why tape is still used. His term for that feature? “Poor.” This prompted Shahin to quip, “I’m surprised we don’t get more of these…..” Please keep those comments (good, indifferent, or critical) coming, our email is podcast@radiofreehpc.com.Why Nobody Should Ever be Online. Ever.This week, Henry doesn’t have a “Reason Why No One Should Ever Be Online. Ever.” He was offline all week, so thus doesn’t have anything to scare us with.Catch of the WeekHenry: has no catch, his net came up empty.Shahin: was practicing Catch & Release this week, so his creel is fishless.Jessi: discusses her new phone. She lost her old one in a Czech toilet (nasty, yikes). This is her first phone upgrade since junior high school – probably 6-7 years – and she’s agog at how the phones have advanced. She can now take pictures and use apps. Yay Jessi!Dan: Encourages listeners to have a good year and to let us know what you think via email (podcast@radiofreehpc.com) and twitter (@radiofreehpc). He also highlights the new RadioFreeHPC logo along the way.Listen in to hear the full conversation* Download the MP3 * Sign up for the insideHPC Newsletter* Follow us on Twitter * Subscribe on Spotify * Subscribe on Google Play * Subscribe on iTunes * RSS Feed * eMail us
SC19 PostviewOur show today is all about what we saw at the “State Fair for Nerds” that is SC19. Where there weren’t any livestock shows or supercomputers carved out of butter, there was a lot to see and hear.Shahin talks about the European Processor Initiative and conversations that he had with folks from the Barcelona Supercomputing Center, the quantum computing briefing by D-Wave, and a chat with Cold Quanta.We reiterate the bet between Henry and Dan, where Henry bets Dan that there will be a RISC-V based system on the TOP500 system by SC20. The stakes? The winner gets the dinner of his choice paid for by the loser.Jessi went to the keynote by Dr. Squires and notes that someone asked him “where did you use HPC systems in your project?” This prompts Jessi to ask us if it’s kosher to have keynotes which don’t necessarily hit directly on HPC. We discuss how there have been non-HPC centric keynote speakers at several SC events in the past….see Al Gore, Alan Alda, Bill Gates, Michael Dell, etc.Dan brings up the news from NVIDA about how they’ve gathered a consortium of big-time industry players who will be working on adapting ARM processors for accelerated computing. We speculate on whether Fujitsu will be contributing their very sporty new ARM chip to the group, with thoughts of licensing it for use by other vendors. In other NVIDIA news, Azure now has eight GPU instances connected by InfiniBand interconnects.Why Nobody Should Ever Be Online. EverThis week Henry has a reason why no one should ever go to a local doctor again. Ever. He cites an article about how a small doctor and dental office service provider suffered a ransomware attack, which meant that the doctors they were managing archiving for could no longer get access to their records. If this can happen to service providers, it can run these smaller providers out of business as HIPPA regulations and fine are onerous. Dan comments that he only goes to vets for medical services (just like Kramer on Seinfeld).Things You Think You Know, But Might NotIn this installment, Jessi asks the panel about interconnects, why we need them, what they do, and what are your choices. Dan jumps in with discussing Ethernet and InfiniBand, while Henry jokingly brings up Token Ring. More helpfully, Henry discusses proprietary interconnects and things like RDMA and ROCE. Shahin believes he has the definitive answer, which is the start of his Computing 301 Lecture Series. This leads into a slight tangent where we discuss SMP vs. MPP and how coherency at scale is incredibly expensive.Catch of the WeekShahin: Talks about a young man who bought his own IBM z/890 mainframe for $350 and installed it in his parent’s basement. Amazing feat. He has it running and now has the only mainframe in his neighborhood. Congrats to Connor. IBM needs to hire this young man and harness his passion.Jessi: Getting or giving an Alexa or Google Home device? Better think twice and then think again. Big time security flaws in both devices.Henry: Has no catch of the week. He’s looking at -14°F and has to go out and shovel snow.Dan: Biggest tech flops of 2019. Includes We Work, Samsung Fold, hacks of VPNs, Facebook Libre and other ignominious failures.Did you say logo?Most of you probably didn't know that RadioFreeHPC even has a logo, Dan thinks as he gives an update on a project to update the logo, in several colors. Getting closer to the sure-to-be-coveted RadioFreeHPC merch we all wanted!Listen in to hear the full conversation* Download the MP3 * Sign up for the insideHPC Newsletter* Follow us on Twitter * Subscribe on Spotify * Subscribe on Google Play * Subscribe on iTunes * RSS Feed * eMail us
The benefits of NVME over SSD are pretty compelling. Not only do NVMe drives offer more bandwidth, but the updated interface and protocols improve latency and better handle heavy workloads . Now, as NVMe approaches price parity with SATA SSDs organizations are making the switch, but what about the storage fabric? Are we just shifting the bottleneck to the network? This week we bring in Dr J Metz to discuss storage networking technologies, protocols and standards. J shares his thoughts on what's near end of road and what is gaining momentum. Dr J Metz is a Data Center Technologist in the Office of the CTO at Cisco, focusing on Storage Topics discussed: Ethernet, Fibre Channel, PCI Express, Omni-Path (a new Intel high-performance communications architecture), and InfiniBand. NVMe's impact on datacenter design Technology transitions (the game of whack-a-mole) Is Tape going away? Links mentioned in this episode: Dr J Metz's Blog Episode 46: NVM Express NVM Express and VMware SNIA The Virtually Speaking Podcast The Virtually Speaking Podcast is a weekly technical podcast dedicated to discussing VMware topics related to storage and availability. Each week Pete Flecha and John Nicholson bring in various subject matter experts from VMware and within the industry to discuss their respective areas of expertise. If you’re new to the Virtually Speaking Podcast check out all episodes on vSpeakingPodcast.com.
Scott Misage, General Manager for the Intel® Omni-Path Business Unit, joins us for an update on Intel® Omni-Path Architecture (Intel® OPA). Intel OPA, part of Intel® Scalable System Framework, is a high-performance fabric enabling the responsiveness, throughput, and scalability required by today's and tomorrow's most-demanding high performance computing (HPC) workloads. In this interview, Misage talks about market uptake in Intel OPA's first year of availability, reports on some of the first HPC deployments using the Intel Xeon Scalable platform and Intel OPA, and gives a sneak peek of what Intel OPA will be talking about at SC17. For more information on Intel Omni-Path Architecture, please visit http://intel.com/omnipath. 1:11: More systems in the Top 500 than InfiniBand based on June 2017 Top500 List https://www.top500.org/list/2017/06/. 1:18: Higher percentage of systems in Top100 compared to Infiniband EDR based on June 2017 Top500 List https://www.top500.org/list/2017/06/. 2:01: Better price/performance than Infiniband: Configuration assumes a 750-node cluster, and number of switch chips required is based on a full bisectional bandwidth (FBB) Fat-Tree configuration. Intel® OPA uses one fully-populated 768-port director switch, and Mellanox EDR solution uses a combination of 648-port director switches and 36-port edge switches. Mellanox component pricing from www.kernelsoftware.com, with prices as of April 4, 2017. Compute node pricing based on Dell PowerEdge R730 server from www.dell.com, with prices as of April 4, 2017. Intel® OPA pricing based on pricing from www.kernelsoftware.com as of August 15, 2017 *Other names and brands may be claimed as property of others.
Show: 8Show Overview: Brian and Tyler talk with Jeremy Eder (@jeremyeder, Senior Principal Software Engineer at Red Hat) about the Kubernetes Resource Management Working Group, scaling Kubernetes environments, extending Kubernetes for high-performance workloads (HPC, HFT, Animation, GPUs, etc.), testing at scale and how companies can get involved. Show Notes:KubeCon 2017 (Austin) ScheduleOpenShift Commons Gathering (Austin, Dec.5th)Kubernetes Resource Management Working GroupContact the Resource Management Working GroupDeploying 1000 Nodes of Kubernetes/OpenShift (Part I)Deploying 2048 Nodes of Kubernetes/OpenShift (Part II)Topic 1 - Welcome to the show. You recently introduced the Resource Management Working Group within Kubernetes. Tell us a little bit about the group. Topic 2 - The group’s prioritized list of features for increasing workload coverage on Kubernetes enumerated in the charter of the Resource Management Working group includes (below). Let’s talk about some of the types of use-cases you’re hearing that drive these priorities.Support for performance sensitive workloads (exclusive cores, cpu pinning strategies, NUMA) Integrating new hardware devices (GPUs, FPGAs, Infiniband, etc.) Improving resource isolation (local storage, hugepages, caches, etc.) Improving Quality of Service (performance SLOs) Performance benchmarking APIs and extensions related to the features mentioned above Topic 3 - This is a broad list of areas to focus on. How do you determine what things should be kernel-level focus, Kubernetes-level focus, or application-level focus? Topic 4 - How do you go about testing these areas? Are there lab environments available? How will you publish methodologies and results? Topic 5 - As you talk to different companies, do you feel like they are holding back on deploying higher-performance applications on Kubernetes now, or they are looking for more optimizations?Feedback?Email: PodCTL at gmail dot comTwitter: @PodCTL Web: http://podctl.com
Fredrik låter lite klippt ibland, det är helt och hållet hans eget fel. Jocke citerar fel person, det är helt och hållet hans eget fel. 0 Örnsköldsvik, superdatorer, fiber och SAN 21:47: Datormagazin har en BBS igen! 30:17: IKEA Trådfri 35:58: Apple har bytt ikon för Kartor 36:25: Nya möjliga CMS för Macpro 44:11: Ett viktigt mejl, och sätt att sponsra podden. Vill du inte använda Patreon men vill donera pengar går det att höra av sig till oss för Swish-uppgifter 47:32: Fredrik har äntligen sett Mr Robot! Spoilervarning från 48:49. 52:59: Fredrik lyssnar på snack om blockkedjor och ser chans till bubblor 1:01:50: Discord kastar ut nazister, Trump är hemsk 1:10:19: Chris Lattner går till Google brain och appstorlekar är löjliga 1:15:05: Jocke försöker bygga nytt webbkluster 1:21:29: Jocke recenserar sin nya USB-hubb Länkar Nationellt superdatorcentrum SGI Origin 3200 Silicon graphics Cray Seymour Cray Den första datorn värd att kritisera verkar vara ett citat från Alan Kay Be och Beos Infiniband Fibre channel R12000-processorn Ernie Bayonne Jockes superdatorloot. Craylink - även känd som NUMAlink Promise thunderbold-fibre channel-adapter Ali - Ali express Datormagazin BBS är tillbaka! Fabbes BBS SUGA A590 Terrible fire Vampire-acceleratorerna FPGA Plipbox SD2IEC - 1541-emulatorn Jocke beställde från Polen. Satandisk IKEA Trådfri Artikeln på Macrumors AAPL:s nyhetsbrev Apple har bytt ikon för Kartor Grav Jekyll Bloxsom Sourceforge - där en del lade kod förr Ilir - tusen tack käre Oneplus 5-sponsor Man kan stödja podden på Patreon, men bara om man vill Mr Robot Incomparableavsnittet om Mr Robot Vi pratade lite milt om blockkedjor i avsnitt 67 Discord kastar ut nazister Cloudflare också Tim Cooks brev till de anställda Videoklippet där Anderson Cooper sakligt tar all heder av Trump Chris Lattner blörjar jobba på Google Brain Appstorlekar är fortfarande löjliga Kod är en deprimerande stor del av Facebook-appens filstorlek Acorn Alpine Linux PHP-FPM Nginx WP super cache Varnish Docker Openbsd Ballmer peak Jocke recenserar sin USB-hubb Henge dock Jockes USB-grafikkort Fullständig avsnittsinformation finns här: https://www.bjoremanmelin.se/podcast/avsnitt-90-superdatorer-med-inbyggda-soffor.html.
In this slidecast, Gilad Shainer from Mellanox describes the advantages of InfiniBand and the company's off-loading network architecture for HPC. Watch the video presentation Sign up for our insideHPC Newsletter
This Week In HPC Episode 81 featuring Michael Feldman and Special Guest Chris Willard. InfiniBand Pulls Ahead on TOP500; Smart Meter Supercomputing.
Lightwave Editorial Director Stephen Hardy quizzes John Calvin, BERT portfolio manager at Tektronix, about the challenges technology developers are likely to face working in applications with data rates ranging from 25G to 28G. The three primary applications are the OIF CEI interfaces, InfiniBand, and IEEE 802.3bj. While they have aspects in common, each application also presents unique challenges that must be addressed.
Bob Rizika, U.S. CEO of ProfitBricks, and Achim Weiss, CEO of ProfitBricks Germany, provide a look into the future of cloud-based infrastructure-as-a-service with a service that is available today. What makes an advanced IaaS offering today? Here are a few hints: - Scale-up servers with on-the-fly CPU and RAM elasticity - By the minute pricing - Infiniband for 80GBps throughput per server - Software defined networking - Replicated Raid10 storage
Unsere Themen drehen sich heute um geekige Valentinstags-Geschenke, Infiniband-Grundlagen, Hashbang-URLs und wir haben eine Job-Ausschreibung für einen Sysadmin. Im Studio: Rolf Kersten, Johannes Schlüter, Marc Baumann und Moderator Constantin Gonzalez.
