The Computer Architecure Seminar Series brings to the UT campus leading researchers from universities and companies to describe their research to UT's faculty, graduate students, and computer engineers and scientists from the Austin community who are welcome attend at no charge. The individual talk…
UT Austin Computer Architecture Faculty
Dr. Jishen Zhao
Today's emerging neural computing substrates have their origins in either biology (spiking neural networks, like IBM's TrueNorth), or machine learning (deep convolutional networks, like Google's TPU). There are glaring gaps between these two approaches, since they differ dramatically in terms of architecture, efficiency, usability, and practical applicability to real-world tasks. Deep neural networks, which are based on the nonlinear perceptron neuron model, have recently emerged as a very powerful tool for classifying spatial inputs, such as complex, real-world image data. DNNs re-evaluate the entire network at every time step, performing either convolutions or matrix-vector multiplications for each layer, requiring massive amounts of compute time and memory to train and process, since there can be millions of parameters that need to be learned and retained for the networks to achieve high rates of accuracy. In contrast, spiking neural networks can be very energy efficient, because they are fundamentally event driven, leading to an efficiency gap between these approaches.However, spiking networks suffer from a dearth of effective approaches for configuring and/or training these networks to perform tasks with practical value. The semantic limitations of simplified hardware spiking neurons chip can make deployment of applications with a biological inspiration challenging, leading to a semantic gap. Similarly, numerical applications require many ad hoc changes to map to such substrates, leading to an algorithmic gap. Finally, emerging technology may provide a much more efficient substrate for these algorithms, creating a technology gap.
Energy and power are the main constraints in computer architecture today and in the foreseeable future. Unfortunately, there is no silver-bullet technique that can single-handedly solve the problem—-certainly not in computer architecture. In architecture, we need to apply a set of technologies that, in combination, can improve energy efficiently substantially. This talk presents some of these technologies. Cores need to flexibly operate at a range of voltages, and techniques for efficient energy use such as power gating and voltage speculation need to be widespread. New control-theoretical controllers can manage multiple parameters concurrently for minimal resource waste. Further, if novel devices can be integrated with conventional CMOS, new energy-efficient architectures can be designed. Josep Torrellas is the Saburo Muroga Professor of Computer Science at the University of Illinois at Urbana-Champaign. He leads the Center for Programmable Extreme-Scale Computing, a center focused on architectures for extreme energy and power efficiency. He has been the director of the Intel-Illinois Parallelism Center (I2PC), a center created by Intel to advance parallel computing. He has made contributions to parallel computer architecture in the areas of shared memory multiprocessor organizations, cache hierarchies and coherence protocols, thread-level speculation, and hardware and software reliability. He is a Fellow of IEEE, ACM, and AAAS. He received the 2015 IEEE CS Technical Achievement Award.
Visual object detection and recognition are needed for a wide range of applications including robotics/drones, self-driving cars, smart Internet of Things, and portable/wearable electronics. For many of these applications, local embedded processing is preferred due to privacy or latency concerns. In this talk, we will describe how joint algorithm and hardware design can be used to reduce the energy consumption of object detection and recognition while delivering real-time and robust performance. We will discuss several energy-efficient techniques that exploit sparsity, reduce data movement and storage costs, and show how they can be applied to popular forms of object detection and recognition, including those that use deep convolutional neural nets. We will present results from recently fabricated ASICs (e.g. our deep CNN accelerator named “Eyeriss”) that demonstrate these techniques in real-time computer vision systems. Speaker Biography Vivienne Sze is an Assistant Professor at MIT in the Electrical Engineering and Computer Science Department. Her research interests include energy-aware signal processing algorithms, and low-power circuit and system design for multimedia applications. Prior to joining MIT, she was a Member of Technical Staff in the R&D Center at TI, where she developed algorithms and hardware for the latest video coding standard H.265/HEVC. She is a co-editor of the book entitled “High Efficiency Video Coding (HEVC): Algorithms and Architectures” (Springer, 2014). Dr. Sze received the B.A.Sc. degree from the University of Toronto in 2004, and the S.M. and Ph.D. degree from MIT in 2006 and 2010, respectively. In 2011, she was awarded the Jin-Au Kong Outstanding Doctoral Thesis Prize in electrical engineering at MIT for her thesis on “Parallel Algorithms and Architectures for Low Power Video Decoding”. She is a recipient of the 2016 AFOSR Young Investigator Award, the 2016 3M Non-tenured Faculty Award, the 2014 DARPA Young Faculty Award, the 2007 DAC/ISSCC Student Design Contest Award and a co-recipient of the 2008 A-SSCC Outstanding Design Award. For more information about research in the Energy-Efficient Multimedia Systems Group at MIT visit: http://www.rle.mit.edu/eems/
In the quest for more intelligent consumer devices, machine learning lets appliances understand what is happening around the computer and what is asked of it, while big data provides the history and context of the environment. But devices must also react to be useful, and for many applications the reaction needs to happen on human timescale to be valuable. For example, an advertisement beacon must beam a discount coupon to the shopper's cellphone in a few hundred milliseconds or the shopper will walk past. Today many people prefer to use large shared data centers remotely accessed through the public internet for big data analytics and machine learning because this is the most cost-effective and energy efficient way to do large-scale computing. But integrating real-time computing with big data and machine learning may make that impractical because exchanging messages through the internet may itself consume a substantial fraction of a second, leaving almost no time for computing if you want to guarantee application response time of a few hundred milliseconds. In this talk I propose a FLASH-based parallel computer using large numbers of low-power processor chips with vector units. Such a system is very much smaller, cheaper and lower power than one with equal memory capacity and instruction throughput made entirely with DRAM, x86 processors and GPUs. It is small enough to install locally in retail and office locations or mobile platforms such as trucks and ships, and inexpensive enough that it need not be a shared computing resource. Yet because it uses primarily FLASH memory, which is extremely dense, the storage capacity can be as big or bigger than any DRAM-based in-memory big data analytic server.
Computers used as datacenter servers have usage patterns that differ substantially from those of desktop or laptop computers. We discuss four key differences in usage and their first-order implications for designing computers that are particularly well-suited as servers: data movement, thousands of transactions per second, program isolation, and measurement underpinnings. Maintaining high-bandwidth data movement requires coordinated design decisions throughout the memory system, instruction-issue system, and even instruction set. Serving thousands of transactions per second requires continuous attention to all sources of delay – causes of long-latency transactions. Unrelated programs running on shared hardware produce delay through undesired interference; isolating programs from one another needs further hardware help. And finally, when running datacenter servers as a business it is vital to be able to observe and hence decrease inefficiencies across dozens of layers of software and thousands of interacting servers. There are myriad open research problems related to these issues. Dick Sites most recently was a Visiting Professor teaching a graduate course on Datacenter Software at the National University of Singapore. Before that he was a Senior Staff Engineer at Google for 12 years. He previously worked at Adobe Systems, Digital Equipment Corporation, Hewlett-Packard, Burroughs, and IBM. His accomplishments include co-architecting the DEC Alpha computers and building various computer performance monitoring and tracing tools at the above companies. He also taught Computer Science for four years at UC/San Diego in the 1970s. His work at Google has included understanding CPU disk and network performance anomalies, disk and network hardware design, web-page language detection, and downloadable Google Translate dictionaries. Dr. Sites holds a PhD degree in Computer Science from Stanford and a BS degree in Mathematics from MIT. He also attended the Master's program in Computer Science at UNC Chapel Hill 1969-70. He holds 38 patents and is a member of the U.S. National Academy of Engineering.
Current-day data centers and IaaS clouds (e.g. Amazon EC2, MS Azure, Google GCE) use microprocessors that are very similar to or the same as those used in small servers and desktops. This work rethinks the design of microprocessors specifically for data center use along with how microprocessors are affected by the novel economic models that have been popularized by IaaS clouds. This talk will describe several architectural changes including how a processor can be decomposed into sub-components (e.g. ALU, Cache, Fetch Unit) that can be individually rented in IaaS clouds, how running similar programs can be taken advantage of in the data center, how architectural features such as the flavor of memory bandwidth (bursty vs. bulk) can be provisioned and sold in the data center, and novel memory architectures that enable the creation of sub-coherence domains of cache coherence across the data center. This work has not only been simulated, but many of the discussed ideas have been implemented in one of the largest academic processors ever built, the Princeton Piton Processor. Piton is a 25-core manycore built in IBM's 32nm process technology containing over 460 Million transistors. This talk will discuss Piton along with what it takes to tape-out a complex microprocessor in an academic setting. Last, Piton has been recently open sourced as the OpenPiton (http://www.openpiton.org) project which is a expandable manycore platform which includes RTL, thousands of tests, and implementation scripts. The talk will conclude by discussing how OpenPiton is able to contribute to the burgeoning field of open source hardware.
As transistor sizes continue to scale, we are about to witness stunning levels of chip integration, with 1,000 (simple) cores on a single die, and increasing levels of die stacking. Transistors may not be much faster, but there will be many more of them. In these architectures, energy and power will be the main constraint, efficient communication and synchronization a major challenge, and programmability an unknown. In this context, this talk presents some of the technologies that we will need to deploy to exploit these architectures. Cores need to flexibly operate at a range of voltages, and techniques for efficient energy use such as power gating and voltage speculation need to be widespread. To enable data sharing, we need to rethink synchronization and fence hardware for scalability. Hardware extensions to ease programming will provide a competitive edge. A combination of all of these techniques --and additional disruptive technologies--are needed. Speaker Biography Josep Torrellas is the Saburo Muroga Professor of Computer Science at the University of Illinois at Urbana-Champaign. He leads the Center for Programmable Extreme-Scale Computing, a center focused on architectures for extreme energy and power efficiency. He has been the director of the Intel-Illinois Parallelism Center (I2PC), a center created by Intel to advance parallel computing. He has made contributions to parallel computer architecture in the areas of shared memory multiprocessor organizations, cache hierarchies and coherence protocols, thread-level speculation, and hardware and software reliability. He is a Fellow of IEEE and ACM. He received the 2015 IEEE CS Technical Achievement Award.
Since early 2000, we have been experiencing two very important developments in computing. One is that a tremendous amount of resources have been invested into innovative applications such as first-principle based models, deep learning and cognitive computing. Many application domains are questioning the conventional “it is too expensive” thinking that led to inaccuracies and missed opportunities. The other part is that the industry has been taking a technological path where application performance and power efficiency vary by more than two orders of magnitude depending on their parallelism, heterogeneity, and locality. Today, most of the top supercomputers in the world are heterogeneous parallel computing systems. New standards such as the Heterogeneous Systems Architecture (HSA) are emerging to facilitate software development. Much has been and needs to be learned about of algorithms, languages, compilers and hardware architecture in these movements. What are the applications that continue to drive the technology development? How hard is it to program these systems today? How will we programming these systems in the future? How will innovations in memory and storage devices present further opportunities and challenges? What is the impact on long-term software engineering cost on applications? In this talk, I will present some research opportunities and challenges that are brought about by this perfect storm.