How Argonne’s Sunspot is Advancing Code Development for Aurora

/content/dam/www/central-libraries/us/en/images/2022-08/headshot-tony-mongkolsmai-rwd.jpg.rendition.intel.web.480.480.jpg

Hosted by:

Tony Mongkolsmai
Technology Evangelist, Intel Corporation

Guests:

Tim Williams
Deputy Director of Argonne’s Computational Science Division

Venkat Vishwanath
Argonne Leadership Computing Facility Data Science Team Lead

Subscribe: iTunes* | Spotify* | Google* | PodBean* | RSS

Science teams at Argonne who work on the Aurora project are getting readier and readier to bring the 10,000-node exascale computer to life. To do so, a test and development system is needed—this is a mini-me to the real thing.

Sunspot is that system, two racks of actual Aurora hardware—primarily the Intel® Data Center GPU Max Series—and the 128 nodes in size.

In this episode of Code Together, two leads on the project discuss the Sunspot testbed and how Argonne research teams are using it to optimize code performance in preparation for early science runs on Aurora.

Listen. [34:50]

Learn More:

Transcript

Tony [00:00:04] Welcome to Code Together, a podcast for developers by developers, where we discuss technology and trends in industry.

Tony [00:00:11] I'm your host Tony Mongkolsmai.

Tony [00:00:18] In the past, we've talked a lot about the software that is making the upcoming Aurora supercomputer at the Argonne National Laboratory possible. Now, as this software is being deployed to the hardware, we are learning more about the performance of the hardware and the potential scientific applications it can enable. Today, we talk about some of that performance and the early scientific projects that are being ready for Aurora.

Tony [00:00:39] Tim Williams is Deputy Director of Argonne’s Computational Science Division where he manages the Early Science Program. Since 2009, he has worked with a number of large-scale projects using ALCF’s supercomputers, especially those in the area of plasma physics. Welcome to the podcast, Tim.

Tim [00:01:05] Thanks for having us.

Tony [00:01:07] Venkat Vishwanath is a computer scientist at Argonne National Laboratory. He is the data science team lead at the Argonne Leadership Computing Facility. His current focus is on algorithms, system software and workflows to facilitate data centric applications on supercomputing systems. His interests include scientific applications, supercomputing architectures, parallel algorithms and runtimes, scalable analytics and collaborative workspaces. Welcome to the podcast, Venkat.

Venkat [00:01:34] Thanks for having me.

Tony [00:01:35] So I've talked in the past to a lot of people who are working on the Aurora project who are actually getting things ready. I know that you guys are really excited about the bring up of Sunspot. Could you guys talk about how Sunspot is helping people get ready for Aurora?

Tim [00:01:47] Sunspot is what we call the test and development system for Aurora. So when you bring up a new machine like this, the test and development system is meant to be a miniature version of the real thing. And in fact, Sunspot is two racks full of actual Aurora hardware. So it's very much just a small version of the of the Aurora platform. And the early users, those in the Early Science Program and Exascale Computing Project are have been on there for some months now running their applications, benchmarking, debugging, testing and later when the Aurora system becomes production, then this Sunspot will be a platform for staging and new system software and for people who want to spin up new capabilities and scale up.

Tony [00:02:43] So for the hardware people out there who get really excited about this stuff, you mentioned it's two racks of systems. How many nodes are you talking about? And are you talking about similar type of data storage performance, etc.?

Tim [00:02:57] Two racks in this architecture is 128 nodes. So each of those has two very powerful CPUs on it. The Intel Data Series Max, which has on package high bandwidth memory, and it has six of the Intel so-called Ponte Vecchio or PVC GPUs officially known as Data Center Max GPUs, and compare that with Aurora, which will have over 10,000 nodes.

Tony [00:03:26] The funny thing that's cool is for most people when they talk about 128 node cluster, they're talking about a gigantic cluster that is super hard to manage. And when you talk about it, you talk about it as this tiny little system that everybody's come is going to come in and try to be ready for this, even more impressive 10,000 plus node cluster. So the amount of scale that you guys are working on is always super impressive. Venkat, how does that affect your data science team and kind of the work that you do in your planning?

Venkat [00:03:55] The goal is to really achieve novel science and enable science, including scientific discoveries at this leadership and really large-scale systems. So this is really exciting that you get to work with leading edge systems, your scaling up with science teams, working very closely with them on these large systems. And in some cases you're attempting problems that we've never done before. You know your reasoning about what can you really do at these scales? And we'll talk some about new and novel science, this is going to enable. But that's what's really exciting, right? You're working with challenging science problems to scale out. You're working on the software stack that needs to scale out. So you're working across teams that span the interconnects, the storage teams, you're working across, deep learning frameworks, compilers, debuggers, Python stack and the science teams together. So we are working closely in coordination to really achieve these goals that we have for science.

Tony [00:05:02] How is Sunspot helping the developers understand what to think about as they start looking towards a full deployment on Aurora?

Tim [00:05:12] The developer is working on applications for Aurora now that many of the have things working and performing well, in many cases on single GPUs or single nodes of the Aurora architecture. All of the calculations of interest are massively parallel. So the next and very important step is demonstrating and benchmarking, scaling, scaling those applications up to many, many nodes, not just a single one, many nodes, each of which has many GPUs, six in our case. That's the most important work that's going on on the Sunspot system now. And, you know, having scaled to that system and helped us vet problems with network software and hardware will set the stage for the much larger scaling, scale up that needs to happen for Aurora.

Venkat [00:06:04] It's a test and development system which has the same architecture as Aurora has. We had the integrated graphics to start off with that applications worked on. We went to the discrete graphics; we had the ATS discrete graphics. We are now working with the Ponte Vecchio GPUs, so we've actually worked on Intel GPU-based architectures for quite some time with applications. And what Sunspot provides us is a mechanism to really start scaling and being prepared for as the applications get access to Aurora and it provides the exact same configuration, a software stack, management stack that as applications start scaling out and move to Aurora and start planning their science scaling studies. So it's a very good vehicle for us as we start running science on Aurora.

Tony [00:07:08] Yeah, one of the funny things about scaling, I talked to a guy at Amazon and actually a Google too about the size of their data lakes and everybody myself being a performance engineer for 20 years, we always think we know how things are going to scale and how they perform and the Amazon guys always tell me until you've worked at Amazon, you don't know anything about scale. I imagine for people who are thinking they're ready for something like Aurora, it's probably something similar.

Venkat [00:07:35] Yeah, it's very similar, I would say. So in the past we have run systems which are the Blue Gene/Q System era was 48,000 nodes. And I remember coming from grad school where I'd run things on 500 nodes and it worked really great. Then I went to 500 nodes of Mira, everything broke apart at 1000 nodes and 2000 nodes. So you learn about new algorithms, new approaches to do it, and that, you know, things work great now at 8000 nodes and everything breaks apart at 32,000 nodes and you relearn from these, right? It's always, it's a good experience on what it takes to really scale in production and the challenges and assumptions that you make or need to break over here, so you know but we will have to have, will have a similar journey as we do at Aurora as well. But we have the experience from previous systems and we have the expertise in house today and our collaboration with both Intel and HPE that will help us scale out on to Aurora.

Tim [00:08:45] Yeah and you know, touching back to Venkat's comments on the balance of the system, you know, one of the unique features of Aurora is it has eight network endpoints per node. So it's sort of rich with networking endpoints that's to, you know, allow us to have good communication bandwidth that we think we need. But the consequence of that is that there are lots and lots of network endpoints and, you know, a scale of which has not yet been tested with the HPE Slingshot Network we have.

Tony [00:09:21] And that's what a computer scientist like Venkat and myself would call job security. Usually it's bugs, but in this case, it's the ability to scale. What's the difference between, for instance, the Early Science Program, which you guys are a part of versus, I guess what would not be the Early Science Program? I mean, the traditional science. Is there an actual difference between the Early Science Program and traditional access to something like Aurora?

Tim [00:09:52] I'll say a few words, then Venkat can jump in. We try in selecting the projects which are competitively chosen for the Early Science Program, which Venkat and I manage, we try to reflect the spectrum of production, scientific applications current and what we anticipate when the system goes into production. That being said, there's a relatively new charge for the leadership computing facilities to support in addition to traditional high performance computing of simulations and engineering and physics and chemistry, data intensive computing associated with experimental, observational and simulation data and the use of AI methodologies, coupling all of them. So that's where there's something new coming in I would say.

Venkat [00:10:55] I'll add that, so the Early Science Program, right? As Tim mentioned, we have nine applications which are more traditional HPC, we have ten applications which cover, I would say, data and learning. To a great extent, most applications have some combinations of simulation, data and learning all combined together for their science goals over here. And I still mention what we tried as part of the Early Science Program is to have a good diversity of applications, both in terms of science, in terms of their software requirements. And that helps us build a software stack and a system that will benefit a very diverse and wide range of science here as part of the Early Science. The hope is that this really helps us tease out the types of applications and science that we expect to see on the system right away. In addition to the Early Science, there are projects from the Exascale Computing Project that also run on the system. So they really provide us with the kind of requirements from the software stack, the architectures that will meet the needs of science going forward. And so when we go into production, we have a system that can really benefit a wide and diverse set of science teams. And this is a constant process of improvement, right? We work with various application domains, we're always getting feedback, features that we need and we keep improving and providing productive systems for science.

Tony [00:12:39] So this is probably one of the interesting areas, and I'd like to get your opinion on this, where oneAPI/SYCL, kind of the libraries that Intel provides that are cross vendor essentially allow you to run on other vendors’ hardware even including potentially, you know, something like Frontier that has AMD in it, or any of these NVIDIA data centers to namedrop some of our competition. This is one of the places where having that open ecosystem I think probably helps for you guys as scientists. When you say, how do I build the stack that's going to allow us to have science going forward by choosing something that won't tie you to Intel or NVIDIA or AMD when it comes to GPU accelerators, that should help promote science more generally because as new supercomputers come up and new large-scale systems come up, you should be able to run the same applications with minimal tweaks to the software stack and still get the important science done.

Tim [00:13:38] One thing that's very important in all of these projects which have generally, you know, large development teams across multiple institutions and persist for many years is the concept of performance portability. So the portability part of that is the clearest, that means one can run these applications on different architectures with minimal tweaks and maintenance of the source codes behind them. When you put that together to make performance portability, that means one expects also high level of performance. And you know, these are big supercomputers, the biggest, fastest in the world and premier computational scientists using them, so expectations are high there as well. And it's a balance to achieve those together. The more that the platforms look the same from a software and programing perspective, the better. And I think there's a general feeling that open software and open standards is a long-term benefit to the computational science community.

Tony [00:14:53] Yeah, for sure. And even before I mean, obviously I mentioned SYCL and oneAPI, but also even Kokkos right, which you guys are using in the lab is also very important and things like that to make sure we kind of have that performance portability.

Tim [00:15:07] Yes. And that's an example of a what I would term a portability layer, where in developing applications, one chooses a layer which may be developed by a third party such as Kokkos or developed as a library that's part of the application. And then as best as possible, you encapsulate the architecture-dependent parts of it and the performance critical parts of it underneath that abstraction layer. And then the implementers of that abstraction portability layer have a constrained area in which they can work to, to make optimal implementations under the hood for different architectures. And you know, certainly for developing for the Intel GPU architectures, you know, most of that work is focused on SYCL implementations.

Tony [00:16:02] And I know earlier this year you guys actually presented at one of the Intel Extreme Performance User Groups in Asia about the performance results you were able to get using the various different pieces of not just oneAPI, but OpenMP offload all these other scalable, scale out programing paradigms. Can you talk a little bit about the performance you were able to get out of the Intel Data Center GPU Max, which is more commonly known as Ponte Vecchio?

Tim [00:16:32] You know, over the last few years, there's is really a multiyear effort to develop these applications. And the first part of that is porting to a software layer for PVC. And the next part of that is optimization. And we've found that as the projects under these programs like Early Science and Exascale Computing have reached that point of optimization and benchmarking that we're now starting to see the fruits of this labor and starting to see the capabilities of the Intel hardware. So what we've found is across a set of fairly disparate applications spanning engineering and physics and computational materials and chemistry, you know, anywhere from, running benchmarks on a single GPU, anywhere from a factor of 1.3 to over 2.6 in performance advantage on the Intel GPUs versus competitor A100 GPU from NVIDIA and these are very promising and they span a number of different numerical methodologies. And as the rest of the software stack is solidifying and maturing, we're gathering more and more data. So we expect to see more promising results from other areas, including the data, the data centric projects that we have.

Tony [00:18:00] Yeah. And when we talk about those data centric projects and we talk about performance, I'm actually looking at the presentation that you guys gave and you've you mentioned greater than 230 petabytes of data and greater than 25 terabytes per second in the system. How does that level of ability to transfer data between parts of the system affect the science that's possible?

Venkat [00:18:24] That's a great question and it's a very key question as we start working on very data intensive projects and we have a history of working on really large data processing as part of supercomputing facilities. So if you look at some of our production systems that we have today, we have a Lustre file system, which is, I would say approximately a peak of close to 600 to 700 gigabytes per second. As we are working with larger data sets from experiments and observations, doing more large-scale training and inference. We're dealing with large-scale simulations and all combinations of these together, right? We really need a very performant file system or performance storage system, let's put it that way, which can help you with, I would say, raw throughput in terms of bytes that you can read and write and, yeah, we're talking about 25 terabytes per second, which is at least 25X more than what we currently have. So that's one tick, you know, one axes or one dimension. The other thing that DAOS really provides is also extremely high metadata operations. So if you're doing lookups, you're doing small reads, reads or writes. This is very challenging for current file systems that we have deployed, and DAOS will really help tackle these. So this really pushes us in two directions where we can we can really benefit from storage systems which are DAOS distributed array store, key array store here, and they are really embracing technologies which will benefit us going forward as we plan for future science.

Tim [00:20:18] And I would just add to that that if you look at the advance of scientific instruments, observational instruments, experiments like the Advanced Photon Source upgrade, that's just about to start construction at Argonne and each of this sequence of, you know, telescopes, if you will, that's being launched into space or set up at the North Pole, South Pole, rather, to do cosmological sky surveys. The amount of data makes a big jump with each of these advances. And so it's at the point where something like supercomputing with more innovative data storage and transfer systems like DAOS are becoming a necessary tool.

Tony [00:21:05] Well, that's interesting. That's actually the question I was going to ask you, but you already answered it, which is, no, it's totally fine. It's great. It's good to know because one of the tenants that we always see is as computers get bigger and bigger and the performance profile of those systems changed, is that the science, I think, and the software around it has to change in order to maximize the value. And I think that's what exactly what you're pointing out there is given the ability to transfer this much data, you're actually able to do science differently than you were before, because in the past, even you even if you have all of the data, you just can't use it as part of your effort because it would bottleneck everything waiting for the data to get somewhere where you compute something and it's just not performant. But now with a system like DAOS, you're actually able to do that. So that's pretty cool.

Venkat [00:21:55] And this is this is a very key point about building balanced systems. You know, you could increase your compute 20x, but if you are at the performance of your storage system and keeping up, you know, that becomes a big bottleneck. And in some cases you turn supercomputing into an IO bottleneck system. And what you really want to build is balanced systems where, which benefit science. And DAOS is, will be very key and critical to enabling that.

Tony [00:22:25] So we've talked about software stack, we've talked about hardware. I'm going to pivot now to the part that I'm interested in, maybe more even than the software and hardware, which is the science. So I'd like to ask each of you out of the, I guess the 19 projects you have or other things that you see coming down the pipeline, what is the most exciting and interesting science... I don't wanna say science project because it makes it sounds trivial...scientific use of Aurora that you guys are, think will be a giant step forward for maybe for science or for society?

Tim [00:23:01] You know as a scientist from physics my background it's very hard to pick a favorite. I think as we said, we've spanned a lot of domains in choosing the projects for these, for science on Aurora. And an important characteristic of Early Science Program is that all of the projects proposed specific scientific campaigns to run on the system prior to other users getting on. And so we sort of know what the first science will be. I'll speak to one project that I know you know more about than any of the others, probably because because of my background in plasma physics, and this is a project that simulates the plasma inside a tokamak, which is a magnetic confinement fusion energy machine. There's a giant version of this device being built now in the south of France and the international collaboration for, you know, multi decades, billions of dollars. And this design is, you know, there's high confidence in the plasma physics community that it will be able to demonstrate a net gain and generation of energy and sustained long pulses. So it really is expected to achieve fusion energy. So the project, which is led by CS Chang, PI from Princeton Plasma Physics Lab, does detailed kinetic simulations of plasma all the way up to the edge of the tokamak and the parts of the walls. And more importantly, a sub piece called the divertor of the tokamak is subjected to exhaust from the fusion reactions and high energy particles bombarding it and lots and lots of heat. And when that happens, even to tungsten, which is very tough, it can sputter atoms off of that surface and those can become ionized. And suddenly you have these really heavy tungsten ions circulating around in your very light hydrogen ions for which are the fusion fuel and how that affects the dynamics and control of the plasma in the devices is a key question for ITER and other tokamaks as well. So this project will be doing super high fidelity kinetic simulations of the plasma in the presence of these tungsten ion impurities to study and understand that and answer some questions for the ITER project.

Venkat [00:25:39] There are several interesting projects and I'll just give like, I'll just briefly touch upon a couple of them and I would definitely add that these are all going to be really exciting. All of them are really pushing the boundaries in their fields collectively as well. You know, one project that I was I've been involved in is really try to map out structure of brains. This is called connectomics to understand the structures. This helps us better understand diseases such as Alzheimer's, plasticity, design of new materials as well. We're working with neuroscientists at University of Chicago, Bobby Kasthuri. This project is led by Nicola Ferrier, who is a senior scientist at Argonne, and collaboration includes teams from Harvard as well as Princeton, where they take large-scale electron microscopy images and are looking to really reconstruct the mice brain I would say for now. And in the future, they want to target human brain. And even if you start looking at the mouse brain, that is an exascale inference problem, not just training, it's an inference problem. It would take us, if you do it in a very brute force approach, it might take you 2 to 3 years on an exascale machine. There are other interesting computational approaches that the team is doing, which will hopefully give us answers faster and with the hope that this will help us target human brains. So this is one exciting project.

Venkat [00:27:14] There is another on drug screening where the goal is to look at designing attractive candidates for treating cancer and trying to evaluate the efficacy of these for various tumors. And this is a project that's led by Rick Stevens at Argonne National Laboratory and includes collaborators from various institutions and national laboratories. It's also a partnership with the National Cancer Institute and NIH, where they're really looking at scaling these of workflows and screening through billions of compounds at scale on these systems and also trying to generate new drug designs to meet certain criteria. So here’s another interesting problem where it really can couple in training at scale, inference at scale. It has MD simulations also that that are coupled in and oreally brings together various aspects of data, learning and simulations to get some very, very interesting science done here.

Tim [00:28:21] And I would I would chime in that the first example that got discussed for the brain connectomics study, that's an example where advances in scientific instruments have driven a great increase in data. So there's a robotic electron microscopy set up that automatically slices and processes the tissue and can produce image data of this, you know, faster than computing and can analyze it.

Tony [00:28:52] Oh, that's actually really interesting. My my son actually has cerebral palsy, which is one of the reasons that is the reason that I live in Orlando now. So he can go to a school here called the Conductive of Education Center of Orlando, which is specifically for kids with cerebral palsy. So really, really interesting to see that kind of work being done. I actually wasn't aware of it before I saw your presentation. So very exciting. I mean, I'm sure for a lot of people, understanding how the brain works is going to make make a big difference in terms of end of life things with Alzheimer's as well. As you know, in my case, you know, my son with cerebral palsy. So very cool stuff.

Tony [00:29:25] It is interesting, as I was going through your presentation, one of the things that we talked about is how big the system is, how fast the system is. But we still, you guys still call out that reduced precision is important. So I don't want to go too much down that rat hole because, you know, we're here to talk about big scale systems. But even in this space, it looks like with all the power that you have in Aurora, we're still looking at potentially using reduced precision to get benefits. So can you talk a little bit about how that actually still adds value in a scale system like this?

Venkat [00:30:01] There are several value adds that reduced precision can provide. So, you know, let's take connectomics, for example, right? we are talking about high resolution imagery and you have several channels that are encoded with eight bits here. This is a very prime candidate where you can use 16 bit and eight bit to really decrease the time to solution for reconstructing these samples and provide fast scientific insights here. We are seeing the impact of using reduced precision for applications ranging from applying language models to scientific literature to genome scale studies that we have, where we tend to use 16 bit. We also use in a lot of accumulation operations will use 32 bit, but 16 bit as I would say, been proven to be key for reaching these goals here. Having said that, we are also seeing solvers that are used for traditional HPC where we're starting to see, in some cases it's more of an exploration, where given the amount of reduced precision that we have, how effective are these multiprecision solvers for science domains? So this is still being teased out in traditional HPC in some domains has actually been already embraced, whereas some of the inner loops could be used with 16 bit. Having said that, there is still a wide range of science applications where they really need 128 bit precision as well. So if we have to design systems that can span across the spectrum and we will see reduced precision really being impactful for a wide range of science.

Tim [00:31:53] Yeah, and I would add that the new architectures like these GPUs, including Intel's, have a significant advantage at lower precision of being able to do more calculations per second. And so that's attractive, even as Venkat mentioned in traditional HPC simulations. For that reason alone, but also because some simulation applications are bound by the size of the system rather than the speed. Something like the HACC cosmology code had already worked to make the bulk of their calculations 32 bit rather than 64 bit just because they can fit a bigger problem that way. But because of the hardware on some of these GPUs, it also can do more calculations per second. And so that drives computational scientists to do as much as they can at reduced precision for those reasons.

Tony [00:32:51] So with that... I'll go to the question that I ask everybody at the end, which is for each of you guys, what are you guys looking forward to coming out of Aurora in the next couple of years?

Tim [00:33:02] Playing the role that Venkat and I have in the Early Science Program, of course, it's the science that is going to commence in those projects in the coming months and over the course of the next year. That's what I most look forward to. These projects will be the first science on the system and some of the associated applications and methodologies and workflows will likely, you know, follow on to be successful production users of the system. You know, exascale while being just a number in the most abstract and objective sense is a milestone and certainly a significant step in scale and performance for Argonne compared to anything we've had before. So seeing that play out in these scientific campaigns is what I look forward to.

Venkat [00:33:59] Yeah. This is you know, I would say Aurora is going to be an exciting system. You know, as Tim mentioned, I'm really looking forward to all the, I would say, novel discoveries that it will enable. It's got 60,000 plus GPUs and the amount of compute that it's going to provide for science is unprecedented. So I'm excited to really see science scale and the new types of science that it will enable. At the same time, it's also going to provide a great vehicle for helping develop the next generation of computer scientists and computational scientists that will help us build future systems that are needed for science as well. So I'm excited for the system. We've actually put in a tremendous amount of work, working closely with science teams, architecture teams, the software stack teams, and it's all come together and it's been a great collaboration.

Tony [00:35:06] Okay. And with that, I think that's probably about the end of our time today. I'd like to thank Tim and Venkat for joining me.

Tim [00:35:13] Well, thanks for having us.

Venkat [00:35:15] Yeah, thank you so much, Tony.

Tony [00:35:17] And thank you, our listeners, for joining this as well. And I hope you join us next time where we talk more technology and trends and industry.

Select Your Language

Using Intel.com Search

Quick Links

Recent Searches

Advanced Search

Only search in

How Argonne’s Sunspot Testbed is Helping Advance Code Development for Aurora

Get the Latest on All Things CODE

Transcript

Related Content

Product and Performance Information