Last week I attended the Arm Research Summit, and on Tuesday there was a panel session on using Arm hardware for performance, an effort known as GoingArm. You can watch the whole panel on youtube, from 2:12 onward (~1h30). All videos are available on the website.

It was interesting to hear the opinions of people in the front lines or ARM HPC, as much as the types of questions that came, which made me believe even more strongly that now is the time to re-think about general computing in general. For a long time we have been coasting on the fact that cranking up the power and squeezing down the components were making the lives of programmers easier. Scaling up to multiple cores was probably the only big change most people had to deal with, and even so, not many people can actually take real advantages of it.

Even in the HPC world we seem to be stuck between MPI and OpenMP, still relying on faster interconnects and memory access to boost performance. On the other end of the scale, people still seem to mostly be stuck in shared memory land, but also relying on better memory or faster cores. And while there are a good amount of research on FPGAs, accelerators and GPU compute, only the most market-friendly (CUDA) ends up on the Top500 clusters.

But the exascale effort is so challenging that none of the traditional methods used in the past decades will be enough. We have to innovate from CPU design to air conditioning systems, from memory design to new software methods. 

The most interesting take away from that panel, for me, is that nothing we have today will be able to measure which machines will be exascale, because the benchmarks we use today have no meaning in this new world. 

As Eric puts it, HPC is the Formula1 of computing, where technology trickles down into general purpose. For the next 10 years, I hope, there will be a lot of new ideas that, even if they don’t stick in HPC, they’ll be useful on more power-aware hardware such as mobile and robotics.

I have no idea what those technologies will be, but I’m excited to be part of it. 

Now, my summary of the panel. It’s a long post, but it covers all that was said. You may want to skip a few questions… however, if you do, I recommend you read at least the two last ones.

Invited panellists

From left to right:

  • Eric Van Hensbergen (ARM)
  • Jon Masters (RedHat)
  • Mitsuhisa Sato (RIKEN)
  • Shinji Sumimoto (Fujitsu)
  • Kevin Pedretti (Sandia National Lab)
  • Scott Hara (Qualcomm)

Q. Moore’s Law vs. Exascale

Both Scott and Kevin argue that not much will change, that the advances will continue to happen. Sato adds that exascale will come before Moore’s law ends, but the costs are getting higher. Eric compares exascale with petascale times, when we were asking the same questions, but at that time, Moore’s law was still kicking.

Sumimoto argues that it really depends on which problem we’re trying to solve, that hardware depends on software to be efficient, to which Jon agrees wholeheartedly. Jon reminds us that developers have been very lazy for a long time, but the free lunch is over.

At this moment I interjected and argued that we never had free lunch in the first place. The costs of developing meaningful architectures was higher than just cranking up the frequency, but that doing so has created a number of problems we have in hardware today, from the design of gates to how primitive our distributed algorithms are.

Eric adds that we have also hit the memory limit, where adding more flops won’t make a difference if you can’t move the data fast enough. Kevin agrees, but also remember that we’re still seeing improvements, so it’s not all doom and gloom.

Even though we still may just get to exascale by just cranking up the power a bit more, luckily, every one agreed that it is best to avoid that narrow minded approach and focus on the real problems.

Q. Omnipath as a game changer

In Kevin’s experience, Intel’s Omnipath interconnect is not that big a deal. We had similar technologies in the past and none of them were a big deal either.

Everyone else chimed in on a similar note, but the crux of the argument is that Omnipath is yet-another closed technology, which will have no adoption beyond its maker. Intel used the commodity argument for far too long after clamping down on competition by closing the technology path.

Eric shares that it’s that kind of thing that makes his job much easier. Whenever a big company tries to bring the whole stack to themselves, all customers reach out to others in order to keep the competition and with that, the innovations that are fundamental in a space as competitive as HPC.

Both Scott and Suminoto defend that, it’s only through open standards and co-design that we’ll reach real performance for real applications at a feasible energy budget.

Q. ISA changes vs. co-processors vs. accelerators

It seems there’s a consensus that less is more. To some degree, every panellist argued that having co-procesot sors or special ISA extensions poses a huge migration and maintenance cost, while some said accelerators could skip the problem if they used open standards for programming and automatic run-tim detection of the features in a way that is completely transparent to the user.

Eric argues that ARM has been through multiple iterations of multiple sub-architecture design and specialisation and the design of AArch64 has avoided all of those problems on purpose. While Jon goes further and states that if a feature is optional, it doesn’t exist.

Aside: As an ARM compiler engineer for the past 7 years I can thoroughly confirm that the sub-architecture nightmare created, especially between ARMv5 and ARMv7, is extremely detrimental to performance and power management on real hardware. It’s rare that compilers can make optimal use of a real board but it’s even more rare that users know of all the compiler flags that would make that happen.

Sato shares that the decision to not add accelerators in Fujitsu’s K Supercomputer is because of the complexity of supporting too many applications in the same cluster. The porting costs are just too high.

Both Jon and Kevin agree that boring is good. If a hardware can’t help the kernel and libraries to pick the most performance out of it (ie. open drivers, simple interfaces, features enabled), then the hardware is next to useless.

But Kevin disagrees that re-compilation is a problem. While for RedHat, re-compiling all packages to multiple sub-architectures would be a nightmare, in HPC systems it’s not uncommon to re-compile the same libraries over and over to extract the last FLOP of the applications.

The conclusion is basically that we need the base OS and tools to just work out-of-the-box but the ability to re-build almost everything when needed.

Q. Sharing the learning to get to exascale

Eric reminded the audience that HPC is like Formula1: a lot of new technologies that have no purpose elsewhere are developed and used in very restricted environment, but ultimately a lot of those have already trickled down into general purpose computing, like caching, floating point, SIMD, etc.

Q. Design decisions towards linpack or machine learning

While Linpack, HPCG and most HPC benchmarks are double precision, deep learning is largely 16-bit and moving down, often less than 8 bits. Eric argues that this may create a divergence in HPC design, but it won’t throw us back into a Linpack-only era. We’re more interested in real world applications nowadays.

Jon and Sato add that there’s always space for balanced designs. Even HPC clusters have to be shared among many users. But Sato warns that deep learning is very popular at the moment, so we’ll have to come up with some solution that is manageable.

Sumimoto reminds us that data is now the biggest problem. Moving and caching data is probably one of the hardest challenges ahead.

In Scott’s view, we’re surely not going towards linpack focused machines”. The divide that Eric mentions will drive new standards. Current HPC clusters have varying degrees of performance. For example, K is 8th on Linpack but first on HPCG, due to high memory-to-compute ratio.

Ultimately, we’ll need both generic and heterogeneous benchmarks to measure success in the exascale world.

Q. Is process-in-memory the ultimate evolution

All answers hit the same note, going back to the general purpose argument as before. As Kevin put it, granularity is always an issue and there are lots of overheads to worry about. While Scott worries about portability issues (that will certainly be a real problem), Sato wonders if attaching a general purpose CPU to memory won’t help with that.

I personally think that memory compute will have to be minimalist, due to the power budget, but that will invariably require a common and open standard programming model, that doesn’t exist right now. We have a solution waiting for a problem, and once we get that, we’ll be able to develop the appropriate model.

Q. Opening the micro-architecture to developers

Jon warns that this is a dangerous model, and everyone else agrees. Erratas and optimisations can happen in the same sub-architecture on different revisions and that means whatever programmers did for one will not work, or worse, break everything, on the new revision.

On ARM, this is less of a problem due to its RISC history, so for Jon, optimising the compiler to emit the right instruction sequence is probably the best way forward. I agree wholeheartedly.

Sato shares that they have proposed SVE to ARM in the past, which refused to implement due to ecosystem fragmentation, which they then (after the formal process), incorporated into as a standard extension.

Q. Handling heterogeneity: improve tools or educate programmers

Kevin shares Sandia’s work on designing a write-once-run-everywhere model where the tool can target different hardware (general purpose, accelerators, etc.) as long as you describe the data and computation using the model they created.

For Scott, as a chip vendor, you want to create hardware that is immediately useful, to everybody. You can’t change a Java programmer into a Fortran programmer, so you need tools, common standards. “Make it boring”.

Eric has received feedback that even though some workloads would benefit tremendously from GPGPUs, those don’t make the majority of the workloads that would run on clusters, so not having the extra maintenance and tooling costs was preferable.

In Sato’s view, the heterogeneity of accelerators will have to converge, with standards like OpenCL, to be useful. This will also allow different vendors to optimise on top of those standards and still be able to compete in the open market.

Q. Memory and storage separation

“Never underestimate how slowly this industry moves”, said Jon. Even though there are a lot of new technology moving into HPC (Formula1-style), there’s still a lot of legacy and risk aversion. If we’re now mixing non-volatile memory with DRAM we’ll have to worry what’s left over, what to persist between checkpoints and what to not override between persists.

Eric has a similar opinion. NVM is used in different ways nowadays (buffers, caches), but that won’t converge towards boring filesystems any time soon. Some people still use tapes, so even though we’ll see more and more NVM in HPC systems from now on, “there will be a heavy tail on traditional storage methodologies”. Not to mention the security issues that persistence will bring us, potentially leaking data between runs of different users.

Q. Is exascale the upper limit to performance

Kevin and Eric share the opinion, repeating previous comments, that exascale is not about raw FLOPS, but efficiency. It’s about doing the same thing X times better in terms of memory bandwidth, latency and locality.

A lot of the problems we’re solving now are at a fraction of the optimal efficiency, so re-thinking how we solve those problems will improve the efficiency more than just throwing hardware at the problem. Echoing Sumimoto’s earlier comments, co-design is the key to exascale.

Sato brings an important point, that there are two types of computation: capability and capacity. While capacity is all about FLOPS on standard benchmarks, the K computer is special because it focused on the capabilities that were needed by the workloads that run on it.

Q. Takes too long to get weather predictions

As a followup from the previous question, Eric and Sato answer that this is not exactly true. Special purpose clusters have been developed for that kind of tasks and they are largely successful. Two good examples are the hurricane models predicting the series of hurricanes this year, as well as Anton, a molecule simulator that uses some very specialised hardware (ASICs, specialised network, etc) to achieve some very impressive results.

Q. Programmability of extremely-heterogeneous hardware

Sato thinks that hardware vendors should always design hardware with programmers in mind, pointing out that FPGA accelerators should use standards like OpenCL, but Kevin brings a good point: “most programmers don’t know what they want until they see it”. Things like CUDA revolutionised the HPC industry, but programmers weren’t interested in GPUs in the beginning. It took NVidia the creation of a language and some serious performance improvement to entice developers to take on the steep learning curve to understand GPUs in the first place.

Scott hits the co-design note once more. From a hardware perspective, you want to provide compelling reasons for people to try your new device, and for that, collaboration in the design phase is crucial, not only to get people’s opinion and do it right the first time around, but also to get people interested before chips hit the shelves.

Q. Shared memory in HPC

Everyone seemed baffled that people were still using shared memory in HPC settings in this day and age, but Eric took the bait.

He argues that this is, again, all about balance. Not just FLOPS and bandwidth, but also I/O, storage and even air conditioning. However, he warned the ones still trying to simplify the developer’s view by packing single-core logic into HPC: “when you hide too much from the users you let them shoot themselves in to foot”“Never underestimate communication as an important part of the balance equation”.

Q. The biggest barriers for ARM entering HPC

Everyone’s answers were exactly the same: maturity of the ecosystem.

Scott poses the billion dollar problem: “we can’t afford billions on software research”. Without a vibrant community driving the ecosystem, there’s no way ARM will succeed.

For Sumimoto, the most important thing is to migrate the existing workloads to ARM and that will take a lot more than just re-compiling everything. It’s about software packages working well, library being optimised, single binary running on all platforms, but also specialised libraries optimised to particular micro-architectures.

Both Sato and Sumimoto agree that SVE is particularly well suited to the portability issue (and vendor specific value), but more importantly as a future-proof technology that will evolve with HPC instead of force yet-another migration in 10 years time. Sato argues that, without SVE, ARM cannot beat Intel in HPC.

For Eric, there are no technical reasons why ARM cannot succeed in HPC. It’ll be down to business decisions, market success and the ecosystem being minimally ready. The existing HPC CPU vendors will not be afraid to use their dominant positions to punish the integrators for experimenting with new architectures. “We need to stick together as a community”, against monopolies.

Q. ARM laptop/workstation as HPC driver

Eric shares his similar experiences for PowerPC. It’s not just for the lone open source developer, but also software vendors. They need a viable platform in which to design their software and port to ARM.

Jon points out that laptops are too far from the HPC world, not only on the class of chips but also battery problems and lighter components. Eric agrees, and believes workstations may actually be a good starter. Well, won’t that be a game changer?