As  Chief Evangelist of Intel Software Products, James Reinders spends  most of his working hours thinking about and promoting parallel  programming. He’s essentially a professor at large, attuning himself to  the needs of software developers with an interest in parallel  programming so he can offer guidance on techniques, ways of learning, and  ways to “think parallel” – all with a strong Intel bent,  naturally.
    As  Intel moved from a multicore paradigm to a manycore one with the introduction  of Xeon Phi in 2010, Reinders parallel programming evangelizing went  into overdrive. In the next half-decade, Reinders  and Intel colleague Jim Jeffers co-authored two books focused on  demonstrating the computational potential of Phi’s 60+ cores: High Performance Parallelism Pearls volume 1 and 2. With  the second-generation Phi, Knights Landing, on-deck for general  availability in 2016, we spoke with Reinders about the implications of  Intel’s design choices for Knights Landing, what that means for  compatibility and performance and what the user community can do to get ready  for the first self-hosted manycore Xeon. Read the first half of our in-depth  interview below.
    HPCwire: How are the various communities and stakeholders  preparing for Knights Landing? Can you talk about the challenges relating to  porting and exploiting parallelism?
    James Reinders: One of the things that distinguishes Xeon Phi is  it’s not challenging to port to at all. Being on a coprocessor or PCI  card requires a lot of considerations because of the limitations in the size of  memory and having to stage your algorithms and so forth. Anytime you are trying  to target something that sits on a PCI card, you have a challenge, and we  really felt that with Xeon Phi, but one of the huge design principles behind  Xeon Phi that we’ve delivered very well on is there is no porting effort  per se to Xeon Phi because it essentially looks like a very high-core count  Xeon. So the porting is the easy part for Knights Landing since it will be a  processor and not sitting on a coprocessor card — unless you design to  buy it that way.
    As  a processor, we’ve gotten rid of what I would say is the number one  headache with Xeon Phi, which is the coprocessor, and you’re not left  with a porting challenge, you’re left with a challenge of scaling your  application. You’re going to face that with any processor or any compute  device of any sort. So that’s why we spent so much energy focused on  evangelist work, teaching and code modernization. The real challenge for the  entire industry for parallel programming is finding and exploiting parallelism,  regardless of what compute device you want to use. And I think with Xeon Phi  what we’ve done is eliminate the porting issues and purely made it an  issue of parallel programming.
    HPCwire: Is one potential downside of  the manycore processor approach in contrast with the accelerator or  coprocessor paradigm that there are  no full-strength cores to handle the parts of the code that  don’t parallelize?
    Reinders: You’re referring to Amdahl’s law, where the part of your  program that’s serial is going to have a performance challenge. So you  get bottlenecks around that. Anytime you have a system that has something  highly parallel in it and you use that to speed up your parallel code, when you  fall back to doing serial code, you’ve got a challenge. On Knights  Corner, because it was a coprocessor, you try to divide your program between  the coprocessor that’s highly parallel and your host, which is probably a  very capable Xeon. So it is helpful to have a very capable host processor in  that case. You’re not going to want to run something fast, accelerate it  and have a weak processor coupled with it. For Knights Landing, we have much  stronger serial performance than Knights Corner — and that’s  on purpose.
    If  you take a look at Knights Corner, we have 61 cores and the performance  difference between a Knights Corner core and a Xeon core is give or take  10X. Some people might tell you it’s 12 or 14X; it depends on the  application — but it’s pretty severe. That means you really want to  avoid having a lot of serial code run on the Knights Corner. It was pretty bad  gap, owing to Amdahl’s effect. A well-parallelized program worked great;  one that had serial regions had trouble.
    On  Knights Landing, we’ve reduced that to about a 3X difference. Of course,  the only way to reduce this to a 1X is to become a Xeon. But that’s not  the point of Xeon Phi, the point is to scale higher. But the fact that  we’re at 3X, we’re seeing really good results with that, meaning  that a system built with just Knights Landing as processors works pretty darn  well.
    HPCwire: Sounds like a balancing act.
    Reinders: When you step back and look at computer architecture, there’s  a lot of fun knobs you can turn when you are designing a machine and  that’s what we as engineers do: we’re turning knobs. There are two  main ones. One is big versus little cores, so when you go to a littler core,  you can have more of them and you can scale further. But if you make them  bigger, you can’t scale as much but you can handle a wider variety of  code. We’ve turned that – that’s one knob.
    Another  knob you can turn is compatibility. If you take a look at a GPU design,  including Intel’s GPUs, one of the design things you do is require a lot  of the parallelism is to be done in lock-step, meaning the individual  processing capabilities cannot do different code; they have to run exactly the  same code at the same time. That has its pros and cons too. So for Xeon  and Xeon Phi, we’ve made a very tight relationship between them in terms  of compatibility. That’s our design decision. It has a lot of advantages  and gives us the ability for Xeon Phi to scale higher than Xeon but to require  that you are doing parallel programming. If you go and try to program our GPUs,  you will find you don’t have that capability and are much more restricted  in the programs you can run, and that gives the GPU certain capabilities that  are useful for graphics units.
    Compatibility  with a processor also means an enormous amount of flexibility — which  provides a large degree of preservation for code investment. Your code can keep  working; you can decide how much to invest for performance, but you don’t  have to go make the change in it just because you changed hardware generations.
    HPCwire: Could you provide examples of codes that are  best suited for Knights Landing and those that are not as well-suited?
    Reinders: The one thing about Knights Landing is that it’s  highly-parallel, so the Amdahl’s effect becomes a key consideration. So  if your program is not parallel or not spending a significant amount of time  doing things in parallel, then Knights Landing is not likely to be interesting.
    There  is an exception to that due to the aggregate bandwidth on Xeon Phi being higher  than on Xeon, so we have seen some examples of codes that lean pretty high on  bandwidth that see benefits on Xeon Phi even though that they aren’t as  parallel as you might think. Because if your processor is waiting for  bandwidth, feeding it more bandwidth can be helpful. Because aggregate  bandwidth is high on Xeon Phi – it always has been – and you add in  the high-bandwidth memory on Knights Landing, there are some applications that  are a little less parallel than you’d expect that can get a boost on  Knights Landing, but for the most part you are looking at programs that are  parallel. So in the HPC domain, everything — that’s the easy part.  Everything is a good target on Knights Landing but that’s simply because  the HPC world has been parallel for so long that to be a successful code in HPC  you need to be parallel. Outside of HPC, it’s less clear. There are  certainly things in technical computing that might be outside what people call  traditional HPC, and Knights Landing looks very good on the ones that are  parallel there, including big data problems and machine learning. Now whether  you consider these to be HPC, to me it’s kind of fuzzy.
    HPCwire: Speaking of machine learning – do you  expect the Knights Landing will get traction for neural networks?
    Reinders: It’s quite a good device for the different neural nets both  the training and the usage of them. Knights Landing in particular has some  great attributes there. Because it’s not a coprocessor, we can talk about  having large amounts of memory on it, which can be a huge advantage to many  science problems and machine learning as well. That’s going to be an  interesting thing to understand how to properly represent because you can  choose your benchmarks carefully to fit in a small or select amount of memory,  but a lot of times, with users, if their programs haven’t been as well  conditioned, it would take effort if it’s even possible to condition an  algorithm or application to run in too tight of a piece of memory. When  you’re talking about a processor, like Knights Landing, that has a large  amount of memory capability, you can build machines with an appropriate size of  memory to fit your application, you’re not straddled by what happens to  fit on a coprocessor code, which with KNC was a challenge for us sometimes.  That constraint of more limited memory definitely limits some of the  applications or algorithms you can run, including machine learning.
     
    Source