Profile cover photo
Profile photo
Glenn K. Lockwood
193 followers
193 followers
About
Communities and Collections
Posts

Post has attachment
A little different from my usual posts--a dissection of the I/O subsystem on a desktop system and all of the neat widgets that never see the light of day in the data center.
Add a comment...

Post has attachment
I couldn't help but write a bit of a rebuttal to a recently posted blog post that trashes tape in favor of disk-based object store. I get that it's fun to beat up on tape as an ancient storage technology, but it's still alive and kicking for a few very good reasons.
Add a comment...

Post has attachment
Related to my burst buffer blog post, I also recently updated the nonvolatile memory technologies page I had on my website. It's very hardware-focused, but it might be of interest to those who would like to learn the basics of how SSDs are built and operate, and where SSD technology is going.
Add a comment...

Post has attachment
I've been sitting on this post for over six months, but after a call with some colleagues of mine who wanted to know more about burst buffers, I was reminded that there aren't a lot of great resources out there that have an up-to-date view of where the burst buffer ecosystem stands today.

To do my part and share the knowledge, I put a bow in this post and present a breakdown of what burst buffers exist (or will exist) in terms of hardware and software.
Add a comment...

Post has attachment
This article stops short of saying what I expected it to--don't write multi-threaded programs by hand if you can use OpenMP.

The biggest reason is not really acknowledged in this article--OpenMP is a lot more than compiler directives; it's also a performance runtime. The developers of OpenMP runtimes have put monumental effort into optimizing the inner workings of the thread scheduling to be performance portable and scalable. Programming threads "by hand" throws all of this out the window and leaves you, the programmer, to reinvent the wheel.

Take for example thread binding--with pthreads, you have to bind threads to cores by hand, and without a lot of effort, your choice of core binding will not be performance portable to other processors. With OpenMP, you simply set a single environment variable, and your threads are bound according to the automatically discovered processor topology.

Other examples exist for thread scheduling (e.g., blocking loops), thread creation/destruction (e.g., active thread pools), and just about every other non-trivial aspect of multithreaded programming. So, unless you enjoy wasting time programming that which has already been optimized, don't program threads by hand.
Add a comment...

Post has attachment
My colleagues at SDSC are looking for someone who wants to get his or her hands dirty supporting their Gordon and Comet systems. This is a fantastic opportunity to get into the HPC industry and exactly how I got my start, and it comes with a lot of opportunities to pursue research, industrial collaboration, and involvement in the larger scientific computing community.

SDSC will be at SC'16, as will I, and I'm happy to connect anyone who's interested in this position with the right people over at SDSC. Let me know!
Add a comment...

Post has attachment
We're building some of the biggest and fastest supercomputers in the world, and we're looking for people to help us design them. NERSC and LBNL has great people, great technology, and a great view, so please get in touch with me if you or anyone you know is interested.

Job description is here: https://lbl.taleo.net/careersection/2/jobdetail.ftl?lang=en&job=83027
Photo
Add a comment...

Post has attachment
I've had this blog post half written for a number of months now. Figure I should just publish it so that I can always remember what all the important IOR flags are.
Add a comment...

Post has attachment
A little lengthier version of the G+ post I made earlier today about the TaihuLight system, its node architecture limitations, and the intrigue of its processor design.
Add a comment...

Post has attachment
China's new ~100 PF TaihuLight system is impressive given the indigenous processor design. That being said, it features some some critically limiting design choices that make the system smell like a design that was only designed to run HPL and be #1 on Top500.

Jack Dongarra wrote this great understatement:

"So, TaihuLight is a lot slower for applications that involve a lot of memory traffic (data movement)."

and this sentiment seems to be echoed by a number of people in the HPC business with whom I spoke today. And while it's true that not all scientific applications are memory-bandwidth-bound, this system will do a great job of turning regular applications into memory-bandwidth bound problems.

Consider the fact that each TaihuLight node turns 3062 GFLOPS and has 136.51 GB/sec of memory bandwidth. This means that in the time it takes for the processor to load two 64-bit floats into the processor from memory, it could theoretically perform over 350 floating point operations. But it won't, because it can only load the operands for a one single FLOP.

Of course, this is an oversimplification since caches exist to feed the extremely high FLOP rate of modern processors. And where there are so many cores that their caches can't be fed fast enough, we see technologies like GDDR DRAM and HBM (on GPUs) and on-package HMC/MCDRAM (on KNL).

The ShenWei chips in the TaihuLight machine have neither; they simply have a 64 KiB scratchpad above their L1 cache. So its CPUs will be starving in all but the most compute-heavy workloads like LINPACK.

To put this into perspective, consider ORNL Titan's K20X GPUs' GDDR subsystem--it can feed the 1.32 TF GPUs at 250 GB/sec, which gives a 4x better byte/flop ratio than TaihuLight. This is a huge gap, meaning that an application that is perfectly balanced to run on a Titan GPU (in terms of average memory accesses required to issue a single flop) will run 4x slower on a TaihuLight processor; despite being capable of doing 3 TFLOPS of computing, its memory bandwidth would limit it to only 0.75 TFLOPs.
Add a comment...
Wait while more posts are being loaded