How the heck we forget performance?
When I speak about high-performance code, the technical staff often scratch their head and look at me as a bizarre phenomenon. In this world of fast programming interface as .Net or Java, many programmers make code without really knowing what is going on behind the scene. Those enterprise programs use only a fraction of the CPU’s ability but deliver the work at a reasonably low cost.
More and more, those languages incorporate some libraries, allowing to introduce some performance factors as a synchronized task and elements to structure multi-threading work. Lately, a library has been added for SIMD operations in Java and .Net. I saw significant improvement in the way multi-threading is used thanks to those libraries. The programmer believes the compiler will do most of the work for optimization. In fact, the compiler does an excellent job of optimizing the logic in place but will extrapolate your code only to a certain extent.
Most of the compiler, even the compiler of low language, does not manage, Non-uniform memory access, pipelining, Caches synchronization, Memory fetch, or SIMD operations. The true and the matter is a high-level language that is not meant for high performance and gives little control on creating real performant code even if they are very impressive. They were conceived to get the ability to develop code fast and manage the nitty-gritty detail for you, reducing the memory leak, pointer issue, garbage collection, etc… but use more overhead to achieve those abilities.
High level or low-level language, to start to optimize your code, I can stress enough how it is essential to benchmark your code. The only way to get a clear vision of your possibility of gain is to see the result clearly and not assume the potential bottleneck. It’s always challenging to generalize optimization because not performing fast enough can vary and do not always need deep programming.
In the High-performance code, the strategies used will align the algorithm desire both concerning the matériel behavior. For example, We while avoiding as much as possible to allocate or release memory. We will ensure that memory access is structured to prevent cache miss or force a fetch on the appropriate timing if needed. We will try between many phases or calculations to not reassign the memory but instead keep the register’s response to limit the read and writhe memory management. We will also not create a blockage as much as possible of the pipeline, losing precious CPU clocks by coding appropriately or via register renaming technic. We will avoid, if possible, using the CPU instruction the costliest by modification of the formula. We will ensure to use the correct thread concerning is the ability to access memory for CPU with Numa domain. We will use the SIMD (Single instruction multiple data) operation if coherent to do so. We will avoid transferring data between threads as much as possible and use Lock or atomic operation as needed. We will facilitate branch predictor, dynamic code creation or self-writing code, and plenty of other Technic.
The CPU tries to optimize itself via several methods. Still, when conflict occurs, the first thing processor is to block the parallel operation on a core until the conflict is avoided (Register conflict(pipelining), Memory not in the cache, etc.) So, the goal in high-performance code is to ensure you create code avoiding as much as possible conflict (material resource wise) in the execution of your code and use as much as possible the ability of the CPU in a simultaneous way. You can be amazed by the gain in performance achieve. When a code run correctly, the performance gain is impressive. Get the full potential of your CPU clock to realize how fast our modern CPU is and how often the CPU is blocked by how kind of restriction.
If that code is critical and compute-intensive, it can be economically interesting to max out the performance. Understanding performance is contextual and depends on your need. Resource-wise, it’s always more costly to get specialized resources. The code optimizes for performance are generally more complicated to read and maintain. The question comes down to how much cost your operation time? Must your computing task be completed in a timely matter? The code giving more performance will benefit as long as you use it. What is the correct balance for you?
I never saw in my experience a case where all the code in place has to be replaced for optimization. In the performance improvement process, the optimization targets the specifics portions of the codes where intensive computing occurs. Usually, those parts of the code will be extracted and encapsulated to get all the flexibility needed to increase performance. That encapsulation will typically be integrated harmoniously via a function call to alter the existing code’s readability and separate it precisely because it will need more specialized attention.
High-performance computing, what we commonly name HPC, consists of separating the compute on many servers and can take various forms. In my opinion, high-performance computing does not exclude the need for high-performance code. In HPC, your compute process’s bottleneck will most likely be the communication between nodes or your slower process if you have a dependent process. The correct balance between workload and communication will become very critical in the case of a dependent process. This computing technic allows the management of complex algorism on a large scale. In my opinion, this ability does not facilitate or simplify the need for individual performance for each algorithm but instead adds complexity, for apparent reason. My rule of thumb is if your intense compute process can be content on one server, whit all the core available on modern CPU, I do not see a valid reason to split that process into many nodes. Please do not take me wrong. Accordingly, that does not mean that process should not run on multiple servers for different simultaneous analyses or redundancy. That means your strategies have to work against reducing dependency between nodes. In a large dataset, it becomes challenging to achieve none dependency and is totally understandable.
A chain is no stronger than its weakest link; in your code review, or when you code some critical code, never forget this in your architecture design. If you benchmark your code correctly, you should be identifying this weakest link. Obviously, no one likes to wait, and time is money, so always keep in mind your need for performance.
Do not hesitate to contact me if you have questions. I am here to help! Please click like if you believe that article helps you.
geoffrey.bastien@gmail.com
Originally published at https://www.linkedin.com.