Into the Itanium, Part 2 - Software Pipelining and Register Stacking
(Page 4 of 4 )
The last part of the Itanium architecture I'm going to discuss is of more use to programmers than it is to hardware guys. But this is another example of how the architecture is optimized to speed the execution of modern code. Those of you who have done a fair amount of programming know that loops are a standard structure in code. Loops are code that is used as a module, gone through over and over again until an exit condition is realized. Thanks to the large number of registers and execution units available in hardware, software pipelining is possible to really cut into the number of cycles used to complete the loop.

Similar to how hardware pipelining works, so too does software pipelining. The processor is able to keep all three loops "in flight" at once, just in various stages of completion. As can be seen in the graphic, the "software pipelined" version completes all three loops in the same time it takes the non pipelined version to complete two. In a normal x86 processor there simply isn't the number of registers or execution units available for something like this to occur.
Register stacking is another feature that you can't really do effectively when there are only 8 GPR visible to the programmer. The first 32 GPR's, 0-31 are considered "global," and variables that are saved here are available to all procedures. Above that, a window or "frame" is created for variables that are specific to only one procedure, both in terms of local variables and outputs. When you go to a further nested procedure, you can rename the registers of the output to the input of the next procedure, then add its local variables and outputs on top. In a normal x86 situation, you would have to save the stack back to memory, because there are not enough resources available, before beginning the next procedure. After you have completed the top procedure, you simply save its outputs to memory, then rename the registers back without having to restore the previous state from memory. Renaming registers obviously is a much faster method.

(Click for larger image.)
Conclusion
I hope you've gained some insight into how the IA-64 architecture differs from the IA-32 architecture that has been around since the very first PCs. With the issues surrounding the addition of speed to the current processors due to hitting the limits of process technology, it's well past time we looked to other methods of adding performance. Itanium, while meant mostly for the "big tin" of servers and gigantic number crunching machines certainly possesses many advantages over it's x86 (IA-32) predecessor.
At the moment the hardware itself is far from being something that can be put into desktops, but the basic architecture is a step in the right direction by going "wide" and adding in features that specifically speed up code used by programmers and remove memory bottlenecks. In our next article on Itanium, we'll look at the hardware that is available right now in the form of the Madison core.
| DISCLAIMER: The content provided in this article is not warranted or guaranteed by Developer Shed, Inc. The content provided is intended for entertainment and/or educational purposes in order to introduce to the reader key ideas, concepts, and/or product reviews. As such it is incumbent upon the reader to employ real-world tactics for security and implementation of best practices. We are not liable for any negative consequences that may result from implementing any information covered in our articles or tutorials. If this is a hardware review, it is not recommended to open and/or modify your hardware. |