Into the Itanium, Part 2 - Effects of Bundling
(Page 2 of 4 )
The compiler is essentially creating a record of execution; the hardware is merely a playback device, the equivalent of a DVD player for example. The focus here is parallelism, hence the use of three instruction bundles. The Itanium architecture is actually capable of handling more than that; it's able to dispatch two bundles, or six instructions worth per cycle if the compiler can find that many pieces to fit together. Otherwise, "no-ops" are added in to fill out the bundle, and a "stop" added to the template to show that bundle should be executed without waiting for the next one after it.

(Click for larger image.)
When the front end receives a bundle, it then takes the instructions and distributes them across the available units. In the Itanium, there's a lot of those with which to work. For general purpose registers, where data has to come and go through before being worked on, there are a massive 128 to work with in the programmer model. This is as opposed to the meager 8 GPRs in x86-32. Itanium also possesses 128 floating point registers, 128 application registers, and 8 branch registers.
All those registers are necessary for two reasons. One is to allow for all the code to execute in parallel without fighting for resources, and to allow more data to sit internal to the CPU, reducing calls to the cache and memory, avoiding the latency involved in such operations. For operations that are sent to the execution core, there are many possible places to dispatch to. The original Itanium's execution core houses four (with two ports) integer arithmetic logic units (ALU), two floating point units (FPU), and three branch units. It can execute two memory operations, and theoretically all of these could be pumping in any given cycle. That's a lot of hardware.

(Click for larger image.)
In the current Itanium2 revision, there are 11 issue ports, created by adding two more multimedia/integer ports. The corresponding execution hardware has also been increased, with a total of six MM/I execution units. The memory interaction was also bumped up, in that it can now do two loads and two stores per clock, instead of the previous two loads or two stores, but not both. These were all added after the original Merced chip was found to have weak integer and memory performance compared to it's astounding floating point capabilities.
By comparison, a P4 is a very "narrow" processor. Instructions come in more or less one at a time, and instead of being sent "wide" like in the Itanium, are put into a long pipeline where there can be a couple of separate uops in each stage, depending on available execution units in the out of order execution core, and what can be shifted around without breaking any dependencies. The Itanium possesses a pipeline only 1/3 the length of the current Prescott iteration of the P4, much closer to the length found in the Pentium3 or Athlon core. This is one reason why the P4 must reach insane clock speeds in order to kick out decent performance, compared to the Itanium2 which runs at a maximum of 1.5GHz. By going "wide," the EPIC architecture simply gets much more done in each one of those clock cycles.
Next: Additions Specific to IA-64 >>
More Computer Processors Articles
More By DMOS