Shader effects are all the rage these days, and there’s little doubt that the bottleneck in graphically-intensive titles is more and more the amount of math graphics chips must perform to render these effects. Rendering more sophisticated effects requires more shaders with longer instructions, thus making efficient use of their available shader engines is where ATI’s new architecture hopes to shine. One way to approach this problem would be to increase the number of ALUs per pixel shader pipeline, yet ATI have obviously not gone this route for these initial Radeon 1000 parts, choosing instead to keep the same amount of units as in their previous generation. Keeping a GPU’s shader units fed instead of idle and wasting clock cycles is paramount in achieving the kind of efficiency ATI hopes for with the Radeon X1000 chips; thus it is rather hard to miss the importance of the Ultra-Threading Dispatch Processor in the above diagram.
Pixel Shader Engine
Each pixel shader pipeline in the Radeon 1000 boards is comprised of two ALUs and a branch execution unit; hese pipelines are arranged in groupings of four, known as quads (the de facto design choice for years, which is why we see past and current GPUs with 4, 8, 16, or 24 “pipelines”). The X1800 XT and X800 XT boast four quads apiece, and the X1600 XT has three. Take note, however, that these ALUs are not co-equal, but rather each pipeline has a “full” and “mini” unit, with the latter lacking the ability to execute all of the instructions of the full ALU. It is also worth noting that the internal precision of the new architecture has been increased from FP24 (24-bit floating point) of previous generations to FP32 (32-bit floating point), again with no reduced or partial precision modes.
Yet looking back at the architecture diagram we see that the Ultra-Threading Dispatch unit feeds these quad pixel shader blocks. This processor could perhaps be thought of as an integrated hardware scheduler, evenly distributing the shader work among the pipelines to increase the efficiency of the ALUs. Moreover, the dispatch unit also manages the shader data in smaller pieces known as threads which are aided by the new branch execution unit in each pipeline capable of executing one flow control instruction per clock cycle. This design improves efficiency in dealing with dynamic branching, an important aspect of any forward-looking SM3.0 architecture. Yet the dispatch processor also feeds the texture address units, which are no longer a part of each pixel pipeline as in previous designs. By decoupling the texture units from the pipelines the dispatch unit can help prevent texturing stalling the pixel shader pipelines, another design decision with an eye for increased efficiency. ATI claims this change to their pixel shader engine should achieve a 90% efficiency in the pixel shader pipeline regardless of the shader being processed.
Memory access has traditionally been one of the bottlenecks for graphics processing, particularly for a feature such as anti-aliasing that stresses bandwidth. To improve performance in these areas and, again, improve overall efficiency, ATI has engineered a new memory controller in the Radeon X1000 family. The R300 and R400 chips boasted 256-bit wide controllers comprised of four 64-bit channels; in contrast the new controller consists of eight 32-bit channels that feed a bi-directional ring bus with the ring stops arbitrating memory access requests. The X1600 XT, though, as a mainstream, smaller chip, has half the bit width in its memory controller as the X1800s.
This new controller circles around the die of the chip, theoretically helping hide latency by nature of its topology; furthermore, because of its physical placement along the outer edges of the chip where less heat is produced, ATI claims the new controller can be clocked at higher speeds to accommodate faster RAM as it becomes available. In addition, the new controller is programmable, allowing ATI to analyze memory access patterns and optimize the controller’s operations on a per application basis. This controller, along with the Ultra-Threading Dispatch unit, probably contributes the most to the transistor increase R520 has over previous chips (321 million compared to the X800 XT’s 160m). We’ll examine how well the X1800 XT performs at high resolutions with anti-aliasing compared to the older X800 XT.
Various sundry factoids for this new architecture also worth mentioning would include: aside from VS3.0 support, the vertex shader engine remains relatively unchanged from the previous generation, though the number of units has been increased from six to eight for the high-end boards. This increase, along with the higher clock speed, should allow for faster vertex processing in the X1800 XT over previous generations. In addition, ATI also now uses floating point calculations for the early Z and occlusion detection, which the company claims improves hidden surface removal by 60%. Removing such occluded pixels early in the raster process is yet another step toward improving efficiency. And, lest we forget, Radeon 1000 parts like the X1600 and X1800 can be paired with Crossfire master boards for a dual PEG configuration; in fact, SimHQ will be reviewing a Radeon X1800 XT Crossfire solution in the near future.