![]() As a result you end up limited to doing 64-bit math "the hard way" in 32-bit registers. There is evidence that FP64 performance on "gaming" cards is crippled due to having very few or no FP64 capable units. Out of preference a vector processor just wants a stream of "run this simple code against this huge array" and a lot of repeated runs on one piece of data quickly eats up bandwidth and processor cores. We are providing an FP64 solution accelerated with tensor cores that happen to use FP16 and FP32. ![]() The whole point in vector processors is that they work on streams of instructions and data and even in a GPU with massive bandwidth memory access is expensive, especially as your data has a dependency on previous parts of the calculation. There is a lot of additional math involved because you can't do a simple "add these two registers together" but instead have to do the math the long way around.įrom Stack Overflow Multiplying 64-bit number by a 32-bit number in 8086 asmįor the final code (with merging) you'd end up with 8 MUL instructions, 3 ADD instructions and about 7 ADC instructions. The compute parts are significantly larger (and power hungry) and thus more expensive. And another one for compute/datacenter, where there are more FP64 use cases. So they reduced significantly the number of 64-bit units. Doing 64-bit floating point math in 32-bit registers is workable, but it is far from a simple halving due to being double width. One for Consumer/Pro graphics, where FP64 is irrelevant for most of those workloads. There would be additional load/stores and bytes needed to handle overflow which might use more registers. Most other architectures dont have hardware for floating-point types larger than 64 bits, therefore they chose the IEEE-754. On the other hand multiplying 64-bit values would require either 4 registers (two 64-bit values split into 32-bit parts each) or memory load/stores between doing the lower 32-bit and then the higher 32-bit of the 64-bit value. In case of Itanium where floating-point registers are 82-bit wide then long double would highly likely have the same width, with some padding for proper alignment to 128-bit. Double-precision floating-point format (sometimes called FP64 or float64) is a floating-point Floating point is used to represent fractional values, or when a. Probably because the default register size within the units is 32-bits.Ī 32-bit register can hold two 16-bit values that can be multiplied across resulting in a doubling of performance.
0 Comments
Leave a Reply. |