So the question came in:
We're running the same calculations with the 32-bit and 64-bit versions of [a C# application] and getting different results. We don't understand why this is happening. Can you explain?
Floating point calculations on a computer can be famously fickle, but if you’ve got the same calculation, on the identical inputs then would you expect the same output out the other end?
If you were using IEEE standard doubles you might. As that Wikipedia page says:
"The reproducibility clause recommends that language standards should provide a means to write reproducible programs (i.e., programs that will produce the same result in all implementations of a language), and describes what needs to be done to achieve reproducible results."
Well, if the programs is the same it should be reproducible, shouldn’t it?
Probably, if you assume you are following the standard. However if you read the CLI standard then you find Section 12.1.3 (Virtual Execution System: Supported data types: Handling of floating-point data types) says that an implementation is free to use an internal representation available on a machine provided that there’s at least 32 (single) or 64 (double) bits. This implies that there’s no guarantee of strict IEEE compliance (there frequently isn’t in most languages, it’s usually slightly faster not to follow the standard to the letter).
To investigate what’s happend we can write arguably the worlds least interesting vector addition C# program.
double[] x = new double[1000]; double[] y = new double[1000]; double[] z = new double[1000]; int xx = 234; int yy= 890; for (int i = 0; i < 1000; ++i) { // Sorta-random :-) xx = 16807*xx; yy = 16807*yy + 1; x[i] = (double)xx; y[i] = (double) yy; } while (true) { for (int i = 0; i < x.Length; ++i) z[i] = x[i] + y[i]; }
If we compile this in release mode; run it outside of the debugger, then interrupted with windbg we can dump out the assembler and compare it. The result is below with 32-bit assembler is on the left, 64-bit on the right.
So 32-bit is using the 80-bit FPU registers, 64-bit is using the 128-bit SSE registers. Neither is using SIMD instructions despite this being obviously parallelizable.
Text version of the Disassembly
32-bit
00ce00aa b9ea000000 mov ecx,0EAh 00ce00af b87a030000 mov eax,37Ah 00ce00b4 33d2 xor edx,edx 00ce00b6 69c9a7410000 imul ecx,ecx,41A7h 00ce00bc 69c0a7410000 imul eax,eax,41A7h 00ce00c2 83c001 add eax,1 00ce00c5 894df0 mov dword ptr [ebp-10h],ecx 00ce00c8 db45f0 fild dword ptr [ebp-10h] 00ce00cb 3b5604 cmp edx,dword ptr [esi+4] 00ce00ce 7346 jae 00ce0116 00ce00d0 dd5cd608 fstp qword ptr [esi+edx*8+8] 00ce00d4 8945f0 mov dword ptr [ebp-10h],eax 00ce00d7 db45f0 fild dword ptr [ebp-10h] 00ce00da 3b5704 cmp edx,dword ptr [edi+4] 00ce00dd 7337 jae 00ce0116 00ce00df dd5cd708 fstp qword ptr [edi+edx*8+8] 00ce00e3 83c201 add edx,1 00ce00e6 81fae8030000 cmp edx,3E8h 00ce00ec 7cc8 jl 00ce00b6 00ce00ee 33d2 xor edx,edx 00ce00f0 8b4604 mov eax,dword ptr [esi+4] 00ce00f3 85c0 test eax,eax 00ce00f5 7ef7 jle 00ce00ee 00ce00f7 dd44d608 fld qword ptr [esi+edx*8+8] 00ce00fb 3b5704 cmp edx,dword ptr [edi+4] 00ce00fe 7316 jae 00ce0116 00ce0100 dc44d708 fadd qword ptr [edi+edx*8+8] 00ce0104 3b5304 cmp edx,dword ptr [ebx+4] 00ce0107 730d jae 00ce0116 00ce0109 dd5cd308 fstp qword ptr [ebx+edx*8+8] 00ce010d 83c201 add edx,1 00ce0110 3bc2 cmp eax,edx 00ce0112 7fe3 jg 00ce00f7 00ce0114 ebd8 jmp 00ce00ee
64-bit
000007ff`00190163 41b87a030000 mov r8d,37Ah 000007ff`00190169 41b9ea000000 mov r9d,0EAh 000007ff`0019016f 90 nop 000007ff`00190170 b8a7410000 mov eax,41A7h 000007ff`00190175 410fafc1 imul eax,r9d 000007ff`00190179 b9a7410000 mov ecx,41A7h 000007ff`0019017e 410fafc8 imul ecx,r8d 000007ff`00190182 83c101 add ecx,1 000007ff`00190185 f20f2ac0 cvtsi2sd xmm0,eax 000007ff`00190189 f20f11441310 movsd mmword ptr [rbx+rdx+10h],xmm0 000007ff`0019018f f20f2ac1 cvtsi2sd xmm0,ecx 000007ff`00190193 f20f11441710 movsd mmword ptr [rdi+rdx+10h],xmm0 000007ff`00190199 41b9a7410000 mov r9d,41A7h 000007ff`0019019f 440fafc8 imul r9d,eax 000007ff`001901a3 41b8a7410000 mov r8d,41A7h 000007ff`001901a9 440fafc1 imul r8d,ecx 000007ff`001901ad 4183c001 add r8d,1 000007ff`001901b1 f2410f2ac1 cvtsi2sd xmm0,r9d 000007ff`001901b6 f20f11441318 movsd mmword ptr [rbx+rdx+18h],xmm0 000007ff`001901bc f2410f2ac0 cvtsi2sd xmm0,r8d 000007ff`001901c1 f20f11441718 movsd mmword ptr [rdi+rdx+18h],xmm0 000007ff`001901c7 4883c210 add rdx,10h 000007ff`001901cb 4881fa401f0000 cmp rdx,1F40h 000007ff`001901d2 7c9c jl 000007ff`00190170 000007ff`001901d4 488b4308 mov rax,qword ptr [rbx+8] 000007ff`001901d8 0f1f840000000000 nop dword ptr [rax+rax] 000007ff`001901e0 b8e8030000 mov eax,3E8h 000007ff`001901e5 85c0 test eax,eax 000007ff`001901e7 7e28 jle 000007ff`00190211 000007ff`001901e9 33c0 xor eax,eax 000007ff`001901eb 0f1f440000 nop dword ptr [rax+rax] 000007ff`001901f0 f20f10440310 movsd xmm0,mmword ptr [rbx+rax+10h] 000007ff`001901f6 f20f58440710 addsd xmm0,mmword ptr [rdi+rax+10h] 000007ff`001901fc f2410f11440310 movsd mmword ptr [r11+rax+10h],xmm0 000007ff`00190203 480508000000 add rax,8 000007ff`00190209 483d401f0000 cmp rax,1F40h 000007ff`0019020f 7cdf jl 000007ff`001901f0 000007ff`00190211 ebcd jmp 000007ff`001901e0