So the question came in:

We're running the same calculations with the 32-bit and 64-bit versions of [a C# application] and getting different results. We don't understand why this is happening. Can you explain?

Floating point calculations on a computer can be famously fickle, but if you’ve got the same calculation, on the identical inputs then would you expect the same output out the other end?

If you were using IEEE standard doubles you might. As that Wikipedia page says:

"The reproducibility clause recommends that language standards should provide a means to write reproducible programs (i.e., programs that will produce the same result in all implementations of a language), and describes what needs to be done to achieve reproducible results."

Well, if the programs is the same it should be reproducible, shouldn’t it?

Probably, if you assume you are following the standard. However if you read the CLI standard then you find Section 12.1.3 (Virtual Execution System: Supported data types: Handling of floating-point data types) says that an implementation is free to use an internal representation available on a machine provided that there’s at least 32 (single) or 64 (double) bits. This implies that there’s no guarantee of strict IEEE compliance (there frequently isn’t in most languages, it’s usually slightly faster not to follow the standard to the letter).

To investigate what’s happend we can write arguably the worlds least interesting vector addition C# program.

double[] x = new double[1000];
double[] y = new double[1000];
double[] z = new double[1000];

int xx = 234;
int yy= 890;

for (int i = 0; i < 1000; ++i)
    // Sorta-random :-)
    xx = 16807*xx;
    yy = 16807*yy + 1;
    x[i] = (double)xx;
    y[i] = (double) yy;

while (true)
    for (int i = 0; i < x.Length; ++i)
        z[i] = x[i] + y[i];

If we compile this in release mode; run it outside of the debugger, then interrupted with windbg we can dump out the assembler and compare it. The result is below with 32-bit assembler is on the left, 64-bit on the right.

So 32-bit is using the 80-bit FPU registers, 64-bit is using the 128-bit SSE registers. Neither is using SIMD instructions despite this being obviously parallelizable.

Text version of the Disassembly


00ce00aa b9ea000000      mov     ecx,0EAh
00ce00af b87a030000      mov     eax,37Ah
00ce00b4 33d2            xor     edx,edx
00ce00b6 69c9a7410000    imul    ecx,ecx,41A7h
00ce00bc 69c0a7410000    imul    eax,eax,41A7h
00ce00c2 83c001          add     eax,1
00ce00c5 894df0          mov     dword ptr [ebp-10h],ecx
00ce00c8 db45f0          fild    dword ptr [ebp-10h]
00ce00cb 3b5604          cmp     edx,dword ptr [esi+4]
00ce00ce 7346            jae     00ce0116
00ce00d0 dd5cd608        fstp    qword ptr [esi+edx*8+8]
00ce00d4 8945f0          mov     dword ptr [ebp-10h],eax
00ce00d7 db45f0          fild    dword ptr [ebp-10h]
00ce00da 3b5704          cmp     edx,dword ptr [edi+4]
00ce00dd 7337            jae     00ce0116
00ce00df dd5cd708        fstp    qword ptr [edi+edx*8+8]
00ce00e3 83c201          add     edx,1
00ce00e6 81fae8030000    cmp     edx,3E8h
00ce00ec 7cc8            jl      00ce00b6
00ce00ee 33d2            xor     edx,edx
00ce00f0 8b4604          mov     eax,dword ptr [esi+4]
00ce00f3 85c0            test    eax,eax
00ce00f5 7ef7            jle     00ce00ee
00ce00f7 dd44d608        fld     qword ptr [esi+edx*8+8]
00ce00fb 3b5704          cmp     edx,dword ptr [edi+4]
00ce00fe 7316            jae     00ce0116
00ce0100 dc44d708        fadd    qword ptr [edi+edx*8+8]
00ce0104 3b5304          cmp     edx,dword ptr [ebx+4]
00ce0107 730d            jae     00ce0116
00ce0109 dd5cd308        fstp    qword ptr [ebx+edx*8+8]
00ce010d 83c201          add     edx,1
00ce0110 3bc2            cmp     eax,edx
00ce0112 7fe3            jg      00ce00f7
00ce0114 ebd8            jmp     00ce00ee


000007ff`00190163 41b87a030000    mov     r8d,37Ah
000007ff`00190169 41b9ea000000    mov     r9d,0EAh
000007ff`0019016f 90              nop
000007ff`00190170 b8a7410000      mov     eax,41A7h
000007ff`00190175 410fafc1        imul    eax,r9d
000007ff`00190179 b9a7410000      mov     ecx,41A7h
000007ff`0019017e 410fafc8        imul    ecx,r8d
000007ff`00190182 83c101          add     ecx,1
000007ff`00190185 f20f2ac0        cvtsi2sd xmm0,eax
000007ff`00190189 f20f11441310    movsd   mmword ptr [rbx+rdx+10h],xmm0
000007ff`0019018f f20f2ac1        cvtsi2sd xmm0,ecx
000007ff`00190193 f20f11441710    movsd   mmword ptr [rdi+rdx+10h],xmm0
000007ff`00190199 41b9a7410000    mov     r9d,41A7h
000007ff`0019019f 440fafc8        imul    r9d,eax
000007ff`001901a3 41b8a7410000    mov     r8d,41A7h
000007ff`001901a9 440fafc1        imul    r8d,ecx
000007ff`001901ad 4183c001        add     r8d,1
000007ff`001901b1 f2410f2ac1      cvtsi2sd xmm0,r9d
000007ff`001901b6 f20f11441318    movsd   mmword ptr [rbx+rdx+18h],xmm0
000007ff`001901bc f2410f2ac0      cvtsi2sd xmm0,r8d
000007ff`001901c1 f20f11441718    movsd   mmword ptr [rdi+rdx+18h],xmm0
000007ff`001901c7 4883c210        add     rdx,10h
000007ff`001901cb 4881fa401f0000  cmp     rdx,1F40h
000007ff`001901d2 7c9c            jl      000007ff`00190170
000007ff`001901d4 488b4308        mov     rax,qword ptr [rbx+8]
000007ff`001901d8 0f1f840000000000 nop     dword ptr [rax+rax]
000007ff`001901e0 b8e8030000      mov     eax,3E8h
000007ff`001901e5 85c0            test    eax,eax
000007ff`001901e7 7e28            jle     000007ff`00190211
000007ff`001901e9 33c0            xor     eax,eax
000007ff`001901eb 0f1f440000      nop     dword ptr [rax+rax]
000007ff`001901f0 f20f10440310    movsd   xmm0,mmword ptr [rbx+rax+10h]
000007ff`001901f6 f20f58440710    addsd   xmm0,mmword ptr [rdi+rax+10h]
000007ff`001901fc f2410f11440310  movsd   mmword ptr [r11+rax+10h],xmm0
000007ff`00190203 480508000000    add     rax,8
000007ff`00190209 483d401f0000    cmp     rax,1F40h
000007ff`0019020f 7cdf            jl      000007ff`001901f0
000007ff`00190211 ebcd            jmp     000007ff`001901e0