Nvidia Tesla K20 vs Intel Core i7: Speed Comparison

Simmakers Ltd, together with Nvidia and Forsite have tested the speed of a GPU version of the Frost 3D Universal software package for solving heat transfer problems on the Nvidia Tesla K20 system. The results have exceeded all expectations — the computation for an ice wall around the Fukushima Nuclear Power Plant was completed in under 6 minutes.
In order to carry out a detailed comparison of the computational speeds on systems with different configurations, Simmakers prepared special tests that included the following model dimensions:
1) 1 million cells (100x100x100); 2) 3.4 million cells (150x150x150); 3) 8 million cells (200x200x200); 4) 15.6 million cells (250x250x250); |
5) 27 million cells (300x300x300); 6) 42.9 million cells (350x350x350); 7) 64 million cells (400x400x400); 8) 91 million cells (450x450x450). |
The computation assignment was the three-dimensional non-stationary two-phase Stefan problem (nonlinear heat transfer).
For efficient parallelization, the code of the computational mechanism was written in two versions: in the C ++ language with the Open MP parallelization directives support for Intel CPUs, and in the CUDA C ++ language for Nvidia GPUs.
The histogram shows the accelaration of test computations on different multicore processors for models with different dimensions relative to the Core i7.

The graph below demonstrates the absolute time in computational minutes for temperature over 1 year in a cubic area of 20x20x20 meters.

The Nvidia GTX 560 Ti appeared to be ahead of the Nvidia GTX 660 Ti in 4 out of 5 tests, despite the fact that it has less CUDA cores. The result was influenced by a higher frequency of CUDA cores in the Nvidia GTX 560 Ti, as well as solver optimization for this graphics card.
The amount of RAM on the Nvidia GTX 560 Ti and the Nvidia GTX 660 Ti graphics cards is 2 GB, which makes them suitable for modeling heat transfer problems with a number of units up to 42 million. The Nvidia Tesla K20 GPU has 5 GB, and therefore renders the use of up to 105 million simulation nodes possible.
At the same time, Intel processors used in the test can exploit up to 32 GB, which corresponds to 690 million nodes (assuming that other processes consume 1 GB in total). Of course, the computational speed on an extremely large mesh would be very low.
To demonstration of the capabilities of the CUDA and OpenMP versions of the computational mechanism, a computation of a previously published application task was performed: simulation and thermal analysis of ground freezing around the perimeter of the Fukushima Nuclear Power Plant. The computation was conducted on a PC using the Intel Core i7 processor and on a server with the Tesla K20 graphics card.
The results of the computations are the following: Intel Core i7 — 58 minutes, Tesla K20 — 6 minutes.
It should be noted that the computation was earlier performed on the single-core version of the processing system, with a resulting computational speed for the Intel Core i7 of 192 minutes.