Threadripper Parallel Efficiency improved ! Gaussian16 Benchmark

Share This:

In our previous post, we benchmarked gaussian16 on Threadripper 1950x. We mentioned that threadripper is quite good for calcultation, but its parallel efficiency is rather bad…

We have been testing our setup and finally could improved the parallel efficiency!!

After improvement,
for opt calculation, parallel efficiency is 1.99 (4 –> 8core), 1.77 (8 –> 16core) and 3.53 (4 –> 16core).
For freq calculation, parallel efficiency is 1.90 (4 –> 8core), 1.85 (8 –> 16core) and 3.51 (4 –> 16core).

In our previous post, parallel efficiency was around 2 (4 –> 16core). So this is a big progress!!

We will explain the detail.

日本語版Threadripper 並列化効率改善?【gaussian16】

What changed?

For computer setup, please read our previous article. Last time we used 2 RAMs (8 GB x 2), but this time we used 4 RAMs (8 GB x 4), and could improved the efficiency of parallel efficiency. We also examined NUMA mode and UMA mode this time, which are mentioned below.

Computational Method

As with the previous time, I did the optimization the freq calculations for the natural product “vomilenine”.

improvement of parallel efficiency

The detailed calculation result is as follows.

The parallel efficiency has become the same value as Xeon E5-2667 v2. It is also related to what we will talk about later, but the calculation speed of opt became slower than Xeon E5-2667 v2… However, the calculation speed of freq was the same.

Some of you might wonder why we compare Threadripper 1950x with the previous generation’s Xeon. This is because we own Xeon E5-2667 v2 (8 CPU 64 core).

I also made a bargrapgh of parallel efficiency as follows.

2RAM vs 4RAM

Installing 4 RAMs made the parallel efficiency improved greatly, but the calculation speed when 4 core was used became about 15% slower. However the computing speed when using 16 cores became 25% faster for opt and 25% faster for freq calculation.

We also made a bargraph. As you can see here, for 4-core calculation, 2RAMS is faster than 4RAMs, while 4RAMs is faster for 8-core and 16-core calculations.

NUMA vs UMA

In MSI mother board’s BIOS screen, you can change memory mode from “Advanced Setting の OC > DRAM Setting > Advanced DRAM Configuration > Misc Item > Memory interleaving”.

Note that in MSI’s system
UMA mode(Distributed) = Die
NUMA mode(Local) = Channel
, which are confusing.

In conclusion, the calculation speed is faster in NUMA mode.

It is about 5% faster for freq, about 2% faster for opt. NUMA is faster than UMA! We think this is quite big difference when calculating large systems. MSI also had a memory mode called Socket, so we tried it, but exactly the same result as UMA mode (Distributed, DIE) was obtained.

Other trials

For the specification of the number of cores, you should use %CPU instead of %nprocshared. When using %CPU, the calculation speed is about 2% faster than %nprocshared.

Since calculation speed decreases when too much memory is allocated for calculation, we changed the amount of memory to 16 GB, 8 GB, 4 GB by calculation using 16 core with 2RAMs, but there was almost no change and no improvement in parallelization efficiency.

Summary

Although parallel efficiency was improved by installing 4RAMs, since the calculation speed fell even at 4 cores, it is not clear if computation efficiency has increased as a total. Usually we often throw four 4-core jobs on Threadripper, so . . .

Since threadripper has 4 channels, therefore using all channels may contribute the parallel efficiency. However, we don’t think that using 8RAMs improve the parallel efficiency more, because 4 channels were already used.

Technically EPYC has the same system with Threadripper, it might be good to do calcultion using many cores on EPYC.

As the parallel efficiency was dramatically improved, we strongly recommend you to use Threadripper for your calculations!

In our next post, we plan to benchmark test397 and test310, and also test the dependency for the level of thepry (HF and MP2) and the size of the basis set.

Related Artcles

  1. Ryzen Threadripper Gaussian16 Benchmark
  2. When the IRC calculation doesn’t work…
  3. How many kcal/mol ?
  4. Why atomic units (a.u.) are used in QM calculation?
  5. Building a PC with Ryzen Threadripper 1950x for Scientific Computation

Leave a Reply