Molecular dynamics is a standard computational technique used in condensed matter physics, materials science, chemistry, and other fields, consisting of following the temporal evolution of a system of N particles, interacting with each other by means of a certain law. In classical molecular dynamics, the evolution is based on the Newton's law, F=ma, and the forces are obtained as gradients of a certain potential which is function of all the particle coordinates.
Use of periodic boundary conditions is customary to simulate infinite systems. Thermodynamic variables such as pressure and temperature can be defined (even if they will fluctuate a bit because the number of particles is finite), and the technique can be used to study a large number of phenomena including phase transitions, liquid structure, defects in solids (vacancies, dislocations, grain boundaries, etc), amorphous materials, surfaces, clusters.
Typically, interactions are short range: two particles do not interact if their distance is larger than a certain cutoff distance (this does not mean that interactions should be pairwise: there can be many-body terms, but they are also typically local in space). And of course, particles can move around in the system. So an efficient molecular dynamics code must:
As a result of these points, codes using smart algorithms often end up accessing data from memory in a rather irregular way. One rather typical example could be a double do-loop of the type
DO I=1,N
DO JLST=MRKR1(I),MRKR2(I)
J=LIST(JLST)
I and J are neighbors, compute their distance,
compute energies and forces FIJ, etc ...
FORCEZ(I) = FORCEZ(I) + FZIJ
FORCEZ(J) = FORCEZ(J) - FZIJ
ENDDO
ENDDO
Here, I increases regularly, but the indexes J of the
neighboring particles---kept stored in a big array LIST with pointers
MRKR1 and MRKR2 indicating the start and the end of
the neighbor list associated to each particle I---will generally be
thoroughly scattered. Therefore, consecutive references to array elements
indexed by J could be far away in memory, generating cache
misses on cache-based hardware.
Not that the situation is much better on traditional vector supercomputers. The above inner loop may be hard to vectorize (at the very least we have to tell the compiler that all J will be different and no one of them will be equal to I), and it is typically a short loop anyway.
MDBNCH was originally developed around 1985 as a "real life" application to be used as benchmark for a machine purchase. That version was not suited for a general distribution, and I replaced it with a brand new code incorporating new algorithms in december 1988. Except for a few cosmetic modifications with no effect on timings, MDBNCH did not change since then, and it will never be changed. All the data available in the current table of results have been obtained with this same version, and can be directly compared with each other.
MDBNCH is written in 100% ANSI-standard Fortran 77 to ensure maximum portability. This is of course why its name is not, say, MDBench. It consists of about 2100 lines. It was derived from a production code named MASTER used to study metals with many-body `glue' potentials. This code was developed by starting from a program constructed by Michele Parrinello and Aneesur Rahman at Argonne National Laboratory around 1980, featuring a cell with variable volume and shape. This program was in turn evolved from previous Rahman's codes, whose development was started in the 60s.
This is perhaps the most technical part of this document and you may wish to skip it.
A single run of MDBNCH performs seven independent molecular dynamics calculations, using systems of different sizes and/or using different codes. The purpose of doing this is to sample a variety of different running conditions.
Each run is separately timed, and the total time is reported at the end. This last number is what is reported in the table of results. However, the partial times may also be useful for particular purposes.
The seven calculations are the following:
The skin is a distance added to the potential interaction range. The neighbor list contains the particles within the interaction range, plus those in the skin. The latter are not required for the current time step, but their presence allows to utilize the same neighbor list for a few time steps. The list is deteriorated when a particle not originally included into it becomes interacting after having traveled across the skin region. The skin is a parameter that can be tuned to optimize a certain calculation on a certain machine. A value too small will force too many neighbor list calculations, while a value too large will include too many neighbor which are not really interacting and add overhead. For this reason, I considered interesting to include similar runs with different skins (2., 5. and 6.)
The issue of cache trashing often suggests to compare two machines A and B under different memory access patterns. For this purpose I suggest to use the partial times for calculations 1., 2., and 7., in the order of increasing size and irregularity of memory access. One could then obtain the A/B time ratio for each calculation and see the trend.
MDBNCH results provide information about the speed of machines in classical molecular dynamics calculations using short-range potentials and neighbor lists. More generally, they may give you a feeling about the behavior of systems on codes which are relatively small, but tend to access memory in a quite irregular way, probably causing many cache misses on recent RISC architectures.
Performance on vector computers is also quite poor, as inner loops may be short and/or not vectorizable (see example above). There are vectorizable codes which do the same job and run much faster on vector machines, but they usually run slower on workstations, and such codes are not included in MDBNCH anyway. Just keep in mind that molecular dynamics production codes on vector computers are somewhat faster than what suggested by MDBNCH results.
In summary, MDBNCH is coded by striving for algorithmic efficiency for the problem at hand, rather than for computational efficiency. For this reason, it is almost certainly unable to bring any machine close to its peak theoretical limit. In this sense, MDBNCH provides for a different perspective with respect to, say, Linpack, and it should be considered as a real life benchmark.
Do not use these results to draw general conclusions about machine speeds. Different applications are known to yield very different speed ratios between different machines. If you are evaluating a set of machines for general purpose computing, MDBNCH performance may be one indicator you may wish to consider, but by no means the only one. In particular, codes which access memory in a regular pattern, such as for instance most electronic structure codes and linear-algebra based codes, will not be represented at all by MDBNCH, and you should look elsewhere to get performance indicators for this kind of codes. The same can be said for codes dominated by integer arithmetics.
Several people helped to put together the current table of results. Their names are acknowledged in the table itself. Do not be envious! You can also be a contributor.
Running MDBNCH is fairly easy, as the program is very standard-conforming, it has been already tried on a lot of platforms, there are no input files and there is just one output file (the standard output on Unix machines). Anybody is very welcome to run the program and report the result, particularly on a machine which has not been already tried.
Here: mdbnch.f (62 kB), or mdbnch.f.gz (12 kB).
You also need a version of the timing routine second.f
appropriate for your system. This routine is typically an interface
to a system-dependent timing routine, and returns the CPU time
(not the elapsed time).
There are ready solutions for
IBM RS/6000 running AIX,
VMS (Vax or AXP),
IBM mainframes using VS Fortran,
MSDOS/MS Fortran Powerstation,
and a generic Unix version,
interfaced to etime and
known to work for DEC Ultrix, DEC OSF, HPUX, SGI and Sun among others.
On recent HPUX releases you need to compile with +U77 to access
etime. On older HPUX releases get this
etime.c
and compile and link it explicitly.
You do not need anything for Crays, as they have second
in the system library.
Note that no modifications of the code are allowed, not even trivial ones such as changing the maximum dimension of arrays to modify the memory layout. This strict rule is required to allow a meaningful comparison of results without introducing an additional amount of tweaking axis.
First of all, note that MDBNCH requires 64-bit
floating point arithmetic, and that the code declares all floats
to be DOUBLE PRECISION and constants are also
explicitly written as 3.D0.
This means that on practically all the modern RISC machines you
do not have to use any particular compiler option, as
DOUBLE PRECISION means REAL*8, i.e.
64-bit. However, you must be careful on machines with a wide word,
most typically Crays, where the compiler would produce
very slow code using 128-bit arithmetics. On these systems make
sure to specify a compiler option called disable precision, or
something similar, to avoid this problem.
Hunting for the best compiler options is up to you, but here are a few additional comments:
There is no point in trying automatic parallelization on machines with multiple CPUs. This code is not suitable for parallelization and no automatic tool can do anything significantly useful with it in this area. MDBNCH is a scalar benchmark.
MDBNCH occupies a quantity of memory of the order of 8MB when running. The execution time is of the order of one minute on a typical 1994 workstation.
There are no input files. Just run the executable, redirecting the output on a file. On a Unix machine, for example,
time is optional. The official execution
time is that printed at the end of the output file.
It is also important to inspect the output file to verify the
calculation, as explained in the next section.
Even if the CPU time is returned, it is good practice to run the benchmark on a lightly loaded or possibly empty machine, as it is known that the CPU time of a run is slightly dependent on the system load. If this is not feasible at your site, data for a loaded machine are better than no data at all.
The `official' time is the last time printed on the standard output. The line containing it may look like
time or
timex usually give a number which is very close
(summing user and system time).
It may be slightly higher because of the time needed to load
the program. A large difference between the CPU time printed by
the program and that reported by time may indicate
troubles with the timing routine, and deserves investigation.
Even if the time is printed with a microsecond precision, very few machines provide timing routines with this kind of resolution. In fact, the typical resolution in a Unix machine may be 10 milliseconds, as in the example above. For pratical purposes, this is actually more than enough. In fact, you will find that repeating a run may yield a different CPU time, with fluctuations of the order of 1%. These fluctuations may be due to changes in the system load: the processor has to do extra work to switch between different tasks, and this extra work is always charged to somebody. For this reason I recommend to run on a lightly loaded system, and always run at least twice each case. If you have many runs to choose from, the time to report is the best time with a given set of compilation options.
Due to the fluctuations discussed above, the 10 milliseconds resolution discussed above is more than adequate with the current machine speed. After you have taken the time, round it (using nearest digit rounding) so that there are 3 decimal digits. That is, round to the second if the time was larger than 100 s; to the tenth of second if the time was between 10 and 100 s; to the hundredth of second if the time was between 1 and 10 s.
In order for the timing to be validated, one should make sure that the program did the right thing. Surprises often occur, particularly when pushing the limits with preprocessors and optimizers. The easy way to verify the output is to send it to me and I will have a look. Or you can do it by yourself (thanks!), using the following guidelines.
You can download an
example of output. This file has been
actually produced on a Sun 4, but DEC MIPS, DEC AXP, HP PA-RISC and
other machines with IEEE arithmetics produce the exact same output.
This output is 'correct', but it not necessarily the only one correct.
Due to the nature of molecular dynamics calculations, a one-bit
difference due to rounding at a certain point in time may cause
a calculation to drift away from a reference run and end up in a
totally different place. For instance, IBM RS/6000s produce an output
different from the above example, because some
calculations may be carried out into the chip which an internal precision
higher than IEEE's precision.
Another typical effect (observed for instance on HPs) is that use of
a preprocessor may produce a slightly different output, presumably due
to a different ordering of operations
(computer arithmetics is not commutative, that is,
(A+B)+C is generally different from A+(B+C)).
In all cases, you should check that some conditions are met. I recommend to look at the first and the last of the seven calculations done in each run. The first is a small system run for a long time, and the last is a large system run for a short time. The other are intermediate cases.
The first calculation output should look like
TOT.E represents the total energy, which is an
approximately conserved quantity except for about the first 5 time steps
(where not all the time derivatives that the algorithm needs
are into place yet). Since
in this case the output line is printed only every 100 steps (it is
1 step in all the other six cases), this means that the total
energy should be the same from step 100 to step 1000, with a +-1
fluctuation allowed on the last decimal digit.
In this case we have -3.0472 +- 0.0001.DIFFUS) going to 17.0283 at step 1000
on your machine, as in the example? If not, it is not dramatic. It just
means that your machine is not IEEE, or that calculations were carried on
with a different internal precision than IEEE, or that you used an
aggressive preprocessor which changed the order of operations. This
number should not be much different from 17, though.PX, PY and PZ
contain the three components of the total linear momentum. Since
this is a conserved quantity and it is initially zero, these numbers
should remain close to zero. In contrast with the total energy,
which is only conserved within some fluctuations, the linear momentum
is conserved exactly. That is, the residual values are
entirely due to the finite precision of computer arithmetics.
10^{-14} is typical for 64-bit IEEE.The last calculation output should look like
LENGTH= 140453...
is exactly as above. A difference there is likely to be a bug in the
compiler. On the other hand, the second LENGTH= ... line may
be different if the calculation takes a different path, for the reasons
discussed above.
Please send me your timing data, so that I can include them in the current table of results.
Please do not forget:
Thanks in advance to all contributors!
To obtain SPEC-like ratios from MDBNCH timings that can be directly compared with SPECfp95 ratios, divide 120 by the MDBNCH timing for the machine you are considering.
120 seconds is the estimated running time of MDBNCH on a Sun SparcStation 10/40, which is the machine chosen by SPEC to establish reference timings for the CPU 95 suite. Although the MDBNCH timing for this machine is not available, it has been estimated from the time of 90.5 s obtained on the similar SparcStation 20/50. SPECfp95 data are not available for the 20/50, so scaling has to be done using SPECfp92 numbers. The SPECfp92 ratings for the 10/40 and the 20/50 are respectively 60.2 and 78.8. Their ratio is 1.309. Ratios of the individual benchmarks constituting SPECfp92 range from 1.171 to 1.445. The ratio for the benchmark supposed to be closest to MDBNCH (MDLJSP2) is 1.334. Therefore, the reference time of 120 s, corresponding to a ratio of 1.326, seems to be a reasonable guess for the running time of MDBNCH on a 10/40, and we shall stipulate that this is the reference running time of MDBNCH to obtain SPEC CPU95-like metrics.