Notes on performance and benchmarking

Ressource Requests

The first release of the CDB assumes that most of the processing will occur over a PBS-like queue management system. The users must thus evaluate their resource requirements (CPU time, number of CPUs, memory) before submitting a diagnostic job. It is out understanding that with the web-based server-side processing framework will still require the users to evaluate their need and we envision that it could even be more stringent (i.e. the user would have to specify output size to request remote temporary storage). This issues will have to considered in future versions to the CDB.

Systems where the code was developed

The CDB was designed to be performant on a wide range of systems with a special emphasis on distributed computing. It has been developed and tested on four linux-based systems with widely different distributed architecture:

sila compute node

Atmospheric Science Group, University of Toronto, Ontario, Canada
Intel(R) Xeon(R) CPU E5620 at 2.40GHz w/ 16 CPU cores, 48GB shared memory
CentOS release 5.6 (Final)
Performance Tests
Used the CDB for single-core asynchronous data processing using the --y_async option (undocumented yet, still in beta testing phase). The large RAM amount allowed this setup and is mostly bug-free. After another round of testing, this option will be documented, thus allowing users of similar systems to take full advantage of their multiple-core machines.

 SciNet General Processing Cluster (GPC)

University of Toronto, Ontario, Canada
IBM iDataPlex cluster based on Intel's Nehalem architecture.
Consists of 3,780 nodes (IBM iDataPlex DX360M2) with a total of 30,240 cores (Intel Xeon E5540) at 2.53GHz, 16GB RAM per node (2GB per core).
Approximately one quarter of the cluster interconnected with non-blocking 4x-DDR InfiniBand?.
Remaining nodes are connected with gigabit ethernet.
Compute nodes are accessed through a queuing system that allows jobs with a maximum wall time of 48 hours. (From SciNet? wiki)
Performance Tests
This architecture poses major limitations to the efficient processing of large amounts of data. It has a slow (gigabit) connection to its file system, making it difficult to make an intensive use of disk swap. In fact, it is about 10 times faster to read from the file system than to write. To circumvent these limitations, the user has to point the CDB_TEMP_DIR to /dev/shm/, thus using the ramdisk. The CDB includes some ramdisk protection measures (script but we were not able to make it completely failsafe. Using a ramdisk on a GPC compute node means that a script cannot use more than ~8GB of swap space. The GPC has yet another limitation: it is not possible to request less than a full node, i.e at 8 processing cores (hyperthreaded). So while it could be possible to use the undocumented --y_async option for the asynchronous processing of years, the 8GB swap space would most likely end up being exhausted if the processing script is slightly complex. To circumvent this, an option --dim_async that has not been thoroughly tested and will remain undocumented for now has been included. This option allows the asynchronous processing of along a chosen dimension. It uses ncks and ncpdq. With this option it has been possible to use the full 8 cores with a small ramdisk swap space, thus optimizing processing power and read/write.

 CICLAD Calcul Intensif pour le Climat, l'Atmosphère et la Dynamique

Institut Pierre-Simon Laplace, France
16 nodes for a total of 224 cores.
Large data storage (400 TB).
Infiniband interconnections.
Performance Tests
This architecture posed the least challenges. On CICLAD, it is possible to request one core at a time and read/write to disk is efficient. It is thus best to use the /data/ directory for swap space, relieving the user from the worry of using the node's ramdisk. It uses a PBS with torque and the automatic submission feature of the CDB has been extensively tested on the system. With its 24 simultaneous core/user cap, it allows 24 models-experiments to be processed simultaneously. For the most computationally-intensive diagnostic offered in the first release of the CDB, cylone_mask, it was possible to process all models with the necessary data in less than 24 hours. More benchmarking will be provided in the near future.

 BADC The British Atmospheric Data Centre login node

STFC Rutherford Appleton Laboratory (RAL), Oxfordshire, UK
SUSE Linux Enterprise Server 11.
Intel(R) Xeon(R) CPU E5620 at 2.40GHz w/ 16 CPU cores, 8GB shared memory
Performance Tests
A simple server-side processing test has been conducted. The daily psl for one model (NOAA-GFDL/GFDL-ESM2M, historical, rcp45) was retrieved from the archive, compressed (using netcdf4 capabilities) and transferred to CICLAD (total ~1.5GB, less than 5 minutes transfer). On CICLAD, the daily psl was not available for this model. The cyclone_mask diagnostic was then applied on the transferred data and the compressed results were transferred back to BADC. The cyclone_mask output is at the daily frequency but it is boolean and it thus take only ~28MB when compressed. The transfer back to BADC was thus instantaneous. The combine_cyclone_mask diagnostic was then launched on the transferred cyclone_mask data taking about 45 minutes to multiply the mask with the daily surface temperature and the daily three dimensional air temperature (pressure levels 1000,850,700,500,250,100,50,10 hPa), or ~15GB of data. After the multiplication, a monthly mean was performed, greatly reducing the output to ~500MB. This suggest the following analysis:
let N be the number of three dimensional fields to be composited with the cyclone_mask
then the GB of data, D, to be transferred for GFDL-ESM2M, 1965-2004 historical and 2060-2099 rcp45 will be:
D = 1.5 (fix cost of transferring psl once) + 0.5N (transferring the final output)

If all of the data was first transferred before being processed, it would be necessary to transfer 15N GB.
The user must thus transfer only a fraction D/(15N) = 0.1/N + 0.03 of the data to be processed.
For N>3, the transfer becomes dominated by the transfer of the final output.
So if one would like to composite air temperature, specific humidity and the horizontal velocities,
the transfer size is mostly determined by the output transfer.