Changes between Version 1 and Version 2 of CDB_Bench


Ignore:
Timestamp:
10/04/12 22:26:49 (8 years ago)
Author:
lalibert
Comment:

--

Legend:

Unmodified
Added
Removed
Modified
  • CDB_Bench

    v1 v2  
    11= Notes on performance and benchmarking = 
    22 
     3The CDB was designed to be performant on a wide range of systems with a special emphasis on distributed computing. It has been developed and tested on four linux-based systems with widely different distributed architecture: 
     4---- 
     5== ''sila'' compute node == 
     6 Location:: 
     7  Atmospheric Science Group, University of Toronto, Ontario, Canada 
     8 Architecture:: 
     9  Intel(R) Xeon(R) CPU E5620 at 2.40GHz w/ 16 CPU cores, 48GB shared memory\\ 
     10  CentOS release 5.6 (Final) 
     11 Performance Tests:: 
     12  Used the CDB for single-core asynchronous data processing using the --y_async option (undocumented yet, still in beta testing phase). The large RAM amount allowed this setup and is mostly bug-free. After another round of testing, this option will be documented, thus allowing users of similar systems to take full advantage of their multiple-core machines. 
     13 
     14 
     15---- 
     16== [http://wiki.scinethpc.ca/ SciNet] ''General Processing Cluster (GPC)''  == 
     17 Location:: 
     18  University of Toronto, Ontario, Canada 
     19 Architecture:: 
     20  IBM iDataPlex cluster based on Intel's Nehalem architecture.\\ 
     21  Consists of 3,780 nodes (IBM iDataPlex DX360M2) with a total of 30,240 cores (Intel Xeon E5540) at 2.53GHz, 16GB RAM per node (2GB per core).\\ 
     22  Approximately one quarter of the cluster interconnected with non-blocking 4x-DDR InfiniBand.\\ 
     23  Remaining nodes are connected with gigabit ethernet.\\ 
     24  Compute nodes are accessed through a queuing system that allows jobs with a maximum wall time of 48 hours. (From SciNet wiki) 
     25 Performance Tests:: 
     26  This architecture poses major limitations to the efficient processing of large amounts of data. It has a slow (gigabit) connection to its file system, making it difficult to make an intensive use of disk swap. In fact, it is about 10 times faster to read from the file system than to write. To circumvent these limitations, the user has to point the CDB_TEMP_DIR to /dev/shm/, thus using the ramdisk. The CDB includes some ramdisk protection measures (script parallel_ramdisk.sh) but we were not able to make it completely failsafe. Using a ramdisk on a GPC compute node means that a script cannot use more than ~8GB of swap space. The GPC has yet another limitation: it is not possible to request less than a full node, i.e at 8 processing cores (hyperthreaded). So while it could be possible to use the undocumented --y_async option for the asynchronous processing of years, the 8GB swap space would most likely end up being exhausted if the processing script is slightly complex. To circumvent this, an option --dim_async that has not been thoroughly tested and will remain undocumented for now has been included. This option allows the asynchronous processing of along a chosen dimension. It uses ncks and ncpdq. With this option it has been possible to use the full 8 cores with a small ramdisk swap space, thus optimizing processing power and read/write.  
     27 
     28---- 
     29== [http://ciclad-web.ipsl.jussieu.fr/accueil/ CICLAD] ''Calcul Intensif pour le Climat, l'Atmosphère et la Dynamique''  == 
     30 Location:: 
     31  Institut Pierre-Simon Laplace, France 
     32 Architecture:: 
     33  16 nodes for a total of 224 cores.\\ 
     34  Large data storage (400 TB).\\ 
     35  Infiniband interconnections. 
     36 Performance Tests:: 
     37  This architecture posed the least challenges. On CICLAD, it is possible to request one core at a time and read/write to disk is efficient. It is thus best to use the /data/ directory for swap space, relieving the user from the worry of using the node's ramdisk. It uses a PBS with torque and the automatic submission feature of the CDB has been extensively tested on the system. With its 24 simultaneous core/user cap, it allows 24 models-experiments to be processed simultaneously. For the most computationally-intensive diagnostic offered in the first release of the CDB, cylone_mask, it was possible to process all models with the necessary data in less than 24 hours. More benchmarking will be provided in the near future. 
     38  
     39---- 
     40== [badc.nerc.ac.uk BADC] ''The British Atmospheric Data Centre'' login node  == 
     41 Location:: 
     42   STFC Rutherford Appleton Laboratory (RAL), Oxfordshire, UK 
     43 Architecture:: 
     44  SUSE Linux Enterprise Server 11.\\ 
     45  Intel(R) Xeon(R) CPU E5620 at 2.40GHz w/ 2 CPU cores, 8GB shared memory 
     46 Performance Tests:: 
     47  A simple server-side processing test has been conducted. The daily psl for one model (NOAA-GFDL/GFDL-ESM2M, historical, rcp45) was retrieved from the archive, compressed (using netcdf4 capabilities) and transferred on CICLAD (total ~1.5GB, less than 5 minutes transfer). On CICLAD, the daily psl was not available for this model. The cyclone_mask diagnostic was then applied on the transferred data and the compressed results were transferred back to BADC. The cyclone_mask output is at the daily frequency but it is boolean and it thus take only ~28MB when compressed. The transfer back to BADC was thus instantaneous. The combine_cyclone_mask diagnostic was then launched on the transferred cyclone_mask data taking about 45 minutes to multiply the mask with the daily surface temperature and the daily three dimensional air temperature (pressure levels 1000,850,700,500,250,100,50,10 hPa), or ~15GB of data. After the multiplication, a monthly mean was performed, greatly reducing the output to ~500MB. This suggest the following analysis: 
     48{{{ 
     49let N be the number of three dimensional fields to be composited with the cyclone_mask 
     50then the GB of data, D, to be transferred for GFDL-ESM2M, 1965-2004 historical and 2060-2099 rcp45 will be: 
     51D = 1.5 (fix cost of transferring psl once) + 0.5N (transferring the final output) 
     52 
     53If all of the data was first transferred before being processed, it would be necessary to transfer 15N GB. 
     54The user must thus transfer only a fraction D/(15N) = 0.1/N + 0.03 of the data to be processed. 
     55For N>3, the transfer becomes dominated by the transfer of the final output. 
     56So if one would like to composite air temperature, specific humidity and the horizontal velocities, 
     57the transfer size is mostly determined by the output transfer. 
     58}}} 
     59 
     60