IPython parallel project
IPython parallel project
- The Problem
- Project structure
- Work Plan
- Useful links
Here we describe a project to investigate the use of IPython as part of an data analysis platform for the JASMIN infrastructure.
The results of large scale climate modelling expreiments typically comprise of thousands to millions of 4D fields from different climate simulations where each field may be 10s of GB in size. Therefore the analysis of these results requires processing huge volumes of data ranging from terrabytes to petabytes.
To perform analyses accross data at this scale we must adapt our current analysis tools to run in parallel. Data analysis has traditionally been done using code which is run either interactively or via ad-hoc scripts without any explicit parallelism. Domain specific languages such as Matlab and IDL are common but the Python numeric programming stack is becoming increasingly popular and CEDA makes heavy use of a range of Python tools that build upon Python.
We need to investigate whether our current Python tools can run analyses in parallel accross terrabytes of data. We have identified IPython-parallel as a potential framework achieving this and therefore we propose a short project to investigate the compatibility between IPython parallel and our Python analysis stack.
Any tickets in component "JASMIN Platform" containing the keyword IPython will be listed here. [ Create New IPython Ticket ]
Sofware outputs are available in the ipython_project git repository.
This is a highly exploratory project which will require a lot of initiative to achieve it's full potential. To help ensure we get concrete results the work will be structured to encourage clear documentation and several oportunities will be given to present your progress.
- The main outputs will be:
- A git repository of all software outputs
- IPython notebook files documenting your results
- Wiki pages on the CEDAServices Trac documenting how you have solved the software engineering problems.
- You will be asked to present your work 2-3 times during the project at the CEDA Developer meetings
- Wednesday July 25th -- Demonstrating initial progress and the potential of the IPython notebook
- Wednesday August 15th
- Wednesday Sept 12th -- Present final results
The project will be divided into 3 exploratory tasks
Task 1: Investigate paralalisability of Scientific Python analysis tools
The IPython-parallel architecture is designed for parallel, interactive data analysis. It is built on the Numpy array processing library and is optimised to minimise copying of large arrays between processing nodes. We have several tools built on top of Numpy that would benefit from being parallisable.
Show how a simple array computation performs on an IPython-parallel cluster when using the following libraries:
- A baseline for something that should show a performance improvement
- A low-level interface to the NetCDF file format
- A high-level atmospheric sciences interface to the NetCDF file format
- A legacy interface to NetCDF
Each of these libraries use Python objects to represent parts of the NetCDF datamodel. A key question is whether these objects can be passed betwen IPython-parallel nodes and used successfully to interact with the data on disk or whether special measures need to be taken to setup the objects on each node. What is the most efficient way of setting-up these objects within the cluster and what performance improvements can we expect by parallelising our common algorithms.
Task 2: IPython notebook interface to a JASMIN VM
The IPython notebook is a web application which gives you a Python interactive session through your browser. Since it is browser-based there is the potential to communicate with a remote system through a notebook. The notebook architecture provides many different ways in which you can distribute it's components accross different systems, each with different advantages and dissadvantages. Briefly the main components are:
- Web browser
- The user interacts with the system through a web browser
- Notebook server
- A Python web-application communicates with the Web browser over HTTP
- IPython kernel
- The IPython interpreter that the user sees in her browser. This is a separate process which may or may not be co-located with the notebook server.
- IPython-parallel cluster
- The notebook can also communicate with extra kernels within a cluster. These can also be on separate machines to the Notebook server.
The task is to find a configuration that meets the following requirements:
- The notebook provides a Python interface directly to a user's JASMIN VM login. This will involve using the user's ssh key to login to the VM and setup the notebook components.
- The user can initialise the system with minimum installation requirements on their desktop. This will restrict our options for how we distribute the notebook components. For instance the notebook server requires zeromq to be installed and therefore probably can't be run on the user's desktop.
- The security of JASMIN is not compromised. Access to the notebook session should be restricted to the user's desktop only. This will probably require ssh tunnels to be setup for the notebook server
Task 3: IPython-parallel LOTUS deployment
This is an optional extension to the project which is contingent on the LOTUS HPC system being ready in the timescale of the project.
The LOTUS HPC cluster is a small batch-processing facility co-located with JASMIN. It uses a batch-scheduling system and MPI to enable highly parallel algorighms to be executed accross multiple nodes. IPython-parallel should be able to utilise this system to create larger clusters of nodes connected to an IPython session.
Using the outputs from Tasks 1 and 2 scale-up your analyses to run on LOTUS. Configure IPython-parallel to create a cluster through the LOTUS job submission system and configure the LOTUS nodes to communicate over MPI.
Task 4: IPython-parallel analysing CMIP5 data
As an alternative to Task 3 if LOTUS is not available we will use the jasmintest1/2 cluster to analyse TB of CMIP5 data stored on Panasas.
CMIP5 Archive Data
The CMIP5 archive is an archive of the outputs of the 5th Coupled Model Intercomparrison Project. The CMIP5 project coordinates simulation of the climate by modelling groups all around the world by defining a set of common experiments to be performed by each group using their particular model software.
The CMIP5 data archived at BADC is in the process of being moved from our legacy storage hardware to the Panasas. For most models and experiments only some of the data is on Panasas at this time, therefore our choice of data to analyse is somewhat restricted. We have selected data from 2 models running the "Representative Concentration Pathway" experiments as suitable high-volume data is available for multiple models. The data is visible on jasmintest1/2 under the following paths:
Further models may be added to /opt/data as data is migrated to Panasas. Below the model directory NetCDF file paths follow the "Data Reference Syntax" (DRS) format:
These terms are explained in detail in the DRS document, however a few observations should help selecting the right data:
- <realm> and <table> partition the set of variables into categories. Therefore each variable should appear on 1 and only 1 <realm>/<table>
- <version> is used when a dataset has been updated. The special version "latest" always points to the latest version. Therefore we can ignore all data not in "latest"
- <ensemble> is of the form "r<n1>i<n2>p<i3>". They represent repeats of the simulation under perturbed conditions to create an ensemble of results.