wiki:DataExtractor/Manuals/DXOverview

Version 2 (modified by astephen, 13 years ago) (diff)

--

An overview of the Data Extractor (DX)

Introduction

The Data Extractor (DX) is a python-based tool for allowing users to access subsets of large geospatial datasets via a common interface. This is typically the DX Browser Client which is accessible as a set of web pages. However, users can also interact programmatically with the DX-Server which presents a functional interface as a Web Service. This document provides an overview of the key components of the DX. More detail is, or will soon be, available in the following guides:

  • DX Installation Guide

  • DX Data Ingestion Guide

  • DX Administrator's Guide

  • DX User Guide

  • Guide to Securing the DX

Architecture

The following diagram provides an overview of the DX architecture highlighting the main components in terms of managing and interacting with the package.

Each component is described in more detail below.

DX-Server

This is the part of the system that does the core processing such as file I/O, subsetting and writing of data files. It provides:

  • a functional interface that can be interrogated by the client (represented by the Web Service interface in the above diagram) applications.

  • a metadata store describing datasets located in a local archive.

  • an I/O layer that extracts requested data (and metadata).

The DX-Server is controlled by the Administrator.

Installation requires knowledge of the local file system and access to various locations such as the webserver CGI area. The Server Configuration module (typically called serverConfig.py) is used to set up the correct paths to local resources which can then be accessed by the DX-Server. These issues are dealt with further in the DX Installation and Administrator Guides.

Both the DX-Server and the DX-Clients are python packages (i.e. collections of python modules). The DX is written using Object Oriented Programming in order to make the code straightforward and simple for the developer to build upon and modify where required. The DX-Server builds upon the Climate Data Analysis Tools (CDAT) package which provides the underlying I/O, selection and subsetting tools. CDAT is not distributed with the DX.

DX-Client (Browser)

The DX Browser Client is the main method via which users will access the DX. If provides a CGI front-end that a user can access via any standard web-browser. In a secure configuration users must log-in to the DX client but you can also configure the DX to provide open access where users can see all datasets. Access can be limited by user and/or by roles associated with datasets.

The Administrator will install the DX-client which may exist on the same machine as the DX-Server or remotely. The client and server communicate using SOAP (Simple Open Access Protocol) messages which require the installation of the python ZSI library (not supplied with the DX).

The Client Configuration module (normally called clientConfig.py) is controlled by the Administrator who configures the client for the local system.

DX-Client (Command Line)

The command line client for the DX allows users to interact programmatically with the DX-Server. This is a relatively untested feature but has the potential to allow users to embed calls to the DX-Server in their programmes and scripts so that data can be extracted seamlessly as and when the user needs it.

Archive

The data archive must currently sit on the same network as the DX-Server and be visible via local path names. The archive must contain data held in files formatted as NetCDF and GRIB. There is also some support available in non-standard versions for pp-format (UK Met Office).

The metadata inside the files should adhere (to some degree) to the CF-Metadata Convention for NetCDF although some variation will normally work. Such data will be easy to ingest without manual intervention.

Dataset Metadata

The DX understands the concept of a "Dataset" as a collection of one or more data files containing variables with a repeated structure. Typically these are 2D or 3D model fields with one time step per file.

The DX also has the concept of a "Dataset Group". This is a logical collection of "Datasets". For example:

Dataset Group

VFGS Model Output

VFGS Model Output

Datasets

VFGS Ocean Model Output

VFGS Atmospheric Model Output

Variables

Salinity, SST...

u-wind, v-wind...

By default the DX requires the Administrator to ingest new Datasets into the DX-Server before they can be accessed by users. The Administrator can also create new Dataset Groups to put Datasets under.

When interacting with the DX (via the Browser Client or Command Line Client) the user will select make selections in the following order:

  1. Dataset Group

  2. Dataset

  3. Variable

  4. Spatial (Horizontal and Vertical) axes

  5. Temporal axes

  6. Output file format

If the user selects 2 variables the DX will try and subtract variable 2 from variable 1 by interpolating variable 2 to the grid of variable 1.

The Dataset Metadata is stored in an XML file (normally called inputDatasets.xml). Ingestion of datasets is describe in detail in the DX Ingestion Guide.

Web Service Interface

The Web Service Interface to the DX-Server is a python script with a number of functions that are presented as a Web Service when the script is run. This server script then waits for calls from client applications. Clients can only access the DX-Server when this script is running on the DX-Server machine.

Security

The DX can be secured or run in non-secure mode. This is all controlled in the Server and Client Configuration modules. The DX provides a set of programmatic hooks that an Administrator can plug into her local security system. The DX allows secure tokens to be exchanged between client and server so these can be modified to provide an interface to the local security implementation in your system.

More detailed are provided in the Guide to Securing the DX.