source: TI03-DataExtractor/trunk/dxs/manuals/ingest.html @ 1715

Subversion URL:
Revision 1715, 19.3 KB checked in by astephen, 13 years ago (diff)

Merged with titania version.

1<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
4        <META HTTP-EQUIV="CONTENT-TYPE" CONTENT="text/html; charset=utf-8">
5        <TITLE>Home Page</TITLE>
6        <META NAME="GENERATOR" CONTENT=" 2.0  (Linux)">
7        <META NAME="CREATED" CONTENT="20060317;12440700">
8        <META NAME="CHANGED" CONTENT="20060628;8092900">
9        <META NAME="ProgId" CONTENT="FrontPage.Editor.Document">
12<H1 LANG="en-GB">The Data Extractor (DX) Ingestion Guide</H1>
13<H2 LANG="en-GB">Introduction</H2>
14<P LANG="en-GB">This document is one of a series of documents
15describing how to set up and use the Data Extractor (DX). Further
16information is, or soon will be, available in the following guides:</P>
18        <LI><P LANG="en-GB" STYLE="margin-bottom: 0cm"><B>DX Overview</B></P>
19        <LI><P LANG="en-GB" STYLE="margin-bottom: 0cm"><B>DX Installation
20        Guide</B> 
21        </P>
22        <LI><P LANG="en-GB" STYLE="margin-bottom: 0cm"><B>DX Administrator's
23        Guide</B> 
24        </P>
25        <LI><P LANG="en-GB" STYLE="margin-bottom: 0cm"><B>DX User Guide</B> 
26        </P>
27        <LI><P LANG="en-GB"><B>Guide to Securing the DX</B> 
28        </P>
30<H2 LANG="en-GB">Data Ingestion Overview</H2>
31<P LANG="en-GB">The DX-Server package needs to be told about your
32local data archive so that it can generate a series of XML files to
33describe the contents of the archive. These are typically CDML
34(Climate Data Management Language) or CSML (Climate Sciences
35Modelling Language) files which can describe NetCDF, GRIB and (with
36limited scope) PP-format data files. The purpose of these XML files
37is to provide a dataset-wide view of your archive rather than lists
38of files in directories. The common properties such as global
39metadata, variables and axes are then available to the DX-Server
40without opening any data files. This allows the user to make logical
41subsetting requests that can span 1000s of data files without needing
42to know the format or names of any individual file.</P>
43<P LANG="en-GB">There is also a top-level XML file that the DX-Server
44interrogates to find out what your dataset groups are and which
45datasets make they include. The CDML/CSML files map one-to-one to
46datasets in the DX-Server.</P>
47<P LANG="en-GB">The following diagram shows how the DX-Server,
48Dataset Group XML file, Dataset XML files and Administrator interact
49to manage these files.</P>
50<P LANG="en-GB"><IMG SRC="images/dx_arch.gif" NAME="Graphic1" ALIGN=BOTTOM WIDTH=767 HEIGHT=383 BORDER=0></P>
51<H2 LANG="en-GB">Dataset Metadata and the DX Hierarchy</H2>
52<P LANG="en-GB">The DX understands the concept of a &quot;Dataset&quot;
53as a collection of one or more data files containing variables with a
54repeated structure. Typically these are 2D or 3D model fields with
55one time step per file.</P>
56<P LANG="en-GB">The DX also has the concept of a &quot;Dataset
57Group&quot;. This is a logical collection of &quot;Datasets&quot;.
58For example:</P>
60        <COL WIDTH=84*>
61        <COL WIDTH=84*>
62        <COL WIDTH=87*>
63        <TR>
64                <TD WIDTH=33%>
65                        <P LANG="en-GB"><B>Dataset Group</B></P>
66                </TD>
67                <TD WIDTH=33%>
68                        <P LANG="en-GB">VFGS Model Output</P>
69                </TD>
70                <TD WIDTH=34%>
71                        <P LANG="en-GB">VFGS Model Output</P>
72                </TD>
73        </TR>
74        <TR>
75                <TD WIDTH=33%>
76                        <P LANG="en-GB"><B>Datasets</B></P>
77                </TD>
78                <TD WIDTH=33%>
79                        <P LANG="en-GB">VFGS Ocean Model Output</P>
80                </TD>
81                <TD WIDTH=34%>
82                        <P LANG="en-GB">VFGS Atmospheric Model Output</P>
83                </TD>
84        </TR>
85        <TR>
86                <TD WIDTH=33%>
87                        <P LANG="en-GB"><B>Variables</B></P>
88                </TD>
89                <TD WIDTH=33%>
90                        <P LANG="en-GB">Salinity, SST...</P>
91                </TD>
92                <TD WIDTH=34%>
93                        <P LANG="en-GB">u-wind, v-wind...</P>
94                </TD>
95        </TR>
97<P LANG="en-GB"><BR><BR>
99<P LANG="en-GB">By default the DX requires the Administrator to
100ingest new Datasets into the DX-Server before they can be accessed by
101users. The Administrator can also create new Dataset Groups to put
102Datasets under.</P>
103<P LANG="en-GB">When interacting with the DX (via the Browser Client
104or Command Line Client) the user will select make selections in the
105following order:</P>
107        <LI><P LANG="en-GB" STYLE="margin-bottom: 0cm">Dataset Group
108        </P>
109        <LI><P LANG="en-GB" STYLE="margin-bottom: 0cm">Dataset
110        </P>
111        <LI><P LANG="en-GB" STYLE="margin-bottom: 0cm">Variable
112        </P>
113        <LI><P LANG="en-GB" STYLE="margin-bottom: 0cm">Spatial (Horizontal
114        and Vertical) axes
115        </P>
116        <LI><P LANG="en-GB" STYLE="margin-bottom: 0cm">Temporal axes
117        </P>
118        <LI><P LANG="en-GB">Output file format
119        </P>
121<P LANG="en-GB">If the user selects 2 variables the DX will try and
122subtract variable 2 from variable 1 by interpolating variable 2 to
123the grid of variable 1.</P>
124<H2 LANG="en-GB">Generating Metadata Files</H2>
125<H3 LANG="en-GB">Adding a new Dataset Group</H3>
126<P LANG="en-GB" STYLE="margin-bottom: 0cm"><SPAN STYLE="background: transparent">The
127top-level metadata file used by the DX-Server is typically called
128<I>inpu</I></SPAN>t<I>Datasets.xml</I> and is located within the
129dxs/datasets/ directory in the standard distribution. This file can
130be hand edited but it is wiser to use the python ingestion scripts
131provided to edit the file. By default, these scripts copy previous
132versions to backup files in the same directory (just in case!).
134<P LANG="en-GB" STYLE="margin-bottom: 0cm"><BR>
136<P LANG="en-GB" STYLE="margin-bottom: 0cm; font-style: normal">To add
137a new Dataset Group you use the <I></I> script in
138the scripts/ sub-directory. It has the following syntax:</P>
139<P LANG="en-GB" STYLE="margin-bottom: 0cm"><BR>
141<P LANG="en-GB" STYLE="margin-bottom: 0cm; font-style: normal"></P>
142<P LANG="en-GB" STYLE="margin-bottom: 0cm; font-style: normal">===================</P>
143<P LANG="en-GB" STYLE="margin-bottom: 0cm"><BR>
145<P LANG="en-GB" STYLE="margin-bottom: 0cm; font-style: normal">Script
146for adding a new dataset group to the DX.</P>
147<P LANG="en-GB" STYLE="margin-bottom: 0cm"><BR>
149<P LANG="en-GB" STYLE="margin-bottom: 0cm; font-style: normal">Usage:</P>
150<P LANG="en-GB" STYLE="margin-bottom: 0cm; font-style: normal">======</P>
151<P LANG="en-GB" STYLE="margin-bottom: 0cm"><BR>
153<P LANG="en-GB" STYLE="margin-bottom: 0cm; font-style: normal">
154-s &lt;shortName&gt; -l &lt;longName&gt; [-p &lt;fileNamePrefix&gt;]</P>
155<P LANG="en-GB" STYLE="margin-bottom: 0cm; font-style: normal">[-r
156&lt;permittedRoles&gt;] [-u &lt;permittedUsers&gt;] [-b
158<P LANG="en-GB" STYLE="margin-bottom: 0cm; font-style: normal">[-d
159&lt;discoveryMetadataLink&gt;] [-a &lt;usageMetadataLink&gt;] [-w
161<P LANG="en-GB" STYLE="margin-bottom: 0cm; font-style: normal">[-h]
163<P LANG="en-GB" STYLE="margin-bottom: 0cm"><BR>
165<P LANG="en-GB" STYLE="margin-bottom: 0cm"><BR>
167<P LANG="en-GB" STYLE="margin-bottom: 0cm; font-style: normal">Where:</P>
168<P LANG="en-GB" STYLE="margin-bottom: 0cm; font-style: normal">======</P>
169<P LANG="en-GB" STYLE="margin-bottom: 0cm; font-style: normal">&lt;shortName&gt;
170is the short name of the item to add (mandatory).</P>
171<P LANG="en-GB" STYLE="margin-bottom: 0cm; font-style: normal">&lt;longName&gt;
172is the long name of the item to add (mandatory).</P>
173<P LANG="en-GB" STYLE="margin-bottom: 0cm; font-style: normal">&lt;fileNamePrefix&gt;
174is the file name prefix for the DX to use (optional).</P>
175<P LANG="en-GB" STYLE="margin-bottom: 0cm; font-style: normal">&lt;permittedRoles&gt;
176is the list of roles allowed to access the data (default=&quot;all&quot;).</P>
177<P LANG="en-GB" STYLE="margin-bottom: 0cm; font-style: normal">&lt;permittedUsers&gt;
178is the list of users allowed to access the data (default=&quot;all&quot;).</P>
179<P LANG="en-GB" STYLE="margin-bottom: 0cm; font-style: normal">&lt;detailedMetadataLink&gt;
180is the URI to the detailed (&quot;B&quot;) metadata (optional).</P>
181<P LANG="en-GB" STYLE="margin-bottom: 0cm; font-style: normal">&lt;discoveryMetadataLink&gt;
182is the URI to the discovery (&quot;D&quot;) metadata record
184<P LANG="en-GB" STYLE="margin-bottom: 0cm; font-style: normal">&lt;usageMetadataLink&gt;
185is the URI to the usage (&quot;A&quot;) metadata (optional).</P>
186<P LANG="en-GB" STYLE="margin-bottom: 0cm; font-style: normal">&lt;documentationLink&gt;
187is the URI to documentation (optional).</P>
188<P LANG="en-GB" STYLE="margin-bottom: 0cm; font-style: normal">-h
189prints this help message.</P>
190<P LANG="en-GB" STYLE="margin-bottom: 0cm; font-style: normal">&lt;outputFilePath&gt;
191is an alternative output file to write to (optional).</P>
192<P LANG="en-GB" STYLE="margin-bottom: 0cm"><BR>
194<H3 LANG="en-GB">Adding an Existing Dataset
196<P LANG="en-GB" STYLE="margin-bottom: 0cm"><SPAN STYLE="background: transparent">An
197existing dataset is defined as a dataset that already has a CDML file
198describing it. If you have a CDML file you can send this as one of
199the command-line arguments to the <I></I> script in
200scripts/ sub-directory. It has the following syntax:</SPAN></P>
201<P LANG="en-GB" STYLE="margin-bottom: 0cm"><BR>
203<P LANG="en-GB" STYLE="margin-bottom: 0cm; font-style: normal"><SPAN STYLE="background: transparent"></SPAN></P>
204<P LANG="en-GB" STYLE="margin-bottom: 0cm; font-style: normal"><SPAN STYLE="background: transparent">=============</SPAN></P>
205<P LANG="en-GB" STYLE="margin-bottom: 0cm"><BR>
207<P LANG="en-GB" STYLE="margin-bottom: 0cm; font-style: normal"><SPAN STYLE="background: transparent">Script
208for adding a new dataset to the DX.</SPAN></P>
209<P LANG="en-GB" STYLE="margin-bottom: 0cm"><BR>
211<P LANG="en-GB" STYLE="margin-bottom: 0cm; font-style: normal"><SPAN STYLE="background: transparent">Usage:</SPAN></P>
212<P LANG="en-GB" STYLE="margin-bottom: 0cm; font-style: normal"><SPAN STYLE="background: transparent">======</SPAN></P>
213<P LANG="en-GB" STYLE="margin-bottom: 0cm"><BR>
215<P LANG="en-GB" STYLE="margin-bottom: 0cm; font-style: normal"><SPAN STYLE="background: transparent">
216-s &lt;shortName&gt; -l &lt;longName&gt; -g &lt;datasetGroup&gt; [-n
218<P LANG="en-GB" STYLE="margin-bottom: 0cm; font-style: normal"><SPAN STYLE="background: transparent">[-r
219&lt;permittedRoles&gt;] [-u &lt;permittedUsers&gt;] [-b
221<P LANG="en-GB" STYLE="margin-bottom: 0cm; font-style: normal"><SPAN STYLE="background: transparent">[-d
222&lt;discoveryMetadataLink&gt;] [-a &lt;usageMetadataLink&gt;] [-w
224<P LANG="en-GB" STYLE="margin-bottom: 0cm; font-style: normal"><SPAN STYLE="background: transparent">[-h]
226<P LANG="en-GB" STYLE="margin-bottom: 0cm"><BR>
228<P LANG="en-GB" STYLE="margin-bottom: 0cm; font-style: normal"><SPAN STYLE="background: transparent">Where:</SPAN></P>
229<P LANG="en-GB" STYLE="margin-bottom: 0cm; font-style: normal"><SPAN STYLE="background: transparent">======</SPAN></P>
230<P LANG="en-GB" STYLE="margin-bottom: 0cm; font-style: normal"><SPAN STYLE="background: transparent">&lt;shortName&gt;
231is the short name of the item to add (mandatory).</SPAN></P>
232<P LANG="en-GB" STYLE="margin-bottom: 0cm; font-style: normal"><SPAN STYLE="background: transparent">&lt;longName&gt;
233is the long name of the item to add (mandatory).</SPAN></P>
234<P LANG="en-GB" STYLE="margin-bottom: 0cm; font-style: normal"><SPAN STYLE="background: transparent">&lt;datasetGroup&gt;
235is the short name of the parent dataset group that the datasets sits
236under (mandatory).</SPAN></P>
237<P LANG="en-GB" STYLE="margin-bottom: 0cm; font-style: normal"><SPAN STYLE="background: transparent">&lt;fileNameSection&gt;
238is the file name section for the DX to use (optional).</SPAN></P>
239<P LANG="en-GB" STYLE="margin-bottom: 0cm; font-style: normal"><SPAN STYLE="background: transparent">&lt;permittedRoles&gt;
240is the comma-separated list of roles allowed to access the data
242<P LANG="en-GB" STYLE="margin-bottom: 0cm; font-style: normal"><SPAN STYLE="background: transparent">&lt;permittedUsers&gt;
243is the comma-separated list of users allowed to access the data
245<P LANG="en-GB" STYLE="margin-bottom: 0cm; font-style: normal"><SPAN STYLE="background: transparent">&lt;detailedMetadataLink&gt;
246is the URI to the detailed (&quot;B&quot;) metadata (optional).</SPAN></P>
247<P LANG="en-GB" STYLE="margin-bottom: 0cm; font-style: normal"><SPAN STYLE="background: transparent">&lt;discoveryMetadataLink&gt;
248is the URI to the discovery (&quot;D&quot;) metadata record
250<P LANG="en-GB" STYLE="margin-bottom: 0cm; font-style: normal"><SPAN STYLE="background: transparent">&lt;usageMetadataLink&gt;
251is the URI to the usage (&quot;A&quot;) metadata. In the case of
252datasets the &lt;usageMetadataLink&gt; provides the URI of the
253CDML/CSML file used by the DX to query the metadata. This argument is
254therefore mandatory.</SPAN></P>
255<P LANG="en-GB" STYLE="margin-bottom: 0cm; font-style: normal"><SPAN STYLE="background: transparent">&lt;documentationLink&gt;
256is the URI to documentation (optional).</SPAN></P>
257<P LANG="en-GB" STYLE="margin-bottom: 0cm; font-style: normal"><SPAN STYLE="background: transparent">-h
258prints this help message. </SPAN>
260<P LANG="en-GB" STYLE="margin-bottom: 0cm; font-style: normal"><SPAN STYLE="background: transparent">&lt;outputFilePath&gt;
261is an alternative output file to write to (optional). </SPAN>
263<H3 LANG="en-GB">Removing and Modifying Existing Datasets and Dataset
265<P LANG="en-GB">There is currently no automatic method of removing or
266modifying existing Datasets and Dataset Groups in the top-level
267<I>inputDatasets.xml </I>file. The recommended course of action is to
268roll-back to a previous backup (see contents of datasets/
269sub-directory) and rename that to <I>inputDatasets.xml</I> and re-run
270<I> </I>and <I></I> as required. You
271can, of course, hand edit the XML file if you are brave enough!</P>
272<P LANG="en-GB">To clean out all Dataset Groups and Datasets from the
273<I>inputDatasets.xml</I><SPAN STYLE="font-style: normal"> file you
274can use the </SPAN><I></I><SPAN STYLE="font-style: normal">
275script which just removes all ingested items. When you first install
276the DX you may wish to do this in order to remove the test datasets
277provided with the standard installation bundle.</SPAN></P>
278<H3 LANG="en-GB">Generating a new Dataset XML File: 1. CDML</H3>
279<P LANG="en-GB" STYLE="margin-bottom: 0cm">The generation of a new
280CDML file requires the use of CDAT's <I>cdscan</I> utility which is
281not described here in detail. The simplest usage of cdscan is as
283<P LANG="en-GB" STYLE="margin-bottom: 0cm"><BR>
285<P LANG="en-GB" STYLE="margin-left: 2cm; margin-bottom: 0cm; font-style: normal">
286cdscan -x &lt;cdmlFile&gt; &lt;inputDataFileList&gt;</P>
287<P LANG="en-GB" STYLE="margin-left: 2cm; margin-bottom: 0cm"><BR>
289<P LANG="en-GB" STYLE="margin-bottom: 0cm; font-style: normal">For
291<P LANG="en-GB" STYLE="margin-left: 2cm; margin-bottom: 0cm; font-style: normal">
292cdscan -x myCDMLFile.xml 2000*_*.nc</P>
293<P LANG="en-GB" STYLE="margin-left: 2cm; margin-bottom: 0cm"><BR>
295<P LANG="en-GB" STYLE="margin-bottom: 0cm; font-style: normal">This
296would scan all files matching the glob pattern “2000*_*.nc” in
297the local directory and write the output file to “ myCDMLFile.xml”.
299<P LANG="en-GB" STYLE="margin-bottom: 0cm"><BR>
301<P LANG="en-GB" STYLE="margin-bottom: 0cm; font-style: normal">You
302can get a full listing cdscan's functionality by typing:</P>
303<P LANG="en-GB" STYLE="margin-bottom: 0cm"><BR>
305<P LANG="en-GB" STYLE="margin-left: 2cm; margin-bottom: 0cm; font-style: normal">
306cdscan -h</P>
307<P LANG="en-GB" STYLE="margin-bottom: 0cm"><BR>
309<P LANG="en-GB" STYLE="margin-bottom: 0cm; font-style: normal">There
310are many quirks to using cdscan. If it doesn't work for you please
311consult the Trouble Shooting section below.</P>
312<P LANG="en-GB" STYLE="margin-bottom: 0cm"><BR>
314<P LANG="en-GB" STYLE="margin-bottom: 0cm; font-style: normal"><B>Note
315on using file name templates in cdscan</B></P>
316<P LANG="en-GB" STYLE="margin-bottom: 0cm"><BR>
318<P LANG="en-GB" STYLE="margin-bottom: 0cm; font-style: normal; font-weight: medium">
319By default cdscan writes the name of every single file into a large
320list within the CDML file. This can bloat the XML file greatly when
321large datasets are involved with many 1000s of files. I have found it
322useful to utilise the file name templating option to overcome this
324<P LANG="en-GB" STYLE="margin-bottom: 0cm"><BR>
326<P LANG="en-GB" STYLE="margin-bottom: 0cm; font-style: normal; font-weight: medium">
327some stuff here....automate removal of cdms_filemap with some
329<H3 LANG="en-GB">Generating a new Dataset XML File: 2. CSML</H3>
330<P LANG="en-GB">To generate a CSML file you will need to import the
331CSML package from the NERC DataGrid software bundle. At the time of
332writing the interface to this is not finalised so please consult the
333CSML documentation for more details.</P>
334<H3 LANG="en-GB">Treating Individual Data Files as Datasets</H3>
335<P LANG="en-GB">The DX allows you to describe an individual data file
336as a dataset if you prefer not to use the CDML files. You can just
337supply your data file as the dataset argument in the <I></I>
338script and it will work the same as if it had ingested a CDML file.
339The only limitation is that a data file will only be able to describe
340a few GB of data whereas a CDML file can easily describe TBs of data.</P>
341<H3 LANG="en-GB">Trouble Shooting</H3>
342<P LANG="en-GB">This section provides a list of possible problems and
343hopefully some useful solutions.</P>
345        <LI><P LANG="en-GB"><B>Problem:</B></P>
346        <P LANG="en-GB"><B>cdscan gives error: </B>
347        </P>
349<P LANG="en-GB" STYLE="margin-bottom: 0cm"><B>Setting reference time
350units to days since 2006-01-01 00:00:00</B></P>
351<P LANG="en-GB"><B>Traceback (most recent call last):</B></P>
352<P LANG="en-GB"><B>File &quot;/usr/local/cdat/bin/cdscan&quot;, line
3531343, in ?</B></P>
354<P LANG="en-GB"><B>main(sys.argv)</B></P>
355<P LANG="en-GB"><B>File &quot;/usr/local/cdat/bin/cdscan&quot;, line
356990, in main</B></P>
357<P LANG="en-GB"><B>starttime = vartime[0]</B></P>
358<P LANG="en-GB"><B>File
360line 1714, in __getitem__</B></P>
361<P LANG="en-GB"><B>raise IndexError, 'index out of bounds'</B></P>
363        <P LANG="en-GB"><B>IndexError: index out of bounds</B></P>
365<P LANG="en-GB"><B>Solution:</B></P>
366<P LANG="en-GB"><B>Do your data files have a dimension set as
367UNLIMITED? This can cause problems for cdscan so it is wise to
368reformat the data setting the exact dimension value instead of
369setting as unlimited. This should solve the problem.</B></P>
370<P LANG="en-GB"><B>Other cdscan problems are:</B></P>
372        <LI><P LANG="en-GB"><B>second/minute timestep</B></P>
373        <LI><P LANG="en-GB"><B>repeat vars but differing on cellmethods –
374        I have a patch</B></P>
375        <LI><P LANG="en-GB"><B>delete the cdmsfilemap</B></P>
376        <LI><P LANG="en-GB"><B>template values – are they all listed in
377        cdimport -h or only in</B></P>
378        <LI><P LANG="en-GB"><B>massive linear dimension or partition</B></P>
379        <LI><P LANG="en-GB"><B>missing value stuff – we have a patch for
380        simple time cases.</B></P>
381        <LI><P LANG="en-GB"><B>non-linear time dim – might need to do -j
382        option </B>
383        </P>
384        <LI><P LANG="en-GB"></P>
387        <LI><P LANG="en-GB"><B>Problem:</B></P>
388        <UL>
389                <LI><P LANG="en-GB"><B>Solution:</B></P>
390        </UL>
391        <LI><P LANG="en-GB"><B>Problem:</B></P>
392        <UL>
393                <LI><P LANG="en-GB"><B>Solution:</B></P>
394        </UL>
Note: See TracBrowser for help on using the repository browser.