source: TI03-DataExtractor/trunk/dxs/manuals/ingest.html @ 794

Subversion URL: http://proj.badc.rl.ac.uk/svn/ndg/TI03-DataExtractor/trunk/dxs/manuals/ingest.html@794
Revision 794, 16.5 KB checked in by astephen, 13 years ago (diff)

Unstable but latest version with multi-variable support and split hooks
for CDML and CSML.

Line 
1<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
2<HTML>
3<HEAD>
4        <META HTTP-EQUIV="CONTENT-TYPE" CONTENT="text/html; charset=utf-8">
5        <TITLE>Home Page</TITLE>
6        <META NAME="GENERATOR" CONTENT="OpenOffice.org 1.1.1  (Linux)">
7        <META NAME="CREATED" CONTENT="20060317;12440700">
8        <META NAME="CHANGED" CONTENT="20060406;9145300">
9        <META NAME="ProgId" CONTENT="FrontPage.Editor.Document">
10</HEAD>
11<BODY LANG="en-US" DIR="LTR">
12<H1>The Data Extractor (DX) Ingestion Guide</H1>
13<H2>Introduction</H2>
14<P>This document is one of a series of documents describing how to
15set up and use the Data Extractor (DX). Further information is, or
16soon will be, available in the following guides:</P>
17<UL>
18        <LI><P STYLE="margin-bottom: 0cm"><B>DX Overview</B></P>
19        <LI><P STYLE="margin-bottom: 0cm"><B>DX Installation Guide</B> 
20        </P>
21        <LI><P STYLE="margin-bottom: 0cm"><B>DX Administrator's Guide</B> 
22        </P>
23        <LI><P STYLE="margin-bottom: 0cm"><B>DX User Guide</B> 
24        </P>
25        <LI><P><B>Guide to Securing the DX</B> 
26        </P>
27</UL>
28<H2>Data Ingestion Overview</H2>
29<P>The DX-Server package needs to be told about your local data
30archive so that it can generate a series of XML files to describe the
31contents of the archive. These are typically CDML (Climate Data
32Management Language) files which can describe NetCDF, GRIB and (with
33limited scope) PP-format data files. The purpose of these XML files
34is to provide a dataset-wide view of your archive rather than lists
35of files in directories. The common properties such as global
36metadata, variables and axes are then available to the DX-Server
37without opening any data files. This allows the user to make logical
38subsetting requests that can span 1000s of data files without needing
39to know the format or names of any individual file.</P>
40<P>There is also a top-level XML file that the DX-Server interrogates
41to find out what your dataset groups are and which datasets make they
42include. The CDML files map one-to-one to datasets in the DX-Server.</P>
43<P>The following diagram shows how the DX-Server, Dataset Group XML
44file, Dataset XML files and Administrator interact to manage these
45files.</P>
46<P><IMG SRC="images/dx_arch.gif" NAME="Graphic1" ALIGN=BOTTOM WIDTH=767 HEIGHT=383 BORDER=0></P>
47<H2>Dataset Metadata and the DX Hierarchy</H2>
48<P>The DX understands the concept of a &quot;Dataset&quot; as a
49collection of one or more data files containing variables with a
50repeated structure. Typically these are 2D or 3D model fields with
51one time step per file.</P>
52<P>The DX also has the concept of a &quot;Dataset Group&quot;. This
53is a logical collection of &quot;Datasets&quot;. For example:</P>
54<TABLE WIDTH=100% BORDER=1 CELLPADDING=2 CELLSPACING=3>
55        <COL WIDTH=84*>
56        <COL WIDTH=84*>
57        <COL WIDTH=87*>
58        <TR>
59                <TD WIDTH=33%>
60                        <P><B>Dataset Group</B></P>
61                </TD>
62                <TD WIDTH=33%>
63                        <P>VFGS Model Output</P>
64                </TD>
65                <TD WIDTH=34%>
66                        <P>VFGS Model Output</P>
67                </TD>
68        </TR>
69        <TR>
70                <TD WIDTH=33%>
71                        <P><B>Datasets</B></P>
72                </TD>
73                <TD WIDTH=33%>
74                        <P>VFGS Ocean Model Output</P>
75                </TD>
76                <TD WIDTH=34%>
77                        <P>VFGS Atmospheric Model Output</P>
78                </TD>
79        </TR>
80        <TR>
81                <TD WIDTH=33%>
82                        <P><B>Variables</B></P>
83                </TD>
84                <TD WIDTH=33%>
85                        <P>Salinity, SST...</P>
86                </TD>
87                <TD WIDTH=34%>
88                        <P>u-wind, v-wind...</P>
89                </TD>
90        </TR>
91</TABLE>
92<P><BR><BR>
93</P>
94<P>By default the DX requires the Administrator to ingest new
95Datasets into the DX-Server before they can be accessed by users. The
96Administrator can also create new Dataset Groups to put Datasets
97under.</P>
98<P>When interacting with the DX (via the Browser Client or Command
99Line Client) the user will select make selections in the following
100order:</P>
101<OL>
102        <LI><P STYLE="margin-bottom: 0cm">Dataset Group
103        </P>
104        <LI><P STYLE="margin-bottom: 0cm">Dataset
105        </P>
106        <LI><P STYLE="margin-bottom: 0cm">Variable
107        </P>
108        <LI><P STYLE="margin-bottom: 0cm">Spatial (Horizontal and Vertical)
109        axes
110        </P>
111        <LI><P STYLE="margin-bottom: 0cm">Temporal axes
112        </P>
113        <LI><P>Output file format
114        </P>
115</OL>
116<P>If the user selects 2 variables the DX will try and subtract
117variable 2 from variable 1 by interpolating variable 2 to the grid of
118variable 1.</P>
119<H2>Generating Metadata Files</H2>
120<H3>Adding a new Dataset Group</H3>
121<P STYLE="margin-bottom: 0cm"><SPAN STYLE="background: transparent">The
122top-level metadata file used by the DX-Server is typically called
123<I>inpu</I></SPAN>t<I>Datasets.xml</I> and is located within the
124dxs/datasets/ directory in the standard distribution. This file can
125be hand edited but it is wiser to use the python ingestion scripts
126provided to edit the file. By default, these scripts copy previous
127versions to backup files in the same directory (just in case!).
128</P>
129<P STYLE="margin-bottom: 0cm"><BR>
130</P>
131<P STYLE="margin-bottom: 0cm; font-style: normal">To add a new
132Dataset Group you use the <I>addDatasetGroup.py</I> script in the
133scripts/ sub-directory. It has the following syntax:</P>
134<P STYLE="margin-bottom: 0cm"><BR>
135</P>
136<P STYLE="margin-bottom: 0cm; font-style: normal">addDatasetGroups.py</P>
137<P STYLE="margin-bottom: 0cm; font-style: normal">===================</P>
138<P STYLE="margin-bottom: 0cm"><BR>
139</P>
140<P STYLE="margin-bottom: 0cm; font-style: normal">Script for adding a
141new dataset group to the DX.</P>
142<P STYLE="margin-bottom: 0cm"><BR>
143</P>
144<P STYLE="margin-bottom: 0cm; font-style: normal">Usage:</P>
145<P STYLE="margin-bottom: 0cm; font-style: normal">======</P>
146<P STYLE="margin-bottom: 0cm"><BR>
147</P>
148<P STYLE="margin-bottom: 0cm; font-style: normal">addDatasetGroup.py
149-s &lt;shortName&gt; -l &lt;longName&gt; [-p &lt;fileNamePrefix&gt;]</P>
150<P STYLE="margin-bottom: 0cm; font-style: normal">[-r
151&lt;permittedRoles&gt;] [-u &lt;permittedUsers&gt;] [-b
152&lt;detailedMetadataLink&gt;]</P>
153<P STYLE="margin-bottom: 0cm; font-style: normal">[-d
154&lt;discoveryMetadataLink&gt;] [-a &lt;usageMetadataLink&gt;] [-w
155&lt;documentationLink&gt;]</P>
156<P STYLE="margin-bottom: 0cm; font-style: normal">[-h]
157&lt;outputFilePath&gt;</P>
158<P STYLE="margin-bottom: 0cm"><BR>
159</P>
160<P STYLE="margin-bottom: 0cm"><BR>
161</P>
162<P STYLE="margin-bottom: 0cm; font-style: normal">Where:</P>
163<P STYLE="margin-bottom: 0cm; font-style: normal">======</P>
164<P STYLE="margin-bottom: 0cm; font-style: normal">&lt;shortName&gt;
165is the short name of the item to add (mandatory).</P>
166<P STYLE="margin-bottom: 0cm; font-style: normal">&lt;longName&gt; is
167the long name of the item to add (mandatory).</P>
168<P STYLE="margin-bottom: 0cm; font-style: normal">&lt;fileNamePrefix&gt;
169is the file name prefix for the DX to use (optional).</P>
170<P STYLE="margin-bottom: 0cm; font-style: normal">&lt;permittedRoles&gt;
171is the list of roles allowed to access the data (default=&quot;all&quot;).</P>
172<P STYLE="margin-bottom: 0cm; font-style: normal">&lt;permittedUsers&gt;
173is the list of users allowed to access the data (default=&quot;all&quot;).</P>
174<P STYLE="margin-bottom: 0cm; font-style: normal">&lt;detailedMetadataLink&gt;
175is the URI to the detailed (&quot;B&quot;) metadata (optional).</P>
176<P STYLE="margin-bottom: 0cm; font-style: normal">&lt;discoveryMetadataLink&gt;
177is the URI to the discovery (&quot;D&quot;) metadata record
178(optional).</P>
179<P STYLE="margin-bottom: 0cm; font-style: normal">&lt;usageMetadataLink&gt;
180is the URI to the usage (&quot;A&quot;) metadata (optional).</P>
181<P STYLE="margin-bottom: 0cm; font-style: normal">&lt;documentationLink&gt;
182is the URI to documentation (optional).</P>
183<P STYLE="margin-bottom: 0cm; font-style: normal">-h prints this help
184message.</P>
185<P STYLE="margin-bottom: 0cm; font-style: normal">&lt;outputFilePath&gt;
186is an alternative output file to write to (optional).</P>
187<P STYLE="margin-bottom: 0cm"><BR>
188</P>
189<H3>Adding an Existing Dataset
190</H3>
191<P STYLE="margin-bottom: 0cm"><SPAN STYLE="background: transparent">An
192existing dataset is defined as a dataset that already has a CDML file
193describing it. If you have a CDML file you can send this as one of
194the command-line arguments to the <I>addDataset.py</I> script in
195scripts/ sub-directory. It has the following syntax:</SPAN></P>
196<P STYLE="margin-bottom: 0cm"><BR>
197</P>
198<P STYLE="margin-bottom: 0cm; font-style: normal"><SPAN STYLE="background: transparent">addDataset.py</SPAN></P>
199<P STYLE="margin-bottom: 0cm; font-style: normal"><SPAN STYLE="background: transparent">=============</SPAN></P>
200<P STYLE="margin-bottom: 0cm"><BR>
201</P>
202<P STYLE="margin-bottom: 0cm; font-style: normal"><SPAN STYLE="background: transparent">Script
203for adding a new dataset to the DX.</SPAN></P>
204<P STYLE="margin-bottom: 0cm"><BR>
205</P>
206<P STYLE="margin-bottom: 0cm; font-style: normal"><SPAN STYLE="background: transparent">Usage:</SPAN></P>
207<P STYLE="margin-bottom: 0cm; font-style: normal"><SPAN STYLE="background: transparent">======</SPAN></P>
208<P STYLE="margin-bottom: 0cm"><BR>
209</P>
210<P STYLE="margin-bottom: 0cm; font-style: normal"><SPAN STYLE="background: transparent">addDataset.py
211-s &lt;shortName&gt; -l &lt;longName&gt; -g &lt;datasetGroup&gt; [-n
212&lt;fileNameSection&gt;]</SPAN></P>
213<P STYLE="margin-bottom: 0cm; font-style: normal"><SPAN STYLE="background: transparent">[-r
214&lt;permittedRoles&gt;] [-u &lt;permittedUsers&gt;] [-b
215&lt;detailedMetadataLink&gt;]</SPAN></P>
216<P STYLE="margin-bottom: 0cm; font-style: normal"><SPAN STYLE="background: transparent">[-d
217&lt;discoveryMetadataLink&gt;] [-a &lt;usageMetadataLink&gt;] [-w
218&lt;documentationLink&gt;]</SPAN></P>
219<P STYLE="margin-bottom: 0cm; font-style: normal"><SPAN STYLE="background: transparent">[-h]
220&lt;outputFilePath&gt;</SPAN></P>
221<P STYLE="margin-bottom: 0cm"><BR>
222</P>
223<P STYLE="margin-bottom: 0cm; font-style: normal"><SPAN STYLE="background: transparent">Where:</SPAN></P>
224<P STYLE="margin-bottom: 0cm; font-style: normal"><SPAN STYLE="background: transparent">======</SPAN></P>
225<P STYLE="margin-bottom: 0cm; font-style: normal"><SPAN STYLE="background: transparent">&lt;shortName&gt;
226is the short name of the item to add (mandatory).</SPAN></P>
227<P STYLE="margin-bottom: 0cm; font-style: normal"><SPAN STYLE="background: transparent">&lt;longName&gt;
228is the long name of the item to add (mandatory).</SPAN></P>
229<P STYLE="margin-bottom: 0cm; font-style: normal"><SPAN STYLE="background: transparent">&lt;datasetGroup&gt;
230is the short name of the parent dataset group that the datasets sits
231under (mandatory).</SPAN></P>
232<P STYLE="margin-bottom: 0cm; font-style: normal"><SPAN STYLE="background: transparent">&lt;fileNameSection&gt;
233is the file name section for the DX to use (optional).</SPAN></P>
234<P STYLE="margin-bottom: 0cm; font-style: normal"><SPAN STYLE="background: transparent">&lt;permittedRoles&gt;
235is the comma-separated list of roles allowed to access the data
236(default=&quot;all&quot;).</SPAN></P>
237<P STYLE="margin-bottom: 0cm; font-style: normal"><SPAN STYLE="background: transparent">&lt;permittedUsers&gt;
238is the comma-separated list of users allowed to access the data
239(default=&quot;all&quot;).</SPAN></P>
240<P STYLE="margin-bottom: 0cm; font-style: normal"><SPAN STYLE="background: transparent">&lt;detailedMetadataLink&gt;
241is the URI to the detailed (&quot;B&quot;) metadata (optional).</SPAN></P>
242<P STYLE="margin-bottom: 0cm; font-style: normal"><SPAN STYLE="background: transparent">&lt;discoveryMetadataLink&gt;
243is the URI to the discovery (&quot;D&quot;) metadata record
244(optional).</SPAN></P>
245<P STYLE="margin-bottom: 0cm; font-style: normal"><SPAN STYLE="background: transparent">&lt;usageMetadataLink&gt;
246is the URI to the usage (&quot;A&quot;) metadata (optional).</SPAN></P>
247<P STYLE="margin-bottom: 0cm; font-style: normal"><SPAN STYLE="background: transparent">&lt;documentationLink&gt;
248is the URI to documentation (optional).</SPAN></P>
249<P STYLE="margin-bottom: 0cm; font-style: normal"><SPAN STYLE="background: transparent">-h
250prints this help message. </SPAN>
251</P>
252<P STYLE="margin-bottom: 0cm; font-style: normal"><SPAN STYLE="background: transparent">&lt;outputFilePath&gt;
253is an alternative output file to write to (optional). </SPAN>
254</P>
255<H3>Removing and Modifying Existing Datasets and Dataset Groups</H3>
256<P>There is currently no automatic method of removing or modifying
257existing Datasets and Dataset Groups in the top-level
258<I>inputDatasets.xml </I>file. The recommended course of action is to
259roll-back to a previous backup (see contents of datasets/
260sub-directory) and rename that to <I>inputDatasets.xml</I> and re-run
261<I>addDatasetGroup.py </I>and <I>addDataset.py</I> as required. You
262can, of course, hand edit the XML file if you are brave enough!</P>
263<H3>Generating a new Dataset XML File (CDML)</H3>
264<P STYLE="margin-bottom: 0cm">The generation of a new CDML file
265requires the use of CDAT's <I>cdscan</I> utility which is not
266described here in detail. The simplest usage of cdscan is as follows:</P>
267<P STYLE="margin-bottom: 0cm"><BR>
268</P>
269<P STYLE="margin-left: 2cm; margin-bottom: 0cm; font-style: normal">cdscan
270-x &lt;cdmlFile&gt; &lt;inputDataFileList&gt;</P>
271<P STYLE="margin-left: 2cm; margin-bottom: 0cm"><BR>
272</P>
273<P STYLE="margin-bottom: 0cm; font-style: normal">For example:</P>
274<P STYLE="margin-left: 2cm; margin-bottom: 0cm; font-style: normal">cdscan
275-x myCDMLFile.xml 2000*_*.nc</P>
276<P STYLE="margin-left: 2cm; margin-bottom: 0cm"><BR>
277</P>
278<P STYLE="margin-bottom: 0cm; font-style: normal">This would scan all
279files matching the glob pattern “2000*_*.nc” in the local
280directory and write the output file to “ myCDMLFile.xml”.
281</P>
282<P STYLE="margin-bottom: 0cm"><BR>
283</P>
284<P STYLE="margin-bottom: 0cm; font-style: normal">You can get a full
285listing cdscan's functionality by typing:</P>
286<P STYLE="margin-bottom: 0cm"><BR>
287</P>
288<P STYLE="margin-left: 2cm; margin-bottom: 0cm; font-style: normal">cdscan
289-h</P>
290<P STYLE="margin-bottom: 0cm"><BR>
291</P>
292<P STYLE="margin-bottom: 0cm; font-style: normal">There are many
293quirks to using cdscan. If it doesn't work for you please consult the
294Trouble Shooting section below.</P>
295<P STYLE="margin-bottom: 0cm"><BR>
296</P>
297<P STYLE="margin-bottom: 0cm; font-style: normal"><B>Note on using
298file name templates in cdscan</B></P>
299<P STYLE="margin-bottom: 0cm"><BR>
300</P>
301<P STYLE="margin-bottom: 0cm; font-style: normal; font-weight: medium">
302By default cdscan writes the name of every single file into a large
303list within the CDML file. This can bloat the XML file greatly when
304large datasets are involved with many 1000s of files. I have found it
305useful to utilise the file name templating option to overcome this
306problem.</P>
307<P STYLE="margin-bottom: 0cm"><BR>
308</P>
309<P STYLE="margin-bottom: 0cm; font-style: normal; font-weight: medium">
310some stuff here....automate removal of cdms_filemap with some
311argument!</P>
312<H3>Treating Individual Data Files as Datasets</H3>
313<P>The DX allows you to describe an individual data file as a dataset
314if you prefer not to use the CDML files. You can just supply your
315data file as the dataset argument in the <I>addDataset.py</I> script
316and it will work the same as if it had ingested a CDML file. The only
317limitation is that a data file will only be able to describe a few GB
318of data whereas a CDML file can easily describe TBs of data.</P>
319<H3>Trouble Shooting</H3>
320<P>This section provides a list of possible problems and hopefully
321some useful solutions.</P>
322<UL>
323        <LI><P><B>Problem:</B></P>
324        <P><B>cdscan gives error: </B>
325        </P>
326</UL>
327<P STYLE="margin-bottom: 0cm"><B>Setting reference time units to days
328since 2006-01-01 00:00:00</B></P>
329<P><B>Traceback (most recent call last):</B></P>
330<P>  <B>File &quot;/usr/local/cdat/bin/cdscan&quot;, line 1343, in ?</B></P>
331<P>    <B>main(sys.argv)</B></P>
332<P>  <B>File &quot;/usr/local/cdat/bin/cdscan&quot;, line 990, in
333main</B></P>
334<P>    <B>starttime = vartime[0]</B></P>
335<P>  <B>File
336&quot;/disks/linux1/cdat_installations/cdat_latest/lib/python2.4/site-packages/cdms/axis.py&quot;,
337line 1714, in __getitem__</B></P>
338<P>    <B>raise IndexError, 'index out of bounds'</B></P>
339<UL>
340        <P><B>IndexError: index out of bounds</B></P>
341</UL>
342<P><B>Solution:</B></P>
343<P><B>Do your data files have a dimension set as UNLIMITED? This can
344cause problems for cdscan so it is wise to reformat the data setting
345the exact dimension value instead of setting as unlimited. This
346should solve the problem.</B></P>
347<P><B>Other cdscan problems are:</B></P>
348<UL>
349        <LI><P><B>second/minute timestep</B></P>
350        <LI><P><B>repeat vars but differing on cellmethods – I have a
351        patch</B></P>
352        <LI><P><B>delete the cdmsfilemap</B></P>
353        <LI><P><B>template values – are they all listed in cdimport -h or
354        only in cdmsobj.py</B></P>
355        <LI><P><B>massive linear dimension or partition</B></P>
356        <LI><P><B>missing value stuff – we have a patch for simple time
357        cases.</B></P>
358        <LI><P><B>non-linear time dim – might need to do -j option </B>
359        </P>
360        <LI><P></P>
361</UL>
362<UL>
363        <LI><P><B>Problem:</B></P>
364        <UL>
365                <LI><P><B>Solution:</B></P>
366        </UL>
367        <LI><P><B>Problem:</B></P>
368        <UL>
369                <LI><P><B>Solution:</B></P>
370        </UL>
371</UL>
372</BODY>
373</HTML>
Note: See TracBrowser for help on using the repository browser.