source: TI01-discovery/branches/ingestAutomation-upgrade/OAIBatch/README.txt @ 5297

Subversion URL: http://proj.badc.rl.ac.uk/svn/ndg/TI01-discovery/branches/ingestAutomation-upgrade/OAIBatch/README.txt
Revision 5297, 12.6 KB checked in by cbyrom, 11 years ago (diff)

Add additional instructions for setting up ingester + improve the
setup script to include more dependencies.

Line 
1Overview
2------------
3
4Two ingester scripts now exist to ingest documents from a data centre into the discovery
5postgres DB.  The original script, oai_document_ingester retrieved docs harvested by the
6OAI service.  A new script, feeddocumentingester, has been added to harvest documents
7direct from an eXist repository running a feed reporting service - to alert the system
8to the publication of new documents to harvest.
9
10The two scripts share some common functionality, so to allow code re-use, this has been
11abstracted out into a parent class, AbstractDocumentIngester.  More detail on the two
12scripts is provided below - together with their shared functionality - which effectively
13is the point from which a newly harvested document has been discovered.
14
15Install
16-----------
17
18The various datacentre config files used by the scripts are loaded directly from
19the python package, as a package resource; as a result of this, the OAIBatch package
20needs to be installed into the running python site-packages library before the
21scripts will work - i.e. run python setup.py install in the top level directory.
22
23NB, this change was made to allow the scripts to be used by other packages - in particular
24the OAIInfoEditor.
25
26oai_document_ingester
27------------------------------ 
28
29The oai_document_ingester script ingests documents obtained by the OAI harvester.
30
31Usage: python oai_document_ingester.py [OPTION] <datacentre>
32 - where:\n   <datacentre> is the data centre to ingest data from; and options are:
33 -v - verbose mode for output logging
34 -d - debug mode for output logging
35
36
37This script can be ran standalone, by specifying a datacentre to ingest files from, or the wrapper script
38run_all_ingest.py can be used; this will parse the contents of the datacentre_config directory and
39run the oai_document_ingester against every config file located there.
40
41
42feeddocumentingester
43------------------------------
44
45The feed ingester polls a feed set up on an eXist DB which hosts records to harvest.  This
46feed should be updated when a new record to harvest is created - and when the ingester
47polls this feed again, it will try to retrieve and ingest the new document.
48
49Usage: python feeddocumentingester.py [OPTION] <feed> [interval=..], [ingestFromDate=..]
50              [eXistDBHostname=..], [eXistPortNo=..], [dataCentrePoll=..]
51 - where:\n <feed> is the atom feed to ingest data from; options are:
52 -v - verbose mode for output logging
53 -d - debug mode for output logging
54 and keywords are:
55 interval - interval, in seconds, at which to retrieve data from the feed
56 ingestFromDate - date, in format, 'YYYY-MM-DD', from which documents should be ingested - if not set, ingest date is taken as the current time
57 eXistDBHostname - name of eXist DB to retrieve data from - NB, this will likely be where the feed is based, too - default is 'chinook.badc.rl.ac.uk'
58 eXistPortNo - port number used by the eXist DB - defaults to '8080'
59 dataCentrePoll - data centre whose documents should be polled for - e.g 'badc', 'neodc' - if not set, all documents on a feed will be ingested
60
61NB, the feed URL will typically be pointing at the RESTful interface to an eXist DB which is hosing the feed.
62The format for this is:
63
64http://eXistHostAndPort/exist/atom/content/<db-collection-for-feed> - e.g.:
65
66http://chinook.badc.rl.ac.uk:8080/exist/atom/content/db/DIF
67
68The client expects this feed to have entries with a content element with a src element that features
69a valid ndgURL - e.g.:
70
71<feed xmlns="http://www.w3.org/2005/Atom">
72    <id>urn:uuid:28a477f6-df52-4bb4-9c30-4ef5e3d5c615</id>
73    <updated>2009-01-20T10:33:02+00:00</updated>         
74    <title>DIF Data</title>
75    <link href="#" rel="edit" type="application/atom+xml"/>
76    <link href="#" rel="self" type="application/atom+xml"/>
77    <entry>
78        <id>urn:uuid:fe496d7f-bdd9-409e-b3b1-2af05a2d0f33</id>
79                <updated>2009-01-19T17:22:27+00:00</updated>
80                <published>2009-01-19T17:22:27+00:00</published>
81        <link href="?id=urn:uuid:fe496d7f-bdd9-409e-b3b1-2af05a2d0f33" rel="edit" type="application/atom+xml"/>
82                <title>DIF Record [CRYOspheric STudies of Atmospheric Trends (CRYOSTAT)]</title>
83        <summary>
84        The CRYOspheric STudies of Atmospheric Trends in stratospherically and radiatively important gases (CRYOSTAT) will undertake the first combined measurements of virtually all significant Greenhouse gases (GHGs). GHGs (other than water vapour), ozone-depleting substances (ODSs), and related trace gases in contiguous firn and ice profiles, spanning as much as 200 years, from both the northern and southern polar ice caps. CRYOSTAT is an evolution of the FIRETRACC/100 project, the data from which is also held at BADC.
85                </summary>
86        <content src="http://badc.nerc.ac.uk:50001/view/badc.nerc.ac.uk__BROWSE-DIF__dataent_cryostat" type="application/atom+xml"/>
87    </entry>
88</feed>
89
90- NB, the ndgUrl in this case is badc.nerc.ac.uk__BROWSE-DIF__dataent_cryostat
91
92
93Shared functionality
94------------------------------
95
96Once a new document to ingest has been discovered (i.e. either via the OAI harvester or
97extracted directly from an eXist repository on prompting from a feed update), the following
98workflow is completed:
99
1001. document contents are adjusted to avoid namespace issues
1012. documents are renamed to include their discovery ids
1023. documents are then transformed, using XQueries ran using the java saxon library, to
103the generic Moles (1.3) format - and then transformed from this into all the other
104available document formats (currently DIF, DC, IS19139 - NB currently the MDIP
105transform does not work - see ToDo (IS THIS STILL TRUE?). 
1064. All transformed docs are stored in the DB and spatiotemporal data is also
107extracted from the moles files and stored.
1085. If the script processes all files successfully, the harvest directory is backed up locally and then
109cleared.
110
111NB, the difference between the oai and feed ingesters here is that, for the oai ingester,
112the above workflow is completed on all docs ingested from the oai, whereas for the feed
113ingester, it is completed for each document retrieved from the feed, in turn.
114
115
116Details for the postgres database are stored in the ingest.txt file; this should be set to access
117permission 0600 and owned by the user running the script - to ensure maximum security of the file.
118
119Layout
120----------
121Dependent python files are stored directly under the ingestAutomation/OAIBatch directory.
122Dependent library files (currently only saxon9.jar) are stored in ingestAutomation/OAIBatch/lib.
123Config files for the various data centres are stored in the ingestAutomation/OAIBatch/datacentre_config directory.
124Files for setting up the datamodel are available in ingestAutomation/database.
125
126When running, the script uses the current working directory as the sandbox to operate from.  A
127directory, 'data', is created in this and this is used to store the various docs and their various
128forms during ingest.
129Backup copies of the original harvested files are stored in /disks/glue1/oaiBackup/
130
131
132Error Handling - OAI harvester
133-----------------
134
135The script will process all files it finds in a harvest directory for a data centre one by one.  If
136all files are successfully ingested, the harvest directory is then cleared; if not it is retained so
137that the script can be reran once problems have been fixed.
138
139If problems are experienced, these will be reported in the output logging; the script can be reran as
140many time necessary to fix the problems; any files that were successfully ingested will be ignored.
141
142
143Updates & Duplicate docs
144------------------------------
145
146If a file is re-ingested, a check is done against its discovery id and the contents of the file; if this
147is found to be identical to an existing row in the DB, the 'harvest count' for that row is incremented
148and further processing is stopped.  If a difference is found in the content, the transformed docs are
149all appropriately updated and the spatiotemporal data is recreated from scratch (with the help of cascading
150delete triggers in the DB).
151
152NB, any updates to the original doc will cause a trigger to record the original row data in the original_document_history
153table - to allow basic audit tracking of changes.
154
155
156Document Transforms
157--------------------------
158Docs are transformed using the Saxon java library - running the transform direct from the command line.  The transforms use the
159XQueries and XQuery libs provided by the ndg.common.src egg - so this must be installed before the script is used.  NB, the various XQuery
160libs that are referenced by the XQueries and extracted locally and the references in the scripts adjusted to point to these copies
161- to ensure the current libs are used when running the scripts.
162
163
164Data Model
165---------------
166
167All data is stored in the postgres database - the data model is available under ingestAutomation/database.
168The DB can be recreated by running the create_database.sh script - this takes the following parameters:
169
170create_database.sh username database [hostname] [port]
171
172The DB is Postgres 8.3 - this version (or later) is required for the text search functionality to be
173availabe.  Additionally, PostGIS is installed in the DB, to allow the storage, and searching, of the
174required coordinate systems.
175
176
177To Do
178-------------
179
1802. Whilst testing the scripts, it was noted that the various MDIP transforms do not currently work
181for the majority of the badc datasets; as such, MDIP format has been
182commented out of the PostgresRecord class; this will need to be fixed asap!
183
1843. The system should handle the deletion of files from the DB.  Not sure how this is handled by the harvester - i.e. are all files
185harvest always - so we need to do a check in the DB so that the contents directly match - and then delete and extras - or is there
186another way of determining deletions?  Once this is established, use the PostgresDAO.deleteOriginalDocument() method to do the clearing
187up.
188
1894. When adding keywords, using the keywordAdder class, the descriptionOnlineReference element loses its values
190- this appears to be due to the molesReadWrite class not picking up the necessary structure - which appears
191to be absent from the ndgmetadata1.3.xsd schema?
192
1935. The molesReadWrite class, used when adding kewords to the moles file, does not correctly handle namespace qualified attributes
194- leading to these being duplicated in some cases - e.g. the ndgmetadata1.3.xsd schema features elements with the names,
195'dgMetadataDescriptionType' and 'moles:dgMetadataDescriptionType'
196- this then leads to the elements being duplicated when the molesElement.toXML() method is called.  The most obvious
197consequence of this is that the AbstractText element is included twice; as such, this particular event has been caught
198and escaped in the code - however a better fix would be to ensure that namespaces are either stripped out and/or handled
199properly.
200
2016. The SpatioTemporalData class is a simple struct to store spatiotemporal data; would be better if there was
202some simple checking of this data (esp the temporal data), to ensure it meets the input requirements - e.g.
203checking dates so they can be turned into valid DB timestamps (not sure what formats we accept which is why this
204isn't implemented).
205
2067. Logging - only the classes I've directly written use the python logging system - i.e. existing code that is referenced
207from this code does not.  This leads to slightly untidy output as the ordering of print statements is different to that of
208the logger.  Ideally the logger would be used throughout code, but this is probably unlikely to happen.
209
2108. The 'author' and 'parameter' elements from the moles files are extracted by the PostgresRecord class.  The original code used
211a simple xpath search operator (&=) to find the data below a particular parent element.  Where possible, the more exact location
212has been identified and used in the new code, however, not all the required info has been available to test - in particular, the
213following need to be checked and adjusted if pointing to the incorrect field:
214
215  creators = self.dgMeta.dgMetadataRecord.dgDataEntity.dgDataRoles.dgDataCreator.dgRoleHolder.dgMetadataID.localIdentifier
216  authors = self.dgMeta.dgMetadataRecord.dgMetadataDescription.abstract.abstractOnlineReference.dgCitation.authors
217  parameters = self.dgMeta.dgMetadataRecord.dgDataEntity.dgDataSummary.dgParameterSummary.dgStdParameterMeasured.dgValidTerm
218
2199. dif2moles xquery transform seems to lose the end_date info in the Temporal_Coverage elements   
220
22110. A significant gain to the system could be implemented potentially without much effort: add a contact email address
222to each datacentre config file; if problem  are experienced during ingest, these are then emailed to the datacentre
223directly - allowing them the chance to curate the data unprompted - thus removing the onus of this work from the BADC.
224
225
226
Note: See TracBrowser for help on using the repository browser.