source: TI01-discovery/branches/ingestAutomation-upgrade/OAIBatch/README.txt @ 4854

Subversion URL: http://proj.badc.rl.ac.uk/svn/ndg/TI01-discovery/branches/ingestAutomation-upgrade/OAIBatch/README.txt@4854
Revision 4854, 12.1 KB checked in by cbyrom, 11 years ago (diff)

Add new ingest script - to allow ingest of DIF docs from eXist hosted
atom feed. NB, this required restructure of original OAI harvester
to allow re-use of shared code - by abstracting this out into new class,
absstractdocumentingester.

Add new documentation and tidy up codebase removing dependencies where possible to simplify things.

Line 
1Overview
2------------
3
4Two ingester scripts now exist to ingest documents from a data centre into the discovery
5postgres DB.  The original script, oai_document_ingester retrieved docs harvested by the
6OAI service.  A new script, feeddocumentingester, has been added to harvest documents
7direct from an eXist repository running a feed reporting service - to alert the system
8to the publication of new documents to harvest.
9
10The two scripts share some common functionality, so to allow code re-use, this has been
11abstracted out into a parent class, AbstractDocumentIngester.  More detail on the two
12scripts is provided below - together with their shared functionality - which effectively
13is the point from which a newly harvested document has been discovered.
14
15oai_document_ingester
16------------------------------ 
17
18The oai_document_ingester script ingests documents obtained by the OAI harvester.
19
20Usage: python oai_document_ingester.py [OPTION] <datacentre>"
21 - where:\n   <datacentre> is the data centre to ingest data from; and options are:"
22 -v - verbose mode for output logging"
23 -d - debug mode for output logging"
24
25
26This script can be ran standalone, by specifying a datacentre to ingest files from, or the wrapper script
27run_all_ingest.py can be used; this will parse the contents of the datacentre_config directory and
28run the oai_document_ingester against every config file located there.
29
30
31feeddocumentingester
32------------------------------
33
34The feed ingester polls a feed set up on an eXist DB which hosts records to harvest.  This
35feed should be updated when a new record to harvest is created - and when the ingester
36polls this feed again, it will try to retrieve and ingest the new document.
37
38Usage: python feeddocumentingester.py [OPTION] <feed> [interval=..], [ingestFromDate=..]"
39              [eXistDBHostname=..], [eXistPortNo=..], [dataCentrePoll=..]"
40 - where:\n <feed> is the atom feed to ingest data from; options are:"
41 -v - verbose mode for output logging"
42 -d - debug mode for output logging"
43 and keywords are:"
44 interval - interval, in seconds, at which to retrieve data from the feed"
45 ingestFromDate - date, in format, 'YYYY-MM-DD', from which documents should be ingested - if not set, ingest date is taken as the current time"
46 eXistDBHostname - name of eXist DB to retrieve data from - NB, this will likely be where the feed is based, too - default is 'chinook.badc.rl.ac.uk'"
47 eXistPortNo - port number used by the eXist DB - defaults to '8080'"
48 dataCentrePoll - data centre whose documents should be polled for - e.g 'badc', 'neodc' - if not set, all documents on a feed will be ingested"
49
50NB, the feed URL will typically be pointing at the RESTful interface to an eXist DB which is hosing the feed.
51The format for this is:
52
53http://eXistHostAndPort/exist/atom/content/<db-collection-for-feed> - e.g.:
54
55http://chinook.badc.rl.ac.uk:8080/exist/atom/content/db/DIF
56
57The client expects this feed to have entries with a content element with a src element that features
58a valid ndgURL - e.g.:
59
60<feed xmlns="http://www.w3.org/2005/Atom">
61    <id>urn:uuid:28a477f6-df52-4bb4-9c30-4ef5e3d5c615</id>
62    <updated>2009-01-20T10:33:02+00:00</updated>         
63    <title>DIF Data</title>
64    <link href="#" rel="edit" type="application/atom+xml"/>
65    <link href="#" rel="self" type="application/atom+xml"/>
66    <entry>
67        <id>urn:uuid:fe496d7f-bdd9-409e-b3b1-2af05a2d0f33</id>
68                <updated>2009-01-19T17:22:27+00:00</updated>
69                <published>2009-01-19T17:22:27+00:00</published>
70        <link href="?id=urn:uuid:fe496d7f-bdd9-409e-b3b1-2af05a2d0f33" rel="edit" type="application/atom+xml"/>
71                <title>DIF Record [CRYOspheric STudies of Atmospheric Trends (CRYOSTAT)]</title>
72        <summary>
73        The CRYOspheric STudies of Atmospheric Trends in stratospherically and radiatively important gases (CRYOSTAT) will undertake the first combined measurements of virtually all significant Greenhouse gases (GHGs). GHGs (other than water vapour), ozone-depleting substances (ODSs), and related trace gases in contiguous firn and ice profiles, spanning as much as 200 years, from both the northern and southern polar ice caps. CRYOSTAT is an evolution of the FIRETRACC/100 project, the data from which is also held at BADC.
74                </summary>
75        <content src="http://badc.nerc.ac.uk:50001/view/badc.nerc.ac.uk__BROWSE-DIF__dataent_cryostat" type="application/atom+xml"/>
76    </entry>
77</feed>
78
79- NB, the ndgUrl in this case is badc.nerc.ac.uk__BROWSE-DIF__dataent_cryostat
80
81
82Shared functionality
83------------------------------
84
85Once a new document to ingest has been discovered (i.e. either via the OAI harvester or
86extracted directly from an eXist repository on prompting from a feed update), the following
87workflow is completed:
88
891. document contents are adjusted to avoid namespace issues
902. documents are renamed to include their discovery ids
913. documents are then transformed, using XQueries ran using the java saxon library, to
92the generic Moles (1.3) format - and then transformed from this into all the other
93available document formats (currently DIF, DC, IS19139 - NB currently the MDIP
94transform does not work - see ToDo (IS THIS STILL TRUE?). 
954. All transformed docs are stored in the DB and spatiotemporal data is also
96extracted from the moles files and stored.
975. If the script processes all files successfully, the harvest directory is backed up locally and then
98cleared.
99
100NB, the difference between the oai and feed ingesters here is that, for the oai ingester,
101the above workflow is completed on all docs ingested from the oai, whereas for the feed
102ingester, it is completed for each document retrieved from the feed, in turn.
103
104
105Details for the postgres database are stored in the ingest.txt file; this should be set to access
106permission 0600 and owned by the user running the script - to ensure maximum security of the file.
107
108Layout
109----------
110Dependent python files are stored directly under the ingestAutomation/OAIBatch directory.
111Dependent library files (currently only saxon9.jar) are stored in ingestAutomation/OAIBatch/lib.
112Config files for the various data centres are stored in the ingestAutomation/OAIBatch/datacentre_config directory.
113Files for setting up the datamodel are available in ingestAutomation/database.
114
115When running, the script uses the current working directory as the sandbox to operate from.  A
116directory, 'data', is created in this and this is used to store the various docs and their various
117forms during ingest.
118Backup copies of the original harvested files are stored in /disks/glue1/oaiBackup/
119
120
121Error Handling - OAI harvester
122-----------------
123
124The script will process all files it finds in a harvest directory for a data centre one by one.  If
125all files are successfully ingested, the harvest directory is then cleared; if not it is retained so
126that the script can be reran once problems have been fixed.
127
128If problems are experienced, these will be reported in the output logging; the script can be reran as
129many time necessary to fix the problems; any files that were successfully ingested will be ignored.
130
131
132Updates & Duplicate docs
133------------------------------
134
135If a file is re-ingested, a check is done against its discovery id and the contents of the file; if this
136is found to be identical to an existing row in the DB, the 'harvest count' for that row is incremented
137and further processing is stopped.  If a difference is found in the content, the transformed docs are
138all appropriately updated and the spatiotemporal data is recreated from scratch (with the help of cascading
139delete triggers in the DB).
140
141NB, any updates to the original doc will cause a trigger to record the original row data in the original_document_history
142table - to allow basic audit tracking of changes.
143
144
145Document Transforms
146--------------------------
147Docs are transformed using the Saxon java library - running the transform direct from the command line.  The transforms use the
148XQueries and XQuery libs provided by the ndg.common.src egg - so this must be installed before the script is used.  NB, the various XQuery
149libs that are referenced by the XQueries and extracted locally and the references in the scripts adjusted to point to these copies
150- to ensure the current libs are used when running the scripts.
151
152
153Data Model
154---------------
155
156All data is stored in the postgres database - the data model is available under ingestAutomation/database.
157The DB can be recreated by running the create_database.sh script - this takes the following parameters:
158
159create_database.sh username database [hostname] [port]
160
161The DB is Postgres 8.3 - this version (or later) is required for the text search functionality to be
162availabe.  Additionally, PostGIS is installed in the DB, to allow the storage, and searching, of the
163required coordinate systems.
164
165
166To Do
167-------------
168
1692. Whilst testing the scripts, it was noted that the various MDIP transforms do not currently work
170for the majority of the badc datasets; as such, MDIP format has been
171commented out of the PostgresRecord class; this will need to be fixed asap!
172
1733. The system should handle the deletion of files from the DB.  Not sure how this is handled by the harvester - i.e. are all files
174harvest always - so we need to do a check in the DB so that the contents directly match - and then delete and extras - or is there
175another way of determining deletions?  Once this is established, use the PostgresDAO.deleteOriginalDocument() method to do the clearing
176up.
177
1784. When adding keywords, using the keywordAdder class, the descriptionOnlineReference element loses its values
179- this appears to be due to the molesReadWrite class not picking up the necessary structure - which appears
180to be absent from the ndgmetadata1.3.xsd schema?
181
1825. The molesReadWrite class, used when adding kewords to the moles file, does not correctly handle namespace qualified attributes
183- leading to these being duplicated in some cases - e.g. the ndgmetadata1.3.xsd schema features elements with the names,
184'dgMetadataDescriptionType' and 'moles:dgMetadataDescriptionType'
185- this then leads to the elements being duplicated when the molesElement.toXML() method is called.  The most obvious
186consequence of this is that the AbstractText element is included twice; as such, this particular event has been caught
187and escaped in the code - however a better fix would be to ensure that namespaces are either stripped out and/or handled
188properly.
189
1906. The SpatioTemporalData class is a simple struct to store spatiotemporal data; would be better if there was
191some simple checking of this data (esp the temporal data), to ensure it meets the input requirements - e.g.
192checking dates so they can be turned into valid DB timestamps (not sure what formats we accept which is why this
193isn't implemented).
194
1957. Logging - only the classes I've directly written use the python logging system - i.e. existing code that is referenced
196from this code does not.  This leads to slightly untidy output as the ordering of print statements is different to that of
197the logger.  Ideally the logger would be used throughout code, but this is probably unlikely to happen.
198
1998. The 'author' and 'parameter' elements from the moles files are extracted by the PostgresRecord class.  The original code used
200a simple xpath search operator (&=) to find the data below a particular parent element.  Where possible, the more exact location
201has been identified and used in the new code, however, not all the required info has been available to test - in particular, the
202following need to be checked and adjusted if pointing to the incorrect field:
203
204  creators = self.dgMeta.dgMetadataRecord.dgDataEntity.dgDataRoles.dgDataCreator.dgRoleHolder.dgMetadataID.localIdentifier
205  authors = self.dgMeta.dgMetadataRecord.dgMetadataDescription.abstract.abstractOnlineReference.dgCitation.authors
206  parameters = self.dgMeta.dgMetadataRecord.dgDataEntity.dgDataSummary.dgParameterSummary.dgStdParameterMeasured.dgValidTerm
207
2089. dif2moles xquery transform seems to lose the end_date info in the Temporal_Coverage elements   
209
21010. A significant gain to the system could be implemented potentially without much effort: add a contact email address
211to each datacentre config file; if problem  are experienced during ingest, these are then emailed to the datacentre
212directly - allowing them the chance to curate the data unprompted - thus removing the onus of this work from the BADC.
213
214
215
Note: See TracBrowser for help on using the repository browser.