source: TI01-discovery/branches/ingestAutomation-upgrade/OAIBatch/README.txt @ 3972

Subversion URL:
Revision 3972, 7.9 KB checked in by cbyrom, 13 years ago (diff)

Use the short filename in the postgres DB for storing the original
document filename.
Add fix to allow proper handling of scope fields as a ts_vector.
Add TODO comments to highlight areas of concern + update docs.

4The oai_document_ingester script is used to ingest documents obtained by the OAI harvester
5into a postgres database.  At the point of ingest, documents are adjusted to avoid namespace issues
6and are renamed to include their discovery ids.  Following the ingest of a document, this is then
7transformed, using XQueries ran using the java saxon library, to the generic Moles format - and
8then transformed from this into all the other available document formats
9(currently DIF, DC, IS19139 - NB currently the MDIP transform does not work - see ToDo). 
10All the transformed docs are stored in the DB and spatiotemporal data is also extracted from the
11moles files and stored.
13If the script processes all files successfully, the harvest directory is backed up locally and then
16The script can be ran standalone, by specifying a datacentre to ingest files from, or the wrapper script can be used; this will parse the contents of the datacentre_config directory and
18run the oai_document_ingester against every config file located there.
20The script can also be ran using two options:
22        -v - 'verbose' mode - prints out logging of level INFO and above
23        -d - 'debug' mode - prints out all logging
25NB, the default level is WARNING and above or, if ran via the run_all_ingest script, INFO and above.
27Details for the postgis database are stored in the ingest.txt file; this should be set to access
28permission 0600 and owned by the user running the script - to ensure maximum security of the file.
32Dependent python files are stored directly under the ingestAutomation/OAIBatch directory.
33Dependent library files (currently only saxon9.jar) are stored in ingestAutomation/OAIBatch/lib.
34Config files for the various data centres are stored in the ingestAutomation/OAIBatch/datacentre_config directory.
35Files for setting up the datamodel are available in ingestAutomation/database.
37When running, the script uses the current working directory as the sandbox to operate from.  A
38directory, 'data', is created in this and this is used to store the various docs and their various
39forms during ingest.
40Backup copies of the original harvested files are stored in /disks/glue1/oaiBackup/
43Error Handling
46The script will process all files it finds in a harvest directory for a data centre one by one.  If
47all files are successfully ingested, the harvest directory is then cleared; if not it is retained so
48that the script can be reran once problems have been fixed.
50If problems are experienced, these will be reported in the output logging; the script can be reran as
51many time necessary to fix the problems; any files that were successfully ingested will be ignored.
54Updates & Duplicate docs
57If a file is re-ingested, a check is done against its discovery id and the contents of the file; if this
58is found to be identical to an existing row in the DB, the 'harvest count' for that row is incremented
59and further processing is stopped.  If a difference is found in the content, the transformed docs are
60all appropriately updated and the spatiotemporal data is recreated from scratch (with the help of cascading
61delete triggers in the DB).
63NB, any updates to the original doc will cause a trigger to record the original row data in the original_document_history
64table - to allow basic audit tracking of changes.
67Document Transforms
69Docs are transformed using the Saxon java library - running the transform direct from the command line.  The transforms use the
70XQueries and XQuery libs provided by the ndgUtils egg - so this must be installed before the script is used.  NB, the various XQuery
71libs that are referenced by the XQueries and extracted locally and the references in the scripts adjusted to point to these copies
72- to ensure the current libs are used when running the scripts.
75Data Model
78All data is stored in the postgres database - the data model is available under ingestAutomation/database.
79The DB can be recreated by running the script - this takes the following parameters:
80 username database [hostname] [port]
83The DB is Postgres 8.3 - this version (or later) is required for the text search functionality to be
84availabe.  Additionally, PostGIS is installed in the DB, to allow the storage, and searching, of the
85required coordinate systems.
88To Do
911. There are a number of outstanding dependencies on the ingest scripts; these should either be
92removed entirely or the script should be properly packaged as an egg, with references to these dependencies
93from other eggs.  The main dependencies stem from the use of the following files:
99- these require the following files:
1102. Whilst testing the scripts, it was noted that the various MDIP transforms do not currently work
111for the majority of the badc datasets; as such, MDIP format has been
112commented out of the PostgresRecord class; this will need to be fixed asap!
1143. The system should handle the deletion of files from the DB.  Not sure how this is handled by the harvester - i.e. are all files
115harvest always - so we need to do a check in the DB so that the contents directly match - and then delete and extras - or is there
116another way of determining deletions?  Once this is established, use the PostgresDAO.deleteOriginalDocument() method to do the clearing
1194. When adding keywords, using the keywordAdder class, the descriptionOnlineReference element loses its values
120- this appears to be due to the molesReadWrite class not picking up the necessary structure - which appears
121to be absent from the ndgmetadata1.3.xsd schema?
1235. The molesReadWrite class, used when adding kewords to the moles file, does not correctly handle namespace qualified attributes
124- leading to these being duplicated in some cases - e.g. the ndgmetadata1.3.xsd schema features elements with the names,
125'dgMetadataDescriptionType' and 'moles:dgMetadataDescriptionType'
126- this then leads to the elements being duplicated when the molesElement.toXML() method is called.  The most obvious
127consequence of this is that the AbstractText element is included twice; as such, this particular event has been caught
128and escaped in the code - however a better fix would be to ensure that namespaces are either stripped out and/or handled
1316. The SpatioTemporalData class is a simple struct to store spatiotemporal data; would be better if there was
132some simple checking of this data (esp the temporal data), to ensure it meets the input requirements - e.g.
133checking dates so they can be turned into valid DB timestamps (not sure what formats we accept which is why this
134isn't implemented).
1367. Logging - only the classes I've directly written use the python logging system - i.e. existing code that is referenced
137from this code does not.  This leads to slightly untidy output as the ordering of print statements is different to that of
138the logger.  Ideally the logger would be used throughout code, but this is probably unlikely to happen.
1408. The 'author' and 'parameter' elements from the moles files are extracted by the PostgresRecord class.  The original code used
141a simple xpath search operator (&=) to find the data below a particular parent element.  Where possible, the more exact location
142has been identified and used in the new code, however, not all the required info has been available to test - in particular, the
143following need to be checked and adjusted if pointing to the incorrect field:
145  creators = self.dgMeta.dgMetadataRecord.dgDataEntity.dgDataRoles.dgDataCreator.dgRoleHolder.dgMetadataID.localIdentifier
146  authors = self.dgMeta.dgMetadataRecord.dgMetadataDescription.abstract.abstractOnlineReference.dgCitation.authors
147  parameters = self.dgMeta.dgMetadataRecord.dgDataEntity.dgDataSummary.dgParameterSummary.dgStdParameterMeasured.dgValidTerm
Note: See TracBrowser for help on using the repository browser.