source: TI01-discovery/branches/ingestAutomation-upgrade/OAIBatch/README.txt @ 3868

Subversion URL: http://proj.badc.rl.ac.uk/svn/ndg/TI01-discovery/branches/ingestAutomation-upgrade/OAIBatch/README.txt@3868
Revision 3868, 5.4 KB checked in by cbyrom, 12 years ago (diff)

Add documentation file with description of new ingestion process.

Line 
1Overview
2------------
3
4The oai_document_ingester script is used to ingest documents obtained by the OAI harvester
5into a postgres database.  At the point of ingest, documents are adjusted to avoid namespace issues
6and are renamed to include their discovery ids.  Following the ingest of a document, this is then
7transformed, using XQueries ran using the java saxon library, to the generic Moles format - and
8then transformed from this into all the other available document formats
9(currently DIF, DC, IS19139 - NB currently the MDIP transform does not work - see ToDo). 
10All the transformed docs are stored in the DB and spatiotemporal data is also extracted from the
11moles files and stored.
12
13If the script processes all files successfully, the harvest directory is backed up locally and then
14cleared.
15
16The script can be ran standalone, by specifying a datacentre to ingest files from, or the wrapper script
17run_all_ingest.py can be used; this will parse the contents of the datacentre_config directory and
18run the oai_document_ingester against every config file located there.
19
20The script can also be ran using two options:
21
22        -v - 'verbose' mode - prints out logging of level INFO and above
23        -d - 'debug' mode - prints out all logging
24       
25NB, the default level is WARNING and above or, if ran via the run_all_ingest script, INFO and above.
26
27Details for the postgis database are stored in the ingest.txt file; this should be set to access
28permission 0600 and owned by the user running the script - to ensure maximum security of the file.
29
30Layout
31----------
32Dependent python files are stored directly under the ingestAutomation/OAIBatch directory.
33Dependent library files (currently only saxon9.jar) are stored in ingestAutomation/OAIBatch/lib.
34Config files for the various data centres are stored in the ingestAutomation/OAIBatch/datacentre_config directory.
35Files for setting up the datamodel are available in ingestAutomation/database.
36
37When running, the script uses the current working directory as the sandbox to operate from.  A
38directory, 'data', is created in this and this is used to store the various docs and their various
39forms during ingest.
40Backup copies of the original harvested files are stored in /disks/glue1/oaiBackup/
41
42
43Error Handling
44-----------------
45
46The script will process all files it finds in a harvest directory for a data centre one by one.  If
47all files are successfully ingested, the harvest directory is then cleared; if not it is retained so
48that the script can be reran once problems have been fixed.
49
50If problems are experienced, these will be reported in the output logging; the script can be reran as
51many time necessary to fix the problems; any files that were successfully ingested will be ignored.
52
53
54Updates & Duplicate docs
55------------------------------
56
57If a file is re-ingested, a check is done against its discovery id and the contents of the file; if this
58is found to be identical to an existing row in the DB, the 'harvest count' for that row is incremented
59and further processing is stopped.  If a difference is found in the content, the transformed docs are
60all appropriately updated and the spatiotemporal data is recreated from scratch (with the help of cascading
61delete triggers in the DB).
62
63NB, any updates to the original doc will cause a trigger to record the original row data in the original_document_history
64table - to allow basic audit tracking of changes.
65
66
67Document Transforms
68--------------------------
69Docs are transformed using the Saxon java library - running the transform direct from the command line.  The transforms use the
70XQueries and XQuery libs provided by the ndgUtils egg - so this must be installed before the script is used.  NB, the various XQuery
71libs that are referenced by the XQueries and extracted locally and the references in the scripts adjusted to point to these copies
72- to ensure the current libs are used when running the scripts.
73
74
75Data Model
76---------------
77
78All data is stored in the postgres database - the data model is available under ingestAutomation/database.
79The DB can be recreated by running the create_database.sh script - this takes the following parameters:
80
81create_database.sh username database [hostname] [port]
82
83The DB is Postgres 8.3 - this version (or later) is required for the text search functionality to be
84availabe.  Additionally, PostGIS is installed in the DB, to allow the storage, and searching, of the
85required coordinate systems.
86
87
88ToDo
89-------------
90
911. There are a number of outstanding dependencies on the ingest scripts; these should either be
92removed entirely or the script should be properly packaged as an egg, with references to these dependencies
93from other eggs.  The main dependencies stem from the use of the following files:
94
95DIF.py
96MDIP.py
97molesReadWrite.py
98
99- these require the following files:
100
101AccessControl.py
102ETxmlView.py
103geoUtilities.py
104People.py
105renderEntity.py
106renderService.py
107SchemeNameSpace.py
108Utilities.py
109
1102. Whilst testing the scripts, it was noted that the various mdip transforms do not currently work; as such, MDIP format has been
111commented out of the PostgresRecord class; this will need to be fixed asap!
112
1133. The system should handle the deletion of files from the DB.  Not sure how this is handled by the harvester - i.e. are all files
114harvest always - so we need to do a check in the DB so that the contents directly match - and then delete and extras - or is there
115another way of determining deletions?  Once this is established, use the PostgresDAO.deleteOriginalDocument() method to do the clearing
116up.
Note: See TracBrowser for help on using the repository browser.