Ignore:
Timestamp:
20/01/09 16:33:33 (11 years ago)
Author:
cbyrom
Message:

Add new ingest script - to allow ingest of DIF docs from eXist hosted
atom feed. NB, this required restructure of original OAI harvester
to allow re-use of shared code - by abstracting this out into new class,
absstractdocumentingester.

Add new documentation and tidy up codebase removing dependencies where possible to simplify things.

File:
1 edited

Legend:

Unmodified
Added
Removed
  • TI01-discovery/branches/ingestAutomation-upgrade/OAIBatch/README.txt

    r3998 r4854  
    22------------ 
    33 
    4 The oai_document_ingester script is used to ingest documents obtained by the OAI harvester  
    5 into a postgres database.  At the point of ingest, documents are adjusted to avoid namespace issues 
    6 and are renamed to include their discovery ids.  Following the ingest of a document, this is then 
    7 transformed, using XQueries ran using the java saxon library, to the generic Moles format - and  
    8 then transformed from this into all the other available document formats  
    9 (currently DIF, DC, IS19139 - NB currently the MDIP transform does not work - see ToDo).   
    10 All the transformed docs are stored in the DB and spatiotemporal data is also extracted from the  
    11 moles files and stored. 
    12  
    13 If the script processes all files successfully, the harvest directory is backed up locally and then  
    14 cleared.  
    15  
    16 The script can be ran standalone, by specifying a datacentre to ingest files from, or the wrapper script 
     4Two ingester scripts now exist to ingest documents from a data centre into the discovery 
     5postgres DB.  The original script, oai_document_ingester retrieved docs harvested by the 
     6OAI service.  A new script, feeddocumentingester, has been added to harvest documents 
     7direct from an eXist repository running a feed reporting service - to alert the system 
     8to the publication of new documents to harvest. 
     9 
     10The two scripts share some common functionality, so to allow code re-use, this has been 
     11abstracted out into a parent class, AbstractDocumentIngester.  More detail on the two 
     12scripts is provided below - together with their shared functionality - which effectively 
     13is the point from which a newly harvested document has been discovered. 
     14 
     15oai_document_ingester 
     16------------------------------   
     17 
     18The oai_document_ingester script ingests documents obtained by the OAI harvester. 
     19 
     20Usage: python oai_document_ingester.py [OPTION] <datacentre>" 
     21 - where:\n   <datacentre> is the data centre to ingest data from; and options are:" 
     22 -v - verbose mode for output logging" 
     23 -d - debug mode for output logging" 
     24 
     25 
     26This script can be ran standalone, by specifying a datacentre to ingest files from, or the wrapper script 
    1727run_all_ingest.py can be used; this will parse the contents of the datacentre_config directory and 
    1828run the oai_document_ingester against every config file located there. 
    1929 
    20 The script can also be ran using two options: 
    21  
    22         -v - 'verbose' mode - prints out logging of level INFO and above 
    23         -d - 'debug' mode - prints out all logging 
    24          
    25 NB, the default level is WARNING and above or, if ran via the run_all_ingest script, INFO and above. 
    26  
    27 Details for the postgis database are stored in the ingest.txt file; this should be set to access 
     30 
     31feeddocumentingester 
     32------------------------------ 
     33 
     34The feed ingester polls a feed set up on an eXist DB which hosts records to harvest.  This 
     35feed should be updated when a new record to harvest is created - and when the ingester 
     36polls this feed again, it will try to retrieve and ingest the new document. 
     37 
     38Usage: python feeddocumentingester.py [OPTION] <feed> [interval=..], [ingestFromDate=..]" 
     39              [eXistDBHostname=..], [eXistPortNo=..], [dataCentrePoll=..]" 
     40 - where:\n <feed> is the atom feed to ingest data from; options are:" 
     41 -v - verbose mode for output logging" 
     42 -d - debug mode for output logging" 
     43 and keywords are:" 
     44 interval - interval, in seconds, at which to retrieve data from the feed" 
     45 ingestFromDate - date, in format, 'YYYY-MM-DD', from which documents should be ingested - if not set, ingest date is taken as the current time" 
     46 eXistDBHostname - name of eXist DB to retrieve data from - NB, this will likely be where the feed is based, too - default is 'chinook.badc.rl.ac.uk'" 
     47 eXistPortNo - port number used by the eXist DB - defaults to '8080'" 
     48 dataCentrePoll - data centre whose documents should be polled for - e.g 'badc', 'neodc' - if not set, all documents on a feed will be ingested" 
     49 
     50NB, the feed URL will typically be pointing at the RESTful interface to an eXist DB which is hosing the feed. 
     51The format for this is: 
     52 
     53http://eXistHostAndPort/exist/atom/content/<db-collection-for-feed> - e.g.: 
     54 
     55http://chinook.badc.rl.ac.uk:8080/exist/atom/content/db/DIF 
     56 
     57The client expects this feed to have entries with a content element with a src element that features 
     58a valid ndgURL - e.g.: 
     59 
     60<feed xmlns="http://www.w3.org/2005/Atom"> 
     61    <id>urn:uuid:28a477f6-df52-4bb4-9c30-4ef5e3d5c615</id> 
     62    <updated>2009-01-20T10:33:02+00:00</updated>          
     63    <title>DIF Data</title> 
     64    <link href="#" rel="edit" type="application/atom+xml"/> 
     65    <link href="#" rel="self" type="application/atom+xml"/> 
     66    <entry> 
     67        <id>urn:uuid:fe496d7f-bdd9-409e-b3b1-2af05a2d0f33</id> 
     68                <updated>2009-01-19T17:22:27+00:00</updated> 
     69                <published>2009-01-19T17:22:27+00:00</published> 
     70        <link href="?id=urn:uuid:fe496d7f-bdd9-409e-b3b1-2af05a2d0f33" rel="edit" type="application/atom+xml"/> 
     71                <title>DIF Record [CRYOspheric STudies of Atmospheric Trends (CRYOSTAT)]</title> 
     72        <summary> 
     73        The CRYOspheric STudies of Atmospheric Trends in stratospherically and radiatively important gases (CRYOSTAT) will undertake the first combined measurements of virtually all significant Greenhouse gases (GHGs). GHGs (other than water vapour), ozone-depleting substances (ODSs), and related trace gases in contiguous firn and ice profiles, spanning as much as 200 years, from both the northern and southern polar ice caps. CRYOSTAT is an evolution of the FIRETRACC/100 project, the data from which is also held at BADC. 
     74                </summary> 
     75        <content src="http://badc.nerc.ac.uk:50001/view/badc.nerc.ac.uk__BROWSE-DIF__dataent_cryostat" type="application/atom+xml"/> 
     76    </entry> 
     77</feed> 
     78 
     79- NB, the ndgUrl in this case is badc.nerc.ac.uk__BROWSE-DIF__dataent_cryostat 
     80 
     81 
     82Shared functionality 
     83------------------------------ 
     84 
     85Once a new document to ingest has been discovered (i.e. either via the OAI harvester or 
     86extracted directly from an eXist repository on prompting from a feed update), the following 
     87workflow is completed: 
     88 
     891. document contents are adjusted to avoid namespace issues 
     902. documents are renamed to include their discovery ids 
     913. documents are then transformed, using XQueries ran using the java saxon library, to  
     92the generic Moles (1.3) format - and then transformed from this into all the other  
     93available document formats (currently DIF, DC, IS19139 - NB currently the MDIP  
     94transform does not work - see ToDo (IS THIS STILL TRUE?).   
     954. All transformed docs are stored in the DB and spatiotemporal data is also  
     96extracted from the moles files and stored. 
     975. If the script processes all files successfully, the harvest directory is backed up locally and then  
     98cleared.  
     99 
     100NB, the difference between the oai and feed ingesters here is that, for the oai ingester, 
     101the above workflow is completed on all docs ingested from the oai, whereas for the feed 
     102ingester, it is completed for each document retrieved from the feed, in turn. 
     103 
     104 
     105Details for the postgres database are stored in the ingest.txt file; this should be set to access 
    28106permission 0600 and owned by the user running the script - to ensure maximum security of the file. 
    29107 
     
    41119 
    42120 
    43 Error Handling 
     121Error Handling - OAI harvester 
    44122----------------- 
    45123 
     
    68146-------------------------- 
    69147Docs are transformed using the Saxon java library - running the transform direct from the command line.  The transforms use the 
    70 XQueries and XQuery libs provided by the ndgUtils egg - so this must be installed before the script is used.  NB, the various XQuery 
     148XQueries and XQuery libs provided by the ndg.common.src egg - so this must be installed before the script is used.  NB, the various XQuery 
    71149libs that are referenced by the XQueries and extracted locally and the references in the scripts adjusted to point to these copies 
    72150- to ensure the current libs are used when running the scripts. 
     
    88166To Do 
    89167------------- 
    90  
    91 1. There are a number of outstanding dependencies on the ingest scripts; these should either be 
    92 removed entirely or the script should be properly packaged as an egg, with references to these dependencies 
    93 from other eggs.  The main dependencies stem from the use of the following files: 
    94  
    95 DIF.py 
    96 MDIP.py 
    97 molesReadWrite.py 
    98  
    99 - these require the following files: 
    100  
    101 AccessControl.py 
    102 ETxmlView.py 
    103 geoUtilities.py 
    104 People.py 
    105 renderEntity.py 
    106 renderService.py 
    107 SchemeNameSpace.py 
    108 Utilities.py 
    109168 
    1101692. Whilst testing the scripts, it was noted that the various MDIP transforms do not currently work 
Note: See TracChangeset for help on using the changeset viewer.