Ticket #35 (closed task: fixed)

Opened 13 years ago

Last modified 12 years ago

[M] Automate OAI harvest and ingest

Reported by: lawrence Owned by: selatham
Priority: required Milestone: ReFactored_Discovery_WebServices
Component: discovery Version:
Keywords: DiscoveryAuto MDIP WS-Discovery2 Cc:

Description (last modified by selatham) (diff)

Automate OAI harvest and ingest into discovery portal - both development and live. Ref Marta's documentation. Look at Dlese/OAI documentation particularly setting up regular harvests.Put all code into Subversion( may have to archive some old stuff). - done.


Mostly done.

  • automate postgres ingest. Gone to ticket #567
  • work probably required on how deletions are handled. Gone to ticket #598
  • set running for BADC, NEODC, (and possibly NCAR, TPAC, WDCC) on dev. Gone to ticket #599.
  • Do notifying Data Centres of OAI errors too. Gone to ticket #600

Todo:-

Changes required for re-factored WS:-

  • (DEFUNCT- don't even need to remove namespace) Change oaiClean to only remove namespace - no wrapping.
  • (DONE)Rename original as repositoryidformatlocalid separated by double underscores.
  • (DONE)Place originals in /db/discovery/original/format/'repositoryid' collection
  • (DONE)Use new minimum Moles creator. Pass it format and collection name(/db/discovery/original/format/'repositoryid').
  • (DONE) Put moles in new exist collection.
  • (DONE)Get a moles organisation record from each DP. Gone to ticket #605
  • Optionally put group information(e.g. MDIP, NERC), into dgStructuredKeyword. This should be done at the DataProviders? end, but useful to have a general keywordAdder module.Gone to #669
  • Parse/correct date formats(see python time module).Gone to #668
  • Parse localID for characters which need escaping(exist rejects).Gone to #667.

Change History

comment:1 Changed 13 years ago by lawrence

  • Type changed from defect to task

comment:2 Changed 13 years ago by lawrence

  • Keywords M06March24 added

comment:3 Changed 13 years ago by selatham

  • Status changed from new to assigned

comment:4 Changed 13 years ago by selatham

Backups of databases and OAI originals completed and running on dev. OAI harvest and eXist ingest complete and running for PML on development portal. To do

  • automate postgres ingest.
  • set running for BODC, BADC, NEODC, NOCS (and possibly NCAR, TPAC) on dev and live (if looks OK).
  • work probably required on how deletions are handled.

comment:5 Changed 13 years ago by selatham

  • Priority changed from critical to required

Become not so critical.

comment:6 Changed 13 years ago by selatham

  • Keywords M06April19 added

comment:7 Changed 13 years ago by spascoe

Do notifying DC of OAI error too.

comment:8 Changed 13 years ago by selatham

  • Description modified (diff)

comment:9 Changed 13 years ago by selatham

  • Keywords M06March24 M06April19 removed

comment:10 Changed 13 years ago by selatham

Had to substantially change script to cope with large numbers of files (ticket #318)

comment:11 Changed 13 years ago by selatham

There is a glitch with permissions when deleting old records in oai. They are assigned to user tomcat, but the script runs under user badc which can't delete tomcat stuff. Run under tomcat? But then may have problems with exist ingest, java pre-processing? Investigate.

comment:12 Changed 13 years ago by selatham

Set up nocs auto-harvest and ingest cron on glue. Seems to be working.

comment:13 Changed 13 years ago by selatham

  • Milestone changed from Replace Metadata Gateway to ReFactored Discovery WebServices

comment:14 Changed 13 years ago by selatham

  • Summary changed from Automate OAI harvest and ingest to [M] Automate OAI harvest and ingest

comment:15 Changed 13 years ago by selatham

  • Description modified (diff)

Substantial changes for:-

  • re-deployment of tomcat, java, exist.
  • change to using oaiClean.py instead of java code for cleaning out namespaces and additionally adding keywords wrapper to every document.
  • cope with multiple incoming formats.

comment:16 Changed 13 years ago by selatham

  • Description modified (diff)

comment:17 Changed 12 years ago by selatham

  • Keywords MDIP WS-Discovery2 added

Changes required for re-factored WS:-

  • Change oaiClean to only remove namespace - no wrapping.
  • Use new minimum Moles creator. Pass it groups(e.g. MDIP, NERC) and format.
  • Rename original and produced MOLES files to reflect an identifier.
  • Put moles in new exist collection.

comment:18 Changed 12 years ago by selatham

  • Description modified (diff)

comment:19 Changed 12 years ago by selatham

  • Description modified (diff)

Actually don't need to remove namespace if using latest exist. So Kev's advised leave them in.

comment:20 Changed 12 years ago by selatham

  • Description modified (diff)

comment:21 Changed 12 years ago by selatham

I have now actioned the naming and ingesting of originals into the exist db on glue. I'm retaining namespaces as per Kev's advice.

This is only for those DPs currently being regularly harvested via OAI:- bodc.nerc.ac.uk, neodc.nerc.ac.uk, npm.ac.uk (PML) and noc.soton.ac.uk .

NOTE - some of the records still have the old go-essp schema declaration. Neodc records have no declaration. I'm hoping DPs will action my e-mail instruction on this.

comment:22 Changed 12 years ago by selatham

  • Description modified (diff)

comment:23 Changed 12 years ago by selatham

  • Description modified (diff)

comment:24 Changed 12 years ago by selatham

  • Status changed from assigned to closed
  • Resolution set to fixed
  • Description modified (diff)
Note: See TracTickets for help on using tickets.