wiki:Discovery/DiscoveryWebServiceMEDIN

Version 29 (modified by mpritcha, 10 years ago) (diff)

--

NERC DataGrid Discovery Web Service : MEDIN Improvements November 2009

The NERC DataGrid (NDG) Discovery Web Service provides a search interface to metadata records harvested from collaborating data providers and is the backend server to which the NERC Data Discovery Service is a client.

Introduction

The Discovery Web Service is a presentation-free web service which acts as a search engine on top of the NDG Discovery metadata catalogue. This catalogue is dynamically populated by the harvesting of metadata (by the  CEDA group at RAL) from a number of collaborating data providers, who make their metadata available in one of a number of supported formats.

The search capability provided by the service enables full-text and spatio-temporal searches of catalogued metadata records and returns search results a defined XML structure, enabling search clients to be constructed by interested parties for their own purposes. The NERC Data Discovery Service is one such client, as is the  Environmental Data Portal, among other examples.

Scope

This documentation covers functionality proposed for the improvements commissioned by MEDIN.

Connectivity

Consumers may access the discovery service via SOAP. Client implementations should be generated from the WSDL at the following URIs:

(URIs tbc pending implementation of proposed improvements)

XML Data Types

The XML documents used as request and response messages for each of the service operations (methods) are defined in the <xs:schema> section of the WSDL document. The structure of each of these messages is discussed as part of the operation/method descriptions below. Automatically-generated documentation for the schema showing the full structure is available here Download.

Discovery Service Operations

The discovery service implements 4 operations, namely:

  • getListNames
  • getList
  • doSearch
  • doPresent

getListNames operation

The discovery web service relies on several lists of valid terms which are specific to the functionality of this service. The reason for using these 2 "helper" operations rather than encoding these valid terms as <xs:enumeration> in the schema part of the WSDL, is so that future modifications to the service need not necessarily require the modification of the WSDL (which can be incovenient for clients already developed around a particular release of the WSDL). The getListNames operation simply returns the names of these lists, which can then be used in a subsequent call to the getList operation.

The WSDL document defines the getListNamesRequest message as an empty <getListNames> element, so the request message should look like this (omitting the SOAP Envelope & Body parent elements):

<m:getListNames xmlns:m="urn:DiscoveryServiceAPI"/>

The getListNamesResponse message comprises a <getListNamesReturn> element, with child elements containing the names of the lists available for inspection:

<getListNamesReturn xmlns="urn:DiscoveryServiceAPI">
	<listNames>
		<listName>presentFormatList</listName>
		<listName>orderByFieldList</listName>
		<listName>scopeList</listName>
		<listName>spatialOperatorList</listName>

                <listName>termTargetList</listName>
                <listName>spatialReferenceSystemList</listName>
                <listName>dateRangeTargetList</listName>
                <listName>temporalOperatorList</listName>
                <listName>metadataFormatList</listName>
                <listName>recordDetailList</listName> 
	</listNames>
</getListNamesReturn>

getList operation

The contents of each of the lists named by the getListNames operation are accessible by invoking a call to the getList operation, with the name of the list as the single argument, encoded as a getListRequest message, as defined in the WSDL :

Request:

<getList xmlns="urn:DiscoveryServiceAPI">
    <listName>presentFormatList</listName>
</getList>

Response:

<getListReturn xmlns="urn:DiscoveryServiceAPI">
    <list name="presentFormatList">
        <listMember>DC</listMember>
        <listMember>DIF_v9.4</listMember>
        <listMember>MEDIN_v2.3</listMember>
     </list>
</getListReturn>

An explanation of the presentFormatList list is given later, in the context of the doPresent operation.

doSearch operation

The doSearch operation performs a search against the NDG discovery database. Queries to this database are formulated from the doSearchRequest message, and forwarded to the database via private methods (i.e. the consumer of the web service is not able directly to interact with the database).

Although outside the scope of the Discovery web service itself, it is worth explaining the structure of the NDG Discovery database which is searched by the service. This is populated from records harvested via  OAI-PMH (Open Archives Initiative Protocol for Metadata Harvesting) from collaborating data providers. Records are harvested in the ISO19115-compliant metadata format agreed by the NERC Metadata Standards Working Group (MEDIN ISO v2.3.1) with legacy support for the previous format (DIF v9.4), and are tagged at ingest time with one or more "scope" keywords (listed in the scopeList list available from the getList operation). These enable the search to be restricted to particular communities, namely NERC, NERC_DDC (Designated Data Centres) and MEDIN (Marine Environmental Data Information Network). Limited quality control on ingested records is also applied at ingest time, but it is the responsibility of the data provider to ensure that metadata records are provided to sufficient quality to enable them to be visible in the system.

The doSearchRequest message is shown in schema form below (click image to enlarge):

Choice of search criteria: <termSearch>, <spatialSearch> and <temporalSearch>

The <searchCriteria> element acts as a container enabling the selection of one or more of <termSearch>, <spatialSearch> and <temporalSearch>. Only one of each of <termSearch>, <spatialSearch> and <temporalSearch> may be included, but at least one of these three types of search must be supplied. If more than one is specified, the resulting search combines the components in "AND" combination (i.e. metadata records should match the term search AND the spatial search AND the temporal search criteria).

termSearch

TermSearch is a full-text search invoked on a specific target field in the discovery database. Child elements <term> and <termTarget> should be populated as follows:

  • <term> : text term to search for. Whitespace separates component words, which are searched in "OR" combination unless the "+" symbol is used between them, in which case the words joined in this way are searched in "AND" combination.
  • <termTarget> : target field name taken from the termTargetList list of valid term targets.

If multiple <termSearch> elements are present (e.g. to search different <termTargets) ), these are interpreted as successive term searches to be combined in "AND" combination. For example:

  <termSearch>
    <term>snow + rain</term>
    <termTarget>abstract</termTarget>
  </termSearch>
  <termSearch>
    <term>lawrence</term>
    <termTarget>author</termTarget>
  </termSearch>

means

  Search for records where:
    abstract contains "snow" AND "rain"
AND
    author contains "lawrence"

If we were to extend the example by adding an additional termSearch also targetted at the abstract, this would be combined in OR combination with the first termSearch for that target (abstract), i.e.

  <termSearch>
    <term>snow + rain</term>
    <termTarget>abstract</termTarget>
  </termSearch>
  <termSearch>
    <term>lawrence</term>
    <termTarget>author</termTarget>
  </termSearch>
  <termSearch>
    <term>hail</term>
    <termTarget>abstract</termTarget>
  </termSearch>

This would be interpreted to mean:

  Search for records where:
    abstract contains ("snow" AND "rain") OR "hail"
AND
    author contains "lawrence"

Spatial searching : <spatialOperator> and <boundingBox>

The search may incorporate a spatial query to restrict results to those metadata records having spatial coverage(s) matching the search criteria defined by the <boundingBox> elements <limitNorth>, <limitSouth>, <limitEast> and <limitWest>. An optional element <spatialReferenceSystem> may be populated with an entry from the spatialReferenceSystem list to specify an alternative spatial reference system (SRS) of the bounding box coordinates. (*Note : this feature is included for future development e.g. ability to supply spatial search coordinates in British National Grid coordinates. Currently the only supported SRS is EPSG:4326 (WGS84 lat/lon), and this will remain as the default if no SRS is specified.

When using SRS EPSG:4326 (default), values for <limitNorth>, <limitSouth>, <limitEast> and <limitWest> should be given in decimal degrees latitude and longitude. <limitNorth> and <limitSouth> must be in the range -90.0 to +90.0, with <limitNorth> greater than <limitSouth>. <limitWest> and <limitEast> must be in the range -180.0 to 180.0 and <limitEast> should be greater than <limitWest>. Bounding boxes that span the -180 degree meridian, or the poles, are not supported.

An optional <spatialOperator> may be included, populated with a term from the spatialOperatorList, defining the comparison to be applied to spatial coverage(s) related to metadata records. The default value is "overlaps". Note that in the discovery index database, metadata records may contain several spatial coverages, so a match can occur if any of the spatial coverages related to the metadata item match the criteria specified in the spatial search.

Temporal searching : <DateRange>

The search may incoporate a temporal query to restrict results to those metadata records having temporal coverage(s) matching the search criteria specified within <DateRange?>. One or two <date> elements may be specified, to represent either a single date, or a date range. Each <date> element must contain a <DateValue?> element populated in the form YYYY-MM-DD, and optionally a <temporalOperator> element, populated with a value from temporalOperatorList, defining the semantic meaning of this date criterion in the search. In addition, an optional <dateRangeTarget> element may be included, populated with a value from the dateRangeTargetList, to enable searching of dates other than the default of "temporal coverage of data". Examples are shown below:

<DateRange>
  <date>
    <DateValue>2001-01-01</DateValue>    
  </date>
  <date>
    <DateValue>2002-02-03</DateValue>    
  </date>
</dateRange>

means

Find metadata records where the temporal coverage(s) of the data overlaps the date range 2001-01-01 to 2002-02-03 inclusive.
<DateRange>
  <date>
    <DateValue>2001-01-01</DateValue>
    <temporalOperator>before</temporalOperator>    
  </date>
  <date>
    <DateValue>2002-02-03</DateValue>    
    <temporalOperator>after</temporalOperator>    
  </date>
</dateRange>

means

Find metadata records where the temporal coverage(s) of the data is outside of the date range 2001-01-01 to 2002-02-03 inclusive.
<DateRange>
  <date>
    <DateValue>2001-01-01</DateValue>
    <temporalOperator>onOrbefore</temporalOperator>    
  </date>
</dateRange>

means

Find metadata records where the temporal coverage(s) of the data is on or before the date 2001-01-01.
<DateRange>
  <date>
    <DateValue>2001-01-01</DateValue>
    <temporalOperator>onOrbefore</temporalOperator>    
  </date>
  <dateRangeTarget>lastRevisionDate</dateRangeTarget>
</dateRange>

means

Find metadata records where the last revision date of the data is on or before the date 2001-01-01.

Looking at these examples we might need to tidy up the capitalisation of elements in the schema & WSDL.

Paging : <start> and <howMany>

The optional elements <start> and <howMany> control which records from the result set should be returned (although the total number of hits is always returned as a number to aid with paging in clients). If <start> is omitted, the default value used is 1 (i.e. the first record). If <howMany> is omitted, all records are returned.

Ordering: <orderBy> and <orderByDirection>

Ordering of the result set can optionally be requested by providing an <orderBy> element containing one or more <orderByField>s, each with an optional associated <orderByDirection> (default : descending). Available fields for use as an <orderByField> are listed in the orderByFieldList.

In addition, the direction of ordering (ascending or descending) can be specified. If omitted, the default direction is descending.

Scope of search: <scope>

The optional <scope> element can be used to restrict the search to onr or more of the supported NDG Data Provider Groups, defined in NDG controlled vocabulary  http://vocab.ndg.nerc.ac.uk/list/N010/0. Currently supported values from this vocabulary are these are given in the the scopeList list accessible via the getList operation. If <scope> is omitted, the search is not restricted in this way.

Format

The optional <format> element can be used to restrict the search to records in the discovery index whose original (harvested) representation was in the format specified. The format must be one of the values listed in the metadataFormatList available from the getList operation. Default behaviour if this element is omitted is not to restrict the search in this way (and return results irrepsective of their harvested format).

RecordDetail

The optional <recordDetail> element enables selection of the level of detail included in each result returned in the search result. Values must be one of those available from the recordList. Default is documentId, which simply returns the id of the document corresponding to the result. See Search Results section for further explanation of the structures returned in each of these cases.

Search results

The doSearchResponse message is defined in the WSDL as shown below (click image for larger version):

The <doSearchReturn> element contains the following top-level elements:

status
true if successful AND number of hits > 0, false otherwise (designed so that a client need only proceed to parse the rest of the message if results were successfully returned)
statusMessage
Textual information regarding success / failure / errors
resultId
reserved for future use
hits
TOTAL number of hits returned
documents
parent element for <documentId>, <documentBrief> OR <documentSummary> elements (as per choice in search request) containing returned results.

Content below here not updated yet


A result where no hits were returned is shown below:

<doSearchReturn xmlns="urn:DiscoveryServiceAPI">
	<status>false</status>
	<statusMessage>Search was successful but generated no results.</statusMessage>
	<resultId>0</resultId>
	<hits>0</hits>
</doSearchReturn>

A result where 2 hits were returned, with the recordDetail set to documentId, is shown below:

<doSearchReturn xmlns="urn:DiscoveryServiceAPI">
	<status>true</status>
	<statusMessage>Search was successful.</statusMessage>
	<resultId>0</resultId>
	<hits>2</hits>
        <documents>
          <documentId>idForResult1GoesHere</documentId>
          <documentId>idForResult2GoesHere</documentId>
        </documents>
</doSearchReturn>

If <documentBrief> is specified as the recordDetail, a <documentBrief> element is returned for each result, as outlined in the doSearchResponeMessage, above. This contains the <documentId> element, containing the id of the document, but is accompanied by the additional element <title>, containing the title from the metadata record, and a set of <orderedField> elements corresponding to the <orderByField>s used in the search request. In other words, the requested ordering fields are returned alongside the results so that a client can display the content of those fields which contributed to the resulting record ordering. The purpose of this <documentBrief> detail option is to enable clients to render a results list directly from the search response, without having immediately to invoke the doPresent operation to retrieve additional detail.

Similarly, if <documentSummary> is specified as the recordDetail, a <documentSummary> element is returned for each result, as outlined in the doSearchResponseMessage, above. In addition to the content added by the <documentBrief> option, <documentSummary> includes the metadata abstract, and temporal and spatial information. For the temporal and spatial components of this <documentSummary> the schema reuses the structures used for the search request, hence the optional temporalOperator are spatialOperator elements are redundant (and will be omitted) from the return context, however the dateRangeTarget element is useful as a contextual reminder of what the returned date pertains to (temporal coverage of data, last revision date of data, or ingestion date of metadata, etc.).

A corresponding <documentFull> structure is used in the doPresent operation operation as the structure in which the document payload is returned.

doPresent Operation

The doPresent operation provides a means of retrieving (presenting) one or more XML documents from the database. The doPresentRequest message is defined as follows:

#[[Image(doPresentSchema.png)]]

One or more <document> elements should each contain the names of a document (in the form returned in the doSearchReturn message) to be retrieved. The optional <format> element should be populated with one of the supported format names as listed by the presentFormatList accessible via the getList operation. All documents returned by a single invocation of the doPresent operation are returned in the same format, i.e. the choice of presentFormat applies to the doPresent request and not individual documents. Currently-supported formats are:

original Documents are returned unaltered, in the format in which they were harvested (via OAI-PMH) from the data provider.

DC
Dublin Core format
DIF
GCMD DIF format (version ??)
MDIP
Metadata format used by the Marine Data and Information Partnership
ISO19115
ISO19115 (Geographic Information: Metadata) encoded as ISO19139 XML

For all formats except original, the following action is taken prior to returning the document:

  • Check if the document exists in the discovery database in the requested format, and if so, return it unaltered
  • Apply a conversion XQuery to create a new document in that format on-the-fly

doPresent response

The doPresentResponse message is defined in the WSDL as follows:

#[[Image(doPresentReturnSchema.png)]]

The <doPresentReturn> element contains the following top-level elements:

<status>
true if there are any documents returned in the payload, false otherwise.
<statusMessage>
Textual information regarding success / failure / errors.
<documents>
If some documents have been successfully returned, a <documents> element is present and will contain a child <document> element for each document retrieved. In the case where some but not all documents are successfully returned, the <documents> element will contain populated <document> elements for the successfully-retrieved documents, but an empty <document> element for those where retrieval failed. If NO documents are successfully returned, however, then the <status> is set to false and no <documents> element is included in the doPresentResponse message.

The <document> element, if present and populated, contains the retrieved document as an encapsulated string representation of the XML. Depending on the client used to display the payload document, it either appears contained within a <![CDATA[ ... ]]> construct, or as XML with the opening angle brackets "<" escaped as "&lt;". Most XML parsers should successfully parse the string to reconstruct the XML document, but it is returned in this form to avoid namespace issues.

The following request / response sequence shows a successful doPresent operation:

Request:

(tbc)

Response:

(tbc)

Term Lists

termTargetList

fullText
Target is a stored version of the original harvested XML metadata document
author
Target is the entry in the discovery index database corresponding to the author of the data described by the metadata.
parameter
Target is entries in the discovery index database corresponding to parameter keywords extracted from the metadata.

MEDINTermTarget.1*::

Target is entry in the discovery database corresponding to the MEDINTermTarget.1 field in the MEDIN metadata format. (*It has been agreed that the search capability will be extended to support direct searching of a limited number of target fields within the MEDIN metadata format. List of fields currently to be confirmed by MEDIN).

presentFormatList

orderByFieldList

textRelevance
Ranking metric based on relevance of match to search term (metric derived by postgres text ranking function).
datasetStartDate
The start date of the date range given for the temporal coverage of the metadata record. Records with no start date defined are treated as if their start date is later than that last record with a start date defined, hence appearing at the end of the results.
datasetEndDate
The end date of the date range given for the temporal coverage of the metadata record. Records with no end date defined are treated as if their end date is later than that last record with a start date defined, hence appearing at the end of the results.
dataCentre
The name of the data centre supplying the metadata record. In the case of records supplied in DIF format, this is the Data_Centre_Name/Short_Name field. In the case of other metadata formats, the most appropriate equivalent field is used as this index (e.g. "DistributorName?" for MDIP format)
datasetResultsetPopularity
Measure of the popularity of a metadata record, related to how many times it has appeared in a result set in discovery searches.
proximity
The geographical proximity of the centre of the spatial coverage defined in the metadata record to the centre of the original search bounding box. Where no spatial information was originally selected proximity is calculated against the centre of a 'global' bounding box (0N, 0E). Metadata records with no spatial information originally defined are omitted from the search
proximityNearMiss
An ordering based on records that were not within the originally requested spatial extent, but that occur within an arbritrary 10% of the original bounding box extent. Where records are present that satisfy this scenario, they are ordered according to proximity to the outside edge of the original bounding box. This option is to give users an idea of datasets that were close to the original bounding box and still matched the search without having to redefine the original bounding box and original search terms.
datasetUpdateOrder
Date order based on the most recent update/edit by the original provider to the metadata record.
datasetOrder
Alphabetical order based on the name of the metadata record. In DIF records this is the Entry_Title field and for MDIP the Title field.
discoveryIngestDate
Date order based on the date of insertion of that record into the underlying database supporting the service. Note that this differs from "datasetUpdateOrder" in that this records the actual date a record was placed in the database as opposed to the last edit date of that record.
MEDINTermTarget.1*
One of fields specified by MEDIN for use as a termTarget : it should be possible to configure these for use as orderByFields, too.

scopeList

MDIP
Marine Data Information Partnership (organisation now renamed MEDIN)
NERC_DDC
NERC Designated Data Centres
NERC
NERC (General)
DPPP
Data Portals Project Provider

Should we deprecate MDIP and add "MEDIN" to the list?

spatialOperatorList

overlaps (default)
doesNotOverlap
within

spatialReferenceSystemList

EPSG:4326

dateRangeTargetList

temporalCoverage
lastRevisionDate
metadataIngestionDate

temporalOperatorList

equals
doesNotEqual
onOrBefore
onOrAfter
before
after

metadataFormatList

recordDetailList

id
brief
summary
full

Attachments