wiki:Discovery/DiscoveryWebServiceMEDIN

Version 7 (modified by mpritcha, 10 years ago) (diff)

--

NERC DataGrid Discovery Web Service : MEDIN Improvements November 2009

The NERC DataGrid (NDG) Discovery Web Service provides a search interface to metadata records harvested from collaborating data providers and is the backend server to which the NERC Data Discovery Service is a client.

Introduction

The Discovery Web Service is a presentation-free web service which acts as a search engine on top of the NDG Discovery metadata catalogue. This catalogue is dynamically populated by the harvesting of metadata (by the  CEDA group at RAL) from a number of collaborating data providers, who make their metadata available in one of a number of supported formats.

The search capability provided by the service enables full-text and spatio-temporal searches of catalogued metadata records and returns search results a defined XML structure, enabling search clients to be constructed by interested parties for their own purposes. The NERC Data Discovery Service is one such client, as is the  Environmental Data Portal, among other examples.

Releases

This documentation covers functionality proposed for the improvements commissioned by MEDIN.

Connectivity

Consumers may access the discovery service via SOAP. Client implementations should be generated from the WSDL at the following URIs:

(URIs tbc pending implementation of proposed improvements)

XML Data Types

The XML documents used as request and response documents for each of the service operations (methods) are defined in the <xs:schema> section of the WSDL document. The structure of each of these documents is discussed as part of the operation/method descriptions below.

Discovery Service Operations

The discovery service implements 4 operations, namely:

  • getListNames
  • getList
  • doSearch
  • doPresent

getListNames operation

The discovery web service relies on several lists of valid terms which are specific to the functionality of this service. The reason for using these 2 "helper" operations rather than encoding these valid terms as <xs:enumeration> in the schema part of the WSDL, is so that future modifications to the service need not necessarily require the modification of the WSDL (which can be incovenient for clients already developed around a particular release of the WSDL). The getListNames operation simply returns the names of these lists, which can then be used in a subsequent call to the getList operation.

The WSDL document defines the getListNamesRequest message as an empty <getListNames> element, so the request message should look like this (omitting the SOAP Envelope & Body parent elements):

<m:getListNames xmlns:m="urn:DiscoveryServiceAPI"/>

The getListNamesResponse message comprises a <getListNamesReturn> element, with child elements containing the names of the lists available for inspection:

<getListNamesReturn xmlns="urn:DiscoveryServiceAPI">
	<listNames>
		<listName>presentFormatList</listName>
		<listName>orderByFieldList</listName>
		<listName>scopeList</listName>
		<listName>spatialOperatorList</listName>

                <listName>termTargetList</listName>
                <listName>spatialReferenceSystemList</listName>
                <listName>dateRangeTargetList</listName>
                <listName>temporalOperatorList</listName>
                <listName>metadataFormatList</listName>
                <listName>recordDetailList</listName> 
	</listNames>
</getListNamesReturn>

getList operation

The contents of each of the lists named by the getListNames operation are accessible by invoking a call to the getList operation, with the name of the list as the single argument, encoded as a getListRequest message, as defined in the WSDL :

Request:

<getList xmlns="urn:DiscoveryServiceAPI">
    <listName>presentFormatList</listName>
</getList>

Response:

<getListReturn xmlns="urn:DiscoveryServiceAPI">
    <list name="presentFormatList">
        <listMember>DC</listMember>
        <listMember>DIF_v9.4</listMember>
        <listMember>MEDIN_v2.3</listMember>
     </list>
</getListReturn>

An explanation of the presentFormatList list is given later, in the context of the doPresent operation.

doSearch operation

The doSearch operation performs a search against the NDG discovery database. Queries to this database are formulated from the doSearchRequest message, and forwarded to the database via private methods (i.e. the consumer of the web service is not able directly to interact with the database).

Although outside the scope of the Discovery web service itself, it is worth explaining the structure of the NDG Discovery database which is searched by the service. This is populated from records harvested via  OAI-PMH (Open Archives Initiative Protocol for Metadata Harvesting) from collaborating data providers. Records are harvested in the ISO19115-compliant metadata format agreed by the NERC Metadata Standards Working Group (MEDIN ISO v2.3.1) with legacy support for the previous format (DIF v9.4), and are tagged at ingest time with one or more "scope" keywords (listed in the scopeList list available from the getList operation). These enable the search to be restricted to particular communities, namely NERC, NERC_DDC (Designated Data Centres) and MEDIN (Marine Environmental Data Information Network). Limited quality control on ingested records is also applied at ingest time, but it is the responsibility of the data provider to ensure that metadata records are provided to sufficient quality to enable them to be visible in the system.

The doSearchRequest message is shown in schema form in fig X.

#[[Image(doSearchSchema.png)]]

Choice of search criteria: <termSearch>, <spatialSearch> and <temporalSearch>

The <searchCriteria> element acts as a container enabling the selection of one or more of <termSearch>, <spatialSearch> and <temporalSearch>. Searches of these 3 basic types may be used in combination.

termSearch

TermSearch? is a full-text search invoked on a specific target field in the discovery database. Child elements <term> and <termTarget> should be populated as follows:

  • <term> : text term to search for. Whitespace separates component words, which are searched in "OR" combination unless the "+" symbol is used between them, in which case the words joined in this way are searched in "AND" combination.
  • <termTarget> : target field name taken from the termTargetList list of valid term targets.

If multiple <termSearch> elements are present (e.g. to search different <termTargets) ), these are interpreted as successive term searches to be combined in "OR" combination. For example:

  <termSearch>
    <term>snow + rain</term>
    <termTarget>abstract</termTarget>
  </termSearch>
  <termSearch>
    <term>lawrence</term>
    <termTarget>author</termTarget>
  </termSearch>

means

  Search for records where:
    abstract contains "snow" AND "rain"
AND
    author contains "lawrence"

If we were to extend the example by adding an additional termSearch, also targetted at the abstract, this would be combined in OR combination with the first termSearch *for that target*, i.e.

  <termSearch>
    <term>snow + rain</term>
    <termTarget>abstract</termTarget>
  </termSearch>
  <termSearch>
    <term>lawrence</term>
    <termTarget>author</termTarget>
  </termSearch>
  <termSearch>
    <term>hail</term>
    <termTarget>abstract</termTarget>
  </termSearch>

This would be interpreted to mean:

  Search for records where:
    abstract contains ("snow" AND "rain") OR "hail"
AND
    author contains "lawrence"

--Got to here while editing--

This element should be populated with a valid value from the termTargetList list accessible via the getList operation. At present, these are:

fullText

A full-text search is applied to the whole discovery metadata record

author

A full-text search is applied only to those sections of the discovery metadata record relating to authorship of the dataset

parameter

A full-text search is applied only to the parameter listing section of the discovery metadata record. <term> should be populated with the search term, which can be a string of one or more words and wildcard characters. The service is currently configured to execute searches by attempting to match XML documents (in the discovery database) where ALL of the components of the search term are matched (as opposed to ANY). In this way, increasingly specific searches can be used to refine the search results. Searches are case-insensitive. Examples of fullText search terms are:

temperature
Matches records with the word "temperature" in any node of a document
sea surface temperature
Matches documents having the words "sea", "surface" AND "temperature" (in any order)
*neodc*
Matches documents containing the string "neodc", even if embedded within a larger string.

Paging : <start> and <howMany>

The optional elements <start> and <howMany> control which records from the result set should be returned (although the total number of hits is always returned as a number to aid with paging in clients). If <start> is omitted, the default value used is 1 (i.e. the first record). If <howMany> is omitted, the default number of records returned is 30.

Ordering: <orderBy> and <orderByDirection>

Ordering of the result set can be requested by setting <orderBy> to one of the valid values listed in the orderByFieldList accesible via the getList operation. Currently these are:

textRelevance
Ranking metric based on relevance of match to search term (metric derived by postgres text ranking function).
datasetStartDate
The start date of the date range given for the temporal coverage of the metadata record. Records with no start date defined are treated as if their start date is later than that last record with a start date defined, hence appearing at the end of the results.
datasetEndDate
The end date of the date range given for the temporal coverage of the metadata record. Records with no end date defined are treated as if their end date is later than that last record with a start date defined, hence appearing at the end of the results.
dataCentre
The name of the data centre supplying the metadata record. In the case of records supplied in DIF format, this is the Data_Centre_Name/Short_Name field. In the case of other metadata formats, the most appropriate equivalent field is used as this index (e.g. "DistributorName?" for MDIP format)
datasetResultsetPopularity
Measure of the popularity of a metadata record, related to how many times it has appeared in a result set in discovery searches.
proximity
The geographical proximity of the centre of the spatial coverage defined in the metadata record to the centre of the original search bounding box. Where no spatial information was originally selected proximity is calculated against the centre of a 'global' bounding box (0N, 0E). Metadata records with no spatial information originally defined are omitted from the search
proximityNearMiss
An ordering based on records that were not within the originally requested spatial extent, but that occur within an arbritrary 10% of the original bounding box extent. Where records are present that satisfy this scenario, they are ordered according to proximity to the outside edge of the original bounding box. This option is to give users an idea of datasets that were close to the original bounding box and still matched the search without having to redefine the original bounding box and original search terms.
datasetUpdateOrder
Date order based on the most recent update/edit by the original provider to the metadata record.
datasetOrder
Alphabetical order based on the name of the metadata record. In DIF records this is the Entry_Title field and for MDIP the Title field.
discoveryIngestDate
Date order based on the date of insertion of that record into the underlying database supporting the service. Note that this differs from "datasetUpdateOrder" in that this records the actual date a record was placed in the database as opposed to the last edit date of that record.

In addition, the direction of ordering (ascending or descending) can be specified. If omitted, the default direction is ascending.

[Note 1 : a bug has recently been identified whereby records with no metric (as used in the orderByDirection, above) are incorrectly treated, resulting in these records appearing at the top of the list in the case of orderByDirection="descending". See ticket  1029 ]

Scope of search: <scope>

The optional <scope> element can be used to restrict the search to onr or more of the supported NDG Data Provider Groups, defined in NDG controlled vocabulary  http://vocab.ndg.nerc.ac.uk/list/N010/0. Currently supported values from this vocabulary are these are given in the the scopeList list accessible via the getList operation. Currently these are:

MDIP
Marine Data Information Partnership (organisation now renamed MEDIN)
NERC_DDC
NERC Designated Data Centres
NERC
NERC (General)
DPPP
Data Portals Project Provider

If <scope> is omitted, the search is not restricted in this way.

Spatial searching : <spatialOperator> and <boundingBox>

Full-text, author or parameter searches, as described above, may optionally be combined with a further restriction that the spatial coverage described in the metadata records match, according to the specified <spatialOperator>, the specified spatial <boundingBox>. <spatialOperator> may be populated with any of the values from the spatialOperatorList accessible via the getList operation. Currently, supported values are:

overlaps (default)
doesNotOverlap
within

If <spatialOperator> is omitted, but a valid <boundingBox> is supplied, the default operator applied is overlaps. Values for <limitNorth>, <limitSouth>, <limitEast> and <limitWest> should be given in decimal degrees latitude and longitude. <limitNorth> and <limitSouth> must be in the range -90.0 to +90.0, with <limitNorth> greater than <limitSouth>. <limitWest> and <limitEast> must be in the range -180.0 to 180.0 and <limitEast> should be greater than <limitWest>. Bounding boxes that span the -180 degree meridian, or the poles, are not currently supported.

Spatial searches (as a further restriction of "term" searches) are currently implemented by obtaining a resultset from the term search, obtaining a result set from the spatial search, then returning the intersection of the two result sets.

Temporal searching : <DateRange?>

Full-text, author or parameter searches my optionally be combined with a further restriction that the temporal coverage ovelaps the specified <DateRange?>. Both <DateRangeStart?> and <DateRangeEnd?> must be specified and must be valid dates of the form YYYY-MM-DD. TODO: it is planned to implement a choice of <temporalOperator> in a similar manner to <spatialOperator>.

Search results

The doSearchResponse message is defined in the WSDL as shown below:

#[[Image(doSearchReturnSchema.png)]]

The <doSearchReturn> element contains the following top-level elements:

status
true if successful AND number of hits > 0, false otherwise (designed so that a client need only proceed to parse the rest of the message if results were successfully returned)
statusMessage
Textual information regarding success / failure / errors
resultId
reserved for future use
hits
TOTAL number of hits returned
documents
parent element for array of <document> elements containing returned document IDs

A typical search result was shown in the "Quick Start" section. A result where no hits were returned is shown below

<doSearchReturn xmlns="urn:DiscoveryServiceAPI">
	<status>false</status>
	<statusMessage>Search was successful but generated no results.</statusMessage>
	<resultId>0</resultId>
	<hits>0</hits>
</doSearchReturn>

doPresent operation

The doPresent operation provides a means of retrieving (presenting) one or more XML documents from the database. The doPresentRequest message is defined as follows:

#[[Image(doPresentSchema.png)]]

One or more <document> elements should each contain the names of a document (in the form returned in the doSearchReturn message) to be retrieved. The optional <format> element should be populated with one of the supported format names as listed by the presentFormatList accessible via the getList operation. All documents returned by a single invocation of the doPresent operation are returned in the same format, i.e. the choice of presentFormat applies to the doPresent request and not individual documents. Currently-supported formats are:

original Documents are returned unaltered, in the format in which they were harvested (via OAI-PMH) from the data provider.

DC
Dublin Core format
DIF
GCMD DIF format (version ??)
MDIP
Metadata format used by the Marine Data and Information Partnership
ISO19115
ISO19115 (Geographic Information: Metadata) encoded as ISO19139 XML

For all formats except original, the following action is taken prior to returning the document:

  • Check if the document exists in the discovery database in the requested format, and if so, return it unaltered
  • Apply a conversion XQuery to create a new document in that format on-the-fly

doPresent response

The doPresentResponse message is defined in the WSDL as follows:

#[[Image(doPresentReturnSchema.png)]]

The <doPresentReturn> element contains the following top-level elements:

<status>
true if there are any documents returned in the payload, false otherwise.
<statusMessage>
Textual information regarding success / failure / errors.
<documents>
If some documents have been successfully returned, a <documents> element is present and will contain a child <document> element for each document retrieved. In the case where some but not all documents are successfully returned, the <documents> element will contain populated <document> elements for the successfully-retrieved documents, but an empty <document> element for those where retrieval failed. If NO documents are successfully returned, however, then the <status> is set to false and no <documents> element is included in the doPresentResponse message.

The <document> element, if present and populated, contains the retrieved document as an encapsulated string representation of the XML. Depending on the client used to display the payload document, it either appears contained within a <![CDATA[ ... ]]> construct, or as XML with the opening angle brackets "<" escaped as "&lt;". Most XML parsers should successfully parse the string to reconstruct the XML document, but it is returned in this form to avoid namespace issues.

The following request / response sequence shows a successful doPresent operation:

Request:

<m:doPresent xmlns:m="urn:DiscoveryServiceAPI">
    <m:documents>
        <m:document>badc.nerc.ac.uk__DIF__dataent_claus.xml</m:document>
        <m:document>ndg.noc.soton.ac.uk__DIF__NOCSDAT160.xml</m:document>
        <m:document>ndg.noc.soton.ac.uk__DIF__NOCSDAT162.xml</m:document>
        <m:document>ndg.noc.soton.ac.uk__DIF__NOCSDAT163.xml</m:document>
    </m:documents>
    <m:format>original</m:format>
</m:doPresent>

Response:

<doPresentReturn xmlns="urn:DiscoveryServiceAPI">
    <status>true</status>
    <statusMessage>Success</statusMessage>
    <documents>
        <document>&lt;DIF xmlns="http://gcmd.gsfc.nasa.gov/Aboutus/xml/dif/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">&lt;Entry_ID>badc.nerc.ac.uk:DIF:dataent_claus&lt;/Entry_ID> (...) &lt;/DIF></document>
        <document>&lt;DIF xmlns="http://gcmd.gsfc.nasa.gov/Aboutus/xml/dif/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">&lt;Entry_ID>ndg.noc.soton.ac.uk__DIF__NOCSDAT160&lt;/Entry_ID> (...) &lt;/DIF></document>
        <document>&lt;DIF xmlns="http://gcmd.gsfc.nasa.gov/Aboutus/xml/dif/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">&lt;Entry_ID>ndg.noc.soton.ac.uk__DIF__NOCSDAT162&lt;/Entry_ID> (...) &lt;/DIF></document>
        <document>&lt;DIF xmlns="http://gcmd.gsfc.nasa.gov/Aboutus/xml/dif/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">&lt;Entry_ID>ndg.noc.soton.ac.uk__DIF__NOCSDAT163&lt;/EntryID> (...) &lt;/DIF></document>
    </documents>
</doPresentReturn>

Term Lists

Supported Metadata Formats

Attachments