Changes between Initial Version and Version 1 of Discovery/DiscoveryWebServiceMEDIN


Ignore:
Timestamp:
04/11/09 23:30:58 (11 years ago)
Author:
mpritcha
Comment:

--

Legend:

Unmodified
Added
Removed
Modified
  • Discovery/DiscoveryWebServiceMEDIN

    v1 v1  
     1= NERC !DataGrid Discovery Web Service = 
     2 
     3The NERC !DataGrid (NDG) Discovery Web Service provides a search interface to metadata records harvested from collaborating data providers and is the backend server to which the NERC Data Discovery Service is a client. 
     4 
     5  * [#Introduction Introduction] 
     6  * [#Releases Releases] 
     7  * [#Connectivity Connectivity] 
     8  * [#XMLDataTypes XML Data Types] 
     9  * [#DiscoveryServiceOperations Discovery Service Operations] 
     10  * [#TermLists Term Lists] 
     11  * [#SupportedMetadataFormats Supported Metadata Formats] 
     12 
     13== Introduction == 
     14 
     15The Discovery Web Service is a presentation-free web service which acts as a search engine on top of the NDG Discovery metadata catalogue. This catalogue is dynamically populated by the [#harvesting harvesting] of metadata (by the [http://www.ceda.ac.uk CEDA] group at RAL) from a number of collaborating data providers, who make their metadata available in one of a number of [#SupportedMetadataFormats supported formats]. 
     16 
     17The search capability provided by the service enables full-text and spatio-temporal searches of catalogued metadata records and returns search results a defined XML structure, enabling search clients to be constructed by interested parties for their own purposes. The [http://ndg.nerc.ac.uk/discovery NERC Data Discovery Service] is one such client, as is the [http://www.edp.nerc.ac.uk Environmental Data Portal], among other examples. 
     18 
     19== Releases == 
     20 
     21== Connectivity == 
     22 
     23Consumers may access the discovery service via SOAP. Client implementations should be generated from the WSDL at the following URIs: 
     24 
     25[http://ndg3beta.badc.rl.ac.uk/axis2/services/DiscoveryService?wsdl] (ndg3beta deployment : latest stable development version) 
     26 
     27[http://proglue.badc.rl.ac.uk/axis2/services/DiscoveryService?wsdl] (ndg "live" deployment : production version) 
     28 
     29== XML Data Types == 
     30 
     31The XML documents used as request and response documents for each of the service operations (methods) are defined in the <xsd:schema> section of the WSDL document. The structure of each of these documents is discussed as part of the [#DiscoveryServiceMethods operation/method descriptions] below. 
     32 
     33== Discovery Service Operations == 
     34 
     35The discovery service implements 4 operations, namely: 
     36  * getListNames 
     37  * getList 
     38  * doSearch 
     39  * doPresent 
     40 
     41=== getListNames operation === 
     42 
     43The discovery web service relies on several lists of valid terms which are specific to the functionality of this service. The reason for using these 2 "helper" operations rather than encoding these valid terms as <xs:enumeration> in the schema part of the WSDL, is so that future modifications to the service need not necessarily require the modification of the WSDL (which can be incovenient for clients already developed around a particular release of the WSDL). The getListNames operation simply returns the names of these lists, which can then be used in a subsequent call to the getList operation. 
     44 
     45The WSDL document defines the getListNamesRequest message as an empty <getListNames> element, so the request message should look like this (omitting the SOAP Envelope & Body parent elements): 
     46{{{ 
     47<m:getListNames xmlns:m="urn:DiscoveryServiceAPI"/> 
     48}}} 
     49The getListNamesResponse message comprises a <getListNamesReturn> element, with child elements containing the names of the lists available for inspection: 
     50{{{ 
     51<getListNamesReturn xmlns="urn:DiscoveryServiceAPI"> 
     52        <listNames> 
     53                <listName>presentFormatList</listName> 
     54                <listName>orderByFieldList</listName> 
     55                <listName>scopeList</listName> 
     56                <listName>termTypeList</listName> 
     57                <listName>spatialOperatorList</listName> 
     58        </listNames> 
     59</getListNamesReturn> 
     60}}} 
     61 
     62=== getList operation === 
     63 
     64The contents of each of the lists named by the getListNames operation are accessible by invoking a call to the getList operation, with the name of the list as the single argument, encoded as a getListRequest message, as defined in the WSDL : 
     65 
     66Request: 
     67{{{ 
     68<getList xmlns="urn:DiscoveryServiceAPI"> 
     69    <listName>presentFormatList</listName> 
     70</getList> 
     71}}} 
     72 
     73Response: 
     74{{{ 
     75<getListReturn xmlns="urn:DiscoveryServiceAPI"> 
     76    <list name="presentFormatList"> 
     77        <listMember>original</listMember> 
     78        <listMember>DC</listMember> 
     79        <listMember>DIF</listMember> 
     80        <listMember>MDIP</listMember> 
     81        <listMember>ISO19115</listMember> 
     82     </list> 
     83</getListReturn> 
     84}}} 
     85An explanation of the presentFormatList list is given later, in the context of the doPresent operation.  
     86 
     87=== doSearch operation === 
     88 
     89The doSearch operation performs a search against the NDG discovery database. Queries to this database are formulated from the doSearchRequest message, and forwarded to the database via private methods (i.e. the consumer of the web service is not able directly to interact with the database). 
     90 
     91Although outside the scope of the Discovery web service itself, it is worth explaining the structure of the NDG Discovery database which is searched by the service. This is populated from records harvested via [http://www.openarchives.org/pmh/ OAI-PMH (Open Archives Initiative Protocol for Metadata Harvesting)] from collaborating data providers. Records are currently harvested in GCMD DIF format, and are tagged at ingest time with one or more "scope" keywords (listed in the scopeList list available from the getList operation). These enable the search to be restricted to particular communities, namely NERC, NERC_DDC (Designated Data Centres) and MDIP (Marine Data Information Partnership). Limited quality control on ingested records is also applied at ingest time, and it is the responsibility of the data provider to ensure that metadata records are provided to sufficient quality to enable them to be visible in the system. 
     92 
     93The doSearchRequest message is shown in schema form in fig X.  
     94 
     95[[Image(doSearchSchema.png)]] 
     96 
     97=== Choice of search type: <term> and <termType> ===  
     98The only mandatory elements are <term> and <termType>, as used in the example "Quick Start", above. By specifying the <termType>, a choice is made as to which of 3 variants of a full-text search should be invoked. This element should be populated with a valid value from the termTypeList list accessible via the getList operation. At present, these are:  
     99 
     100==== fullText ==== 
     101A full-text search is applied to the whole discovery metadata record  
     102==== author ====  
     103A full-text search is applied only to those sections of the discovery metadata record relating to authorship of the dataset  
     104==== parameter ====  
     105A full-text search is applied only to the parameter listing section of the discovery metadata record. 
     106  
     107<term> should be populated with the search term, which can be a string of one or more words and wildcard characters. The service is currently configured to execute searches by attempting to match XML documents (in the discovery database) where ALL of the components of the search term are matched (as opposed to ANY). In this way, increasingly specific searches can be used to refine the search results. Searches are case-insensitive. Examples of fullText search terms are:  
     108 
     109  temperature:: 
     110    Matches records with the word "temperature" in any node of a document  
     111  sea surface temperature:: 
     112    Matches documents having the words "sea", "surface" AND "temperature" (in any order)  
     113  *neodc*:: 
     114    Matches documents containing the string "neodc", even if embedded within a larger string.  
     115 
     116=== Paging : <start> and <howMany> ===  
     117The optional elements <start> and <howMany> control which records from the result set should be returned (although the total number of hits is always returned as a number to aid with paging in clients). If <start> is omitted, the default value used is 1 (i.e. the first record). If <howMany> is omitted, the default number of records returned is 30. 
     118 
     119=== Ordering: <orderBy> and <orderByDirection> === 
     120Ordering of the result set can be requested by setting <orderBy> to one of the valid values listed in the orderByFieldList accesible via the getList operation. Currently these are: 
     121  textRelevance:: 
     122    Ranking metric based on relevance of match to search term (metric derived by postgres text ranking function). 
     123  datasetStartDate::  
     124    The start date of the date range given for the temporal coverage of the metadata record. Records with no start date defined are treated as if their start date is later than that last record with a start date defined, hence appearing at the end of the results. 
     125  datasetEndDate::  
     126    The end date of the date range given for the temporal coverage of the metadata record. Records with no end date defined are treated as if their end date is later than that last record with a start date defined, hence appearing at the end of the results. 
     127  dataCentre:: 
     128    The name of the data centre supplying the metadata record. In the case of records supplied in DIF format, this is the Data_Centre_Name/Short_Name field. In the case of other metadata formats, the most appropriate equivalent field is used as this index (e.g. "DistributorName" for MDIP format) 
     129  datasetResultsetPopularity:: 
     130    Measure of the popularity of a metadata record, related to how many times it has appeared in a result set in discovery searches. 
     131  proximity:: 
     132    The geographical proximity of the centre of the spatial coverage defined in the metadata record to the centre of the original search bounding box.  Where no spatial information was originally selected proximity is calculated against the centre of a 'global' bounding box (0N, 0E).  Metadata records with no spatial information originally defined are omitted from the search 
     133  proximityNearMiss:: 
     134    An ordering based on records that were not within the originally requested spatial extent, but that occur within an arbritrary 10% of the original bounding box extent.  Where records are present that satisfy this scenario, they are ordered according to proximity to the outside edge of the original bounding box.  This option is to give users an idea of datasets that were close to the original bounding box and still matched the search without having to redefine the original bounding box and original search terms.  
     135  datasetUpdateOrder:: 
     136    Date order based on the most recent update/edit by the original provider to the metadata record. 
     137  datasetOrder:: 
     138    Alphabetical order based on the name of the metadata record.  In DIF records this is the Entry_Title field and for MDIP the Title field. 
     139  discoveryIngestDate:: 
     140    Date order based on the date of insertion of that record into the underlying database supporting the service.  Note that this differs from "datasetUpdateOrder" in that this records the actual date a record was placed in the database as opposed to the last edit date of that record. 
     141 
     142In addition, the direction of ordering (ascending or descending) can be specified. If omitted, the default direction is ascending. 
     143 
     144[Note 1 : a bug has recently been identified whereby records with no metric (as used in the orderByDirection, above) are incorrectly treated, resulting in these records appearing at the top of the list in the case of orderByDirection="descending". See ticket [http://proj.badc.rl.ac.uk/ndg/ticket/1029#comment:4 1029] ]  
     145 
     146=== Scope of search: <scope> === 
     147The optional <scope> element can be used to restrict the search to onr or more of the supported NDG Data Provider Groups, defined in NDG controlled vocabulary http://vocab.ndg.nerc.ac.uk/list/N010/0. Currently supported values from this vocabulary are these are given in the the scopeList list accessible via the getList operation. Currently these are: 
     148 
     149  MDIP:: 
     150    Marine Data Information Partnership (organisation now renamed MEDIN) 
     151  NERC_DDC:: 
     152    NERC Designated Data Centres  
     153  NERC:: 
     154    NERC (General) 
     155  DPPP:: 
     156    Data Portals Project Provider 
     157 
     158If <scope> is omitted, the search is not restricted in this way. 
     159 
     160=== Spatial searching : <spatialOperator> and <boundingBox> === 
     161Full-text, author or parameter searches, as described above, may optionally be combined with a further restriction that the spatial coverage described in the metadata records match, according to the specified <spatialOperator>, the specified spatial <boundingBox>. <spatialOperator> may be populated with any of the values from the spatialOperatorList accessible via the getList operation. Currently, supported values are: 
     162 
     163  overlaps (default):: 
     164 
     165  doesNotOverlap:: 
     166 
     167  within:: 
     168 
     169If <spatialOperator> is omitted, but a valid <boundingBox> is supplied, the default operator applied is overlaps. Values for <limitNorth>, <limitSouth>, <limitEast> and <limitWest> should be given in decimal degrees latitude and longitude. <limitNorth> and <limitSouth> must be in the range -90.0 to +90.0, with <limitNorth> greater than <limitSouth>. <limitWest> and <limitEast> must be in the range -180.0 to 180.0 and <limitEast> should be greater than <limitWest>. Bounding boxes that span the -180 degree meridian, or the poles, are not currently supported. 
     170 
     171Spatial searches (as a further restriction of "term" searches) are currently implemented by obtaining a resultset from the term search, obtaining a result set from the spatial search, then returning the intersection of the two result sets. 
     172 
     173=== Temporal searching : <DateRange> === 
     174Full-text, author or parameter searches my optionally be combined with a further restriction that the temporal coverage ovelaps the specified <DateRange>. Both <DateRangeStart> and <DateRangeEnd> must be specified and must be valid dates of the form YYYY-MM-DD. TODO: it is planned to implement a choice of <temporalOperator> in a similar manner to <spatialOperator>. 
     175 
     176== Search results == 
     177The doSearchResponse message is defined in the WSDL as shown below: 
     178 
     179[[Image(doSearchReturnSchema.png)]] 
     180 
     181The <doSearchReturn> element contains the following top-level elements: 
     182 
     183  status:: 
     184    true if successful AND number of hits > 0, false otherwise (designed so that a client need only proceed to parse the rest of the message if results were successfully returned)  
     185  statusMessage:: 
     186    Textual information regarding success / failure / errors  
     187  resultId:: 
     188    reserved for future use  
     189  hits:: 
     190    TOTAL number of hits returned  
     191  documents:: 
     192    parent element for array of <document> elements containing returned document IDs  
     193 
     194A typical search result was shown in the "Quick Start" section. A result where no hits were returned is shown below 
     195{{{ 
     196<doSearchReturn xmlns="urn:DiscoveryServiceAPI"> 
     197        <status>false</status> 
     198        <statusMessage>Search was successful but generated no results.</statusMessage> 
     199        <resultId>0</resultId> 
     200        <hits>0</hits> 
     201</doSearchReturn> 
     202}}} 
     203 
     204=== doPresent operation === 
     205 
     206The doPresent operation provides a means of retrieving (presenting) one or more XML documents from the database. The doPresentRequest message is defined as follows: 
     207 
     208[[Image(doPresentSchema.png)]] 
     209 
     210One or more <document> elements should each contain the names of a document (in the form returned in the doSearchReturn message) to be retrieved. The optional <format> element should be populated with one of the supported format names as listed by the presentFormatList accessible via the [#getListoperation getList] operation. All documents returned by a single invocation of the doPresent operation are returned in the same format, i.e. the choice of presentFormat applies to the doPresent request and not individual documents. Currently-supported formats are: 
     211 
     212original  
     213Documents are returned unaltered, in the format in which they were harvested (via OAI-PMH) from the data provider. 
     214 
     215  DC:: 
     216    Dublin Core format  
     217  DIF:: 
     218    GCMD DIF format (version ??) 
     219  MDIP:: 
     220    Metadata format used by the Marine Data and Information Partnership 
     221  ISO19115:: 
     222    ISO19115 (Geographic Information: Metadata) encoded as ISO19139 XML  
     223 
     224For all formats except original, the following action is taken prior to returning the document: 
     225 
     226  * Check if the document exists in the discovery database in the requested format, and if so, return it unaltered  
     227  * Apply a conversion XQuery to create a new document in that format on-the-fly  
     228 
     229==== doPresent response ==== 
     230The doPresentResponse message is defined in the WSDL as follows: 
     231 
     232[[Image(doPresentReturnSchema.png)]] 
     233 
     234The <doPresentReturn> element contains the following top-level elements: 
     235 
     236  <status>:: 
     237    true if there are any documents returned in the payload, false otherwise.  
     238  <statusMessage>:: 
     239    Textual information regarding success / failure / errors.  
     240  <documents>:: 
     241    If some documents have been successfully returned, a <documents> element is present and will contain a child <document> element for each document retrieved. In the case where some but not all documents are successfully returned, the <documents> element will contain populated <document> elements for the successfully-retrieved documents, but an empty <document> element for those where retrieval failed. If NO documents are successfully returned, however, then the <status> is set to false and no <documents> element is included in the doPresentResponse message. 
     242 
     243The <document> element, if present and populated, contains the retrieved document as an encapsulated string representation of the XML. Depending on the client used to display the payload document, it either appears contained within a <![CDATA[ ... ]]> construct, or as XML with the opening angle brackets "<" escaped as "&lt;". Most XML parsers should successfully parse the string to reconstruct the XML document, but it is returned in this form to avoid namespace issues.  
     244 
     245The following request / response sequence shows a successful doPresent operation: 
     246 
     247Request: 
     248{{{ 
     249<m:doPresent xmlns:m="urn:DiscoveryServiceAPI"> 
     250    <m:documents> 
     251        <m:document>badc.nerc.ac.uk__DIF__dataent_claus.xml</m:document> 
     252        <m:document>ndg.noc.soton.ac.uk__DIF__NOCSDAT160.xml</m:document> 
     253        <m:document>ndg.noc.soton.ac.uk__DIF__NOCSDAT162.xml</m:document> 
     254        <m:document>ndg.noc.soton.ac.uk__DIF__NOCSDAT163.xml</m:document> 
     255    </m:documents> 
     256    <m:format>original</m:format> 
     257</m:doPresent> 
     258}}} 
     259 
     260Response: 
     261{{{ 
     262<doPresentReturn xmlns="urn:DiscoveryServiceAPI"> 
     263    <status>true</status> 
     264    <statusMessage>Success</statusMessage> 
     265    <documents> 
     266        <document>&lt;DIF xmlns="http://gcmd.gsfc.nasa.gov/Aboutus/xml/dif/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">&lt;Entry_ID>badc.nerc.ac.uk:DIF:dataent_claus&lt;/Entry_ID> (...) &lt;/DIF></document> 
     267        <document>&lt;DIF xmlns="http://gcmd.gsfc.nasa.gov/Aboutus/xml/dif/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">&lt;Entry_ID>ndg.noc.soton.ac.uk__DIF__NOCSDAT160&lt;/Entry_ID> (...) &lt;/DIF></document> 
     268        <document>&lt;DIF xmlns="http://gcmd.gsfc.nasa.gov/Aboutus/xml/dif/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">&lt;Entry_ID>ndg.noc.soton.ac.uk__DIF__NOCSDAT162&lt;/Entry_ID> (...) &lt;/DIF></document> 
     269        <document>&lt;DIF xmlns="http://gcmd.gsfc.nasa.gov/Aboutus/xml/dif/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">&lt;Entry_ID>ndg.noc.soton.ac.uk__DIF__NOCSDAT163&lt;/EntryID> (...) &lt;/DIF></document> 
     270    </documents> 
     271</doPresentReturn> 
     272}}} 
     273 
     274 
     275== Term Lists == 
     276 
     277== Supported Metadata Formats ==