wiki:Identifiers

Version 9 (modified by domlowe, 13 years ago) (diff)

Changed 'running the parser' to 'running the scanner'.Answered question about where the random string comes from.

NDG Identifiers

On their definition and usage.

 Kev's updated opus

The Status Quo

We've agreed that NDG identifiers should look like:

repository:schema:local_id

with allowable alternatives of

respository/schema/local_id

A suggestion

We also allow, and start to *prefer*:

respository__schema__local_id

(note the TWO underscores between each token).

How we should use these

Key entities that NDG knows about are

  • data granule documents (schema CSML)
  • moles documents (schema MOLES-B0)
  • stub-b documents (in MOLES-B1) WE NEED A SCHEMA KEV!!!!
  • discovery documents (could be in DIF, ISO19139, MDIP, or DC).

There are also internal identifiers in the CSML documents and I'm going to come back to those.

Creating Metadata Documents

(Bryan's view of how the BADC should do it)

In the beginning there is data. (Let's assume the Activity, ObsStation? and DPT exist or are created independently, they're not really germane to this discussion).

In our case (BADC), the data exists on disk. For netcdf data at least, we would expect to be able to run cdmlscan and produce cdml documents. These describe the data file contents, and we expect csml2 to accept a cdml document as a storage descriptor.

We can then create csml files by running the scanner on the cdml document.

... at this point we need to put an identifier in the csml file. It will be:

badc.nerc.ac.uk__csml__SomeRandomString

(Where do we get the string from? - Currently it's just invented and supplied at the time of scanning (Dominic))

We leave the datafiles and cdml file in the right place on the directory

We move the csml file to the badc exist repository (why? because then we can build simple WCS etc stuff based on a document retrieval interface that is agnostic about the location of the files). We put the csml document in the repository using the name for the document which is the *identifier* above!

BUT, that leaves he problem of linking the csml document to specific cdml documents living on a system, somewhere.

Let's park that for the moment.

Then we can run tools on our csml file, which produce some moles snippets, which we upload into the badc catalogue into a data entity description which HAS A NEW IDENTIFIER (multiple granules can exist in one data entity). Let's call this one:

badc.nerc.ac.uk__MOLES-B0__NEWRANDOMSTRING.

We many then want to create some NumSim documents. We load those into the BADC exist too. We can link to those from within the MOLES descriptions in the badc catalogue. (these too have different identifiers)

Sue's moles-RDB to existDB script takes that and populates the BADC existDB. (Sue, when will this actually start to happen?).

Sue's code run's a MOLES-to-DIF conversion, which puts DIF files into our OAI repository, where their identifiers are

badc.nerc.ac.uk__DIF__NEWRANDOMSTRING

(i.e the same local_id as in the moles repository).

My browse code comes along and can pull any of these documents directly from the local existDB by their identifiers alone! It can also transform the MOLES documents into all the formats Kev supports. Kev: the DC identifier in this world OUGHT TO BE

badc.nerc.ac.uk__DC__NEWRANDOMSTRING

not NEWRANDOMSTRING as it is now. Please fix.

When we OAI harvest any documents, we are now making sure that we ingest them into the NDG existDB with a filename which will be

respository__schema__identifier

(where in the case of NDG discovery documents we expect to read this direct from the DIF entry_ID, and in the case of documents from elsewhere, we will create because we know where they came from). WE are also converting everything into MOLES, so we can extract all discovery documents back in a vareity of formats.

That way, my discovery code, via Matt's do present, can produce a restful, bookmarkable, shopping-cartable link to all discovery documents, which looks like

http://glue.badc.rl.ac.uk/retrieve?uri=badc.nerc.ac.uk__DC__NEWRANDOMSTRING

(if the user wants DC etc).

It also means in the case of NDG documents, I can construct on the fly a browse request by parsing the id alone, and going from a map of ndg repository identifiers to ndg browse services (yes, I have one of those), so that we have the browse link up as

http://localbrowsehostforrepository/retrieve?uri=badc.nerc.ac.uk__MOLES-B0__NEWRANDOMSTRING/format=MOLES-B1

(actually there is redundancy here that is necessary only because of where we are today, I should be able to just assume that I can do this in effect by substituting the desired schema in the middle).

I can also have a map of csml servers (currently dx servers), which I can point to for data services etc.

However, ideally I don't want a map, what I want you to do is put in your DIF related URL something that looks like this:

<content_Type> NDG_B_Service </content_Type>
<url>http://localbrowsehostforrepository/browse?uri=badc.nerc.ac.uk__MOLES-B0__NEWRANDOMSTRING/format=MOLES-B1</url>

or the equivalent in what replaces DIF.

Note that this is different from NDG1 and NDG-Alpha, but it means that our services are consummable by others than just ourselves.

Now coming back to the CSML internal identifier issues. That's a matter for the storage descriptor, so I'm going to hand that back to Andrew (for now).