Ticket #739 (reopened defect)

Opened 12 years ago

Last modified 11 years ago

[M] (DI-3-1) Author search bugs

Reported by: selatham Owned by: mpritcha
Priority: discussion Milestone: NDG3
Component: discovery Version:
Keywords: Cc:

Description

Author search has the same problem as parameter search had:-

Attachments

grid.bodc.nerc.ac.uk__DIF__PCDA47973RS2302.xml Download (10.2 KB) - added by selatham 12 years ago.

Change History

comment:1 Changed 12 years ago by selatham

  • Type changed from task to defect

comment:2 Changed 12 years ago by lawrence

and note that author can turn up in two places (at least): in citation, in people (where it comes in at least two guises). An interesting question is what Kevin does with it in and out of miniMoles.

comment:3 Changed 12 years ago by selatham

  • Milestone changed from PROD to ReFactored_Discovery_WebServices

comment:4 Changed 12 years ago by mpritcha

  • Status changed from new to assigned

OK, I'm assuming dgDataCurator will be

/dgMetadata/dgMetadataRecord/dgDataEntity/dgDataRoles/dgDataCurator

...Is that correct? But can people supply similar (xpath) locations for author-like fields that should be searched as well (preferably with some example doc IDs to try out?)

Presumably I would need to look up the actual dgDataCurator name in the relevant org moles record for the example above?

comment:5 follow-up: ↓ 7 Changed 12 years ago by lawrence

Just to reiterate an author is not the curator ... :-)

comment:6 Changed 12 years ago by ko23

I was assuming that the author was the data creator(s) for the various transforms, and that the creator would be publisher (where relevant)

comment:7 in reply to: ↑ 5 Changed 12 years ago by mpritcha

Replying to lawrence:

Just to reiterate an author is not the curator ... :-)

Oops I meant dgDataCreator, not dgDataCurator. So I am suggesting that one of the search targets be:

/dgMetadata/dgMetadataRecord/dgDataEntity/dgDataRoles/dgDataCcreator

(but resolved via the relavant org moles records).

comment:8 Changed 12 years ago by mpritcha

  • Status changed from assigned to closed
  • Resolution set to fixed

Search target now looks like this:

(./moles:dgMetadata/moles:dgMetadataRecord/moles:dgDataEntity/moles:dgDataRoles/moles:dgDataCreator &= 'term')
or
(./moles:dgMetadata/moles:dgMetadataRecord/moles:dgMetadataDescription/moles:abstract/moles:abstractOnlineReference/moles:dgCitation/moles:authors &= 'term')

Please comment on whether this is correct. Need to update records to actually include these fields to make this work across the board, however.

comment:9 Changed 12 years ago by selatham

There are some dgDataCreator type of authors within discovery records. Mainly DASSH and some BODC ones. So clearly the DIF/MDIP-to-miniMoles transform at ingest is picking some up (from DIF field Origination_Center by the looks of it).

Citations will have to come later (we may want to change this in Moles anyway).

Changed 12 years ago by selatham

comment:10 follow-up: ↓ 11 Changed 12 years ago by selatham

  • Status changed from closed to reopened
  • Resolution fixed deleted

Just spotted a problem with this due to not resolving the org/person. A typical generated mini-moles entry for dataCreator looks like this:-

                <dgDataCreator>
                    <dgMetadataID>
                        <schemeIdentifier>NDG-B0</schemeIdentifier>
                        <repositoryIdentifier>grid.bodc.nerc.ac.uk</repositoryIdentifier>
                        <localIdentifier>generated_creator-PCDA47973RS2302</localIdentifier>
                    </dgMetadataID>
                    <roleName>Data Creator</roleName>
                    <abbreviation>Creator</abbreviation>
                    <dgRoleHolder>
                        <dgOrganisationID>
                            <schemeIdentifier>NDG-B0</schemeIdentifier>
                            <repositoryIdentifier>grid.bodc.nerc.ac.uk</repositoryIdentifier>
                            <localIdentifier>generated_orgcit-Scottish%20Marine%20Biological%20Association-PCDA47973RS2302</localIdentifier>
                        </dgOrganisationID>
                        <startDate>2007-04-02+01:00</startDate>
                    </dgRoleHolder>
                </dgDataCreator>

so I only get this author if I search on generated_orgcit-Scottish%20Marine%20Biological%20Association-PCDA47973RS2302 or *scottish*.

I don't get it if I search on Scottish Marine Biological Association, or Scottish.

See attached file for the full mini-moles for this. See the generated org is tacked on the end to make it available.

comment:11 in reply to: ↑ 10 Changed 12 years ago by mpritcha

Replying to selatham:

Need an org moles record for Scottish Marine Biological Association, then, so I can test it!

Just spotted a problem with this due to not resolving the org/person.

so I only get this author if I search on generated_orgcit-Scottish%20Marine%20Biological%20Association-PCDA47973RS2302 or *scottish*.

I don't get it if I search on Scottish Marine Biological Association, or Scottish.

comment:12 Changed 12 years ago by mpritcha

Can I just check my understanding is correct? If a dgMetadataRecord has a dgDataCreator element, then I should be able to match the dgDataCreator/dgRoleHolder/dgOrganisationID (schemeIdentifier, repositoryIdentifier, localIdentifier) to the (schemeIdentifier, repositoryIdentifier, localIdentifier) dgMetadataID of an dgOranisation record.

If so then we have a problem, because the examples of dgMetadataRecords that I can find have dgDataCreator/dgRoleHolder/dgOrganisationIDs that seem to be specific to that dgMetadataRecord.

Example:

This has the following for the dataCreator:

	<dgDataCreator>
		<dgMetadataID>
			<schemeIdentifier>NDG-B0</schemeIdentifier>
			<repositoryIdentifier>ndg.noc.soton.ac.uk</repositoryIdentifier>
			<localIdentifier>generated_creator-NOCSDAT236</localIdentifier>
		</dgMetadataID>
		<roleName>Data Creator</roleName>
		<abbreviation>Creator</abbreviation>
		<dgRoleHolder>
			<dgOrganisationID>
				<schemeIdentifier>NDG-B0</schemeIdentifier>
				<repositoryIdentifier>ndg.noc.soton.ac.uk</repositoryIdentifier>
				<localIdentifier>generated_orgcit-BODC%20(NOCS)-NOCSDAT236</localIdentifier>
			</dgOrganisationID>
			<startDate>2007-04-02+01:00</startDate>
		</dgRoleHolder>
	</dgDataCreator>

suggesting that the localIdentifier used to identify the dataCreator is specific to this dataEntity (NOCSDAT236). Surely you want one single dataCreator Organisation to which all these records should refer? In that case the organisation record should be the one in nocs_org_moles.xml, which is:

    <dgOrganisation>
        <dgMetadataID>
            <schemeIdentifier>NDG-B0</schemeIdentifier>
            <repositoryIdentifier>ndg.noc.soton.ac.uk</repositoryIdentifier>
            <localIdentifier>nocs</localIdentifier>
        </dgMetadataID>
        <name>BODC-NOCS, National Oceanography Centre, Southampton</name>
        <abbreviation>BODC-NOCS</abbreviation>
        ....

...and that way, a search on "*NOCS*" might work (at least that what I'm thinking I need to resolve things to.

In the case of DASSH records, they have a dgDataCreator/dgMetadataID (which defines the role), but no dgRoleHolder element, ...which is where the set of 3 IDs I need would be! (note problem of specific localID again)

<dgDataCreator>
	<dgMetadataID>
		<schemeIdentifier>NDG-B0</schemeIdentifier>
		<repositoryIdentifier>dassh.ac.uk</repositoryIdentifier>
		<localIdentifier>generated_creator-MRMLN00400000049</localIdentifier>
	</dgMetadataID>
	<roleName>Data Creator</roleName>
	<abbreviation>Creator</abbreviation>
</dgDataCreator>

comment:13 follow-up: ↓ 14 Changed 12 years ago by selatham

  • Priority changed from blocker to discussion

We have some pre-prepared Organisation records for specific ingested DataCentres?, but this is only useful when resolving individual records for displaying their DataCurator?. It's not going to work in a search scenario.

What we really need is for the mini-moles to be like stub-b i.e. with first level references resolved into one dgMetadata document. ...but then if we were going to radically change how discovery worked it might be better to ditch XML altogether and use postgres as suggested in discussions about performance.

comment:14 in reply to: ↑ 13 Changed 12 years ago by mpritcha

Yep, I think we'd be better off using postgres as the discovery search target (that's not to say we can't keep the original format XML docs harvested from places & serve these up as requested). But we could use JDBC-enabled XQueries (...we already do this to perform the spatial searches) to populate searchable postgres table(s) containing everything we need for discovery, I think. I reckon we'd see a huge performance improvement & there is even an extension that will do full text now, I believe. We've been talking about this for a while now: perhaps now is the time to investigate more seriously.

comment:15 Changed 11 years ago by sdonegan

  • Summary changed from [M] Author search bugs to [M] (DI-3-1) Author search bugs
  • Milestone changed from Reporting to NDG3

..fixed with postgres upgrade? Not sure, but problem should go away with move to ISO as workhorse format.. -hence classified as a DI-3-1 problem.

Note: See TracTickets for help on using tickets.