Ticket #311 (closed defect: worksforme)

Opened 13 years ago

Last modified 12 years ago

[m] browse/discovery - invalid characters in some dif returns

Reported by: lawrence Owned by: lawrence
Priority: critical Milestone: BETA
Component: discovery Version:
Keywords: Cc:

Description

If we do a search on humidity, we find some records don't parse in elementttree. It looks like it might be funny characters like tau. Could it be we have the wrong character set definitions?

Change History

comment:1 Changed 13 years ago by lawrence

Note from xml-sig list:

As per XML spec, if no encoding is declared, UTF-8 is assumed (AFAIK expat follows this). Check if your data is valid UTF-8. Expat accepts only UTF-8, UTF-16, iso-8859-1 and ascii data, but without encoding declaration treats everything as UTF-8.

Not sure what the problem is I would have thought UTF-8 would be ok, but maybe not ...

comment:2 Changed 13 years ago by lawrence

More from xml-sig:

I think you will have to tell elementtree what encoding your XML is in. Otherwise how would it know? I am sure there is a better way, but I have seen people try to guess encodings like:

  # untested and from my bad memory :-)
  encodings = ['utf-8', 'utf-16',i 'iso-8859-1',]
  for encoding in encodings:
      try:
          unicode(s, encoding)
      except UnicodeError:
          pass
      else:
          break

The encodings list would be a list of common encodings that you may expect. Again there must be a better way to do this... I would suggest that you try to set a standard for encodings.

comment:3 Changed 13 years ago by lawrence

Another note from xml-sig, and this one looks good for some of the problems:

BNL: For the record, we find [3 <= tau ]in that block ... Response: I expect this is not a unicode but an XML problem: "<=" should in fact be spelled "&lt;=" (as "<" needs to be quoted in XML).

This is almost certainly the problem, though I'm not sure how to fix this ... how can I parse the file for invalid <? It might be that we need to make sure the original content has these things escaped ... otherwise it's going to be a smart regex ...

comment:4 Changed 13 years ago by lawrence

  • Milestone changed from ALPHA to PreBeta

At this point, I have a hack in there that works ... but better to parse all documents and make them well-formed XML (ie. with a declared encoding) before loading into ET ...

comment:5 Changed 13 years ago by domlowe

  • Status changed from new to assigned
  • Owner changed from lawrence to domlowe

Similar problems noted with the csml Parser. eg. will not parse the Norwegian character Ø

comment:6 Changed 13 years ago by lawrence

  • Milestone changed from PreBeta to PostAlpha_review

Ok, then you probably need the loadET method in ETxmlView.py

It's got some historical dribbly crashes in there, but it works for the minute (probably ought to include sub_orphan in the same module, but plan to rationalise all this later).

comment:7 Changed 13 years ago by selatham

If we plan to just take whatever 'NDGLite' DataProviders? give us for DIFs and put them straight into Discovery portal, with no editorial review process, then we will have to deal (elegantly) with content problems such as this.

comment:8 Changed 13 years ago by domlowe

  • Milestone changed from PostAlpha_review to PreBeta

comment:9 Changed 13 years ago by domlowe

I've got an alternative fix now which involves looking at the first few bytes of a file to try and determine its encoding. The file and the name of encoding (eg utf8,16 etc) are then passed to element tree which handles it correctly when given the right encoding.

comment:10 Changed 12 years ago by selatham

Is this closable? Who or what is it affecting?

comment:11 Changed 12 years ago by domlowe

If it's not closeable I'd certainly like to reassign it to someone who knows about DIFs... :-) (Sue/Matt??)

Bryan and I seem to have both resolved our paticular unicode problems, but the point you made about data provider content may still be an issue for DIF harvesting?

Perhaps an element-tree 'unicode-filter' should be run over the DIFs to ensure they have the correct XML encoding set at the top of the document.

comment:12 Changed 12 years ago by selatham

  • Status changed from assigned to new
  • Owner changed from domlowe to lawrence

If someone can point me to some code that will do such 'unicode-filter' and encoding correction then I'll use it in the ingest. If I have to write something then it's lower priority than a lot of other stuff at the moment.

comment:13 Changed 12 years ago by lawrence

  • Status changed from new to closed
  • Resolution set to worksforme

I think we've got no evidence that this is a problem at the moment.

Note: See TracTickets for help on using tickets.