wiki:NDGRoom101Meeting

Version 11 (modified by mpritcha, 11 years ago) (diff)

Draft for comment by attendees

NDG "Room 101" Meeting

Meeting to remind ourselves of what we have, why, & what we have learned along the way.
Held in CR03 on Friday 22 August 2008.

Attendees

  • Sam Pepler
  • Dom Lowe
  • Phil Kershaw
  • Steve Donegan
  • Kevin Marsh
  • Bryan Lawrence
  • Stephen Pascoe
  • Matt Pritchard
  • Calum Byrom (by phone)

Intro from Matt

  • Good opportunity to see where we are & look at improving the way we do things.
  • Room 101 ...ok, not quite. We're not going to vote for our least favourite components & see them disappear. In reality, design & development decisions have already been made, often for good reason (but it's worth reminding ourselves what those were & seeing if we're happy with the decision-making process ...not just on this project).
  • NDG took some excursions down some blind alleys. Did we learn things along the way? Are there lessons to be learned from how we got down these blind alleys (without dwelling on what was down there?)

Agenda

Review of Development Approach

How did we go about designing and developing the system?

  • Use of RM-ODP architecture
  • What approach did we use to the software life-cycle and did it work?
  • Development across multiple institutions

Review of current components

  • COWS
  • CSML
  • MOLES (v2+incomplete infrastructure, v3 on way)
  • Discovery (v2)
  • Security
  • Vocabserver

Suggested points for discussion for each component:

  • Overview (brief!)
    • Reminder of what it does, where it fits in
  • How did we get there?
    • Was it designed or did it grow organically?
    • (Would we do the same again, starting from scratch?)
  • SWOT
    • Strengths
      • Particular successes / things learned
      • Fitness for purpose
        • Does it do what it was specified to?
        • Does the spec meet current needs?
        • Have people been able to deploy / integrate / use it?
    • Weaknesses
      • What obstacles have been encountered?
      • Have these been overcome?
      • By satisfactory means?
    • Opportunities
      • What have we learned while building this?
      • What is the roadmap for this component?
    • Technology (...or Threats?)
      • Have we used appropriate technologies?
        • Element tree for XML procesing
        • Exist for XML databases
        • SOAP toolkits
        • WS-Security
        • Pylons
        • Postgres
      • Has development been guided too much / enough by available technology
      • Have we reinvented wheels?

Going forward

What new requirements lie ahead

  • NDG MSI
  • ??

Sum up


Notes from Meeting

Review of Development Approach

(BL) NDG wasn't [necessarily intended to be] a software development project at the start. As such, it didn't have very clear requirements at the start. We had to "do" the buzzwords in order to "do" e-science as far as funding was concerned.

How to roll out deployments : we are still asking questions now as to how we integrate security into BADC. Could it be done in smaller chunks or do we have to wait until a whole chunk is ready before attempting deployment?

(Sam) Initially there was a "do everything" approach [i.e. see which things were successull to aid in narrowing down candidate technologies for solving certain problems?]

(Stephen) Learned about modular software development, but (Bryan) at the time, there were no partners actually signed up to implementing [specific components]. It seemed that everyone was waiting for a complete system.

In order to make things better implementable [do we mean deployable?], things need to be in smaller chunks and this needs to be done all along.

Why didn't this happen?

  • Lost people at certain institutions (esp. key ones with link to Data Centres)
  • Had "services" but no [p??? ...sorry can't read my own notes] to start with

Was there a timetable for implementing these things?

  • Comes down to tightly-defined requirements, or lack of.

Example : CSML

  • (Dom) Suffers from being at the end of the data delivery chain. [All other bits in chain need to be deployed in order for this to be tried out properly / generated useful feedback]
  • (Matt) Should timetable for deployments be tailored to position in data chain?

(Calum) NDG development seemed to have a very adaptive approach, with a basic idea of what it wanted to do. This maps well onto the agile software development approach (cf. predictive where everything is very well planned out from the start).

  • Good because not too much time is spent down dead ends
  • But relies on very good communication between all involved

Maybe there was some attempt to be predictive in some parts of the project (bits of project were agile, bits predictive). But there was not enough attention to the relative pace of the different development streams.

RM-ODP model

Reference Model of Open Distributed Processing

Start off with what needs to be achieved (Enterprise Viewpoint), and develop other viewpoints to develop a specification of the whole system (others are information, computation, engineering, technology,  http://www.rm-odp.net/)

(Calum) Comment : Lots of code in the NDG stack seems to be non-OO. (Phil) RM-ODP gave structure at the start, and hence a structured approach. But it felt like we fell off the end of something. This had benefits (adaptive style) but at the expense of some loss of structure.

Plus some things were never deployed.

[RM-ODP doesn't provide any help beyond the initial design phase]. Is there another [complementary?] model that is more applicable further down the development path? [Flag this, and RM-ODP in general, as something to discuss in more detail at some point.]

Development across institutions

EDP / NERC Portals experience : suggests benefits of ensuring that at least 1 person is actually using a particular component, in order to provide feedback on it.

MOLES : There seemed to be someone in each institution trying to use it. But many people were trying to understand DIFs, let alone MOLES (some "fear" of it), even worse with CSML. Perhaps should have implemented (& made deployable) early on to get people on board [...by showing what benefits were obtained rather than what overhead it entailed?]

Logging

  • Need more logging built in to code that we write
  • This was one module that we always planned to build but never did
  • If we use a good logging framework (e.g. Log4J, or python equivalent) can send stuff down logging pipe (labelled appropriately) without having to about where it ends up. External configuration then decides which level of messages go where. Much better solution.
  • For mature services, need to have monitoring systems in place that are aware of error & warning messages that get generated, perhaps linked to some alert handler.

Testing

  • Decent test environment really important for good development
    • Feasible to set up traffic-light system to alert when things not OK.
    • NOSE ? similar to how done in JUnit. Simple server available
      • Module-level tests could be run overnight
  • Setting this up is not really possible until we have unit tests built into code
  • Responsibility of coders needs to extend beyond simply writing code.
  • Unit tests tell you (6 months down the line):
    • Whether it's working
    • What you were trying to achieve
  • Writing tests forces you to write code in a way that's testable
    • e.g. where 1 component does one thing only, ...and can be demonstrably good at it

Review of components

COWS

Overview

Low-level toolkit aimed at building data services including visualisation.

With fairly small amount of "glue" code, can produce services.

COWS was an example of change of tack. Realised that it was not going to be possible to integrate DataExtractor (Dx) code (for various reasons, Ag's availability, difference of approach etc).

Strengths

  • Standards compliant (built on OGC etc)
  • Can quickly implement any dimension we like
  • Presenting spatial data via JavaScript? map interfaces is now very prevalent on the web [e.g. Google Maps]. This is a "big win", in that we now have lots of expertise in this.
  • ...whereas Dx was a component all to itself (& the code of 1 person who didn't have enough time)

Weaknesses

  • Never got visualisation tools quite there actually joined up to data.
  • Not completely integrated with CSML

Lessons

  • Code development
    • Got to stop situation where only 1 person writes code [Another item for further discussion]
    • Very useful to review code at the end of a project.
    • People have different levels of skill : code review is good way of demonstrating good code to team members
    • Need to take step back occasionally and ask questions
    • Unit testing : very useful but not often done
      • Requires certain mind set at start and end of project
      • Time constraints can be an obstacle to proper testing. Unit tests are fairly easy to do; system tests more tricky and require more discipline.
      • Misdirected thinking to say that unit tests "take up" time (often happens early on when developers want to "get in the thick of things" straight away & see/demonstrate some result).
    • Almost everything is not quite finished (Bryan's whiteboard sketch)
      • In order to move forward, we must actually finish & implement across all our grid series data in the Data Centres.
    • There always seemed to be 1 thing in the jigsaw that was in such a state of flux as to prevent the whole from working at any time.
    • We have to think about deploying earlier, accepting that some bugs need to be left, to be fixed in a later version.
    • Need more info (from Ag?) about progress with GeoServer? at UKMO : should pursue this.

CSML

Overview

GML application schema to describe content of files (concerned with data structure). XML schema, followed modelling frameworks. High-level API for making subsetting requests.

Strengths / Successes

  • Beyond NDG, interest in CSML from ocean community groups e.g. ECOOP, MarineXML.

Weaknesses

  • Hasn't really been implemented (deployed?) yet. Working in prototype but no effort on part of data scientists. All the bits are there but...
  • (Stephen) Not convinced that Feature Type approach is the way forward. See GML : simple features. How is this ever going to join up with Feature Types? This is what INSPIRE is mandating...
  • (Bryan) Problem with CSML historically : nothing changes except through Andrew Woolf
  • CSML was designed to have different I/O layers & be lossy. It would never be as good as reading netCDF directly, but could read netCDF & PNG files at the same time.
  • HDF Example : It would be great to plot Grape + ERA40 data on same plot : that would be something really new i.e. an easy win for demonstrating CSML's capability.
  • (Bryan) We have the middleware layer, but now need to get the benefit of it. We should make a big effort to get ~50 datasets working with CSML (albeit if all 150 is impossible for now).
  • Granulite concept should help in trying to deploy some benefits (e.g. getting parameters into MOLES records), rather than all aspects (e.g. visualisation).

MOLES

Overview

...What do we mean by MOLES? Do we mean just the schema, or everything including Browse, Discovery. Difference in perception even amongst ourselves so perhaps not clear enough about this. XML Schema : heavily flawed schema representing some quite good ideas, with a relational database schema that is flawed2 !!

Weakenesses

Main problem is difficultly / inability to change. Even if the (complex enough) XML schema is changed, the relational DB schema (and subsequent changes to the editor interface) take too long to implement. New schema should be much more lightweight. XML database to be used instead of RDB which should help make evolution easier now.

In designing MOLES, we should have concentrated on the aspects of the metadata model (and hence schema) that were unique to environmental science, rather than those that were already familiar (to the developer). There are lots of instances where existing models for generic things (like people, organisations etc) could have been imported into the model rather than re-engineered.

To be fair to Kevin (developer), he kept asking for feedback on the design of MOLES as it was progressing but got hardly any. We now know what we can with a metadata model, so presumably would be better placed to provide constructive criticism.

Lessons

  • (NDG team) No point producing schema unless accompanied by lightweight tool to help users populate example instance documents.
  • (Data centres) Need to define tools early on that satisfy the requirements of the data centre in terms of populating metadata records.

Discovery

Overview

(Could do with presentation from Calum (& Steve?) about how new Discovery service works) Provides search facility against metadata records harvested via OAI from data providers

Strengths/Successes?

  • Used successfully in NERC Portals project and by MDIP as well as by NERC Data Discovery Service.
  • Despite delays in making it "operational", now provides useful service to NERC
  • Metadata subgroup now formed, to talk about these issues, so Data Centres are now interested and working together on these.
  • DMAG now like it. Could do with more usage stats, but satisfies FOI requirement for way of finding out what information an organisation holds.

Weaknesses

  • Performance. This has been addressed in Calum's re-write, largely by transforming to all required export formats on ingest rather than on-the-fly during Present operation of web service.
  • Could do with imcluding context of hit with result (i.e. why was this document a hit?), plus returning more than just the document id (e.g. abstract or "summary Present")
  • SOAP : yesterday's technology that we got stuck with.
  • Looking at supporting OpenSearch?, OGC WCS, etc. Hopefully new revision by Calum should make adaptation to provide these interfaces easier.

Lessons

  • Bryan should never be in the critical path of any development work!
  • Even if NERC DDS isn't creating huge usage stats, we should make sure that the Data Centres (esp NEODC, BADC) actually use Discovery as their own search tool on their public-facing website(s). Should do this now.

Security

Overview

Key concept = role mapping

Strengths / Successes

  • Security showed early on that it was possible to build quite sophisticated system based on Web Services.
  • Things have come a long way, in particular, good progress has been made with OpenId? in Java & Python.

Weaknesses

  • Too tied to what was available at the time (Globus, Proxy, certificates etc)
  • Problems of integrating this with normal user/password system used by Data Centres. In fact this was overkill compared to what was needed.
  • Went down a blind alley with MyProxy? (good tool, but made things too complicated in this context)
  • Personal User Certificates : didn't need these (can do same job by asserting someone's identity & using something like SAML)
  • Single sign on : wasn't a big requirement at the start but ended up spending lots of time on it.

Technology

  • Lots of immature tooling hence lots of time wasted trying to get things to work.
  • WS-Security : spent too much time on this. In the end people voted with their feet & just used SSL instead. Huge problems of implementation even between different Java toolkits.
  • WSGI is a really important tool for layering services & for encouraging modular developement
  • Pylons : good for security but perhaps more difficult to pull other things out of it.
  • OGC side : big challenges remain. People are still breeding security solutions and having headaches getting them to be interoperable.
  • GEO-DRM experiment : largely SOAP-based. Didn't want to hinder OGC protocol, but at same time, the only emerging technology out there was SOAP.
  • There still isn't an established technology for HTTP-based security
    • Big corporations using SOAP (...works)
    • "Rebels" using RESTful systems ...?

COWS context :

  • Got stuck in security but had to move to OGC-style services
  • Came up with workable solution.
  • Could in theory create a WSGI layer for security (& Stephen wouldn't need to know too much about what was inside it)
  • Still need database with record of what resource has what security policy
  • Phil would have liked more time to look at SAcML or GEOSAcML (e.g. enabling restriction by bounding box)
  • One thing we did wrong was to assume that access control would be in the MOLES docs.
    • Much better if this information is in some out-of-band database (FTP needs to talk to it, too)
      • Could easily write plugin for ProFTPd to do this ...should look at this for NEODC/BADC now
  • Shibboleth is going to be key as is OpenId?
    • Is Shibboleth a "winner" technology that we should have anticipated? Doesn't meet NDG requirements but does meet those of BADC.

Vocab Server

Overview

Enables query by term, responds with "narrower than" terms that are useful for searching. Place to maintain lists of vocab terms : essential part of building domain vocabs (and ontologies)

Strengths / Successes

  • Key piece of NDG, one of its only operational components. In use in the discovery service.
  • Stimlated lots of interest
  • This was the one part of the jigsaw driven by BODC (& it works well)

Weaknesses

  • BODC lost some key staff & found it difficult to spend small amounts of money on required development.
    • Future progress possibly difficult. Maybe Steve D can take on some of this work?
  • Breaks if query contains 2 terms (but otherwise very easy to use)

Lessons learned

  • Roy shouldn't be in the critical path of a project, either!
  • Andrew's group is part of CEDA in wider sense, but need to interact with them as much as possible. We now know what their skill (& weaknesses) are. They are important contacts, even if we don't always agree on things.

Going forward

NDG MSI

Aim : to get across the message of data modelling to those parts of NERC that haven't been exposed to it before.

Note that application schema (e.g. CSML) is our viewpoint of our data, i.e. corresponding with our application.

Activities

  • New MOLES (v3) atom serialization
    • addressing "when" for the first time (Simon Cox to be brought in for this & offer up result to standards body). He's a Earth Scientist so hopefully should get buy-in to results from e.g. BGS.
  • National Capability
    • Workshops about understanding of Data Modelling
  • Security
  • COWS
  • Discovery Service
    • Improve GUI, implelement OpenSearch?, Inspire requirements, improve spatial searching (e.g. proximity ranking)
  • Vocab service

MOLES v2+v3 activities both supported. By end of FY should have v3.

Bid has now been funded. Intend to get James Doughty to organise work, workshops, keep things on track.

Not enough money to fix all flaws with COWS...

(Stephen) At its core, it is a library to help in creating OGC services. Code needs maturing so hope MSI will find some time for this.

(General comment) : So far, have tended to develop new things on a code branch (because it's expedient to do so). Need to move to situation where expediency means developing on the trunk rather than a branch.