Changes between Version 13 and Version 14 of EsgMeetings/Telecon20090807

24/11/09 12:51:12 (12 years ago)

Page moved to go-essp wiki


  • EsgMeetings/Telecon20090807

    v13 v14  
    1 = ESG AR5 CMIP5 EU-US Telecon - 20090807 = 
    3 [[PageOutline]] 
    5 This page is effectively a set of minutes from the telecon, reordered a little to make easier to follow, and annotated (in the process of writing) with questions which the BADC feel need to be expanded upon via email or at subsequent meetings. All annotated questions are shown with sidebars. 
    7 == 0. Attendees == 
    9 Dean Williams; Cecelia !DeLuca; Bryan Lawrence; V. Balaji; Michael Lautenschlager; Don Middleton; Luca Cinquini; Bob Drach; Sylvia Murphy; , Stephen Pascoe; Ann Chervenak; Rachana Ananthakrishnan; Frank Toussaint; Ag Stephens; Philip Bentley; Karl Taylor; Serguei Nikonov; Eric Nienhouse 
    11 == 1. ESG Software and data flows == 
    13 Dean Williams gave an overview of ESG Software and data flows. 
    14  *  There are two main ESG components: (i) the Gateway and (ii) the Data Nodes. 
    15  *  30-40 climate modelling sites will send core data to PCMDI. 
    16  * This will create the "common core archive". It will be ~600TB (rising to 1PB). This will be replicated to BADC and Max Planck. 
    17  * Those not planning to be involved in the ESG federation can just mail the 1TB disks to PCMDI as before, those within the ESG federation will move their core data to PCMDI via the proposed ESG replication mechanism. 
    18 > ''' We need to clarify whether that's a push or a pull, how updates are handled, and whether or not it would be better to replicate from data nodes to all gateways rather than via the PCMDI gateway.''' 
    20 ==== 1.1. The ESG Gateway ==== 
    22 Luca Cinquini talked about ESG Gateways.  
    24 Currently deploying prototypes at two sites to test now (PCMDI and NCAR). 
    26 The Gateway provides: 
    27  * search  
    28  * faceted browsing through metadata records 
    29  * metadata harvest from data nodes via RDF exchange metadata (some lack of clarity on this, see below). 
    30  * metadata harvest from DIF 
    31  * can expose metadata for harvest via other gateways. 
    32    * So for example, BADC can harveset all the metadata from PCMDI, and a BADC instance of an ESG gateway could expose metadata and data services from PCMDI as well as data and services harvested from data nodes directly. 
    33      * In this instance, the most likely configuration would be that the BADC would expose one (or more) data node(s) which expose core data, Met Office and NCAS complete data. 
    35 Data services in the gateway use data delivered from data nodes. 
    37 This implies that core data centres need to run both a data node and a gateway. 
    39 Also discussed as gateway services were: 
    40   * download files 
    41   * WGET support 
    42   * GML 
    43   * GridFTP support (as a client) - planned in future 
    44 > '''It's not clear which of these are features of the Gateway, which are data node services and which are unimplemented features''' 
    46 Security is being tested with BADC, some good results using OpenID. 
    47 > '''What does the gateway need to secure, as opposed to the data nodes? ''' 
    49 > '''Need a timescale for Authorisation and clear understanding of what is involved.''' 
    51 > ''' 
    52 > ''Comment from Philip Kershaw:'' 
    53 > 
    54 > Luca and I have been working on the Attribute Service over the last couple of weeks.  This is a key component for authorisation as it allows an authorisation service to query whether a user making a request has the required attributes for access.  We've done our first end-to-end tests with a query running from the BADC to Luca's ESG test service. 
    55 ''' 
    57 Currently to do: 
    58   * Versioning  
    59 > '''Timescale?''' 
    60   * Gateway Demonstrator/Installation Guide (due in a "couple of weeks"). 
    61   * Authorisation 
    62   * revise UI 
    65 ==== 1.2. Data Nodes ==== 
    67 Bob Drach talked on this. 4 or 5 groups are installing data nodes. 
    69 Data node is where the data actually resides, and where you can get access to it and where it gets published. 
    71 Data Node includes: 
    73  * Publisher 
    74    * client that orchestrates other pieces 
    75    * reads the files  
    76 >  '''does this really use cdscan, if so what is the relation between cdscan and the thredds catalogues?''' 
    77    * publishes Thredds catalogues 
    78    * saves this info in database  
    79 >  ''' why? or is this where cdat has a role ''' 
    80    * publishes "services" to data, which is are endpoints (e.g URI to a service like FTP, LAS access etc) 
    81    * extensible to allow new service types to be published. 
    82  * Place to store the data 
    83  * Thredds catalogue to serve data 
    84  * optional LAS and visualisation 
    85  * Myproxy client 
    86  * GridFTP server planned but not going to be ready for GO-ESSP 
    88 Data flows: 
    90  1. Data files are there 
    91  2. Publisher scans the data 
    92  3. Generates XML files for Thredds catalogues 
    93  4. Notifies the Gateway that the data is ready (i.e. publishing): 
    94     * authenticates using a Myproxy client and x509 certificate 
    96 Just starting to work on versioning (like Gateway). 
    98 Each data node will publish to one Gateway. 
    99 > ''' Ok, what then will we do with core data "datasets" which are subsets of data held in gateways? If we do want to use 
    100 replication data nodes to the core data centres, how does this affect things? 
    102 ==== 1.3 Bulk Download ==== 
    104 Ann Chervenak led this discussion '''(did she: I thought it was Rachana?)''' 
    106 GridFTP will be used for bulk download. 
    108 Looking at GridFTP in data nodes: 
    110  1. Bulk data download - user can select ftp by GridFTP download 
    111  2. Bulk data movement - replication management 
    113 The majority of the work has been on scenario 1 to enable users to start a GridFTP client via the web-start mechanism.  They are working on making it handle required authentication. 
    115 > '''Currently the main effort with GridFTP is concentrated on users using it for bulk download, little effort seems to have been put into supporting the (arguably more important) use case scenario 2: replicating data between data nodes and core data node(s). Can we get some timescales on when we can do this.''' 
    117 Will be able to configure gridftp on data node servers on multiple ports, customised for efficient delivery to users or trans atlantic etc (i.e. using different window sizes). 
    119 Note [ uberftp] provides ftp like client using gridftp transport. 
    121 ==== 1.4 Replication across sites ==== 
    123 This discussion led by Ann Chervenek. 
    125 Planning to create a replication client (using existing metadata client and query client). 
    127 What PCDMI expects is that EU data nodes will want to copy core datasets in their entirety (rather than files). 
    129 Request will come from EU node: 
    131  * request comes to PCMDI Gateway 
    132  * will work out the list of file names and pack it off to: 
    133    * data mover light tool (existing tool): 
    134      * currently using FTP 
    135      * will eventually do bulk data movement (in future, using GridFTP) 
    137 Data movement tool will provide a status report for any problems in data transfers. 
    139 If mirror site wants to publish the dataset, then the replication client will package up the metadata required to publish that dataset. But the mirror metadata will be identified as different from the original metadata. 
    141 > ''' This doesn't seem to address the use case of how the PCMDI core is established and maintained. It presupposes that PCMDI want to serve the entire core (up to 1 PB) out ''' ''twice'' '''to BADC and WDCC. It doesn't address the issue that we can't wait til datasets are complete at PCMDI (or where ever). Clearly we need this to be automated, shouting when it has problems, but ensuring "mirroring" where necessary ''' 
    143 (Bryan asked why we weren't using standard mirroring tools: apparently rsync can't handle the volume.) 
    145 ==== 1.5 General ESG Discussion ==== 
    147 At present the catalogue and gateway system doesn't handle management of replicants in different locations. 
    148 > ''' This will be crucial for the proposed global federation, when and how will we do this? ''' 
    150 Is ESG delivering a version of LAS which is ESG "access control" aware? (Sort of similar to what has to 
    151 be done for Thredds, and so they expect to). 
    153 '''Release cycles''': Phil noted that go-essp-tech could be used for all notifications, and there was general agreement that this was desirable. 
    155 > '''We need a clearer plan for the ESG release cycle, how they plan to updates between releases etc.''' (We appreciate that the original ESG plan didn't take on a global federation, but that's where we are now, so we need to have clear understandings of what will happen when, and what that means for other ESG gateways, nodes and services built upon ESG). 
    157 '''Versioning issues''' E.g. what if a dataset is out of sync with its original copy? ESG are thinking about this kind of thing.  
    159 '''Master copies''' What is a master copy? E.g. if GFDL publishes (from its own data node) then it has the Master copy even though it contributes a copy to PCMDI (as part of the core). If GFDL changes its data to a new version, then the owner of the Master has to inform any other data nodes that its data has changed. 
    161 ==== 1.6 CMIP5 Core ==== 
    163 Q: What is the core?  A: Effectively defined by Karl and Ron. 
    165 Q: How should it be replicated?''' 
    167 Possible answer: 
    169 If datasets are marked as part of the CORE when they enter their first data node. 
    171 Then the federated system should work as follows: 
    173  * all data nodes can be a provider of CORE data 
    174  * some data nodes are identified as CORE REPLICATION NODES 
    175  * any new data that is CORE gets marked as such when it is published 
    176  * then the replication system needs to either: 
    177    1. Disseminate each CORE dataset to all CORE REPLICATION NODES, OR 
    178    1. Each CORE REPLICATION NODE polls all data nodes to check if they have any new datasets published as CORE that are not held locally. If there is new data or new versions of existing data then the CORE REPLICATION NODE requests the dataset from the data node that has the new data. 
    180 Note the difficulties that will arise in the cataloguing, and definitions of datasets, and associated endpoints. 
    182 Effectively we can consider any given simulation of consisting of output, say the set {{{A=[1,2,3,4]}}}, but only {B=[2]}}} might be in core. 
    183 The metadata for A and B differ only by the dates (and perhaps parameters). The endpoints may differ (core data nodes may have more or less services than data nodes). There might be three replicants of B. The data node might update 3, nothing changes in the core data nodes. They might update 2, then all core nodes need to update. (They might of course update part of 2 ...) 
    185 '''Action Required: BADC to document this issue to form the basis of replication and cataloging discussion at next telco.''' 
    187 == 2. Metadata == 
    189 Much of the material under this section was covered in a telecon the day before, a follow up [attachment:karl.txt note] from Karl, and [attachment:cecelia.txt minutes] from Cecelia are attached here. 
    191 Cecelia and Sylvia went over those minutes. 
    193 The key message for the European partners is that the Curator team are leading the development of the ESG catalogue system which navigates model metadata. Key components include: 
    194  * the underlying ontology (being informed by [ metafor] as well as their own efforts. 
    195  * the facetted browse technology 
    196  * the "trackback" pages 
    197  * relationships to gridspec. 
    199 The model metadata input will mainly come from the output of a questionnaire being developed by the metafor team. (The link isn't included here because it's not currently secure, and we don't want it spammed.)  
    201 The Questionnaire is really a Metadata Entry Tool. It reads in: 
    202  * controlled vocabs 
    203  * parameters 
    204  * CMIP5 experiment descriptions etc 
    206 Usage issues: 
    207  * it will force users to include fill in many aspects of the metadata so they need to do something useful 
    208  * it will be very time-consuming to fill in 
    209  * BADC will provide someone to support people in filling the data in 
    212 Bryan showed the team some of the capabilities, and we agreed that it wouldn't need to be used in anger before November, so a due date for GO-ESSP for delivery should be ok. 
    214 Questionnaire current timeline: 
    215  * alpha3 12 August 
    216  * beta: end of August (include some XML output in metafor CIM format). 
    217  * (there will probably be some interim betas) 
    218  * rc1: end of September 
    219  * rc2: Go-ESSP 
    220  * final: following feedback at GO-ESSP. 
    222 There has been considerable effort between the Curator and Metafor teams to harmonise the ontologies. See for example meta4:ticket:280. 
    224 CIM is metafor's overall metadata format called "Common Information Model". The Questionnaire will output CIM metadata directly. We agreed that 
    225  * We would have some initial examples of Questionnaire output in CIM XML by end of August, with more reliable ones by end of September. 
    226  * We would have CIM format numerical experiment descriptions for both CMIP3 and CMIP5 by mid September. 
    228 == 3. Timeline == 
    230  * ESG: Looking to release this Autumn. 
    231 > ''' How will updates be released?''' 
    233 We understood that after this release, the ESG team would "have a huddle" and decide on priorities and deliverables for the next release. 
    234 > The Europeans would like ESG to make this an open process ... 
    236 We expect most modelling centres to be begin real runs in November ... 
    238 == 4. Next Steps == 
    240  1. Produce a document outlining concerns and issues to do with cataloging and requesting replicants. 
    241  1. Establish a follow up telco to address the issues and actions from this one. 
    242  1. Establish an agenda for the ESG Monday before GO-ESSP. 
    244 == 5. Outstanding BADC/IS-ENES Concerns Not Covered Elsewhere == 
    246 This section does not reflect discussion at the telco, and reflects concerns after reflection. They may of course be spurious and reflect our poor knowledge of what is planned by ESG. 
    248  1. The data flow for replication between ESG and BADC/DKMZ isn't clearly defined.   
    249  1. No mechanism is defined to make Gateways aware of the existence of remote replicants of datasets they expose. 
    250  1. The ESG design assumes data movement has already occurred.  The Bulk Data Movement component appears less mature than the data-node and Gateway components. 
    251    1. GridFTP development appears to be focused on the end-user use case rather than Bulk Data Movement which IS-ENES sees as a critical use case for replication. 
    252  1. Versioning of datasets is not clearly defined or implemented 
    253  1. Very poor knowledge of the services to be deployed and their maturity. Status of LAS security? OpenDAP security? 
     1MOVED TO