Martin Dove: The value of CML in managing simulations and data; “the best kept secret” is out

#semphyssci

Martin Dove and I have collaborated for malmost all the time that I have been at Cambridge. It’s fair to say that we wouldn’t have had a lot the progress in CML without Martin’s encouragement, collaboration, getting funding, and publications. Martin was an essential choice for our symposium on Semantic Physical Science.

We were colleagues in the Cambridge eScience Centre and Martin picked up the value of CML immediately. He invited me to be part of eMinerals and later MaterialsGrid – two large collaborative projects which addressed high-throughput simulation of (mainly regular) crystalline and similar materials. One subproject which really impressed me was simulating the damage done by alpha-particles in solidified nuclear waste. The particles zinged off at high speed into the surrounding crystal (NaCl, TiO2, etc.). http://rsta.royalsocietypublishing.org/content/367/1890/967.full Figure 3. IIRC materials such as NaCl recovered well from the damage (but are soluble) while TiO2 did not recover well.

One of the key features of this work was the “parameter sweep”. There are many variable parameters in studies like this – the material, the energy of the particle, the model used (e.g. the force field), etc. It easily leads to large numbers of calculations since we have to take steps along each parameter axis and multiply all possibilities.

Martin had the foresight to develop an impressive local Grid of departmental computers which lay idle for much of the time (e.g. when students were asleep). “CamGrid” (http://www.ucs.cam.ac.uk/scientific/camgrid ) has been a great success with over 1000 machines. It’s simple to use (CONDOR) and popular. The difficulty, as Martin shows below, is how to manage all the output.

That is where his support for CML has been so valuable. In MaterialsGrid the data representation was predicated as CML and the members of the group build CML-aware components. Toby White developed FoX to support CML in FORTRAN programs. It’s not fun using a non-object-oriented language to support XML but Toby’s FoX can do this in a way that the community has found straightforward and useful. FoX had generic XML support and also managed a useful subset of CML. Moreover Toby, Martin and colleagues built visualisations (such as ccVIZ) and you will see these below.

Martin makes it clear that he uses CML because it is useful and saves him time. We have responded by trying to build in the elements and attributes needed to support solid state and computation and these have stood up well over the last 5-7 years.

But CML is not yet universal. IN the talk martin describes it as “best kept secret”. I’d agree, and offer some reasons why.

  • The chemistry/materials community is intrinsically conservative (compared to bioscience)
  • The codebase is “mature” and quite a lot is commercial. It’s not easy to convince developers to add in yet-another-feature. In fact we have made progress with codes such as GULP, DL_POLY, CASTEP (through MaterialsGrid).
  • Many scientists need a working prototype before they believe. The prototype has taken a great deal of work – years
  • The infrastructure needs to be stable. That was almost impossible in the first 10 years of CML with W3C bringing out new specs, changing toolkits etc. However the spec has been essentially stable for about 5 years. It continues to be challenged and to be able to deal with those. So Martin was a very early adopter and has been through a good deal of pain.

There’s more, but it’s only in the last 2-3 years that there are signals that CML might be widely deployable. It still needs a lot of work but the way is clear. We have proved dictionaries, conventions, data validation, etc. And #semphyssci has shown that there is a desire for the value that a rigorous approach brings.

So many thanks Martin, and here is the annotation of your talk.

1:10 Scientific Example materials that shrink when heated
1:40 plotted simulated volume against temperature
2:30 no reason not to plot lots of points
2:50 workflow
3:20 can launch hundreds of jobs
3:40 hard bit is extracting results
4:00 traditional output file
5:00 traditionally might have to read code to understand
5:50 traditionally bad practice is tolerated
5:45 CML makes extracting data easy
6:00 Example of CML
6:30 input parameters and metadata
7:00 typical CML scalar property
8:00 Codes which produce CML
8:15 introduces FoX; writing CML is simple
9:45 otherwise have to write parsers
10:10 Toby White’s ccViz (DEMOs from here on)
11:45 Final quantities (e.g. Diffusion Coefficient)
12:00 Graphs
12:45 can send marked up file for collaboration
13:00 very good to avoid training students on legacy
13:20 extraction data (xtract will create a CSV file)
15:00 Martin was able to present this to high-school students – they could run hundreds of jobs
15:50 Martin cares because it makes lives easier
16:15 CML is “Best kept secret”
16:40 end