This post was stimulated by Al Downie’s recent article, Research Data Management as a national service. Thanks to Al for this stimulating thought piece, and also to Andy Turner for bringing JISC’s Open Research Hub into the conversation in his comment on Al’s post. I’ll come back to the Open Research Hub later, and start by picking up on Al’s thread.
A national approach to research data management
I agree with the point, implicit in Al’s post, that ‘storage’ is the starting point for thinking about research data management as a service. ‘Storage as a platform’ is perhaps the best way to think about this. Al describes it as a “basic, discipline-agnostic, universally-relevant storage platform”. I also agree with the thrust of Al’s argument about the advantages of shared services. However, as Al points out, the ideal of a UK national shared storage facility on which a range of services can be built and provided is nothing more than a pipe dream at this point, and absent some drastic changes in institutional and funding models and prioritization of research priorities, is unlikely ever to be achieved.
What steps could be taken to change that pessimistic assessment? Identifying other projects with similar objectives already planned or in development is a useful first step. One I’m familiar with is the Massachusetts Green High Performance Research Computer Center (MGHPCC), a data center operated by Boston University, Harvard University, the Massachusetts Institute of Technology, Northeastern University, and the University of Massachusetts, and open for use by any research organization. The MGHPCC was originally funded by a grant from an arm of the Massachusetts government, the Massachusetts Life Sciences Center, but it is managed independently by the five participating universities. The board of directors is made up of the presidents and CIOs of the universities and operations are managed by representatives from the computing departments of the universities.
Since it’s creation in 2013 MGHPCC has also spearheaded and/or become associated with related research computing initiatives which make use of MGHPCC infrastructure. For example, the Northeast Cyberteam Initiative enables researchers from small and medium size New England colleges and universities to access MGPHCC resources. Most recently, and most relevant to the vision Al lays out, parties involved in MGHPCC are taking steps to create the New England Research Cloud, a production-ready open source cloud on which applications will be deployed and made available to researchers.
The MGHPCC and the New England Research Cloud seem to me to be similar in vision, structure, and implementation to Al’s concept of research data management as a national (or in this case regional) service.
A few salient points about the MGHPCC strike me as important and relevant to its successful start and promising future:
- It is supported by government funding.
- It is controlled and managed by participating universities.
- Commercial entities — Cisco and Dell so far — are involved as partners but have no role in governance.
- To continue to thrive and be relevant to the research community it requires constant innovation and that in turn means regular access to new sources of funding.
- The incentive to raise new funds lies with the universities, but they are not constrained in doing this as they would be if MGHPCC was controlled by the government entity that originally funded it.
- MGHPCC views itself as a platform, and wants to become a facility on which multiple applications needed by researchers can be deployed, i.e. it is not prescriptive in telling researchers which applications they can access and make use of.
Would this model or something like it work in the UK? University and research funding, and government support for these activities, are of course very different than in the US, and any attempt at direct transplantation would likely fail. That doesn’t mean, however, that the model is not worth understanding, or that elements of the model could not be adopted and adapted to UK circumstances, or that a dialogue with people who are taking MGHPCC forward would not be useful. Moreover, given that the entire R&D environment in the UK is likely to change significantly in the coming years and that the new government’s stated intention is to both increase funding for science research and to implement new approaches to supporting it, it would seem to be an ideal time for innovative thinking and proposals about research infrastructure.
How services offered on top of the storage could be selected and organized – the role of a Data Commons
In response to Al’s post Andy referenced JISC’s Open Research Hub, to which Al commented, “this is a great example of the great work that is being done to drive forwards the Open Research agenda, but it doesn’t address the live data issue – the bulk of the iceberg!” Al’s vision for ‘live data’ is described in his post:
“Design [the storage infrastructure] in such a way that the developers and commercial partners can build and grow a library of interface skins that will provide specialist toolsets for those different research communities”.
Again, this closely mirrors the approach being taken by MGHPCC, which is beginning to work with specialist toolset providers to deploy them on top of MGPHCC, and crucially to do that in a way that takes advantage of particular strengths of MGHPCC, e.g. computational capabilities.
The Harvard Data Commons concept being developed by Merce Crosas at Harvard seems to me to be an excellent template for the layer of ‘live data services’ deployed on top of the storage platform. The below graphic illustrates the Harvard Data Commons, which Merce has described most recently at https://scholar.harvard.edu/mercecrosas/presentations/harvard-data-commons.
Two critical aspects of the Data Commons concept are that the services offered are as far as possible inter-operable with each other, and that they also interact productively with the storage platform on which they are to be deployed, i.e. MGHPCC and related facilities in the case of the Harvard Data Commons.
An example of inter-operable services: electronic lab notebooks and data repositories
A couple of years ago Agustina Martinez put together an interesting presentation, Cambridge use case: Integrating RSpace and Apollo repository, discussing how an integration between an electronic lab notebook and a data repository works in practice and can benefit the research community and the university. This ties in well with the discussion at the outset of this article around the Open Research Hub, which is based around data repositories, and Al’s comment that the Open Research Hub focused on/was limited to repository data and did not cover the much larger part of the iceberg, live data. Electronic lab notebooks are an example of tools used to collect and analyse live data – along with generic tools like OneNote, specialized tools like protocols.io, and many other kinds of tools. The integration between an ELN and a data repository constitutes an important bridge between live or active data and archival data.
At the end of her presentation Agustina goes on to speculate about ‘Future Solutions’, and imagines a scenario where ELNs are deployed on top of an active data storage facility (which could be the shared storage platform Al is envisioning or MGHPCC), accessed by researchers from multiple universities, and integrated with the repositories in JISC’s Research Shared Service, i.e. the Open Research Hub that Andy brought into the conversation. This an example of how the Open Research Hub could be extended beyond repository data to include live research data, and to do that in an integrated fashion.
Research data management as a national service: the building blocks are (mostly) already in place
Reflecting on Al’s original ideas, the MGHPCC/Harvard Data Commons model, and Agustina Martinez’ thoughts on future solutions, it seems to me that the following are the core elements of research data management as a national service:
- A scalable, well-funded and well managed storage platform open to researchers from all or most universities.
- A portfolio of inter-operable live data tools that can be deployed on top of the storage platform.
- A repository service into which live data tools are integrated, permitting direct data deposit from the tools to the service.
Let’s take each of these, in reverse order.
3. The repository service. This already exists in the form of the Open Research Hub.
2. The library of inter-operable tools that integrate with the Open Research Hub. A plethora of tools are already used by UK researchers, some of them are inter-operable with other tools, and a few of these are already integrated with the Open Research Hub. The action or task here would be to establish a procedure for vetting tools and enabling them to be deployed on top of the storage platform and hence made available as part o the integrated national research data management service. The Harvard Data Commons could be a good starting point for thinking about what categories of tools to include, and examples of tools that are both inter-operable and serving a useful role in a comparable research environment.
1. The data storage platform. Rather than the top down approach of trying to create from scratch a national, centralized data storage platform, which would at best take years to achieve and might never happen, why not explore bottom up approaches that make use of and build on existing infrastructure? In the same way that Harvard, MIT, BU, Northeastern and UMASS have collaborated to create the MGHPCC, several UK research institutions could work together to seek funding for and co-invest in a new storage facility, if possible one that builds on existing infrastructure (It may well be that examples of shared storage platforms already exist and could be brought into the picture). As with the MGHPCC, that facility could be open to use by researchers from other institutions. This, combined with the library of inter-operable tools and the Open Research Hub, would prove the concept and could attempt to validate the utility of a multi-institutional research data management service. It would also provide a governance structure and an impetus to continue to build out the service on a truly national scale.
At the end of his post Al said he would love to hear others’ thoughts – these are a few to add to the conversation!