This post was stimulated by Al Downie’s recent article, Research Data Management as a national service. Thanks to Al for this stimulating thought piece, and also to Andy Turner for bringing JISC’s Open Research Hub into the conversation in his comment on Al’s post. I’ll come back to the Open Research Hub later, and start by picking up on Al’s thread.
I agree with the point, implicit in Al’s post, that ‘storage’ is the starting point for thinking about research data management as a service. ‘Storage as a platform’ is perhaps the best way to think about this. Al describes it as a “basic, discipline-agnostic, universally-relevant storage platform”. I also agree with the thrust of Al’s argument about the advantages of shared services. However, as Al points out, the ideal of a UK national shared storage facility on which a range of services can be built and provided is nothing more than a pipe dream at this point, and absent some drastic changes in institutional and funding models and prioritization of research priorities, is unlikely ever to be achieved.
What steps could be taken to change that pessimistic assessment? Identifying other projects with similar objectives already planned or in development is a useful first step. One I’m familiar with is the Massachusetts Green High Performance Research Computer Center (MGHPCC), a data center operated by Boston University, Harvard University, the Massachusetts Institute of Technology, Northeastern University, and the University of Massachusetts, and open for use by any research organization. The MGHPCC was originally funded by a grant from an arm of the Massachusetts government, the Massachusetts Life Sciences Center, but it is managed independently by the five participating universities. The board of directors is made up of the presidents and CIOs of the universities and operations are managed by representatives from the computing departments of the universities.
Since it’s creation in 2013 MGHPCC has also spearheaded and/or become associated with related research computing initiatives which make use of MGHPCC infrastructure. For example, the Northeast Cyberteam Initiative enables researchers from small and medium size New England colleges and universities to access MGPHCC resources. Most recently, and most relevant to the vision Al lays out, parties involved in MGHPCC are taking steps to create the New England Research Cloud, a production-ready open source cloud on which applications will be deployed and made available to researchers.
The MGHPCC and the New England Research Cloud seem to me to be similar in vision, structure, and implementation to Al’s concept of research data management as a national (or in this case regional) service.
A few salient points about the MGHPCC strike me as important and relevant to its successful start and promising future:
Would this model or something like it work in the UK? University and research funding, and government support for these activities, are of course very different than in the US, and any attempt at direct transplantation would likely fail. That doesn’t mean, however, that the model is not worth understanding, or that elements of the model could not be adopted and adapted to UK circumstances, or that a dialogue with people who are taking MGHPCC forward would not be useful. Moreover, given that the entire R&D environment in the UK is likely to change significantly in the coming years and that the new government’s stated intention is to both increase funding for science research and to implement new approaches to supporting it, it would seem to be an ideal time for innovative thinking and proposals about research infrastructure.
In response to Al’s post Andy referenced JISC’s Open Research Hub, to which Al commented, “this is a great example of the great work that is being done to drive forwards the Open Research agenda, but it doesn’t address the live data issue – the bulk of the iceberg!” Al’s vision for ‘live data’ is described in his post:
“Design [the storage infrastructure] in such a way that the developers and commercial partners can build and grow a library of interface skins that will provide specialist toolsets for those different research communities”.
Again, this closely mirrors the approach being taken by MGHPCC, which is beginning to work with specialist toolset providers to deploy them on top of MGPHCC, and crucially to do that in a way that takes advantage of particular strengths of MGHPCC, e.g. computational capabilities.
The Harvard Data Commons concept being developed by Merce Crosas at Harvard seems to me to be an excellent template for the layer of ‘live data services’ deployed on top of the storage platform. The below graphic illustrates the Harvard Data Commons, which Merce has described most recently at https://scholar.harvard.edu/mercecrosas/presentations/harvard-data-commons.
Two critical aspects of the Data Commons concept are that the services offered are as far as possible inter-operable with each other, and that they also interact productively with the storage platform on which they are to be deployed, i.e. MGHPCC and related facilities in the case of the Harvard Data Commons.
A couple of years ago Agustina Martinez put together an interesting presentation, Cambridge use case: Integrating RSpace and Apollo repository, discussing how an integration between an electronic lab notebook and a data repository works in practice and can benefit the research community and the university. This ties in well with the discussion at the outset of this article around the Open Research Hub, which is based around data repositories, and Al’s comment that the Open Research Hub focused on/was limited to repository data and did not cover the much larger part of the iceberg, live data. Electronic lab notebooks are an example of tools used to collect and analyse live data – along with generic tools like OneNote, specialized tools like protocols.io, and many other kinds of tools. The integration between an ELN and a data repository constitutes an important bridge between live or active data and archival data.
At the end of her presentation Agustina goes on to speculate about ‘Future Solutions’, and imagines a scenario where ELNs are deployed on top of an active data storage facility (which could be the shared storage platform Al is envisioning or MGHPCC), accessed by researchers from multiple universities, and integrated with the repositories in JISC’s Research Shared Service, i.e. the Open Research Hub that Andy brought into the conversation. This an example of how the Open Research Hub could be extended beyond repository data to include live research data, and to do that in an integrated fashion.
Reflecting on Al’s original ideas, the MGHPCC/Harvard Data Commons model, and Agustina Martinez’ thoughts on future solutions, it seems to me that the following are the core elements of research data management as a national service:
Let’s take each of these, in reverse order.
3. The repository service. This already exists in the form of the Open Research Hub.
2. The library of inter-operable tools that integrate with the Open Research Hub. A plethora of tools are already used by UK researchers, some of them are inter-operable with other tools, and a few of these are already integrated with the Open Research Hub. The action or task here would be to establish a procedure for vetting tools and enabling them to be deployed on top of the storage platform and hence made available as part o the integrated national research data management service. The Harvard Data Commons could be a good starting point for thinking about what categories of tools to include, and examples of tools that are both inter-operable and serving a useful role in a comparable research environment.
1. The data storage platform. Rather than the top down approach of trying to create from scratch a national, centralized data storage platform, which would at best take years to achieve and might never happen, why not explore bottom up approaches that make use of and build on existing infrastructure? In the same way that Harvard, MIT, BU, Northeastern and UMASS have collaborated to create the MGHPCC, several UK research institutions could work together to seek funding for and co-invest in a new storage facility, if possible one that builds on existing infrastructure (It may well be that examples of shared storage platforms already exist and could be brought into the picture). As with the MGHPCC, that facility could be open to use by researchers from other institutions. This, combined with the library of inter-operable tools and the Open Research Hub, would prove the concept and could attempt to validate the utility of a multi-institutional research data management service. It would also provide a governance structure and an impetus to continue to build out the service on a truly national scale.
At the end of his post Al said he would love to hear others’ thoughts – these are a few to add to the conversation!
May 30, 2023On Machine-actionable Data Management Plans in RSpace
How we're using the RDA's common standard on machine-actionable DMPs to enrich the flows of data into and out of RSpace.Read more
November 12, 2022RSpace wins ‘Best Tool’ at NFDI4Ing Conference!
RSpace won the award for 'best tool' (RDM Software Solution) at the annual NFDI4Ing conference on "Unifying the Understanding of RDM in Engineering Science" which took place between October 26th and 27th, 2022! Find out more...Read more
October 7, 2022Clustermarket: Manage Equipment with Ease!
Find out exactly how the new Clustermarket integration can save you time and effort when it comes to managing, scheduling and working with equipment! In this post we cover three reasons why you should be using this powerful integration as well as a three minute video on how to get started!Read more