Electronic notebooks as data curation tools 2: Optimizing the ELN-to-repository workflow

This is the second of three posts on electronic lab notebooks (ELNs) as data curation tools.  In the first post I looked at how ELNs can serve as low cost and more effective alternatives to Springer Nature’s new research data support service by automating a large part of what is currently a cumbersome and manual data curation process.  In the third and last post I am going to look at how data curators can work with researchers to use ELNs as effective data curation tools.

Here I’m going to take a deeper drive into two things:  (1) What capabilities do ELNs need to properly support data curation in preparation for deposit of data into data repositories?; (2) What additions need to be made to data repositories to take full advantage of the data structuring capabilities of ELNs?

Capabilities ELNs need to serve as effective data curation tools

At the most basic level, ELNs provide a framework for capturing, organizing and presenting research data.  They are typically used by researchers in life sciences and related/adjacent disciplines.  Not all ELNs are created equal, however!  In order to serve as effective data curation tools, ELNs also need the capabilities described below, and illustrated with screenshots from the RSpace ELN.

1.  Linking to external data sources

ELNs are not used in isolation from other tools and resources researchers use to gather and analyze research data, so they need to support linking to data in other resources including file sharing apps — like Dropbox, G Drive, One Drive, Box and Egnyte — as well as external institutional and lab file stores where ‘big data’ like sequencing data and imaging data likely is stored.  This ensures that data stored on these external resources can, through links to documents in the ELN, become part of the research record, and searchable and accessible after the relevant datasets have been deposited in an appropriate date repository.

2.  Linking between internal documents and resources

The ability to form links between documents and other resources in an ELN is a fundamental prerequisite to preserving and presenting a record of the research that can be queried and reproduced.  For example, a link between an experimental writeup and the protocol used to run that experiment provides an essential clue to the workflow used in the research.

This linking capability is significantly enhanced when a unique ID is assigned to each resource in the ELN.  This enables easy identification of individual resources, search, and identification of linked resources.

The ability to connect to data in external resources and to form links between internal resources provide a platform for data curation which already includes structured datasets and meaningful information about how different parts of the datasets relate to each other, including in some cases the workflows that led to their production.  This platform enables researchers and data curators to start the manual data curation process from a much more advanced level by working with rich underlying structure and meaningful metadata.  How they can do this is the subject of the third part of this article, which will appear tomorrow in another post.

3.  Flexible data export

In order for the datasets to be made publicly available for viewing, query, and possible reproduction of the research, they need to be transferred into an appropriate data repository.  For this to be possible, it is necessary to be able to export data from the ELN with as much flexibility as possible.  This should include the ability to export (a) in multiple export formats including pdf, html and xml, (b) at varying levels of granularity, from individual documents to groups of selected documents, to all documents produced by a group, and (c) by individuals, group leaders and admins.

4.  Integrations with data repositories

Finally, there needs to be a streamlined process for getting datasets from the ELN into data repositories, which can only be achieved in an optimal fashion with a dedicated integration between the ELN and a range of repositories.  RSpace, for example, has been integrated with Dataverse, figshare and DSpace, allowing direct deposit of datasets and metadata from RSpace into each of these three repositories.  It is also beneficial for deposited datasets to be accompanied by the researcher’s ORCID ID.

A DOI is assigned to the dataset by the repository after the deposit has taken place.

A serious limitation of data repositories needs to be addressed for full advantage of the structuring capabilities of ELNs to be utilized

The focus of the repository community, in terms of both design and policy, has been around making data already in data repositories more accessible and reusable, i.e. what happens to data after it gets into a repository. The assumption has been that data and metadata goes into repositories directly and in an unprocessed and unorganized fashion, and there has been little discussion about data and metadata before it is deposited in data repositories. This oversimplification of the research workflow is increasingly untenable; as tools for managing and organizing active research data like ELNs proliferate, repositories need to be designed in ways that better facilitate inter-operability with these active research data tools.

This limitation is reflected in the APIs provided by data repositories, which do not support deposit of data that has already been structured in an ELN or other active data tool.  Thus it currently is necessary for datasets that have been structured in an ELN to be unbundled for deposit and then restructured in the repository.  The structure that exists in the ELN is still extremely useful, of course, for the researcher and their group, and for data curators who work with the researcher preparing the data for deposit into a repository.  However, the addition of more fully featured APIs that enable maintenance of structure in datasets when deposits are made would significantly streamline the process and save data curators in particular a significant amount of time.

We at Research Space have discussed how to tackle this issue with both Dataverse and Jisc.  We understand that the new Jisc research data shared service is being enhanced to become an Open Science service and will seek to develop APIs and extend the metadata model to support deposit of previously structured datasets.  It is to be hoped that other data repositories will also go down this path.