The Research Data Support Service and early reactions
Springer Nature’s recent announcement of its new research data support offering has produced both confusion and some consternation in the research data management and data curation communities. The research data support service offers to “improve data findability by editing and enhancing metadata about submitted data and files (such as titles, authors, keywords and descriptions)”, and carry out “”checks on file integrity, for the presence of potentially sensitive or identifying information, and for consistency with associated manuscripts where applicable.” The platform for the research data support service is figshare. Although currently free of charge for files of up to 50 gigabytes, this will be a paid-for service. The proposed cost is reportedly $340 per dataset. Assuming 5 – 6 datasets per publication the cost of the service will be in the thousands of dollars.
Reactions from data curators have included shock at the additional cost and statements to the effect of, ‘that’s the service we already provide, we don’t need to pay someone else to do it.’ At a deeper level, concern has been expressed that this is reminiscent of the early days of Elsevier’s launch of a service for scholarly publishing and the resulting loss of control by scholars and universities of the academic publishing process, leading in the direction of researchers and universities losing control of research data in the same way that they lost control of publications.
The RDSS does highlight a weakness in the research data workflow
Although Springer Nature’s RDSS is proving controversial, it does point up an important weakness in the current research data workflow. This is that research data generally is not well organized during the active research process, and hence requires substantial manual curation when it comes time to deposit the data in conjunction with publication, in effect adding a second, and in many cases largely unnecessary, stage in the research to publication process.
The introduction of data repositories has made available a vehicle that is well suited for storing datasets, which is useful for preservation and subsequent querying and re-use of the datasets and as a complement to publications. However, data repositories are not designed to provide structure to data. As long as repositories are used in isolation, the result is the need for extensive manual curation of data in order to prepare the data for deposit. This manual curation becomes a second step in the research to publication workflow. Typically it is done with the assistance of data curators, and now Springer Nature is offering the RDSS as a replacement or alternative to data curators working at universities. Neither of these approaches, however, deals with the fundamental problem, which is that far too much data that requires structuring and preservation is being produced than can be dealt with by manual creation without an order of magnitude increase in costs — because manual curation of all this data would require either widespread uptake of the RDSS or hiring an army of new data curators.
Electronic notebooks as a data curation tool
A low cost alternative is in prospect that will deal with this conundrum in the disciplines that produce the greatest volume of research data, i.e. life sciences and related areas. Use of electronic lab notebooks (ELNs) potentially truncates the current two step process — first do the research and second curate the data — into a one step process — structure the data during the active research phase using the tools provided by ELNs. Unlike data repositories, ELNs are designed to make it easy and in some cases automatic to provide structure to data. This structuring is done by the researcher, the person best placed to understand their research. Once the data has been structured in the ELN, it can then be deposited into a data repository for storage and future re-use and querying.
This ‘self service’ approach made possible by ELNs also has the benefit of freeing up data curators to concentrate on providing higher value added advice to researchers, given that datasets are already structured by the time the data curator gets involved. So data curators can be more productively employed in advising a greater number of researchers on the finer points of data selection and preparation instead of the basic, tedious collection of data from widely disparate sources that currently occupies much of their time and only allows them to work with a small subset of the research community. This will result in availability of a substantially increased number of datasets for deposit, higher quality deposits, and more efficient and productive use of data curators’ time.
Benefits to universities and researchers: maintaining control over data and low cost
Universities and researchers will benefit in two ways from adoption of ELNs and their use as data curation tools in conjunction with data repositories. First, it will enable them to maintain control over their research data, and not risk losing control of it in the way that they did with publications.
Second, the productivity and increase in data availability enabled by adoption of ELNs can be delivered at very little additional cost. When adopted by an entire university the cost per researcher per year of an ELN is in the $10 – $20 range. This is modest in comparison with the the several thousand dollars per publication for the RDSS, which will bring none of the productivity gains ELNs can deliver and on the contrary will further entrench current inefficient manual data collection practices.
What about disciplines other than life sciences?
A legitimate question often posed by the research data management community is: ELNs are not the solution for everyone: they aren’t used in the humanities, most social sciences, and many engineering disciplines. This is absolutely true, but it would seem odd to conclude that this therefore implies that ELNs should not be adopted. First, ELNs solve an important problem for a large researcher group that produces a huge amount of important research data. Second, rather than seeing a glass that is half full, surely it is more productive to encourage greater adoption of ELNs, enabling them to serve as a model which can inspire comparable solutions for other disciplines. This is in fact already happening. For example, Dataverse is working on support for making ‘big data’ produced using computational methods available for query from the Dataverse repository, thereby adding another category of data that can be dealt with by the repository. And, consideration is being given to building an ‘ELN for the humanities’ around the International Image Operability Framework.
Next: optimizing ELNs and data repositories for data capture and presentation
In a second followup post I discuss how both ELNs and data repositories need to be designed to optimize the benefits described above of using ELNs as a replacement for manual data curation, and in a third and final post I will focus on how data curators can use ELNs when working with researchers to prepare datasets for deposit in repositories.