FAIR sharing of health data: a systematic review of applicable solutions

3.1 Categorization of the results

The 35 reviewed solutions are heterogeneous regarding their features, but they can be broadly divided into two groups.

Firstly, the solutions that are implemented as a single global instance (referred to as "Centralized" in Table 1, n = 22) are online platforms that can be used after a simple registration or sometimes an approval process, like EGA and GDC. They aggregate data in a single place, through a centralized network, enabling researchers to easily work on and publish their data. These platforms are usually not customizable but are straightforward to use. Some are suited for project management (n = 7/22) and even provide analysis tools (n = 7/22), but most are only meant to be used as publishing and archiving platforms.

Secondly, the solutions that can be distributed across institutions (referred to as "Independent" in Table 1, n = 10) are instances of a software, managed by the researcher’s organizations, with a few customizable options like metadata standards, access controls, or members permissions. Data are maintained on an independent network (n = 8/10) but not necessarily on-premises since storage can sometimes be provided as a service (n = 5/10). They implement tools for data and project management (n = 9/10) but can also be used as publishing platforms. Lastly, three solutions are referred to as “Federated” in Table 1. Their specificity is that the data are managed by the owning institutions (which is similar to an independent network), but they are made available for query through one global interface, and not through separated instances (which is similar to a centralized network).

3.2 Evaluation3.2.1 Findability

Findability was almost always satisfied (see Table 2). Digital Object Identifier (DOI) implementation is becoming increasingly popular, and most solutions have a search tool allowing to easily browse the data and metadata. A DOI is a very relevant item for Findability, because it references a dataset in a unique manner, while also making it searchable through any web browser. Moreover, all the solutions presented here were selected after verifying that public data were in practice findable by anyone on any web browser, since we excluded ‘private’ solutions during the selection process (see Fig. 1). Some solutions, such as CyVerse Data Store [31], DataFed [33, 34], IDA [62, 63], and XNAT for data sharing [81], need an account to be explored but they all have a free and open process. Data on IDA are especially hard to find even with an account because they are organized in projects with little to no description of the data, even though external links to the projects’ websites are provided.

3.2.2 Accessibility

Accessibility was satisfied in almost all cases (see Table 2). Indeed, the data are usually fully open, or controlled with the possibility to submit an access request, or equipped with enough information to contact the owner by other means (see Supplementary Table S3). The access requests can be very different from one solution to another (see Supplementary Table S3). The simplest ones are the equivalent of sending a message to the owner through the platform. The more complicated require researchers to upload a study proposal which is then reviewed by a committee. In all cases, these protocols are “open, free, and universally implementable” (criterion A.1.2 of the FAIR principles [11]). It is worth noting that apart from OpenNeuro no solution explicitly mentions that metadata would stay accessible even if the data were to be no longer available (criterion A.2 of the FAIR principles [11]). However, many solutions suggest that they can be used to display metadata while the actual data are stored elsewhere and linked by a persistent and unique identifier. Digital Commons is the most limited of all solutions, because it in practice hosts articles rather than datasets, even though it can theoretically do both. Therefore, we deemed the data underlying these articles to be poorly Accessible.

3.2.3 Interoperability

Interoperability is the main limiting factor for FAIR-compliance (see Table 2). On the one hand, general-purpose platforms will be able to store any type of data but do not implement all the appropriate vocabularies and formats necessary to curate, standardize, and visualize specialized data. For instance, regarding figshare repository, we could not find information indicating that community standards or vocabularies are suggested to the uploaders. On the other hand, specialized platforms will provide all the tools necessary to share any data associated with a specialized research community, e.g., genomics or neuroimaging, and might have well-curated metadata schemas and standards specifically designed to handle that type of data. They inherently lack the ability to support any data that are outside the scope of their field. XNAT Central is not scoring well in terms of interoperability. It is a neuroimaging repository whose data are not checked or curated upon upload. In practice, it contains a lot of empty or poorly described datasets.

3.2.4 Reusability

Reusability was satisfied most of the time (see Table 2). Many solutions implement popular metadata standards such as Dublin Core or DataCite, enabling metadata to be both rich and adapted. However, the main drawback of these general metadata standards is that most solutions only require a few fields to be filled, leading to datasets described with minimal information: title, author, contact, and a small description. XNAT Central is a good illustration of this situation, since it is an open, non-curated repository. It contains non-standardized and poorly described datasets, which prevents data from being Reusable. Some solutions like EGA, GDC, and Dryad, however, implement a review process upon data submission, to ensure compliance with standards.

3.2.5 Ease of use and implementation

This evaluation is a result of the various specificities of each solution, such as access protocols, conditions for data deposit, and more generally all the characteristics described in Table 1, Supplementary Tables S3 and S4. Online platforms are mostly straightforward to use (n = 17/22 “researcher” cells colored yellow or green in Table 2). A simple registration is usually enough to use all the functionalities of the platforms. Only dbGaP and GDC are time-consuming. This is due to the fact that they host sensitive data that have not been anonymized, and thus implement a long protocol of submissions and reviews. EGA implements a lighter version of such a protocol.

Software instances, on the other hand, usually require institutions to spend time on the installation and customization (n = 8/10 “institution” cells colored orange in Table 2). Therefore, data submitters have the comfort of working with locally managed data.

3.3 IPD policy

We found that only a bit more than half of the solutions is mentioning anonymization considerations (n = 19/35) (see Supplementary Table S3). Almost all of them require data to be anonymized before the upload even when access controls are provided. Vivli and IDA are the only platforms offering help with the anonymization. Some solutions like EGA, dbGap, and GDC, accept sensitive data, but they also have much more advanced access protocols, ensuring compliance with the data sharing legislation. In fact, these platforms are directly in relation with the instances responsible for the legislation: dbGaP and GDC are funded and administered by the NIH and EGA is part of the ELIXIR consortium which is partly funded by the European Commission.

3.4 Examples

In this section, we present in more details two solutions that illustrate the previously identified categories: Vivli for the online platforms, which are implemented as a single instance and Dataverse for the software instances, which are distributed across institutions (see Section 3.1). These solutions were chosen because they present enough differences to be representative of the landscape of available solutions. They also provide a large amount of relevant up-to-date documentation and are well-implemented in their respective communities.

3.4.1 Dataverse

Dataverse is an open-source web application to share, preserve, cite, and explore research data [36]. The underlying software must be installed and configured by the institution. It then constitutes a Dataverse repository which can host multiple virtual archives called Dataverse collections, which contain datasets, which consists of files and metadata. figshare [56] and B2SHARE for institutions [23, 24] as well as CyVerse [30, 32], Digital Commons [43], and XNAT for data sharing [81] also provide similar institutional repositories, with variability regarding maintenance, storage, and cost. Researchers of the institution can create Dataverse collections to deposit data featured in a project or in a published article.

The FAIR principles are explicitly mentioned as the first feature of the software [85], and Dataverse is cited as a viable tool for data sharing in the article that introduces the FAIR principles [11]. The following information were extracted from the articles screened above and the corresponding website [35,36,37, 86].

Findability

The structure of the Dataverse instances systematically guarantees Findability. Dataverse uses DOIs as well as Universal Numeric Fingerprint (UNF) which are globally unique and persistent identifiers (F1: first criterion of the Findability criteria). These identifiers are registered in the metadata, which cannot be separated from the data themselves, as they are bundled together in a single entity called dataset (F3). These datasets are contained in collections and can be searched and accessed through the Dataverse instance (F4). The search tool itself can be easily integrated into an institutional website. It is, however, the choice of the said institution to make the Dataverse instance searchable for everyone or to remain private.

Accessibility

Like for the search tool, the rest of the Dataverse infrastructure can be integrated in a website. This allows all interactions (e.g., data access, submission, requests) to take place in a single interface, and guarantees that the data owner can be contacted.

The retrieval of (meta)data from a Dataverse collection or dataset depends on the level of protection the data owner has chosen. If the data are in open access, one can simply download the data in a few clicks. If the access to data is controlled, it is necessary to authenticate oneself (A1.2) and submit an access request, through the Dataverse interface. The data owner can then choose to allow access, or not, to its data. In every case, this is a free, open and universally implementable protocol (A1.1). Once again, it is possible for the owner to make the data totally private, or hidden, to all users. It is not mentioned that metadata remain accessible “even when the data are no longer available” (A2). However, it is possible to create empty datasets, which means that metadata can be hosted on Dataverse, without any uploaded data. This helps respecting the criterion A2, although not explicitly.

Interoperability

The evaluation of the interoperability was more delicate. As Dataverse is not a specialized software, it does not provide community guidelines, nor curation of data, even though it advises to follow vocabularies and good practices. It is the responsibility of the institution to make sure the uploaded data are interoperable. Nonetheless, Dataverse provides tools that help facing these challenges, such as customizable metadata schemas, and some community schemas and ontologies: Data Documentation Initiative (DDI) for social and health sciences, DATS for life sciences and The Gene Ontology for molecular biology and genetics. Moreover, metadata can contain references to other data (I3), such as a scientific publication, or any website.

Reusability

Dataverse implements rich metadata schemas, such as Schema.org, DataCite, and DublinCore. License and terms of use are available in the metadata (R1.1), and detailed provenance can be provided (R1.2). Once again, some metadata fields are mandatory, but it is the responsibility of the researcher to fill the additional fields necessary to make its data Reusable. Dataverse provides the infrastructure for it but cannot guarantee it.

NB: for software instances, it is always the responsibility of the institution to decide the level of findability and accessibility. This evaluation of the FAIR criteria is solely based on the possibilities offered by the software, not on the practical choices made by the users.

However, Dataverse does not explicitly mention the sharing of IPD. Nonetheless, all data are stored on-premises by the managing institution and Dataverse collections can contain metadata alone when the data files are too sensitive to be shared. This is a way to ensure the Findability of the data while respecting data sharing legislation. Additionally, it is not a specialized repository which means that every file format is accepted. However, only a few can be previewed, e.g., images, PDF, text, video, tabular data, and other basic formats. To better understand the concrete implementation of Dataverse, one can browse one of the 93 installations [86] at time of writing (12 December 2022) or try to use Dataverse Harvard [38] for free which is a public instance of the software.

3.4.2 Vivli

Vivli [77, 78] is an online repository hosting anonymized clinical data. Anyone can search for clinical studies on Vivli. However, access and upload of data are controlled by various protocols and agreements. These protocols are similar to the ones implemented by EGA [51], dbGaP [39], and GDC [58, 59] for individually identifiable data, although to some extent only, since the data on Vivli are systematically anonymized. Moreover, Vivli offers help for the anonymization of datasets before submission, ensuring the sharing of IPD in a secure manner. At time of writing, 12 December 2022, Vivli hosts 6907 studies and 621 data requests have been submitted.

Findability

Vivli implements automatic DOI attribution (FAIR criterion F1) and open visualization of the studies through a search tool (F4). A study contains data files, metadata, and a description of the aim of the study. It is registered on ClinicalTrials.gov. The DOI is linked to the study, which cannot be separated from the metadata and the actual datafiles (F3).

Accessibility

The access protocol, although time consuming, is very clear and open (A1.1). Anyone can submit an access request, after creation of a free account, to allow for authentication (A1.2). The request is approved or refused by Vivli within 3 business days (A1.2). Some data can also be public, depending on the choices made by the data contributor.

Additionally, metadata are always public and available, even if there are no data files uploaded, although it is not explicitly mentioned what would happen if datafiles were deleted (A2).

Interoperability and reusability

Vivli encourages researchers to use rich metadata, dictionaries, and ontologies (e.g., the Cochrane ontology) (I1, I2, R1.3). It also reviews all uploaded data, which improves Interoperability and Reusability. When requesting data from a study, it is necessary to sign a Data Use Agreement (DUA), with clear terms of use and license (R1.1). All studies are richly described, give provenance information (R1.2), and additional information are available on ClinicalTrials.gov.

View original article

HEALTH AND TECHNOLOGY

Like

Share Bookmark

0 0 0 0 0 0 0

More from this channel

FAIR sharing of health data: a systematic review of applicable solutions

Comments (0)