Convert-Pheno: A software toolkit for the interconversion of standard data models for phenotypic data

Structured storage of phenotypic and clinical data (“pheno-clinical”), which encompasses the organized collection of information pertaining to observable characteristics and medical aspects of individuals, plays a vital role in advancing biomedical research and enhancing patient outcomes [1]. However, pheno-clinical data are often scattered across various sources such as Excel files, PDFs, text files, databases, REDCap projects, Electronic Health Records (EHRs), etc., resulting in a fragmented information landscape [2]. Furthermore, many of these sources use proprietary formats, requiring users to export or dump data to gain full access [3]. Often, each research center must develop a unique approach to achieve its goals within specific constraints, resulting in a lack of standardization that makes cross-center studies challenging. For instance, one center may use the variable name “sex” and values “M” and “F” to define biological sex, while another may use “sexo” with values “Hombre” and “Mujer”, leading to errors and inconsistencies that increase with the complexity of the data. Moreover, human error and typos may occur without standardized vocabulary enforcement.

Enforcing health data standards is a practical approach to address the issue of inconsistent phenotypic data storage [1], [4], [5], [6], [7], [8], [9] These standards provide guidelines or specifications for collecting, processing, transmitting, and maintaining health-related information in a consistent and interoperable manner. Typically, a standard includes a data model defined by a schema that organizes and relates data elements [10]. Note that the terms “data model” and “standard” are often used interchangeably due to overlapping definitions. The scientific community has developed several standard data models for clinical information, such as OMOP-CDM [11], HL7-FHIR [12] for healthcare interoperability, DICOM [13] for medical imaging, and CDISC [14] for clinical research. Here we focus on text-based data models of clinical description and leave aside those for imaging data.

Standards can be viewed as a “higher” level of data harmonization, whereas ontologies play a critical role at a “lower” level by providing the vocabulary for expressing values within the data model. In particular, ontologies provide a formalized conceptualization of knowledge that specifies the concepts and relationships within a particular domain [15]. Examples of health data ontologies include Gene Ontology (GO) for molecular functions [16], Human Phenotype Ontology (HPO) for phenotypic abnormalities [17], Logical Observation Identifiers Names and Codes (LOINC) for identifying medical laboratory observations [18], and the National Cancer Institute Thesaurus (NCIt) for a general-purpose biomedical coding (https://ncit.nci.nih.gov). In summary, standardizing pheno-clinical data is essential to ensure consistency and accuracy, facilitating data sharing and integration across different research groups and platforms. This accelerates scientific progress and maximizes the potential impact of research findings.

Although standardizing data is a positive step forward, various research groups use different standards or coding schemes, creating barriers to data integration and sharing [1], [6]. In addition, the complexity of the data models makes it challenging for non-experts to perform one-to-one mapping of variables. To address this challenge, we have developed Convert-Pheno, an open-source software that facilitates the interconversion of common data models for phenotypic data, without altering the underlying ontologies, except to fill in missing terms in required fields. It accepts various input formats, including Beacon v2 Models [19], CDISC-ODM [14], [19], OMOP-CDM (https://www.ohdsi.org), Phenopackets v2 [21], and REDCap projects [22]. By mapping all input data to the Beacon v2 Models [19] as a pivotal target model, Convert-Pheno can generate both Beacon v2 and Phenopackets v2 as output formats, ensuring the retention of the ontological integrity. This method represents a substantial advance in fostering compatibility and interoperability among various data systems, with plans to facilitate support for additional formats in future releases.

Comments (0)

No login
gif