Working with synthetic biomedical data

When developing data analysis methods, one major issue is having access to realistic examples to make analyses robust and reproducible. For simple sanity checks, random number generators (with appropriate constraints on range, frequency, etc.) or Fisher’s Iris dataset could be enough. But if you need the input to be in a highly specific format with realistic biological variation like biomedical data or electronic health records, simple random number generators are no longer sufficient.

Generating synthetic genomic data

While there are a number of public repositories for natural genomic data, eg. SRA sometimes it’s more convenient to generate your own sequencing data directly. For that, dwgsim provides a way to flexibly generate sequencing reads that can be used as input for read-level tools (eg. fastQC, fastp) or entire genomic pipelines.

Generating synthetic observational data

The Observational Medical Outcomes Partnership (OMOP) provides a Common Data Model (CDM) for observational data, so compatible SQL tables can be created. OMOP is widely used, but due to obvious privacy concern, OMOP datasets from real subjects aren’t usually publicly accessible.

However, synthetic data can be downloaded in bulk, including general, disease- and geography-specific datasets from Mitre. These were generated using Synthea, which can create synthetic patients using a number of parameters. Synthea outputs its data in multiple formats, from FHIR JSON to plain CSV. There are even open-source tools such as ETL-Synthea to import these CSVs into the OMOP CDM.

Processing synthetic biomedical data

While many analysis tasks operate on tabular data, often data comes in other formats such as JSON for FHIR. This necessitates creating a data processing step using additional tools, though unsurprisingly bespoke tools also exist for R (fhircrackr or BiocFHIR) and Python (fhirpack and fhiry).

Storage and querying

Depending on the scenario, OMOP and FHIR data can cover thousands of subjects in millions of rows, meriting performance considerations. Previously, SQLite has been a popular choice, especially for rapid prototyping, given its cross-platform availability, low overhead and high performance. However, it has since been eclipsed by DuckDB having not only superior performance, but better support for languages and data formats. In fact, if you need synthetic OMOP data and are performing analyses in R, the package CDMConnector provides a DuckDB instance containing Synthea-generated OMOP data that allows you to start working almost immediately.

Written on November 1, 2024