DepressionDC Data Integration for FAIR Data Sharing

Dataset Info
Published on
2026-02-06

Variables
1

Data Access

Data is available only upon formal request and subject to approval.

Approved users receive a secure institute account and work with the data exclusively in our Trusted Research Environment (TRE) via remote desktop.

Request data (Email to us)

Reuse & Usage Terms
  • Data is not downloadable (TRE access only).
  • Approved users receive a personal institute account.
  • Tools available: RStudio, Jupyter, Python, Stata, etc.
  • Data resides in your TRE home directory.
  • Re-use/publication per Data Use Agreement (DUA).
  • No redistribution of the data.
Contact us for the DUA template and details.
Description

Data Integration, Cleaning, and Quality Assessment

Background

In this project, trial data from the Depression DC study were integrated and prepared for further analysis and dashboard development. The Depression DC study is a longitudinal randomized trial comprising multiple measurements across different time points. The data integration process involved systematic data understanding, preprocessing, transformation into REDCap-compatible format, and comprehensive quality checks.

Data Understanding

To ensure accurate data handling, an initial data exploration phase was conducted:

  • Exploration of the study data dictionary to understand variable definitions and coding schemes
  • Review of published literature related to the trial to gain contextual understanding
  • Examination of raw CSV files to assess data structure, variable formats, and completeness

Data Preparation and Preprocessing

Data Structuring

  • Multiple source files were organized into a structured data list

Variable Standardization

  • Variables were renamed to align with REDCap conventions (e.g., visit → REDCap event names)
  • Column names were cleaned by removing extra spaces and inconsistencies
  • Special missing value codes such as “f.A” (fehlende Angabe) were converted to standard missing values (NA)

Alignment with REDCap Standards

Data Type Validation

  • Variables were validated and converted according to REDCap requirements:
    • Date variables → Date format
    • Continuous variables → Numeric
    • Discrete variables → Integer

Categorical Variable Recoding

  • Categorical variables were recoded into numerical formats based on the REDCap data dictionary
    • Example: Yes/No variables → 1/0

Handling Longitudinal and Repeated Measures

  • Repeated measurements were identified across time points
  • Appropriate REDCap structure was applied by assigning repeat instruments and event names
  • This ensured compatibility with REDCap’s longitudinal data model

Data Integration into REDCap

A custom function was developed to automate the import of REDCap-compatible datasets into the REDCap system. This ensured consistency, reproducibility, and efficiency in the data integration process.

Data Quality Assessment (Pre-Anonymization)

  • Initial quality checks were performed to ensure data integrity:
  • Use of REDCap’s built-in Data Quality Tool to identify inconsistencies
  • Verification of overall data consistency across datasets
  • Detection of special characters (e.g., “ß”) that could affect data processing
  • Comparison of missing values (NA) between original datasets and REDCap exports

Data Anonymization and Feature Derivation

To ensure compliance with data protection and sharing requirements:

  • All patient identifiers (e.g., date of birth, date variables, administrative identifiers) were removed
  • Patient IDs were double-pseudonymized to ensure privacy protection
  • New derived variables were created:
    • Age at randomization
    • Age at onset of depression
    • Duration of depression

Data Quality Assessment (Post-Anonymization)

Further validation was performed after anonymization:

  • Comparison of original and REDCap-exported data using graphical methods for randomly selected variables
  • Verification of missing values across datasets
  • Calculation of descriptive statistics and comparison with published study results

Validation of repeated measurements in both:

  • Initial integrated dataset
  • Pseudonymized REDCap dataset

Summary

The data integration process ensured that heterogeneous longitudinal trial data were transformed into a standardized, REDCap-compatible format. Through systematic preprocessing, validation, and quality checks, a high-quality dataset suitable for analysis and dashboard visualization was established.

Available Variables (1)
Event: Screening
Coversheet
  • patid

Analysis Code
Viewing: v1 Default: v3 R Multi-file Archive
Viewing version: v1 (R)
Created by lakshmi.batchu · 2026-02-06 16:43
📦 Archive contents
  • data_integration.txt
    documentation · 4923 bytes
    docs
  • example.csv
    data · 3272 bytes
    file
  • DepressionDataDictionaryStruct_DataDictionary_2025-09-20 (1).csv
    data · 24852 bytes
    file
🧾 README
No README found in this archive.
Version Timeline (by language)
R
Version History (detailed)
Version Language Type Relation Author Date
Global v1 (R v1) selected R Multi-file Archive Initial Implementation lakshmi.batchu 2026-02-06
Global v2 (R v2) R Multi-file Archive Refinement/Bug Fix ← Global v1 lakshmi.batchu 2026-02-06
Global v3 (R v3) default R Multi-file Archive Refinement/Bug Fix ← Global v2 lakshmi.batchu 2026-03-16
Contact
Lakshmi Batchu
Email
Publisher

Project
DepressionDC