DepressionDC Data Integration for FAIR Data Sharing

Dataset Info

Published on

2026-02-06

Variables

Data Access

Data is available only upon formal request and subject to approval.

Approved users receive a secure institute account and work with the data exclusively in our Trusted Research Environment (TRE) via remote desktop.

Request data (Email to us)

Reuse & Usage Terms

Data is not downloadable (TRE access only).
Approved users receive a personal institute account.
Tools available: RStudio, Jupyter, Python, Stata, etc.
Data resides in your TRE home directory.
Re-use/publication per Data Use Agreement (DUA).
No redistribution of the data.

Description

Data Integration, Cleaning, and Quality Assessment

Background

In this project, trial data from the Depression DC study were integrated and prepared for further analysis and dashboard development. The Depression DC study is a longitudinal randomized trial comprising multiple measurements across different time points. The data integration process involved systematic data understanding, preprocessing, transformation into REDCap-compatible format, and comprehensive quality checks.

Data Understanding

To ensure accurate data handling, an initial data exploration phase was conducted:

Exploration of the study data dictionary to understand variable definitions and coding schemes
Review of published literature related to the trial to gain contextual understanding
Examination of raw CSV files to assess data structure, variable formats, and completeness

Data Preparation and Preprocessing

Data Structuring

Multiple source files were organized into a structured data list

Variable Standardization

Variables were renamed to align with REDCap conventions (e.g., visit → REDCap event names)
Column names were cleaned by removing extra spaces and inconsistencies
Special missing value codes such as “f.A” (fehlende Angabe) were converted to standard missing values (NA)

Alignment with REDCap Standards

Data Type Validation

Variables were validated and converted according to REDCap requirements:
- Date variables → Date format
- Continuous variables → Numeric
- Discrete variables → Integer

Categorical Variable Recoding

Categorical variables were recoded into numerical formats based on the REDCap data dictionary
- Example: Yes/No variables → 1/0

Handling Longitudinal and Repeated Measures

Repeated measurements were identified across time points
Appropriate REDCap structure was applied by assigning repeat instruments and event names
This ensured compatibility with REDCap’s longitudinal data model

Data Integration into REDCap

A custom function was developed to automate the import of REDCap-compatible datasets into the REDCap system. This ensured consistency, reproducibility, and efficiency in the data integration process.

Data Quality Assessment (Pre-Anonymization)

Initial quality checks were performed to ensure data integrity:
Use of REDCap’s built-in Data Quality Tool to identify inconsistencies
Verification of overall data consistency across datasets
Detection of special characters (e.g., “ß”) that could affect data processing
Comparison of missing values (NA) between original datasets and REDCap exports

Data Anonymization and Feature Derivation

To ensure compliance with data protection and sharing requirements:

All patient identifiers (e.g., date of birth, date variables, administrative identifiers) were removed
Patient IDs were double-pseudonymized to ensure privacy protection
New derived variables were created:
- Age at randomization
- Age at onset of depression
- Duration of depression

Data Quality Assessment (Post-Anonymization)

Further validation was performed after anonymization:

Comparison of original and REDCap-exported data using graphical methods for randomly selected variables
Verification of missing values across datasets
Calculation of descriptive statistics and comparison with published study results

Validation of repeated measurements in both:

Initial integrated dataset
Pseudonymized REDCap dataset

Summary

The data integration process ensured that heterogeneous longitudinal trial data were transformed into a standardized, REDCap-compatible format. Through systematic preprocessing, validation, and quality checks, a high-quality dataset suitable for analysis and dashboard visualization was established.

Available Variables (1)

Event: Screening

Coversheet

patid

Analysis Code

Viewing: v1 Default: v3 R Multi-file Archive

Viewing version: v1 (R)

Created by lakshmi.batchu · 2026-02-06 16:43

⬇ Download ZIP

📦 Archive contents

data_integration.txt

documentation · 4923 bytes

docs
example.csv

data · 3272 bytes

file
DepressionDataDictionaryStruct_DataDictionary_2025-09-20 (1).csv

data · 24852 bytes

file

Uncompressed size: 33047 bytes
Files: 3

🧾 README

No README found in this archive.

Version Timeline (by language)

v1 →

v2 →

Version History (detailed)

Version	Language	Type	Relation	Author	Date
Global v1 (R v1) selected	R	Multi-file Archive	Initial Implementation	lakshmi.batchu	2026-02-06
Global v2 (R v2)	R	Multi-file Archive	Refinement/Bug Fix ← Global v1	lakshmi.batchu	2026-02-06
Global v3 (R v3) default	R	Multi-file Archive	Refinement/Bug Fix ← Global v2	lakshmi.batchu	2026-03-16

Contact

Lakshmi Batchu
Email
Publisher

Project
DepressionDC