DepressionDC Data Integration for FAIR Data Sharing

Dataset Info

Published on

2026-02-06

Variables

Data Access

Data is available only upon formal request and subject to approval.

Approved users receive a secure institute account and work with the data exclusively in our Trusted Research Environment (TRE) via remote desktop.

Request data (Email to us)

Reuse & Usage Terms

Data is not downloadable (TRE access only).
Approved users receive a personal institute account.
Tools available: RStudio, Jupyter, Python, Stata, etc.
Data resides in your TRE home directory.
Re-use/publication per Data Use Agreement (DUA).
No redistribution of the data.

Description

Data Integration, Cleaning, and Quality Assessment

Background

In this project, trial data from the Depression DC study were integrated and prepared for further analysis and dashboard development. The Depression DC study is a longitudinal randomized trial comprising multiple measurements across different time points. The data integration process involved systematic data understanding, preprocessing, transformation into REDCap-compatible format, and comprehensive quality checks.

Data Understanding

To ensure accurate data handling, an initial data exploration phase was conducted:

Exploration of the study data dictionary to understand variable definitions and coding schemes
Review of published literature related to the trial to gain contextual understanding
Examination of raw CSV files to assess data structure, variable formats, and completeness

Data Preparation and Preprocessing

Data Structuring

Multiple source files were organized into a structured data list

Variable Standardization

Variables were renamed to align with REDCap conventions (e.g., visit → REDCap event names)
Column names were cleaned by removing extra spaces and inconsistencies
Special missing value codes such as “f.A” (fehlende Angabe) were converted to standard missing values (NA)

Alignment with REDCap Standards

Data Type Validation

Variables were validated and converted according to REDCap requirements:
- Date variables → Date format
- Continuous variables → Numeric
- Discrete variables → Integer

Categorical Variable Recoding

Categorical variables were recoded into numerical formats based on the REDCap data dictionary
- Example: Yes/No variables → 1/0

Handling Longitudinal and Repeated Measures

Repeated measurements were identified across time points
Appropriate REDCap structure was applied by assigning repeat instruments and event names
This ensured compatibility with REDCap’s longitudinal data model

Data Integration into REDCap

A custom function was developed to automate the import of REDCap-compatible datasets into the REDCap system. This ensured consistency, reproducibility, and efficiency in the data integration process.

Data Quality Assessment (Pre-Anonymization)

Initial quality checks were performed to ensure data integrity:
Use of REDCap’s built-in Data Quality Tool to identify inconsistencies
Verification of overall data consistency across datasets
Detection of special characters (e.g., “ß”) that could affect data processing
Comparison of missing values (NA) between original datasets and REDCap exports

Data Anonymization and Feature Derivation

To ensure compliance with data protection and sharing requirements:

All patient identifiers (e.g., date of birth, date variables, administrative identifiers) were removed
Patient IDs were double-pseudonymized to ensure privacy protection
New derived variables were created:
- Age at randomization
- Age at onset of depression
- Duration of depression

Data Quality Assessment (Post-Anonymization)

Further validation was performed after anonymization:

Comparison of original and REDCap-exported data using graphical methods for randomly selected variables
Verification of missing values across datasets
Calculation of descriptive statistics and comparison with published study results

Validation of repeated measurements in both:

Initial integrated dataset
Pseudonymized REDCap dataset

Summary

The data integration process ensured that heterogeneous longitudinal trial data were transformed into a standardized, REDCap-compatible format. Through systematic preprocessing, validation, and quality checks, a high-quality dataset suitable for analysis and dashboard visualization was established.

Available Variables (1)

Event: Screening

Coversheet

patid

Analysis Code

Viewing: v3 Default: v3 R Multi-file Archive

Viewing version: v3 (R)

Created by lakshmi.batchu · 2026-03-16 16:06 · Added Jupyter Notebooks for data integration pipeline and data comparison after integration.

⬇ Download ZIP

📦 Archive contents

depressiondc-data-integration-for-fair-data-sharing-119-code-v3/README.md

documentation · 4908 bytes

docs
depressiondc-data-integration-for-fair-data-sharing-119-code-v3/example.csv

data · 3272 bytes

file
depressiondc-data-integration-for-fair-data-sharing-119-code-v3/DATAQUALITYCHECKS.ipynb

other · 15337 bytes

file
depressiondc-data-integration-for-fair-data-sharing-119-code-v3/DepressionDCDoublePseudonymize_DataDictionary_2026-03-16.csv

data · 17324 bytes

file
depressiondc-data-integration-for-fair-data-sharing-119-code-v3/data_integration.ipynb

other · 7163 bytes

file

Uncompressed size: 48004 bytes
Files: 5

🧾 README

# CSV to REDCap Data Integration Workflow

Automated R-based workflow for standardizing and importing CSV data into REDCap databases with validation and type conversion.

## Overview

This workflow handles the complete pipeline from raw CSV files to REDCap import, including:
- Auto-detection of CSV delimiters
- Column name standardization
- REDCap field alignment and type conversion
- Automated API import

## Prerequisites

### Required R Packages

```r
install.packages(c(
  "dplyr",
  "janitor", 
  "readr",
  "stringr",
  "lubridate",
  "httr",
  "jsonlite"
))
```

### Required Files

- **CSV data files**: Source data to be imported
- **REDCap data dictionary** (`Redcap_dictionary.csv`): Export from your REDCap project
- **REDCap API token**: Generated from your REDCap project settings

## Configuration

### 1. Set File Paths

```r
folder_path <- "PATH/TO/YOUR/CSV/FOLDER"
```

### 2. Configure API Credentials

**Security Best Practice**: Never hardcode API tokens in scripts. Use environment variables:

```r
# Set environment variable (run once in R console)
Sys.setenv(REDCAP_API_TOKEN = "your_token_here")

# Use in script
api_token <- Sys.getenv("REDCAP_API_TOKEN")
api_url <- "https://your-redcap-server/api/"
```

## Workflow Steps

### 1. Data Loading

The script automatically:
- Detects CSV delimiter (comma or semicolon)
- Handles UTF-8 encoding
- Loads all CSV files from the specified folder

### 2. Column Standardization

Automatically maps common column names to REDCap system fields:

| Source Column | REDCap Field |
|---------------|--------------|
| `subject` | `subjid_drv` |
| `repeat_number` | `redcap_repeat_instance` |
| `visitid` / `visit_id` | `redcap_event_name` |
| `site` | `redcap_data_access_group` |

**Note**: Column names are cleaned using `janitor::make_clean_names()` (lowercase, underscores, no special characters).

### 3. Data Type Alignment

The workflow reads your REDCap data dictionary and applies validation rules:

- **date_dmy**: Converts to date format (supports multiple input formats)
- **integer**: Converts to integer
- **number**: Converts to numeric

System fields are preserved and not subject to validation conversion.

### 4. REDCap Import

Uses the REDCap API to import data with:
- Flat JSON format
- Normal overwrite behavior (updates existing records)
- Count return for verification

## Usage Example

```r
# Source the script
source("data_integration_workflow.R")

# Load and standardize all CSV files
# (already done automatically when script runs)

# Apply REDCap dictionary alignment to a specific dataset
DM <- data_list[[1]]  # First CSV file
DM_aligned <- align_by_text_validation(DM, redcap_dict)

# Import to REDCap
api_token <- Sys.getenv("REDCAP_API_TOKEN")
api_url <- "https://your-redcap-server/api/"

result <- import_to_redcap(DM_aligned, api_token, api_url)
```

## Data Preprocessing Features

### Special Value Handling

- Empty strings converted to `NA`
- "f.A." values converted to `NA` (common data export artifact)

### Date Format Support

The workflow supports multiple date input formats:
- `dmy`: 31/12/2024
- `dmY`: 31/12/24
- `Ymd`: 2024-12-31
- `Y-m-d`: 2024-12-31
- `d-m-Y`: 31-12-2024

## Error Handling

The import function provides:
- HTTP status code reporting
- Success/failure messages
- Return of API response for debugging

## Security Considerations

⚠️ **Important Security Practices**:

1. **Never commit API tokens to version control**
2. **Use environment variables for credentials**
3. **Restrict file permissions on scripts containing tokens**
4. **Use separate tokens for development/production**
5. **Regularly rotate API tokens**

## Troubleshooting

### Common Issues

**Import fails with validation errors**:
- Check REDCap data dictionary field names match CSV columns
- Verify date formats match expected validation
- Ensure required fields are present

**Column mapping doesn't work**:
- Column names are case-sensitive after cleaning
- Check for typos in source CSV headers
- Review `make_clean_names()` output

**Date conversion fails**:
- Verify date format in source data
- Add additional `orders` to `parse_date_time()` if needed
- Check for invalid dates (e.g., 31/02/2024)

## Workflow Customization

### Adding Custom Column Mappings

```r
# In standardize_columns() function, add:
if ("your_column" %in% names(df)) {
  df <- df %>% rename(redcap_field = your_column)
}
```

### Adding Custom Validation Types

```r
# In align_by_text_validation() function, add:
else if (validation == "custom_type") {
  df[[var]] <- your_conversion_function(df[[var]])
}
```

## Author

**Sowjanya Batchu**

## License

Include appropriate license information for your organization.

---

## Related Documentation

- [REDCap API Documentation](https://redcap.vanderbilt.edu/api/help/)
- [janitor Package](https://sfirke.github.io/janitor/)
- [httr Package](https://httr.r-lib.org/)

Version Timeline (by language)

v1 →

v2 →

Version History (detailed)

Version	Language	Type	Relation	Author	Date
Global v1 (R v1)	R	Multi-file Archive	Initial Implementation	lakshmi.batchu	2026-02-06
Global v2 (R v2)	R	Multi-file Archive	Refinement/Bug Fix ← Global v1	lakshmi.batchu	2026-02-06
Global v3 (R v3) default selected	R	Multi-file Archive	Refinement/Bug Fix ← Global v2	lakshmi.batchu	2026-03-16

Contact

Lakshmi Batchu
Email
Publisher

Project
DepressionDC