DepressionDC Data Integration for FAIR Data Sharing

Dataset Info
Published on
2026-02-06

Variables
1

Data Access

Data is available only upon formal request and subject to approval.

Approved users receive a secure institute account and work with the data exclusively in our Trusted Research Environment (TRE) via remote desktop.

Request data (Email to us)

Reuse & Usage Terms
  • Data is not downloadable (TRE access only).
  • Approved users receive a personal institute account.
  • Tools available: RStudio, Jupyter, Python, Stata, etc.
  • Data resides in your TRE home directory.
  • Re-use/publication per Data Use Agreement (DUA).
  • No redistribution of the data.
Contact us for the DUA template and details.
Description

Data Integration, Cleaning, and Quality Assessment

Background

In this project, trial data from the Depression DC study were integrated and prepared for further analysis and dashboard development. The Depression DC study is a longitudinal randomized trial comprising multiple measurements across different time points. The data integration process involved systematic data understanding, preprocessing, transformation into REDCap-compatible format, and comprehensive quality checks.

Data Understanding

To ensure accurate data handling, an initial data exploration phase was conducted:

  • Exploration of the study data dictionary to understand variable definitions and coding schemes
  • Review of published literature related to the trial to gain contextual understanding
  • Examination of raw CSV files to assess data structure, variable formats, and completeness

Data Preparation and Preprocessing

Data Structuring

  • Multiple source files were organized into a structured data list

Variable Standardization

  • Variables were renamed to align with REDCap conventions (e.g., visit → REDCap event names)
  • Column names were cleaned by removing extra spaces and inconsistencies
  • Special missing value codes such as “f.A” (fehlende Angabe) were converted to standard missing values (NA)

Alignment with REDCap Standards

Data Type Validation

  • Variables were validated and converted according to REDCap requirements:
    • Date variables → Date format
    • Continuous variables → Numeric
    • Discrete variables → Integer

Categorical Variable Recoding

  • Categorical variables were recoded into numerical formats based on the REDCap data dictionary
    • Example: Yes/No variables → 1/0

Handling Longitudinal and Repeated Measures

  • Repeated measurements were identified across time points
  • Appropriate REDCap structure was applied by assigning repeat instruments and event names
  • This ensured compatibility with REDCap’s longitudinal data model

Data Integration into REDCap

A custom function was developed to automate the import of REDCap-compatible datasets into the REDCap system. This ensured consistency, reproducibility, and efficiency in the data integration process.

Data Quality Assessment (Pre-Anonymization)

  • Initial quality checks were performed to ensure data integrity:
  • Use of REDCap’s built-in Data Quality Tool to identify inconsistencies
  • Verification of overall data consistency across datasets
  • Detection of special characters (e.g., “ß”) that could affect data processing
  • Comparison of missing values (NA) between original datasets and REDCap exports

Data Anonymization and Feature Derivation

To ensure compliance with data protection and sharing requirements:

  • All patient identifiers (e.g., date of birth, date variables, administrative identifiers) were removed
  • Patient IDs were double-pseudonymized to ensure privacy protection
  • New derived variables were created:
    • Age at randomization
    • Age at onset of depression
    • Duration of depression

Data Quality Assessment (Post-Anonymization)

Further validation was performed after anonymization:

  • Comparison of original and REDCap-exported data using graphical methods for randomly selected variables
  • Verification of missing values across datasets
  • Calculation of descriptive statistics and comparison with published study results

Validation of repeated measurements in both:

  • Initial integrated dataset
  • Pseudonymized REDCap dataset

Summary

The data integration process ensured that heterogeneous longitudinal trial data were transformed into a standardized, REDCap-compatible format. Through systematic preprocessing, validation, and quality checks, a high-quality dataset suitable for analysis and dashboard visualization was established.

Available Variables (1)
Event: Screening
Coversheet
  • patid

Analysis Code
Viewing: v3 Default: v3 R Multi-file Archive
Viewing version: v3 (R)
Created by lakshmi.batchu · 2026-03-16 16:06 · Added Jupyter Notebooks for data integration pipeline and data comparison after integration.
📦 Archive contents
  • depressiondc-data-integration-for-fair-data-sharing-119-code-v3/README.md
    documentation · 4908 bytes
    docs
  • depressiondc-data-integration-for-fair-data-sharing-119-code-v3/example.csv
    data · 3272 bytes
    file
  • depressiondc-data-integration-for-fair-data-sharing-119-code-v3/DATAQUALITYCHECKS.ipynb
    other · 15337 bytes
    file
  • depressiondc-data-integration-for-fair-data-sharing-119-code-v3/DepressionDCDoublePseudonymize_DataDictionary_2026-03-16.csv
    data · 17324 bytes
    file
  • depressiondc-data-integration-for-fair-data-sharing-119-code-v3/data_integration.ipynb
    other · 7163 bytes
    file
🧾 README
# CSV to REDCap Data Integration Workflow

Automated R-based workflow for standardizing and importing CSV data into REDCap databases with validation and type conversion.

## Overview

This workflow handles the complete pipeline from raw CSV files to REDCap import, including:
- Auto-detection of CSV delimiters
- Column name standardization
- REDCap field alignment and type conversion
- Automated API import

## Prerequisites

### Required R Packages

```r
install.packages(c(
  "dplyr",
  "janitor", 
  "readr",
  "stringr",
  "lubridate",
  "httr",
  "jsonlite"
))
```

### Required Files

- **CSV data files**: Source data to be imported
- **REDCap data dictionary** (`Redcap_dictionary.csv`): Export from your REDCap project
- **REDCap API token**: Generated from your REDCap project settings

## Configuration

### 1. Set File Paths

```r
folder_path <- "PATH/TO/YOUR/CSV/FOLDER"
```

### 2. Configure API Credentials

**Security Best Practice**: Never hardcode API tokens in scripts. Use environment variables:

```r
# Set environment variable (run once in R console)
Sys.setenv(REDCAP_API_TOKEN = "your_token_here")

# Use in script
api_token <- Sys.getenv("REDCAP_API_TOKEN")
api_url <- "https://your-redcap-server/api/"
```

## Workflow Steps

### 1. Data Loading

The script automatically:
- Detects CSV delimiter (comma or semicolon)
- Handles UTF-8 encoding
- Loads all CSV files from the specified folder

### 2. Column Standardization

Automatically maps common column names to REDCap system fields:

| Source Column | REDCap Field |
|---------------|--------------|
| `subject` | `subjid_drv` |
| `repeat_number` | `redcap_repeat_instance` |
| `visitid` / `visit_id` | `redcap_event_name` |
| `site` | `redcap_data_access_group` |

**Note**: Column names are cleaned using `janitor::make_clean_names()` (lowercase, underscores, no special characters).

### 3. Data Type Alignment

The workflow reads your REDCap data dictionary and applies validation rules:

- **date_dmy**: Converts to date format (supports multiple input formats)
- **integer**: Converts to integer
- **number**: Converts to numeric

System fields are preserved and not subject to validation conversion.

### 4. REDCap Import

Uses the REDCap API to import data with:
- Flat JSON format
- Normal overwrite behavior (updates existing records)
- Count return for verification

## Usage Example

```r
# Source the script
source("data_integration_workflow.R")

# Load and standardize all CSV files
# (already done automatically when script runs)

# Apply REDCap dictionary alignment to a specific dataset
DM <- data_list[[1]]  # First CSV file
DM_aligned <- align_by_text_validation(DM, redcap_dict)

# Import to REDCap
api_token <- Sys.getenv("REDCAP_API_TOKEN")
api_url <- "https://your-redcap-server/api/"

result <- import_to_redcap(DM_aligned, api_token, api_url)
```

## Data Preprocessing Features

### Special Value Handling

- Empty strings converted to `NA`
- "f.A." values converted to `NA` (common data export artifact)

### Date Format Support

The workflow supports multiple date input formats:
- `dmy`: 31/12/2024
- `dmY`: 31/12/24
- `Ymd`: 2024-12-31
- `Y-m-d`: 2024-12-31
- `d-m-Y`: 31-12-2024

## Error Handling

The import function provides:
- HTTP status code reporting
- Success/failure messages
- Return of API response for debugging

## Security Considerations

⚠️ **Important Security Practices**:

1. **Never commit API tokens to version control**
2. **Use environment variables for credentials**
3. **Restrict file permissions on scripts containing tokens**
4. **Use separate tokens for development/production**
5. **Regularly rotate API tokens**

## Troubleshooting

### Common Issues

**Import fails with validation errors**:
- Check REDCap data dictionary field names match CSV columns
- Verify date formats match expected validation
- Ensure required fields are present

**Column mapping doesn't work**:
- Column names are case-sensitive after cleaning
- Check for typos in source CSV headers
- Review `make_clean_names()` output

**Date conversion fails**:
- Verify date format in source data
- Add additional `orders` to `parse_date_time()` if needed
- Check for invalid dates (e.g., 31/02/2024)

## Workflow Customization

### Adding Custom Column Mappings

```r
# In standardize_columns() function, add:
if ("your_column" %in% names(df)) {
  df <- df %>% rename(redcap_field = your_column)
}
```

### Adding Custom Validation Types

```r
# In align_by_text_validation() function, add:
else if (validation == "custom_type") {
  df[[var]] <- your_conversion_function(df[[var]])
}
```

## Author

**Sowjanya Batchu**

## License

Include appropriate license information for your organization.

---

## Related Documentation

- [REDCap API Documentation](https://redcap.vanderbilt.edu/api/help/)
- [janitor Package](https://sfirke.github.io/janitor/)
- [httr Package](https://httr.r-lib.org/)
Version Timeline (by language)
R
Version History (detailed)
Version Language Type Relation Author Date
Global v1 (R v1) R Multi-file Archive Initial Implementation lakshmi.batchu 2026-02-06
Global v2 (R v2) R Multi-file Archive Refinement/Bug Fix ← Global v1 lakshmi.batchu 2026-02-06
Global v3 (R v3) default selected R Multi-file Archive Refinement/Bug Fix ← Global v2 lakshmi.batchu 2026-03-16
Contact
Lakshmi Batchu
Email
Publisher

Project
DepressionDC