Summary of Statistical Methods Used in the Study
This study employed a variety of statistical methods, broadly categorized into machine learning for prediction and traditional statistical analyses for longitudinal cohort data. Here’s a breakdown:
1. Prediction of Mortality (Cross-sectional Data):
* Machine Learning Model: XGBoost was used to predict mortality risk.
* Hyperparameter Tuning: Bayesian optimization was used to find the optimal hyperparameters for the XGBoost model (eta, max_depth, min_child_weight, subsample, nfold).
* Model Evaluation:
* AUC (Area Under the Curve) was used as the scoring function during hyperparameter tuning.
* AUC, F1 score, precision, recall, accuracy, and balanced accuracy where calculated on the test dataset and after 10-fold cross-validation.
* 10-fold Cross-validation: Used to robustly assess model performance and reduce bias. Results are reported as mean ± standard deviation.
2. Longitudinal Cohort Analyses (IPAH Patients):
* Survival Analysis: Kaplan-Meier curves were used to visualize overall and transplant-free survival. Patients were censored at last known alive or transplantation date.
* Trend Analysis:
* Longitudinal trends of clinical variables (NOTCH3-ECD,mRAP,PVR,mPAP,TRV,6MWD) were compared between patients with progressive disease (transplant/death) and those who survived.
* Locally Estimated Scatterplot Smoothing (LOESS): Used to visualize trends without assuming a specific functional form. Plots showed time (years) vs. variable value, with facets for each variable.
* Mixed ANOVA: Used to analyze the effects of time and prognostic status (death/transplant vs. survival) on the trends of clinical variables.
* Statistical comparisons:
* t-tests: Independent two-sample t-tests were used for comparing continuous variables between two independent groups.
* ANOVA with Tukey’s post-hoc test: Used for comparing continuous variables between multiple groups.
* Spearman’s Rank Correlation: Used to assess correlation between continuous variables.
* Significance Level: A two-sided *P* < 0.05 was considered statistically notable.
3. data Handling:
* Missing Data: Missing data (3.5-5% in cross-sectional,2-3% in longitudinal) was left blank; no imputation was performed.
Software Used:
* GraphPad Prism, v9.1.2
* R software, v4.21 (with ggplot2 package for LOESS plots)