This script is to develop models using different ML methods and use nested 5-fold cross validation to validate the model performance. This pipeline does not include data preprocessing steps; please ensure you do data cleaning (e.g., nan, outliers) before using this pipeline Nested CV multipmodel training utility with threshold tuning and SHAP summaries. This module contains a compact collection of helper functions and a main routine (`my_nestedcv_multi`) to run a **nested cross‑validation** workflow across multiple classifiers using imbalanced-learn pipelines, with: - Standardization (pre‑SMOTE for meaningful distances) - BorderlineSMOTE oversampling - Model‑based feature selection - Grid search on inner CV (recall‑optimized) - Threshold tuning on the inner CV (Youden / F-beta / target sensitivity / cost) - Out‑of‑fold ROC aggregation - SHAP summary importances per feature
main toolbox used # conda install -y -c conda-forge \ # "numpy=1.26.4" \ # "scipy<1.14" \ # "pandas=2.2.*" \ # "scikit-learn=1.3.2" \ # "scikit-learn-intelex=2024.6.0" \ # "daal4py=2024.6.0" \ # "imbalanced-learn=0.13.0" \ # "shap=0.43.0" \ # "xgboost=1.7.6" \ # "tqdm: 4.67.1" # matplotlib=3.8.4 #
Utilities for summarizing model performance with permutation-based p-values and compact display strings. Key pieces: - `mypermutation_ttest`: simple permutation comparison of performance vs. chance - `_mean_ci_95`: mean + 95% CI - `_stars_from_p`: significance stars from p-values - `build_model_performance_table`: assemble per-metric, per-model tables ready for display or downstream analysis