Icon

UCI_​Diabetes_​Data_​Feature_​Enginering

Data Preprocessing

ranked groupby target val means for categorial data

ordinalizing metric columns

grouping ordinalized medications

Concatenate processed data to evaluate all featured variable groups in a loop

Import UCI Diabetes table

101766 cases

group rare medical specialties

Part 5: Dashboard

Part 2: processing variables and scoring

in this part data which are too granular to be menaninful in a statistical sense as

  • medical specialties of referring practitioners as well as

  • 3 diagnosis columns are

    grouped to lesser granularity.

Furtheron redundnt medications columns are grouped due to their basic drug classes

metric data are ordinalized

Finally: Features like codes are ordinalized by ranking their groupby target value ("severrity") means

target value transformation

grouping diagnoses

Grouping diagnosis and med. specialties columns

to consolidate the three ICD-9 diagnosis columns and medical specialties into fewer, more meaningful groups the following approach was implemented:

  • Data Preparation: The three original columns were first unpivoted to create a single list of all diagnosis codes. A frequency count was then performed to identify codes occurring fewer than 50 times.

  • Grouping Logic: A mapping strategy was developed to handle the long tail of rare diagnoses. Codes with a count ≥50 were retained as original values, while all codes with a count <50 were simplified and reclassified into their respective ICD-9 Chapter headers

XgBoost scores over featured variable categories

- compounded diagnoses
- aggregated medical specialties
- ordinalized values
- aggregated pharmaceuticals
- ranked groupby menas

Feature Engineering UCI-Hospital Diabetes Records

Part I: First insights and target value optimization

Abstract

the well-known UCI hospital records address the readmission of diabetes patients. 10766 recorded patients are documented in 35 features over a 10 yr period.Hitherto, approaches to predict their readmittance probability have not proven to be effective.

The present approach concerns:

1) target value engineering due to changing the design of the study object from prediction of readmittance of patients towards

1a) grading of patients into severe" and "less severe" cases and

1b) prediction of readmittance from severe cases

2) feature engineering of the dependent variables by aggregating of

2) the reliability of metrics

3) deeper insights into the main important features

4) enhances the prediction of severe vs.less severs patient outcomes (kappa 0.04 -> kappa ~ 0.414

5) enhances the predicatbilty of readmitted patients (kappa 0.037 -> 0.212)

Scoring predictability before feature engineering

to asess predicatbility in raw data XgBoost-application yields kappa = 0.04,1 accuracy = 0.88. the latter is misleading due to imbalanced data

To jugde if there is any predicatbilty in he next loop this will be compared to kappas and Accuracies from th prevoius procedure with shuffled target column

Feature Importance of 3 consolidated ICD-9 code columns
Standard deviaton estimation in shuffle-loop over XgBoost

to assess the quality of xgBoost metrics a loop (n=100) over permuted target variable (here means and standard) estimations are computed

as a bootstrap approcah yielding mean(kappa) ~ 0 +/- 0.0005, Accuracy = 0.8884 (!) +/- 0.0005

Target variable feature engineering loop

Addition of discharge disposition id's to the target values and scoring via XgBoost

CATBOOST to assess the predictability of readmitted in severe cases

using original "readmitted < 30" column out of only severe cases catboost as python code is called via subprocess()

Feature selection loop

finding the combination of featured and non-fetured columns to maximize xgBoost score

NB: computation takes a long time!
Feature selection loop
  • allows to select, the subset of features that is best for model construction.

  • The Genetic Algorithm, a stochastic approach that bases its optimization on the mechanics of biological evolution and genetics is chosen here. Similar to natural selection, different solutions (individuals) are carried and mutated from generation to generation based on their performance (fitness). This approach converges into a local optimum .

  • Advantages:

    • finds better global optima than Forward/Backward methods

    • Well-suited when features interact strongly or redundant groups are present

  • Disadvantages:

    • computationally intensive (many iterations).

    • result is stochastic → different results are possible with the same settings.

Introcduction

the well-known UCI records comprise administrative and medical data from 107.766 patients from ~30 US-hospital

over an approximate 10 year period. Given are mainly categorial features. Some of which are ordinal (e.g. # days spent in the hospital,

# of laboratory procedures etc.,as well as columns for single medicines in categories of quantities

Few records are given in a direct medical context (e.g. laboratory outcomes). These records reflect a typical administrative hospital data set.

The target column ist the readmittance of patients comprising categories of readmittance - within 30 days, in > 30 days, - not readmitted

The original objective predicting readmittance from these records in many examples rendered not convincing.

The present analysis focusses not immediately on this prediction, but rather aims to diffentitate between severe and less severe

cases to enhance the understanding of relations between categories in typical hospital records

Most final part: prediciton of readmitted <30 from severe cases

Prediction of readmitted < 30 days from severe cases

yields Cohens kappa = 0.212 using featured variables

maximum XgBoost Score column combination (Cohen's kappa = 0.414)

med_spec_grouped,

diag_1_mapped,

diag_2_mapped,

diag_3_mapped,

A1Cresult_ordnl,

max_glu_serum_ordnl,

age_ordnl,

metformin_comb,

rank_race,

rank_gender,

rank_admission_type_id,

rank_payer_code,

rank_diabetesMed,

all medical aggregatations,

single medicines (examide, citoglipton, insuline,)

numerical data

Feature Importance of medical specialties
Final feature importance

Part 3: evaluation of featured variables through XgBoost scores to predict severe cases

  • XgBoost-scores over groups of featured variables

  • feature impotances of selected variables

  • feature selection loop to find the feature combination tat maximazes predicatbility of severe cases

Part 4: predictability of readmissions < 30 days severe cases using featured variables through XgBoost and Catboost scores

  • XgBoost-score and feature importance over groups of featured variables

  • CatBoost on the same setting

N.B: To statistically correct apply ranked groupby menas of the target variable in each category as surrogate quantity to estimate feature importance:
the meaures have to be estimated on a first subset of data and applied on a second subset
Row Filter
Column Filter
Number Rounder
Column Filter
Statistics
groupby targetval means
Number to String
Column Filter
shuffle loop
Row Sampler
Column Filter
tagret value optimization
Column Filter
Python Script
xgb bfrore target var opti
Column Filter
Row Sampler
Row Filter
Rule Engine
Metanode
Concatenate
Column Filter
target_val_transform
Rule Engine
Python Script
Metanode
Component
Python Script
Parquet Reader
Column Filter
Constant Value Column Appender
Column Filter
Joiner
Constant Value Column Appender
Constant Value Column Appender
Rule-based Row Filter
Joiner
Constant Value Column Appender
readmitted > 30 are deemed "0"
Rule Engine
Concatenate
medication_loop
Column Filter
Sorter
metric categories to ordinal numbers
GroupBy
Joiner
String to Number
Constant Value Column Appender
Column Filter
Concatenate
Constant Value Column Appender
Joiner
Column Filter
Column Filter
Row Filter
Column Filter
final feature impotance
Column Filter
Sorter
Concatenate
Metanode
Table Partitioner
Row Filter
Concatenate
Metanode
Joiner
Column Filter
Column Filter
Column Filter
Duplicate Row Filter
Joiner

Nodes

Extensions

Links