Icon

s_​401_​spark_​label_​encoder

Spark Label Encoding - prepare the data in local Big Data environment

s_401 - prepare label encoding with spark
prepare the preparation of data in a big data environment
- label encode string variables
- transform numbers into Double format (Spark ML likes that)
- remove highly correlated data
- remove NaN variables
- remove continous variables
- optional: normalize the data

s_401 - prepare label encoding with sparkprepare the preparation of data in a big data environment- label encode string variables- transform numbers into Double format (Spark ML likes that)- remove highly correlated data- remove NaN variables- remove continous variables Remove correlatedvariables if you want tokeep them the data used is a cleaned and updated version of Census Income datasethttps://archive.ics.uci.edu/ml/datasets/census+income => Adapt the handling of numeric variables toyour needs! additional things you could do to your data:- reduce dimensions with PCA- balance the targets to 50/50- normalize or log() transform you data- replace missing values with a more sophisticated approachplease remember: we want to do it all with Spark and on a Big Data cluster. And make sure you wouldreproduce all the transformations (with the exception of balancing/SMOTE) on your real life and test data. list of output data and tables from s_401:- nvl_numeric.txt = CAST SQL command for all numeric variable selected- nvl_numeric_sql.table = CAST SQL command as KNIME table- spark_label_encode_sql.csv = SQL command with CASE WHEN to label encode string variables- spark_label_encode_sql.table = SQL command with CASE WHEN to lable encode string variables- spark_label_encoder_regex_filter.table = regex to filter the string variables you want to handel with the above rules- spark_missing.pmml = general rules for dealing with missing values- spark_normalizer.pmml = normalization rules s_401 - Sparkling predictions and encoded labelsUse Big Data Technologies like Spark to ge a robust and scalable data preparation. Use the latest Auo ML technology like H2O.aiAutoML to cretae a robust model and deploy it in a Big Data environment (like Cloudera)https://hub.knime.com/mlauber71/spaces/Public/latest/kn_example_bigdata_h2o_automl_spark census_income_train.parquetthe training datastring valuesmanually filtercolumnsSELECT 1 AS dummy_const_var $${Sspark_lable_encoder}$$ , $${Snvl_numeric}$$FROM #table#this is where themagic happens of label encodingextract specificationshow to handle thenumeric variableswith CAST commands!!! careful in this examples negativevalues are handeled as 0 (zero)see if there are NaN variablesd_reference_spark_405_all.tabled_reference_spark_exclude_500.tablefilter numeric with NaNspark_missing.pmmlnumeric_colsnvl_numeric_sql.tableextract specificationsnumeric valuesd_reference_spark_405_numeric.tabledata_trainset a partitioning command on the tab" Additional Options" PARTITIONED BY (education STRING)=> deletes the wholelocal big data fold../data/local_big_dataif you encouter any problems, closeKNIME and delete all data from the folder/data/local_big_data/data_train- to _housekeepingdrop education for partition^(?!education$).*normalizer by decimal scalingadapt to your needs. Exclude the Target^(?!Target$).*spark_normalizer.pmmld_reference_spark_include_500.tablealternative: SQL Executor withDROP TABLE IF EXISTS default.data_all;=> user action Parquet Reader Spark Column Filter Spark Column Filter Persist SparkDataFrame/RDD Spark SQL Query Spark Label Encoder Spark Row Sampling Spark to Table numeric NVL Spark Statistics Table Writer Filter NaN numericvariables Spark Column Filter Filter continousnumeric variables Spark Missing Value PMML Writer Column Filter Merge Variables(deprecated) Table Reader Spark Row Sampling Spark to Table Persist SparkDataFrame/RDD Spark Column Filter ReferenceColumn Filter ReferenceColumn Filter Table Writer DB Table Creator Hive to Spark local sparkcontext create DB Loader Column Rename Destroy SparkContext Column Filter Spark Normalizer PMML Writer create RegEx with finallist of variables DB Table Remover Table Rowto Variable Filter CorrelationMatrix s_401 - prepare label encoding with sparkprepare the preparation of data in a big data environment- label encode string variables- transform numbers into Double format (Spark ML likes that)- remove highly correlated data- remove NaN variables- remove continous variables Remove correlatedvariables if you want tokeep them the data used is a cleaned and updated version of Census Income datasethttps://archive.ics.uci.edu/ml/datasets/census+income => Adapt the handling of numeric variables toyour needs! additional things you could do to your data:- reduce dimensions with PCA- balance the targets to 50/50- normalize or log() transform you data- replace missing values with a more sophisticated approachplease remember: we want to do it all with Spark and on a Big Data cluster. And make sure you wouldreproduce all the transformations (with the exception of balancing/SMOTE) on your real life and test data. list of output data and tables from s_401:- nvl_numeric.txt = CAST SQL command for all numeric variable selected- nvl_numeric_sql.table = CAST SQL command as KNIME table- spark_label_encode_sql.csv = SQL command with CASE WHEN to label encode string variables- spark_label_encode_sql.table = SQL command with CASE WHEN to lable encode string variables- spark_label_encoder_regex_filter.table = regex to filter the string variables you want to handel with the above rules- spark_missing.pmml = general rules for dealing with missing values- spark_normalizer.pmml = normalization rules s_401 - Sparkling predictions and encoded labelsUse Big Data Technologies like Spark to ge a robust and scalable data preparation. Use the latest Auo ML technology like H2O.aiAutoML to cretae a robust model and deploy it in a Big Data environment (like Cloudera)https://hub.knime.com/mlauber71/spaces/Public/latest/kn_example_bigdata_h2o_automl_spark census_income_train.parquetthe training datastring valuesmanually filtercolumnsSELECT 1 AS dummy_const_var $${Sspark_lable_encoder}$$ , $${Snvl_numeric}$$FROM #table#this is where themagic happens of label encodingextract specificationshow to handle thenumeric variableswith CAST commands!!! careful in this examples negativevalues are handeled as 0 (zero)see if there are NaN variablesd_reference_spark_405_all.tabled_reference_spark_exclude_500.tablefilter numeric with NaNspark_missing.pmmlnumeric_colsnvl_numeric_sql.tableextract specificationsnumeric valuesd_reference_spark_405_numeric.tabledata_trainset a partitioning command on the tab" Additional Options" PARTITIONED BY (education STRING)=> deletes the wholelocal big data fold../data/local_big_dataif you encouter any problems, closeKNIME and delete all data from the folder/data/local_big_data/data_train- to _housekeepingdrop education for partition^(?!education$).*normalizer by decimal scalingadapt to your needs. Exclude the Target^(?!Target$).*spark_normalizer.pmmld_reference_spark_include_500.tablealternative: SQL Executor withDROP TABLE IF EXISTS default.data_all;=> user action Parquet Reader Spark Column Filter Spark Column Filter Persist SparkDataFrame/RDD Spark SQL Query Spark Label Encoder Spark Row Sampling Spark to Table numeric NVL Spark Statistics Table Writer Filter NaN numericvariables Spark Column Filter Filter continousnumeric variables Spark Missing Value PMML Writer Column Filter Merge Variables(deprecated) Table Reader Spark Row Sampling Spark to Table Persist SparkDataFrame/RDD Spark Column Filter ReferenceColumn Filter ReferenceColumn Filter Table Writer DB Table Creator Hive to Spark local sparkcontext create DB Loader Column Rename Destroy SparkContext Column Filter Spark Normalizer PMML Writer create RegEx with finallist of variables DB Table Remover Table Rowto Variable Filter CorrelationMatrix

Nodes

Extensions

Links