Icon

s_​601_​spark_​label_​encoder

s_601 - Sparkling predictions and encoded labels - "the poor man's ML Ops" (on a Big Data System)

Sparkling predictions and encoded labels - "the poor man's ML Ops" (on a Big Data System)
Use Big Data Technologies like Spark to ge a robust and scalable data preparation. Use the latest Auo ML technology like H2O.ai AutoML to cretae a robust model and deploy it in a Big Data environment (like Cloudera)

s_601 - prepare label encoding with spark
prepare the preparation of data in a big data environment
- label encode string variables
- transform numbers into Double format (Spark ML likes that)
- remove highly correlated data
- remove NaN variables
- remove continous variables
- optional: normalize the data





s_601 - prepare label encoding with sparkprepare the preparation of data in a big data environment- label encode string variables- transform numbers into Double format (Spark ML likes that)- remove highly correlated data- remove NaN variables- remove continous variables Remove correlated variables ifyou want to keep them the data used is a cleaned and updated version of Census Income datasethttps://archive.ics.uci.edu/ml/datasets/census+income => Adapt the handling of numeric variables to yourneeds! additional things you could do to your data:- reduce dimensions with PCA- balance the targets to 50/50- normalize or log() transform you data- replace missing values with a more sophisticated approachplease remember: we want to do it all with Spark and on a Big Data cluster. And make sure you wouldreproduce all the transformations (with the exception of balancing/SMOTE) on your real life and test data. list of output data and tables from s_601:- nvl_numeric.txt = CAST SQL command for all numeric variable selected- nvl_numeric_sql.table = CAST SQL command as KNIME table- spark_label_encode_sql.csv = SQL command with CASE WHEN to label encode string variables- spark_label_encode_sql.table = SQL command with CASE WHEN to lable encode string variables- spark_label_encoder_regex_filter.table = regex to filter the string variables you want to handel with the above rules- spark_missing.pmml = general rules for dealing with missing values- spark_normalizer.pmml = normalization rules s_601 - Sparkling predictions and encoded labels - "the poor man's ML Ops" (on a Big Data System)Use Big Data Technologies like Spark to ge a robust and scalable data preparation. Use the latest Auo ML technology like H2O.ai AutoML to cretae a robust model and deploy it in a Big Data environment (like Cloudera)https://hub.knime.com/mlauber71/spaces/Public/latest/kn_example_bigdata_h2o_automl_spark_46 string valuesmanually filtercolumnsSELECT 1 AS dummy_const_var $${Sspark_lable_encoder}$$ , $${Snvl_numeric}$$FROM #table#this is where themagic happens of label encoding../data/d_reference_spark_405_all.tableextract specificationsTOP 10 Rowsnvl_numerichow to handle thenumeric variableswith CAST commands!!! careful in this examples negativevalues are handeled as 0 (zero)see if there are NaN variablesd_reference_spark_exclude_500.tablefilter numeric with NaNnumbers and doublesto 0numeric_colsextract specificationsnumeric values- to _housekeepingdrop education for partition^(?!education$).*normalizer by decimal scalingadapt to your needs. Exclude the Target^(?!Target$).*d_reference_spark_include_500.tablealternative: SQL Executor withDROP TABLE IF EXISTS default.data_all;=> user actioncustomer_numbersimpulates existence of a customer number thatwould be needed to export the relevant data lines=> create a local big data contexif you encouter any problems, closeKNIME and delete all data from the folder/big_data/ and start over../data/census_income_train.parquetthe training datadata_trainset a partitioning command on the tab"Additional Options"PARTITIONED BY (education STRING)data_traindata_trainnvl_numeric_sql.table../model/spark_missing.pmml../model/spark_normalizer.pmml../data/d_reference_spark_405_numeric.table^(?!customer_number$).*=> exclude customer_numberREFRESH TABLE #table#=> make sure the Sparkenvironment 'knows' about the tableSpark Column Filter Spark Column Filter Persist SparkDataFrame/RDD Spark SQL Query Spark Label Encoder Table Writer Spark Row Sampling Spark to Table Table Rowto Variable numeric NVL Spark Statistics Filter NaN numericvariables Spark Column Filter Filter continousnumeric variables Spark Missing Value Column Filter Spark Row Sampling Spark to Table Persist SparkDataFrame/RDD Spark Column Filter ReferenceColumn Filter ReferenceColumn Filter Hive to Spark Column Rename Destroy SparkContext Column Filter Spark Normalizer create RegEx with finallist of variables DB Table Remover Filter CorrelationMatrix Java Snippet(simple) local big datacontext create Parquet Reader DB Table Creator DB Loader DB Table Selector Table Reader Merge Variables PMML Writer PMML Writer Table Writer DB Column Filter Spark SQL Query s_601 - prepare label encoding with sparkprepare the preparation of data in a big data environment- label encode string variables- transform numbers into Double format (Spark ML likes that)- remove highly correlated data- remove NaN variables- remove continous variables Remove correlated variables ifyou want to keep them the data used is a cleaned and updated version of Census Income datasethttps://archive.ics.uci.edu/ml/datasets/census+income => Adapt the handling of numeric variables to yourneeds! additional things you could do to your data:- reduce dimensions with PCA- balance the targets to 50/50- normalize or log() transform you data- replace missing values with a more sophisticated approachplease remember: we want to do it all with Spark and on a Big Data cluster. And make sure you wouldreproduce all the transformations (with the exception of balancing/SMOTE) on your real life and test data. list of output data and tables from s_601:- nvl_numeric.txt = CAST SQL command for all numeric variable selected- nvl_numeric_sql.table = CAST SQL command as KNIME table- spark_label_encode_sql.csv = SQL command with CASE WHEN to label encode string variables- spark_label_encode_sql.table = SQL command with CASE WHEN to lable encode string variables- spark_label_encoder_regex_filter.table = regex to filter the string variables you want to handel with the above rules- spark_missing.pmml = general rules for dealing with missing values- spark_normalizer.pmml = normalization rules s_601 - Sparkling predictions and encoded labels - "the poor man's ML Ops" (on a Big Data System)Use Big Data Technologies like Spark to ge a robust and scalable data preparation. Use the latest Auo ML technology like H2O.ai AutoML to cretae a robust model and deploy it in a Big Data environment (like Cloudera)https://hub.knime.com/mlauber71/spaces/Public/latest/kn_example_bigdata_h2o_automl_spark_46 string valuesmanually filtercolumnsSELECT 1 AS dummy_const_var $${Sspark_lable_encoder}$$ , $${Snvl_numeric}$$FROM #table#this is where themagic happens of label encoding../data/d_reference_spark_405_all.tableextract specificationsTOP 10 Rowsnvl_numerichow to handle thenumeric variableswith CAST commands!!! careful in this examples negativevalues are handeled as 0 (zero)see if there are NaN variablesd_reference_spark_exclude_500.tablefilter numeric with NaNnumbers and doublesto 0numeric_colsextract specificationsnumeric values- to _housekeepingdrop education for partition^(?!education$).*normalizer by decimal scalingadapt to your needs. Exclude the Target^(?!Target$).*d_reference_spark_include_500.tablealternative: SQL Executor withDROP TABLE IF EXISTS default.data_all;=> user actioncustomer_numbersimpulates existence of a customer number thatwould be needed to export the relevant data lines=> create a local big data contexif you encouter any problems, closeKNIME and delete all data from the folder/big_data/ and start over../data/census_income_train.parquetthe training datadata_trainset a partitioning command on the tab"Additional Options"PARTITIONED BY (education STRING)data_traindata_trainnvl_numeric_sql.table../model/spark_missing.pmml../model/spark_normalizer.pmml../data/d_reference_spark_405_numeric.table^(?!customer_number$).*=> exclude customer_numberREFRESH TABLE #table#=> make sure the Sparkenvironment 'knows' about the tableSpark Column Filter Spark Column Filter Persist SparkDataFrame/RDD Spark SQL Query Spark Label Encoder Table Writer Spark Row Sampling Spark to Table Table Rowto Variable numeric NVL Spark Statistics Filter NaN numericvariables Spark Column Filter Filter continousnumeric variables Spark Missing Value Column Filter Spark Row Sampling Spark to Table Persist SparkDataFrame/RDD Spark Column Filter ReferenceColumn Filter ReferenceColumn Filter Hive to Spark Column Rename Destroy SparkContext Column Filter Spark Normalizer create RegEx with finallist of variables DB Table Remover Filter CorrelationMatrix Java Snippet(simple) local big datacontext create Parquet Reader DB Table Creator DB Loader DB Table Selector Table Reader Merge Variables PMML Writer PMML Writer Table Writer DB Column Filter Spark SQL Query

Nodes

Extensions

Links