Icon

02_​Document_​Classification

Document Classification: Model Training and Deployment

The goal of this workflow is to do spam classification using YouTube comments as the dataset. The workflow starts with a data table containing some YouTube comments taken from the YouTube Spam Collection Data Set at the UCI ML Repository[1] . The data is available in the workflow directory. The comments are divided into two categories, spam and ham (non-spam). The distribution of the values in both categories is roughly equal.

First, the comments are converted into documents, whose category is the class spam or ham. The documents are then preprocessed by filtering and stemming. After that, the documents are transformed into a bag of words, which is filtered again. Only terms that occur at least in 1% of the documents (at least in 3 documents) will be used as features and not be filtered out. Then the documents are transformed into document vectors.
The document vectors are a numerical representation of documents and are in the following used for classification via a support vector machine.

The lower part contains the deployment workflow.

This workflow shows how to import textual data, preprocess documents by filtering and stemming, transform documents into a bag of words and document vectors and finally build a predictivemodel to classify the documents. It also contains the corresponding deployment workflow. Data Import Preprocessing Transformation Predictive Modeling and Scoring Preprocessing Predictive Modeling and Scoring Data Import Transformation Document Classification: Model Deployment Color by category (class)Training and test setConvert todocumentsBinary vectorcreationBased on documentfrequencyConvert todocumentsBinary vectorcreationBased on documentfrequencyYoutube commentsYoutube commentsdeployment datadictionary structure Punctuation Erasure N Chars Filter Number Filter Case Converter Snowball Stemmer Color Manager Partitioning Category To Class SVM Predictor SVM Learner Strings To Document Column Filter Document Vector Term Filtering Punctuation Erasure Strings To Document SVM Predictor Document Vector Number Filter Case Converter Term Filtering Column Filter Snowball Stemmer N Chars Filter Concatenate Missing Value Stop Word Filter Stop Word Filter Table Reader Column Filter Scorer PMML Writer Table Reader Table Reader Column Filter PMML Reader This workflow shows how to import textual data, preprocess documents by filtering and stemming, transform documents into a bag of words and document vectors and finally build a predictivemodel to classify the documents. It also contains the corresponding deployment workflow. Data Import Preprocessing Transformation Predictive Modeling and Scoring Preprocessing Predictive Modeling and Scoring Data Import Transformation Document Classification: Model Deployment Color by category (class)Training and test setConvert todocumentsBinary vectorcreationBased on documentfrequencyConvert todocumentsBinary vectorcreationBased on documentfrequencyYoutube commentsYoutube commentsdeployment datadictionary structure Punctuation Erasure N Chars Filter Number Filter Case Converter Snowball Stemmer Color Manager Partitioning Category To Class SVM Predictor SVM Learner Strings To Document Column Filter Document Vector Term Filtering Punctuation Erasure Strings To Document SVM Predictor Document Vector Number Filter Case Converter Term Filtering Column Filter Snowball Stemmer N Chars Filter Concatenate Missing Value Stop Word Filter Stop Word Filter Table Reader Column Filter Scorer PMML Writer Table Reader Table Reader Column Filter PMML Reader

Nodes

Extensions

Links