0 ×

02_​Document_​Classification

Workflow

Document Classification: Model Training and Deployment
The goal of this workflow is to do spam classification using YouTube comments as the dataset. The workflow starts with a data table containing some YouTube comments taken from the YouTube Spam Collection Data Set at the UCI ML Repository[1] . The data is available in the workflow directory. The comments are divided into two categories, spam and ham (non-spam). The distribution of the values in both categories is roughly equal. First, the comments are converted into documents, whose category is the class spam or ham. The documents are then preprocessed by filtering and stemming. After that, the documents are transformed into a bag of words, which is filtered again. Only terms that occur at least in 1% of the documents (at least in 3 documents) will be used as features and not be filtered out. Then the documents are transformed into document vectors. The document vectors are a numerical representation of documents and are in the following used for classification via a support vector machine. The lower part contains the deployment workflow.
NLPNatural Language ProcessingText Classification
This workflow shows how to import textual data, preprocess documents by filtering and stemming, transform documents into a bag of words and document vectors and finally build a predictive model to classify the documents. It also contains the correspondingdeployment workflow. Data Import Preprocessing Transformation Predictive Modeling and Scoring Preprocessing Predictive Modeling and Scoring Data Import Transformation Document Classification: Model Deployment Color by category (class)Training and test setConvert todocumentsYoutube commentsBinary vectorcreationBased on documentfrequencyYoutube commentsdeployment dataConvert todocumentsBinary vectorcreationBased on documentfrequencydictionary structure Punctuation Erasure N Chars Filter Number Filter Case Converter Snowball Stemmer Color Manager Partitioning Column Filter Category To Class Scorer SVM Predictor SVM Learner Strings To Document Table Reader Column Filter Document Vector Term Filtering Punctuation Erasure Table Reader Strings To Document SVM Predictor Document Vector Column Filter Number Filter Case Converter Term Filtering Column Filter Snowball Stemmer N Chars Filter PMML Writer PMML Reader Table Reader Concatenate Missing Value Stop Word Filter Stop Word Filter This workflow shows how to import textual data, preprocess documents by filtering and stemming, transform documents into a bag of words and document vectors and finally build a predictive model to classify the documents. It also contains the correspondingdeployment workflow. Data Import Preprocessing Transformation Predictive Modeling and Scoring Preprocessing Predictive Modeling and Scoring Data Import Transformation Document Classification: Model Deployment Color by category (class)Training and test setConvert todocumentsYoutube commentsBinary vectorcreationBased on documentfrequencyYoutube commentsdeployment dataConvert todocumentsBinary vectorcreationBased on documentfrequencydictionary structure Punctuation Erasure N Chars Filter Number Filter Case Converter Snowball Stemmer Color Manager Partitioning Column Filter Category To Class Scorer SVM Predictor SVM Learner Strings To Document Table Reader Column Filter Document Vector Term Filtering Punctuation Erasure Table Reader Strings To Document SVM Predictor Document Vector Column Filter Number Filter Case Converter Term Filtering Column Filter Snowball Stemmer N Chars Filter PMML Writer PMML Reader Table Reader Concatenate Missing Value Stop Word Filter Stop Word Filter

Download

Get this workflow from the following link: Download

Resources

Nodes

02_​Document_​Classification consists of the following 78 nodes(s):

Plugins

02_​Document_​Classification contains nodes provided by the following 3 plugin(s):