Icon

06_​Calculate_​Document_​Distance_​Using_​Word_​Vectors

Calculate Document Distance using Word Vectors

First, we read in a dataset containing sentences and assign each document a unique label. The unique label is used to create a document vector which represents the whole document and not only singe words. Next, we train a Doc2Vec model using the Word Vector Learner node. The Learner Node will output a word vector model containing a vocabulary of all learned words and labels with corresponding word vectors. This can be extracted using a Vocabulary Extractor node witch outputs a column containing the word and a collection column containing the corresponding word vector in the first output port and the same for the labels in the second output port. The length of the vector (layer size) as well as other learning parameters can be adjusted in the Word Vector Learner Node Dialog.
In order to visualize the result of the Learner, we select six sentences from the training set containing five sentences which are very similar and one sentence which is dissimilar to the other five sentences. Next, we use a PCA to reduce the dimensionality of our document vectors to two so we can plot them in a scatter plot. In the plot, we can now easily distinguish between the sentences as the dissimilar sentence has a very large distance to all other sentences whereas the similar sentences have a small distance to each other.

Workflow Requirements
KNIME Analytics Platform 3.4.0
KNIME Deeplearning4J Integration
KNIME Deeplearning4J Integration Text Processing Extension

Calculate Document Distance using Word VectorsThis example shows how to train a Word Vector model as well as some properties of theresulting vectors. Reduce to 2-dimSelect 5 similar sentences and one that is dissimilar to the restPlotJoin withvectorsAdd colorSplit collectionfor PCA Read Sentences PCA (deprecated) Select sentences Scatter Plot Doc2Vec Learner VocabularyExtractor Joiner Color Manager Split CollectionColumn Calculate Document Distance using Word VectorsThis example shows how to train a Word Vector model as well as some properties of theresulting vectors. Reduce to 2-dimSelect 5 similar sentences and one that is dissimilar to the restPlotJoin withvectorsAdd colorSplit collectionfor PCARead Sentences PCA (deprecated) Select sentences Scatter Plot Doc2Vec Learner VocabularyExtractor Joiner Color Manager Split CollectionColumn

Nodes

Extensions

Links