Icon

JKISeason3-16_​tomljh

Analyzing User Reviews


Level: Medium

Description: You are the architect behind an innovative vocal assistance device, and your initial goal is to process user reviews and uncover insights about 'sound quality’. To this end, you decide to use 4-grams to discover frequently mentioned words near the term "sound quality”. What trends emerge from the 4-gram frequencies? What is the top 4-gram? Hint: The NGram Creator node can come in handy here!

Author: Michele Bassanelli

Dataset: User Reviews Data in the KNIME Community Hub

1.What trends emerge from the 4-gram frequencies?As the value of N increases from small to large, the document frequency of the obtained phrases becomes smaller and smaller.2.What is the top 4-gram?Since the phrase to be checked is composed of two English words, the minimum value of N in N-gram analysis is 3. You can try N= 3, 4, 5... respectively. However, when N is greater than or equal to 5, the document frequency is all 1, indicating that thisphrase only appears once in all documents, which is statistically insignificant. So the maximum value of N is 4. The reason for doing this is just to find the row where the original document is located.For example, "Row647_0" indicates that the row ID of the original document is "Row647" amazon_alexa.tsvN = 4Ngram frequenciesDelete duplicate rows of data*sound quality*N = 3NGram bag of words*sound quality*delete "sound quality"group by : Ngramagg : sum CSV Reader Strings to Document Punctuation Erasure Number Filter N Chars Filter Case Converter NGram Creator Stop Word Filter Column Filter DuplicateRow Filter Tag Cloud(JavaScript) Row Filter NGram Creator Row Filter String Manipulation GroupBy Tag Cloud(JavaScript) Joiner 1.What trends emerge from the 4-gram frequencies?As the value of N increases from small to large, the document frequency of the obtained phrases becomes smaller and smaller.2.What is the top 4-gram?Since the phrase to be checked is composed of two English words, the minimum value of N in N-gram analysis is 3. You can try N= 3, 4, 5... respectively. However, when N is greater than or equal to 5, the document frequency is all 1, indicating that thisphrase only appears once in all documents, which is statistically insignificant. So the maximum value of N is 4. The reason for doing this is just to find the row where the original document is located.For example, "Row647_0" indicates that the row ID of the original document is "Row647" amazon_alexa.tsvN = 4Ngram frequenciesDelete duplicate rows of data*sound quality*N = 3NGram bag of words*sound quality*delete "sound quality"group by : Ngramagg : sum CSV Reader Strings to Document Punctuation Erasure Number Filter N Chars Filter Case Converter NGram Creator Stop Word Filter Column Filter DuplicateRow Filter Tag Cloud(JavaScript) Row Filter NGram Creator Row Filter String Manipulation GroupBy Tag Cloud(JavaScript) Joiner

Nodes

Extensions

Links