Icon

text clustering of Wikipeidia articles II

Text clustering of Wikipeidia articles

Text clustering of Wikipedia articles. 12 different Wikipedia articles, three each on subjects of Philosophy, Religion, Law and Quantum-Mechanics were randomly selected, manually copied from Internet, saved into respective twelve text files (*.txt) in a folder. These twelve text files were then read, text-processed and finally hierachical clustering was performed. Clustering is perfect (even though files are just 12). At the lowest level in the dendogram articles on each subject first cluster together. Any distance measure other than 'cosine', reduces accuracy drastically.

Text cleaning Read files & convert to Documents Modeling Last amended: 24th Jan, 2023Text clustering of Wikipedia articles. 12Wikipedia articles, three each onPhilosophy, Releigon, Law, Quantum-Mechanics and sports were manuallycopied from Internet, put into respectivetwelve text files in a folder. These twelvetext files were then read, text-processedand finally hierachical clustering wasperformed. Clustering is perfect. At thelowest level in the dendogram articles oneach subject first cluster together.Any distance measure other than'cosine', reduces accuracy drastically. use filenamesascategorieslowercaseremovepunctuationremove numbersrel tfnormalized idfcalculatetf * idfvector value=tfidfremovenumberAverage linkageDoc category (filenames)vsclusterscosinedistanceExtract doccategory valuesas anothercolumnreplace RowIDsby another column17 text files.Each appears in 2nd column Strings To Document Case Converter Punctuation Erasure Number Filter TF IDF Column Expressions Bag Of WordsCreator Document Vector Stop Word Filter Document Viewer Number Filter Hierarchical Clustering(DistMatrix) HierarchicalCluster View Numeric Distances Category To Class RowID Table Reader Text cleaning Read files & convert to Documents Modeling Last amended: 24th Jan, 2023Text clustering of Wikipedia articles. 12Wikipedia articles, three each onPhilosophy, Releigon, Law, Quantum-Mechanics and sports were manuallycopied from Internet, put into respectivetwelve text files in a folder. These twelvetext files were then read, text-processedand finally hierachical clustering wasperformed. Clustering is perfect. At thelowest level in the dendogram articles oneach subject first cluster together.Any distance measure other than'cosine', reduces accuracy drastically. use filenamesascategorieslowercaseremovepunctuationremove numbersrel tfnormalized idfcalculatetf * idfvector value=tfidfremovenumberAverage linkageDoc category (filenames)vsclusterscosinedistanceExtract doccategory valuesas anothercolumnreplace RowIDsby another column17 text files.Each appears in 2nd column Strings To Document Case Converter Punctuation Erasure Number Filter TF IDF Column Expressions Bag Of WordsCreator Document Vector Stop Word Filter Document Viewer Number Filter Hierarchical Clustering(DistMatrix) HierarchicalCluster View Numeric Distances Category To Class RowID Table Reader

Nodes

Extensions

Links