Icon

JKISeason3-30

KaggleThe goal of this competition is to predict which of the provided pairs of questions contain twoquestions with the same meaning. The ground truth is the set of labels that have been supplied byhuman experts. The ground truth labels are inherently subjective, as the true meaning of sentencescan never be known with certainty. Human labeling is also a 'noisy' process, and reasonable peoplewill disagree. As a result, the ground truth labels on this dataset should be taken to be 'informed' butnot 100% accurate, and may include incorrect labeling. We believe the labels, on the whole, torepresent a reasonable consensus, but this may often not be true on a case by case basis forindividual items in the dataset.Please note: as an anti-cheating measure, Kaggle has supplemented the test set with computer-generated question pairs. Those rows do not come from Quora, and are not counted in the scoring.All of the questions in the training set are genuine examples from Quora.Data fieldsid - the id of a training set question pairqid1, qid2 - unique ids of each question (only available in train.csv)question1, question2 - the full text of each questionis_duplicate - the target variable, set to 1 if question1 and question2 have essentially the samemeaning, and 0 otherwise. Description: You work for a company that wants to improve its support for customers. Once a customer submits a question, a system should findsimilar questions that were submitted in the past in its database, fetch the answers that were given, and send all this data to a customer servicerepresentative. The representative should review these answers and leverage them to assist the current customer. As a first step to create thissystem, you should create a mechanism that recognizes whether two questions have the same intent. This will be key for finding relevant previousquestions in the company’s database, leading to more effective support. Given a dataset of question pairs, annotated with whether or not they havethe same intent, create a classifier that learns how to make this distinction. Hint: You can find more information about the datasets here. Hint 2: TheKNIME Textprocessing extension is helpful for creating features to represent the questions. Node 1Node 434Node 442Node 470Node 475Node 478Node 479Node 480Node 481 CSV Reader Scorer (JavaScript) XGBoost Predictor XGBoost Predictor Scorer (JavaScript) Clean Data /Partition Generate Terms Test set Training set KaggleThe goal of this competition is to predict which of the provided pairs of questions contain twoquestions with the same meaning. The ground truth is the set of labels that have been supplied byhuman experts. The ground truth labels are inherently subjective, as the true meaning of sentencescan never be known with certainty. Human labeling is also a 'noisy' process, and reasonable peoplewill disagree. As a result, the ground truth labels on this dataset should be taken to be 'informed' butnot 100% accurate, and may include incorrect labeling. We believe the labels, on the whole, torepresent a reasonable consensus, but this may often not be true on a case by case basis forindividual items in the dataset.Please note: as an anti-cheating measure, Kaggle has supplemented the test set with computer-generated question pairs. Those rows do not come from Quora, and are not counted in the scoring.All of the questions in the training set are genuine examples from Quora.Data fieldsid - the id of a training set question pairqid1, qid2 - unique ids of each question (only available in train.csv)question1, question2 - the full text of each questionis_duplicate - the target variable, set to 1 if question1 and question2 have essentially the samemeaning, and 0 otherwise. Description: You work for a company that wants to improve its support for customers. Once a customer submits a question, a system should findsimilar questions that were submitted in the past in its database, fetch the answers that were given, and send all this data to a customer servicerepresentative. The representative should review these answers and leverage them to assist the current customer. As a first step to create thissystem, you should create a mechanism that recognizes whether two questions have the same intent. This will be key for finding relevant previousquestions in the company’s database, leading to more effective support. Given a dataset of question pairs, annotated with whether or not they havethe same intent, create a classifier that learns how to make this distinction. Hint: You can find more information about the datasets here. Hint 2: TheKNIME Textprocessing extension is helpful for creating features to represent the questions. Node 1Node 434Node 442Node 470Node 475Node 478Node 479Node 480Node 481CSV Reader Scorer (JavaScript) XGBoost Predictor XGBoost Predictor Scorer (JavaScript) Clean Data /Partition Generate Terms Test set Training set

Nodes

Extensions

Links