Icon

03.1_​In-database&Spark_​processing

The company tracks the usage of the website and stores the information about each session.

- Various data are collected, e.g., session start, duration, # clicks, etc., as well as the session satisfaction score (optional)
- The company calculates averaged statistics for each customer, e.g., total # visits, average satisfaction, etc., and updates the "statistics" table in different locations
- Session satisfaction score column has missing values which need to be imputed, e.g., with machine learning predictions.

We access the usage data from Hive and personal data (anonymized & updated in sessions 1 & 2) and contracts data from a database. We perform in-database processing, read the data into Spark, enrich the usage data with the personal and contract data to predict missing values better, aggregate usage data on Spark, and save the aggregated data.

Nodes

Extensions

Links