Icon

Challenge 13 - Onsite and Online Transactions

Four solutions to extract data based on various scenarios. Catering for:

1. Simplicity
2. Dimensional Reduction
3. Scalability
4. Parallelism

Each approach benchmarked and compared. Results are verified against expected results table.

TasksExtract digits from the transactions (which are related to the boughtproducts) given the following guidelines:(1) if the onsite transaction starts with “L”, then take its first 12 digits;otherwise, take its first 6 digits(2) if the onsite transaction has a missing value, then take the stringfrom the online transaction.What is the most efficient way to perform this task? For theexample above, you should produce the following output column: Approach 1: All at oncePros: Simple and straight forward. No fancy codingCons: Might become slower for prohibitively large data setsPossible Improvements- Dimensional Reduction- Looping per case (six, 12 etc. digits)- Parallel Data Processing Approach 2: Like Approch 1 but WITH dimensional reductionPros: Approch 1 with added bonus to only work on necessary dataCons: Might become slower for prohibitively large data setsPossible Improvements- Looping per case (six, 12 etc. digits)- Parallel Data Processing Approach 3: Like Approch 2 but WITH loop per case to reduce resourceconsumption (Divide et impera). Process missing Onsite in parallel.Pros: Approch 1 with added bonus to only work on necessary data AND inchunks. Allows multiple condistionsCons: Slightly more complex. Sorting at the end can be computational expensivefor large data setsPossible Improvements- Parallel Data Processing Approach 4: Optimized for max throughput via parallel processing of both cases.No sort requied. Process only neccessary data (dimensional reduction).Pros: Approch 1 with added bonus to only work on necessary data AND inchunks. Allows multiple condistionsCons: Slightly more complex. Sorting at the end can be computational expensivefor large data setsPossible Improvements- Parallel Data Processing Sample dataverificationTask 2Fallback for missingonline transactionFail if differs tocontrol tableTask 2Fallback for missingonline transactionFail if differs tocontrol tableCollect PerformanceResultsRun1000 timesRun1000 timesRun1000 timesFail if differs tocontrol tableRun1000 timesFail if differs tocontrol tableRight Clickto change toinflated data setTable Creator Column Merger Table DifferenceFinder Task 1: GetProduct Codes Create Sample Data Task 1: GetProduct Codes Column Merger Table DifferenceFinder Concatenate Benchmark End(Memory Monitoring) Benchmark End(Memory Monitoring) Benchmark Start(Memory Monitoring) Benchmark Start(Memory Monitoring) Collect PerformanceMetrics Save and runGarbage Collector Benchmark Start(Memory Monitoring) Save and runGarbage Collector Benchmark End(Memory Monitoring) Process Data Table DifferenceFinder Process Data Benchmark Start(Memory Monitoring) Benchmark End(Memory Monitoring) Save and runGarbage Collector Table DifferenceFinder Choose DataSet Type Collect PerformanceMetrics Collect PerformanceMetrics Collect PerformanceMetrics String Manipulation Column Resorter TasksExtract digits from the transactions (which are related to the boughtproducts) given the following guidelines:(1) if the onsite transaction starts with “L”, then take its first 12 digits;otherwise, take its first 6 digits(2) if the onsite transaction has a missing value, then take the stringfrom the online transaction.What is the most efficient way to perform this task? For theexample above, you should produce the following output column: Approach 1: All at oncePros: Simple and straight forward. No fancy codingCons: Might become slower for prohibitively large data setsPossible Improvements- Dimensional Reduction- Looping per case (six, 12 etc. digits)- Parallel Data Processing Approach 2: Like Approch 1 but WITH dimensional reductionPros: Approch 1 with added bonus to only work on necessary dataCons: Might become slower for prohibitively large data setsPossible Improvements- Looping per case (six, 12 etc. digits)- Parallel Data Processing Approach 3: Like Approch 2 but WITH loop per case to reduce resourceconsumption (Divide et impera). Process missing Onsite in parallel.Pros: Approch 1 with added bonus to only work on necessary data AND inchunks. Allows multiple condistionsCons: Slightly more complex. Sorting at the end can be computational expensivefor large data setsPossible Improvements- Parallel Data Processing Approach 4: Optimized for max throughput via parallel processing of both cases.No sort requied. Process only neccessary data (dimensional reduction).Pros: Approch 1 with added bonus to only work on necessary data AND inchunks. Allows multiple condistionsCons: Slightly more complex. Sorting at the end can be computational expensivefor large data setsPossible Improvements- Parallel Data Processing Sample dataverificationTask 2Fallback for missingonline transactionFail if differs tocontrol tableTask 2Fallback for missingonline transactionFail if differs tocontrol tableCollect PerformanceResultsRun1000 timesRun1000 timesRun1000 timesFail if differs tocontrol tableRun1000 timesFail if differs tocontrol tableRight Clickto change toinflated data setTable Creator Column Merger Table DifferenceFinder Task 1: GetProduct Codes Create Sample Data Task 1: GetProduct Codes Column Merger Table DifferenceFinder Concatenate Benchmark End(Memory Monitoring) Benchmark End(Memory Monitoring) Benchmark Start(Memory Monitoring) Benchmark Start(Memory Monitoring) Collect PerformanceMetrics Save and runGarbage Collector Benchmark Start(Memory Monitoring) Save and runGarbage Collector Benchmark End(Memory Monitoring) Process Data Table DifferenceFinder Process Data Benchmark Start(Memory Monitoring) Benchmark End(Memory Monitoring) Save and runGarbage Collector Table DifferenceFinder Choose DataSet Type Collect PerformanceMetrics Collect PerformanceMetrics Collect PerformanceMetrics String Manipulation Column Resorter

Nodes

Extensions

Links