Icon

Data_​Preparation

Cleaning the NYC taxi dataset on Spark
Cleaning the NYC taxi dataset on SparkThe NYC taxi dataset contains over 1 billion taxi trips in New York City between January 2009 and December 2017 and is provided by the NYC Taxi and LimousineCommision (TLC)[1]. It contains not only information about the regular yellow cabs, but also green taxis, which started in August 2013, and For-Hire Vehicle (e.g Uber)starting from January 2015. In the data, each taxi trip is recorded with information such as the pickup and dropoff locations, datetime, number of passengers, trip distance,fare amount, tip amount, etc. Since the dataset was first published, the TLC has made several changes to it, e.g renaming, adding, removing some columns. Therefore, we need to do somepreprocessing steps before loading the data into the database. The goal of this workflow is to get the dataset from [1], then load them onto Spark for preprocessing. Thepreprocessing includes unifying the columns (names, values, datatypes), reverse geocoding (assigning GPS coordinates or location IDs to their corresponding taxi zones),and filtering negative values that don't make sense. At the end, the cleaned data are stored on an Amazon S3 bucket in Parquet format, ready for further analysing.[1] http://www.nyc.gov/html/tlc/html/about/trip_record_data.shtml Enter your Amazon S3 credentialsand Apache Livy URL here Get the data URLsfrom the TLC website Variable Loop End Table Row ToVariable Loop Start Get the URLs Preprocess Yellowtaxi dataset CASE SwitchVariable (Start) Java Edit Variable CASE SwitchVariable (End) Preprocess FHVtaxi dataset Preprocess Greentaxi dataset Amazon S3Connection Create SparkContext (Livy) Cleaning the NYC taxi dataset on SparkThe NYC taxi dataset contains over 1 billion taxi trips in New York City between January 2009 and December 2017 and is provided by the NYC Taxi and LimousineCommision (TLC)[1]. It contains not only information about the regular yellow cabs, but also green taxis, which started in August 2013, and For-Hire Vehicle (e.g Uber)starting from January 2015. In the data, each taxi trip is recorded with information such as the pickup and dropoff locations, datetime, number of passengers, trip distance,fare amount, tip amount, etc. Since the dataset was first published, the TLC has made several changes to it, e.g renaming, adding, removing some columns. Therefore, we need to do somepreprocessing steps before loading the data into the database. The goal of this workflow is to get the dataset from [1], then load them onto Spark for preprocessing. Thepreprocessing includes unifying the columns (names, values, datatypes), reverse geocoding (assigning GPS coordinates or location IDs to their corresponding taxi zones),and filtering negative values that don't make sense. At the end, the cleaned data are stored on an Amazon S3 bucket in Parquet format, ready for further analysing.[1] http://www.nyc.gov/html/tlc/html/about/trip_record_data.shtml Enter your Amazon S3 credentialsand Apache Livy URL here Get the data URLsfrom the TLC website Variable Loop End Table Row ToVariable Loop Start Get the URLs Preprocess Yellowtaxi dataset CASE SwitchVariable (Start) Java Edit Variable CASE SwitchVariable (End) Preprocess FHVtaxi dataset Preprocess Greentaxi dataset Amazon S3Connection Create SparkContext (Livy)

Nodes

Extensions

Links