This workflow handles the preprocessing of the NYC taxi dataset (loading, cleaning, filtering, etc). The NYC taxi dataset contains over 1 billion taxi trips in New York City between January 2009 and December 2017 and is provided by the NYC Taxi and Limousine Commision (TLC)[1]. It contains not only information about the regular yellow cabs, but also green taxis, which started in August 2013, and For-Hire Vehicle (e.g Uber) starting from January 2015. In the data, each taxi trip is recorded with information such as the pickup and dropoff locations, datetime, number of passengers, trip distance, fare amount, tip amount, etc. Since the dataset was first published, the TLC has made several changes to it, e.g renaming, adding, removing some columns. Therefore, we need to do some preprocessing steps before loading the data into the database. The goal of this workflow is to get the dataset from [1], then load them onto Spark for preprocessing. The preprocessing includes unifying the columns (names, values, datatypes), reverse geocoding (assigning GPS coordinates or location IDs to their corresponding taxi zones), and filtering negative values that don't make sense. At the end, the cleaned data are stored on an Amazon S3 bucket in Parquet format, ready for further analysing. [1] http://www.nyc.gov/html/tlc/html/about/trip_record_data.shtml
To use this workflow in KNIME, download it from the below URL and open it in KNIME:
Download WorkflowDeploy, schedule, execute, and monitor your KNIME workflows locally, in the cloud or on-premises – with our brand new NodePit Runner.
Try NodePit Runner!Do you have feedback, questions, comments about NodePit, want to support this platform, or want your own nodes or workflows listed here as well? Do you think, the search results could be improved or something is missing? Then please get in touch! Alternatively, you can send us an email to mail@nodepit.com.
Please note that this is only about NodePit. We do not provide general support for KNIME — please use the KNIME forums instead.