Difficulty: advanced
Estimated Time: 120 minutes

The possibility to blend machine learning with real-time data flowing through a single platform is opening a world of new possibilities, such as enabling organizations to take advantage of opportunities as they arise. Leveraging these opportunities requires processing events in real time, applying machine learning to add value, and scalable fast storage.

In this Hands on Lab we will look at the architecture of a data pipeline that combines streaming data with machine learning and fast storage to predict flight delays. You will see the end-to-end process required to build this application and you will become familiar with the MapR Data Platform by interacting with Apache Spark SQL, Spark Streaming, Spark Machine learning, MapR Event Store for Kafka, and MapR Database on a single-node MapR cluster.

This Lab consists of the following steps:

  1. Using Apache Spark SQL to explore the flight dataset.

  2. Using Spark Machine Learning to build a model to predict flight delays.

  3. Use Spark Structured Streaming with the Kafka API to read streaming flight events, use a Spark ML model to enrich streaming flight events with flight delays predictions, store the results in MapR Database, and do real time analysis with Spark SQL.

  4. Optional Analysis of Flight delay data and predictions stored in MapR Database with Apache Spark SQL, Apache Drill.

  5. Optional Analyzing Flight Delays with Apache Spark GraphFrames and MapR Database

The MapR Data Platform includes a wide variety of analytics and open source tools such as Apache Hadoop, Apache Spark, and Apache Drill with real-time database capabilities, global event streaming, and scalable enterprise storage to power a new generation of Big Data applications. With support for POSIX, cutting-edge AI and ML tools can run natively on the same cluster as other analytics and leverage the power of the MapR Data Platform.

The MapR Data Platform delivers dataware for AI and analytics, effectively handling the diversity of data types, data access, and ecosystem tools needed to manage data as an enterprise resource regardless of the underlying infrastructure or location. With the MapR Data Platform, users can store, manage, process, and analyze all data - including files, tables, and streams from operational, historical, and real-time data sources - with mission-critical reliability to meet production SLAs. MapR solves the challenges of complex data environments by managing data and its ecosystem across multiple clouds and containerized infrastructures.

In this scenario you saw how MapR combines Hadoop, Spark, and Apache Drill with a distributed file system, distributed database, and distributed event streaming, all on a single cluster. This improves performance and lowers hardware costs for Big Data applications. The MapR Data Platform allows you to manage your data with any tooling on any infrastructure.

Would you like to learn more about MapR? Check out our blog, In Search of a Data Platform.

If you'd like to speak with MapR, contact us!

Predicting Flight Delays with Spark ML

Step 1 of 5

Step 1 - Using Apache Spark SQL to Explore the flight data

There are typically two phases in machine learning:

  • Data Discovery: The first phase involves analysis on historical data to build and train the machine learning model.
  • Analytics Using the Model: The second phase uses the model in production on new data.

In production, models need to be continuously monitored and updated with new models when needed.

In Step 1 of this HOL we will use Spark SQL for data discovery of the flights dataset stored on the MapR distributed file system.

For each step of the HOL follow the instructions on the left frame before running the Zeppelin notebook code. After finishing the notebook, click on continue on the left frame, and follow the next instructions before running another notebook.

Wait about 3 minutes for the setup script to finish

Copying and working with Files on MapR XD

MapR XD Distributed File and Object Store is designed to support trillions of files, and combine analytics and operations into a single platform. MapR XD supports industry standard protocols and APIs, including POSIX, NFS, S3, and HDFS. It's easy to interact with MapR XD using traditional filesystem commands. This is possible because MapR XD is POSIX compliant, which means files and directories in MapR XD have all the characteristics you're accustomed to seeing in conventional filesystems. So, you can edit files, move files, change permissions, and so on all with traditional utilities like, vi, mv, chmod, etc. The ability to treat MapR XD like a conventional Unix filesystem and still benefit from features like exabyte-scale, multi-cloud mirroring, and failure recovery, means you can do things that are impossible with non-POSIX filesystems, like Hadoop, AWS S3, and Azure Blob Storage.

Make a directory for data on MapR XD: mkdir /mapr/demo.mapr.com/user/mapr/data

If you get an error here, wait a a few minutes for the setup script to finish, then try again.

Copy flight data file from the local Linux filesystem to MapR XD: cp ~/flightdata2018.json.gz /mapr/demo.mapr.com/user/mapr/data/.

Copy airports data file from the local Linux filesystem to MapR XD: cp ~/airports.json /mapr/demo.mapr.com/user/mapr/data/.

Unzip the flights data file on MapR XD: gunzip /mapr/demo.mapr.com/user/mapr/data/flightdata2018.json.gz

Look at the end of the flights data file on MapR XD: tail /mapr/demo.mapr.com/user/mapr/data/flightdata2018.json

Using Apache Drill to explore the flight dataset on MapR XD

Apache Drill is an open source, low-latency query engine for big data that delivers interactive SQL analytics at petabyte scale. Drill provides a massively parallel processing execution engine, built to perform distributed query processing across the various nodes in a cluster.

With Drill, you can use SQL to interactively query and join data from files in JSON, Parquet, or CSV format, Hive, and NoSQL stores, including MapR Database, HBase, and Mongo, without defining schemas.

Try Apache Drill with the Drill shell connect to the Drill service:

sqlline -u jdbc:drill:zk=localhost:5181 -n mapr -p mapr

Query for longest departure delays originating from Atlanta:

select id, src, dst, depdelay from dfs.`/user/mapr/data/flightdata2018.json` where id like 'ATL%' order by depdelay desc limit 20;

Exit the shell: !quit

Try Apache Drill with the Drill web UI:

  1. Click on the Drill tab on the right.
  2. This should take you to a Drill web UI.
  3. Click on the Query tab in the Drill web UI.
  4. Login using userid mapr password mapr.
  5. Copy paste, or type one of the queries below next to the 1 and click submit.

Example queries:

select id, src, dst, depdelay from dfs.`/user/mapr/data/flightdata2018.json`where id like 'ATL%' order by depdelay desc limit 20
select src, count(depdelay) as countdelay from dfs.`/user/mapr/data/flightdata2018.json` where depdelay > 40 and src='ATL' group by src

Let's move on to Spark, we will look at Drill more later

Using Apache Spark SQL to explore the flight dataset on MapR XD

These lab exercises use Spark in Apache Zeppelin notebooks.

  1. To run this exercise, click on the Zeppelin tab on the right.
  2. This should take you to a Zeppelin page with a list of notebooks, if you do not see the list, click refresh.
  3. Open the FlightDelay1SparkDatasets notebook. ✈️
  4. Follow the Notebook lab instructions. Click on the READY > arrows in the notebook (on the right of the code paragraphs) to run the Spark code.

Summary After Running the Zeppelin Notebook

You have now learned how to load data into Spark Datasets and DataFrames and how to explore tabular data with Spark SQL. These code examples can be reused as the foundation to solve many types of business problems. Click continue before running the next notebook.

Terminal
Zeppelin
MapR Control System
Drill