Data Ingestion with JSON in Apache Spark: A Quick Guide

JSON (JavaScript Object Notation) is a popular format for storing and exchanging data. Apache Spark provides excellent support for ingesting and processing JSON data, making it easy to work with structured data. In this guide, we’ll cover how to ingest, transform, and write JSON data for different Formula 1-related files like Constructors, Drivers, Results, and Pitstops.

1. Constructors File - Requirements

The Constructors file contains data about the teams (constructors) in Formula 1.
Fields may include constructor names, team names, nationality, and other team-specific details.
For analysis, we need to read the JSON file, transform it appropriately, and write the output to a new location.

2. Constructors File - Read Data

To read the Constructors JSON file, use Spark’s DataFrameReader API. Example:

df_constructors = spark.read.json("/path/to/constructors.json")
This loads the JSON and automatically infers the schema.

3. Constructors File - Transform & Write Data

Select relevant columns, rename if needed, and possibly add calculated fields.
Example transformation:
df_constructors_transformed = df_constructors.select("constructorId", "name", "nationality")
Writing output:
df_constructors_transformed.write.json("/path/to/output/constructors_transformed.json")

4. Drivers File - Requirements

The Drivers file contains data about Formula 1 drivers (driver IDs, names, countries, etc.).
This data needs similar ingestion and transformation for analysis (extract key information, clean or rename fields).

5. Drivers File - Spark Program

Read the Drivers JSON file:
df_drivers = spark.read.json("/path/to/drivers.json")
Select relevant columns:
df_drivers_transformed = df_drivers.select("driverId", "name", "country")
Write the result:
df_drivers_transformed.write.json("/path/to/output/drivers_transformed.json")

6. Results File - Requirements

The Results file contains race result data: positions, points, race times etc.
For analysis, ingest data, filter or aggregate; for example extracting results for certain drivers or first-place finishes.

7. Results File - Spark Program (Assignment)

Read the Results JSON file:
df_results = spark.read.json("/path/to/results.json")
Filter data, e.g., select first-place finishes:
df_filtered_results = df_results.filter(df_results["positionOrder"] == 1)
Write filtered results:
df_filtered_results.write.json("/path/to/output/filtered_results.json")

8. Pitstops File - Requirements

Pitstops file contains data about pit stops: times, driver info, stop numbers etc.
Useful for analysis of strategy or efficiency; needs ingestion and transformation similarly.

9. Pitstops File - Spark Program

Read the Pitstops JSON file:
df_pitstops = spark.read.json("/path/to/pitstops.json")
Select relevant columns:
df_pitstops_transformed = df_pitstops.select("raceId", "driverId", "stop", "time")
Write the transformed data:
df_pitstops_transformed.write.json("/path/to/output/pitstops_transformed.json")

Conclusion

Spark makes it simple to ingest, transform, and write JSON data. With just a few lines of code, you can load data from files like Constructors, Drivers, Results, and Pitstops, perform necessary transformations, and save the processed data. By following these steps, you can integrate Formula 1 data into analytics workflows and leverage Spark’s power to handle large datasets efficiently.

Data Ingestion with JSON in Apache Spark: A Quick Guide