Ingestion and processing of real-time automotive IoT data
This example scenario builds a real-time data ingestion and processing pipeline to ingest and process messages from IoT devices (in general sensors) into a big data analytic platform in Azure. Vehicle telematics ingestion and processing platforms are the key to create connected car solutions. This specific scenario is motivated by the car telematics ingestion and processing systems. However, the design patterns are relevant for many industries using sensors to manage and monitor complex systems in industries such as smart buildings, communications, manufacturing, retail, and healthcare.
This example demonstrates a real-time data ingestion and processing pipeline for messages from IoT devices installed in vehicles. Thousands and millions of messages (or events) are generated by the IoT devices and sensors. By capturing and analyzing these messages, we can decipher valuable insights and take appropriate actions. For example, with cars equipped telematics devices, if we can capture the device (IoT) messages in real time, we would be able to monitor the live location of vehicles, plan optimized routes, provide assistance to drivers, and support telematics-related industries such as auto insurance.
For this example demonstration, imagine a car manufacturing company that wants to create a real-time system to ingest and process messages from telematics devices. The company's goals include:
- Ingest and store data in real time from vehicles sensors and devices.
- Analyze the messages to understand vehicle location, and other information emitted through different types of sensors (such as engine-related sensors and environment-related sensors).
- Store the data after analysis for other downstream processing to provide actionable insights (For example, in accident scenarios, insurance agencies may be interested to know what happened during an accident etc.)
Relevant use cases
Other relevant use cases include:
- Vehicle maintenance reminders and alerting.
- Location-based services for the vehicle passengers (that is, SOS).
- Autonomous (self-driving) vehicles.
In this real-time big data processing pipeline, the data flows through the solution as follows:
- Events generated from the IoT data sources are sent to the stream ingestion layer through Azure IoT Hub as a stream of messages. Azure IoT Hub stores streams of data in partitions for a configurable amount of time.
- Azure Databricks picks up the message in real time from IoT Hub, processes the data based on the business logic and sends the data to Serving layer for storage.
- Downstream storage services, like Azure Cosmos DB, Azure SQL Data warehouse, or Azure SQL DB, will store the IoT data, plus any transactional data, and be a data source for presentation and action layer.
- Business analysts can use Microsoft Power BI to analyze warehoused data.
- Web, mobile and other applications can be built on the serving layer as well. For example, we can expose APIs based on the serving layer data for third-party uses.
IoT device-generated events (data or messages) are ingested, processed, and then stored for further analysis, presentation, and action, using the following Azure components:
- Azure IoT Hub is in the ingestion layer. IoT Hub also support bi-directional communication back to devices, allowing Actions to be sent from the cloud or Azure IoT Edge to the device.
- Azure Databricks is located in the transformation and analytics layer. Databricks uses the azure-eventhubs-spark_2.11:2.3.6 Maven library to connect to IoT Hub's Event Hub compatible endpoint.
- Azure Cosmos DB, Azure SQL Database, and Azure Synapse Analytics are in the Serving storage layer, where Azure Databricks can write the data via data connectors.
- Azure Synapse is a distributed system for storing and analyzing large datasets. Its use of massive parallel processing (MPP) makes it suitable for running high-performance analytics.
- Power BI is a suite of business analytics tools to analyze data and share insights. Power BI can query a semantic model stored in Analysis Services, or it can query Azure Synapse directly.
- Azure Active Directory (Azure AD) authenticates users, when connecting to Azure Databricks. If we would build a cube in Analysis Services based on the model based on Azure Synapse data, we could use Azure AD to connect to the Analysis Services server through Power BI. Data Factory can also use Azure AD to authenticate to Azure Synapse via a service principal or Managed Service Identity (MSI).
- Azure App Services can be used to build web and mobile applications. Azure API App can be used to expose data to third parties, based on the data stored in the Serving Layer.
A more generalized big data pipeline could be implemented using other Azure components.
- In the stream ingestion layer, we could use Event Hubs or Apache Kafka on HDInsight to ingest data. It is important to note these services only offer uni-directional communication, no device management, no edge capabilities and are recommended for non-IoT scenarios such as streaming data pipelines. Refer to Connecting IoT Devices to Azure: IoT Hub and Event Hubs for a more detailed comparison.
- In the transformation and analytics layer, we could use HDInsight Storm, HDInsight Spark or Azure Stream Analytics to perform streaming analytics, and Azure Functions to perform payload transformations.
- In the service storage layer, we could use Analysis Services to provide a semantic model for your data. It can also increase system performance when analyzing your data. You can build the model based on Azure DW data. Azure Data Explorer and Time Series Insights could be used for storing and analyzing time series data.
The technologies in this architecture were chosen based on the scale needed to process events, the SLA of the services, the cost management and ease of management of the components.
- Azure IoT Hub comes with a 99.9% SLA and has integrated disaster recovery capabilities.
- Azure Databricks is optimized from the ground up for performance and cost-efficiency in the cloud. The Databricks Runtime adds several key capabilities to Apache Spark workloads that can increase performance and reduce costs by as much as 10-100x when running on Azure, including:
- Azure Databricks integrates deeply with Azure databases and stores: Azure Synapse, Azure Cosmos DB, Azure Data Lake Storage, and Azure Blob Storage
- Autoscaling and autotermination for Spark clusters to automatically minimize costs.
- Performance optimizations including caching, indexing, and advanced query optimization, which can improve performance by as much as 10-100x over traditional Apache Spark deployments in cloud or on-premises environments.
- Integration with Azure Active Directory enables you to run complete Azure-based solutions using Azure Databricks.
- Role-based access in Azure Databricks enables fine-grained user permissions for notebooks, clusters, jobs, and data.
- Comes with Enterprise-grade SLAs.
- Azure Cosmos DB is Microsoft's globally distributed, multi-model database. Azure Cosmos DB was built from the ground up with global distribution and horizontal scale at its core. It offers turnkey global distribution across any number of Azure regions by transparently scaling and replicating your data wherever your users are. You can elastically scale throughput and storage worldwide, and pay only for the throughput and storage you need.
- The massively parallel processing architecture of Azure Synapse provides scalability and high performance.
- Azure Synapse has guaranteed SLAs and recommended practices for achieving high availability.
- When analysis activity is low, the company can scale Azure Synapse on demand, reducing or even pausing compute to lower costs.
- The Azure Synapse security model provides connection security, authentication, and authorization via Azure AD or SQL Server authentication, and encryption.
Review Azure Databricks pricing, Azure IoT Hub pricing, pricing sample for a data warehousing scenario via the Azure pricing calculator. Adjust the values to see how your requirements affect your costs.
- Azure IoT Hub is a fully managed cloud service that makes it easy, fast, and cost-effective to ingest massive amounts of IoT data
- Azure Databricks offers two distinct workloads on several VM Instances tailored for your data analytics workflow — the Data Engineering workload makes it easy for data engineers to build and execute jobs, and the Data Analytics workload makes it easy for data scientists to explore, visualize, manipulate, and share data and insights interactively.
- Azure Cosmos DB guarantees single-digit-millisecond latencies at the 99th percentile anywhere in the world, offers multiple well-defined consistency models to fine-tune performance, and guarantees high availability with multi-homing capabilities — all backed by industry leading comprehensive service level agreements (SLAs).
- Azure Synapse allows you to scale your compute and storage levels independently. Compute resources are charged per hour, and you can scale or pause these resources on demand. Storage resources are billed per terabyte, so your costs will increase as you ingest more data.
- Analysis Services is available in developer, basic, and standard tiers. Instances are priced based on query processing units (QPUs) and available memory. To keep your costs lower, minimize the number of queries you run, how much data they process, and how often they run.
- Power BI has different product options for different requirements. Power BI Embedded provides an Azure-based option for embedding Power BI functionality inside your applications. A Power BI Embedded instance is included in the pricing sample above.
- Review the Azure IoT reference architecture that shows a recommended architecture for IoT applications on Azure using PaaS (platform-as-a-service) components.
- Review the Real-time analytics reference architecture that includes big data pipeline flow.
- Review the Advanced analytics on big data reference architecture to get a peek on how different Azure components can help build a big data pipeline.