Looking for real-time data ingestion pattern with examples

Question

My company is in the process of converting a monolithic application into microservices. We've decided that the data lake will be the Source of Truth. A consulting company got us started, but processing messages to the Silver layer in the data lake is not performant and is cost prohibitive. As getting data into the Silver layer of the data lake is new to me and our company I'm looking for the best practice for processing a Service Bus message (from a microservice update/insert) ASAP. We are expected a data lake query to include that message details in under a second. The current pattern works like this:

Service Bus message lands in the data lake as a parquet file in the Bronze layer
A storage account trigger sends this file to a Synapse Notebook to process this single message
The Synapse Notebook checks if message is an update or insert processes it appropriately. This process is slow. We read the current Silver file, merge in the Bronze message and the overwrite the Silver parquet file. This process with a Spark pool spin up and down takes about 2 minutes per message.

I've looked into Stream Analytics, to replace this process, but I have two hurdles (maybe more)

Input file is a parquet file.
I can have this written as a json or csv file.
Since message is coming off the Service Bus, should I even be writing this message as a file?
Output file has a primary key and expecting that anything with the same key gets updated, but if primary key isn't in the file, that message gets inserted.
As far as I can tell writing to a Azure Storage Account parquet file output will just append all messages to the parquet output whether the primary key already exists in the output or not.
Right now we are looking at a parquet file being our landing place for this data. Should this be a something else like a Cosmos DB?

We are obviously doing something wrong here with our current solution. My question is do we just need to tweak the current solution or look into something different like Stream Analytics?

Accepted Answer

@Jeff Born (J&CLT-ATL) - Thanks for the question and using MS Q&A platform.

Real-time data ingestion is a common requirement in modern data architectures. There are several patterns and technologies that can be used to achieve this goal. Here are some examples:

Event-driven architecture: In this pattern, microservices publish events to a message broker (such as Azure Service Bus or Azure Event Hubs) whenever they update or insert data. Other microservices or data processing pipelines can subscribe to these events and process them in real-time. This pattern is highly scalable and decouples the microservices from each other.
Change Data Capture (CDC): CDC is a technique that captures changes made to a database and makes them available in real-time. CDC can be used to capture changes made to a database table and publish them to a message broker or a data processing pipeline. This pattern is useful when you need to capture changes made to a database in real-time.
Stream processing: Stream processing is a technique that processes data in real-time as it flows through a system. Stream processing engines (such as Azure Stream Analytics or Apache Kafka) can be used to process data streams and generate real-time insights. This pattern is useful when you need to process data in real-time and generate real-time insights.

Regarding your specific use case, it seems like you are currently using a batch processing approach to process data in the data lake. This approach can be slow and expensive, especially if you need to process data in real-time.

One alternative approach is to use a stream processing engine (such as Azure Stream Analytics) to process data in real-time as it flows through the system. Stream Analytics supports various input formats (including Parquet, JSON, and CSV) and can output data to various destinations (including Azure Storage, Azure Cosmos DB, and Azure SQL Database).

To handle updates and inserts based on a primary key, you can use Stream Analytics' "merge" feature. This feature allows you to merge incoming data with existing data based on a primary key. If the primary key already exists in the output, the incoming data will update the existing data. If the primary key does not exist in the output, the incoming data will be inserted as a new record.

In summary, there are several patterns and technologies that can be used to achieve real-time data ingestion. Stream processing is a popular approach that can be used to process data in real-time and generate real-time insights. Azure Stream Analytics is a powerful tool that supports various input formats and output destinations and can handle updates and inserts based on.

Hope this helps. Do let us know if you any further queries.

If this answers your query, do click Accept Answer and Yes for was this answer helpful. And, if you have any further query do let us know.

Looking for real-time data ingestion pattern with examples

0 additional answers