Designing Table Structure for Accomodating Huge Amount of Data

Question

I am working on a use-case wherein I am expected to receive 80 million records every week and I have to load it in Azure SQL Database, some of the data is dimensional type and some is of fact type, I would be using Azure Data Factory to load the data from the files present in Blob Storage.

The factual information is the metrics of some keywords and the keywords might change over time, so I want to know how should I design the table, as keeping keywords as a column will be difficult as we have over 1000 keywords and practically keeping so many columns seems impossible.

Also, loading and retrieving huge amount of data will hamper the database performance.

Can you suggest some of the table designing methodologies keeping size and performance in mind.

Thanks!

Answer

Hi,

Please see my comments below -

The factual information is the metrics of some keywords and the keywords might change over time, so I want to know how should I design the table, as keeping keywords as a column will be difficult as we have over 1000 keywords and practically keeping so many columns seems impossible.

--- You can try out creating a separate table of keywords and link a relationship wherever it is required

Can you suggest some of the table designing methodologies keeping size and performance in mind.

---https://learn.microsoft.com/en-us/azure/data-factory/copy-activity-performance
https://learn.microsoft.com/en-us/azure/data-factory/whitepapers
https://learn.microsoft.com/en-us/azure/data-factory/copy-activity-performance-features

Answer

Other than the table design piece, which is important because without detailed knowledge of the data, designing and offering guidance on a solution is limiting. You mentioned Azure SQL Database but reference data warehousing terminology or perhaps a data warehouse solution is a better option? Based upon best practices the below is how this process is accomplished with Synapse Analytics. Not everyone's data is based upon best practices. What I didn't find in this thread, were any best practices as it relates to Azure Synapse Analytics.

There is a good guidance with regard to distributed tables and the common distribution methods for tables:

The table category often determines which option to choose for distributing the table.

Fact Tables:

Use hash-distribution with clustered columnstore index. Performance improves when two hash tables are joined on the same distribution column.

Dimension Tables:

Use replicated for smaller tables. If tables are too large to store on each Compute node, use hash-distributed.

There is also some key guidance on data migrations and data loading and strategies for distributed and replicated tables.

In attempting to go through your dataset to determine table category, agree that the single table approach for keywords is efficient but I don't have clarity on your metrics or fact data? For the keyword or dimensional data, I would use replicated tables based upon the size of that table in relation to each compute node in your SQL Pool. Queries can be built to join the keyword values between the fact data and the dimensional data.

Designing Table Structure for Accomodating Huge Amount of Data

2 answers