Designing Table Structure for Accomodating Huge Amount of Data

Pratik Somaiya 4,201 Reputation points
2020-06-09T04:27:29.12+00:00

I am working on a use-case wherein I am expected to receive 80 million records every week and I have to load it in Azure SQL Database, some of the data is dimensional type and some is of fact type, I would be using Azure Data Factory to load the data from the files present in Blob Storage.

The factual information is the metrics of some keywords and the keywords might change over time, so I want to know how should I design the table, as keeping keywords as a column will be difficult as we have over 1000 keywords and practically keeping so many columns seems impossible.

Also, loading and retrieving huge amount of data will hamper the database performance.

Can you suggest some of the table designing methodologies keeping size and performance in mind.

Thanks!

Azure SQL Database
Azure Data Factory
Azure Data Factory
An Azure service for ingesting, preparing, and transforming data at scale.
9,526 questions
{count} votes

2 answers

Sort by: Most helpful
  1. Malleswara Reddy, G 1,631 Reputation points
    2020-06-09T04:34:39.68+00:00

    Hi,

    Please see my comments below -

    The factual information is the metrics of some keywords and the keywords might change over time, so I want to know how should I design the table, as keeping keywords as a column will be difficult as we have over 1000 keywords and practically keeping so many columns seems impossible.

    --- You can try out creating a separate table of keywords and link a relationship wherever it is required

    Can you suggest some of the table designing methodologies keeping size and performance in mind.

    ---https://learn.microsoft.com/en-us/azure/data-factory/copy-activity-performance
    https://learn.microsoft.com/en-us/azure/data-factory/whitepapers
    https://learn.microsoft.com/en-us/azure/data-factory/copy-activity-performance-features


  2. Mike Ubezzi 2,776 Reputation points
    2020-06-23T19:50:50.353+00:00

    Other than the table design piece, which is important because without detailed knowledge of the data, designing and offering guidance on a solution is limiting. You mentioned Azure SQL Database but reference data warehousing terminology or perhaps a data warehouse solution is a better option? Based upon best practices the below is how this process is accomplished with Synapse Analytics. Not everyone's data is based upon best practices. What I didn't find in this thread, were any best practices as it relates to Azure Synapse Analytics.

    There is a good guidance with regard to distributed tables and the common distribution methods for tables:

    The table category often determines which option to choose for distributing the table.

    Fact Tables:

    Use hash-distribution with clustered columnstore index. Performance improves when two hash tables are joined on the same distribution column.

    Dimension Tables:

    Use replicated for smaller tables. If tables are too large to store on each Compute node, use hash-distributed.

    There is also some key guidance on data migrations and data loading and strategies for distributed and replicated tables.

    In attempting to go through your dataset to determine table category, agree that the single table approach for keywords is efficient but I don't have clarity on your metrics or fact data? For the keyword or dimensional data, I would use replicated tables based upon the size of that table in relation to each compute node in your SQL Pool. Queries can be built to join the keyword values between the fact data and the dimensional data.

    Other questions remain:

    How often are keywords updated and do multiple iterations of a keywords needs to exist based upon date? I am trying to understand your Dimension1Id and Dimension2Id column headers?

    Please also see:

    0 comments No comments