question

PratikSomaiya avatar image
0 Votes"
PratikSomaiya asked ·

Designing Table Structure for Accomodating Huge Amount of Data

I am working on a use-case wherein I am expected to receive 80 million records every week and I have to load it in Azure SQL Database, some of the data is dimensional type and some is of fact type, I would be using Azure Data Factory to load the data from the files present in Blob Storage.

The factual information is the metrics of some keywords and the keywords might change over time, so I want to know how should I design the table, as keeping keywords as a column will be difficult as we have over 1000 keywords and practically keeping so many columns seems impossible.

Also, loading and retrieving huge amount of data will hamper the database performance.

Can you suggest some of the table designing methodologies keeping size and performance in mind.

Thanks!

azure-data-factoryazure-sql-database
· 2
10 |1600 characters needed characters left characters exceeded

Up to 10 attachments (including images) can be used with a maximum of 3.0 MiB each and 30.0 MiB total.

Please let us know if you need any further help regarding this issue.

Thanks
Navtej S

0 Votes 0 ·

Hi,
Do you have any update on my below query?
Thanks!

0 Votes 0 ·
MalleswarReddy avatar image
0 Votes"
MalleswarReddy answered ·

Hi,

Please see my comments below -

The factual information is the metrics of some keywords and the keywords might change over time, so I want to know how should I design the table, as keeping keywords as a column will be difficult as we have over 1000 keywords and practically keeping so many columns seems impossible.

--- You can try out creating a separate table of keywords and link a relationship wherever it is required

Can you suggest some of the table designing methodologies keeping size and performance in mind.

---https://docs.microsoft.com/en-us/azure/data-factory/copy-activity-performance
https://docs.microsoft.com/en-us/azure/data-factory/whitepapers
https://docs.microsoft.com/en-us/azure/data-factory/copy-activity-performance-features

· 3
10 |1600 characters needed characters left characters exceeded

Up to 10 attachments (including images) can be used with a maximum of 3.0 MiB each and 30.0 MiB total.

I have 1000 keywords, so do you mean that I should create 1000 tables? The keyword which is today can be deleted further, which will make its table of no use, also maintaining 1000 tables just for keywords would be tedious.

I will read the link which you have provided.

Thanks!

0 Votes 0 ·

Hi,

No. I mean, Please create one table to hold all the keywords. That table columns will be Keyword ID , Keywords. You will insert the keywords into this table in the form of records.

0 Votes 0 ·

Thanks and sorry for delayed reply

As per the design, below image will be the table structure.

Now, my concern is that for 1 value of Dimension1, there will be multiple values of Keywords as each dimension have multiple keywords, I will be having 70 k values of Dimension 1 and Keywords will be 1000 so, total records will be 70k * 1K, which will be loaded every week into Azure SQL database.

I want to know is this design perfect and how I will be able to load and retrieve the records faster with this approach.

I know indexes will help, but as new data will be inserted every week, then I will also have to rebuild the indexes.

I want to know how I should improve the write and read performance of such tables.
10411-db-size-question.png
Thanks!


0 Votes 0 ·
Mike-Ubezzi-MSFT avatar image
0 Votes"
Mike-Ubezzi-MSFT answered ·

Other than the table design piece, which is important because without detailed knowledge of the data, designing and offering guidance on a solution is limiting. You mentioned Azure SQL Database but reference data warehousing terminology or perhaps a data warehouse solution is a better option? Based upon best practices the below is how this process is accomplished with Synapse Analytics. Not everyone's data is based upon best practices. What I didn't find in this thread, were any best practices as it relates to Azure Synapse Analytics.

There is a good guidance with regard to distributed tables and the common distribution methods for tables:

The table category often determines which option to choose for distributing the table.

Fact Tables:

Use hash-distribution with clustered columnstore index. Performance improves when two hash tables are joined on the same distribution column.

Dimension Tables:

Use replicated for smaller tables. If tables are too large to store on each Compute node, use hash-distributed.

There is also some key guidance on data migrations and data loading and strategies for distributed and replicated tables.

In attempting to go through your dataset to determine table category, agree that the single table approach for keywords is efficient but I don't have clarity on your metrics or fact data? For the keyword or dimensional data, I would use replicated tables based upon the size of that table in relation to each compute node in your SQL Pool. Queries can be built to join the keyword values between the fact data and the dimensional data.

Other questions remain:

How often are keywords updated and do multiple iterations of a keywords needs to exist based upon date? I am trying to understand your Dimension1Id and Dimension2Id column headers?

Please also see:

10 |1600 characters needed characters left characters exceeded

Up to 10 attachments (including images) can be used with a maximum of 3.0 MiB each and 30.0 MiB total.