question

Swares avatar image
0 Votes"
Swares asked Swares answered

DWH Modeling - Fact table PK - VARCHAR(23) vs DECIMAL(23,0)

Hi guys,

I'm modeling a fact table that has a document number as a degenerate dimension. This number on the source table is a varchar(23) but it's a garanteed string of numbers from 0 to 9 on each character. Since this can be used logically as a primary key of the fact table, I went to check if the sql server int types cover this, but they do not. The biggest int, the BIGINT only covers 19 digits long.

I then noticed I could use a Decimal(23,0). This would help save some space as the varchar(23) has aproximately 25 (each char * 1 byte) + 2 bytes ( the lenght info).
The decimal(23,0) stores data with precision 20-28 as 13 bytes which is the half of it. Customer also told me this document number is changing to 15 chars, but the historical numbers will remain as 23. This means that even with 15 characters, the same decimal(23,0), since it has a fixed precision, will keep it as 13 bytes vs 17 bytes. It's still a gain of 4 bytes, on a 150 million table, which will save >550MB from day 1, and counting.

In the fact table, these records will be updated (existing records) / inserted (new) via SQL Merge, on this PK, so I'm thinking it's worth to change it to a decimal (23,0) instead of the varhcar (23,0), although I would like to hear some opinions and brainstorm a little.

Thank you

Credit to where it's due: https://www.sqlshack.com/understanding-sql-decimal-data-type/

sql-server-transact-sql
5 |1600 characters needed characters left characters exceeded

Up to 10 attachments (including images) can be used with a maximum of 3.0 MiB each and 30.0 MiB total.

ErlandSommarskog avatar image
0 Votes"
ErlandSommarskog answered

First question: can there be leading zeroes in those document numbers? If so, decimal(23, 0) is out of the game and the only question is whether you should have char(23) or varchar(23). char(23) would be the obvious choice if they were to stick with these 23-digit numbers, but with the change to 15 digits, it seems that varchar is the better choice.

Even if there can be no leading zeroes, my gut feeling says character, because to me this feels like a string. But, yes, you will save some space with decimal.

If you go for varchar, you can consider to use a binary collation to speed operations on this column.

5 |1600 characters needed characters left characters exceeded

Up to 10 attachments (including images) can be used with a maximum of 3.0 MiB each and 30.0 MiB total.

Swares avatar image
0 Votes"
Swares answered ErlandSommarskog commented

Hi @ErlandSommarskog ,

That was interesting. I found 148 old, historical, records in 2004 with a leading 0. Although, I think you are right, and I even think I should shield this right now before this number logic turns into something different, like backslashes, pipes, or some other char logic, in the future.
As for the binary collation I'm forced with a linguistic collation for now: Latin1_General_CI_AS, due to customer organisation policies.


· 1
5 |1600 characters needed characters left characters exceeded

Up to 10 attachments (including images) can be used with a maximum of 3.0 MiB each and 30.0 MiB total.

The collation thing is a horribly important. For most operations you may get a 30% speedup, which may not be worth the increased complexity.

The exception is if you can expect searches like

  SELECT * FROM tbl WHERE documnumber LIKE '%123%'

A binary collation can give you a speed of a factor 7-10 for these searches.

0 Votes 0 ·
Swares avatar image
0 Votes"
Swares answered

In my case, it will only be used on the merge during the daily batch ETL. It is however the PK of the fact table.

MERGE [DWH].[FCT_CHARGE_DOCUMENT] AS TARGET
USING [STG].[T_CHARGE_DOCUMENT] AS SOURCE
ON SOURCE.DOCUMENT_NUMBER = TARGET.DOCUMENT_NUMBER

(...)

Analytically, this is always aggregated per time period.

5 |1600 characters needed characters left characters exceeded

Up to 10 attachments (including images) can be used with a maximum of 3.0 MiB each and 30.0 MiB total.