Value gets pushed to zero

Kottn 176 Reputation points
2021-02-09T17:07:45.437+00:00

I am using the following formula to calculate the Pearson correlation in my data. Note: I am using a CASE WHEN to account for a divide by zero error. The code below represents solely the formula.

( COUNT(*) * SUM(X * Y) - SUM(X) * SUM(Y) ) / ( SQRT(COUNT(*) * SUM(X * X) - SUM(X) * SUM(x)) * SQRT(COUNT(*) * SUM(Y* Y) - SUM(Y) * SUM(Y) ) )

The output of my data looks like the table below. Note that excel_corr is the correlation manually calculated in Excel, which is my expected output.
The column sql_corr is the result from my sql code above. The columns from count to the end represent the X and Y values that get plugged into the formula above. My problem is that the sql_corr does not match the output from manually calculating the correlation by groupings in Excel.

+------+------------+-------------+----------------+----------+-------+-----------+-----------+-----------+-----------+---------+------------+
| year | product_id | score_range | reporting_year | sql_corr | count | numerator | 1_sumprod | 1_sum     | 2_sumprod | 2_sum   | excel_corr |
+------+------------+-------------+----------------+----------+-------+-----------+-----------+-----------+-----------+---------+------------+
| 2016 | 1          | 1-2         | 2016           | NULL     | 1     | 0         | 0.000124  | -0.011155 | 195364    | 442     | #DIV/0!    |
+------+------------+-------------+----------------+----------+-------+-----------+-----------+-----------+-----------+---------+------------+
| 2016 | 1          | 3-4         | 2016           | NULL     | 1272  | -0.0683   | 4.9E-11   | -0.000007 | 304648060 | 622434  | -0.02911   |
+------+------------+-------------+----------------+----------+-------+-----------+-----------+-----------+-----------+---------+------------+
| 2016 | 1          | 5-6         | 2016           | -0.06416 | 3913  | -11.845   | 2.89E-09  | -0.000459 | 1.089E+09 | 2063948 | -0.06391   |
+------+------------+-------------+----------------+----------+-------+-----------+-----------+-----------+-----------+---------+------------+
| 2016 | 1          | 7-8         | 2016           | 0.00573  | 2593  | 1.63663   | 2.27E-08  | -0.000975 | 848560006 | 1482872 | 0.00573    |
+------+------------+-------------+----------------+----------+-------+-----------+-----------+-----------+-----------+---------+------------+
| 2016 | 1          | 9-10        | 2016           | -0.02106 | 1420  | -3.2855   | 4.13E-08  | -0.00131  | 555096971 | 887587  | -0.02106   |
+------+------------+-------------+----------------+----------+-------+-----------+-----------+-----------+-----------+---------+------------+
| 2016 | 1          | 11-12       | 2016           | 0.05231  | 917   | 6.64768   | 1.06E-07  | -0.000987 | 413059274 | 615312  | 0.052438   |
+------+------------+-------------+----------------+----------+-------+-----------+-----------+-----------+-----------+---------+------------+
| 2016 | 1          | 13-14       | 2016           | 0.006704 | 359   | 0.5064    | 6.18E-07  | 0.000271  | 185781413 | 258205  | 0.006705   |
+------+------------+-------------+----------------+----------+-------+-----------+-----------+-----------+-----------+---------+------------+
| 2016 | 1          | 15-16       | 2016           | 0.017846 | 55    | 0.14095   | 3.79E-06  | 0.000349  | 31849498  | 41850   | 0.017839   |
+------+------------+-------------+----------------+----------+-------+-----------+-----------+-----------+-----------+---------+------------+
| 2016 | 1          | 17-18       | 2016           | NULL     | 1     | 0         | 0         | 0         | 641601    | 801     | #DIV/0!    |
+------+------------+-------------+----------------+----------+-------+-----------+-----------+-----------+-----------+---------+------------+

For example, in score_range 3-4 the sql_corr value is NULL but in excel the value is -0.02911. If we plug in the values manually into the formula -0.02911 is the correct result.

numerator / ( ( SQRT(n_count * 1_sumprod - 1_sum * 1_sum) * SQRT(n_count * 2_sumprod - 2_sum * 2_sum) ) ) 

In SQL Server the denominator is getting pushed to 0. When I calculate this manually in Excel the denominator is 2.344354. Why is my denominator being pushed to 0 in SQL Server when the same data results in a different calculation when done manually?

Edit The first part of the denominator is being pushed to 0. ( ( SQRT(n_count * 1_sumprod - 1_sum * 1_sum). When the multiplication occurs the whole denominator gets pushed to 0 in SQL activating the CASE statement to return NULL. This is incorrect confirmed by manual calculation. The following represents the output from both parts of the denominator 0.000000 and 9394.0387480572. The actual value for the first part of the denominator via manual calculation is ~0.00025.

Edit The value of (n_count * 1_sumprod - 1_sum * 1_sum) = 6.2279E-08 -- before taking the square root. However, SQL is pushing this part of the equation to 0.

I am using SQL Server 2016 v14.0.2037.2

Transact-SQL
Transact-SQL
A Microsoft extension to the ANSI SQL language that includes procedural programming, local variables, and various support functions.
4,552 questions
{count} votes

Accepted answer
  1. Tom Phillips 17,716 Reputation points
    2021-02-09T19:27:47.023+00:00

    They problem you have is you have a mix of integers and floats. This causes confusions and conversions.

    Convert all your values to float to get the value you are expecting. This is what Excel does.

    1 person found this answer helpful.

0 additional answers

Sort by: Most helpful