They problem you have is you have a mix of integers and floats. This causes confusions and conversions.
Convert all your values to float to get the value you are expecting. This is what Excel does.
This browser is no longer supported.
Upgrade to Microsoft Edge to take advantage of the latest features, security updates, and technical support.
I am using the following formula to calculate the Pearson correlation in my data. Note: I am using a CASE WHEN
to account for a divide by zero error. The code below represents solely the formula.
( COUNT(*) * SUM(X * Y) - SUM(X) * SUM(Y) ) / ( SQRT(COUNT(*) * SUM(X * X) - SUM(X) * SUM(x)) * SQRT(COUNT(*) * SUM(Y* Y) - SUM(Y) * SUM(Y) ) )
The output of my data looks like the table below. Note that excel_corr
is the correlation manually calculated in Excel, which is my expected output.
The column sql_corr
is the result from my sql code above. The columns from count
to the end represent the X and Y values that get plugged into the formula above. My problem is that the sql_corr
does not match the output from manually calculating the correlation by groupings in Excel.
+------+------------+-------------+----------------+----------+-------+-----------+-----------+-----------+-----------+---------+------------+
| year | product_id | score_range | reporting_year | sql_corr | count | numerator | 1_sumprod | 1_sum | 2_sumprod | 2_sum | excel_corr |
+------+------------+-------------+----------------+----------+-------+-----------+-----------+-----------+-----------+---------+------------+
| 2016 | 1 | 1-2 | 2016 | NULL | 1 | 0 | 0.000124 | -0.011155 | 195364 | 442 | #DIV/0! |
+------+------------+-------------+----------------+----------+-------+-----------+-----------+-----------+-----------+---------+------------+
| 2016 | 1 | 3-4 | 2016 | NULL | 1272 | -0.0683 | 4.9E-11 | -0.000007 | 304648060 | 622434 | -0.02911 |
+------+------------+-------------+----------------+----------+-------+-----------+-----------+-----------+-----------+---------+------------+
| 2016 | 1 | 5-6 | 2016 | -0.06416 | 3913 | -11.845 | 2.89E-09 | -0.000459 | 1.089E+09 | 2063948 | -0.06391 |
+------+------------+-------------+----------------+----------+-------+-----------+-----------+-----------+-----------+---------+------------+
| 2016 | 1 | 7-8 | 2016 | 0.00573 | 2593 | 1.63663 | 2.27E-08 | -0.000975 | 848560006 | 1482872 | 0.00573 |
+------+------------+-------------+----------------+----------+-------+-----------+-----------+-----------+-----------+---------+------------+
| 2016 | 1 | 9-10 | 2016 | -0.02106 | 1420 | -3.2855 | 4.13E-08 | -0.00131 | 555096971 | 887587 | -0.02106 |
+------+------------+-------------+----------------+----------+-------+-----------+-----------+-----------+-----------+---------+------------+
| 2016 | 1 | 11-12 | 2016 | 0.05231 | 917 | 6.64768 | 1.06E-07 | -0.000987 | 413059274 | 615312 | 0.052438 |
+------+------------+-------------+----------------+----------+-------+-----------+-----------+-----------+-----------+---------+------------+
| 2016 | 1 | 13-14 | 2016 | 0.006704 | 359 | 0.5064 | 6.18E-07 | 0.000271 | 185781413 | 258205 | 0.006705 |
+------+------------+-------------+----------------+----------+-------+-----------+-----------+-----------+-----------+---------+------------+
| 2016 | 1 | 15-16 | 2016 | 0.017846 | 55 | 0.14095 | 3.79E-06 | 0.000349 | 31849498 | 41850 | 0.017839 |
+------+------------+-------------+----------------+----------+-------+-----------+-----------+-----------+-----------+---------+------------+
| 2016 | 1 | 17-18 | 2016 | NULL | 1 | 0 | 0 | 0 | 641601 | 801 | #DIV/0! |
+------+------------+-------------+----------------+----------+-------+-----------+-----------+-----------+-----------+---------+------------+
For example, in score_range
3-4
the sql_corr
value is NULL
but in excel the value is -0.02911
. If we plug in the values manually into the formula -0.02911
is the correct result.
numerator / ( ( SQRT(n_count * 1_sumprod - 1_sum * 1_sum) * SQRT(n_count * 2_sumprod - 2_sum * 2_sum) ) )
In SQL Server the denominator is getting pushed to 0. When I calculate this manually in Excel the denominator is 2.344354. Why is my denominator being pushed to 0 in SQL Server when the same data results in a different calculation when done manually?
Edit The first part of the denominator is being pushed to 0. ( ( SQRT(n_count * 1_sumprod - 1_sum * 1_sum)
. When the multiplication occurs the whole denominator gets pushed to 0 in SQL activating the CASE statement to return NULL. This is incorrect confirmed by manual calculation. The following represents the output from both parts of the denominator 0.000000 and 9394.0387480572. The actual value for the first part of the denominator via manual calculation is ~0.00025.
Edit The value of (n_count * 1_sumprod - 1_sum * 1_sum) = 6.2279E-08 -- before taking the square root. However, SQL is pushing this part of the equation to 0.
I am using SQL Server 2016 v14.0.2037.2
They problem you have is you have a mix of integers and floats. This causes confusions and conversions.
Convert all your values to float to get the value you are expecting. This is what Excel does.