How to perform linear regression in BigQuery? -


bigquery has statistical aggregation functions such stddev(x) , corr(x, y), doesn't offer functions directly perform linear regression.

how can 1 compute linear regression using functions exist?

the following query performs linear regression using calculations numerically stable , modified work on input table. produces slope , intercept of best fit model y = slope * x + intercept , pearson correlation coefficient using builtin function corr.

as example, use public natality dataset compute birth weight linear function of duration of pregnancy, broken down state. write more compactly, use several layers of subqueries highlight how pieces go together. apply dataset, need replace innermost query.

select bucket,        slope,        (sum_of_y - slope * sum_of_x) / n intercept,        correlation (     select bucket,            n,            sum_of_x,            sum_of_y,            correlation * stddev_of_y / stddev_of_x slope,            correlation     (         select bucket,                count(*) n,                sum(x) sum_of_x,                sum(y) sum_of_y,                stddev_pop(x) stddev_of_x,                stddev_pop(y) stddev_of_y,                corr(x,y) correlation         (select state bucket,                      gestation_weeks x,                      weight_pounds y               [publicdata.samples.natality])         bucket not null ,               x not null ,               y not null         group bucket)); 

using stddev_pop , corr functions improves numerical stability of query compared summing products of x , y , taking differences , dividing, if use both approaches on well-behaved dataset, can verify produce same results high accuracy.


Comments

Popular posts from this blog

javascript - Thinglink image not visible until browser resize -

firebird - Error "invalid transaction handle (expecting explicit transaction start)" executing script from Delphi -

mongodb - How to keep track of users making Stripe Payments -