algorithm - Calculating cannibalization/attribution in R -
i have metric distributed among 4 categories a, b, c, d
.
over period of time track movement in metric each category. sum of these movements represents quantity has either left or entered system elsewhere ('external').
# setup ------------------------------------------------------------------- categories <- letters[1:4] set.seed(1) movements <- lapply(categories, function(...) {round(runif(10, -10,10))*10}) names(movements) <- categories movements[['external']] <- reduce(`+`, movements)*-1 problem <- as.data.frame(movements) problem b c d external 1 -50 -60 90 0 20 2 -30 -60 -60 20 130 3 10 40 30 0 -80 4 80 -20 -70 -60 70 5 -60 50 -50 70 -10 6 80 0 -20 30 -90 7 90 40 -100 60 -90 8 30 100 -20 -80 -30 9 30 -20 70 40 -120 10 -90 60 -30 -20 80
where categories have undergone positive movements , others have undergone negative movements, can infer transfer within system.
# add transfer columns , initialise 0 -------------------------------- transfer_matrix <- combn(c(categories, 'external'), 2) transfer_list <- combn(c(categories, 'external'), 2, simplify=f) problem[,sapply(transfer_list, paste, collapse='.')] <- 0 paste(names(problem), collapse=', ') [1] "a, b, c, d, external, a.b, a.c, a.d, a.external, b.c, b.d, b.external, c.d, c.external, d.external"
for example a
has decreased 50 , c
has increased 90, can infer there transfer a
c
stored in variable a.c
.
the rule calculating transfers proportional. when 'a' has decrease 50 , b
has decreased 60, 50/(50+60) of increase in c
should attributed 'a'
, , 60/(50+60) of increase in c
should attributed b
. , transfers in , out of system.
below shows full manual calculation of variables need, first row:
# manual calculation ------------------------------------------------------ row_limit <- 1 # change e.g. 1:10 problem[row_limit, 'a.b'] <- 0 problem[row_limit, 'a.c'] <- 90*(-50/(-50+-60)) problem[row_limit, 'a.d'] <- 0 problem[row_limit, 'a.external'] <- 20 * -50/(-50+-60) problem[row_limit, 'b.c'] <- 90*(-60/(-50+-60)) problem[row_limit, 'b.d'] <- 0 problem[row_limit, 'b.external'] <- 20 * -60/(-50+-60) problem[row_limit, 'c.d'] <- 0 problem[row_limit, 'c.external'] <- 0 problem[row_limit, 'd.external'] <- 0
note since a.c = -c.a
subset of possible transfers need calculated.
my question is, how can code above calculations programmatically, in concise , efficient way handle 10-20 categories , large number of rows?
i typically use data.table open suggestions of packages use.
below code checking output:
# checking ---------------------------------------------------------------- check <- function(problem, category, categories, transfer_list, transfer_matrix) { out_columns <- sapply(transfer_list[transfer_matrix[1,] == category], paste, collapse='.') in_columns <- sapply(transfer_list[transfer_matrix[2,] == category], paste, collapse='.') stopifnot(length(c(out_columns, in_columns)) == length(categories)-1) out_sum <- 0 if(length(out_columns) == 1) { out_sum <- problem[,out_columns] } else if(length(out_columns) > 1) { out_sum <- reduce(`+`, problem[,out_columns]) } in_sum <- 0 if(length(in_columns) == 1) { in_sum <- problem[,in_columns] } else if(length(in_columns) > 1) { in_sum <- reduce(`+`, problem[,in_columns]) } lhs <- out_sum - in_sum rhs <- -problem[, category] sprintf('%s vs %s',lhs, rhs) } # each category, actual vs expected sapply(c(categories,'external'), check, problem=problem, categories=c(categories,'external'), transfer_list=transfer_list,transfer_matrix=transfer_matrix) b c d [1,] "50 vs 50" "60 vs 60" "-90 vs -90" "0 vs 0" [2,] "0 vs 30" "0 vs 60" "0 vs 60" "0 vs -20" [3,] "0 vs -10" "0 vs -40" "0 vs -30" "0 vs 0" [4,] "0 vs -80" "0 vs 20" "0 vs 70" "0 vs 60" [5,] "0 vs 60" "0 vs -50" "0 vs 50" "0 vs -70" [6,] "0 vs -80" "0 vs 0" "0 vs 20" "0 vs -30" [7,] "0 vs -90" "0 vs -40" "0 vs 100" "0 vs -60" [8,] "0 vs -30" "0 vs -100" "0 vs 20" "0 vs 80" [9,] "0 vs -30" "0 vs 20" "0 vs -70" "0 vs -40" [10,] "0 vs 90" "0 vs -60" "0 vs 30" "0 vs 20"
here 1 idea. believe output matches want.
#x row problem df #y column transfer_matrix check_pairs <- function(x,y){ #split y columns being compared . e.g. if col 1 'd' vs 'external', ... <- y[1] #would 'd' b <- y[2] #would 'external' #if both pos, both neg, or 1 val 0, return 0 if( sign(x[a]) == sign(x[b]) | sign(x[[a]]) == 0){ return(0) }else{ #else return formula manual calculation return( x[[b]] * x[[a]] / sum( x[sign(x)==sign(x[[a]]) ] ) ) } } #for each row of problem matrix, compare each column of transfer_matrix check_matrix_cols <- function(x){ return( apply(transfer_matrix, 2, function(y) check_pairs(x,y)) ) } problem[,-seq(length(c(categories, 'external')))] <- t( apply(problem, 1, check_matrix_cols) ) sapply(c(categories,'external'), check, problem=problem, categories=c(categories,'external'), transfer_list=transfer_list,transfer_matrix=transfer_matrix) b c d external [1,] "50 vs 50" "60 vs 60" "-90 vs -90" "0 vs 0" "-20 vs -20" [2,] "30 vs 30" "60 vs 60" "60 vs 60" "-20 vs -20" "-130 vs -130" [3,] "-10 vs -10" "-40 vs -40" "-30 vs -30" "0 vs 0" "80 vs 80" [4,] "-80 vs -80" "20 vs 20" "70 vs 70" "60 vs 60" "-70 vs -70" [5,] "60 vs 60" "-50 vs -50" "50 vs 50" "-70 vs -70" "10 vs 10" [6,] "-80 vs -80" "0 vs 0" "20 vs 20" "-30 vs -30" "90 vs 90" [7,] "-90 vs -90" "-40 vs -40" "100 vs 100" "-60 vs -60" "90 vs 90" [8,] "-30 vs -30" "-100 vs -100" "20 vs 20" "80 vs 80" "30 vs 30" [9,] "-30 vs -30" "20 vs 20" "-70 vs -70" "-40 vs -40" "120 vs 120" [10,] "90 vs 90" "-60 vs -60" "30 vs 30" "20 vs 20" "-80 vs -80"
Comments
Post a Comment