c++ - Busy loop slows down latency-critical computation -
my code following:
- do long-running intense computation (called useless below)
- do small latency-critical task
i find time takes execute latency-critical task higher long-running computation without it.
here stand-alone c++ code reproduce effect:
#include <stdio.h> #include <stdint.h> #define len 128 #define useless 1000000000 //#define useless 0 // read timestamp counter static inline long long get_cycles() { unsigned low, high; unsigned long long val; asm volatile ("rdtsc" : "=a" (low), "=d" (high)); val = high; val = (val << 32) | low; return val; } // compute simple hash static inline uint32_t hash(uint32_t *arr, int n) { uint32_t ret = 0; for(int = 0; < n; i++) { ret = (ret + (324723947 + arr[i])) ^ 93485734985; } return ret; } int main() { uint32_t sum = 0; // adding dependencies uint32_t arr[len]; // we'll compute hash of array for(int iter = 0; iter < 3; iter++) { // create new array hash iteration for(int = 0; < len; i++) { arr[i] = (iter + i); } // intense computation for(int useless = 0; useless < useless; useless++) { sum += (sum + useless) * (sum + useless); } // latency-critical task long long start_cycles = get_cycles() + (sum & 1); sum += hash(arr, len); long long end_cycles = get_cycles() + (sum & 1); printf("iteration %d cycles: %lld\n", iter, end_cycles - start_cycles); } }
when compiled -o3
useless
set 1 billion, 3 iterations took 588, 4184, , 536 cycles, respectively. when compiled useless
set 0, iterations took 394, 358, , 362 cycles, respectively.
why (particularly 4184 cycles) happening? suspected cache misses or branch mis-predictions induced intense computation. however, without intense computation, zeroth iteration of latency critical task pretty fast don't think cold cache/branch predictor cause.
moving speculative comment answer:
it possible while busy loop running, other tasks on server pushing cached arr
data out of l1 cache, first memory access in hash
needs reload lower level cache. without compute loop wouldn't happen. try moving arr
initialization after computation loop, see effect is.
Comments
Post a Comment