Loop-Carried Dependency: Single vs Split Accumulator
Question 19 / 51 • Correct so far: 0 (0 answered)
Single Sum
double sumArray(const std::vector<double>& data) {
double sum = 0.0;
for (double x : data)
sum += x;
return sum;
}
double result = sumArray(DATA); Split Sum
double sumArray(const std::vector<double>& data) {
double s0 = 0.0, s1 = 0.0, s2 = 0.0, s3 = 0.0;
const std::size_t n = data.size();
std::size_t i = 0;
for (; i + 3 < n; i += 4) {
s0 += data[i];
s1 += data[i + 1];
s2 += data[i + 2];
s3 += data[i + 3];
}
for (; i < n; ++i) s0 += data[i];
return s0 + s1 + s2 + s3;
}
double result = sumArray(DATA); Shared test data (shared-setup)
constexpr int kDataSize = 65536;
static std::vector<double> makeData() {
std::vector<double> v(kDataSize);
for (int i = 0; i < kDataSize; ++i)
v[i] = 1.0 + static_cast<double>(i) * 0.000001;
return v;
}
static const std::vector<double> DATA = makeData(); Which snippet is faster?
Snippet B is faster. Floating-point addition is not associative, so the compiler cannot reorder snippet A's additions — each iteration waits for the previous result, creating a serial dependency chain limited by add latency. Four independent partial sums let the CPU keep multiple additions in flight at once, hiding the 4–5 cycle latency per add.
Benchmark results
| Snippet | CPU time / iteration | Speedup |
|---|---|---|
| Single Sum | 36.5 us | 1.0× |
| Split Sum | 9.11 us | 4.0× |
Explore the source
Open in Compiler ExplorerQuiz complete. You can return to the question list to restart and compare.