If the method is always missing to DRAM, the LUT entries are being cleared out before the next invocation (that hits the same line) which must be a "while". The key observation is the "deeper" the miss (i.e., miss to DRAM being the deepest, ignoring swap), the less the implied frequency the LUT-using method was being called anyways. Now the latter is certainly true, but you can certainly put reasonable bounds on the cost. To be a win, LUT function needs to be something pretty heavy, while LUT itself needs to be small (at least Microbenchmarks often miss "system" level issues.īut it's like a reverse myth now: at some point (apparently?) everyone loved LUTs - but now it's popular to just dismiss any LUT use with: "yeah but a cache miss takes yyy cycles which will make a LUT terrible!" or "microbenchmarks can't capture the true cost of LUTs!". Loads can affect performance system wide. No one is talking about the context of a LUT of function pointers being used for a switch statement.Ī histogram is generally a read/write table in memory and doesn't have much to do with a (usually read-only) LUT - unless I missed what you're getting at there. So basically rather than some ALU ops + possibly some branches, you use a LUT and fewer or zero ALU ops, and fewer or zero branches. The parent comments were specifically talking about using LUTs to replace calculation, and particular calculations involving branches. > If it's not vectorizable, LUT result is often used for indirect jump/call (like large switch statement) or memory access (say, a histogram etc.). Well, isn't that where the performance wins are and what you need to do to extract maximum performance from that hot loop? A good truly parallel vector gather implementation could make (small) LUTs very interesting performance wise. > If it's vectorization, then the whole equation changes Most vendors count FMACs nowadays as two FOPs. > The comparison with 384 FOPs seems a bit offģ84 was for the extreme vectorization case, 12 x 2 x 8 FMACs (AVX). Microbenchmarks often miss "system" level issues. It's very easy to start to spill to L2 (and further). There can be other hot loops nearby that could also benefit from hot L1. There are just 512 of 64-byte L1 cache lines. It's generally good to keep L1 footprint small. To be a win, LUT function needs to be something pretty heavy, while LUT itself needs to be small (at least If the method is really hot, e.g., in a tight(ish) loop, then you are mostly going to be getting L1 hits. > What adding a lookup table can do is to add the load-latency to the dependency chain involving the calculation, which seems to be what you are talking about here. If it's not vectorizable, LUT result is often used for indirect jump/call (like large switch statement) or memory access (say, a histogram etc.).
> To be fair, replacing a series of ALU ops with a lookup table doesn't usually add a "data dependency" If it's vectorization, then the whole equation changes!
The comparison with 384 FOPs seems a bit off: I guess you are talking about about some 32-FOP per cycle SIMD implementation (AVX512?) - but the assumption of data dependencies kind of rules that out: one would assume it's scalar code here. If the method is really hot, e.g., in a tight(ish) loop, then you are mostly going to be getting L1 hits. On the other hand, in that case performance isn't that critical by definition. If the involved method isn't that hot then L1 misses (like your example) or worse are definitely a possibility. How much that actually matters depends on whether the code is latency-bound and the involved lookup is on the critical path: in many cases where there is enough ILP it won't be (an general rule is that in most code most instructions are not on a critical dependency-chain). For an L1 hit that's usually 4 or 5 cycles, and for L2 hits and beyond it's worse, as you point out.
What adding a lookup table can do is to add the load-latency to the dependency chain involving the calculation, which seems to be what you are talking about here. To be fair, replacing a series of ALU ops with a lookup table doesn't usually add a "data dependency" - the data dependency probably already existed, but perhaps flowed through registers rather than memory.