uops-again.info

uops-again repository 👯‍♀️

This artifact repository accompanies the paper “A Second Look at Port Assignment on Intel CPUs”, by Yarin Oziel, Tomer Laor, Shlomi Levy, Clémentine Maurice, Yossi Oren, Thomas Rokicki and Gabriel Scalosub. Read the two-page work-in-progress report here.

It complements, and is inspired by, the monumental work of Abel and Reineke found at https://uops.info.

Microarchitecture	CPU Model Evaluated	Commands Indexed
Broadwell	Intel Xeon E5-2620 v4	2500
Coffee Lake	Intel i7-8700k	2500
Comet Lake	Intel i5-10500	2500
Cascade Lake	Intel Xeon Gold 5220	2500
Haswell	Intel Xeon E5-2630 v3	2304*
Ice Lake	Intel Xeon Platinum 8358	2500
Ivy Bridge	Intel i5-3470	2500
Skylake	Intel Xeon Gold 6130	2500
Sandy Bridge	Intel Xeon E5-2630	2116*
Westmere	Intel Xeon X5670	1

* Haswell’s ISA not supporting ADOX, ADCX

* SandyBridge’s ISA not supporting ADOX, ADCX, ANDN, BLSI

Understanding the Results

Most of our microbenchmarks consisted of small code blocks comprising an LFENCE instruction, followed by a pair of two other instructions. For example, to analyze the LFENCE; STC; ADD R64, R64 code block, we invoked nanoBench with the following command:

sudo ./kernel-nanoBench.sh -no_norm -n_meas 1
  -warm_up_count 10
  -config configs/cfg_port_0156_only.txt
  -asm "LFENCE; STC; ADD RAX, RBX"
  -unroll 120 

We measure the μ-ops dispatched to each port across increasing unroll factors, with unroll values ranging from 100 to 6980. Each unroll factor is repeated multiple times, and we take the mean and standard deviation of the μ-ops dispatched to each port.

We focus our attention on instructions that are CPU-bound, and do not involve memory access (neither load, nor store). This is done in order to avoid latency affects caused by memory/cache accesses, which are by nature dynamic. In particular, we consider very short code blocks, that will be available from the L1 i-cache. For the most part, the instructions we consider are decoded to a single μ-op. However, distinct μ-ops may require a distinct number of cycles for executing. Similarly to the notation used by Abel et al., we will denote the eligibility set of an instruction by pXYZW. For example, the eligibility set of CBW is p0156, since it may be executed on any of the ports 0, 1, 5, or 6, whereas the eligibility set of SHL R64, 1 is p06, since it may only be executed on ports 0 or 6.

In all figures, the X axis represents the unroll factor, representing the number of times the microbenchmark is executed before its performance is measured, and the Y axis represents the number of μ-ops dispatched to each port, surrounded by error bars.

Acknowledgments

This research was supported by Israel Science Foundation grant 229/24. Experiments presented in this paper were carried out using the Grid’5000 testbed, supported by a scientific interest group hosted by Inria and including CNRS, RENATER and several Universities as well as other organizations.

ISF Logo Grid'5000 Logo