uops-again.info

uops-again repository 👯‍♀️

This artifact repository accompanies the paper “A Second Look at Port Assignment on Intel CPUs”, by Yarin Oziel, Tomer Laor, Shlomi Levy, Clémentine Maurice, Yossi Oren, Thomas Rokicki and Gabriel Scalosub. Read the two-page work-in-progress report here.

It complements, and is inspired by, the monumental work of Abel and Reineke found at https://uops.info.

Microarchitecture CPU Model Evaluated Commands Indexed
Broadwell Intel Xeon E5-2620 v4 2500
Coffee Lake Intel i7-8700k 2500
Comet Lake Intel i5-10500 2500
Cascade Lake Intel Xeon Gold 5220 2500
Haswell Intel Xeon E5-2630 v3 2304*
Ice Lake Intel Xeon Platinum 8358 2500
Ivy Bridge Intel i5-3470 2500
Skylake Intel Xeon Gold 6130 2500
Sandy Bridge Intel Xeon E5-2630 2116*
Westmere Intel Xeon X5670 1

* Haswell’s ISA not supporting ADOX, ADCX

* SandyBridge’s ISA not supporting ADOX, ADCX, ANDN, BLSI

Understanding the Results

Most of our microbenchmarks consisted of small code blocks comprising an LFENCE instruction, followed by a pair of two other instructions. For example, to analyze the LFENCE; STC; ADD R64, R64 code block, we invoked nanoBench with the following command:

sudo ./kernel-nanoBench.sh -no_norm -n_meas 1
  -warm_up_count 10
  -config configs/cfg_port_0156_only.txt
  -asm "LFENCE; STC; ADD RAX, RBX"
  -unroll 120 

We measure the μ-ops dispatched to each port across increasing unroll factors, with unroll values ranging from 100 to 6980. Each unroll factor is repeated multiple times, and we take the mean and standard deviation of the μ-ops dispatched to each port.

We focus our attention on instructions that are CPU-bound, and do not involve memory access (neither load, nor store). This is done in order to avoid latency affects caused by memory/cache accesses, which are by nature dynamic. In particular, we consider very short code blocks, that will be available from the L1 i-cache. For the most part, the instructions we consider are decoded to a single μ-op. However, distinct μ-ops may require a distinct number of cycles for executing. Similarly to the notation used by Abel et al., we will denote the eligibility set of an instruction by pXYZW. For example, the eligibility set of CBW is p0156, since it may be executed on any of the ports 0, 1, 5, or 6, whereas the eligibility set of SHL R64, 1 is p06, since it may only be executed on ports 0 or 6.

In all figures, the X axis represents the unroll factor, representing the number of times the microbenchmark is executed before its performance is measured, and the Y axis represents the number of μ-ops dispatched to each port, surrounded by error bars.

Acknowledgments

This research was supported by Israel Science Foundation grant 229/24. Experiments presented in this paper were carried out using the Grid’5000 testbed, supported by a scientific interest group hosted by Inria and including CNRS, RENATER and several Universities as well as other organizations.

ISF Logo Grid'5000 Logo