The Edison main SoC is a 22 nm Intel Atom “Tangier” (Z34XX) that includes two Atom Silvermont (SLM) cores. Although never advertised by Intel, the CPU is known to be 64 bits (x86_64) capable.
Examples where code optimizations have been done specific to Edison / x86_64 have been implemented 1):
Base64 encoding/decoding
CRC32C encoding/decoding
1) Feel free to add Edison / NUC E3815 or other Baytrail examples here.
There are quite a number of disadvantages:
From the Intel® 64 and IA-32 Architectures Optimization Reference Manual F.8.1.2 Front End High IPC Considerations:
- The total length of the instruction bytes that can be decoded each cycle varies by microarchitecture.
SLM: up to 16 bytes per cycle with instruction not more than 8 bytes in length. For an instruction length exceeding 8 bytes, only one instruction per cycle is decoded on decoder 0.- An instruction with multiple prefixes can restrict decode throughput. The restriction is on the length of bytes combining prefixes and escape bytes. There is a 3 cycle penalty when the escape/prefix count exceeds the following limits as specified per microarchitectures.
SLM: the limit is 3 bytes.- Only decoder 0 can decode an instruction exceeding the limit of prefix/escape byte restriction on the Silvermont and Goldmont microarchitectures.
- The maximum number of branches that can be decoded each cycle is 1 for SLM. Prevent a re-steer penalty by avoiding back-to-back conditional branches.
Unfortunately x86_64 mode will add a prefix byte to instructions that are already long. For instance CRC32Q will exceed the limit causing a 3 cycle penalty, which totally destroys the obtained performance enhancement.
Fortunately there is a way around this restriction. Again from the Optimization Reference Manual F.8.1.4 Loop Unrolling and Loop Stream Detector, engauging the Loop Stream Detector (LSD):
The Silvermont and Goldmont microarchitectures include a Loop Stream Detector (LSD) that provides the back end with uops that are already decoded. This provides performance and power benefits. When the LSD is engaged, front end decode restrictions, such as number of prefix/escape bytes and instruction length, no longer apply.
It appears the LSD can kick in for short loops, and after a certain amount of loops occured (although this is not clearly documented the number is probably 64). To use this, take care not to have the compiler unroll your loop. The effect can be quite dramatic, as the 3 cycle penalty is eliminated after 64 iterations a a 3x speed up can be observed for long running loops.
If you checked out master
, in meta-intel-edison/meta-intel-edison-bsp/conf/machine/edison.conf
change KBUILD_DEFCONFIG="x86_64_defconfig"
and set DEFAULTTUNE = "core2-64"
.
Alternatively you can checkout scarthgap
which does this for you.
© 2018 Ferry Toth