by Roman Bilusiak, Vas Mylko
Curiosio is the smartest travel guide for road trippers. You — curious travelers — can create interesting road trips in any geography that you want within the time and budget that you have. You can take any itinerary/route you love and customize it by perimeter, points, time, and money to the journey you need. You are traveling the world by following your curiosity.
Curiosio is a unique breed of the search+ computational + answering engine for travelers. Scientifically speaking, Curiosio is doing a multi-objective optimization with constraints satisfaction. Mathematically speaking, Curiosio is finding a solution to the NP-Complete conundrum in predictable real-time.
Curiosio is a computational intelligence. It’s a unique kind of a computation knowledge + search engine + answer engine. Curiosio must be fully ARM’ed for the global scale.
Curiosio is running on top of Ingeenee — our AI engine that finds a unique solution in the infinitely large solution space. Ingeenee is Evolutionary AI. Evolutionary AI lives on High-Performance Computing aka HPC. HPC lives on electrical power. Electricity is expensive. Arm server CPUs are power efficient. So here we are — evaluating a brand new Arm-based server processor Ampere Altra by Ampere Computing.
Ampere Altra Processor
Ampere Altra is the Arm-based server CPU. It is a many-core processor, featuring 80 cores which are customized Arm Neoverse N1 cores. Ampere Altra is the first Arm-based server CPU with pedigree.
Ampere Computing designed two versions of the processor — Ampere Altra and Altra Max (see brief and data sheet for Ampere Altra details). Ampere Altra has 80 cores, AmpereⓇ Altra Max has 128 cores. Our workload is more sensitive to the clock speed than other characteristics of the chip. At the application time we thought that the 80-core version was clocking at 3.3GHz, hence opted for Ampere Altra over Ampere Altra Max. Now we know that both Ampere Altra and Ampere Altra Max have a consistent max frequency of 3.0 GHz.
Curiosio needs high-performant many-core server processors densely packed into energy-efficient multi-processor HPC optimized nodes. As soon as Ampere Computing announced the Evaluation Program we applied. We directly mentioned that we were interested in many-core bootable chips like Intel Xeon Phi 72xx. So high hopes, finger crossed, knock on wood.
The Evaluation Program is intended to test the Ampere systems — entire server systems — for the exclusive needs of every participant and share the technical report with Ampere Computing. There are several Ampere Altra computing systems designed together with the server makers. We got a dual-socket Ampere Altra system [160 cores at 3.0GHz] which is called Mt. Jade. A few such nodes were allocated in the scope of our evaluation programme.
Evaluation Programme
We divided our AI workload into four classes: easy, medium, heavy, and very heavy. It involves our bits and a few 3rd party modules. The deeper details on each class are out of the scope of this report because it’s intellectual property.
During the initial probing-groping of the system, we briefly benchmarked performance for only 20–30 loaded cores [from available 160], but did not formalize that measurement because our need is 10x more cores than that. In that context, Ampere Altra was performing with 2x less power consumption on par with our Intels.
The main benchmarking for Ampere Altra was performed on 80 and 150 cores. The workload was put onto Ampere Altra and three Intel Xeon E5–26xx v2, v3, v4 systems. We measured Ampere Altra performance from five perspectives: Performance per Server, Performance per Watt, Single Thread Performance, Multi Thread Performance, Response Time. In most cases, Performance is the number of requests per time. More requests per time are better. The charts [below] are showing the ratios between Ampere Altra and Intel Xeons in the same contexts.
Performance per Server. Packing more cores in one server chassis unlocks more productivity from the same iron. Two Ampere Altra chips in one server give 160 cores to the OS and apps. This metric is correlating with the performance per space or volume. In the data center, you are paying for the co-location which is the space, hence it’s a rationale to have your cabinets densely packed.
Our Intel systems are also dual-socket. Knowing that even denser packaging is possible in the future — 4x dual-socket HPC nodes in 2U server even with air cooling — this is the strong side of Ampere Altra. It’s 640–1024 fat cores in one chassis!
Performance per Watt. Whatever is packaged in the server, there will be [usually] two power supply units in that server. Within the server, electrical power is mainly consumed by the CPU chips and DIMM memory chips. Ampere Altra is power efficient in comparison to our Intel Xeons.
The chart is quite similar to the Performance per Server, but there are differences — mainly between the Intel chips crunching light workload.
Single Thread Performance. This metric is describing the amount of work done by a single thread. In our case, it’s a number of requests done by a single thread per second. Though it is RISC vs. CISC, the test showed almost identical performance for both.
Multi Thread Performance. This metric is describing the amount of work done by multiple [and many] threads. We measured the number of requests per second per thread for 2, 4, 8, …, 160 threads. The chart shows similar performance for Ampere Altra and Intel Xeon CPUs. It’s visible how performance drops with the increasing number of threads, and that appeared to be [shared] memory-related. (Preliminary analysis points to the issues in Go run-time, accessing shared read-only data from multiple goroutines).
To verify memory-related causes we repeated tests without using shared data (i.e. shared memory by Go run-time). This time the modified workload loaded all 160 cores. This is where things got very interesting: Ampere Altra performed steadily on any number of threads to a total of 160. The CPU temperature was pretty high ~65–70C (Intel Xeon E5–2620 v4 showed temperature ~37–41C).
Response Time. This metric is most felt by the users because it is the system response time, it is usability. The whole request is too complicated to be reworked for the Ampere system, so we isolated the heaviest part of it, and put it on Ampere Altra to benchmark vs. our Xeon system.
Though we are showing only the performance ratio between Ampere Altra and Intel Xeons for our proprietary workload, it’s visible that the Arm-based server CPUs are almost there. Will we switch to Arm-based CPUs tomorrow? Keep reading, there are a bunch of details to consider.
Evaluation Details
We run the same test on different hardware systems to compare the performance of Ampere vs. Intel. All tests were done using the high number of threads (90–95% of the total number of available cores) to measure the performance of a loaded system. The measurements were embedded into our code (like telemetry). During the load we monitored the system with the standard utils: htop
, glances
, sensors
(example of sensors command below), etc.
# sensors
apm_xgene-isa-0000
Adapter: ISA adapter
SoC Temperature: +65.0°C
CPU power: 145.00 W
IO power: 29.83 W
apm_xgene-isa-0000
Adapter: ISA adapter
SoC Temperature: +70.0°C
CPU power: 159.00 W
IO power: 34.96 W
Metric details. Performance per Server metric measures the number of requests done by all cores on the host per time interval. Basically, this is performance per core multiplied by a number of running threads. This is where the number of cores really stands out.
Performance per Watt metric was the number of completed requests normalized to the electric power consumption. The total consumption of the Ampere Altra system was approximated by adding CPU + IO. For our Intel systems sensors
return the same total number as what we read from the BMC, which means that consumption by fans and other modules cannot be excluded. While consumption of storage/network/fans was not accounted for Arm systems.
Single Thread Performance & Multi Thread Performance. Our real-life software is not running single-threaded, so we have modified it a bit for the sake of testing purity. Both metrics are always per second per thread. For Multi Thread scenario we ran two tests to compare local vs. remote memory cases.
In the Response Time metric, we measured the total running time spend to calculate a request on three different real-time configurations (called X, Y, Z). They differ by the CPU load, probability distribution, etc. The whole response chain of commands is too long & complex to be packed into the one-week test. No networking was used in the tests.
Anomaly. An interesting anomaly happened during the first test— we could not load all 160 cores with our workload/test. The load was ~50% when we configured 80, 100, 150, 160 (see htop screenshot below). This led us to useful analysis. Most probably it’s our issue as OSRM was not involved during the benchmarking. A nice curious case to dig in.
Golang details. We discovered a potential performance issue in one of our modules written in Go. The issue was not noticeable on Intel chips with the number of hardware cores below 40. We performed dedicated tests for that module and discovered linear dependency between the response time and the number of running routines. That was an unexpected revelation.
We have several hypos of what is going on there and will share our findings once we finish the investigation, and if it’s Go-related. It could be infamous Go maps (memory issues, concurrency issues). It could be Go IPC. Anyway, thanks to the 160-core Ampere Altra for helping this to pop up.
This was discussed with Ampere Computing's performance professional. Most likely it’s our or Go language/runtime issue. This anomaly was not observed on Xeon Phi 72xx with 250+ hardware threads because it was a single socket? Probably this is NUMA-related — is Go scheduler NUMA-aware yet? Well, we have to do low-level profiling and probably redesign. Would be interesting to benchmark both implementations on 128-core Ampere Altra Max chips to compare.
OSRM details. To check if the issue is in our modules, we ran an isolated OSRM test — OSM to OSRM data conversion. We run it using a different number of threads (as they are called in OSRM config). We noticed an anomaly with a configured number of threads and the productivity from the cores. Isolated OSRM test shows there might be a limit for the number of cores working at max frequency (overheating?). The magic number is about 32 cores from where we start seeing the drop in OSRM performance despite more cores being involved. 128 running threads slow down the entire process 10x compared to 16 running threads.
The chart shows the dependency between time and number of used threads to convert OSM dump to OSRM using osrm-contract
(Contraction Hierarchies). It’s visible how total processing time is increasing with an increasing number of configured threads. It does not look like a synchronization issue as htop
shows a corresponding number of loaded cores during this job.
NUMA details. We discussed the OSRM finding with performance and solution professionals at Ampere Computing. Ampere Altra doesn’t do throttling. It’s most likely related to 80+ OSRM threads mapped onto Ampere Altra hardware threads. As the processor has only 80 cores-threads — all other threads are put onto another processor. Accessing another socket causes latency. Non-symmetrical memory access from different sockets causes noticeable latency (it is also relevant for Intel processors, though the latency is a bit smaller).
Increasing core count and differences between ‘near memory’ and ‘far memory’ are increasing as we move to faster DRAM technologies like DDR5. Being NUMA aware used to be a good-to-have. It’s becoming a necessity with modern platforms.
There is no #include <humaif.h> or “numaif.h” in the OSRM codebase. It means the OSRM doesn’t use libnuma. Most probably, the OSRM issue is a software issue. Looks like there is not much changed since several years ago with NUMA in Go, so we didn’t do anything explicitly to use NUMA. There is already a modern term — NUMA-aware programming.
During the second test, we tweaked our software to not use the remote memory and loaded all Ampere Altra cores. It looked so beautiful in htop
.
Issues. We experienced system failures. They were reported with the low-level details to Ampere Computing. The same workload was tried without Docker, but the issues were recurring. Probably this is the most useful information from us to them. Unfortunately, we could not get the real-time readings about potential overheat or hw errors because we didn’t have access to the BMC (because of the remote access policy).
This was discussed with Ampere Computing folks. The system instability was related to outdated firmware. Preliminary agreed to repeat the evaluation on the updated system in a month from now or later. Would be interesting to try another 128-core processor Ampere Altra Max, dual-socket node would be 256 juicy cores. We will see, stay tuned.
Software Details
Since the dawn of time, we work with Linux/Unix systems. Linux is used for everything on our servers and workstations, even Linux as desktop; FreeBSD Unix is used for special purposes. So we requested the nodes with Ubuntu OS. Since we automated scalability we are running on Docker. This evaluation was done on Ubuntu 20.04 LTS (GNU/Linux 5.4.0–40-generic aarch64), and Docker 20.10.9.
The workload is mainly our proprietary Go binaries. Go was intended as a language for writing server programs that would be easy to maintain over time. We program in Go because we need parallel programming, high scalability, low-level capabilities, high performance. Our Go codebase compiles for Intel/AMD, Arm, Power 64-bit processors. So far we are on go1.15.5.
Some Go libraries depend on C libraries, special configuration is needed. Here is what we did to build the Arm bits:
# destination OS
export GOOS=linux
# destination architecture
export GOARCH=arm64
# configure compilers
export CGO_ENABLED=1
export CC=aarch64-linux-gnu-gcc
export CC_FOR_TARGET=gcc-aarch64-linux-gnu
# then just build
go build -o /output/path
There are some 3rd party modules involved, one of them is the OSRM. We used OSRM-backend version 5.26 running on Docker.
Hardware Details
Server. Mt. Jade is a dual-socket rack server manufactured by Wiwynn. As of today, it is the platform with the highest core density in the industry. We are sure even more dense packaging is possible because we did not observe any cooling issues during the load.
Processor. It has a max clock speed 3.0GHz. Is there an L3 cache? Ampere Altra does have a 3rd level of cache. It’s called a System Level Cache (SLC). It’s 32MB on Ampere Altra. Some OSs have trouble calling that out (e.g. Ubuntu, see below).
$ lscpuArchitecture: aarch64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 160
On-line CPU(s) list: 0-159
Thread(s) per core: 1
Core(s) per socket: 80
Socket(s): 2
NUMA node(s): 2
Vendor ID: ARM
Model: 1
Model name: Neoverse-N1
...
CPU max MHz: 3000.0000
CPU min MHz: 1000.0000
BogoMIPS: 50.00
L1d cache: 10 MiB
L1i cache: 10 MiB
L2 cache: 160 MiB
NUMA node0 CPU(s): 0-79
NUMA node1 CPU(s): 80-159
...
Another view at the CPU. This is the detailed information dump for CPU 1 only. The dump for the CPU 2 is identical except for the socket designation number, ids, handles, tags. L3 Cache Handle is shown with non-zero values for both CPUs.
$ dmidecode -t 4Processor Information
Socket Designation: CPU 1
Type: Central Processor
Family: ARMv8
Manufacturer: Ampere(TM)
...
Version: Ampere(TM) Altra(TM) Processor
Voltage: 0.9 V
External Clock: 1600 MHz
Max Speed: 2800 MHz
Current Speed: 2800 MHz
Status: Populated, Enabled
...
L1 Cache Handle: ...
L2 Cache Handle: ...
L3 Cache Handle: ...
Serial Number:
...
Core Count: 80
Core Enabled: 80
Thread Count: 80
Characteristics:
64-bit capable
The dmidecode
statement calling out max frequency as 2.8GHz is a static number; it’s not an instantaneous number. It is possible our CPUs were running at a max of 2.8GHz because our machines had slightly outdated firmware. If that’s the case, Ampere Altra’s performance could be even better, by running at a consistent 3.0GHz.
The CPU is indeed running at a maximum 3.0GHz, but not always, it’s visible how the numbers fluctuate when dumping the CPU frequency several times in a row.
$ cat /sys/devices/system/cpu/cpu0/cpufreq/cpuinfo_cur_freq3000000
2250000
3000000
2010000
3000000
Memory. The system has 16 x 16GB DDR4 DIMM modules manufactured by Samsung. This is a pretty standard part of modern servers.
$ dmidecode -t memory...
Physical Memory Array
Location: System Board Or Motherboard
Use: System Memory
Error Correction Type: Multi-bit ECC
Maximum Capacity: 4 TB
Error Information Handle: Not Provided
Number Of Devices: 16
...
Memory Device
...
Total Width: 72 bits
Data Width: 64 bits
Size: 16384 MB
Form Factor: DIMM
Set: None
Locator: DIMM 1
Bank Locator: Bank 1
Type: DDR4
Type Detail: Registered (Buffered)
Speed: 3200 MT/s
Manufacturer: Samsung
...
Rank: 2
Configured Memory Speed: 3200 MT/s
Minimum Voltage: 1.14 V
Maximum Voltage: 1.26 V
Configured Voltage: 1.2 V
Memory Technology: DRAM
...
Power. Redundant power supply units 2000W. We did not have access to IPMI console to get the power consumption readings directly from the hardware.
Other Details
We were interested in Arm-based server CPUs for HPC from the very beginning. We evaluated Cavium ThunderX back in the day (see the post Dev Stack, section Infrastructure). Could not evaluate the better ThunderX2 because there were firmware issues in the system. CEO of Packet talked to each customer back then, what a man! Then, they were in the process of acquisition by Equinix and it all suspended and dried out…
Conclusion
Power-efficient Arm-based CPUs are not knocking on the data center doors. They are already there — inside data centers. If not long ago Arm-based servers were used for storage and networking then now they are applicable to HPC. Intel Xeons we’re using cannot hit declared turbo frequencies when involving more cores. Current Ampere Altra should be running at 3.0 GHz for all cores.
You can consider Arm-based servers with Ampere Altra or Ampere Altra Max even for real-time critical jobs. The best offering is many cores with the best performance per watt. Design redundancy solution because the nodes are going to be restarted periodically.