Evaluation of AmpereⓇ AltraⓇ Processor

Curious traveler armed with armet

Curiosio is a computational intelligence. It’s a unique kind of a computation knowledge + search engine + answer engine. Curiosio must be fully ARM’ed for the global scale.

Curiosio is running on top of Ingeenee — our AI engine that finds a unique solution in the infinitely large solution space. Ingeenee is Evolutionary AI. Evolutionary AI lives on High-Performance Computing aka HPC. HPC lives on electrical power. Electricity is expensive. Arm server CPUs are power efficient. So here we are — evaluating a brand new Arm-based server processor Ampere Altra by Ampere Computing.

Ampere Altra Processor

Ampere Altra is the Arm-based server CPU. It is a many-core processor, featuring 80 cores which are customized Arm Neoverse N1 cores. Ampere Altra is the first Arm-based server CPU with pedigree.

Ampere Altra Processor

Evaluation Programme

We divided our AI workload into four classes: easy, medium, heavy, and very heavy. It involves our bits and a few 3rd party modules. The deeper details on each class are out of the scope of this report because it’s intellectual property.

Comparison chart for Server performance
Comparison chart for Watt performance
Comparison chart for Single Thread performance
Comparison chart for Multi Thread performance
Comparison chart for Multi Thread performance
Comparison chart for Response Time

Evaluation Details

We run the same test on different hardware systems to compare the performance of Ampere vs. Intel. All tests were done using the high number of threads (90–95% of the total number of available cores) to measure the performance of a loaded system. The measurements were embedded into our code (like telemetry). During the load we monitored the system with the standard utils: htop, glances, sensors (example of sensors command below), etc.

# sensors
apm_xgene-isa-0000
Adapter: ISA adapter
SoC Temperature: +65.0°C
CPU power: 145.00 W
IO power: 29.83 W

apm_xgene-isa-0000
Adapter: ISA adapter
SoC Temperature: +70.0°C
CPU power: 159.00 W
IO power: 34.96 W
Underloaded 160 cores (~50% load)
Go module anomaly chart, adding goroutines kills performance
OSRM anomaly chart, adding threads kills throughput

Increasing core count and differences between ‘near memory’ and ‘far memory’ are increasing as we move to faster DRAM technologies like DDR5. Being NUMA aware used to be a good-to-have. It’s becoming a necessity with modern platforms.

There is no #include <humaif.h> or “numaif.h” in the OSRM codebase. It means the OSRM doesn’t use libnuma. Most probably, the OSRM issue is a software issue. Looks like there is not much changed since several years ago with NUMA in Go, so we didn’t do anything explicitly to use NUMA. There is already a modern term — NUMA-aware programming.

Ampere Altra fully loaded in htop
Ampere Altra fully loaded in glances

Software Details

Since the dawn of time, we work with Linux/Unix systems. Linux is used for everything on our servers and workstations, even Linux as desktop; FreeBSD Unix is used for special purposes. So we requested the nodes with Ubuntu OS. Since we automated scalability we are running on Docker. This evaluation was done on Ubuntu 20.04 LTS (GNU/Linux 5.4.0–40-generic aarch64), and Docker 20.10.9.

# destination OS
export GOOS=linux
# destination architecture
export GOARCH=arm64

# configure compilers
export CGO_ENABLED=1
export CC=aarch64-linux-gnu-gcc
export CC_FOR_TARGET=gcc-aarch64-linux-gnu

# then just build
go build -o /output/path

Hardware Details

Server. Mt. Jade is a dual-socket rack server manufactured by Wiwynn. As of today, it is the platform with the highest core density in the industry. We are sure even more dense packaging is possible because we did not observe any cooling issues during the load.

Mt. Jade dual-socket Ampere/Arm server
$ lscpuArchitecture:                    aarch64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 160
On-line CPU(s) list: 0-159
Thread(s) per core: 1
Core(s) per socket: 80
Socket(s): 2
NUMA node(s): 2
Vendor ID: ARM
Model: 1
Model name: Neoverse-N1
...
CPU max MHz: 3000.0000
CPU min MHz: 1000.0000
BogoMIPS: 50.00
L1d cache: 10 MiB
L1i cache: 10 MiB
L2 cache: 160 MiB
NUMA node0 CPU(s): 0-79
NUMA node1 CPU(s): 80-159
...
$ dmidecode -t 4Processor Information
Socket Designation: CPU 1
Type: Central Processor
Family: ARMv8
Manufacturer: Ampere(TM)
...
Version: Ampere(TM) Altra(TM) Processor
Voltage: 0.9 V
External Clock: 1600 MHz
Max Speed: 2800 MHz
Current Speed: 2800 MHz
Status: Populated, Enabled
...
L1 Cache Handle: ...
L2 Cache Handle: ...
L3 Cache Handle: ...
Serial Number:
...
Core Count: 80
Core Enabled: 80
Thread Count: 80
Characteristics:
64-bit capable
$ cat /sys/devices/system/cpu/cpu0/cpufreq/cpuinfo_cur_freq3000000
2250000
3000000
2010000
3000000
$ dmidecode -t memory...       
Physical Memory Array
Location: System Board Or Motherboard
Use: System Memory
Error Correction Type: Multi-bit ECC
Maximum Capacity: 4 TB
Error Information Handle: Not Provided
Number Of Devices: 16
...
Memory Device
...
Total Width: 72 bits
Data Width: 64 bits
Size: 16384 MB
Form Factor: DIMM
Set: None
Locator: DIMM 1
Bank Locator: Bank 1
Type: DDR4
Type Detail: Registered (Buffered)
Speed: 3200 MT/s
Manufacturer: Samsung
...
Rank: 2
Configured Memory Speed: 3200 MT/s
Minimum Voltage: 1.14 V
Maximum Voltage: 1.26 V
Configured Voltage: 1.2 V
Memory Technology: DRAM
...

Other Details

We were interested in Arm-based server CPUs for HPC from the very beginning. We evaluated Cavium ThunderX back in the day (see the post Dev Stack, section Infrastructure). Could not evaluate the better ThunderX2 because there were firmware issues in the system. CEO of Packet talked to each customer back then, what a man! Then, they were in the process of acquisition by Equinix and it all suspended and dried out…

Conclusion

Power-efficient Arm-based CPUs are not knocking on the data center doors. They are already there — inside data centers. If not long ago Arm-based servers were used for storage and networking then now they are applicable to HPC. Intel Xeons we’re using cannot hit declared turbo frequencies when involving more cores. Current Ampere Altra should be running at 3.0 GHz for all cores.

--

--

geek travel

Love podcasts or audiobooks? Learn on the go with our new app.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store