Scaling Globally

7 min readOct 11, 2018

by Vas Mylko

Most curious trips will be found with Curiosio. Curiosio runs on Ingeenee. Ingeenee is intelligent. Intelligence grows on substrate (organic or silicon). Silicon substrate is high-performance computing aka HPC. HPC is dense servers in racks in a data center, or cloud. We have already started planning for scale — for global scale by 2020.

According to our measurements, data center is 4x cheaper than cloud. Not much changed in cloud economics since Craig Venter’s attempts to do it in the cloud, ending with own data center. Cloud is better when your growth is ultra fast like Netflix. For others, like Dropbox, cloud is pricey. Over two years Dropbox saved $74.6M off its operational expenses primarily because of the move off the cloud to own data center. Below are the options we are considering for 2019+, to build capabilities for global scale.

Supermicro with Xeon Scalable

Cheapest good option is dense computing by Supermicro (eBay uses white branded Supermicro). 4 nodes in 2U form factor, each node has 2 chips. Each chip has up to 28 cores; a core could be hyper-threaded (additional 20% gain). Totals: 224 powerful cores in 2U box; which is ~1000 cores in 4 servers, or ~5000 cores in one full rack, which is ~10,000 threads per rack.

Supermicro with EPYC

AMD EPYC chip has 32 cores, with hyper-threading to 64 threads. Even more performance is possible from the same space. EPYC is x86 compatible, hence everything that works on Xeons must work on EPYCs. But reality is different: those EPYC systems are all new, some time is required to polish them (from issues with motherboards and firmware to efficient cooling). Totals: 5,000+ cores, 10,000+ threads per full rack.

Gigabyte with ThunderX2

ARM entered server chip market. Recently several companies tried to make a good server chip using ARM architecture. ARM chips are powering our smartphones for years, we are running fat apps on them and performance is good. So why not to build a server ARM chip? Calxeda tried and flopped, Applied Micro tried with X-Gene (now Ampere Computing), Cavium tried with ThunderX, Qualcomm tried with Centriq, Broadcom tried with Vulcan.

It happened that Centriq is weak, suitable for storage servers only. Broadcom ditched Vulcan during mega-acquisition movements. Cavium built slow ThunderX (we tested it at Packet, a node with 2 chips with 96 cores). But good things happened then. Cavium purchased Broadcom’s Vulcan IP and built ThunderX2, which is challenging Intel Xeons today. Gigabyte offers 1U ARM servers. Probably the system requires stability and firmware improvements (like it required for older ThunderX). Totals: ~5,000 cores in full rack, which is ~20,000 threads per rack. But how strong a single thread is? Is it possible to uniformly load all threads at once? Questionable.

Ideal htop on single node with 2x ThunderX2 chips under load

Cisco with EPYC

Cisco UCS C4200 Series Rack Server Chassis with 4x C125 M5 Rack Server Nodes. 2x AMD EPYC chips per node, each chip up to 32 cores, hyper-threaded to 64 threads. Built quality is higher than Gigabyte and Supermicro. Remote monitoring and management is richer and more convenient too. It’s good to have Cisco in the game. Not so long ago they were third biggest server maker in the world, then lost it. Compute density is like Supermicro.

Dell with Xeon Scalable

Dell is our favorite. Based on new and used models. Build quality is best. Google uses Dells when they need to buy servers (for own data centers Google builds own servers; Facebook uses HP and builds own servers). We’d love to have FX series. We’d love to have all Dell. Compute density is like Supermicro. Cooling is more efficient, known as smarter, adaptive, configurable. Dells can work at higher ambient temperatures than other makes (including HP/HPE).

Xeon Phi

Xeon Phi 7295 is a 72-core x86 many-core microprocessor introduced by Intel in late 2017. Operates at a frequency of 1.5 GHz with a TDP of 320W and a turbo of 1.6GHz. It has 288 threads on 72 cores; htop would show 288 processes per node, all available to the operating system on the node. Intel Xeon Phi 72xx was (and still is) a great chip for HPC. Sheer productivity is amazing. There were models from Dell, Fujitsu, Intel, Supermicro for socketed Xeon Phi…

We’ve tested Supermicro with Xeon Phi— that was amazing. But at that time one chip price was higher than $6.5K! Memory was crazy expensive too, remember Samsung situation? 64G DIMM prices sky rocketed. Though the chip price went down since then, things did not go well with the chip. It’s a shame Intel ditched it. Unfortunately, Intel has discontinued them all: accelerators and socketed chips of Phi series. Looks like it has happened exactly after fulfilling all supercomputers around the world — like Cray XC supercomputer; check out the compute density.

Amazing compute density in Cray XC

Power9

What to do during Intel struggle time, AMD ramping up time, ARM uncertainty time? To think and measure more. We were looking at Power chip from IBM. Since Power8 it was already interesting. Then they overclocked it as Power8+. Now they have Power9, and it is interesting again. Google compared alternatives to Intel Xeons, looked at Qualcomm’s Centriq and IBM’s Power.

Google preferred Power (Microsoft took Centriq for Azure storage servers). Google already deployed Power servers in own data centers in early 2018. It’s Zaius/Barreleye hardware experiments together with Rackspace. We contacted Rackspace regarding Barreleye irons some time ago… their sales rep was numb as a beautiful barreleye fish. Come on, Rackspace, the project wasn’t secret, there could be more awareness about these servers.

What is cool with Power9? SMT4/SMT8 cores, giving 4x/8x threads per core (it’s better than x86 hyper-threads). There are two variations, a 12-core SMT8 model and a 24-core SMT4 model. The SMT4 is optimized for the Linux ecosystem. Another thing — better bandwidth between chip and memory and peripherals.

There are Power9 benchmarks by Phoronix. It happened that the software is not optimized for Power9 (because nobody uses it, except Google or so). Here are other benchmarks, with Power9 optimizations. Totals: with 2 nodes in 2U server it could be ~8000 strong threads per rack. Due to [air] cooling requirements, probably only one node fits into 2U box. If so, then only ~4000 strong threads per rack.

Conclusion

We made our codebase running on three architectures: x86/amd64, ARMv8, Power/ppc64. We automated deployment, computing, workloads over bare metal. According to our estimates, to serve entire planet and to build for the future, there is a need in few-to-several full server rack cabinets. One full cabinet costs between $0.5-$1M. Very likely Ingeenee will become the name of a supercomputer Curiosio is running on. The plan is to build our travel AI infrastructure & technology for true global scale, by the time of Travel Trio 2020.

Right now we’re working on the beta3, making it 100x faster than beta2, and 10x cheaper/easier on hardware. Very smart geeks are working on it. After beta3 is built (this Winter), we will be looking for the geeks, who love travel and high tech, to join forces to scale Curiosio globally. By joining forces we plan fundraising from rich geeks and hiring smart geeks to invent & code more.

Hey, future hires, who dream to program 10,000+ strong cores — stay tuned for our development stack — to be published soon. Big probability each position will start with “badass” word. Who knows, could it be Power9+ programming on Barreleye-likes? Have you noticed: no details on accelerators and quantum so far? Stay tuned, stay curious.

PS.

Eyes are those big green barrels, oriented upwards, thru transparent head. Like two huge heat sinks on top of powerful chips. The two spots above the fish’s mouth are olfactory organs called nares, which are analogous to human nostrils. Check out more about the curious fish.