WOA Issue 65
Works on Arm News is produced most Fridays by the…
“Better, faster, cheaper: pick any two”
Qualcomm announced its Centriq 2400 processor line this week at an event at the Glass House in San Jose. Much of this newsletter is news coming out of this event!
The Qualcomm Centriq 2400 processor family is a single chip platform-level solution built using Samsung’s 10 nanometer FinFET process with 18 billion transistors on only 398mm2. It contains up to 48 high-performance, 64-bit, single-thread cores, running at up to 2.6 GHz frequency. The cores are connected with a bi-directional segmented ring bus with 250GBps of aggregate bandwidth to avoid performance bottlenecks under full load. To maximize performance under various use cases, the design has 512KB of shared L2 cache for every two cores, and 60MB of unified L3 cache distributed on the die. It has 6 channels of DDR4 memory and can support up to 768 GB of total DRAM capacity with 32 PCIe Gen3 lanes and 6 PCIe controllers. The Qualcomm Centriq 2400 processor family also supports Arm’s TrustZone secure operating environment, and supports hypervisors for virtualization. The Qualcomm Centriq 2400 is able to achieve exceptional performance, while consuming less than 120 watts.
Pricing for the processor was announced. A “Centriq 2460” SKU with 48 cores lists at $1995; “Centriq 2452” with 46 cores lists at $1373; and the “Centriq 2434” with 40 cores has a $888 price tag.
On @Cloudflare’s workloads we’re seeing equivalent performance to Intel at half the power. (Matthew Prince, Cloudflare CEO)
Vlad Krasnov from Cloudflare published the first public benchmarks of the Qualcomm Centriq 2400 server. He reports based on real workloads and real code that the Centriq delivers comparable performance to Intel servers that cost more and consume more power.
With the NGINX workload Falkor handled almost the same amount of requests as the Skylake server, and both significantly outperform Broadwell. The power readings, taken from the BMC show that it did so while consuming less than half the power of other processors. That means Falkor managed to get 214 requests/watt vs the Skylake’s 99 requests/watt and Broadwell’s 77.
The picture is not all rosy when it comes to language support for arm64, however. Krasnov’s sharpest disappointment is with the Go language runtimes, which are often not optimized for arm64 processors:
Go support for aarch64 is quite disappointing. I am very happy to say that everything compiles and works flawlessly, but on the performance side, things should get better. Is seems like the enablement effort so far was concentrated on the compiler back end, and the library was left largely untouched. There are a lot of low hanging optimization fruits out there, like my 20 minute fix for addMulVVW clearly shows. Qualcomm and other ARMv8 vendors intend to put significant engineering resources to amend this situation, but really any one can contribute to Go. So if you want to leave your mark, now is the time.
The Wall Street Journal reported on a merger proposal by chipmaker Broadcom to take over Qualcomm, in a deal that would make Broadcom the world’s third largest chipmaker behind Intel and Samsung.
Broadcom made an offer of $70 a share in cash and stock for Qualcomm, the world’s largest maker of mobile phone chips. That’s a 28 percent premium over the stock’s closing price on Nov. 2, before Bloomberg first reported talks of a deal. The proposed transaction is valued at approximately $130 billion on a pro forma basis, including $25 billion of net debt.
The Next Platform has a terrific breakdown of what this would mean to the chip-making world, and specifically to the IP around server-class arm64 chipmaking. The below quote is very interesting though it depends broadly on a lot of events happening in a specific order:
Here is the irony of it all. While it is not probable, there is an outside chance that Marvell, which has done a fair amount of innovation in the Arm chip area but very little with respect to servers, ends up with all of the relevant pieces. If Broadcom prevails in its acquisition of Qualcomm and jettisons the Centriq processor line and if Marvell buys Cavium, which has its own ThunderX as well as the former Broadcom Vulcan lines, then Marvell will have under its control three of the five important Arm server chip variants — the other two being Applied Micro’s X-Gene and Fujitsu’s Post-K HPC chip.
Marvell Technology Group Ltd is in advanced talks to buy Cavium Inc – a deal that would create a chipmaker worth about $14 billion, Wall Street Journal reported.
Drew Henry, the new senior vice president and general manager of the Infrastructure Business Unit of Arm, talks about computing in the data center and at the edge in this interview with Jeffrey Burt at The Next Platform.
Computing is becoming distributed now, compute cycles are moving off the CPU and onto the GPU, the TPU, and other types of processing nodes. The physical computing is being done elsewhere. Now as you try to get as close to where the decisions need to be made as possible. This is what’s happening at the edge. And so our center of gravity as a company is that edge.
Cavium announced that they are contributing a design of their new ThunderX2 system to be used in Microsoft’s “Project Olympus” to Open Compute Project specifications.
Kushagra Vaid, GM, Azure Hardware Infrastructure, Microsoft Corp. said, “We designed Microsoft’s Project Olympus with the ability to accommodate a variety of workloads and processor architectures. We’ve been closely collaborating with Cavium to integrate ThunderX2 into Microsoft’s Project Olympus design, and to drive innovation within the ARM ecosystem especially for workloads that benefit from high-throughput computing. The completion and contribution of our Project Olympus specification shows our continued commitment to the Open Compute Project and community developed innovation.”
Eagle-eyed watchers of C and C++ code repositories have noticed details of Qualcomm’s next generation chips. Industry analyst David Schor writes:
Looks like Qualcomm already started patching LLVM for the Saphira microarchitecture as of a month ago. Saphira will support ARMv8.3 (from v8.0 in Falkor). We’ll start tracking changes, see what they’re up to!
The University of Michigan announced a ThunderX based Hadoop cluster for use by campus researchers. It encompasses 4800 ThunderX cores, 3 petabytes of storage, with 40/100G XPliant Ethernet running Hadoop software from Hortonworks. HDP v2.6.2 includes Hive, Kafka, Spark and Sqoop.
Syed Ali, Cavium’s founder and CEO and a U-M alumnus, added, “I know from experience that U-M researchers are capable of amazing discoveries. Cavium is honored to help break new ground in big data research at one of the top universities in the world.”
At the Qualcomm announcement, ScyllaDB announced an arm64 release of their Apache Cassandra compatible NoSQL database.
ScyllaDB recently completed a benchmark study of its database running on three Qualcomm Centriq 2400-based servers, with a varying number of cores on each machine. The study shows that performance exceeded 1,000,000 IOPS, and scaled linearly as the number of cores per machine was increased from 10 to 40.
At the Qualcomm announcement, MariaDB announced an arm64 release of their MySQL compatible SQL database, targeting the Centriq 2400 architecture.
MariaDB’s architectural support for thread per connection helps MariaDB scale extremely well on the Qualcomm Centric 2400 server processor providing near consistent increase in throughput through the core count.
Peter Robinson of the Fedora project discusses the state of single-board computer (SBC) arm64 support in Fedora 27. Systems supported are the Raspberry Pi 3, the 96boards Dragonboard 410c and HiKey, and a handful of AllWinner devices with a focus on the Pine64 series of boards.
The key functionality we were aiming for was a well polished uEFI implementation in u-boot to enable a single install/boot path in Fedora on aarch64 using uEFI/shim/grub2 to boot Fedora on both SBCs and SBSA compliant aarch64 platforms. We now have that platform in place, primarily due to Herculean efforts of Rob Clark and Peter Jones, as well as many others who have provided insight into the deep dark details of the uEFI specification. Fedora 27 will ship with a quite heavily patched, well by Fedora’s standards anyway, u-boot 2017.09 which provides us the core of this functionality enabling us to use a vanilla upstream shim and grub2 to boot a standard Fedora. All this work is already upstream, or making it’s way there in 2017.11. In Fedora 28 there will be even more improvements that will enable us to do a bunch of other cool stuff (that’ll be a post for later!) and also enable much quicker upstream board enablement now all the core bits are in place.
Apache HBase and linkerd fail to build on arm64 due to a dependency on “protoc”, the protocol buffers compiler from Google. The software is available and works in source form, but both projects depend on the centralized Maven repository for getting binary artifacts, and the requisite versions are not there.
protoc binaries are anticipated to be built for the protobuf 3.5.0 release.
A team from Docker, Packet, and Arm resolved an issue with slow performance of Linuxkit. On some hardware, the kernel does not expose a hardware random number generator with sufficient entropy to satisfy Linuxkit’s need for randomness. The solution was a change from using /dev/random to instead using /dev/urandom as a source of entropy.
Niclas Mietz has a merge request for Gitlab to support arm64 in the Gitlab runner, a component of its continuous integration (CI) system.
This work was supported in part by the Works on Arm project that runs at Packet.
Supercomputer 17 (SC17) is in Denver, Colorado on November 12-17, 2017. We’ve been told by several sources that there will be Arm news coming from the event. Use the hashtag #SC17 to follow along.