Musl Malloc is Slow

So I have built swoole-cli, and figured out the HPET issue.

Let's see how it performs.

Start with 1 thread and 10 clients:

$ wrk -c 10 -t 1 -d 10s -L http://127.0.0.1:9501
Running 10s test @ http://127.0.0.1:9501
  1 threads and 10 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency    22.93ms    1.79ms  50.95ms   93.16%
    Req/Sec   437.09     10.57   450.00     86.00%
  Latency Distribution
     50%   22.74ms
     75%   22.94ms
     90%   23.28ms
     99%   29.29ms
  4357 requests in 10.01s, 702.06KB read
Requests/sec:    435.13
Transfer/sec:     70.11KB

OMG, QPS is less than 500. On the same machine, Go can achieve >10k QPS.

Flamegraph to help

Flamegraph is my go-to tool for troubleshoting performance issue.

$ perf record --call-graph dwarf swoole-cli ./http-hello.php
$ # start wrk in another terminal, after benchmark, Ctrl-C to stop perf
$ perf script | inferno-collapse-perf | inferno-flamegraph > flamegraph.svg

And here's the flamegraph(open in new tab to see details), frankly speeking, I don't see any problem in it.

How about off-CPU time?

$ /usr/share/bcc/tools/offcputime -df -p $(pidof swoole-cli) > wait.stack
$ < wait.stack inferno-flamegraph -c blue > offcputime.svg

Hmm, we are waiting for __munmap and enframe and lots of page faults. Maybe memory issue? Maybe it's related to the memory allocator?

Mimalloc to help

We all known musl's malloc is slow, and swoole-cli has support mimalloc.

Let's see if my swoole-cli is using mimalloc:

$ MIMALLOC_VERBOSE=1 swoole-cl
(no output)

Nope, mimalloc is not linked.

From the build script, I found it didn't add -lmimalloc :(

Add that flag back and recompile, it now has mimalloc linked.

$ MIMALLOC_VERBOSE=1 swoole-cli
mimalloc: process init: 0x7f708cf35020
mimalloc: secure level: 0
mimalloc: using 1 numa regions
mimalloc: option 'show_errors': 0
mimalloc: option 'show_stats': 0
mimalloc: option 'eager_commit': 1
mimalloc: option 'deprecated_eager_region_commit': 0
mimalloc: option 'deprecated_reset_decommits': 0
mimalloc: option 'large_os_pages': 0
mimalloc: option 'reserve_huge_os_pages': 0
mimalloc: option 'reserve_huge_os_pages_at': -1
mimalloc: option 'reserve_os_memory': 0
mimalloc: option 'deprecated_segment_cache': 0
mimalloc: option 'page_reset': 0
mimalloc: option 'abandoned_page_decommit': 0
mimalloc: option 'deprecated_segment_reset': 0
mimalloc: option 'eager_commit_delay': 1
mimalloc: option 'decommit_delay': 25
mimalloc: option 'use_numa_nodes': 0
mimalloc: option 'limit_os_alloc': 0
mimalloc: option 'os_tag': 100
mimalloc: option 'max_errors': 16
mimalloc: option 'max_warnings': 16
mimalloc: option 'max_segment_reclaim': 8
mimalloc: option 'allow_decommit': 1
mimalloc: option 'segment_decommit_delay': 500
mimalloc: option 'decommit_extend_delay': 2

And now it performs much better, 37x improvement:

Running 10s test @ http://127.0.0.1:9501
  1 threads and 10 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency   601.61us  186.81us   4.51ms   87.57%
    Req/Sec    16.52k   378.21    16.91k    89.00%
  Latency Distribution
     50%  531.00us
     75%  550.00us
     90%    1.05ms
     99%    1.12ms
  164353 requests in 10.00s, 25.86MB read
Requests/sec:  16432.60
Transfer/sec:      2.59MB

Again, from the on-CPU flamegraph I can't see any problem. But from the off-CPU time flamegraph, it's spends most time doing network calls now.

How about newer machines?

I did all above on a Pentium T4500 machine, it's quite old.

So I copy the binary to an AMD Zen3 machine to see how it performs.

With musl malloc (from 400 to 14000):

$  wrk -c 250 -t 2 -d 60s -L http://127.0.0.1:9501
Running 1m test @ http://127.0.0.1:9501
  2 threads and 250 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency    17.84ms  719.01us  41.92ms   63.52%
    Req/Sec     7.04k   261.63     7.47k    56.83%
  Latency Distribution
     50%   17.54ms
     75%   18.69ms
     90%   18.87ms
     99%   19.02ms
  840619 requests in 1.00m, 132.28MB read
Requests/sec:  14009.94
Transfer/sec:      2.20MB

With mimalloc (from 16000 to 160000):

Running 1m test @ http://127.0.0.1:9501
  2 threads and 250 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     1.47ms   55.89us   4.78ms   97.60%
    Req/Sec    85.31k     1.93k   88.26k    92.25%
  Latency Distribution
     50%    1.47ms
     75%    1.48ms
     90%    1.50ms
     99%    1.54ms
  10189141 requests in 1.00m, 1.57GB read
Requests/sec: 169815.19
Transfer/sec:     26.72MB

Again, over 10x improvement!

CPU improvement in these 10 years is HUGE.

Kauruus
CC BY-SA 4.0
Website built with Franklin.jl and the Julia programming language.