Optimized abs() for int64 in Go

The Go programming language has no built in abs function for computing the absolute value of an integer. That is, the non-negative representation of a negative or positive number.

I recently needed an implementation of the abs function to solve the Day 20 challenge in Advent of Code 2017. If you want to learn something new and get a kick out of testing yourself, I strongly recommend you check it out!

Go does include an abs function in the math package: math.Abs. Unfortunately, this doesn’t fit my use case, as it only accepts float64 as input and output. I need int64. It is possible to use math.Abs via type conversion, but this introduces some overhead for the conversion to float64 and back to int64, as well as introducing truncation for larger numbers.

There is a great discussion on optimizing math.Abs for floating point integers, but unfortunately these optimizations don’t directly apply to integers, due to their underlying encoding.

There must be other options. So begins my adventure into optimizing a simple function in Go!

You can find the source code and tests for this article on GitHub.

Type conversion vs. branching

The most obvious solution to me for this problem was a very basic function which returns n if n is zero or greater, and returns negative n if n is less than zero. The double negative is, of course, always positive.

Since it relies on a branching control structure to calculate an absolute value, let’s call this function abs.WithBranch.

package abs

func WithBranch(n int64) int64 {
	if n < 0 {
		return -n
	}
	return n
}

It works! Curiously, this is currently (Go v1.9.x) how math.Abs is implemented for float64 numbers.

Great! Branching works. But, does it improve on a call to math.Abs using type conversions? Let’s implement that too.

package abs

func WithStdLib(n int64) int64 {
	return int64(math.Abs(float64(n)))
}

Here we are converting n to a float64, passing it to math.Abs and then converting the result from float64 to int64. Surely, this must introduce some performance overhead?

A couple of simple benchmark tests yield clear results:

$ go test -bench=.
goos: darwin
goarch: amd64
pkg: github.com/cavaliercoder/abs
BenchmarkWithBranch-8           2000000000               0.30 ns/op
BenchmarkWithStdLib-8           2000000000               0.79 ns/op
PASS
ok      github.com/cavaliercoder/abs    2.320s

At 0.30 ns/op (nanoseconds per operation), WithBranch is more than twice as fast! But our benchmark doesn’t represent the real world…

Benchmarking

We know that branching code breaks the sequential flow of a program, meaning that pipelining processors must predict what will happen next to maintain similar performance to a sequential program.

In the above benchmark, we used a constant value as the input to each call to abs. Because of this, the processor is able to predict which branch to take every time, because it’s the same every time.

We can trip the processor’s ability to predict the flow of execution, and better represent real world use-cases by introducing a Pseudo Random-Number Generator. If the input is random, the output will be random and the processor’s predictions should be wrong more often.

Given there are only two possible futures, I assume a pipelining processor can compute both outcomes at once, without a performance penalty. Let’s test by experiment.

$ go test -bench=.
goos: darwin
goarch: amd64
pkg: github.com/cavaliercoder/go-abs
BenchmarkRand-8                 1000000000               2.59 ns/op
BenchmarkWithBranch-8           200000000                6.36 ns/op
BenchmarkWithStdLib-8           500000000                3.13 ns/op
PASS
ok      github.com/cavaliercoder/go-abs 6.692s

I’ve introduced a new benchmark, BenchmarkRand, to measure the overhead of the RNG. We can substract the resulting overhead from the other benchmarks for an approximation of their actual runtime.

It turns out, abs.WithBranch doesn’t perform very well with random input. In fact, abs.WithStdLib is nearly seven times faster.

Truncation

Before we throw out the branching approach, it should be noted that it has the advantage that it won’t truncate any large numbers by converting from a signed int64 to an IEEE-754 float64, losing bits to represent the decimal point.

By example, abs.WithBranch(-9223372036854775807) correctly returns 9223372036854775807, while abs.WithStdLib(-9223372036854775807) experiences an overflow during the type conversion and incorrectly returns -9223372036854775808. It even returns the same incorrect, negative result for a high positive input, abs.WithStdLib(9223372036854775807).

The type conversion approach is faster with random input, but the branching approach is more correct. Can we do better?

A non-branching solution

Chapter 2 of Hacker’s Delight by Henry S. Warren introduces us to a branch-free way of computing the absolute value of a signed integer using a little Two’s Complement arithmetic.

To compute the absolute value of x, first, we compute the value y which is equal to x ⟫ 63. That is, x right shifted by 63 bits on a 64 bit architecture. Finally, compute (x ⨁ y) - y. That is, x exclusive-or y, take y. This yields the absolute value of x.

Because we live hardcore, let’s implement this in assembly!

// abs.go
package abs

func WithASM(n int64) int64

// abs_amd64.s
TEXT ·WithASM(SB),$0
  MOVQ    n+0(FP), AX     // copy input to AX
  MOVQ    AX, CX          // y ← x
  SARQ    $63, CX         // y ← y ⟫ 63
  XORQ    CX, AX          // x ← x ⨁ y
  SUBQ    CX, AX          // x ← x - y
  MOVQ    AX, ret+8(FP)   // copy result to return value
  RET

We start above with a Go function declaration for WithASM that has no immediate implementation. In a separate ASM code file, we declare the body in Go’s Assembler language. Note that this implementation is only valid on amd64 architecture systems, so we advise the compiler by giving the file name the _amd64.s suffix.

And the benchmarks results:

$ go test -bench=.
goos: darwin
goarch: amd64
pkg: github.com/cavaliercoder/go-abs
BenchmarkRand-8                 1000000000               2.52 ns/op
BenchmarkWithBranch-8           200000000                5.91 ns/op
BenchmarkWithStdLib-8           500000000                2.88 ns/op
BenchmarkWithASM-8              1000000000               2.78 ns/op
PASS
ok      github.com/cavaliercoder/go-abs 9.454s

All right! We observe a minor improvement of 0.10 ns/op. That’s actually a ~38% speed up. Still, there could be more! It might help to understand how Go treats ASM code.

Compiler optimizations

We need visibility of the optimizations the compiler is applying to the Go functions. The compiler accepts the -m flag to “print optimization decisions”. This can also be enabled in go build or go test with the -gcflags=-m argument.

Here’s what we see:

$ go tool compile -m abs.go
# github.com/cavaliercoder/abs
./abs.go:11:6: can inline WithBranch
./abs.go:21:6: can inline WithStdLib
./abs.go:22:23: inlining call to math.Abs

For simple functions like ours, the Go compiler supports function inlining. Function inlining means that calls to our function are replaced inline with the actual body of our function.

By example,

package main

import (
  "fmt"
  "github.com/cavaliercoder/abs"
)

func main() {
  n := abs.WithBranch(-1)
  fmt.Println(n)
}

… might actually be compiled more similar to:

package main

import "fmt"

func main() {
  n := -1
  if n < 0 {
    n = -n
  }
  fmt.Println(n)
}

According to the compiler output above, WithBranch and WithStdLib are able to be inlined, but WithASM is not. Even the underlying call to math.Abs is inlined into WithStdLib.

Since our WithASM function cannot be inlined, each caller incurs the overhead of a function call. This means writing a stack frame, copying in arguments, branching the program pointer, etc.

What if we even the playing field and disable inlining on our other Go functions? We can do this with a simple pragma comment before the function declaration: //go:noinline.

For example:

package abs

//go:noinline
func WithBranch(n int64) int64 {
	if n < 0 {
		return -n
	}
	return n
}

Running the compiler again, we see many fewer optimizations:

$ go tool compile -m abs.go
abs.go:22:23: inlining call to math.Abs

And here are the benchmark results:

$go test -bench=.
goos: darwin
goarch: amd64
pkg: github.com/cavaliercoder/go-abs
BenchmarkRand-8                 1000000000               2.40 ns/op
BenchmarkWithBranch-8           200000000                7.06 ns/op
BenchmarkWithStdLib-8           500000000                3.46 ns/op
BenchmarkWithASM-8              1000000000               2.86 ns/op
PASS
ok      github.com/cavaliercoder/go-abs 10.032s

The performance of WithASM remains unchanged, while WithStdLib is three times slower thanks to the overhead of the non-inline function call.

The lesson learned for me here, is that the performance gained by implementing ASM by hand needs to outweigh the benefits of compiler type safety, garbage collection, function inlining, etc. In most cases, this won’t be the true, though there are exceptions; like taking advantage of SIMD instructions for cryptography or media encoding.

Running the benchmarks a few times it becomes clear that there is a reasonable performance gain to be had via ASM. But, can we somehow retain the benefits of inlining?

One inline function, please

The Go compiler cannot inline functions that are implemented in assembler, but implementing our Two’s Complement arithmetic in Go is easy:

package abs

func WithTwosComplement(n int64) int64 {
	y := n >> 63          // y ← x ⟫ 63
	return (n ^ y) - y    // (x ⨁ y) - y
}

The compiler says our new function can be inlined:

$ go tool compile -m abs.go
...
abs.go:26:6: can inline WithTwosComplement

How does it perform? It turns out, when we re-enable inlining, it performs even better! A whole 0.18 ns better, or roughly 60%.

$ go test -bench=.
goos: darwin
goarch: amd64
pkg: github.com/cavaliercoder/go-abs
BenchmarkRand-8                         1000000000               2.39 ns/op
BenchmarkWithBranch-8                   300000000                5.71 ns/op
BenchmarkWithStdLib-8                   1000000000               3.01 ns/op
BenchmarkWithASM-8                      1000000000               2.88 ns/op
BenchmarkWithTwosComplement-8           1000000000               2.70 ns/op
PASS
ok      github.com/cavaliercoder/go-abs 14.417s

Now that the overhead of a function call is gone, the Two’s Complement implementation in Go out performs the ASM implementation. Curiosity might lead us next to wonder how similar the compiler’s output is to our hand-crafted ASM?

There’s an app for that.

Passing -S to the Go compiler causes it to “print assembly listing”.

$ go tool compile -S abs.go
...
"".WithTwosComplement STEXT nosplit size=24 args=0x10 locals=0x0
        0x0000 00000 (abs.go:26)        TEXT    "".WithTwosComplement(SB), NOSPLIT, $0-16
        0x0000 00000 (abs.go:26)        FUNCDATA        $0, gclocals·f207267fbf96a0178e8758c6e3e0ce28(SB)
        0x0000 00000 (abs.go:26)        FUNCDATA        $1, gclocals·33cdeccccebe80329f1fdbee7f5874cb(SB)
        0x0000 00000 (abs.go:26)        MOVQ    "".n+8(SP), AX
        0x0005 00005 (abs.go:26)        MOVQ    AX, CX
        0x0008 00008 (abs.go:27)        SARQ    $63, AX
        0x000c 00012 (abs.go:28)        XORQ    AX, CX
        0x000f 00015 (abs.go:28)        SUBQ    AX, CX
        0x0012 00018 (abs.go:28)        MOVQ    CX, "".~r1+16(SP)
        0x0017 00023 (abs.go:28)        RET
...

The compiler’s implementation is exactly the god-damned same. Only this time, it has the advantage of being correctly configured and being cross-platform portable! Running the compiler again with GOARCH=386 produces a more complicated program that handles the 64-bit arithmetic on 32-bit machines.

I need to be more trusting and slightly less hardcore.

One last note about memory allocations; all of our implementations exhibit the same ideal behavior. When I run go test -bench=. -benchmem, I observe that each of the functions result in zero allocation operations and zero bytes allocated within the function body.

Homework

A few exercises for the reader:

What if we implemented the assembly for abs as a compiler intrinsic, rather than a runtime function call?
What if we used the PABSQ instruction included in SSSE3? This is not currently supported in Go’s ASM.

Conclusion

The Two’s Complement implementation in Go offers portability, function inlining, predictable performance, zero allocations and no integer truncation due to type conversions. The benchmarks showed a significant speed advantage over the other approaches.

The end result:

func abs(n int64) int64 {
	y := n >> 63
	return (n ^ y) - y
}