# Optimized abs() for int64 in Go

The Go programming language has no built in `abs`

function for computing the
absolute value of an integer.
That is, the non-negative representation of a negative or positive number.

I recently needed an implementation of the `abs`

function to solve the
Day 20 challenge in
Advent of Code 2017. If you want to learn
something new and get a kick out of testing yourself, I strongly recommend you
check it out!

Go does include an `abs`

function in the `math`

package:
math.Abs. Unfortunately, this doesn’t fit my
use case, as it only accepts `float64`

as input and output. I need `int64`

. It
is possible to use `math.Abs`

via type conversion, but this introduces some
overhead for the conversion to `float64`

and back to `int64`

, as well as
introducing truncation for larger numbers.

There is a great discussion on optimizing
`math.Abs`

for floating point integers, but unfortunately these optimizations
don’t directly apply to integers, due to their underlying encoding.

There must be other options. So begins my adventure into optimizing a simple function in Go!

You can find the source code and tests for this article on GitHub.

## Type conversion vs. branching

The most obvious solution to me for this problem was a very basic function which
returns `n`

if `n`

is zero or greater, and returns `-n`

if `n`

is less than
zero. The double negative is, of course, always positive.

Since it relies on a branching control structure to calculate an absolute value,
let’s call this function `abs.WithBranch`

.

```
package abs
func WithBranch(n int64) int64 {
if n < 0 {
return -n
}
return n
}
```

It works! Curiously, this is currently (Go v1.9.x) how `math.Abs`

is
implemented
for `float64`

numbers.

Great! Branching works. But, does it improve on a call to `math.Abs`

using type
conversions? Let’s implement that too.

```
package abs
func WithStdLib(n int64) int64 {
return int64(math.Abs(float64(n)))
}
```

Here we are converting `n`

to a `float64`

, passing it to `math.Abs`

and then
converting the result from `float64`

to `int64`

. Surely, this must introduce
some performance overhead?

A couple of simple benchmark tests yield clear results:

```
$ go test -bench=.
goos: darwin
goarch: amd64
pkg: github.com/cavaliercoder/abs
BenchmarkWithBranch-8 2000000000 0.30 ns/op
BenchmarkWithStdLib-8 2000000000 0.79 ns/op
PASS
ok github.com/cavaliercoder/abs 2.320s
```

At 0.30 ns/op (nanoseconds per operation), `WithBranch`

is more than twice as
fast! It also has the advantage that it won’t truncate any large numbers when
converting from a signed `int64`

to an IEEE-754 `float64`

, which loses bits to
represent the decimal point.

By example, `abs.WithBranch(-9223372036854775807)`

correctly returns
`9223372036854775807`

, while `abs.WithStdLib(-9223372036854775807)`

experiences
an overflow during the type conversion and incorrectly returns
`-9223372036854775808`

. It even returns the same incorrect, negative result for
a high positive input, `abs.WithStdLib(9223372036854775807)`

.

The branching approach is clearly faster and more correct for signed integers, but can we do better?

We know that branching code breaks the sequential flow of a program, meaning that pipelining processors cannot predict what will happen next. Sounds important?

## A non-branching solution

Chapter 2 of Hacker’s Delight by Henry S. Warren introduces us to a branch-free way of computing the absolute value of a signed integer using a little Two’s Complement arithmetic.

To compute the absolute value of `x`

, first, we compute the value `y`

which is
equal to `x ⟫ 63`

. That is, `x`

right shifted by 63 bits on a 64 bit
architecture. If you’re familiar with signed integers, you’ll note the value of
`y`

will now be `1`

if `x`

is negative, otherwise `0`

.

Finally, compute `(x ⨁ y) - y`

. That is, `x`

exclusive-or `y`

, take `y`

. This
yields the absolute value of `x`

.

Because we live hardcore, let’s implement this in assembly!

```
// abs.go
package abs
func WithASM(n int64) int64
```

```
// abs_amd64.s
TEXT ·WithASM(SB),$0
MOVQ n+0(FP), AX // copy input to AX
MOVQ AX, CX // y ← x
SARQ $63, CX // y ← y ⟫ 63
XORQ CX, AX // x ← x ⨁ y
SUBQ CX, AX // x ← x - y
MOVQ AX, ret+8(FP) // copy result to return value
RET
```

We start above with a Go function declaration for `WithASM`

that has no
immediate implementation. In a separate ASM code file, we declare the body in
Go’s Assembler language. Note that this
implementation is only valid on `amd64`

architecture systems, so we advise the
compiler by giving the file name the `_amd64.s`

suffix.

And the benchmarks results:

```
$ go test -bench=.
goos: darwin
goarch: amd64
pkg: github.com/cavaliercoder/abs
BenchmarkWithBranch-8 2000000000 0.29 ns/op
BenchmarkWithStdLib-8 2000000000 0.78 ns/op
BenchmarkWithASM-8 2000000000 1.78 ns/op
PASS
ok github.com/cavaliercoder/abs 6.059s
```

Well, that’s embarrassing!!! WTF?

It turn’s out, in my naive benchmarks, that the non-branching, highly succinct assembler code runs significantly slower at 1.78 ns/op. How can this be?

## Compiler optimizations

We need visibility of the optimizations the compiler is applying to the Go
functions. The compiler accepts the `-m`

flag to “print optimization decisions”.
This can also be enabled in `go build`

or `go test`

with the `-gcflags=-m`

argument.

Here’s what we see:

```
$ go tool compile -m abs.go
# github.com/cavaliercoder/abs
./abs.go:11:6: can inline WithBranch
./abs.go:21:6: can inline WithStdLib
./abs.go:22:23: inlining call to math.Abs
```

For simple functions like ours, the Go compiler supports function inlining. Function inlining means that calls to our function are replaced inline with the actual body of our function.

By example,

```
package main
import (
"fmt"
"github.com/cavaliercoder/abs"
)
func main() {
n := abs.WithBranch(-1)
fmt.Println(n)
}
```

… might actually be compiled more similar to:

```
package main
import "fmt"
func main() {
n := -1
if n < 0 {
n = -n
}
fmt.Println(n)
}
```

According to the compiler output above, `WithBranch`

and `WithStdLib`

are able
to be inlined, but `WithASM`

is not. Even the underlying call to `math.Abs`

is
inlined into `WithStdLib`

.

Since our `WithASM`

function cannot be inlined, each caller incurs the overhead
of a function call. This means allocating a stack, copying in arguments,
branching the program pointer, etc.

What if we even the playing field and disable inlining on our other Go
functions? We can do this with a simple pragma comment before the function
declaration: `//go:noinline`

.

For example:

```
package abs
//go:noinline
func WithBranch(n int64) int64 {
if n < 0 {
return -n
}
return n
}
```

Running the compiler again, we see many fewer optimizations:

```
$ go tool compile -m abs.go
abs.go:22:23: inlining call to math.Abs
```

And here are the benchmark results:

```
$ go test -bench=.
goos: darwin
goarch: amd64
pkg: github.com/cavaliercoder/abs
BenchmarkWithBranch-8 1000000000 1.87 ns/op
BenchmarkWithStdLib-8 1000000000 1.94 ns/op
BenchmarkWithASM-8 2000000000 1.84 ns/op
PASS
ok github.com/cavaliercoder/abs 8.122s
```

Each function now performs at around ~1.9ns/op.

You might infer the overhead of a function call to be around ~1.5ns for each solution. The overhead appears to negate any minor speed advantage in the body of our functions.

The lesson learned for me here, is that the performance gained by implementing ASM by hand needs to outweigh the benefits of compiler type safety, garbage collection and function inlining. In most cases, this won’t be the true, though there are exceptions, like taking advantage of SIMD instructions for cryptography or media encoding.

Running the benchmarks a few times it becomes clear that there are no significant performance gains to be had. We need function inlining.

## One inline function, please

The Go compiler cannot inline functions that are implemented in assembler, but implementing our Two’s Complement arithmetic in Go is easy:

```
package abs
func WithTwosComplement(n int64) int64 {
y := n >> 63 // y ← x ⟫ 63
return (n ^ y) - y // (x ⨁ y) - y
}
```

The compiler says our new function can be inlined:

```
$ go tool compile -m abs.go
...
abs.go:26:6: can inline WithTwosComplement
```

How does it perform? It turns out, when we re-enable inlining, it performs very
similar to `WithBranch`

:

```
$ go test -bench=.
goos: darwin
goarch: amd64
pkg: github.com/cavaliercoder/abs
BenchmarkWithBranch-8 2000000000 0.29 ns/op
BenchmarkWithStdLib-8 2000000000 0.79 ns/op
BenchmarkWithTwosComplement-8 2000000000 0.29 ns/op
BenchmarkWithASM-8 2000000000 1.83 ns/op
PASS
ok github.com/cavaliercoder/abs 6.777s
```

Now that the overhead of a function call is gone, the Two’s Complement implementation in Go out performs the ASM implementation. Curiosity might lead us next to wonder how similar the compiler’s output is to our hand-crafted ASM?

There’s an app for that.

Passing `-S`

to the Go compiler causes it to “print assembly listing”.

```
$ go tool compile -S abs.go
...
"".WithTwosComplement STEXT nosplit size=24 args=0x10 locals=0x0
0x0000 00000 (abs.go:26) TEXT "".WithTwosComplement(SB), NOSPLIT, $0-16
0x0000 00000 (abs.go:26) FUNCDATA $0, gclocals·f207267fbf96a0178e8758c6e3e0ce28(SB)
0x0000 00000 (abs.go:26) FUNCDATA $1, gclocals·33cdeccccebe80329f1fdbee7f5874cb(SB)
0x0000 00000 (abs.go:26) MOVQ "".n+8(SP), AX
0x0005 00005 (abs.go:26) MOVQ AX, CX
0x0008 00008 (abs.go:27) SARQ $63, AX
0x000c 00012 (abs.go:28) XORQ AX, CX
0x000f 00015 (abs.go:28) SUBQ AX, CX
0x0012 00018 (abs.go:28) MOVQ CX, "".~r1+16(SP)
0x0017 00023 (abs.go:28) RET
...
```

The compiler’s implementation is exactly the god-damned same. Only this time, it
has the advantage of being correctly configured and being cross-platform
portable! Running the compiler again with `GOARCH=386`

produces a more
complicated program that handles the 64-bit arithmetic on 32-bit machines.

I need to be more trusting and slightly less hardcore.

One last note about memory allocations; all of our implementations exhibit the
same ideal behavior. When I run `go test -bench=. -benchmem`

, I observe that
each of the functions result in zero allocation operations and zero bytes
allocated within the function body.

## Conclusion

The Two’s Complement implementation in Go offers portability, function inlining,
non-branching code, zero allocations and no integer truncation due to type
conversions. The benchmarks never showed any significant speed advantage over
the branching approach, but this approach was selected anyway as, *in theory*,
non-branching code should perform better in more diverse scenarios.

The end result:

```
func abs(n int64) int64 {
y := n >> 63
return (n ^ y) - y
}
```