18 minutes
Challenge: make this Go function inlinable and free of bounds checks
In this post, I challenge you to refactor a small function in such a way as to make it inlinable and free of bounds checks, for better performance.
Disclaimer: this post assumes version 1.24.2 of the (official) Go compiler; you may get different results with other versions of the Go compiler or with other implementations of the Go language.
Function inlining & bounds-check elimination ¶
Some familiarity with function inlining and bounds-check elimination is a prerequisite for attempting my challenge. The following three sections serve as a crash course on those topics. Feel free to skip straight to the challenge itself.
Function inlining 101 ¶
Inlining is a compiler strategy that can roughly be described as “substituting the body of a function for calls to that function”. Because it eliminates some function-call overhead, inlining often results in faster program execution. However, function inlining also increases the size of the resulting binaries, and indiscriminate inlining could cause compilers to produce prohibitively large binaries. Therefore, compilers typically restrict eligibility for inlining to functions that they deem “simple enough”.
For each function, the Go compiler allocates an inlining budget and computes an
inlining cost. If a function’s inlining cost does not exceed its budget, the
compiler deems the function eligible for inlining (a.k.a. inlinable) and
inlines it; otherwise, it doesn’t. At the time of writing (Go
1.24.2), the default inlining budget is
80, but the compiler may, thanks to profile-guided
optimisation (PGO), decide to increase the inlining budget for functions
called on the hot path; for this post, let’s keep things simple and ignore PGO,
though. Some language constructs (such as recursion, “defer”
statements, and calls to the recover
function) preclude
function inlining outright. The Go compiler’s inlining strategy is not specified
by the language itself and may change from one release to the next.
The most straightforward way to determine whether a function is eligible for inlining is to run a command of the following form and inspect its output:
$ go build -gcflags '-m=2' <package-path> 2>&1 | grep <function-name>
Even if a function is inlinable, you may occasionally want to prevent it from
being inlined; you can instruct the compiler not to inline a function by adding
a //go:noinline
directive above the function’s declaration. But you cannot
force the compiler to inline arbitrary functions, for reasons explained above;
a more thorough rationale for this limitation can be found in issue
#21536.
By refactoring an initially non-inlinable function (and/or by relaxing its semantics), you can sometimes lower a function’s inlining cost to the point of making it eligible for inlining; however, unless you’re very familiar with the Go compiler’s inliner (and keep abreast of changes brought to it by maintainers of the Go project), you may rightly perceive making a function eligible for inlining tends more as an art than as a science.
Bounds-check elimination 101 ¶
I touched upon bounds-check elimination (BCE) in a post published earlier this year on this blog; allow me to quote it:
[…] Go is said to be memory-safe; in particular, per the language specification, implementations must trigger a run-time panic if a slice-indexing operation is ever out of bounds. Such bounds checks are relatively cheap, but they are not free. When the compiler can prove that some slice access cannot be out of bounds, it may omit, for better performance, the corresponding bounds check from the resulting executable.
What I wrote about slices in that post is also true of string
s, which are
little more than slices of byte
s under the hood.
Bounds-check elimination, like inlining strategies, remains an implementation detail; it is not specified by the language itself and may change from one release of the Go compiler to the next.
To identify where the Go compiler introduces bounds checks in a package, run a command of the following form:
$ go build -gcflags '-d=ssa/check_bce/debug=1' <package-path>
The presence of bounds checks is not systematically detrimental to performance, and eliminating all bounds checks from a program often proves impossible anyway. However, eliminating some bounds checks, perhaps by hoisting them out of loops, typically is conducive to faster program execution. You can find instances of hoisting bounds checks out of loops in Go’s standard library.
Despite frequent improvements to the Go compiler’s BCE logic, you sometimes need to refactor your code in order to convince the compiler to eliminate or displace some bounds checks. This too is more an art than a science, and typically requires some trial and error.
A delicate balancing act ¶
Function inlining and BCE are closely intertwined. Inlinability sometimes comes at the cost of a couple of additional bounds checks; conversely, eliminating some bounds checks may rob a function from its eligibility for inlining. In some lucky cases, though, you can have your cake and eat it too!
Now that you know enough about inlining and BCE, I can finally present my challenge to you.
Can you make this function inlinable and free of bounds checks? ¶
Consider the following source file (which I’ve uploaded to a Gist, for your convenience):
package headers
// TrimOWS trims up to n bytes of optional whitespace (OWS)
// from the start of and/or the end of s.
// If no more than n bytes of OWS are found at the start of s
// and no more than n bytes of OWS are found at the end of s,
// it returns the trimmed result and true.
// Otherwise, it returns some unspecified string and false.
func TrimOWS(s string, n int) (string, bool) {
if s == "" {
return s, true
}
s, ok := trimLeftOWS(s, n)
if !ok {
return "", false
}
s, ok = trimRightOWS(s, n)
if !ok {
return "", false
}
return s, true
}
func trimLeftOWS(s string, n int) (string, bool) {
var i int
for len(s) > 0 {
if !isOWS(s[0]) {
break
}
if i >= n {
return "", false
}
s = s[1:]
i++
}
return s, true
}
func trimRightOWS(s string, n int) (string, bool) {
var i int
for len(s) > 0 {
if !isOWS(s[len(s)-1]) {
break
}
if i >= n {
return "", false
}
s = s[:len(s)-1]
i++
}
return s, true
}
// See https://httpwg.org/specs/rfc9110.html#whitespace.
func isOWS(b byte) bool {
return b == '\t' || b == ' '
}
The implementation of TrimOWS
above incurs no bounds checks, as evidenced by
the empty output of the following command:
$ go build -gcflags '-d=ssa/check_bce/debug=1'
$
However, it is not inlinable:
$ go build -gcflags '-m=2' 2>&1 | grep TrimOWS
./ows.go:9:6: cannot inline TrimOWS: function too complex: \
cost 130 exceeds budget 80
Here is my challenge to you: can you refactor function TrimOWS
so that it be
inlinable (without relying on profile-guided optimisation) and free of bounds
checks?
To allow you to catch bugs as you refactor the code, I’ve included a suite of unit tests; and to allow you to later measure performance changes between implementations, I’ve also included some benchmarks:
package headers
import "testing"
const maxOWS = 1 // number of OWS bytes tolerated on either side
var trimOWStests = []struct {
desc string
s string
want string
ok bool
}{
{
desc: "empty",
s: "",
want: "",
ok: true,
}, {
desc: "no OWS",
s: "foo",
want: "foo",
ok: true,
}, {
desc: "internal OWS",
s: "foo \t\tbar",
want: "foo \t\tbar",
ok: true,
}, {
desc: "leading and trailing OWS",
s: "\tfoo ",
want: "foo",
ok: true,
}, {
desc: "non-OWS whitespace",
s: "\nfoo\t",
want: "\nfoo",
ok: true,
}, {
desc: "too much leading OWS",
s: " \tfoo\t",
ok: false,
}, {
desc: "too much trailing OWS",
s: " foo\t ",
ok: false,
}, {
desc: "too much leading and trailing OWS",
s: " \tfoo\t ",
ok: false,
},
}
func TestTrimOWS(t *testing.T) {
for _, tc := range trimOWStests {
f := func(t *testing.T) {
allocs := testing.AllocsPerRun(10,
func() { str, truth = TrimOWS(tc.s, maxOWS) },
)
if allocs > 0 {
const tmpl = "TrimOWS(%q, %d) allocates %.2f objects"
t.Errorf(tmpl, tc.s, maxOWS, allocs)
}
got, ok := TrimOWS(tc.s, maxOWS)
if !tc.ok && ok {
// In cases where TrimOWS must fail,
// its string result is unspecified.
const tmpl = "TrimOWS(%q, %d): _, %t; want _, %t"
t.Fatalf(tmpl, tc.s, maxOWS, ok, tc.ok)
}
if tc.ok && (!ok || got != tc.want) {
const tmpl = "TrimOWS(%q, %d): %q, %t; want %q, %t"
t.Errorf(tmpl, tc.s, maxOWS, got, ok, tc.want, tc.ok)
}
}
t.Run(tc.desc, f)
}
}
func BenchmarkTrimOWS(b *testing.B) {
for _, tc := range trimOWStests {
f := func(b *testing.B) {
for range b.N {
str, truth = TrimOWS(tc.s, maxOWS)
}
}
b.Run(tc.desc, f)
}
}
var ( // sinks for avoiding dead-code elimination in benchmarks
str string
truth bool
)
Ready for some micro-optimisation? Clone the Gist and cd
into your clone:
$ git clone https://gist.github.com/jub0bs/0151185bad8ea4d70c4fd1d6ac78ab40
$ cd 0151185bad8ea4d70c4fd1d6ac78ab40
And off you go!
Hints ¶
Here is series of hints that you might find useful in case you get stuck.
Hint 0
Try eliminating unnecessary branching. In particular, the initial empty-string
check in TrimOWS
is redundant; it’s an attempt at
micro-optimisation, but its performance benefits are moot. Moreover, TrimOWS
doesn’t need to check the boolean results of trimRightOWS
; instead, it could
simply bubble up trimRightOWS
’s results to the caller.
func TrimOWS(s string, n int) (string, bool) {
- if s == "" {
- return s, true
- }
s, ok := trimLeftOWS(s, n)
if !ok {
return "", false
}
- s, ok = trimRightOWS(s, n)
- if !ok {
- return "", false
- }
- return s, true
+ return trimRightOWS(s, n)
}
These changes don’t introduce any bounds check and somewhat reduce TrimOWS
’s
inlining cost:
$ go build -gcflags '-d=ssa/check_bce/debug=1'
$
$ go build -gcflags '-m=2' 2>&1 | grep TrimOWS
./ows.go:9:6: cannot inline TrimOWS: function too complex: \
cost 121 exceeds budget 80
Hint 1
Although helper functions trimLeftOWS
and trimRightOWS
are themselves
inlinable, manually inlining them in TrimOWS
helps inlinability without
hurting readability:
func TrimOWS(s string, n int) (string, bool) {
- s, ok := trimLeftOWS(s, n)
- if !ok {
- return "", false
- }
- return trimRightOWS(s, n)
-}
-
-func trimLeftOWS(s string, n int) (string, bool) {
var i int
for len(s) > 0 {
if !isOWS(s[0]) {
break
}
if i >= n {
return "", false
}
s = s[1:]
i++
}
- return s, true
-}
-
-func trimRightOWS(s string, n int) (string, bool) {
- var i int
+ i = 0
for len(s) > 0 {
if !isOWS(s[len(s)-1]) {
break
}
if i >= n {
return "", false
}
s = s[:len(s)-1]
i++
}
return s, true
}
Not only do these changes not introduce any bounds checks, but they
dramatically reduce TrimOWS
’s inlining cost:
$ go build -gcflags '-d=ssa/check_bce/debug=1'
$
$ go build -gcflags '-m=2' 2>&1 | grep TrimOWS
./ows.go:9:6: cannot inline TrimOWS: function too complex: \
cost 88 exceeds budget 80
Manually inlining my little isOWS
function as well is tempting and does
somewhat reduce TrimOWS
’s inlining cost, but doing so isn’t ideal because it
violates the DRY principle:
Every piece of knowledge must have a single, unambiguous, authoritative representation within a system.
Hint 2
Try walking along the string rather than repeatedly munching at it:
func TrimOWS(s string, n int) (string, bool) {
- var i int
- for len(s) > 0 {
- if !isOWS(s[0]) {
+ for i := 0; i < len(s); i++ {
+ if !isOWS(s[i]) {
+ s = s[i:]
break
}
if i >= n {
return "", false
}
- s = s[1:]
- i++
}
- i = 0
- for len(s) > 0 {
- if !isOWS(s[len(s)-1]) {
+ for i := len(s) - 1; i >= 0; i-- {
+ if !isOWS(s[i]) {
+ s = s[:i+1]
break
}
- if i >= n {
+ if i <= len(s)-1-n {
return "", false
}
- s = s[:len(s)-1]
- i++
}
return s, true
}
These changes, althouth they introduce no bounds checks, may seem
counterproductive at first because they cause TrimOWS
’s inlining cost to
increase:
$ go build -gcflags '-d=ssa/check_bce/debug=1'
$
$ go build -gcflags '-m=2' 2>&1 | grep TrimOWS
./ows.go:9:6: cannot inline TrimOWS: function too complex: \
cost 94 exceeds budget 80
Fret not! Performance optimisation seldom is a straight line, and refactorings initially perceived as setbacks may in fact unlock new avenues for improvement.
Hint 3
Use range-over-int loops wherever possible, as they tend to be cheaper (and easier to reason about!) than classic three-clause loops:
func TrimOWS(s string, n int) (string, bool) {
- for i := 0; i < len(s); i++ {
+ for i := range len(s) {
if !isOWS(s[i]) {
s = s[i:]
break
No bounds checks have been introduced, and TrimOWS
’s inlining cost is now at its lowest so far:
$ go build -gcflags '-d=ssa/check_bce/debug=1'
$
$ go build -gcflags '-m=2' 2>&1 | grep TrimOWS
./ows.go:9:6: cannot inline TrimOWS: function too complex: \
cost 87 exceeds budget 80
We’re back on tracks!
Hint 4
Using naked returns can help reduce a function’s inlining cost:
-func TrimOWS(s string, n int) (string, bool) {
+func TrimOWS(s string, n int) (_ string, _ bool) {
for i := range len(s) {
if !isOWS(s[i]) {
s = s[i:]
break
}
if i >= n {
- return "", false
+ return
}
}
for i := len(s) - 1; i >= 0; i-- {
if !isOWS(s[i]) {
s = s[:i+1]
break
}
if i <= len(s)-1-n {
- return "", false
+ return
}
}
return s, true
}
Still no bounds checks to deplore, and TrimOWS
’s inlining cost is lower than
ever:
$ go build -gcflags '-d=ssa/check_bce/debug=1'
$
$ go build -gcflags '-m=2' 2>&1 | grep TrimOWS
./ows.go:9:6: cannot inline TrimOWS: function too complex: \
cost 83 exceeds budget 80
Down to 83, now. We’re breathing rare air, here!
Hint 5
You can return early from the second loop; by doing so, you can also clarify
that, if execution reaches the bottom of TrimOWS
, the string returned by the
function is empty:
func TrimOWS(s string, n int) (_ string, _ bool) {
for i := range len(s) {
if !isOWS(s[i]) {
s = s[i:]
break
}
if i >= n {
return
}
}
for i := len(s) - 1; i >= 0; i-- {
if !isOWS(s[i]) {
- s = s[:i+1]
- break
+ return s[:i+1], true
}
if i <= len(s)-1-n {
return
}
}
- return s, true
+ return "", true
}
Again, no bounds checks have been introduced, and TrimOWS
’s inlining cost has
been further reduced:
$ go build -gcflags '-d=ssa/check_bce/debug=1'
$
$ go build -gcflags '-m=2' 2>&1 | grep TrimOWS
./ows.go:9:6: cannot inline TrimOWS: function too complex: \
cost 82 exceeds budget 80
Down to 82. Tantalisingly close to the inlining budget!
Hint 6
At this stage, you may be (as I was) running out of ideas for further reducing
TrimOWS
’s inlining cost. But try offsetting the loop variable of the second
loop by one…
func TrimOWS(s string, n int) (_ string, _ bool) {
for i := range len(s) {
if !isOWS(s[i]) {
s = s[i:]
break
}
if i >= n {
return
}
}
- for i := len(s) - 1; i >= 0; i-- {
- if !isOWS(s[i]) {
- return s[:i+1], true
+ for i := len(s); i > 0; i-- {
+ if !isOWS(s[i-1]) {
+ return s[:i], true
}
- if i <= len(s)-1-n {
+ if i <= len(s)-n {
return
}
}
return "", true
}
And voilà! TrimOWS
still doesn’t incur any bounds checks but is now inlinable:
$ go build -gcflags '-d=ssa/check_bce/debug=1'
$
$ go build -gcflags '-m=2' 2>&1 | grep TrimOWS
./ows.go:9:6: can inline TrimOWS with cost 78 as: \
func(string, int) (string, bool) { for loop; for loop; return "", true }
Mission accomplished. 🎉
My solution ¶
Below is my complete final implementation (click to reveal), both inlinable and free of bounds checks:
Final implementation
package headers
// TrimOWS trims up to n bytes of optional whitespace (OWS)
// from the start of and/or the end of s.
// If no more than n bytes of OWS are found at the start of s
// and no more than n bytes of OWS are found at the end of s,
// it returns the trimmed result and true.
// Otherwise, it returns some unspecified string and false.
func TrimOWS(s string, n int) (_ string, _ bool) {
for i := range len(s) {
if !isOWS(s[i]) {
s = s[i:]
break
}
if i >= n {
return
}
}
for i := len(s); i > 0; i-- {
if !isOWS(s[i-1]) {
return s[:i], true
}
if i <= len(s)-n {
return
}
}
return "", true
}
// See https://httpwg.org/specs/rfc9110.html#whitespace.
func isOWS(b byte) bool {
return b == '\t' || b == ' '
}
I think it remains quite readable; what’s your opinion?
You might have gotten there by a less circuitous way than I did, or you may
have arrived at an entirely different implementation. However, if your
implementation of TrimOWS
is correct, inlinable, free of bounds checks, and
has an inlining cost below 74 (my absolute record), I want to hear from you; do
let me know on on Bluesky, Mastodon, Gophers
Slack, or Reddit.
Benchmark results ¶
Was all this micro-optimisation effort worth it? As often, it depends™. Here are some benchmark results pitting my final implementation against my initial one:
goos: darwin
goarch: amd64
pkg: gist.github.com/jub0bs/0151185bad8ea4d70c4fd1d6ac78ab40
cpu: Intel(R) Core(TM) i7-6700HQ CPU @ 2.60GHz
│ old │ new │
│ sec/op │ sec/op vs base │
TrimOWS/empty-8 1.831n ± 5% 2.203n ± 4% +20.35% (p=0.000 n=20)
TrimOWS/no_OWS-8 3.088n ± 5% 2.543n ± 1% -17.65% (p=0.000 n=20)
TrimOWS/internal_OWS-8 3.100n ± 2% 2.510n ± 1% -19.03% (p=0.000 n=20)
TrimOWS/leading_and_trailing_OWS-8 5.586n ± 1% 5.012n ± 1% -10.26% (p=0.000 n=20)
TrimOWS/non-OWS_whitespace-8 5.114n ± 1% 4.926n ± 3% -3.67% (p=0.000 n=20)
TrimOWS/too_much_leading_OWS-8 3.874n ± 1% 3.627n ± 1% -6.39% (p=0.000 n=20)
TrimOWS/too_much_trailing_OWS-8 5.148n ± 1% 4.463n ± 1% -13.30% (p=0.000 n=20)
TrimOWS/too_much_leading_and_trailing_OWS-8 3.872n ± 1% 3.609n ± 1% -6.78% (p=0.000 n=20)
geomean 3.745n 3.455n -7.74%
My final implementation is faster overall than the initial one. Granted, it saves only a fraction of a nanosecond in each benchmark case, but you should pay closer attention to the relative difference than to the absolute difference; and such seemingly insignificant improvements may matter in a performance-critical scenario.
Can those performance improvements really be ascribed to inlinability, though?
To find out, I slapped a //go:noinline
directive and re-ran the benchmarks.
According to the results, inlinability helps a lot: my final implementation,
when disallowed from being inlined, is actually slower overall than my initial,
non-inlinable implementation.
You may have noticed one fly in the ointment: execution is faster in all cases
but one; it is slower when TrimOWS
’s string
argument is empty. This
performance gap can be explained by my removal of the empty-string check at the
top of TrimOWS
in the initial implementation. In the case of an empty string
argument, TrimOWS
now has to check the length of that string not once, but
twice (once in each of its two loops). I can think of one final refactoring
that, without introducing bounds checks or compromising inlinability, speeds up
TrimOWS
’s execution even in the case where its string
argument is empty.
I’m including some benchmark results below as a teaser, but I’ll leave this
final refactoring as an exercise.
goos: darwin
goarch: amd64
pkg: gist.github.com/jub0bs/0151185bad8ea4d70c4fd1d6ac78ab40
cpu: Intel(R) Core(TM) i7-6700HQ CPU @ 2.60GHz
│ old │ latest │
│ sec/op │ sec/op vs base │
TrimOWS/empty-8 1.831n ± 5% 1.232n ± 4% -32.67% (p=0.000 n=20)
TrimOWS/no_OWS-8 3.088n ± 5% 2.754n ± 1% -10.82% (p=0.000 n=20)
TrimOWS/internal_OWS-8 3.100n ± 2% 2.749n ± 1% -11.32% (p=0.000 n=20)
TrimOWS/leading_and_trailing_OWS-8 5.586n ± 1% 5.380n ± 1% -3.68% (p=0.000 n=20)
TrimOWS/non-OWS_whitespace-8 5.114n ± 1% 4.812n ± 5% -5.89% (p=0.000 n=20)
TrimOWS/too_much_leading_OWS-8 3.874n ± 1% 2.430n ± 1% -37.27% (p=0.000 n=20)
TrimOWS/too_much_trailing_OWS-8 5.148n ± 1% 4.700n ± 2% -8.71% (p=0.000 n=20)
TrimOWS/too_much_leading_and_trailing_OWS-8 3.872n ± 1% 2.424n ± 1% -37.38% (p=0.000 n=20)
geomean 3.745n 3.007n -19.69%
Can and should regressions be automatically detected? ¶
Manually verifying that a function remains inlinable and free of bounds checks quickly becomes tedious, to the point that I (and other Gophers) sometimes wish for compiler directives that would automatically alert me to such regressions. For instance, you could imagine a function directive for marking a function as “intended for inlining” and that would cause compilation to fail if the compiler in fact deemed the function ineligible for inlining:
//go:inlinable
func TrimOWS(s string, n int) (_ string, _ bool) { /* ... */ }
You could also imagine a compiler directive that would cause compilation to fail if the compiler detected bounds checks on a given line:
if !isOWS(s[i-1]) { //go:nobc
Alternatively, you could imagine a more granular compiler directive that would fail compilation if a line of interest incurred a prohibitive (and configurable) number of bounds checks:
func Cut(s, sep string) (before, after string, found bool) {
if i := Index(s, sep); i >= 0 {
return s[:i], s[i+len(sep):], true //go:maxbce:1
}
return s, "", false
}
However, I think that such directives are unlikely to ever be supported by the Go compiler. Recall that neither the inlining logic nor the bounds-check-elimination logic are enshrined in the language specification; they’re mere implementation details of the compiler and, as such, they may (and do, in my experience) evolve. The introduction of such compiler directives as discussed above would open the undesirable possibility that existing programs break as a result of changes to the compiler, a possibility that would violate Go 1.0’s compatibility promise.
Third-party tools, such as Jordan Lewis’s gcassert
command, purport (among other goals) to fill the gap left by the
absence of such compiler directives. However, I must admit that I’m reluctant
to pepper my code with third-party annotations; doing so would go against the
spirit of minimalism that I’ve grown accustomed to since I took my first step
in the world of Go.
And perhaps a good suite of benchmarks is all I need. After all, some inlinability regressions and bounds-check-elimination regressions may only have a negligible effect on performance and not be worth attending to.
Takeaways ¶
- Eliminating most bounds checks from a function while also making it eligible for inlining can unlock an execution speedup.
- Through trial and error, a series of simple refactorings can lower a function’s inlining cost and reduce the number of expensive bounds checks deemed necessary by the Go compiler.
- Micro-optimisation sometimes is a futile endeavour but often proves intellectually rewarding.
If you’d like to learn more about inlining and bounds-check elimination, I recommend you peruse Tapir Liu’s Go Optimizations 101 e-book.
I hope that you’ve enjoyed this challenge. Till next time.
Goperformancemicro-optimisationbenchmarksinliningbounds-check elimination
3816 Words
2025-04-30 20:45 +0000