In this post, I challenge you to refactor a small function in such a way as to make it inlinable and free of bounds checks, for better performance.

Disclaimer: this post assumes version 1.24.2 of the (official) Go compiler; you may get different results with other versions of the Go compiler or with other implementations of the Go language.

Function inlining & bounds-check elimination

Some familiarity with function inlining and bounds-check elimination is a prerequisite for attempting my challenge. The following three sections serve as a crash course on those topics. Feel free to skip straight to the challenge itself.

Function inlining 101

Inlining is a compiler strategy that can roughly be described as “substituting the body of a function for calls to that function”. Because it eliminates some function-call overhead, inlining often results in faster program execution. However, function inlining also increases the size of the resulting binaries, and indiscriminate inlining could cause compilers to produce prohibitively large binaries. Therefore, compilers typically restrict eligibility for inlining to functions that they deem “simple enough”.

For each function, the Go compiler allocates an inlining budget and computes an inlining cost. If a function’s inlining cost does not exceed its budget, the compiler deems the function eligible for inlining (a.k.a. inlinable) and inlines it; otherwise, it doesn’t. At the time of writing (Go 1.24.2), the default inlining budget is 80, but the compiler may, thanks to profile-guided optimisation (PGO), decide to increase the inlining budget for functions called on the hot path; for this post, let’s keep things simple and ignore PGO, though. Some language constructs (such as recursion, “defer” statements, and calls to the recover function) preclude function inlining outright. The Go compiler’s inlining strategy is not specified by the language itself and may change from one release to the next.

The most straightforward way to determine whether a function is eligible for inlining is to run a command of the following form and inspect its output:

$ go build -gcflags '-m=2' <package-path> 2>&1 | grep <function-name>

Even if a function is inlinable, you may occasionally want to prevent it from being inlined; you can instruct the compiler not to inline a function by adding a //go:noinline directive above the function’s declaration. But you cannot force the compiler to inline arbitrary functions, for reasons explained above; a more thorough rationale for this limitation can be found in issue #21536.


By refactoring an initially non-inlinable function (and/or by relaxing its semantics), you can sometimes lower a function’s inlining cost to the point of making it eligible for inlining; however, unless you’re very familiar with the Go compiler’s inliner (and keep abreast of changes brought to it by maintainers of the Go project), you may rightly perceive making a function eligible for inlining tends more as an art than as a science.

Bounds-check elimination 101

I touched upon bounds-check elimination (BCE) in a post published earlier this year on this blog; allow me to quote it:

[…] Go is said to be memory-safe; in particular, per the language specification, implementations must trigger a run-time panic if a slice-indexing operation is ever out of bounds. Such bounds checks are relatively cheap, but they are not free. When the compiler can prove that some slice access cannot be out of bounds, it may omit, for better performance, the corresponding bounds check from the resulting executable.


What I wrote about slices in that post is also true of strings, which are little more than slices of bytes under the hood.


Bounds-check elimination, like inlining strategies, remains an implementation detail; it is not specified by the language itself and may change from one release of the Go compiler to the next.

To identify where the Go compiler introduces bounds checks in a package, run a command of the following form:

$ go build -gcflags '-d=ssa/check_bce/debug=1' <package-path>

The presence of bounds checks is not systematically detrimental to performance, and eliminating all bounds checks from a program often proves impossible anyway. However, eliminating some bounds checks, perhaps by hoisting them out of loops, typically is conducive to faster program execution. You can find instances of hoisting bounds checks out of loops in Go’s standard library.

Despite frequent improvements to the Go compiler’s BCE logic, you sometimes need to refactor your code in order to convince the compiler to eliminate or displace some bounds checks. This too is more an art than a science, and typically requires some trial and error.

A delicate balancing act

Function inlining and BCE are closely intertwined. Inlinability sometimes comes at the cost of a couple of additional bounds checks; conversely, eliminating some bounds checks may rob a function from its eligibility for inlining. In some lucky cases, though, you can have your cake and eat it too!

Now that you know enough about inlining and BCE, I can finally present my challenge to you.

Can you make this function inlinable and free of bounds checks?

Consider the following source file (which I’ve uploaded to a Gist, for your convenience):

package headers

// TrimOWS trims up to n bytes of optional whitespace (OWS)
// from the start of and/or the end of s.
// If no more than n bytes of OWS are found at the start of s
// and no more than n bytes of OWS are found at the end of s,
// it returns the trimmed result and true.
// Otherwise, it returns some unspecified string and false.
func TrimOWS(s string, n int) (string, bool) {
	if s == "" {
		return s, true
	}
	s, ok := trimLeftOWS(s, n)
	if !ok {
		return "", false
	}
	s, ok = trimRightOWS(s, n)
	if !ok {
		return "", false
	}
	return s, true
}

func trimLeftOWS(s string, n int) (string, bool) {
	var i int
	for len(s) > 0 {
		if !isOWS(s[0]) {
			break
		}
		if i >= n {
			return "", false
		}
		s = s[1:]
		i++
	}
	return s, true
}

func trimRightOWS(s string, n int) (string, bool) {
	var i int
	for len(s) > 0 {
		if !isOWS(s[len(s)-1]) {
			break
		}
		if i >= n {
			return "", false
		}
		s = s[:len(s)-1]
		i++
	}
	return s, true
}

// See https://httpwg.org/specs/rfc9110.html#whitespace.
func isOWS(b byte) bool {
	return b == '\t' || b == ' '
}

The implementation of TrimOWS above incurs no bounds checks, as evidenced by the empty output of the following command:

$ go build -gcflags '-d=ssa/check_bce/debug=1'
$

However, it is not inlinable:

$ go build -gcflags '-m=2' 2>&1 | grep TrimOWS
./ows.go:9:6: cannot inline TrimOWS: function too complex: \
    cost 130 exceeds budget 80

Here is my challenge to you: can you refactor function TrimOWS so that it be inlinable (without relying on profile-guided optimisation) and free of bounds checks?

To allow you to catch bugs as you refactor the code, I’ve included a suite of unit tests; and to allow you to later measure performance changes between implementations, I’ve also included some benchmarks:

package headers

import "testing"

const maxOWS = 1 // number of OWS bytes tolerated on either side

var trimOWStests = []struct {
	desc string
	s    string
	want string
	ok   bool
}{
	{
		desc: "empty",
		s:    "",
		want: "",
		ok:   true,
	}, {
		desc: "no OWS",
		s:    "foo",
		want: "foo",
		ok:   true,
	}, {
		desc: "internal OWS",
		s:    "foo  \t\tbar",
		want: "foo  \t\tbar",
		ok:   true,
	}, {
		desc: "leading and trailing OWS",
		s:    "\tfoo ",
		want: "foo",
		ok:   true,
	}, {
		desc: "non-OWS whitespace",
		s:    "\nfoo\t",
		want: "\nfoo",
		ok:   true,
	}, {
		desc: "too much leading OWS",
		s:    " \tfoo\t",
		ok:   false,
	}, {
		desc: "too much trailing OWS",
		s:    " foo\t ",
		ok:   false,
	}, {
		desc: "too much leading and trailing OWS",
		s:    " \tfoo\t ",
		ok:   false,
	},
}

func TestTrimOWS(t *testing.T) {
	for _, tc := range trimOWStests {
		f := func(t *testing.T) {
			allocs := testing.AllocsPerRun(10,
				func() { str, truth = TrimOWS(tc.s, maxOWS) },
			)
			if allocs > 0 {
				const tmpl = "TrimOWS(%q, %d) allocates %.2f objects"
				t.Errorf(tmpl, tc.s, maxOWS, allocs)
			}
			got, ok := TrimOWS(tc.s, maxOWS)
			if !tc.ok && ok {
				// In cases where TrimOWS must fail,
				// its string result is unspecified.
				const tmpl = "TrimOWS(%q, %d): _, %t; want _, %t"
				t.Fatalf(tmpl, tc.s, maxOWS, ok, tc.ok)
			}
			if tc.ok && (!ok || got != tc.want) {
				const tmpl = "TrimOWS(%q, %d): %q, %t; want %q, %t"
				t.Errorf(tmpl, tc.s, maxOWS, got, ok, tc.want, tc.ok)
			}
		}
		t.Run(tc.desc, f)
	}
}

func BenchmarkTrimOWS(b *testing.B) {
	for _, tc := range trimOWStests {
		f := func(b *testing.B) {
			for range b.N {
				str, truth = TrimOWS(tc.s, maxOWS)
			}
		}
		b.Run(tc.desc, f)
	}
}

var ( // sinks for avoiding dead-code elimination in benchmarks
	str   string
	truth bool
)

Ready for some micro-optimisation? Clone the Gist and cd into your clone:

$ git clone https://gist.github.com/jub0bs/0151185bad8ea4d70c4fd1d6ac78ab40
$ cd 0151185bad8ea4d70c4fd1d6ac78ab40

And off you go!

Hints

Here is series of hints that you might find useful in case you get stuck.

Hint 0

Try eliminating unnecessary branching. In particular, the initial empty-string check in TrimOWS is redundant; it’s an attempt at micro-optimisation, but its performance benefits are moot. Moreover, TrimOWS doesn’t need to check the boolean results of trimRightOWS; instead, it could simply bubble up trimRightOWS’s results to the caller.

 func TrimOWS(s string, n int) (string, bool) {
-       if s == "" {
-               return s, true
-       }
        s, ok := trimLeftOWS(s, n)
        if !ok {
                return "", false
        }
-       s, ok = trimRightOWS(s, n)
-       if !ok {
-               return "", false
-       }
-       return s, true
+       return trimRightOWS(s, n)
 }

These changes don’t introduce any bounds check and somewhat reduce TrimOWS’s inlining cost:

$ go build -gcflags '-d=ssa/check_bce/debug=1'
$
$ go build -gcflags '-m=2' 2>&1 | grep TrimOWS
./ows.go:9:6: cannot inline TrimOWS: function too complex: \
    cost 121 exceeds budget 80
Hint 1

Although helper functions trimLeftOWS and trimRightOWS are themselves inlinable, manually inlining them in TrimOWS helps inlinability without hurting readability:

 func TrimOWS(s string, n int) (string, bool) {
-       s, ok := trimLeftOWS(s, n)
-       if !ok {
-               return "", false
-       }
-       return trimRightOWS(s, n)
-}
-
-func trimLeftOWS(s string, n int) (string, bool) {
        var i int
        for len(s) > 0 {
                if !isOWS(s[0]) {
                        break
                }
                if i >= n {
                        return "", false
                }
                s = s[1:]
                i++
        }
-       return s, true
-}
-
-func trimRightOWS(s string, n int) (string, bool) {
-       var i int
+       i = 0
        for len(s) > 0 {
                if !isOWS(s[len(s)-1]) {
                        break
                }
                if i >= n {
                        return "", false
                }
                s = s[:len(s)-1]
                i++
        }
        return s, true
 }

Not only do these changes not introduce any bounds checks, but they dramatically reduce TrimOWS’s inlining cost:

$ go build -gcflags '-d=ssa/check_bce/debug=1'
$
$ go build -gcflags '-m=2' 2>&1 | grep TrimOWS
./ows.go:9:6: cannot inline TrimOWS: function too complex: \
    cost 88 exceeds budget 80

Manually inlining my little isOWS function as well is tempting and does somewhat reduce TrimOWS’s inlining cost, but doing so isn’t ideal because it violates the DRY principle:

Every piece of knowledge must have a single, unambiguous, authoritative representation within a system.

Hint 2

Try walking along the string rather than repeatedly munching at it:

 func TrimOWS(s string, n int) (string, bool) {
-       var i int
-       for len(s) > 0 {
-               if !isOWS(s[0]) {
+       for i := 0; i < len(s); i++ {
+               if !isOWS(s[i]) {
+                       s = s[i:]
                        break
                }
                if i >= n {
                        return "", false
                }
-               s = s[1:]
-               i++
        }
-       i = 0
-       for len(s) > 0 {
-               if !isOWS(s[len(s)-1]) {
+       for i := len(s) - 1; i >= 0; i-- {
+               if !isOWS(s[i]) {
+                       s = s[:i+1]
                        break
                }
-               if i >= n {
+               if i <= len(s)-1-n {
                        return "", false
                }
-               s = s[:len(s)-1]
-               i++
        }
        return s, true
 }

These changes, althouth they introduce no bounds checks, may seem counterproductive at first because they cause TrimOWS’s inlining cost to increase:

$ go build -gcflags '-d=ssa/check_bce/debug=1'
$
$ go build -gcflags '-m=2' 2>&1 | grep TrimOWS
./ows.go:9:6: cannot inline TrimOWS: function too complex: \
    cost 94 exceeds budget 80

Fret not! Performance optimisation seldom is a straight line, and refactorings initially perceived as setbacks may in fact unlock new avenues for improvement.

Hint 3

Use range-over-int loops wherever possible, as they tend to be cheaper (and easier to reason about!) than classic three-clause loops:

 func TrimOWS(s string, n int) (string, bool) {
-       for i := 0; i < len(s); i++ {
+       for i := range len(s) {
                if !isOWS(s[i]) {
                        s = s[i:]
                        break

No bounds checks have been introduced, and TrimOWS’s inlining cost is now at its lowest so far:

$ go build -gcflags '-d=ssa/check_bce/debug=1'
$
$ go build -gcflags '-m=2' 2>&1 | grep TrimOWS
./ows.go:9:6: cannot inline TrimOWS: function too complex: \
    cost 87 exceeds budget 80

We’re back on tracks!

Hint 4

Using naked returns can help reduce a function’s inlining cost:

-func TrimOWS(s string, n int) (string, bool) {
+func TrimOWS(s string, n int) (_ string, _ bool) {
        for i := range len(s) {
                if !isOWS(s[i]) {
                        s = s[i:]
                        break
                }
                if i >= n {
-                       return "", false
+                       return
                }
        }
        for i := len(s) - 1; i >= 0; i-- {
                if !isOWS(s[i]) {
                        s = s[:i+1]
                        break
                }
                if i <= len(s)-1-n {
-                       return "", false
+                       return
                }
        }
        return s, true
 }

Still no bounds checks to deplore, and TrimOWS’s inlining cost is lower than ever:

$ go build -gcflags '-d=ssa/check_bce/debug=1'
$
$ go build -gcflags '-m=2' 2>&1 | grep TrimOWS
./ows.go:9:6: cannot inline TrimOWS: function too complex: \
    cost 83 exceeds budget 80

Down to 83, now. We’re breathing rare air, here!

Hint 5

You can return early from the second loop; by doing so, you can also clarify that, if execution reaches the bottom of TrimOWS, the string returned by the function is empty:

 func TrimOWS(s string, n int) (_ string, _ bool) {
        for i := range len(s) {
                if !isOWS(s[i]) {
                        s = s[i:]
                        break
                }
                if i >= n {
                        return
                }
        }
        for i := len(s) - 1; i >= 0; i-- {
                if !isOWS(s[i]) {
-                       s = s[:i+1]
-                       break
+                       return s[:i+1], true
                }
                if i <= len(s)-1-n {
                        return
                }
        }
-       return s, true
+       return "", true
 }

Again, no bounds checks have been introduced, and TrimOWS’s inlining cost has been further reduced:

$ go build -gcflags '-d=ssa/check_bce/debug=1'
$
$ go build -gcflags '-m=2' 2>&1 | grep TrimOWS
./ows.go:9:6: cannot inline TrimOWS: function too complex: \
    cost 82 exceeds budget 80

Down to 82. Tantalisingly close to the inlining budget!

Hint 6

At this stage, you may be (as I was) running out of ideas for further reducing TrimOWS’s inlining cost. But try offsetting the loop variable of the second loop by one…

 func TrimOWS(s string, n int) (_ string, _ bool) {
        for i := range len(s) {
                if !isOWS(s[i]) {
                        s = s[i:]
                        break
                }
                if i >= n {
                        return
                }
        }
-       for i := len(s) - 1; i >= 0; i-- {
-               if !isOWS(s[i]) {
-                       return s[:i+1], true
+       for i := len(s); i > 0; i-- {
+               if !isOWS(s[i-1]) {
+                       return s[:i], true
                }
-               if i <= len(s)-1-n {
+               if i <= len(s)-n {
                        return
                }
        }
        return "", true
 }

And voilà! TrimOWS still doesn’t incur any bounds checks but is now inlinable:

$ go build -gcflags '-d=ssa/check_bce/debug=1'
$
$ go build -gcflags '-m=2' 2>&1 | grep TrimOWS
./ows.go:9:6: can inline TrimOWS with cost 78 as: \
    func(string, int) (string, bool) { for loop; for loop; return "", true }

Mission accomplished. 🎉

My solution

Below is my complete final implementation (click to reveal), both inlinable and free of bounds checks:

Final implementation
package headers

// TrimOWS trims up to n bytes of optional whitespace (OWS)
// from the start of and/or the end of s.
// If no more than n bytes of OWS are found at the start of s
// and no more than n bytes of OWS are found at the end of s,
// it returns the trimmed result and true.
// Otherwise, it returns some unspecified string and false.
func TrimOWS(s string, n int) (_ string, _ bool) {
	for i := range len(s) {
		if !isOWS(s[i]) {
			s = s[i:]
			break
		}
		if i >= n {
			return
		}
	}
	for i := len(s); i > 0; i-- {
		if !isOWS(s[i-1]) {
			return s[:i], true
		}
		if i <= len(s)-n {
			return
		}
	}
	return "", true
}

// See https://httpwg.org/specs/rfc9110.html#whitespace.
func isOWS(b byte) bool {
	return b == '\t' || b == ' '
}

I think it remains quite readable; what’s your opinion?

You might have gotten there by a less circuitous way than I did, or you may have arrived at an entirely different implementation. However, if your implementation of TrimOWS is correct, inlinable, free of bounds checks, and has an inlining cost below 74 (my absolute record), I want to hear from you; do let me know on on Bluesky, Mastodon, Gophers Slack, or Reddit.

Benchmark results

Was all this micro-optimisation effort worth it? As often, it depends™. Here are some benchmark results pitting my final implementation against my initial one:

goos: darwin
goarch: amd64
pkg: gist.github.com/jub0bs/0151185bad8ea4d70c4fd1d6ac78ab40
cpu: Intel(R) Core(TM) i7-6700HQ CPU @ 2.60GHz
                                            │     old     │                 new                 │
                                            │   sec/op    │   sec/op     vs base                │
TrimOWS/empty-8                               1.831n ± 5%   2.203n ± 4%  +20.35% (p=0.000 n=20)
TrimOWS/no_OWS-8                              3.088n ± 5%   2.543n ± 1%  -17.65% (p=0.000 n=20)
TrimOWS/internal_OWS-8                        3.100n ± 2%   2.510n ± 1%  -19.03% (p=0.000 n=20)
TrimOWS/leading_and_trailing_OWS-8            5.586n ± 1%   5.012n ± 1%  -10.26% (p=0.000 n=20)
TrimOWS/non-OWS_whitespace-8                  5.114n ± 1%   4.926n ± 3%   -3.67% (p=0.000 n=20)
TrimOWS/too_much_leading_OWS-8                3.874n ± 1%   3.627n ± 1%   -6.39% (p=0.000 n=20)
TrimOWS/too_much_trailing_OWS-8               5.148n ± 1%   4.463n ± 1%  -13.30% (p=0.000 n=20)
TrimOWS/too_much_leading_and_trailing_OWS-8   3.872n ± 1%   3.609n ± 1%   -6.78% (p=0.000 n=20)
geomean                                       3.745n        3.455n        -7.74%

My final implementation is faster overall than the initial one. Granted, it saves only a fraction of a nanosecond in each benchmark case, but you should pay closer attention to the relative difference than to the absolute difference; and such seemingly insignificant improvements may matter in a performance-critical scenario.

Can those performance improvements really be ascribed to inlinability, though? To find out, I slapped a //go:noinline directive and re-ran the benchmarks. According to the results, inlinability helps a lot: my final implementation, when disallowed from being inlined, is actually slower overall than my initial, non-inlinable implementation.

You may have noticed one fly in the ointment: execution is faster in all cases but one; it is slower when TrimOWS’s string argument is empty. This performance gap can be explained by my removal of the empty-string check at the top of TrimOWS in the initial implementation. In the case of an empty string argument, TrimOWS now has to check the length of that string not once, but twice (once in each of its two loops). I can think of one final refactoring that, without introducing bounds checks or compromising inlinability, speeds up TrimOWS’s execution even in the case where its string argument is empty. I’m including some benchmark results below as a teaser, but I’ll leave this final refactoring as an exercise.

goos: darwin
goarch: amd64
pkg: gist.github.com/jub0bs/0151185bad8ea4d70c4fd1d6ac78ab40
cpu: Intel(R) Core(TM) i7-6700HQ CPU @ 2.60GHz
                                            │     old     │               latest                │
                                            │   sec/op    │   sec/op     vs base                │
TrimOWS/empty-8                               1.831n ± 5%   1.232n ± 4%  -32.67% (p=0.000 n=20)
TrimOWS/no_OWS-8                              3.088n ± 5%   2.754n ± 1%  -10.82% (p=0.000 n=20)
TrimOWS/internal_OWS-8                        3.100n ± 2%   2.749n ± 1%  -11.32% (p=0.000 n=20)
TrimOWS/leading_and_trailing_OWS-8            5.586n ± 1%   5.380n ± 1%   -3.68% (p=0.000 n=20)
TrimOWS/non-OWS_whitespace-8                  5.114n ± 1%   4.812n ± 5%   -5.89% (p=0.000 n=20)
TrimOWS/too_much_leading_OWS-8                3.874n ± 1%   2.430n ± 1%  -37.27% (p=0.000 n=20)
TrimOWS/too_much_trailing_OWS-8               5.148n ± 1%   4.700n ± 2%   -8.71% (p=0.000 n=20)
TrimOWS/too_much_leading_and_trailing_OWS-8   3.872n ± 1%   2.424n ± 1%  -37.38% (p=0.000 n=20)
geomean                                       3.745n        3.007n       -19.69%

Can and should regressions be automatically detected?

Manually verifying that a function remains inlinable and free of bounds checks quickly becomes tedious, to the point that I (and other Gophers) sometimes wish for compiler directives that would automatically alert me to such regressions. For instance, you could imagine a function directive for marking a function as “intended for inlining” and that would cause compilation to fail if the compiler in fact deemed the function ineligible for inlining:

//go:inlinable
func TrimOWS(s string, n int) (_ string, _ bool) { /* ... */ }

You could also imagine a compiler directive that would cause compilation to fail if the compiler detected bounds checks on a given line:

		if !isOWS(s[i-1]) { //go:nobc

Alternatively, you could imagine a more granular compiler directive that would fail compilation if a line of interest incurred a prohibitive (and configurable) number of bounds checks:

func Cut(s, sep string) (before, after string, found bool) {
	if i := Index(s, sep); i >= 0 {
		return s[:i], s[i+len(sep):], true //go:maxbce:1
	}
	return s, "", false
}

However, I think that such directives are unlikely to ever be supported by the Go compiler. Recall that neither the inlining logic nor the bounds-check-elimination logic are enshrined in the language specification; they’re mere implementation details of the compiler and, as such, they may (and do, in my experience) evolve. The introduction of such compiler directives as discussed above would open the undesirable possibility that existing programs break as a result of changes to the compiler, a possibility that would violate Go 1.0’s compatibility promise.

Third-party tools, such as Jordan Lewis’s gcassert command, purport (among other goals) to fill the gap left by the absence of such compiler directives. However, I must admit that I’m reluctant to pepper my code with third-party annotations; doing so would go against the spirit of minimalism that I’ve grown accustomed to since I took my first step in the world of Go.

And perhaps a good suite of benchmarks is all I need. After all, some inlinability regressions and bounds-check-elimination regressions may only have a negligible effect on performance and not be worth attending to.

Takeaways

  • Eliminating most bounds checks from a function while also making it eligible for inlining can unlock an execution speedup.
  • Through trial and error, a series of simple refactorings can lower a function’s inlining cost and reduce the number of expensive bounds checks deemed necessary by the Go compiler.
  • Micro-optimisation sometimes is a futile endeavour but often proves intellectually rewarding.

If you’d like to learn more about inlining and bounds-check elimination, I recommend you peruse Tapir Liu’s Go Optimizations 101 e-book.

I hope that you’ve enjoyed this challenge. Till next time.