go concurrency correctness

Sep 4, 2018 1305 words 7 minutes

Contents

Go has great concurrency support and as with all concurrent code, it’s very easy to write code that’s doesn’t perform as expected or intended. This is a look at correctness in concurrent go, and some of the considerations.

correctness

How do we define correctness, the dictionary definition is:

“the quality or state of being free from error; accuracy.”

But as a whole, correctness is a deep and well read computer science topic. It includes a basis for formally verifying a program; the act of proving or disproving an algorithm, with respect to a formal specification.

Both correctness as a compsci topic and in our case depends on the spec, where the spec is usually grounded in mathematical techniques.

We can consider concurrent correctness to be a program that contains concurrent (but not necessarily parallel) code and produces a logically expected and deterministic output.

why is correctness hard?

Humans are logical, often serial thinkers. We often consider that because some code block appears after another on the page, it will be executed in that order. This is empirically true for code that runs in serial, notwithstanding goto, if-else, defer and other control structures.

This model of code execution falls down when you start to consider parallel code, which by definition is more than one code block running at the same time. But introduce concurrency, and you’ve got a potentially more confusing half way house. Concurrency is a property of the code, and parallelism is a property of the runtime for the code.

Writing concurrent code does not necessarily mean that it will run in parallel (do more than one thing at once). Concurrent code just says that it could do more than one thing at once. Imagine concurrent code that runs on a machine with a processor that has a single core, with a single tthread; processor scheduling and time slicing may give the impression of parallel execution, but really it’s serial. The fact that the code is concurrent only means that it is built to handle and exploit multithreaded environments, not that it has to be run in one.

golang’s fork-join model

Go uses a fork-join model to achieve concurrency, a single goroutine will fork with the go func(...) statement to run a task concurrently, and then expects a “join” operation to happen to allow serial execution to continue. Go has a number of ways of achieving a “join”, but all of them are initiated by the new goroutine. There’s currently no way of stopping an executing goroutine without it blocking on a join point.

So what is a join point? A join point is a point in the code where the goroutine indicates that it can pause, and the scheduler can run some other part goroutines before coming back to this one. Join points are often used to communicate between goroutines, or stop concurrent execution.

Some of the most common join-point creating operations in golang are: mutex.Lock(), <-chan, chan<-, waitGroup.Wait()

All of these operations will pause the execution of the current goroutine, this allows the go scheduler to kick in and decide if the execution of the current goroutine can continue, or if it’s better to execute some other routine before continuing here.

It’s also worth noting that a goroutine coming to the end of it’s execution e.g. the go func(...) returning can be considered to be an implicit join point. As it too allows the end of the concurrent task execution.

why do join points matter?

Join points are important because they are the basis for us to reason about the execution of a program. Without an explicit join point, concurrent programs are guaranteed to be non-determinstic. Consider the following example:

1
2
3
4
5
6
7
8
func main() {
  go func() {
      fmt.Println("async")
  }
  fmt.Println("sync")
  
  <-time.After(1 * time.Second)
}

This example is trivially small, but has multiple things wrong with it. It’s output is non-deterministic, meaning we do not know (for sure) what the output will be. Some possibilities:

output 1:

1
2
async
sync

output 2:

1
2
sync
async

output 3:

1
sync

In the example there’s no way to know when anonymous function started with go func() will be executed. This means that we can get output 1 or output 2 where the anonymous function is executed before or after the fmt.Println("sync") call. output 3 happens because the sleep blocks the main goroutine allowing the go runtime to consider which routines to schedule, but it does not enforce the execution of other routines. By not enforcing the scheduling and execution of the anonymous routine, it’s entirely possible that the main routine continues execution and exits after the sleep before the anonymous function is executed, this created output 3.

The sleep blocks the execution of the main thread, but does not create a join point. The sleep can expire / complete before the anonymous function gets a chance to execute. Adding a sleep in this situation only increases the chances that other goroutines will get a chance to execute, but does not enforce this. The lack of enforcement means that we cannot consider this example program to be concurrently correct. Imagine the scenario where the system calls required to print async take longer than one second. On a multicore machine the program could exit before completing the print operation. This means there are more than just 3 possibilities, and it’s hard to enumerate them all.

Looking back at the criteria for correctness we see that it:

✔️ contains concurrent code
❌ produces a logically expected and deterministic output

TL;DR the sleep increases the chance of other goroutines being executed, but does not guarantee it. Without that guarantee, the program is not correct.

how can we make this program correct?

WITH JOIN POINTS!

Adding some synchronisation between the two goroutines ensures that one cannot be executed before the other. Let’s assume we want the output:

1
2
async
sync

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
func main() {
  ch := make(chan struct{})
  go func() {
      defer close(ch)
      fmt.Println("async")
  }
  <-ch
  fmt.Println("sync")
  
}

A closed channel always allows reading a value (the zero value). Once we’ve printed async we close the channel, this creates the join point that synchronises the anonymous and main goroutines allowing the main goroutine to continue, and we know that the output will be what we expect. The main goroutine cannot contine execution until it’s recieved the notification on the channel ch that the anonymous routine has finished.

Without using a synchronisation method such as a channel or mutex we cannot be sure of the output of the program, and therefore consider it to be incorrect.

why does correctness matter?

Correctness is the only way to guarantee that your program is not hiding bugs that are only exposed by the scheduler. Often called “finiky” bugs, these are errors that occur when something in the codes environment changes; change of machine it’s running on, change of load the machine is under, change of runtime version etc. All of these external changes can cause the bug hidden by the programs lack of correctness to rear it’s head. And what appears to cause the bug is often not related to the actual bug; this makes the hard and time consuming to track down and fix.

Writing correct programs in the first place means you do not have this problem. There are some rules to assist this;

Any goroutine that spawns others must include a method of cleaning up what it has spawned.
Never use sleeps to wait for, schedule, or orchestrate logic between goroutines.
Unless 2 routines directly communicate, there’s no guarantee of execution order.
Use sync.WaitGroup, sync.Mutex or chan to communicate between routines.
Unless the main gouroutine blocks until all routines are finished there’s no guarantee the others will get the time to complete.