Lessons learned from programming in Go

Prevent future concurrent processing headaches by learning how to address these common pitfalls.
160 readers like this.
gopher illustrations

Renee French. CC BY 3.0

When you are working with complex distributed systems, you will likely come across the need for concurrent processing. At Mode.net, we deal daily with real-time, fast and resilient software. Building a global private network that dynamically routes packets at the millisecond scale wouldn’t be possible without a highly concurrent system. This dynamic routing is based on the state of the network and, while there are many parameters to consider here, our focus is on link metrics. In our context, link metrics can be anything related to the status or current properties of a network link (e.g.: link latency).

Concurrent probing for link metrics

H.A.L.O. (Hop-by-Hop Adaptive Link-State Optimal Routing), our dynamic routing algorithm relies partially on link metrics to compute its routing table. Those metrics are collected by an independent component that sits on each PoP (Point of Presence). PoPs are machines that represent a single routing entity in our networks, connected by links and spread around multiple locations shaping our network. This component probes neighboring machines using network packets, and those neighbors will bounce back the initial probe. Link latency values can be derived from the received probes. Because each PoP has more than one neighbor, the nature of such a task is intrinsically concurrent: we need to measure latency for each neighboring link in real-time. We can’t afford sequential processing; each probe must be processed as soon as possible in order to compute this metric.

latency computation graph

Sequence numbers and resets: A reordering situation

Our probing component exchanges packets and relies on sequence numbers for packet processing. This aims to avoid processing of packet duplication or out-of-order packets. Our first implementation relied on a special sequence number 0 to reset sequence numbers. Such a number was only used during initialization of a component. The main problem was that we were considering an increasing sequence number value that always started at 0. After the component restarts, packet reordering could happen, and a packet could easily replace the sequence number with the value that was being used before the reset. This meant that the following packets would be ignored until it reaches the sequence number that was in use just before the reset.

UDP handshake and finite state machine

The problem here was proper agreement of a sequence number after a component restarts. There are a few ways to handle this and, after discussing our options, we chose to implement a 3-way handshake protocol with a clear definition of states. This handshake establishes sessions over links during initialization. This guarantees that nodes are communicating over the same session and using the appropriate sequence number for it.

To properly implement this, we have to define a finite state machine with clear states and transitions. This allows us to properly manage all corner cases for the handshake formation.

finite state machine diagram

Session IDs are generated by the handshake initiator. A full exchange sequence is as follows:

  1. The sender sends out a SYN (ID) packet.
  2. The receiver stores the received ID and sends a SYN-ACK (ID).
  3. The sender receives the SYN-ACK (ID) and sends out an ACK (ID). It also starts sending packets starting with sequence number 0.
  4. The receiver checks the last received ID and accepts the ACK (ID) if the ID matches. It also starts accepting packets with sequence number 0.

Handling state timeouts

Basically, at each state, you need to handle, at most, three types of events: link events, packet events, and timeout events. And those events show up concurrently, so here you have to handle concurrency properly.

  • Link events are either link up or link down updates. This can either initiate a link session or break an existing session.
  • Packet events are control packets (SYN/SYN-ACK/ACK) or just probe responses.
  • Timeout events are the ones triggered after a scheduled timeout expires for the current session state.

The main challenge here is how to handle concurrent timeout expiration and other events. And this is where one can easily fall into the traps of deadlocks and race conditions.

A first approach

The language used for this project is Golang. It does provide native synchronization mechanisms such as native channels and locks and is able to spin lightweight threads for concurrent processing.

You can start first by designing a structure that represents our Session and Timeout Handlers.

type Session struct {  
  State SessionState  
  Id SessionId  
  RemoteIp string  
}

type TimeoutHandler struct {  
  callback func(Session)  
  session Session  
  duration int  
  timer *timer.Timer  
}

Session identifies the connection session, with the session ID, neighboring link IP, and the current session state.

TimeoutHandler holds the callback function, the session for which it should run, the duration, and a pointer to the scheduled timer.

There is a global map that will store, per neighboring link session, the scheduled timeout handler.

SessionTimeout map[Session]*TimeoutHandler

Registering and canceling a timeout is achieved by the following methods:

// schedules the timeout callback function.  
func (timeout* TimeoutHandler) Register() {   
  timeout.timer = time.AfterFunc(time.Duration(timeout.duration) * time.Second, func() {   
    timeout.callback(timeout.session)   
  })   
}

func (timeout* TimeoutHandler) Cancel() {   
  if timeout.timer == nil {   
    return   
  }   
  timeout.timer.Stop()   
}

For the timeouts creation and storage, you can use a method like the following:

func CreateTimeoutHandler(callback func(Session), session Session, duration int) *TimeoutHandler {  
  if sessionTimeout[session] == nil {  
    sessionTimeout[session] := new(TimeoutHandler)  
  }  
    
  timeout = sessionTimeout[session]  
  timeout.session = session  
  timeout.callback = callback  
  timeout.duration = duration  
  return timeout  
}

Once the timeout handler is created and registered, it runs the callback after duration seconds have elapsed. However, some events will require you to reschedule a timeout handler (as it happens at SYN state — every 3 seconds).

For that, you can have the callback rescheduling a new timeout:

func synCallback(session Session) {  
  sendSynPacket(session)

  // reschedules the same callback.  
  newTimeout := NewTimeoutHandler(synCallback, session, SYN_TIMEOUT_DURATION)  
  newTimeout.Register()

  sessionTimeout[state] = newTimeout  
}

This callback reschedules itself in a new timeout handler and updates the global sessionTimeout map.

Data race and references

Your solution is ready. One simple test is to check that a timeout callback is executed after the timer has expired. To do this, register a timeout, sleep for its duration, and then check whether the callback actions were done. After the test is executed, it is a good idea to cancel the scheduled timeout (as it reschedules), so it won’t have side effects between tests.

Surprisingly, this simple test found a bug in the solution. Canceling timeouts using the cancel method was just not doing its job. The following order of events would cause a data race condition:

  1. You have one scheduled timeout handler.
  2. Thread 1:

    a) You receive a control packet, and you now want to cancel the registered timeout and move on to the next session state. (e.g. received a SYN-ACK after you sent a SYN).

    b) You call timeout.Cancel(), which calls a timer.Stop(). (Note that a Golang timer stop doesn’t prevent an already expired timer from running.)
  3. Thread 2:

    a) Right before that cancel call, the timer has expired, and the callback was about to execute.

    b) The callback is executed, it schedules a new timeout and updates the global map.
  4. Thread 1:

    a) Transitions to a new session state and registers a new timeout, updating the global map.

Both threads were updating the timeout map concurrently. The end result is that you failed to cancel the registered timeout, and then you also lost the reference to the rescheduled timeout done by thread 2. This results in a handler that keeps executing and rescheduling for a while, doing unwanted behavior.

When locking is not enough

Using locks also doesn’t fix the issue completely. If you add locks before processing any event and before executing a callback, it still doesn’t prevent an expired callback to run:

func (timeout* TimeoutHandler) Register() {  
  timeout.timer = time.AfterFunc(time.Duration(timeout.duration) * time._Second_, func() {  
    stateLock.Lock()  
    defer stateLock.Unlock()

    timeout.callback(timeout.session)  
  })  
}

The difference now is that the updates in the global map are synchronized, but this doesn’t prevent the callback from running after you call the timeout.Cancel() — This is the case if the scheduled timer expired but didn’t grab the lock yet. You should again lose reference to one of the registered timeouts.

Using cancellation channels

Instead of relying on golang’s timer.Stop(), which doesn’t prevent an expired timer to execute, you can use cancellation channels.

It is a slightly different approach. Now you won’t do a recursive re-scheduling through callbacks; instead, you register an infinite loop that waits for cancellation signals or timeout events.

The new Register() spawns a new go thread that runs your callback after a timeout and schedules a new timeout after the previous one has been executed. A cancellation channel is returned to the caller to control when the loop should stop.

func (timeout *TimeoutHandler) Register() chan struct{} {  
  cancelChan := make(chan struct{})  
    
  go func () {  
    select {  
    case _ = <- cancelChan:  
      return  
    case _ = <- time.AfterFunc(time.Duration(timeout.duration) * time.Second):  
      func () {  
        stateLock.Lock()  
        defer stateLock.Unlock()

        timeout.callback(timeout.session)  
      } ()  
    }  
  } ()

  return cancelChan  
}

func (timeout* TimeoutHandler) Cancel() {  
  if timeout.cancelChan == nil {  
    return  
  }  
  timeout.cancelChan <- struct{}{}  
}

This approach gives you a cancellation channel for each timeout you register. A cancel call sends an empty struct to the channel and triggers the cancellation. However, this doesn’t resolve the previous issue; the timeout can expire right before you call cancel over the channel, and before the lock is grabbed by the timeout thread.

The solution here is to check the cancellation channel inside the timeout scope after you grab the lock.

  case _ = <- time.AfterFunc(time.Duration(timeout.duration) * time.Second):  
    func () {  
      stateLock.Lock()  
      defer stateLock.Unlock()  
      
      select {  
      case _ = <- handler.cancelChan:  
        return  
      default:  
        timeout.callback(timeout.session)  
      }  
    } ()  
  }

Finally, this guarantees that the callback is only executed after you grab the lock and no cancellation was triggered.

Beware of deadlocks

This solution seems to work; however, there is one hidden pitfall here: deadlocks.

Please read the code above again and try to find it yourself. Think of concurrent calls to any of the methods described.

The last problem here is with the cancellation channel itself. We made it an unbuffered channel, which means that sending is a blocking call. Once you call cancel in a timeout handler, you only proceed once that handler is canceled. The problem here is when you have multiple calls to the same cancelation channel, where a cancel request is only consumed once. And this can easily happen if concurrent events were to cancel the same timeout handler, like a link down or control packet event. This results in a deadlock situation, possibly bringing the application to a halt.

gophers on a wire, talking

Is anyone listening?

By Trevor Forrey. Used with permission.

The solution here is to at least make the channel buffered by one, so sends are not always blocking, and also explicitly make the send non-blocking in case of concurrent calls. This guarantees the cancellation is sent once and won’t block the subsequent cancel calls.

func (timeout* TimeoutHandler) Cancel() {  
  if timeout.cancelChan == nil {  
    return  
  }  
    
  select {  
  case timeout.cancelChan <- struct{}{}:  
  default:  
    // can’t send on the channel, someone has already requested the cancellation.  
  }  
}

Conclusion

You learned in practice how common mistakes can show up while working with concurrent code. Due to their non-deterministic nature, those issues can go easily undetected, even with extensive testing. Here are the three main problems we encountered in the initial implementation.

Updating shared data without synchronization

This seems like an obvious one, but it’s actually hard to spot if your concurrent updates happen in different locations. The result is data race, where multiple updates to the same data can cause update loss, due to one update overriding another. In our case, we were updating the scheduled timeout reference on the same shared map. (Interestingly, if Go detects a concurrent read/write on the same Map object, it throws a fatal error —you can try to run Go’s data race detector). This eventually results in losing a timeout reference and making it impossible to cancel that given timeout. Always remember to use locks when they are needed.

gopher assembly line

don’t forget to synchronize gophers’ work

CC BY 3.0

Missing condition checks

Condition checks are needed in situations where you can’t rely only on the lock exclusivity. Our situation is a bit different, but the core idea is the same as condition variables. Imagine a classic situation where you have one producer and multiple consumers working with a shared queue. A producer can add one item to the queue and wake up all consumers. The wake-up call means that some data is available at the queue, and because the queue is shared, access must be synchronized through a lock. Every consumer has a chance to grab the lock; however, you still need to check if there are items in the queue. A condition check is needed because you don’t know the queue status by the time you grab the lock.

In our example, the timeout handler got a ‘wake up’ call from a timer expiration, but it still needed to check if a cancel signal was sent to it before it could proceed with the callback execution.

gopher boot camp

Condition checks might be needed if you wake up multiple gophers

CC BY 3.0

Deadlocks

This happens when one thread is stuck, waiting indefinitely for a signal to wake up, but this signal will never arrive. Those can completely kill your application by halting your entire program execution.

In our case, this happened due to multiple send calls to a non-buffered and blocking channel. This meant that the send call would only return after a receive is done on the same channel. Our timeout thread loop was promptly receiving signals on the cancellation channel; however, after the first signal is received, it would break off the loop and never read from that channel again. The remaining callers are stuck forever. To avoid this situation, you need to carefully think through your code, handle blocking calls with care, and guarantee that thread starvation doesn’t happen. The fix in our example was to make the cancellation calls non-blocking—we didn’t need a blocking call for our needs.

What to read next
User profile image.
Eduardo Ferreira is a Senior Software Engineer at Mode.net. He has a bachelor degree in Computer Science from Federal University of Rio de Janeiro (UFRJ) and studied one year abroad at Cornell University. His main interests are in the area of Software Engineering, Distributed Systems, Computer Networks and Peer to Peer Architecture.

Comments are closed.

Creative Commons LicenseThis work is licensed under a Creative Commons Attribution-Share Alike 4.0 International License.