Protocol and game agnostic multithreaded networking engine

This is part 5 of the multipart Trap Labs Code Design and Architecture Series.

Gaffer on Games has a great series on the inner workings of a networking engine. The author, Glenn Fiedler goes into detail about every component in way more detail than I ever would so I please read his articles before you read mine. Consider this article an addendum to his series. I’ll specifically focus on design and architecture, and provide some insights on testability as networking is one of the more difficult components to test.

So again, from this point forward, I am going to assume you already know the inner workings of a network engine. Otherwise this article won't make much sense to you.

Multithreaded engine

The example Glenn gave in his series is a single threaded game loop that updates networking and game at the same time. This is fine for most games. I took it a step further and improved it with a multithreaded engine

Ideally, you want the networking component to send and receive packets as fast as possible. If the entire game is on a single thread the core game loop will be the biggest bottleneck, because you can only send and receive packets as fast as your game does the processing. For example if your game runs at less than 30fps and your need a send rate of 30 ticks/second, then your network is effectively bottlenecked by your game. And if your frame rate fluctuates a lot it'll also affect your network stability and congestion. I created a multithreaded and asynchronous (Boost Asio async sockets) standalone networking library such that the bottleneck wouldn’t exist. It was a good challenge and I hope it will be useful for you too.

High level architecture

Note that OSSockets and GameLoop are modules, not classes

Including the GameLoop this is a three thread process: receiver thread, sender thread, and game loop thread. The basic work flow goes like this:

OS receives a packet asynchronously (handled by Boost Asio)
Packet goes through the ReceiverProcessor for deserialization and processing
If the packet processing is successful, the ReceiverProcessor puts the deserialize data onto a command data queue and wakes the GameLoop thread (if sleeping)
The GameLoop thread takes this data queue and inputs it through the command engine described in Part 3
Command is executed and a response is prepared by GameLoop and put on an outgoing data queue
GameLoop wakes the SenderProcessor (if sleeping)
SenderProcessor takes the outgoing data queue and creates corresponding packets
OS sends out the packets asynchronously (handled by Boost)

I used a wait-free queue from BOOST (single producer single consumer) as the data queue that delivered the data between threads.

Note the clear separation of boundaries on OSSockets and GameLoop. All boundaries dependencies are inverted using interfaces (remember DIP?). This way the networking module is agnostic from the operating system and the game. This means that it doesn’t need to know about the game protocol or the transport protocol.

Something that worried me when I came up with his architecture was whether the context switch between threads may be too expensive. I’ve yet to benchmark anything, but the CPU usage on Trap Labs’ dedicated server is only 3% on debug build hosting a 4 player game (Intel i5 2500k @ 4.0GHz), and there was no perceived lag. I’d say that’s pretty good.

Receiver and Sender interfaces

The significance of the Receiver and Sender being an interface is that you can use any transport protocol you want. This allows you to use the networking library with either UDP or TCP (or something completely different). I implemented both TCP and UDP variants with Boost’s Asio asynchronous sockets. I used UDP for game the loop and TCP for all lobby transactions.

The interesting bit of architecture worth noting is that dependency on the Receiver and Sender is different. OSSockets inherits Sender and references Receiver. Under UDP I have two implementations: UDPReceiver and UDPSender. UDPReceiver references the Receiver interface which is implemented by ReceiverProcessor. The idea is that if you are not receiving any packets you should be doing nothing. UDPReceiver will wait for packets the come in (handled by Boost Asio), and when a packet is received it goes through ReceiverProcessor and wakes up the GameLoop through the Wakeable interface (wakeable is usually implemented using condition_variable in C++ in case you are curious).

UDPSender on the other hand implements the Sender interface because it is used by SenderProcessor to send the packets. And because SenderProcessor is wakeable by GameLoop, the association on both SenderProcessor and Sender are different from Receiver.

Simply stated, receiving is passive and sending is active, and that's why the dependency is different from the networking module's perspective. I know it’s a bit difficult to see the reasoning behind this, but it should become clear once you try to implement this architecture.

To summarize:
Receiving in one direction (>>>)
UDPReceiver uses Receiver interface which is implemented by ReceiverProcessor which wakes up the GameLoop Sending in other direction (<<<)
GameLoop wakes up SenderProcessor which uses Sender interface implemented as UDPSender

ReceiverProcessor and SenderProcessor

As their name suggest, ReceiverProcessor is responsible for taking the packets received and processes them into useable data, and SenderProcessor takes game data and processes them into outgoing packets.

What’s significant in my architecture is that most of the networking component resides within the SenderProcessor:

I’ll quickly go over the components from left to right:

PacketResender – determines whether a packet should resend or not
PacketCache – caches sent packets; used by PacketResender if a packet needs to be resent
CongestionAnalyzer – analyses network congestion
PacketRTT – calculates packet’s Round Trip Time (RTT) and reports late/timeout packets
SendLimiter – manages send rate based on network congestion; swappable algorithm based on Strategy pattern
PacketGenerator – takes the data from GameLoop and generates the packet
AckProcessor – sets ack bits based on received packets
NetworkFirewall – firewall that filters bad packets and unwanted clients

If you need more information on what these modules do you can visit Gaffer on Games to find detailed descriptions of their behavior. The only component that is new is the firewall. It simply filters clients by endpoint (IP + port pair) and protocol id.

Since ReceiverProcessor and SenderProcessor are on separate threads, they communicate through a wait free queue TimeStampedAckQueue. Whenever ReceiverProcessor receives a packet it updates the queue with the sequence number and its timestamp. Whenever SenderProcessor is awaken it consumes the queue and updates its internal components like PacketRTT and CongestionAnalyzer.

Let me go through a typical receive and send process. From the ReceiverProcessor side:

Packet is received by Boost Asio and transferred to ReceiverProcessor
Packet is filtered by firewall
Packet is deserialized into data useable by GameLoop
Packet’s ack is added to the TimeStampedAckQueue
Deserialized data is added to the command data queue waiting to be consumed by GameLoop

From the SenderProcessor side:

SenderProcessor consumes outgoing data queue
Outgoing command data is serialized and protocol is applied
Packet is cached by PacketCache
Ack bits are applied on the packet based on TimeStampedAckQueue (shared by ReceiverProcessor)
Packets are sent by Boost Asio
PacketRTT updates its RTT value from by checking matched ack and sequence number from TimeStampedAckQueue and SentTimeStampedSequences. Packets that haven’t yet been acked will be saved in the TSAckQueueCache
All packets that were acked are cleaned from PacketCache
CongestionAnalyzer is updated based on the latest RTT value reported by PacketRTT. If congestion is detected send rate limiter is triggered
PacketRTT reports all late/timeout packets
PacketResender resends late packets if required

The reason that most of components residing inside SenderProcessor is that components like PacketRTT, CongestionAnalyzer needs acks from packets received in order for them to operate. In addition, PacketResender only resends based on timed-out packets reported by PacketRTT. So all these components are not useful until something is received. Having all of the components within SenderProcessor is mostly for convenience of access. Logically it makes more sense to put PacketRTT and CongestionAnalyzer in ReceiverProcessor. However, this results in SentTimeStampedSequences and TSAckQueueCache being wait free queues inorder for SenderProcessor to share them due to the multithread nature. I didn’t want to have this overhead so simply repositioning the composition inside SenderProcessor eliminated this overhead. This was one of the more interesting design decisions I had to make which was not immediately clear until refactoring.

General purpose networking library

Many of the design decisions that you make for your software is about determining what are the implementation detail that should be decoupled from the module. It is almost intuitive to make the assumption that the protocol and the underlying transport should be coupled to the networking module because networking can’t work without them… or can they? Always remember to question your designs because often the intuitive design is often a bad design.

Transport protocol agnostic

Let’s look at the transport layer that implements the Receiver and Sender interfaces. The obvious choice here is to implement UDPSender and UDPReceiever. When using TCP versions they'd still work exactly same. I simply added a flag for SenderProcessor to be able to turn off resending components modules when using TCP. Due to this convenience I used UDP for the game loop and TCP for the lobby system. I’m planning to use TCP for all ingame operations such as chat and pulling player stats, and I intend to use UDP for all gameplay mechanics. I’ll update this article in the future if I have new findings when using both protocols at the same time.

The true advantage of being transport protocol agnostic actually lies within testing. Not only this allows you to mock the interfaces for unit testing, it’ll also allow you to implement unstable version of UDP and TCP to simulate network instability for playtesting.

In one of the sections this GDC talk on Halo: Reach networking architecture, the speaker talks about their costly network traffic shaping hardware to test the robustness of their networking architecture (you should watch this talk btw). But for us indies we don’t have the money for such expensive equipment. What’s the next best thing? Since Receiver and Sender are interface we can implement our own traffic shaping with zero cost! And since the networking is a on a separate thread, I can literally do what I want to the network and it would not affect the game thread!

For example, I implemented a LossyUDPReceiver that drops a packet every x milliseconds on purpose. This was trivial to implement (simply do nothing when timer is triggered). On top of that I can test the full gameplay with the lossy variants to experience what gameplay is like on a bad network while on localhost. If I wanted, I could go one step further and implement a more complicated lossy algorithm that drops packet on random intervals and delay release of packets. This is super convenient and allows me to test variety of network issues over localhost and further optimize the gameplay. I could even see in the future to implement a full suite of lossy transports to simulate myriad of realistic networking conditions.

Of course I’m not saying this is better than using actual traffic shaping hardware. But until I have the money to afford such expensive equipment this is the best way for me to test and optimize for bad networks.

See I just saved you tens of thousands of monies how are you going to repay me? :)

Game protocol agnostic

To make the networking engine truly portable, it has to be game agnostic as well. This means that it cannot know about the game protocol. How can that be? Well if you separate the network protocol from the game protocol you can do just that. For example, the mandatory parts for the network protocol are just really 4 items:

Protocol magic number/id
Sequence number
Ack bits
Client ID

You can certainly add more fields as needed. The networking engine would only deserialize the first chunk which is what stated above, and the game loop would deserialize the latter chunk. So as long as your game uses the same networking protocol you can essentially use any game protocol you like, making the networking engine completely game and protocol agnostic.

Minimizing visible lag

Let’s say your game is single threaded and using non-blocking sockets. Regardless whether you use UDP and TCP as your underlying transport protocol your game will hang if there is packet loss. The way to reduce the visible hang effect is to continue to update your game loop even if you didn’t receive a packet, and use a custom reliability method over UDP to reduce the lag. So just as Glenn explained UDP is still the superior choice. In addition, depending on how computationally intensive your game loop is, it’ll also affect your send rate over the network.

My multithreaded architecture eliminated the networking bottleneck completely by placing the game loop on a separate thread. This way the networking components can try to send and receive as fast as possible, and the game loop can continue to process without being affected by network condition. Assuming there’s ample computing resource, the game loop will never block the network, and the network will never block the game loop.

Regardless of what architecture you use, if there is packet loss your game will lag. What you can do as the software architect is design your game in such a way to compensate for the lag as best as you can such that to the player the lag is mitigated or unnoticeable.

I don't have any emperical evidence to show that my architecture is actually better than a single threaded variant. If I have the time and resources one day I would like to run benchmarks against a single thread variant to test their performance and scalability. But hey, it works great!

Is this over design?

For some this architecture might be considered as over design. In addition, the knowledge required to implement and test multithreaded library is definitly not something for beginner or even some intermediate programmers. And I would agree, for most indie games, a single threaded version like that Glenn offered is more than sufficient.

Personally I don't think this is over design. I’m in it for the long run, and I hope you are too. I built this networking with the intention of reusing the library in the future for any real-time multi-client software. And as it stands, this networking engine is infinitely reusable as along as I respect the separation of networking and the software protocol. In addition, the fact that its multithreaded means non of my apps in the future will be blocked by the network or the app core. I’m really happy it turned out well and I’m proud to say it’s one library that I built worthy of the multi-core era.

Code'n Such