Introduction

Welcome to the bittide documentation book! Bittide is a inter-chip communication link that allows cycle-level deterministic communication between chips with no in-band overhead.

The benefits include:

  • eliminating tail latency
  • allowing compilers to statically schedule workloads across physical chips
  • scaling to an arbitrary number of nodes without in-band loss

This book serves as an introduction to bittide and its concepts. Our goal is to provide the necessary information to develop and deploy experiments on bittide-based systems.

Key terms

Clock drift - The gradual deviation of two physical clocks from each other

Logical synchrony - Distributed computing to be coordinated as tightly as in synchronous systems without the distribution of a global clock or any reference to universal time

Latency-deterministic hardware - Computer hardware whose latencies and computation graph can be known ahead of time, independent of the data. Examples of non latency-deterministic hardware are branch predictors and cache prefetching, since the computation time cannot be known at compile time. Non latency-deterministic hardware can still use bittide, but loses most of the benefit of bittide's cycle-accurate communication.

The problem with inter-chip communication

Modern computation workloads are growing larger. To accomodate this growth, workloads are being split between multiple physical processors. The real-world scaling of these workloads does not match the theoretically achievable scaling.

This disconnect is caused by non-determinism in the underlying physical clocks of each chip causing clock drift. Due to physical phenomena including manufacturing defects, heat, or vibrations, these clocks do not run perfectly in sync. Most chips have an input buffer to compensate, but this buffer can be overrun. To handle this, chips must be able to apply backpressure, or a signal to wait before sending more data. The problem also scales exponentially with the number of chips in a system, effectively limiting the scaling of systems.

As you scale up the number of connected processors, the problem compounds. More chips mean more independent clocks, more buffering, and more backpressure, which together place practical limits on system scaling.

Bittide's Core Idea

Bittide uses a decentralized hardware algorithm to prevent clock drift. The algorithm is based on the following observations:

  1. The clock drift between any two nodes can be known as the difference between the number of data frames sent out by a node and the number of data frames received by a node
  2. If you adjust the clock frequency based on this signal, you can keep two nodes in logical synchrony
  3. If this algorithm is carefully scaled up to an arbitrary number of nodes1, an entire system can be held in logical synchrony

If we remove all non-determinism from inter-chip communication, then we get the benefit of scaling logic as if we were using a bigger chip, but the physical scaling of adding more chips to the cluster.

For an in-depth description of the bittide system, please see this paper.

Benefits of bittide

By removing non-determinism from inter-chip communication, bittide allows distributed systems to scale as if they were a single, larger chip. From the perspective of software and compilers, communication becomes a deterministic operation with known cost, enabling more aggressive scheduling, static placement of data, and compile-time reasoning about performance.

In effect, bittide shifts complexity away from runtime mechanisms and into design-time guarantees. This tradeoff enables large-scale systems that preserve the simplicity and predictability traditionally associated with small, tightly integrated hardware designs.

Who might be interested in bittide

The bittide system represents a novel approach to inter-chip communication that guarantees determinism. Certain workloads and compute architectures are better suited to take advantage of this property than others.

We believe those engineers are:

  • Hardware engineers who work on latency-deterministic hardware
  • Compiler engineers who are interested in optimizing the mapping of computation onto distributed hardware
  • Software engineers who need a fixed latency output for their workload

The requirements of a bittide system

Any engineering implementation requires tradeoffs. For bittide,

  • a small amount of die space to handle input buffers and clock control
  • an adjustable clock source
  • some startup time on power up to synchronize clocks before starting workload
  • a compiler that is able to take advantage of a bittide-based system

Timeline of project

The bittide project is Apache 2.0 licensed and is being developed by QBayLogic and Google DeepMind.

Timeline

Mar 2025 - Paper on 8 node full bittide setup published (arXiv)

Aug 2024 - 8 node demo extended to handle logical latency and scheduling between nodes

June 2023 - 8 node proof-of-concept demo completed for clock synchronization

Jan 2023 - 2 node proof-of-concept demo completed for clock synchronization

Sep 2021 - Paper on bittide system theory published (arXiv)

Aug 2021 - Project start


  1. See this paper

Architecture Overview

Bittide as a communication link is inherently scalable. Bittide nodes can be connected point-to-point between nodes, boxes, and racks without loss of per-cycle accuracy. The only parameters to a bittide network are link topology and link latency. Notably, link latency DOES NOT affect per-cycle accuracy, but it does affect inter-node latency.

Here's an example bittide network

Diagram not found at ../mdbook-drawio/bittide-network-simple-page-0.svg

Any Processing Element can be added to a bittide network, with the requirements that

  1. the Processing Element is able to run on the bittide clock domain, or some PLL multiple
  2. the Processing Element dedicates a small amount of die or FPGA space for the bittide interface
  3. ...that's about it, actually

The rest of this chapter is devoted to how bittide achieves logical synchrony on a per-node basis. For scheduling a computation over the network and failure recovery, see a further chapter.

Bittide Node bringup sequence

For actually using a bittide network, these details are largely irrelevant. Still, it's useful to have a general understanding of how bittide works in practice.

Step 1: Booting the bittide clock

Diagram not found at ../mdbook-drawio/bittide-boot-diagram-page-0.svg

Contrary to what one might expect, there are actually TWO clocks in a bittide node: an adjustable clock (called the bittide clock) and a static clock. For most of the bittide boot and all of the Processing Element functioning, the bittide clock is used. However, since the adjustable clock is actually a somewhat complex piece of silicon, it itself needs to be set up. To do this, we have a Boot CPU running on the static clock, which has two jobs:

  1. set up bittide clock by configuring clock registers and setting the initial nominal frequency
  2. bring parts of the SerDes out of reset

Once the bringup sequence has moved to the next step, the Boot CPU and static clock are no longer used.

Step 2: Achieving clock syntony(ish)

Diagram not found at ../mdbook-drawio/bittide-cc-diagram-page-0.svg

At this point in the bringup sequence, all bittide clocks have been coordinated with reference to local, inperfect, static clocks. The goal of this step is to align every bittide clock on the network, creating syntony.

Each bittide node starts sending a pseudorandom binary sequence over every link. The content of the data is not important, merely that data is being sent. The SerDes of every link locks onto the remote clock frequency embedded in the link in order to deserialize the data.

In typical networks, once the data is deserialized and converted to the local frequency, this remote clock information is discarded. In bittide, the remote clock frequency is stored locally in a register that counts every tick of the remote clock. Bittide also stores a counter for the local bittide clock. Collectively, we call these registers the Domain Difference Counter (or DDC). The Clock Control CPU reads in these counters and adjusts (FINC or FDEC) the local bittide clock depending on whether it detects the local clock is too fast or too slow.

Over time, we can show bittide clocks settle to a common network frequency (up to a small delta). Once this state is achieved, these clock control circuits stay active to maintain a common frequency despite changes in heat, etc.

Step 3: Actually achieving clock syntony

Diagram not found at ../mdbook-drawio/bittide-mu-diagram-page-0.svg

So far, we have achieved a common bittide network frequency. The dutiful engineer will have noticed we only guaranteed that frequency with a small delta. This seems counter to bittide's promise of cycle-level accuracy. And if we left it there, you would be right. Cycle-ish-level accuracy doesn't get us much.

To absorb small wobbles between the remote bittide clocks and local bittide clock, a small Elastic Buffer (EB) is inserted. Notably, this buffer can be made much smaller than the corresponding input buffer used by most networking interfaces, because bittide guarantees a small delta of clock drift. The bittide system tries to keep the elastic buffer filled halfway at all times, so that it can absorb a wobble of (buffer size)/2 clock cycles.

We can now say the bittide system has achieved cycle-level latency. But we have a new problem: no bittide node knows what that latency actually is between itself and its neighbors (nor even who its neighbors are).

Step 4: Determining logical latencies

Each node will send out a message of the current clock cycle. Notably, this message sending does not need to be coordinated in any way. Each neighbor will receive that clock cycle and record the local clock cycle it was received on. This number defines the one way domain mapping, also called logical latency.

How this "recording" happens in practice is an open technical discussion on bittide. We present the two main solutions below, both of which have been tested on the bittide hardware.

Option 1: Hard UGN capture

Hard UGN capture has two parts:

  1. For each node, the first Bittide Word it writes to the network is the local clock counter value
  2. Each node also has a hardware component, called the UGN capture, that sits between the EB and the RB on the receiving end. Its sole job is to wait for the first valid piece of bittide data and save it with the local clock cycle at that time. It lets all data thereafter through to the RB

These two pieces, together, ensure the UGNs are captured between each node. The Management Unit (MU) can then read the UGN values and use them. Below we have an example of a UGN capture in practice.

Example Hard UGN capture

Option 2: Soft UGN capture

Soft UGN capture is done by using the Management Unit (MU) CPU to read the RingBuffer (RB). The benefit of this approach is we can re-use an existing component instead of creating a new one, saving space on the hardware.

However, the CPU approach comes with a major limitation - unlike the UGN Capture component, the MU cannot inspect every bittide word the same cycle it comes in.1 Therefore, we need to do two things:

  1. let the MU know which element in the Rx RB corresponds to the start of the Tx RB.
  2. have the neighbor node send the clock cycle sometime at the start of the Tx RB.

This way, the MU does not need to inspect every element in the RB for the clock cycle, it just needs to inspect the one entry it knows the clock cycle will eventually be in. For more detail, see the Ringbuffer alignment section.

Once the relationship has been mapped, the sending node can send a "UGN event" (5 bittide words), which will be read by the receiving MU.

Step 5: Handover to the processing element

Diagram not found at ../mdbook-drawio/bittide-pe-diagram-page-0.svg

Once logical latency has been established, the bittide network guarantees these latencies until reboot. Control is handed over to the Processing Element. The Processing Element can operate as normal, without any knowledge of the bittide network. However, it now has guaranteed latency with all other nodes, allowing it to schedule and execute computations.

Glossary for bittide-specific terms

Bittide word - The smallest unit of the bittide network in a clock cycle. The word size of bittide is 64 bits.

Clock Control CPU (CC) - The CPU that reads in the domain differences between each neighbor node and the local node and sends a signal to the clock to speed up or slow down based on some function.

Boot CPU - The CPU that boots the adjustable (bittide) clock and configures its registers via SPI. It also brings SerDes and Handshake out of reset, which negotiates the 8b10b link with every other node.

Domain difference counter (DDC) -

Domain mapping - Previously called the Uninterpretable Garbage Number (UGN), the domain mapping is the observation we make to obtain the logical latency. It consists of transmit timestamp expressed in clock cycles of your link partner. When this timestamp arrives in your own domain, it is stored together with the receive timestamp. This pair of counters is the domain mapping because it maps the transmit cycle of your link partner to your receive cycle. Both of these are natural numbers (each represented as u64)

Elastic buffer (EB) -

Processing element (PE) - The computation element that is connected to the bittide network via bittide. The computational elements can be anything (hence the general term), but is often considered to be an ASIC or similar.

Handshake (needs better name) - A step function that sends out PRBS until a link is negotiated. Then simply passes data through.

Logical latency - Integer that is derived from the domain mapping, we can use this to predict the exact clock cycle when a message will arrive at our link partner.

Logic layer -

Management Unit (MU) - The CPU that performs elastic buffer centering and UGN capturing.

Nominal frequency -

Pseudorandom binary sequence (PRBS) -

Physical layer -

(Aligned) Ring buffer

Roundtrip time - Natural number that represents the number of clock cycles it takes for a message to make a roundtrip from node A to node B and back. It is the sum of the logical latencies l a -> b and l b -> a.

Static clock (SCLK) - A reference clock that is not adjustable. Its only purpose is to provide a clock for the Boot CPU. Once the Boot CPU is finished, the static clock is no longer needed.

Glossary for non-bittide technical terms

Comma symbol - An alignment symbol in 8b10b link negotiation


  1. This difference occurs because the RingBuffer (RB) only supports accessing one address per cycle. The UGN Capture sits before the RingBuffer (RB) in the data pipeline, while the MU CPU sits behind the RB. So UGN capture can inspect every new word, while the MU needs to know which RB element to inspect. If the MU were to scan the entire RB, it would find the right element, but it would then not know on which clock cycle the element was put into the RB.

Hardware-in-the-Loop (HITL) Platform

This chapter describes the specific hardware setup used to realize a bittide system in our lab environment. The HITL platform is designed to implement the principles of bittide using real hardware components for experimentation, development, and validation for topologies up to 8 nodes.

Platform Overview

Requires diagram

The HITL platform consists of the following main components:

  • Host Computer: Executes experiments on the bittide system.
  • FPGA Boards: Implement clock control, bittide routing, and compute fabric.
  • Clock Adjustment Boards: Provide single controllable clock source for an FPGA.
  • High-Speed Interconnects: Serial links (such as SFP+ or QSFP) connect the FPGAs for low-latency, high-bandwidth communication.
  • JTAG/UART dongles: Provide JTAG and UART interface for each FPGA to the host computer.
  • Ethernet connections: Each FPGA has an ethernet connection from a dedicated RJ45 port to the host computer.
  • SYNC_IN / SYNC_OUT: Ring connection through all FPGAs to get a sense of global time. This is used for starting / stopping experiments and mapping measurement data from all FPGAs to a single time line. This is not part of the bittide architecture itself.

Example bittide System

This document describes a concrete example of a bittide design that can be programmed on the FPGAs of the HITL platform. It showcases the internal architecture and configuration of the system.

Architecture

Requires diagram

Components:

  • Transceivers
  • Domain difference counters
  • Clock control (CC)
  • SPI programmer for clock generator
  • Elastic buffers
  • UGN capture

Components:

  • Switch (Crossbar + calendar)
  • Management unit (MNU)
  • 1x General purpose processing element (GPPE)

Management unit

Components:

  • RISCV core
  • Scatter unit
  • Gather unit
  • UART (For debugging)

The management unit has access to, and is responsible for all calendars in the node.

Calendars:

  • Switch calendar
  • Management unit scatter calendar
  • Management unit gather calendar
  • GPPE scatter unit
  • GPPE gather unit
  • UART arbiter
  • JTAG interconnect
  • Integrated logic analyzers
  • SYNC_IN / SYNC_OUT

Example System

Below is a diagram generated from the first page of bittide_concepts.drawio:

TODO: Insert diagram

Experiments

This chapter describes the existing infrastructure to support the creation and execution of experiments on the example bittide system.

Required components

  • Example design: Contains all bittide related FPGA logic.
  • Programs: The binary files that will be loaded into the CPUs that exist in the example design.
  • Driver: Runs on the host PC and communicates with the testbench.
  • Testbench: Contains the example design and other necessary logic such as ILA's.

Relevant infrastructure

Names which tools are relevant and where they live

Execution

Describes how tests are executed as part of a CI/CD pipeline.

Writing programs

Describes how to write new programs for the management unit, general purpose processing element or clock control.

Writing a driver

Describes how to write a driver for the experiment.

Adding your experiment to CI/CD

Describes how to add your experiment to the CI/CD pipeline.

Components

This section provides an overview of the main hardware components in the Bittide system. Each component plays a specific role in enabling efficient, synchronized, and flexible operation of the hardware platform.

Available Components

Calendar

The Calendar component is a programmable state machine that drives the configuration of other components (like the Scatter Unit, Gather Unit, or Switch) on a cycle-by-cycle basis. It allows for time-division multiplexing of resources by cycling through a sequence of pre-programmed configurations.

Architecture

Diagram not found at ../../mdbook-drawio/components-page-2.svg

The Calendar consists of two memories (buffers) to allow for double-buffering: an active calendar and a shadow calendar.

  • The active calendar drives the output signals to the target component.
  • The shadow calendar can be reconfigured via the Wishbone interface without interrupting the operation of the active calendar.

The active and shadow calendars can be swapped at the end of a metacycle. A metacycle is one full iteration through the active calendar's entries, including all repetitions.

Operation

Each entry in the calendar consists of:

  1. Configuration Data: The actual control signals for the target component.
  2. Repetition Count: The number of additional cycles this configuration should remain active.
    • 0: The entry is valid for 1 cycle.
    • N: The entry is valid for N + 1 cycles.

The Calendar iterates through the entries in the active buffer. When it reaches the end of the active buffer (determined by the configured depth), it loops back to the beginning, completing a metacycle.

Double Buffering and Swapping

To update the schedule:

  1. Software writes new entries into the shadow calendar using the Wishbone interface.
  2. Software configures the depth of the shadow calendar.
  3. Software arms the swap mechanism by writing to the swapActive register.

The swap does not happen immediately. It occurs only when the active calendar completes its current metacycle. This ensures that the schedule is always switched at a deterministic point, preventing glitches or partial schedules.

Scatter Unit

The Scatter Unit is a hardware component designed to receive data frames from a Bittide link and store them into a local memory. It uses a double-buffered memory architecture and a configurable Calendar to determine the write address for each incoming frame. This allows for cycle-accurate, deterministic data reception.

Architecture

Diagram not found at ../../mdbook-drawio/components-page-0.svg

The Scatter Unit consists of:

  • Double-Buffered Memory: Two memory buffers (Active and Shadow).
    • Active Buffer: Receives data from the Bittide link.
    • Shadow Buffer: Can be read by the CPU via the Wishbone interface.
  • Calendar: Determines the write address in the Active Buffer for each incoming frame.
  • Wishbone Interface: Allows the CPU to read received data and monitor metacycle progress.

Operation

  1. Data Reception:

    • In each cycle, if a valid frame arrives from the Bittide link, the Calendar provides the write address for the Active Buffer.
    • The frame is written to the Active Buffer at that address.
  2. Buffer Swapping:

    • The Active and Shadow buffers are swapped at end of each metacycle.
    • A metacycle is defined by the Calendar's schedule.
    • After the swap, the data that was just received becomes available in the Shadow Buffer for the CPU to read.
  3. CPU Access:

    • The CPU reads the data from the Shadow Buffer using the Wishbone interface.
    • To ensure data consistency, the CPU can synchronize with the metacycle using the stalling mechanism.

Gather Unit

The Gather Unit is a hardware component designed to transmit data frames over a Bittide link. It uses a double-buffered memory architecture and a configurable Calendar to determine which data to send in each cycle. This ensures deterministic data transmission.

Architecture

Diagram not found at ../../mdbook-drawio/components-page-1.svg

The Gather Unit consists of:

  • Double-Buffered Memory: Two memory buffers (Active and Shadow).
    • Active Buffer: Provides data to the Bittide link.
    • Shadow Buffer: Can be written to by the CPU via the Wishbone interface.
  • Calendar: Determines the read address in the Active Buffer for each outgoing frame.
  • Wishbone Interface: Allows the CPU to write data to be transmitted and monitor metacycle progress.

Operation

  1. Data Transmission:

    • Each clock tick, the Calendar provides the read address for the Active Buffer.
    • The data at that address is read from the Active Buffer and sent over the Bittide link.
  2. Buffer Swapping:

    • The Active and Shadow buffers are swapped at end of each metacycle.
    • A metacycle is defined by the Calendar's schedule.
    • After the swap, the data that was written by the CPU becomes the active data being transmitted.
  3. CPU Access:

    • The CPU writes the data to the Shadow Buffer using the Wishbone interface.
    • Byte Enables: The Wishbone interface is 32-bit, but the Gather Unit memory is 64-bit. The hardware uses the byte enables to allow writing to the upper or lower 32 bits of the 64-bit memory words.
    • To ensure data consistency, the CPU can synchronize with the metacycle using the stalling mechanism.

Software UGN Demo

This chapter describes the specific hardware setup used to perform a test where each node is able to determine the UGNs to its neighbors without communicating with a host PC that is able to read all of the information from each of the nodes in the system. This is accomplished through the firmware on the GPPE.

Architecture

Diagram not found at ../../mdbook-drawio/softUgnDemo-page-0.svg

Initialization sequence

  1. "Boot" CPU
    1. Gets programmed by the host
    2. Programs the clock boards
    3. Gets transceiver block out of reset (enabling the bittide domain / other CPUs)
    4. Activates each transceiver channel and waits until they've negotiated a link with their neighbors.
    5. Prints "all done" message to UART.
  2. Clock control CPU
    1. Gets programmed by the host
    2. Calibrates clocks
    3. Prints "all done" message to UART.
    4. (Keeps calibrating clocks.)
  3. Management unit CPU
    1. Gets programmed by the host
    2. Centers elastic buffers
    3. Initializes the scatter/gather calendars.
    4. Sets the channels to "user" mode. This makes the channels accept data from the outside world instead of their negotiation state machinery.
    5. Prints UGNs captured by hardware component to UART.
  4. General purpose procesisng element (PE)
    1. Gets programmed by the host
    2. Waits for the management unit to initialize the calendars
    3. Calls the "c_main" and runs the software UGN discovery protocol
    4. Prints the discovered UGNs over UART

Components:

  • Boot CPU (BOOT)
  • Transceivers
  • Domain difference counters
  • Clock control (CC)
  • Elastic buffers (one per incoming transceiver link)
  • Hardware UGN capture (for comparison)

Components:

  • Management unit (MU)
  • 1x general purpose processing element (PE)
  • 7 scatter and gather units, one of each per elastic buffer

Management unit

Connected components:

  • Timer
  • UART (for debugging)
  • FPGA DNA register

The management unit has access to and is responsible for all scatter/gather calendars in the node. In this demo, it programs the calendars with increasing values (0, 1, 2, ...), effectively creating a transparent link for the GPPE to access the scatter/gather units directly. It also centers the elastic buffers to ensure stable communication.

To change the binary run on this CPU, one may either:

  • Edit bittide-instances/src/bittide/Instances/Hitl/SoftUgnDemo/Driver.hs, line 215 (at time of writing) to use another binary instead of soft-ugn-mu
  • Edit the source files in firmware-binaries/soft-ugn-mu/ to change the binary pre-selected by the driver function

General purpose processing element

This component is labeled as "PE" in the diagram above. Connected components:

  • 7 scatter and gather units, one of each per elastic buffer
  • UART (for debugging)
  • Timer
  • FPGA DNA register

The general purpose processing element runs the soft-ugn-gppe firmware, which implements a distributed protocol to discover the Uninterpretable Garbage Numbers (UGNs) of the network links. For a detailed description of the procedure, see Software UGN Discovery Procedure. It uses the scatter/gather units (enabled by the MU) to exchange timestamped messages with neighbors, calculating the propagation delays in software.

  • UART arbiter
  • JTAG interconnect
  • Integrated logic analyzers
  • SYNC_IN/SYNC_OUT

Running tests

One may specifically run the software UGN demo test by making a .github/synthesis/debug.json with the following contents:

[
  {"top": "softUgnDemoTest",       "stage": "test", "cc_report": true}
]

At the time of writing, the clock control CPU stabilizes system. The driver running on the host (bittide-instances/src/bittide/Instances/Hitl/SoftUgnDemo/Driver.hs) then releases the reset of the management unit CPU. In turn, this CPU will center the elastic buffers, initialize the scatter/gather calendars, and print out the UGNs captured using the hardware UGN capture component over UART. Finally, the general purpose processing element is started. It executes the software UGN discovery protocol and prints the results over UART. The host driver then compares the hardware-captured UGNs with the software-discovered UGNs to verify correctness.

Tests are configured to run the following binaries on the system's CPUs:

  • Boot CPU: switch-demo1-boot (firmware-binaries/demos/switch-demo1-boot)
  • Clock control CPU: clock-control (firmware-binaries/demos/clock-control)
  • Management unit: soft-ugn-mu (firmware-binaries/demos/soft-ugn-mu)
  • General purpose processing element: soft-ugn-gppe (firmware-binaries/demos/soft-ugn-gppe)

One may change this by either:

  1. Changing the driver function so that it loads different binaries onto the CPUs. This may be accomplished by changing which binary name is used with each of the initGdb function calls.
  2. Changing the source code for the binaries. The locations for them are listed above.

Switch Demo with ASIC processing element

This chapter describes the specific hardware setup used to perform a demonstration of the bittide switch.

Architecture

Diagram not found at ../../mdbook-drawio/switchDemoAsic-page-2.svg

Initialization sequence

  1. "Boot" CPU
    1. Gets programmed by the host
    2. Programs the clock boards
    3. Gets transceiver block out of reset (enabling the bittide domain / other CPUs)
    4. Activates each transceiver channel and waits until they've negotiated a link with their neighbors.
    5. Prints "all done" message to UART.
  2. Clock control CPU
    1. Gets programmed by the host
    2. Calibrates clocks
    3. Prints "all done" message to UART.
    4. (Keeps calibrating clocks.)
  3. Management unit CPU
    1. Gets programmed by the host
    2. Centers elastic buffers
    3. Sets the channels to "user" mode. This makes the channels accept data from the outside world instead of their negotiation state machinery.
    4. Prints UGNs captured by hardware component to UART.

The ASIC processing element comes out of reset as soon as the bittide domain is enabled.

Components:

  • Boot CPU (BOOT)
  • Transceivers
  • Domain difference counters
  • Clock control (CC)
  • Elastic buffers (one per incoming transceiver link)
  • Hardware UGN capture

Components:

  • Management unit (MU)
  • Crossbar
  • Crossbar calendar
  • Null link
  • ASIC processing element
  • 1x scatter/gather units

Management unit

  • Timer
  • UART (for debugging)
  • FPGA DNA register

The management unit has access to and is responsible for all scatter/gather calendars in the node, as well as the crossbar calendar.

To change the binary run on this CPU, one may either:

  • Edit bittide-instances/src/bittide/Instances/Hitl/SwitchDemo/Driver.hs, line 500 (at time of writing) to use another binary instead of switch-demo1-mu
  • Edit the source files in firmware-binaries/switch-demo1-mu/ to change the binary pre-selected by the driver function

Application specific processing element

This component is labeled as "PE" in the diagram above. It is directly connected to the output of the crossbar without any buffering, and as such can only work on data as it is streamed to it, and the manner in which this happens is determined by the crossbar calendar.

The processing element itself can be instructed to read from its incoming link for a configurable number of clock cycles starting at a very specific one. Similarly, it can write to its outgoing link for a configurable number of clock cycles starting at a very specific one. The demo is such that UGNs are received from all nodes in the system and with that a "schedule" is created for when to read and write on the link on each PE. If the schedule is correct and the bittide property holds, a "thread" through all nodes is created. For more information see this presentation.

  • UART arbiter
  • JTAG interconnect
  • Integrated logic analyzers
  • SYNC_IN/SYNC_OUT

Running tests

One may specifically run the switch demo test by making a .github/synthesis/debug.json with the following contents:

[
  {"top": "switchDemoTest",       "stage": "test", "cc_report": true}
]

At the time of writing, the clock control CPU stabilizes system. The driver running on the host (bittide-instances/src/bittide/Instances/Hitl/SwitchDemo/Driver.hs) then releases the reset of the management unit CPU. In turn, this CPU will center the elastic buffers and print out the UGNs captured using the hardware UGN capture component over UART. The behavior of the application specific processing element is explained in Application specific processing element.

Tests are configured to run the following binaries on the system's CPUs:

  • Boot CPU: switch-demo1-boot (firmware-binaries/demos/switch-demo1-boot)
  • Clock control CPU: clock-control (firmware-binaries/demos/clock-control)
  • Management unit: switch-demo1-mu (firmware-binaries/demos/switch-demo1-mu)

One may change this by either:

  1. Changing the driver function so that it loads different binaries onto the CPUs. This may be accomplished by changing which binary name is used with each of the initGdb function calls.
  2. Changing the source code for the binaries. The locations for them are listed above.

Switch Demo with GPPE

This chapter describes the specific hardware setup used to perform a demonstration of the bittide switch.

Architecture

Diagram not found at ../../mdbook-drawio/switchDemoGppe-page-1.svg

Initialization sequence

  1. "Boot" CPU
    1. Gets programmed by the host
    2. Programs the clock boards
    3. Gets transceiver block out of reset (enabling the bittide domain / other CPUs)
    4. Activates each transceiver channel and waits until they've negotiated a link with their neighbors.
    5. Prints "all done" message to UART.
  2. Clock control CPU
    1. Gets programmed by the host
    2. Calibrates clocks
    3. Prints "all done" message to UART.
    4. (Keeps calibrating clocks.)
  3. Management unit CPU
    1. Gets programmed by the host
    2. Centers elastic buffers
    3. Sets the channels to "user" mode. This makes the channels accept data from the outside world instead of their negotiation state machinery.
    4. Prints UGNs captured by hardware component to UART.
  4. General purpose procesisng element (PE)
    1. Gets programmed by the host
    2. Prints "Hello!" message over UART

Components:

  • Boot CPU (BOOT)
  • Transceivers
  • Domain difference counters
  • Clock control (CC)
  • Elastic buffers (one per incoming transceiver link)
  • Hardware UGN capture

Components:

  • Management unit (MU)
  • Crossbar
  • Crossbar calendar
  • Null link
  • 1x general purpose processing element (PE)
  • 1x scatter/gather units

Management unit

  • Timer
  • UART (for debugging)
  • FPGA DNA register

The management unit has access to and is responsible for all scatter/gather calendars in the node, as well as the crossbar calendar.

To change the binary run on this CPU, one may either:

  • Edit bittide-instances/src/bittide/Instances/Hitl/SwitchDemoGppe/Driver.hs, line 215 (at time of writing) to use another binary instead of switch-demo2-mu
  • Edit the source files in firmware-binaries/switch-demo2-mu/ to change the binary pre-selected by the driver function

General purpose processing element

This component is labeled as "PE" in the diagram above. Connected components:

  • 7 scatter and gather units, one of each per elastic buffer
  • UART (for debugging)
  • Timer
  • Scatter/gather unit connection to crossbar output
  • FPGA DNA register

The general purpose processing element is a drop-in replacement of the ASIC processing element. It has no functionality other than printing "Hello!" over UART.

  • UART arbiter
  • JTAG interconnect
  • Integrated logic analyzers
  • SYNC_IN/SYNC_OUT

Running tests

One may specifically run the switch demo with GPPE test by making a .github/synthesis/debug.json with the following contents:

[
  {"top": "switchDemoGppeTest",       "stage": "test", "cc_report": true}
]

At the time of writing, the clock control CPU stabilizes system. The driver running on the host (bittide-instances/src/bittide/Instances/Hitl/SwitchDemoGppe/Driver.hs) then releases the reset of the management unit CPU. In turn, this CPU will center the elastic buffers and print out the UGNs captured using the hardware UGN capture component over UART. Finally, the general purpose processing element has its reset deasserted. It simply prints "Hello!" over UART.

Tests are configured to run the following binaries on the system's CPUs:

  • Boot CPU: switch-demo1-boot (firmware-binaries/demos/switch-demo1-boot)
  • Clock control CPU: clock-control (firmware-binaries/demos/clock-control)
  • Management unit: switch-demo2-mu (firmware-binaries/demos/switch-demo2-mu)
  • General purpose processing element: switch-demo2-gppe (firmware-binaries/demos/switch-demo2-gppe)

One may change this by either:

  1. Changing the driver function so that it loads different binaries onto the CPUs. This may be accomplished by changing which binary name is used with each of the initGdb function calls.
  2. Changing the source code for the binaries. The locations for them are listed above.

Ringbuffer Alignment Protocol

Definitions

  • Transmitting Ringbuffer (TX): Local memory written by CPU, read by hardware using a wrapping counter.
  • Receiving Ringbuffer (RX): Local memory written by hardware using a wrapping counter, read by CPU.

Context

In a bittide system, nodes operate in a globally synchronous manner despite being asynchronous devices with unknown start times. Communication occurs via ringbuffers. When TX and RX ringbuffers are the same size, the address mapping between them is constant, determined by (logical) network latency and the start time difference between nodes.

Objective

Determine the constant offset between the transmit address (TX) and receive address (RX) for each link to enable reliable asynchronous communication.

Alignment Algorithm

  1. Initialization: Set all ringbuffers to a uniform default size.
  2. Announce: Each CPU writes a recognizable message containing the non-zero state ALIGNMENT_ANNOUNCE to index 0 of its TX ringbuffer and clears all other positions.
  3. Search: Each CPU scans its RX ringbuffer for the ALIGNMENT_ANNOUNCE message.
  4. Determine Offset: Upon finding the message at RX_Index, the CPU stores the RX_Index, to be used as future read offset.
  5. Acknowledge: The CPU updates index 0 of its TX ringbuffer with the state ALIGNMENT_RECEIVED.
  6. Confirm: The CPU monitors the RX ringbuffer until it receives ALIGNMENT_RECEIVED from its neighbor.
  7. Finalize: Once the CPU has received ALIGNMENT_RECEIVED, the link is aligned. The offset is stored for future communication and we proceed to normal operation.

Resulting Interface: AlignedRingbuffer

Upon successful alignment, the system can instantiate an AlignedRingbuffer abstraction. This interface handles the offset calculations transparently, allowing the CPU to read and write to the ringbuffers without concern for alignment.

Communication Challenges

While the AlignedRingbuffer provides logical connectivity, the physical link remains unreliable due to the interaction between the read/write counters of the ringbuffers and asynchronous CPU access:

  1. Continuous Hardware Operation: The hardware continuously cycles through the ringbuffers at the network link speed.
  2. Asynchronous CPU Access: The CPU operates asynchronously and often slower than the network link.

This leads to specific failure modes:

  • Data Corruption (Pointer Overtaking):
    • TX Side: If the hardware's read pointer overtakes the CPU's write pointer during a write, a torn frame is sent.
    • RX Side: If the hardware's write pointer overtakes the CPU's read pointer during a read, the message is corrupted.
  • Data Loss: If the CPU does not read from the RX ringbuffer every iteration, the hardware will overwrite unread data.
  • Data duplication: If the CPU does not write to the TX ringbuffer every iteration, the hardware will resend old data.

Reliable communication requires a higher-level protocol to handle these errors. See the Asynchronous Communication Protocol for a proposed solution using the smoltcp library to implement a reliable TCP/IP layer over the AlignedRingbuffer.

Asynchronous Communication Protocol

Context

In a bittide system, we need asynchronous communication between nodes, particularly during the boot phase. The Ringbuffer Alignment Protocol provides an AlignedRingbuffer abstraction that allows for packet exchange.

However, as described in that protocol's documentation, the raw AlignedRingbuffer link is unreliable, subject to packet corruption and loss due to hardware/software speed mismatches.

Objective

Establish a reliable, asynchronous, point-to-point communication channel between nodes over the potentially unreliable AlignedRingbuffer links.

Proposed Solution

Leverage the TCP/IP protocol suite to handle error detection, retransmission, and flow control. We will use the smoltcp library, a lightweight TCP/IP stack designed for embedded systems, to implement this layer. Note that future versions of bittide will probably use a bespoke network stack for asynchronous communication.

Implementation Strategy

1. Network Interface (smoltcp::phy::Device)

We will implement the smoltcp::phy::Device trait for the Aligned Ringbuffer.

  • Medium: Use Medium::Ip to minimize overhead (no Ethernet headers required for point-to-point).
  • MTU: Set to 1500 bytes (standard Ethernet size) to accommodate typical payloads.

2. Framing & Alignment

  • The underlying Aligned Ringbuffer abstraction ensures packets are read from the correct aligned memory location.
  • Packet Boundaries: The length of each packet will be derived directly from the IP Header length field.

3. Addressing

  • Topology: Initially restricted to Peer-to-Peer links.
  • IP Assignment: Placeholder IPs, if necessary we use static addressing derived from unique hardware identifiers (e.g., FPGA DNA or Port ID) to avoid the complexity of DHCP.

4. Demo Application

Develop a proof-of-concept application that:

  1. Initializes the Aligned Ringbuffer.
  2. Sets up a smoltcp interface.
  3. Establishes a TCP connection between two nodes.
  4. Transfers data to verify reliability against induced packet loss/corruption.

Assumptions

  • An Aligned Ringbuffer abstraction exists that provides a read/write interface for single aligned packets.
  • The ringbuffer size is sufficient to hold at least one MTU-sized packet plus overhead.