Introduction

Welcome to the bittide documentation book. bittide is a decentralized synchronization scheme that aims to bring the benefits of single-clock, single-chip design to multiple independently clocked chips at datacenter scale. This book serves as an introduction to bittide and its concepts, bittide-hardware-as-a-library, and the experimentation platform(s) currently in use. Our goal is to provide the necessary information to develop and deploy experiments on bittide-based systems.

What is bittide

Hardware-in-the-Loop (HITL) Platform

This chapter describes the specific hardware setup used to realize a bittide system in our lab environment. The HITL platform is designed to implement the principles of bittide using real hardware components for experimentation, development, and validation for topologies up to 8 nodes.

Platform Overview

Requires diagram

The HITL platform consists of the following main components:

  • Host Computer: Executes experiments on the bittide system.
  • FPGA Boards: Implement clock control, bittide routing, and compute fabric.
  • Clock Adjustment Boards: Provide single controllable clock source for an FPGA.
  • High-Speed Interconnects: Serial links (such as SFP+ or QSFP) connect the FPGAs for low-latency, high-bandwidth communication.
  • JTAG/UART dongles: Provide JTAG and UART interface for each FPGA to the host computer.
  • Ethernet connections: Each FPGA has an ethernet connection from a dedicated RJ45 port to the host computer.
  • SYNC_IN / SYNC_OUT: Ring connection through all FPGAs to get a sense of global time. This is used for starting / stopping experiments and mapping measurement data from all FPGAs to a single time line. This is not part of the bittide architecture itself.

Example bittide System

This document describes a concrete example of a bittide design that can be programmed on the FPGAs of the HITL platform. It showcases the internal architecture and configuration of the system.

Architecture

Requires diagram

Components:

  • Transceivers
  • Domain difference counters
  • Clock control (CC)
  • SPI programmer for clock generator
  • Elastic buffers
  • UGN capture

Components:

  • Switch (Crossbar + calendar)
  • Management unit (MNU)
  • 1x General purpose processing element (GPPE)

Management unit

Components:

  • RISCV core
  • Scatter unit
  • Gather unit
  • UART (For debugging)

The management unit has access to, and is responsible for all calendars in the node.

Calendars:

  • Switch calendar
  • Management unit scatter calendar
  • Management unit gather calendar
  • GPPE scatter unit
  • GPPE gather unit
  • UART arbiter
  • JTAG interconnect
  • Integrated logic analyzers
  • SYNC_IN / SYNC_OUT

Example System

Below is a diagram generated from the first page of bittide_concepts.drawio:

TODO: Insert diagram

Experiments

This chapter describes the existing infrastructure to support the creation and execution of experiments on the example bittide system.

Required components

  • Example design: Contains all bittide related FPGA logic.
  • Programs: The binary files that will be loaded into the CPUs that exist in the example design.
  • Driver: Runs on the host PC and communicates with the testbench.
  • Testbench: Contains the example design and other necessary logic such as ILA's.

Relevant infrastructure

Names which tools are relevant and where they live

Execution

Describes how tests are executed as part of a CI/CD pipeline.

Writing programs

Describes how to write new programs for the management unit, general purpose processing element or clock control.

Writing a driver

Describes how to write a driver for the experiment.

Adding your experiment to CI/CD

Describes how to add your experiment to the CI/CD pipeline.

Components

This section provides an overview of the main hardware components in the Bittide system. Each component plays a specific role in enabling efficient, synchronized, and flexible operation of the hardware platform.

Available Components

Calendar

The Calendar component is a programmable state machine that drives the configuration of other components (like the Scatter Unit, Gather Unit, or Switch) on a cycle-by-cycle basis. It allows for time-division multiplexing of resources by cycling through a sequence of pre-programmed configurations.

Architecture

Diagram not found at ../../mdbook-drawio/components-page-2.svg

The Calendar consists of two memories (buffers) to allow for double-buffering: an active calendar and a shadow calendar.

  • The active calendar drives the output signals to the target component.
  • The shadow calendar can be reconfigured via the Wishbone interface without interrupting the operation of the active calendar.

The active and shadow calendars can be swapped at the end of a metacycle. A metacycle is one full iteration through the active calendar's entries, including all repetitions.

Operation

Each entry in the calendar consists of:

  1. Configuration Data: The actual control signals for the target component.
  2. Repetition Count: The number of additional cycles this configuration should remain active.
    • 0: The entry is valid for 1 cycle.
    • N: The entry is valid for N + 1 cycles.

The Calendar iterates through the entries in the active buffer. When it reaches the end of the active buffer (determined by the configured depth), it loops back to the beginning, completing a metacycle.

Double Buffering and Swapping

To update the schedule:

  1. Software writes new entries into the shadow calendar using the Wishbone interface.
  2. Software configures the depth of the shadow calendar.
  3. Software arms the swap mechanism by writing to the swapActive register.

The swap does not happen immediately. It occurs only when the active calendar completes its current metacycle. This ensures that the schedule is always switched at a deterministic point, preventing glitches or partial schedules.

Scatter Unit

The Scatter Unit is a hardware component designed to receive data frames from a Bittide link and store them into a local memory. It uses a double-buffered memory architecture and a configurable Calendar to determine the write address for each incoming frame. This allows for cycle-accurate, deterministic data reception.

Architecture

Diagram not found at ../../mdbook-drawio/components-page-0.svg

The Scatter Unit consists of:

  • Double-Buffered Memory: Two memory buffers (Active and Shadow).
    • Active Buffer: Receives data from the Bittide link.
    • Shadow Buffer: Can be read by the CPU via the Wishbone interface.
  • Calendar: Determines the write address in the Active Buffer for each incoming frame.
  • Wishbone Interface: Allows the CPU to read received data and monitor metacycle progress.

Operation

  1. Data Reception:

    • In each cycle, if a valid frame arrives from the Bittide link, the Calendar provides the write address for the Active Buffer.
    • The frame is written to the Active Buffer at that address.
  2. Buffer Swapping:

    • The Active and Shadow buffers are swapped at end of each metacycle.
    • A metacycle is defined by the Calendar's schedule.
    • After the swap, the data that was just received becomes available in the Shadow Buffer for the CPU to read.
  3. CPU Access:

    • The CPU reads the data from the Shadow Buffer using the Wishbone interface.
    • To ensure data consistency, the CPU can synchronize with the metacycle using the stalling mechanism.

Gather Unit

The Gather Unit is a hardware component designed to transmit data frames over a Bittide link. It uses a double-buffered memory architecture and a configurable Calendar to determine which data to send in each cycle. This ensures deterministic data transmission.

Architecture

Diagram not found at ../../mdbook-drawio/components-page-1.svg

The Gather Unit consists of:

  • Double-Buffered Memory: Two memory buffers (Active and Shadow).
    • Active Buffer: Provides data to the Bittide link.
    • Shadow Buffer: Can be written to by the CPU via the Wishbone interface.
  • Calendar: Determines the read address in the Active Buffer for each outgoing frame.
  • Wishbone Interface: Allows the CPU to write data to be transmitted and monitor metacycle progress.

Operation

  1. Data Transmission:

    • Each clock tick, the Calendar provides the read address for the Active Buffer.
    • The data at that address is read from the Active Buffer and sent over the Bittide link.
  2. Buffer Swapping:

    • The Active and Shadow buffers are swapped at end of each metacycle.
    • A metacycle is defined by the Calendar's schedule.
    • After the swap, the data that was written by the CPU becomes the active data being transmitted.
  3. CPU Access:

    • The CPU writes the data to the Shadow Buffer using the Wishbone interface.
    • Byte Enables: The Wishbone interface is 32-bit, but the Gather Unit memory is 64-bit. The hardware uses the byte enables to allow writing to the upper or lower 32 bits of the 64-bit memory words.
    • To ensure data consistency, the CPU can synchronize with the metacycle using the stalling mechanism.

Software UGN Demo

This chapter describes the specific hardware setup used to perform a test where each node is able to determine the UGNs to its neighbors without communicating with a host PC that is able to read all of the information from each of the nodes in the system. This is accomplished through the firmware on the GPPE.

Architecture

Diagram not found at ../../mdbook-drawio/softUgnDemo-page-0.svg

Initialization sequence

  1. "Boot" CPU
    1. Gets programmed by the host
    2. Programs the clock boards
    3. Gets transceiver block out of reset (enabling the bittide domain / other CPUs)
    4. Activates each transceiver channel and waits until they've negotiated a link with their neighbors.
    5. Prints "all done" message to UART.
  2. Clock control CPU
    1. Gets programmed by the host
    2. Calibrates clocks
    3. Prints "all done" message to UART.
    4. (Keeps calibrating clocks.)
  3. Management unit CPU
    1. Gets programmed by the host
    2. Centers elastic buffers
    3. Sets the channels to "user" mode. This makes the channels accept data from the outside world instead of their negotiation state machinery.
    4. Prints UGNs captured by hardware component to UART.
  4. General purpose procesisng element (PE)
    1. Gets programmed by the host
    2. Prints "Hello!" message over UART
    3. Calls the "c_main"
    4. Prints "Hello from C!" over UART

Components:

  • Boot CPU (BOOT)
  • Transceivers
  • Domain difference counters
  • Clock control (CC)
  • Elastic buffers (one per incoming transceiver link)
  • Hardware UGN capture (for comparison)

Components:

  • Management unit (MU)
  • 1x general purpose processing element (PE)
  • 7 scatter and gather units, one of each per elastic buffer

Management unit

Connected components:

  • Timer
  • UART (for debugging)
  • FPGA DNA register

The management unit has access to and is responsible for all scatter/gather calendars in the node.

To change the binary run on this CPU, one may either:

  • Edit bittide-instances/src/bittide/Instances/Hitl/SoftUgnDemo/Driver.hs, line 215 (at time of writing) to use another binary instead of soft-ugn-mu
  • Edit the source files in firmware-binaries/soft-ugn-mu/ to change the binary pre-selected by the driver function

General purpose processing element

This component is labeled as "PE" in the diagram above. Connected components:

  • 7 scatter and gather units, one of each per elastic buffer
  • UART (for debugging)
  • Timer
  • FPGA DNA register

The general purpose processing element has no functionality other than printing "Hello from C!" over UART.

  • UART arbiter
  • JTAG interconnect
  • Integrated logic analyzers
  • SYNC_IN/SYNC_OUT

Running tests

One may specifically run the software UGN demo test by making a .github/synthesis/debug.json with the following contents:

[
  {"top": "softUgnDemoTest",       "stage": "test", "cc_report": true}
]

At the time of writing, the clock control CPU stabilizes system. The driver running on the host (bittide-instances/src/bittide/Instances/Hitl/SoftUgnDemo/Driver.hs) then releases the reset of the management unit CPU. In turn, this CPU will center the elastic buffers and print out the UGNs captured using the hardware UGN capture component over UART. Finally, the general purpose processing element has its reset deasserted. It simply prints "Hello from C!".

Tests are configured to run the following binaries on the system's CPUs:

  • Boot CPU: switch-demo1-boot (firmware-binaries/demos/switch-demo1-boot)
  • Clock control CPU: clock-control (firmware-binaries/demos/clock-control)
  • Management unit: soft-ugn-mu (firmware-binaries/demos/soft-ugn-mu)
  • General purpose processing element: soft-ugn-gppe (firmware-binaries/demos/soft-ugn-gppe)

One may change this by either:

  1. Changing the driver function so that it loads different binaries onto the CPUs. This may be accomplished by changing which binary name is used with each of the initGdb function calls.
  2. Changing the source code for the binaries. The locations for them are listed above.

Switch Demo with ASIC processing element

This chapter describes the specific hardware setup used to perform a demonstration of the bittide switch.

Architecture

Diagram not found at ../../mdbook-drawio/switchDemoAsic-page-2.svg

Initialization sequence

  1. "Boot" CPU
    1. Gets programmed by the host
    2. Programs the clock boards
    3. Gets transceiver block out of reset (enabling the bittide domain / other CPUs)
    4. Activates each transceiver channel and waits until they've negotiated a link with their neighbors.
    5. Prints "all done" message to UART.
  2. Clock control CPU
    1. Gets programmed by the host
    2. Calibrates clocks
    3. Prints "all done" message to UART.
    4. (Keeps calibrating clocks.)
  3. Management unit CPU
    1. Gets programmed by the host
    2. Centers elastic buffers
    3. Sets the channels to "user" mode. This makes the channels accept data from the outside world instead of their negotiation state machinery.
    4. Prints UGNs captured by hardware component to UART.

The ASIC processing element comes out of reset as soon as the bittide domain is enabled.

Components:

  • Boot CPU (BOOT)
  • Transceivers
  • Domain difference counters
  • Clock control (CC)
  • Elastic buffers (one per incoming transceiver link)
  • Hardware UGN capture

Components:

  • Management unit (MU)
  • Crossbar
  • Crossbar calendar
  • Null link
  • ASIC processing element
  • 1x scatter/gather units

Management unit

  • Timer
  • UART (for debugging)
  • FPGA DNA register

The management unit has access to and is responsible for all scatter/gather calendars in the node, as well as the crossbar calendar.

To change the binary run on this CPU, one may either:

  • Edit bittide-instances/src/bittide/Instances/Hitl/SwitchDemo/Driver.hs, line 500 (at time of writing) to use another binary instead of switch-demo1-mu
  • Edit the source files in firmware-binaries/switch-demo1-mu/ to change the binary pre-selected by the driver function

Application specific processing element

This component is labeled as "PE" in the diagram above. It is directly connected to the output of the crossbar without any buffering, and as such can only work on data as it is streamed to it, and the manner in which this happens is determined by the crossbar calendar.

The processing element itself can be instructed to read from its incoming link for a configurable number of clock cycles starting at a very specific one. Similarly, it can write to its outgoing link for a configurable number of clock cycles starting at a very specific one. The demo is such that UGNs are received from all nodes in the system and with that a "schedule" is created for when to read and write on the link on each PE. If the schedule is correct and the bittide property holds, a "thread" through all nodes is created. For more information see this presentation.

  • UART arbiter
  • JTAG interconnect
  • Integrated logic analyzers
  • SYNC_IN/SYNC_OUT

Running tests

One may specifically run the switch demo test by making a .github/synthesis/debug.json with the following contents:

[
  {"top": "switchDemoTest",       "stage": "test", "cc_report": true}
]

At the time of writing, the clock control CPU stabilizes system. The driver running on the host (bittide-instances/src/bittide/Instances/Hitl/SwitchDemo/Driver.hs) then releases the reset of the management unit CPU. In turn, this CPU will center the elastic buffers and print out the UGNs captured using the hardware UGN capture component over UART. The behavior of the application specific processing element is explained in Application specific processing element.

Tests are configured to run the following binaries on the system's CPUs:

  • Boot CPU: switch-demo1-boot (firmware-binaries/demos/switch-demo1-boot)
  • Clock control CPU: clock-control (firmware-binaries/demos/clock-control)
  • Management unit: switch-demo1-mu (firmware-binaries/demos/switch-demo1-mu)

One may change this by either:

  1. Changing the driver function so that it loads different binaries onto the CPUs. This may be accomplished by changing which binary name is used with each of the initGdb function calls.
  2. Changing the source code for the binaries. The locations for them are listed above.

Switch Demo with GPPE

This chapter describes the specific hardware setup used to perform a demonstration of the bittide switch.

Architecture

Diagram not found at ../../mdbook-drawio/switchDemoGppe-page-1.svg

Initialization sequence

  1. "Boot" CPU
    1. Gets programmed by the host
    2. Programs the clock boards
    3. Gets transceiver block out of reset (enabling the bittide domain / other CPUs)
    4. Activates each transceiver channel and waits until they've negotiated a link with their neighbors.
    5. Prints "all done" message to UART.
  2. Clock control CPU
    1. Gets programmed by the host
    2. Calibrates clocks
    3. Prints "all done" message to UART.
    4. (Keeps calibrating clocks.)
  3. Management unit CPU
    1. Gets programmed by the host
    2. Centers elastic buffers
    3. Sets the channels to "user" mode. This makes the channels accept data from the outside world instead of their negotiation state machinery.
    4. Prints UGNs captured by hardware component to UART.
  4. General purpose procesisng element (PE)
    1. Gets programmed by the host
    2. Prints "Hello!" message over UART

Components:

  • Boot CPU (BOOT)
  • Transceivers
  • Domain difference counters
  • Clock control (CC)
  • Elastic buffers (one per incoming transceiver link)
  • Hardware UGN capture

Components:

  • Management unit (MU)
  • Crossbar
  • Crossbar calendar
  • Null link
  • 1x general purpose processing element (PE)
  • 1x scatter/gather units

Management unit

  • Timer
  • UART (for debugging)
  • FPGA DNA register

The management unit has access to and is responsible for all scatter/gather calendars in the node, as well as the crossbar calendar.

To change the binary run on this CPU, one may either:

  • Edit bittide-instances/src/bittide/Instances/Hitl/SwitchDemoGppe/Driver.hs, line 215 (at time of writing) to use another binary instead of switch-demo2-mu
  • Edit the source files in firmware-binaries/switch-demo2-mu/ to change the binary pre-selected by the driver function

General purpose processing element

This component is labeled as "PE" in the diagram above. Connected components:

  • 7 scatter and gather units, one of each per elastic buffer
  • UART (for debugging)
  • Timer
  • Scatter/gather unit connection to crossbar output
  • FPGA DNA register

The general purpose processing element is a drop-in replacement of the ASIC processing element. It has no functionality other than printing "Hello!" over UART.

  • UART arbiter
  • JTAG interconnect
  • Integrated logic analyzers
  • SYNC_IN/SYNC_OUT

Running tests

One may specifically run the switch demo with GPPE test by making a .github/synthesis/debug.json with the following contents:

[
  {"top": "switchDemoGppeTest",       "stage": "test", "cc_report": true}
]

At the time of writing, the clock control CPU stabilizes system. The driver running on the host (bittide-instances/src/bittide/Instances/Hitl/SwitchDemoGppe/Driver.hs) then releases the reset of the management unit CPU. In turn, this CPU will center the elastic buffers and print out the UGNs captured using the hardware UGN capture component over UART. Finally, the general purpose processing element has its reset deasserted. It simply prints "Hello!" over UART.

Tests are configured to run the following binaries on the system's CPUs:

  • Boot CPU: switch-demo1-boot (firmware-binaries/demos/switch-demo1-boot)
  • Clock control CPU: clock-control (firmware-binaries/demos/clock-control)
  • Management unit: switch-demo2-mu (firmware-binaries/demos/switch-demo2-mu)
  • General purpose processing element: switch-demo2-gppe (firmware-binaries/demos/switch-demo2-gppe)

One may change this by either:

  1. Changing the driver function so that it loads different binaries onto the CPUs. This may be accomplished by changing which binary name is used with each of the initGdb function calls.
  2. Changing the source code for the binaries. The locations for them are listed above.

Ringbuffer Alignment Protocol

Definitions

  • Transmitting Ringbuffer (TX): Local memory written by CPU, read by hardware using a wrapping counter.
  • Receiving Ringbuffer (RX): Local memory written by hardware using a wrapping counter, read by CPU.

Context

In a bittide system, nodes operate in a globally synchronous manner despite being asynchronous devices with unknown start times. Communication occurs via ringbuffers. When TX and RX ringbuffers are the same size, the address mapping between them is constant, determined by (logical) network latency and the start time difference between nodes.

Objective

Determine the constant offset between the transmit address (TX) and receive address (RX) for each link to enable reliable asynchronous communication.

Alignment Algorithm

  1. Initialization: Set all ringbuffers to a uniform default size.
  2. Announce: Each CPU writes a recognizable message containing the non-zero state ALIGNMENT_ANNOUNCE to index 0 of its TX ringbuffer and clears all other positions.
  3. Search: Each CPU scans its RX ringbuffer for the ALIGNMENT_ANNOUNCE message.
  4. Determine Offset: Upon finding the message at RX_Index, the CPU stores the RX_Index, to be used as future read offset.
  5. Acknowledge: The CPU updates index 0 of its TX ringbuffer with the state ALIGNMENT_RECEIVED.
  6. Confirm: The CPU monitors the RX ringbuffer until it receives ALIGNMENT_RECEIVED from its neighbor.
  7. Finalize: Once the CPU has received ALIGNMENT_RECEIVED, the link is aligned. The offset is stored for future communication and we proceed to normal operation.

Resulting Interface: AlignedRingbuffer

Upon successful alignment, the system can instantiate an AlignedRingbuffer abstraction. This interface handles the offset calculations transparently, allowing the CPU to read and write to the ringbuffers without concern for alignment.

Communication Challenges

While the AlignedRingbuffer provides logical connectivity, the physical link remains unreliable due to the interaction between the read/write counters of the ringbuffers and asynchronous CPU access:

  1. Continuous Hardware Operation: The hardware continuously cycles through the ringbuffers at the network link speed.
  2. Asynchronous CPU Access: The CPU operates asynchronously and often slower than the network link.

This leads to specific failure modes:

  • Data Corruption (Pointer Overtaking):
    • TX Side: If the hardware's read pointer overtakes the CPU's write pointer during a write, a torn frame is sent.
    • RX Side: If the hardware's write pointer overtakes the CPU's read pointer during a read, the message is corrupted.
  • Data Loss: If the CPU does not read from the RX ringbuffer every iteration, the hardware will overwrite unread data.
  • Data duplication: If the CPU does not write to the TX ringbuffer every iteration, the hardware will resend old data.

Reliable communication requires a higher-level protocol to handle these errors. See the Asynchronous Communication Protocol for a proposed solution using the smoltcp library to implement a reliable TCP/IP layer over the AlignedRingbuffer.

Asynchronous Communication Protocol

Context

In a bittide system, we need asynchronous communication between nodes, particularly during the boot phase. The Ringbuffer Alignment Protocol provides an AlignedRingbuffer abstraction that allows for packet exchange.

However, as described in that protocol's documentation, the raw AlignedRingbuffer link is unreliable, subject to packet corruption and loss due to hardware/software speed mismatches.

Objective

Establish a reliable, asynchronous, point-to-point communication channel between nodes over the potentially unreliable AlignedRingbuffer links.

Proposed Solution

Leverage the TCP/IP protocol suite to handle error detection, retransmission, and flow control. We will use the smoltcp library, a lightweight TCP/IP stack designed for embedded systems, to implement this layer. Note that future versions of bittide will probably use a bespoke network stack for asynchronous communication.

Implementation Strategy

1. Network Interface (smoltcp::phy::Device)

We will implement the smoltcp::phy::Device trait for the Aligned Ringbuffer.

  • Medium: Use Medium::Ip to minimize overhead (no Ethernet headers required for point-to-point).
  • MTU: Set to 1500 bytes (standard Ethernet size) to accommodate typical payloads.

2. Framing & Alignment

  • The underlying Aligned Ringbuffer abstraction ensures packets are read from the correct aligned memory location.
  • Packet Boundaries: The length of each packet will be derived directly from the IP Header length field.

3. Addressing

  • Topology: Initially restricted to Peer-to-Peer links.
  • IP Assignment: Placeholder IPs, if necessary we use static addressing derived from unique hardware identifiers (e.g., FPGA DNA or Port ID) to avoid the complexity of DHCP.

4. Demo Application

Develop a proof-of-concept application that:

  1. Initializes the Aligned Ringbuffer.
  2. Sets up a smoltcp interface.
  3. Establishes a TCP connection between two nodes.
  4. Transfers data to verify reliability against induced packet loss/corruption.

Assumptions

  • An Aligned Ringbuffer abstraction exists that provides a read/write interface for single aligned packets.
  • The ringbuffer size is sufficient to hold at least one MTU-sized packet plus overhead.