bfloat16: Google Brain Floating-Point for Deep Learning
Training neural networks requires a format that balances three concerns: the dynamic range must be wide enough to represent gradients (which can span many orders of magnitude), the storage must be compact enough to fit large models in memory, and arithmetic must be fast enough to process billions of operations per second.
IEEE half-precision (float16) has only 5 exponent bits, giving a dynamic range of ~6.5 x 10^4 — far too narrow for gradient accumulation, where values routinely span 10^-8 to 10^4. Google’s Brain Float (bfloat16) solves this by truncating float32 to 16 bits from the top: same 8 exponent bits as float32 (dynamic range ~3.4 x 10^38), with a reduced 7-bit fraction. The result is a format perfectly tuned for deep learning: wide range, compact storage, and trivial conversion to/from float32 (just truncate or zero-extend the lower 16 bits).
bfloat16 is a fixed-format 16-bit floating-point type:
[sign : 1 bit] [exponent : 8 bits] [fraction : 7 bits]| Property | Value |
|---|---|
| Total bits | 16 |
| Exponent bits | 8 |
| Fraction bits | 7 |
| Exponent bias | 127 |
| Dynamic range | ~1.2 x 10^-38 to ~3.4 x 10^38 |
| Precision | ~2 decimal digits |
| Subnormals | No |
| Infinity | Yes |
| NaN | Yes |
Key Properties
Section titled “Key Properties”- Not a template: fixed format, no configuration parameters
- Same dynamic range as float32: 8 exponent bits, bias 127
- Trivial conversion to/from float32: top 16 bits of IEEE-754 binary32
- Constexpr support: compile-time evaluation
- Trivially copyable: suitable for hardware acceleration
Comparison with Other 16-bit Formats
Section titled “Comparison with Other 16-bit Formats”| Format | Exponent | Fraction | Dynamic Range | Precision |
|---|---|---|---|---|
| IEEE float16 | 5 bits | 10 bits | ~6.5 x 10^4 | ~3.3 digits |
| bfloat16 | 8 bits | 7 bits | ~3.4 x 10^38 | ~2 digits |
| float32 | 8 bits | 23 bits | ~3.4 x 10^38 | ~7.2 digits |
How It Works
Section titled “How It Works”bfloat16 is the upper 16 bits of a 32-bit IEEE-754 float. This means:
- Conversion from float32: truncate (or round) the lower 16 fraction bits
- Conversion to float32: zero-extend the lower 16 fraction bits
- Arithmetic: performed by promoting to float32, computing, then truncating back
The Universal implementation stores the encoding in a uint16_t and uses std::memcpy-based conversion to/from native float, ensuring constexpr compatibility where the compiler supports it.
How to Use It
Section titled “How to Use It”Include
Section titled “Include”#include <universal/number/bfloat16/bfloat16.hpp>using namespace sw::universal;Basic Usage
Section titled “Basic Usage”bfloat16 weight(0.5f);bfloat16 gradient(-0.001f);bfloat16 learning_rate(0.01f);
// Gradient descent updateweight = weight - learning_rate * gradient;
std::cout << "Updated weight: " << weight << std::endl;std::cout << "Encoding: " << to_binary(weight) << std::endl;Mixed-Precision Training Pattern
Section titled “Mixed-Precision Training Pattern”// Typical mixed-precision workflow:// 1. Store weights and activations in bfloat16 (saves memory)// 2. Accumulate in float32 (preserves accuracy)// 3. Cast result back to bfloat16 for storage
template<typename LowPrec>float accumulate_gradients(const std::vector<LowPrec>& gradients) { float sum = 0.0f; // Accumulate in float32 for (const auto& g : gradients) { sum += float(g); } return sum;}
std::vector<bfloat16> grads = { bfloat16(0.001f), bfloat16(-0.002f), bfloat16(0.0005f) };float total = accumulate_gradients(grads);bfloat16 avg_gradient(total / grads.size());Plug-in Replacement for Neural Network Layers
Section titled “Plug-in Replacement for Neural Network Layers”template<typename Real>void linear_layer(const std::vector<Real>& input, const std::vector<std::vector<Real>>& weights, std::vector<Real>& output) { for (size_t i = 0; i < weights.size(); ++i) { Real sum(0); for (size_t j = 0; j < input.size(); ++j) { sum += weights[i][j] * input[j]; } output[i] = sum; }}
// Use with bfloat16 for 2x memory reduction vs float32std::vector<bfloat16> input = { bfloat16(1.0f), bfloat16(0.5f) };std::vector<std::vector<bfloat16>> weights = { { bfloat16(0.3f), bfloat16(0.7f) }, { bfloat16(-0.2f), bfloat16(0.4f) }};std::vector<bfloat16> output(2);linear_layer(input, weights, output);Problems It Solves
Section titled “Problems It Solves”| Problem | How bfloat16 Solves It |
|---|---|
| Neural network weights don’t fit in GPU memory | 2x compression vs float32, same dynamic range |
| float16 gradients overflow during training | 8-bit exponent matches float32 range |
| Memory bandwidth limits training throughput | Half the bytes per element = 2x bandwidth efficiency |
| Need fast conversion between training (float32) and storage (16-bit) | Truncate/extend top 16 bits — no complex conversion |
| Google TPU and AI accelerator compatibility | Native bfloat16 format used by TPUs |
| Transformer models with billions of parameters | Compact storage enables larger models and batch sizes |