B-Trees vs LSM-Trees: The Two Religions of On-Disk Data
Every database you use bets on one of two storage engines: B-trees (read-optimised, update-in-place) or LSM-trees (write-optimised, append-and-compact). The choice isn't about speed but about which kind of amplification you're willing to pay.
The engine under your data is not "fast" or "slow". It is amplified — a fact that survives almost every database argument you have ever had. Every read you issue touches more bytes than the row you asked for; every write you commit rewrites more bytes than the value you changed; every gigabyte you store occupies more than a gigabyte on disk. Those three taxes — read amplification, write amplification, and space amplification — are not bugs to be tuned away. They are conserved. You can push any one of them down to almost nothing, but only by shoving the slack onto the other two. A storage engine is the specific bargain its designers struck about which tax to pay.
The two great bargains have names, and they split the database world cleanly down the middle. B-trees keep the data sorted in fixed-size pages and update those pages in place — the structure Postgres, MySQL/InnoDB, and almost every relational database of the last half-century quietly runs on. Log-structured merge-trees (LSM-trees) refuse to update anything in place; they buffer writes in memory, dump them to disk in sorted batches, and clean up later in the background — the structure behind RocksDB, Cassandra, LevelDB, and most of the "web-scale" lineage. One religion optimises the read. The other optimises the write. Neither escapes the bill.
Insert a few keys into both panes above and the two philosophies separate immediately. The B-tree on the left keeps everything sorted and, when a node fills, splits it — pushing a median key up and rewriting the page in place. The LSM-tree on the right does nothing of the sort: keys pile into an in-memory buffer, and when it fills it gets flushed wholesale to an immutable sorted run on disk. Same keys, same order in, two completely different shapes on disk. This post is about why those shapes exist, what each one costs, and why your database sometimes does surprise disk IO at 3am that you never asked for.
Two ways to put a tree on a disk
Start with the constraint that makes storage engines hard: the disk. A spinning disk or even an SSD is happiest reading and writing in large, contiguous chunks — a page, typically 4 KB to 16 KB — and miserable doing scattered single-byte updates. Memory does not care where things live; disk cares enormously. So the entire game is arranging your data so that the access pattern your workload generates maps onto the access pattern the disk wants.
There are exactly two honest answers, and they correspond to which operation you refuse to compromise.
The first answer: keep the data sorted on disk at all times, in place. Then any lookup is a search through a sorted structure — a handful of page reads, no matter how much data you hold. This is the B-tree. The cost is that keeping things sorted as you write means finding the right page, reading it, modifying it, and writing it back — a read-modify-write for every change, scattered wherever the key happens to land.
The second answer: never sort on disk; only ever append. Writes go to a buffer in memory and then to disk as immutable, append-only sorted runs. Appending is the one thing disks do superbly. The cost is deferred, not avoided: now your data is scattered across many runs, so a read has to check several of them, and a background process has to keep merging runs together or they multiply without bound. This is the LSM-tree.
The choice is made before the first query runs
You do not get to pick "read-optimised" and "write-optimised" at query time. The decision is baked into the on-disk layout the moment you choose the engine. A B-tree has already paid for fast reads by the time your first SELECT arrives — the data is sorted, the pages are placed. An LSM-tree has already paid for fast writes by the time your first INSERT arrives — the append path is clear, the sorting is postponed. The storage engine is a bet placed at schema-creation time about which way your workload leans, and you live with it.
The B-tree: update-in-place, shallow, read-optimised
A B-tree is a sorted tree built for the disk's block size. Each node is one page and holds many keys — not two children like a binary tree, but of them, where is large precisely so the tree stays shallow. A node with room for hundreds of keys has hundreds of children, so the tree fans out fast and bottoms out in very few levels.
That fan-out is the whole point, and it falls straight out of the height formula. A balanced tree of fan-out holding keys has height:
The base of that logarithm is the lever. With a binary tree, , and a billion keys is levels deep — 30 disk seeks per lookup, a catastrophe. Make instead, and the same billion keys is levels deep. Three or four page reads, and you have found any row in a billion. A large does not change what the tree does; it changes the base of the log, and the base of the log is the number of seeks. That is why every B-tree node is a fat page crammed with keys: each key you add to a node multiplies the tree's reach without adding depth.
Why the fan-out is so wide
A binary search tree and a B-tree both find a key in logarithmic time. The difference is the base of the log, and on a disk the base is everything. Each level of the tree is potentially one disk seek — the single most expensive thing the database does, milliseconds on a spinning disk, still microseconds on an SSD versus nanoseconds in memory. A binary tree spends 30 seeks where a B-tree spends 3. The B-tree wins not by being smarter per comparison but by doing the comparisons in memory, hundreds at a time, between rare disk seeks.
Now watch what an insert actually does — and the left pane of the simulator at the top of this post does exactly this. You walk down to the correct leaf and write the key into the page in place. If the page still has room, you are done: one page read, one page write. But if the page is full, it has to split — the node divides in two, the median key moves up into the parent, and now the parent might be full too, so the split can cascade upward, occasionally all the way to the root (which is how the tree grows a level). Each split rewrites several pages: the one that overflowed, the new sibling, and the parent that received the median. The B-tree's page writes counter climbs faster than the keys you inserted, and that gap is the B-tree's write amplification — the price of keeping everything sorted in place so reads stay cheap.
That bargain has a sharp consequence. Because B-tree pages are modified in place, a single-row update can mean reading a 16 KB page, changing 8 bytes, and writing the whole 16 KB page back. Databases soften this with a write-ahead log and by batching dirty pages, but the structural fact stands: the B-tree pays on the write to be lavish on the read.
The LSM-tree: append to a memtable, flush, compact
The LSM-tree starts from the opposite premise — that random in-place writes are the enemy — and arranges never to do them. A write does not touch the on-disk tree at all. It lands in two places: an append-only write-ahead log for durability, and an in-memory sorted structure called the memtable (usually a skip list or a balanced tree, sorted because sorting in RAM is cheap).
The memtable absorbs writes until it hits a size threshold. Then it is flushed: written out, in one sequential sweep, as an immutable sorted string table (SSTable) — a file of key-value pairs in sorted order. Crucially, the SSTable is never modified again. An update to a key you wrote yesterday does not find and edit yesterday's SSTable; it just writes a newer entry, in a newer SSTable, that shadows the old one. A delete writes a tombstone, a marker that says "this key is gone", which also shadows the old value until cleanup removes both.
You can already see the problem this creates. Keep flushing memtables and you accumulate dozens, then hundreds, of SSTables, each sorted internally but overlapping the others. A read for a single key might have to check many of them, newest first, until it finds the key or a tombstone. Reads degrade as the number of runs grows. The fix is compaction: a background process that merges several SSTables into one, discarding shadowed values and dead tombstones along the way. SSTables are organised into levels, each level larger than the one above, and compaction merges runs from one level down into the next.
An LSM-tree never overwrites a byte on disk; it writes a newer truth on top of the old one and lets a background janitor reconcile them later.
Now drive the right pane. Hit Insert ×8 and watch the memtable fill — four keys here — then flush as a single immutable L0 run; the SSTable writes counter ticks once per flush, not once per key. Keep inserting and L0 runs accumulate until three of them trigger a compaction that merges them into a larger L1 run. That compaction is the surprise IO. It is not driven by your query; it is the engine reconciling its deferred work, reading several runs and writing one larger run, all in the background. The "why is my database hammering the disk when nobody's querying it" phenomenon is compaction catching up on the writes you batched cheaply earlier. You paid a low price at write time; compaction is the invoice arriving later.
The append-only insight that makes it all work
Because SSTables are immutable, an LSM-tree never has a write that races a read on the same byte — readers see a consistent snapshot of files that, once written, never change. This is the same insight behind log-structured filesystems, copy-on-write, and immutable data structures everywhere: you avoid the hard concurrency problem of in-place mutation by never mutating, only appending and later garbage-collecting. The complexity does not vanish; it concentrates into the compaction scheduler, which is the single hardest piece of any LSM engine to get right.
The three amplifications and the RUM conjecture
Now name the taxes precisely, because the whole comparison reduces to them. For a storage engine, define:
- Read amplification — the number of bytes read from disk to satisfy a query, divided by the bytes of the result. A B-tree reads a handful of pages per lookup. An LSM-tree may probe several runs across several levels, so its read amplification grows with the number of levels.
- Write amplification — the number of bytes written to disk, divided by the bytes of the logical write. A B-tree rewrites whole pages and splits nodes. An LSM-tree rewrites every record once per compaction it survives.
- Space amplification — the bytes occupied on disk, divided by the bytes of live data. A B-tree leaves pages partially empty after splits (typically about 2/3 full). An LSM-tree holds stale, shadowed values and tombstones until compaction reclaims them.
The LSM-tree's write amplification has a clean approximate form for leveled compaction. Every record is written once when its memtable flushes, then rewritten again each time it is merged down a level. With levels and a per-level size ratio (fan-out) , the worst-case rewriting at each level is on the order of — when a level fills, compacting it into the next reads and rewrites roughly times as much data as it contributes. Summed across all levels, the total lands on the order of times :
A typical configuration with levels and a fan-out of gives write amplification around 50 — every byte you logically write gets physically written roughly 50 times over its lifetime, spread across the flush and the compactions it survives. That sounds ruinous until you remember the alternative: those rewrites are all sequential writes, which a disk handles far faster than the random in-place page writes a B-tree scatters around. The LSM-tree trades a higher write volume for a far friendlier write pattern.
Flip the workload toggle between read-heavy and write-heavy and watch the amplification readout re-weight. Under a write-heavy workload, the LSM-tree's bargain looks brilliant: sequential appends, batched flushes, the cost deferred to background compaction. Flip to read-heavy and the picture inverts — the LSM-tree's read amplification, all those runs to probe, starts to hurt, and the B-tree's always-sorted layout pays off. Keep inserting in either mode and watch the live vs total bytes counters drift apart: the gap between them is space amplification, the dead weight of shadowed values and half-empty pages.
This is not a coincidence of these two designs. It is a theorem-shaped observation called the RUM conjecture — Read, Update, Memory. It states that an access method can optimise at most two of the three overheads, and improving one of read, update, or memory amplification comes at the expense of at least one of the others:
The B-tree picks low read amplification and moderate space, paying with random in-place writes. The LSM-tree picks low write amplification (sequential, batched) and pays with read amplification and transient space bloat. You can move the dial — bigger Bloom filters cut LSM read amplification at the cost of memory; more aggressive compaction cuts space at the cost of write amplification — but every adjustment is a slide along the triangle, never an escape from it.
There is no free lunch, only a relocated cost
When a vendor benchmark shows their engine "beating" the other on throughput, read the workload first. A write-heavy benchmark flatters LSM-trees; a read-heavy or scan-heavy benchmark flatters B-trees. Neither result is a lie, and neither is general. The RUM triangle guarantees that any engine winning decisively on one axis is paying for it on another — the benchmark just chose which axis to spotlight. The honest question is never "which is faster" but "which amplification can my workload afford".
- B-tree lookup
- ~3-4
- page reads to find any row in a billion, at fan-out 500
- LSM write amp
- ~10-50×
- bytes written per logical write, all sequential
- RUM axes
- 2 of 3
- read, update, memory — optimise at most two
Who picks what, and why
Map the engines onto the workloads and the choices stop looking arbitrary.
Postgres and MySQL/InnoDB are B-trees because the relational world they grew up in is read-and-scan-heavy: indexes that get range-scanned, joins that walk sorted keys, point lookups that must be fast and predictable. Update-in-place also makes transactional semantics and secondary indexes simpler — the row lives in one place, so an index entry can point at it. The B-tree's bargain — pay on the write to be cheap and predictable on the read — is exactly the bargain an OLTP relational database wants.
Cassandra is an LSM-tree because it was built for write-heavy, high-ingest workloads — time-series, event logs, sensor feeds, audit trails — where data pours in far faster than it is read back, and where horizontal write scalability matters more than single-key read latency. Append-and-compact turns a flood of random-looking inserts into a stream of sequential flushes, which is the only way to keep a write-saturated node from collapsing.
RocksDB (a fork of Google's LevelDB) is the LSM-tree as a reusable embedded library — the storage engine extracted from any particular database and handed to others. It sits under MySQL's MyRocks, under TiKV (and so under TiDB), under Kafka Streams' state stores, under countless services that need a fast embedded key-value store; CockroachDB famously re-implemented the same design in Go as Pebble, a RocksDB-inspired engine it has shipped by default since 2020. RocksDB exposes the RUM triangle as configuration: compaction style, level sizes, Bloom-filter bits, block-cache size — every knob is a position on the trade-off, handed to the operator to set for their workload.
# The same logical operation, two storage philosophies underneath.
# B-tree (Postgres-style): UPDATE finds the page, reads it, modifies it
# in place, writes it back. One row changed, one page (or a few) rewritten.
db.execute("UPDATE accounts SET balance = balance - 100 WHERE id = 42")
# -> walk B-tree to the leaf holding id=42, read page, edit, write page back
# LSM-tree (RocksDB-style): the "update" is just a newer write that
# shadows the old value. Nothing on disk is modified; a tombstone-free
# newer entry lands in the memtable, flushes later, compacts later.
store.put(b"account:42:balance", b"...") # append; old value reconciled at compactionThe pattern to carry away: a B-tree's UPDATE is genuinely an update — it finds and edits the existing bytes. An LSM-tree's put is a disguised INSERT — it writes a newer truth and trusts compaction to retire the old one. The API hides the difference; the amplification profile does not.
Hybrid engines blur the line on purpose
The split is real but not absolute. Modern engines borrow across the aisle: WiredTiger (MongoDB's default) is a B-tree with LSM-like log-structured behaviour available; MyRocks puts an LSM engine under MySQL's relational front end; fractal trees and -trees (TokuDB) buffer writes inside B-tree nodes to claw back write performance without abandoning the sorted layout. Each hybrid is an attempt to occupy a different point on the RUM triangle than either pure design — never an attempt to leave the triangle, because that exit does not exist.
The same shape in other rooms
Once you have the append-and-compact pattern in your hand, you start seeing it everywhere the cost of in-place mutation got too high to bear.
Log-structured filesystems are the LSM idea applied to whole files. LFS, the 1991 design by Mendel Rosenblum and John Ousterhout, writes all changes — data and metadata alike — to a sequential log and runs a background segment cleaner to reclaim space. The motivation was identical to the LSM-tree's: disks are fast at sequential writes and slow at random ones, so turn every random write into a sequential append and pay the cleanup cost later. The segment cleaner is compaction wearing a filesystem hat.
SSD flash translation layers (FTLs) run the same playbook in firmware, because they have no choice. NAND flash cannot overwrite a page in place — a block must be erased before rewriting, and erases are slow and wear the cell out. So every SSD is secretly log-structured: writes go to fresh pages, a mapping table redirects logical addresses to wherever the data landed, and a garbage collector compacts mostly-dead blocks. Write amplification is a term SSD engineers use daily, meaning exactly what it means for an LSM-tree. Your "update-in-place" B-tree, running on an SSD, sits on a log-structured device that turns its in-place writes into appends anyway.
Apache Kafka takes the append-only half of the idea and makes it the entire product. A Kafka topic is a partitioned, append-only commit log — the memtable's write-ahead log promoted to the central abstraction. Producers append; consumers read sequentially; log compaction (Kafka's own term) retires superseded records by key. Kafka is what happens when you decide the append-only log is not an implementation detail of a storage engine but the data model itself.
The thread through all of them is one decision made over and over: when in-place mutation is expensive — because the disk is slow at random writes, because flash cannot overwrite, because concurrent mutation is hard to reason about — stop mutating. Append the new truth, shadow the old, and run a background process to reconcile. The B-tree is the holdout that insists on mutating in place and pays for it on the write; everything in the log-structured lineage is a variation on the LSM-tree's refusal.
Reading further
- Bayer & McCreight, Organization and Maintenance of Large Ordered Indices (1970) — the original B-tree paper (presented at the 1970 SIGFIDET workshop, expanded in Acta Informatica in 1972), where the fan-out-keeps-it-shallow argument and the node-split algorithm first appear.
- O'Neil, Cheng, Gawlick & O'Neil, The Log-Structured Merge-Tree (LSM-Tree) (1996) — the paper that named the structure and worked out the merge cost; short, and the source of every modern compaction design.
- Athanassoulis et al., Designing Access Methods: The RUM Conjecture (EDBT 2016) — the formal statement of the read/update/memory trade-off triangle that frames the whole comparison.
- Petrov, Database Internals (O'Reilly, 2019) — the modern, implementation-level treatment of both engines side by side, with the compaction strategies and on-disk layouts spelled out.
Try it in the lab
All effects →Phase Portrait
mathsODE trajectories flowing through vector fields — Lotka-Volterra, Van der Pol, Duffing.
odedynamical systemsInverse Kinematics
engineering2R planar robot arm solving for joint angles via analytic IK — drag the end-effector.
roboticskinematicsTransmission Line Pulse
engineeringTDR — a voltage pulse travels, reflects, and inverts on a mismatched line.
rftdrimpedance
More from the blog
How Python Dicts Really Work: Open Addressing, Probing, and the djb2 Lie
The dict is the most-used data structure in Python and almost nobody knows it uses open addressing, not chaining — so a single deletion leaves a tombstone, and a simple hash table is really a careful dance between load factor, probe sequences, and cache lines.
A* Search, Visually: the Heuristic Is the Whole Game
A* is not a clever algorithm so much as Dijkstra plus a bet about the future. The same code becomes Dijkstra, greedy best-first, or A* depending on one term in the priority key — and admissibility is the single property that buys optimality.
AM, FM, QAM: A Tour of the Modulation Zoo
Every modulation scheme is the same act — painting information onto a carrier — and they differ only in which property of the carrier you paint on. Plotted as a constellation, AM is a line, FM is a circle, and QAM is a grid.