Skip to content

Science Behind Modern DATA Storage Systems

Familiarizing yourself with data storage and database internals eases architectural choices, helps clarify why a system acts a certain way, helps troubleshoot issues when they appear, and also fine-tunes the database to your own workload.
Reading Time: 8 minutes
data storage system

The quantities of information processed by modern software are continuously growing. With this expansion, scaling storage grows more challenging for companies. Now knowing the data storage system is critical for any CTO, as it assists in picking out the perfect one from numerous available alternatives.


Every program differs concerning Freestyle workload equilibrium, balancing demands, latencies, and access patterns.

In my last article, I talk about how SSD Storage helps our Shared Linux hosting service help boost performance, In this article, we will even dive deeper to explore the science behind the Data Storage system.

It is not possible to maximize a system in most directions. In a perfect world, there could be information structures promising the very best read and write functionality with no storage overhead but, naturally, in practice that is not feasible.

This report takes a closer look at two storage system design procedures used in the vast majority of contemporary databases–read-optimized B-trees plus write-optimized LSM (log-structured unite)-trees–also explains their use cases and tradeoffs.

Table of Contents

What is B-Tree?

B-trees are a popular read-optimized indexing information structure and generalization of binary trees. They come in many versions and are used in several databases (like MySQL Database InnoDB4 and Postgre SQL7) as well as file programs (HFS+8, HTrees at ext49).

The B in B-tree stands for Bayer, the writer of the first data structure, or Boeing, in which he worked at the time.

In a binary tree, every node has two children (called a left & right child). Left and right subtrees hold the keys that are less than and larger than the present node crucial, respectively.

To maintain the tree depth to a minimum, a tree has to be balanced: when randomly arranged keys have been added to the tree, it’s normal that one facet of the tree will eventually get deeper than the other.

One can rebalance a binary tree to use so-called rotation: rearrange nodes, compelling the parent node of this longer subtree down its kid and pulling up this child, effectively placing it in its parent’s position.

Figure 1 is a good instance of rotation used for reconciliation at a binary tree. To be able to balance the tree, then node 3 is used as a pivot (the tree is rotated around it).

Subsequently, node 5, previously a root node and a parent node for 3, becomes a child node. After the rotation step is done, the height of the left subtree decreases by one, and the elevation of the right subtree increases by one. The maximum thickness of the tree has decreased.

Binary trees are useful as in-memory information structures. Due to balancing (the necessity to maintain the depth of all subtrees into a minimum) and very low fanout (a max of two pointers per node), they do not work well on disc storage. B-trees permit for saving two or more pointers per node and operate nicely with block devices by fitting the node dimensions to the page dimensions (e.g., 4 KB).

B-trees possess the following properties:

Sorted: This allows sequential scans and simplifies lookups.

Self-balancing: There is no need to balance the tree through insertion and deletion: whenever a B-tree node is full, it’s split in 2, and once the occupancy of these neighboring nodes falls under a particular threshold, the nodes have been merged. This also suggests that leaves are both remotes from the root and may be located with an identical number of measures through exploration.

Guarantee of logarithmic hunt time. This makes B-trees a fantastic pick for database indicators, where search times are significant.

Mutable: Inserts, updates, and deletes (additionally, following splits and merges) are done disc set up. To create in-place updates potential, a particular quantity of distance overhead is necessary.

This report discusses the Btree,3 a contemporary version of the B-tree frequently used for storage. The B+tree differs from the initial B-tree1 because (a) it’s an extra amount of connected leaf nodes holding the worth, and (b) these values can’t be stored on inner nodes.

Anatomy of those B-tree

Let us take a good look at the B-tree building blocks, exemplified in figure two. B-trees have a lot of node types: root, inner, and foliage. Root (shirt) is the node that doesn’t have any parents (i.e., it isn’t a child of some other node). Leaf nodes (underside) transmit the information and don’t have any kids.


On every level, the search space is reduced into the kid subtree (the assortment of the subtree involves the hunted value) by following the kid pointer.

Figure 3 displays a search in a B-tree creating one root-to-leaf pass, after the pointers “between” the two keys, one of which will be higher than (or equal to) the searched phrase, and another of which can be significantly less than the searched phrase. When a point query is done, the hunt is complete after finding the leaf node.

Throughout the scope scan, the values and keys of this found foliage, and the intruder foliage’s nodes, are traversed, before the end of the range is attained.

Binary search is readily explained when it comes to looking for words starting with a specific letter from the dictionary, in which keywords have been sorted alphabetically.

First, you start the dictionary just at the center. If the searched correspondence is alphabetically “less than” (appears sooner than) the one opened, then you keep your search at the left half of this dictionary; differently, you continue in the ideal half. You continue reducing the remaining webpage array by half and choosing the side to follow till you locate the letter.

Every measure halves the search area, which makes the search time logarithmic. Searches in B-trees have logarithmic complexity, because the node degree keys have been sorted, and the binary search is done to be able to discover a match. That is why it is important to maintain the occupancy large and uniform around the tree.

Insert, update, & delete data

When doing insertions, the first step would be to find the target leaf. For this, the above search algorithm is utilized. Following the goal leaf is situated, value and key are appended to it. If the foliage doesn’t have sufficient free space, then this circumstance is known as overflow, as well as the leaf needs to be broken in 2.

This is accomplished by allocating a fresh leaf, shifting half the components into it, and also appending a pointer to the newly allocated foliage into the parent. If the parent does not have free space, separation is done on the parent level too. After the root exerts, its contents are divided between the newly allocated nodes, and also the root node itself is overwritten to prevent relocation.

This also means that the tree (and its own elevation) constantly grows by dividing the root node.

What is LSM-Trees?

The log-structured merge-tree is an immutable disk-resident write-optimized information structure. It’s most useful in programs where parties tend to be somewhat more regular than lookups that recover the documents.

LSM-trees are getting more attention as they can eliminate arbitrary insertions, updates, and deletions.

Anatomy of the LSM-tree

To permit sequential writes,

” LSM-trees batch writes and updates at a memory-resident dining table (often implemented with a data structure permitting logarithmic time lookups,

like a binary search tree or jump list) till its size reaches a threshold, at which stage it’s written on disc (this functionality is known as a flush).

Assessing the information requires hunting all disk-resident areas of the tree, checking the in-memory table, and mixing their contents before returning the outcome.

Figure 5 displays the arrangement of an LSM-tree: a memory-resident table used for writes.

Whenever the memory table is big enough, its sprinkled contents are written on the disc. Reads are served, hitting both discs- and – memory-resident tables, necessitating a mixed process to recreate the information.

Many Contemporary LSM-tree implementations, (For example Apache the one we use in our web hosting environment ) Apply disk-resident tables like SSTables (Sorted String Tables), due to their simplicity (simple to compose, search, and examine) and unite properties (throughout the merge, origin SSTable scans and also merged outcome writes are sequential).

Structurally, an SSTable is divided into two components: data and index blocks, as shown in figure 6. A data blocks includes composed unique key/value pairs, arranged by key. An index block includes keys to data-block pointers, pointing to where the real document is situated.

An index can be implemented with a structure optimized for rapid searches, like a B-tree, or having a hash table to get a point-query.

Every value thing within an SSTable includes a timestamp associated with it. This specifies the spare time for inserts and updates (which are usually equal) and the removal period for deletes.

SSTables have some pleasant properties:

  • Point-queries (i.e., locating a value by key) could be accomplished quickly by signing up the key indicator.
  • Scans (i.e., iterating over all key/value pairs at a given key range) may be accomplished effectively by simply studying key/value pairs sequentially in the information block.

An SSTable represents a snapshot of database operations within a time period since the SSTable is generated by the flush procedure from the memory-resident table which functioned as a buffer for operations against the database condition for this interval.

Retrieving data necessitates hunting all SSTables on disc, checking the memory-resident tables, and consolidating their contents prior to returning the outcome.

The merge step through the read is needed because the hunted data can live in numerous SSTables.The merge step can also be vital to make sure the deletes and updates operate.

Likewise, an upgrade is merely a document using a later timestamp. A similar thing occurs with all the updates: from 2 documents with the identical key, just the one with all the subsequent timestamp is returned.

Figure 7 displays a mix step reconciling the information stored in different tables for the exact same key: as exhibited here, the album for Alex was composed with timestamp 100 and upgraded with a brand new phone and timestamp 200; the album for John has been deleted. The other two entries are accepted as is, as they are not shadowed.

To decrease the number of hunted SSTables and also to prevent checking every SSTable to your hunted key, many storage systems use a data structure called a Bloom filter.

This really is a probabilistic data structure that may be utilized to check if an element is part of this group.

It may create false-positive games (i.e., say that the component is a part of place, although it’s not, in actuality, present there) but cannot create false negatives (i.e., in case a negative game is returned, the component is guaranteed not to become part of the group).

To put it differently, a Bloom filter is used to inform if the key “could be at an SSTable” or “is not within an SSTable.”


Among the largest differences between the B-tree and LSM-tree information structures is that which they optimize for and what consequences those optimizations have. In Conclusion, B-trees possess the following properties:

  • They’re mutable, allowing for in-place updates by introducing a few distance overhead and much more involved write routes, though it doesn’t demand full file rewrites or multisource merges.
  • They’re read-optimized, meaning they don’t need reading from (and then merging) multiple resources, thereby simplifying the browse path.
  • Writes may activate a cascade of node breaks, making some write operations more costly.
  • They’re optimized for paged surroundings (block storage), in which byte fixing isn’t feasible.
  • Fragmentation, brought on by regular updates, may require extra maintenance and prevent rewrites. B-trees, but usually require less upkeep compared to LSM-tree storage.
  • Concurrent access necessitates reader/writer isolation and entails chains of locks and latches.
  • They’re immutable. SSTables are composed on disc and never upgraded. Compaction can be utilized to recreate the area occupied by removed things and unite same-key information from several data files.
    Merged SSTables are discarded and removed following a successful mix within the compaction procedure. Another helpful property coming from immutability is the fact that refrigerated tables could be obtained concurrently.
  • They’re composed optimized, meaning that writes are buffered and flushed on disc, possibly allowing for the spatial area on the disc.
  • Reads might necessitate obtaining data from several sources, since information for the exact same key, composed during different times, may property in various data files. Records need to go through the mixing process prior to being returned to the customer.
Picture of Neha Somani

Neha Somani

Passionate and almost obsessive about reading books. Totally, in love with literature. Sleep is something I can never have enough of.

Leave a Comment

Get the latest news and deals

Sign up for email updates covering blogs, offers, and lots more.

Current Deals at Netspace

Subscribe: Trusted By 1M+ Readers

Get the weekly Tech Update straight to your inbox.

Our customer sales team is here to answer your questions. Ask us anything!
👋 Hi, how can I help?