Shard data set across multiple servers (Range-based)

From QuABaseBD - Quality Architecture at Scale for Big Data
Jump to: navigation, search

Description

Sharding improves the capacity of the system by splitting the data set across multiple servers in a cluster, which reduces the amount of data store on each node. Sharding can also improve performance, by eliminating disk I/O bottlenecks.

The most common type of sharding is horizontal sharding, which requires two steps:

  • At design time or configuration time, define an attribute shared by every object in the collection, called the shard key.
  • At run time, the database system will apply an algorithm that maps each shard key to a particular server in the cluster.

Range-based sharding maps the shard key value directly to a server. For example, shard key values 1-100 map to Server #1, shard key values 101-200 map to Server #2, etc.

Range-based sharding is conceptually simple and fast, but introduces several issues:

  1. If the designated shard key does produce a (roughly) uniform distribution across the objects being stored, then some servers will store more objects than other servers. This can be mitigated by rebalancing (see next item).
  2. If the number of objects stored by each server starts to vary significantly, the database can rebalance the shards, by adjusting the partition points in the shard key space and moving objects from one server to another. This uses network and storage I/O resources, and can result in periodic performance degradation (analogous to garbage collection in a Java virtual machine).
  3. Naive use of range-based sharding can have a disastrous effect for write-intensive workloads (for example, bulk loading), as the database has to frequently rebalance.
  4. The use of certain types of attributes as shard keys can result in undesirable behavior. In particular, using a scheme where all new objects are assigned a shard key at the end of the key space results in all new objects being written on the same server, instead of being spread across all servers. An example would be the using the timestamp to shard time series objects like log entries, or using a counter to create new order IDs and using that as the shard key.
  5. Finally, a single shard key is used for both write and read operations. Choosing a shard key that creates good distribution of one type of operation across servers may result in hotspots where the other operation is not well distributed.

Range-based sharding can reduce scalability, because the amount of data to be moved when the number of shards is increased or decreased may require taking the system offline while the storage scheme is reconfigured.

Any sharding tactic complicates availability, because it introduces new failure modes. If the server for a shard fails, then reads and writes to that shard will fail, while reads and writes to other shards will succeed. If shards are replicated, then single server failures can be masked, but server failures may still result in the failure to reach a quorum for operations on one shard, while other shards are unaffected.


Improves Quality Performance
Reduces Quality Availability, Scalability
Related Tactics Shard data set across multiple servers (Consistent Hashing)

Implementations



This tactic is supported by the feature MongoDB Data Distribution Features of the product MongoDB.


This tactic is supported by the feature Accumulo Data Distribution Features of the product Accumulo.


This tactic is supported by the feature CouchDB Data Distribution Features of the product CouchDB.


This tactic is supported by the feature FoundationDB Data Distribution Features of the product FoundationDB.


This tactic is supported by the feature HBase Data Distribution Features of the product HBase.

Notes

This tactic is related to the Data Distribution Method in the Data Distribution category

Notes