Saturday, July 9, 2016

Shard Key Selection in Sharding MongoDB

The shard key determines the distribution of the collection’s documents among the cluster’s shards. The shard key is either an indexed field or indexed compound fields that exists in every document in the collection.
MongoDB partitions data in the collection using ranges of shard key values. Each range defines a non-overlapping range of shard key values and is associated with a chunk.
MongoDB attempts to distribute chunks evenly among the shards in the cluster. The shard key has a direct relationship to the effectiveness of chunk distribution. 
IMPORTANT
Shard key fields and values are immutable. Attempting to modify a shard key field or value throws an exception.
You cannot change a shard key after sharding the collection.
Once inserted, a document’s shard key value cannot be changed.

Creating a Shard Key

The sh.shardCollection() method shards the target collection based on the indexed fields passed to the key parameter.
sh.shardCollection( namespace, key )
The namespace parameter consists of a string <database>.<collection> specifying the full namespace of the target collection. The key parameter consists of a document containing a field and the traversal direction for that field.
If the collection is empty, sh.shardCollection() creates indexes based on the specified fields and the traversal direction. If the collection is not empty, create the indexes first using db.collection.createIndex(). The fields and travesal directions passed to key must match the target indexes.
You cannot create a shard key using non-indexed field or fields.

Shard Key Requirements

Key Size

A shard key cannot exceed 512 bytes.
When sharding an existing collection, the shard key can constrain the maximum supported collection size for the initial sharding operation only. 
IMPORTANT
A sharded collection can grow to any size after successful sharding

Shard Key Indexes

A shard key index can be an ascending index on the shard key, a compound index that start with the shard key and specify ascending order for the shard key, or a hashed index.
shard key index cannot be an index that specifies a multikey index, a text index or a geospatial index on the shard key fields.
All sharded collections must have an index that starts with the shard key; i.e. the index can be an index on the shard key or a compound index where the shard key is a prefix of the index.
The index on the shard key cannot be a multikey index.
If you shard a collection without any documents and without such an index, sh.shardCollection() creates the index on the shard key. If the collection already has documents, you must create the index before using sh.shardCollection().
If you drop the last valid index for the shard key, recover by recreating an index on just the shard key.
For restrictions on shard key indexes, see Shard Key Limitations.

Unique Indexes

Sharded collections cannot enforce unique indexes. MongoDB does not support creating new unique indexes and does not allow sharding collections with unique indexes on fields other than _id.
MongoDB can enforce uniqueness on the shard key. MongoDB enforces uniqueness on the entire key combination, and not specific components of the shard key. You cannot specify a unique constraint on a hashed index. To enforce uniqueness on the shard key, pass the unique parameter as true to the sh.shardCollection() method.
The best way to ensure a field has unique values is to generate universally unique identifiers (UUID,) such as MongoDB’s ‘ObjectId values.
For collections where _id is not part of the shard key, the application is responsible for ensuring that the_id field is unique.

Choosing a Shard Key

The choice of shard key affects how the sharded cluster balancer creates and distributes chunks across the available shards. This affects the overall efficiency and performance of operations within the sharded cluster.
The shard key affects the performance and efficiency of the sharding strategy used by the sharded cluster.
The ideal shard key allows MongoDB to distribute documents evenly throughout the cluster.
At minimum, consider the consequences of the cardinalityfrequency, and rate of change of a potential shard key.

Shard Key Cardinality

The cardinality of a shard key determines the maximum number of chunks the balancer can create. This can reduce or remove the effectiveness of horizontal scaling in the cluster.
A unique shard key value can exist on no more than a single chunk at any given time. If a shard key has a cardinality of 4, then there can be no more than 4 chunks within the sharded cluster, each storing one unique shard key value. This constrains the number of effective shards in the cluster to 4 as well - adding additional shards would not provide any benefit.
The following image illustrates a sharded cluster using the field X as the shard key. If X has low cardinality, the distribution of inserts may look similar to the following:
The cluster in this example would not scale horizontally, as incoming writes would only route to a subset of shards.
A shard key with high cardinality does not guarantee even distribution of data across the sharded cluster, though it does better facilitate horizontal scaling. The frequency and rate of change of the shard key also contributes to data distribution. Consider each factor when choosing a shard key.
If your data model requires sharding on a key that has low cardinality, consider using a compound index using a field that has higher relative cardinality.

Shard Key Frequency

Consider a set representing the range of shard key values - the frequency of the shard key represents how often a given value occurs in the data. If the majority of documents contain only a subset of those values, then the chunks storing those documents become a bottleneck within the cluster. Furthermore, as those chunks grow, they may become indivisible chunks as they cannot be split any further. This reduces or removes the effectiveness of horizontal scaling within the cluster.
The following image illustrates a sharded cluster using the field X as the shard key. If a subset of values for occur with high frequency, the distribution of inserts may look similar to the following:

A shard key with low frequency does not guarantee even distribution of data across the sharded cluster. The cardinality and rate of change of the shard key also contributes to data distribution. Consider each factor when choosing a shard key.
If your data model requires sharding on a key that has high frequency values, consider using a compound index using a unique or low frequency value.

Monotonically Changing Shard Keys

A shard key on a value that increases or decreases monotonically is more likely to distribute inserts to a single shard within the cluster.
This occurs because every cluster has a chunk that captures a range with an upper bound of maxKeymaxKey always compares as higher than all other values. Similarly, there is a chunk that captures a range with a lower bound of minKeyminKey always compares as lower than all other values.
If the shard key value is always increasing, all new inserts are routed to the chunk with maxKey as the upper bound. If the shard key value is always decreasing, all new inserts are routed to the chunk with minKey as the lower bound. The shard containing that chunk becomes the bottleneck for write operations.
The following image illustrates a sharded cluster using the field X as the shard key. If the values for X are monotonically increasing, the distribution of inserts may look similar to the following:
If the shard key value was monotonically decreasing, then all inserts would route to Chunk A instead. A shard key that does not change monotonically does not guarantee even distribution of data across the sharded cluster. 
If your data model requires sharding on a key that changes monotonically, consider using Hashed Sharding.

Unique Constraints on Arbitrary Fields

If you cannot use a unique field as the shard key or if you need to enforce uniqueness over multiple fields, you must create another collection to act as a “proxy collection”. This collection must contain both a reference to the original document (i.e. its ObjectId) and the unique key.
Consider a collection records that stores user information. The field email is not the shard key, but needs to be unique.
The proxy collection then would contain the following:
{
"_id" : ObjectId("...")
"parent_id" : "<ID>"
"email" : "<string>"
}
Use the following command to create a unique index on the email field:
db.proxy.createIndex( { "email" : 1 }, { unique : true } )
The following example first attempts to insert a document containing the target field and a generated Unique ID into the proxy collection. If the operation is successful, then it inserts the full document into therecords collection.
records = db.getSiblingDB('records');
proxy = db.getSiblingDB('proxy');
var primary_id = ObjectId();
proxy.insertOne({
"_id" : primary_id
"email" : "example@example.net"
})
// if: the above operation returns successfully,
// then continue:
records.insertOne({
"_id" : primary_id
"email": "example@example.net"
// additional information...
})
Note that this methodology requires creating a unique ID for the primary_id field rather than letting MongoDB automatically create it on document insertion.
If you need to enforce uniqueness on multiple fields, then each field would require its own proxy collection.

Considerations

  • Your application must catch errors when inserting documents into the “proxy” collection and must enforce consistency between the two collections.
  • If the proxy collection requires sharding, you must shard on the single field on which you want to enforce uniqueness.
  • To enforce uniqueness on more than one field using sharded proxy collections, you must have one proxy collection for every field for which to enforce uniqueness. If you create multiple unique indexes on a single proxy collection, you cannot be able to shard proxy collections.

Shard Key Limitations

Shard Key Size
A shard key cannot exceed 512 bytes.
Shard Key Index Type
A shard key index can be an ascending index on the shard key, a compound index that start with the shard key and specify ascending order for the shard key, or a hashed index.
A shard key index cannot be an index that specifies a multikey index, a text index or a geospatial index on the shard key fields.
Shard Key is Immutable
If you must change a shard key:
  • Dump all data from MongoDB into an external format.
  • Drop the original sharded collection.
  • Configure sharding using the new shard key.
  • Pre-split the shard key range to ensure initial even distribution.
  • Restore the dumped data into MongoDB.
Shard Key Value in a Document is Immutable
Shard key fields and values are immutable. Attempting to modify a shard key field or value throws an exception.
You cannot change a shard key after sharding the collection.
Once inserted, a document’s shard key value cannot be changed.
Monotonically Increasing Shard Keys Can Limit Insert Throughput
For clusters with high insert volumes, a shard keys with monotonically increasing and decreasing keys can affect insert throughput. If your shard key is the _id field, be aware that the default values of the_id fields are ObjectIds which have generally increasing values.
When inserting documents with monotonically increasing shard keys, all inserts belong to the same chunk on a single shard. The system eventually divides the chunk range that receives all write operations and migrates its contents to distribute data more evenly. However, at any moment the cluster directs insert operations only to a single shard, which creates an insert throughput bottleneck.
If the operations on the cluster are predominately read operations and updates, this limitation may not affect the cluster.
To avoid this constraint, use a hashed shard key or select a field that does not increase or decrease monotonically.

No comments:

Post a Comment

Mongodb explain() Query Analyzer and it's Verbosity

First creating 1 million documents: > for(i=0; i<100; i++) { for(j=0; j<100; j++) {x = []; for(k=0; k<100; k++) { x.push({a:...