In part 1 I gave an overview of our cluster configuration. In this part we’ll dig into:
- How our data is partitioned into indices to scale over time
- Optimizing bulk indexing
- Scaling real time indexing
- How we manage indexing failures and downtime.
The details of our document mappings are mostly irrelevant for our indexing scaling discussion, so we’ll skip them until part 3.
Data Partitioning
Since WordPress.com data is constantly growing we need an indexing structure that can grow over time as well. A well-known limitation of ES is that once an index is created you cannot change the number of shards. The common solution to this problem is to recognize that searching across an index with 10 shards is identical to searching across 10 indices with 1 shard each, and indices can be created at will.
In our case we create one index for every 10 million blogs, with 25 shards per index. We use index templates so that as our system tries to index to a non-existent index the index is created dynamically.
There are a few factors that led to our index and sharding sizes:
- Uniform shard sizes: Shards should be of similar sizes so that you get mostly uniform response times. Larger shards take longer to query. We tried one index per 1 million blogs and found too much variation. For instance, when we migrated Microsoft’s LiveSpaces to WordPress.com we got a million or so blogs added to our DBs in a row created and they have remained pretty active. This variation drove us to put many blogs into each index. We rely on the hashing algorithm to spread the blogs across all the shards in the index.
- Limit number of shards per index: New shards are not instantly created when a new index is created. Technically, I guess, the cluster state is red for a very short period. At one point we tested 200 shards per index. In those cases we sometimes saw a few document indexing failures in our real time indexing because the primary shards were still being allocated across the cluster. There’s probably some other ways around this, but its something to look out for.
- Upper limit of shard size: Early on we tried indexing 10 million blogs per index with only 5 shards per index. Our initial testing went well, but then we found that the indices with the larger shards (the older blogs) were experiencing much longer query latencies. This was us starting to hit the upper limits of our shard sizes. The upper limit on shard size varies by what kind of data you have and is difficult to predict so it’s not surprising that we hit it.
- Minimize total number of shards: We’ll discuss this further in our next post on global queries, but as the number of shards increases the efficiency of the search decreases, so reducing the number of shards helps make global queries faster.
Like all fun engineering problems, there is no easy or obvious answer and the solution comes by guessing, testing, and eventually deciding that things are good enough. In our case we figured out that our maximum shard size was around 30 GB. We then created shards that were fairly large but which we don’t think would be able to grow to that maximum for many years.
As I’m writing this, and after a few months in production, we’re actually wondering if our shards are still too large. We didn’t take into account that deleted documents would also negatively affect shard size, and every time we reindex or update a document we effectively delete the old version. Investigation into this is still ongoing, so I’m not going to try to go into the details. The number of deleted documents in your shards is related to how much real time indexing you are doing, and the merge policy settings.
Bulk Indexing Practicalities
Bulk indexing speed is a major limit in how quickly we can iterate during development, and indexing will probably be one of our limiting factors in launching new features in the future. Faster bulk indexing means faster iteration time, more testing of different shard/index configurations, and more testing of query scaling.
For these reason we ended up putting a lot of effort into speeding up our bulk indexing. We went from bulk indexing taking about two months (estimated, we never actually ran this) to taking less than a week. Improving bulk indexing speed is very iterative and application specific. There are some obvious things to pay attention to like using the bulk indexing API, and testing different numbers of docs in each bulk API request. There were also a few things that surprised me:
- Infrastructure Load: ES Indexing puts a heavy load on certain parts of the WordPress.com infrastructure because it pulls data from so many places. In the end, our indexing bottleneck is not ES itself, but actually other pieces of our infrastructure. I suppose we could throw more infrastructure at the problem, but that’s a trade off between how often you are going to bulk reindex vs how much that infrastructure will cost.
- Extreme Corner Cases: For instance en.blog has millions of followers, likers, and lots of commenters. Building a list of these (and keeping it up to date with real time indexing) can be very costly – like, “oh $%#@ are we really trying to index a list of 2.3 million ids” costly (and then update it every few seconds).
- Selective Bulk Indexing: Adding fields to the index requires updating the mappings and bulk reindexing all of our data. Having a way to selectively bulk index (find all blogs of a particular type) can speed up bulk indexing a lot.
- Cluster Restarts: After bulk indexing we need to do a full rolling restart of the cluster.
I wish we had spent more time finding and implementing bulk indexing optimizations earlier in the project.
Scaling Real Time Indexing
During normal operation, our rate of indexing (20m+ document changes a day) has never really been a problem for our Elasticsearch cluster. Our real time indexing problems have mostly stemmed from combining so many pieces of information into each document that gathering the data can be a high load on our database tables.
Creating the correct triggers to detect when a user has changed a field is often non-trivial to implement in a way that won’t over index documents. There are posts that get new comments or likes every second. Rebuilding the entire document in those cases is a complete waste of time. We make heavy use of the Update API mostly to reduce the load of recreating ES documents from scratch.
The other times when real time indexing became a problem was when something went wrong with the cluster. A couple of examples:
- Occasionally when shards are getting relocated or initialized on a number of nodes the network can get swamped which backs up the indexing jobs or can cause a high proportion of them to start failing.
- Query load can become too high if a non-performant query is released into production.
- We make mistakes (shocking!) and break a few servers.
- Occasionally we find ES bugs in production. Particularly these have been around deleteByQuery, probably because we run a lot of them.
- A router or server fails.
Real time indexing couples other portions of our infrastructure to our indexing. If indexing gets slowed down for some reason we can create a heavy load on our DBs, or on our jobs system that runs the ES indexing (and a lot of other, more important things).
In my opinion, scaling real time indexing comes down to two pieces:
- How do we manage downtime and performance problems in Elasticsearch and decouple it from our other systems.
- When indexing fails (which it will), how do we recover and avoid bulk indexing the whole data set.
Managing Downtime
We mentioned in the first post of this series that we mark ES nodes as down if we receive particular errors from them (such as a connection error). Naturally, if a node is down, then the system has less capacity for handling indexing operations. The more nodes that are down the less indexing we can handle.
We implemented some simple heuristics to reduce the indexing load when we start to detect that a server has been down for more than a few minutes. Once triggered, we queue certain indexing jobs for later processing by just storing the blog IDs in a DB table. The longer any nodes are down, the fewer types of jobs we allow. As soon as any problems are found we disable bulk indexing of entire blogs. If problems persist for more than 5 minutes we start to disable reindexing of entire posts, and eventually we also turn off any updating of documents or deletions.
Before implementing this indexing delay mechanism we had some cases where the real time indexing overwhelmed our system. Since implementing it we haven’t seen any, and we actually smoothly weathered a failure of one of the ES network routers while maintaining our full query load.
We of course also have some switches we can throw if we want to completely disable indexing and push all blog ids that need to be reindexed into our indexing queue.
Managing Indexing Failures
Managing failures means you need to define your goals for indexing:
- Eventually ES will have the same data as the canonical data.
- Minimize needing to bulk re-index everything.
- Under normal operation the index is up to date within a minute.
There are a couple of different ways our indexing can fail:
- An individual indexing job crashes (eg an out of memory error when we try to index 2.3 million ids :) ).
- An individual indexing job gets an error trying to index to ES.
- Our heuristics delay indexing by adding it to a queue
- We introduce a bug to our indexing that affects a small subset of posts.
- We turn off real time indexing.
We have a few mechanisms that deal with these different problems.
- Problems 1, 3, and 5: The previously mentioned indexing queue. This lets us pick up the pieces after any failure and prevent bulk reindexing everything.
- Problem 2: When indexing jobs fail they are retried using an exponential back off mechanism (5 min, 10 min, 20 min, 40 min, …).
- Problem 4: We run scrolling queries against the index to find the set of blogs that would have been affected by a bug, and bulk index only those blogs.
It’s All About the Failures and Iteration
Looking back on what I wish we had done better, I think a recognition that indexing is all about handling the error conditions would have been the best advice I could have gotten. Getting systems for handling failures and for faster bulk indexing in place sooner would have helped a lot.
In the next part of the series I’ll talk about query performance and balancing global and local queries.
