Elasticsearch Zero Downtime Upgrade and Modernization

The Fountain engineering team recently upgraded all production traffic to a new major version of Elasticsearch without having to plan and coordinate downtime with customers. 🎉

The approach was a “zero downtime” strategy, gradually cutting over specific aspects as steps, and making each step reversible. Avoiding downtime meant that the upgrade would not block customer usage of Fountain. Although each step required coordination and planning, this approach reduced risk because each step was granular and reversible.

During the course of testing and fixing upgrade related issues, the team also took the opportunity to modernize and simplify Elasticsearch-related application code.

This post will recap some of the key points in achieving both goals of a zero downtime upgrade and an application code modernization.

Mission Critical

Elasticsearch is a mission critical dependency for the Fountain platform, used by millions of customers for their hiring needs. Fountain takes advantage of the fast aggregations and filtered search capabilities made possible by Elasticsearch where relational database management systems may struggle. Some of the struggles occur when performing many JOINs across normalized data tables when the tables are large, and providing enough indexes on enough fields in those tables to provide fast retrieval.

Fountain also provides product functionality that is “schemaless” to help customers map out their hiring processes. A future blog post may cover this topic in more detail.

For reasons of performance, fewer schema constraints, and the traditional search use cases, Elasticsearch has been a critical piece of the Fountain platform.

Earlier this year, the Elasticsearch version became “End of Life” (EOL), meaning all support for it had stopped. This created a number of issues including provisioning new clusters, and created risk for Fountain’s customers.

Handling Past Versions

A requirement for this project was that all application code be made compatible with Elasticsearch 7.

During the upgrade, we determined a mix of version 6 and version 5 clusters were actively in use! 😅

Multiple versions in use simultaneously made the upgrade a bit more challenging because of how Elasticsearch provides deprecation announcements from version to version.

For some version 5 features the team “missed” the deprecation period that occurred in 6, and went straight to needing to remove and replace that functionality to be compatible with version 7.

Elasticsearch Content

Elasticsearch acts as a primary data store for certain production functionality, with data supplied from PostgreSQL synchronized during application lifecycle events in the background. Fountain uses Ruby on Rails for the API layer, PostgreSQL, and background jobs with Sidekiq as part of the pipeline to keep Elasticsearch content synchronized with the Bulk API. Bulk API payloads are tuned so that indices can be populated as quickly as possible.

Risks for Zero Downtime and In-Place Upgrade Strategies

An in-place upgrade was not an option due to the size of the indices and the amount of downtime minutes migrating them would require. In-place upgrades involved more risk that bugs and issues may not be discovered and with a higher cost rollback. Reindex times are something the team is actively looking into for future development to provide more flexibility.

A “zero downtime” strategy was decided early in the process based on a number of benefits.

Reducing overall risk by breaking a single step into multiple steps
Reducing individual step risk by making each step reversible. For example, migrating “writes”, then migrating “reads”.
Providing an opportunity to modernize the application code, including required and optional changes, and removals of unused code.

Risks of Zero Downtime Upgrades

Zero downtime upgrades are not without any risks of their own. The risks may be less technical but may be organizational risks in terms of sunk cost, financial cost, and opportunity cost.

Increased financial cost due to running clusters on both versions simultaneously temporarily
The organizational cost with two deployments simultaneously actually created /more/ complexity in the short term, while unrelated product development continued. Eventually the old instances were shut down and removed, and the development environment was simplified.
Sunk cost: at a certain point with this strategy, we had invested a considerable amount of engineering time and that became a sunk cost. Pursuing significant engineering time for this style of upgrade effort does represent an opportunity cost for how that time might be spent differently. I think the “investment” of this time was worthwhile taking a longer view and given the stage of the company.

Steps to Perform Zero Downtime Upgrade

Introduce new transport library client gem.

The 'elasticsearch', '~> 7.0' gem was compatible with 7 and also backwards compatible with 6.
Provision destination Elasticsearch 7 cluster.

We used an environment variable to manage the connection string information and were able to verify connectivity before starting to use it from application code.
From all code paths that perform indexing (INDEX requests), begin indexing on the Elasticsearch 7 deployment as well.

This is sometimes called “double writing” because writes are happening in two places. There were various breaking changes to review and fix in pre-production before this could be put into production.
Backfill the Elasticsearch 7 indexes (perform indexing). Once the “new” writes are working reliably, we needed to index all historical content in the new deployment.

Fountain uses persistent Kubernetes jobs detached from our rolling deployments to call Ruby on Rails Rake tasks that perform reindexing for historical content. Once backfilling is completed we can do sanity checks comparing results from queries on both the old version and new version, and confirming they are equivalent.
Switch reads to ES 7. Begin testing all search queries from ES 7.

This is where a majority of the work occurred. The team worked through various bugs and breaking API changes in pre-production. We relied on a large CI test suite to catch a lot of issues as broken tests to investigate. Once all of the tests were passing, we conducted some manual verifications as well. Finally, “reads” (queries) could be switched over either for individual indexes or for all indexes.

Once the application code and all clusters were moved in production to the new version, the very last step was to clean everything up. The clean up phase involved removing support for two versions, removing all temporary code, removing all temporary and historical environment variables that are no longer referenced. Once that was done and there were no application errors, all the old deployments were terminated.

This last step was a very satisfying step after a long upgrade effort! 😎

Do Not Harm Unrelated Development

We could not forget about the developers on the team building things unrelated to this upgrade project. We wanted to minimize any interruptions to their workflows.

We configured Circle CI to run 2 versions of Elasticsearch simultaneously as well as provided a configuration for Docker and docker-compose that developers could use to run both versions. This allowed developers to test both paths simultaneously or not, depending on what their individual work tasks were.

In terms of documentation, we maintained install instructions to the project Readme and on internal wikis where possible. We offered “office hours” as well to help unblock developers. Several developers do not regularly work in this area and needed some assistance to get their local development environments up-to-date.

To summarize the steps taken to preserve Developer Experience:

Maintained install instructions, GitHub Readme and wiki documentation.
Maintained development seed data which created and populated indices for development.
Supported multiple versions simultaneously in development.
Offered dedicated “office hours” (and in various time zones) to offer personalized, ad hoc support for local development environment maintenance.

Repeat For All Environments

Fountain has over a dozen deployment environments, which meant that most of this upgrade process was repeated over a dozen times! Fountain makes heavy use of Terraform to manage infrastructure dependencies, although the infrastructure for this project pre-dated Terraform usage. The team does have strong Infrastructure support (shout out Sriram and Dan 🙌) helping make it easier for application developers to get configuration and code changes deployed.

Remove Accidental Complexity

In the process of application code modernization, we found that 2 indexes were unused entirely! Those 2 indexes were not migrated. We also found another index that was being used primarily for count queries and we could relocate those counts to a similar index. Maintaining fewer indexes reduced operational complexity.

In terms of accidental complexity in the application, due to past migrations, we had over 5 different client configurations. When I started on the project, I thought there were many more clusters in use, when in fact there were duplicated configurations for the same clusters. We were able to consolidate 5 client configurations into a single configuration through a combination of removing duplicates, and relocating indexes spread on multiple clusters into a single cluster.

Removal of this “accidental complexity” will help with onboarding new engineers to the project!

Acknowledgements

Thank you to team members that helped along the way: Steven, Andrew, Michael, Sriram, Dan, and others. 🙌

Next Steps

Fountain is hiring Application and Infrastructure Engineers looking to help grow and scale Elasticsearch and other technologies.

If these sorts of challenges are interesting to you and you wish to reach out, or if you have other general feedback on this post, please send me an email at [email protected].

Thank you for reading!