Orion experienced a ~thirty-minute period on 6/23/18 when new users encountered a blocking error when attempting to sign up. This was first reported by a member of the Orion community. Additionally, users asking to re-send confirmation emails received an error when clicking “re-send confirmation email”. Registered users of Orion were able to browse and use the system as usual, upgrade and downgrade.
What was the proximate cause?
This was unrelated to any recent development changes made in the Orion codebase or new versions. The most recent version of the Orion portal, v1.35, was released on 6/21/18, and automated tests correctly confirmed that users were able to sign up following this release.
On Saturday morning, the upstream, third-party provider used by Orion to send emails, Sendgrid, hit its maximum authorized capacity for emails on its current plan tier. This led it to disable service to Orion.
What was the root cause?
Orion sends confirmation emails using a third-party backend provider, Sendgrid. On the current plan, Sendgrid caps the number of daily confirmation emails sent to 100/day after its 40K monthly limit is reached. This limit had not yet been reached during initial development, launch and rollout, but was reached as Orion continues to scale. Once the limit was reached on 6/23/18, Sendgrid shut down service to Orion. This led to the Orion application to throw an error when attempting to send a confirmation email to new users signing up.
What was the immediate fix?
The team reviewed logs to identify the source of the error.
Sending confirmation emails was temporarily turned off by the team. This fixed the immediate error, so that users could continue to signup without receiving an error.
The Sendgrid plan was upgraded to a plan so that overages are permitted and will no longer lead to disabling the service.
Sending email was turned back on without incident and service resumed normally around 10:41.
What changes are being made to address this type of error in the future?
Downtime on critical flows in a production application is a serious matter, so we take steps (including analyses of this type) to understand the causes and implement safeguards against them.
First, and most obviously, we are reviewing all active third party services to ensure they’re tolerant of any increases in usage as Orion scales. We should aim for a verification level of 10x our current load; at 5K users, we should test to confirm we can support scaling to 50K users without interruption in service.
Second, we are activating Sentry, our error-monitoring and notification system, in order to send notifications by text, email, and Slack, each time any user encounters errors in Orion. This will allow us to identify user errors and respond to them promptly without waiting for a community response. A team member on pager duty will be available to triage and remedy issues promptly.
Third, the relevant portions of Orion that access 3p services are being moved out of the main process and into background jobs so that they tolerate failures in upstream providers. The portions that access the blockchain already occur in background jobs, but the email confirmation during sign up was not.
Fourth, our automatic test-runner, CircleCI, which runs automated tests on key functions every time new versions of Orion are released, will be moved to run once per hour to check that things are functioning normally in the production environment, instead of just on each release. This will provide an additional layer of support to catch failures that may occur independently of our release cycle.
Last, we should plan on implementing more informative messages on error screens, which will allow users to avoid seeing an unbranded “500” message, and receive messages about the context of their error.