Response to the recent incident

In this blog post, we would like to formally address the issue that happened on Friday, Dec 4, 2020. It would be easy to put the blame on an external factor, but we’re going to outline everything we did wrong. And how we fixed and plan to prevent it from happening again.

In short, we sent a large number of misleading emails and closed down some inactive accounts without prior notice. The main reason was a combination of poor coding, UX, and communications.

What were our expectations?

To improve our performance and for routine maintenance purposes, we updated our account closure process. The account closure emails were supposed to have been sent on day 15 and day 44 inactivity.

What happened?

At around midnight our time we’ve faced some technical challenges as our system started sending dozens of account closure emails to our customers. This resulted in lots of fear and confusion.

In technical terms, this is what happened. We use AWS EKS to auto scale our services. At that particular time, there were 12 instances of cron services running. This resulted in a single user receiving dozens of the same email in a short time. To solve this particular issue, we’ve separated this specific cron service from the rest and disabled the auto-scaling.

Here is the outline of what happened:

  1. A customer receives an account closure email, which was poorly worded and confusing.
  2. Tries to sign in through the organization subdomain provided in the email:
    1. Returns “404 not found” error
    2. Tries to reset the password, but it wasn’t working
    3. Successfully signs in
  3. Visits erxes.io/signin:
    1. Can’t log in, creates a new account that does not relate to their initial organization.
    2. Sign-ins and ends up in the new feature called “Global Profile”. Sees a new interface that is different from managing organizations.
      1. Doesn’t see their SaaS organization listed there or the redirect link.
      2. The SaaS account limits for AppSumo users were displaying the wrong numbers.
  4. The bulk emailing account was shut down due to a high number of spam reports because of the incident.

Who received these emails?

SaaS account owners:

  • whose organization was on a free plan
  • who did not sign in to their organization for 45 consecutive days

Which accounts were deleted?

It’s important to point out that no active organization account was deleted.

  • Some of the email recipients had many organizations created with the same email address. There were 63 instances of such cases. This means, if I had myorganization.app.erxes.io and mytest.app.erxes.io, the mytest.app.erxes.io account got deleted because I wasn’t using it.
  • Another instance was when the customers forgot that they’ve canceled their subscription or refunded their AppSumo codes. Since they were on the free plan now and were inactive, their account was deleted.

What measures have we taken so far?

Backend:

  • Fixed the bulk emailing and improved the spam handling process
  • Fixed the reset password emailing bug
  • Improved our database backup practice
  • Improved our unit tests, so that all of the above does not happen again.

Frontend:

  • Updated the account closure emails and triggers
    • Notice emails will be sent on day 21,30,40 of inactivity
    • The account will be closed on day 45.
    • Emails now include more details on the account status and possible actions.
  • Improved the Global Profile interface with better instructions on how to access the SaaS detail page
  • Fixed the SaaS limits

Communications:

  • Emailed everyone after understanding the full scope of the issue after 4 hours
  • Reached out to those customers who lost their free inactive accounts
  • Addressed the issue as a team during the weekly sprint review meeting

What are our takeaways from this experience?

  • The synergy between seemingly unrelated features runs deeper
  • To enforce testing all new features on our test environment first before releasing them
  • To publish updates and guides on new features right away

All in all, we apologize for the undue stress and we are very grateful for our customers’ patience and understanding. We have a long way ahead, but it is important to make these moments into learning opportunities and keep going. Again, thank you to those who’ve taken the time to report the problem to us.