Skip to main content

Business Continuity Plan

Updated over a week ago

Introduction

Eventmaker's technical infrastructure is described in this document. It explains in particular the different strategies implemented for the redundancy of Eventmaker platform's critical systems. It also details the backup policy and frequency performed on the database. Finally, it describes the various security mechanisms.

The business continuity plan provides a brief overview of how continuity is managed in the event of a disruption or disaster at Eventmaker. Several elements are presented here: The technical infrastructure, The different strategies implemented for the redundancy of the platform's and information system's critical systems, The backup policy and frequency performed on the database, The various security mechanisms.

The main objectives of this document are to: Preserve customer data, Preserve the availability of Eventmaker services, Ensure a rapid recovery of priority and essential processes in the event of a disruption/disaster, Respond effectively in case of disruption/disaster.

The document is reviewed at least once a year or in the event of a major change to the information system. The business continuity plan will also be tested annually, which may result in a review of the document if certain processes need to be modified following the test results.

Key people and contact information

First Name Last Name | Role | Email

Ivan Maireaux | CTO| ivan.maireaux@eventmaker.io

Wilfried Deluche | Partner Solution Manager | wilfried.deluche@eventmaker.io

Tristan Verdier | Managing Director | damien.schmitz@infopro-digital.com

Identified risks

Introduction of a failure in the web application code

Description

The Eventmaker web application is developed following the principle of continuous delivery. Consequently, new versions of the application are deployed to production regularly and sometimes several times a day. A new version can introduce a failure in the application.

Impact analysis

The impact of introducing a failure can vary depending on the criticality of the failure and the part of the system it affects. The worst-case scenario is a failure that renders the application unusable.

The impact therefore ranges from a failure with no real effect on system use to a failure that makes the system unusable.

Detection methods

The introduction of a failure in production can be detected in three ways.

  • If the failure is a code error that causes an abnormal interruption of code execution, the error is reported to an error-tracking system: sentry.io. The entire development team is notified by email that an error has occurred.

  • If the application becomes completely inaccessible, a service continuity monitoring system, pingdom, notifies the development team and the CEO by email.

  • If the failure is functional, meaning the code does not follow the correct logic and therefore does not deliver a feature, it is detected by humans and recorded in our project management tool Basecamp so that the development team is notified by email. Depending on the criticality, the malfunction may also be communicated directly orally or via an instant message to the development team.

Prevention methods

To minimize failures, Eventmaker follows development best practices:

  • use of git as the code version control system

  • use of Github and the Github flow: systematic code reviews

  • unit tests and integration tests (code coverage: 79.14%)

  • continuous integration: unit tests are executed automatically; a code linter, rubocop, is also executed to ensure code readability

Additionally, potentially critical new versions are tested internally on a pre-production environment.

When deploying a new version, it becomes the main version only when it successfully responds to a startup probe, which prevents deploying a version with a critical failure that would stop the application from starting or processing HTTP requests, for example.

Finally, processes that stop functioning abnormally are automatically restarted by Kubernetes, our Linux container management system.

Remediation methods

When a failure is detected, it is assessed directly by the development team. If it is determined that the failure must be addressed immediately, developers authorized to deploy the application can perform a rollback, that is, replace the production version with a previous version. The rollback duration is approximately 2 minutes.

Application cluster outage

Description

The application cluster is the set of servers that run the Eventmaker web application. It is located in Amazon AWS Cloud (more information in this document). Outages can originate from various sources: a virtual machine stops functioning, network problems within the VPC, or any other failure that AWS may experience.

Impact analysis

The impact is the total unavailability of the web application.

Detection methods

All servers are monitored by Newrelic Infrastructure. When a server stops responding, an email is sent to Robin Monjo and Ivan Maireaux.

In case of total unavailability of the web application, the development team is also notified by a pingdom email.

Preventive methods

To be resilient to potential AWS outages, the Eventmaker application is deployed on at least 4 servers distributed across 3 different zones.

Servers are managed by an auto-scaling group that ensures the desired number of servers is maintained.

Kubernetes ensures the application is distributed across the available servers.

Servers are exposed externally via a load balancer.

Remediation methods

In the scenario where all zones in the Amazon eu-west-1 region were affected by a major outage, a re-deployment of the infrastructure would be necessary.

Eventmaker uses Terraform, an Infrastructure as Code tool, to deploy its infrastructure in an automated way. This allows automatic re-deployment of the infrastructure in another AWS region.

The time to restore service in this scenario is approximately 8 hours. The people involved in this procedure are Robin Monjo, Ivan Maireaux and Wilfried Deluche.

Database cluster outage

Description

The database cluster is the set of servers used for the MongoDB database management system managed by Atlas (more information in this document). Like the application cluster, it is hosted on Amazon AWS Cloud. The outages that can occur are the same as those for the application cluster, although the database cluster is isolated in its own VPC.

Impact analysis

An outage of the database cluster can lead to the web application being unable to communicate with the database. In this situation, the application can be considered unusable.

Another possible impact is data loss.

Detection methods

Like the application servers, the database servers are monitored by Newrelic Infrastructure. In case of a server failure, an email is sent to Robin Monjo and Ivan Maireaux.

The MongoDB software is also monitored by MongoDB Cloud Manager. This allows alerts on abnormal MongoDB behavior, for example the disappearance of a node in the replica set. Alerts are sent by email to Robin Monjo and Ivan Maireaux.

Preventive methods

As a critical component of the Eventmaker system, the database is replicated in real time on 3 servers via the replica set system provided by MongoDB. The 3 servers are isolated in their own VPS and distributed across the three zones of the AWS eu-west-1 region.

Remediation methods

If fewer than 2 servers are impacted at the same time, no service interruption is observed thanks to the replication system. Human intervention is still necessary to restore replication to 100%:

  • if the failure is due to the server: a server must be re-deployed (always using Terraform) and then the new server must be added to the replica set from the MongoDB Cloud Manager interface. Operation duration: 30 minutes

  • if the failure is due to MongoDB: diagnose and restore the impacted node(s) so they rejoin the replica set. The tool to use here is MongoDB Cloud Manager. Operation duration: 20 minutes

In the case where all 3 servers fail at the same time, a service interruption is observed. The time to restore service depends on the scenario:

  • data are not impacted (if the servers' disks were not damaged by the outage)

  • data were lost as a result of the outage

In the first scenario, servers are re-deployed via Terraform using the existing disks. The replica set is then recreated via MongoDB Cloud Manager. Operation duration: 2 hours.

In the second scenario, in addition to recreating the servers and the replica set, it is necessary to restore the latest database backup. The backup policy and frequency are detailed in this document.

Restoring a backup currently takes about 2 hours. The time to restore service for this scenario is therefore approximately 4 hours. The backup restoration is performed directly via the MongoDB Cloud Manager tool.

The people involved in these procedures are Robin Monjo and Ivan Maireaux.

Denial of service attack

Description

A denial of service attack aims to massively request a service in order to make it unavailable to legitimate users.

Impact analysis

Such an attack can cause slowdowns and unavailability of the platform.

Detection methods

Eventmaker's servers are monitored by Newrelic Infrastructure. The Eventmaker web application is monitored by Newrelic APM.

A denial of service attack causes resource overconsumption in the application cluster (CPU and memory). An alert is sent by email to Robin Monjo and Ivan Maireaux when a server uses more than 80% CPU or memory for more than 5 minutes.

Confirming a denial of service attack is then possible by consulting the data collected by Newrelic APM, which indicate among other things the number of requests per minute on the platform.

Preventive methods

To mitigate the effects of a denial of service attack, Eventmaker's infrastructure is designed to allow rapid horizontal scaling. The procedure is as follows:

  • add servers to the Kubernetes cluster for the server group dedicated to Eventmaker

  • add web-type processes via our Platform As A Service Hephy

The operation takes about 2 minutes per server added. This procedure can be performed by Robin Monjo and Ivan Maireaux.

Remediation methods

In the case of a prolonged denial of service attack that cannot be sufficiently mitigated by scaling, the procedure is to enable AWS Shield on the load balancer targeted by the attack. This procedure can be carried out by Robin Monjo and/or Ivan Maireaux; its implementation takes about 2 hours.

Outage of our emailing provider

Description

Email sending is a vital system for the Eventmaker platform. The service used to send emails is Mailjet.

Impact analysis

The inability to send emails can greatly impact the Eventmaker service. Indeed, sending registration confirmation emails or email campaigns is a major feature.

Detection methods

Eventmaker communicates with Mailjet via their REST API. If an HTTP request to Mailjet fails, an exception is generated and sent to the error-tracking service: sentry.io. An alert email is then sent to each member of the development team.

Also, emails are sent asynchronously via a queue system. When one of the email sending tasks generates an exception, the job is considered failed and is kept in a special queue dedicated to storing failed tasks. The service continuity monitoring system pingdom monitors this queue and alerts the development team by email when it is not empty.

Preventive methods

When Eventmaker uses Mailjet's API, the request is retried up to 3 times if the API response is an error or if the request could not be completed. This helps mitigate temporary Mailjet errors.

Remediation methods

If sending fails more than 3 times in a row, it can be retried directly in the queue system interface.

In the event of a prolonged Mailjet outage, the procedure is to switch services and migrate to Amazon SES. The migration takes about 5 hours; the people involved are the development team including Robin Monjo or Ivan Maireaux for the SES setup. It will be a degraded mode that will allow sending emails but will not allow statistics tracking.

Outage of our other providers

Description

Eventmaker relies on other third-party services in addition to AWS and Mailjet:

  • Esendex for SMS sending

  • Crisp for support and documentation

  • Heroku, which hosts an application that serves the themes for Eventmaker's website builder and email builder

  • Amazon S3, which is used for the themes of Eventmaker's website builder

Impact analysis

Service and Impact

Esendex | Inability of the platform to send SMS. SMS are used for VIP notifications, sending temporary codes for multi-factor authentication and SMS campaigns

Crisp | Inability for users (event organizers) to request help from the Eventmaker team via the platform chat. Unavailability of the documentation help.eventmaker.io

Heroku | Inability to configure a new website or a new email on the Eventmaker platform. Note: duplication remains possible and existing websites or emails are not impacted

Amazon S3 | Pages of a website requiring Javascript are displayed but do not behave as expected because the Javascript code could not be loaded

Detection methods

Service & Detection Esendex | As with emails, communication errors with the Esendex API generate an exception which is then reported to sentry.io.

Zoho Invoice | Human detection

Crisp | Human detection

Heroku | Human detection

Amazon S3 | Human detection

Preventive methods

Service & Detection

Esendex | As with emails, in case of an error the request to the Esendex API is retried up to 3 times to mitigate temporary errors

Crisp | None

Heroku | None

Amazon S3 | None

Remediation methods

Service | Detection

Esendex | None. The SMS sending system does not prevent Eventmaker from delivering its added value. Human interventions on a case-by-case basis can be considered to unblock a user who can no longer access their account because they cannot receive their authentication code by SMS.

Crisp | None

Heroku | Deployment of the application on Eventmaker's internal application cluster. This procedure can be carried out by Robin Monjo and takes about 1 hour

Amazon S3 | None

Did this answer your question?