Skip to main content

High availability

GovCMS is backed by a range of service levels

Service Availability SLA's

Platform Availability Uptime

  • 24x7  99.95% (per month) – for all Platform functionality. (excluding scheduled maintenance windows) 

AWS Infrastructure

  • 24x7  99.99% (per month) infrastructure uptime  

Backups

  • Mid-term storage must be an EC2 instance with EBS volumes (providing a 99.999% availability Service Level) attached. 
  • Backups must be held in long term storage via Glacier and the long term backup must be saved in Glacier with 99.999999999% availability. 

Docker Images

  • Docker images availability are stored in AWS S3, and must meet a 99.999999999%  availability Service Level. 

OpenShift Cluster

Restoration within 48 hours of Salsa Digital becoming aware of the issue, including: 

  • Salsa Digital must recreate whole OpenShift Cluster with complete new resources; and 
  • Process must restore OpenShift Cluster from backup. 

Service level applies if backups are accessible on AWS. 

 

Disaster Recovery

Table: Disaster recovery scenarios, actions and SLA detail
Service Type Action Required Service Level Detail
Recover from lost single or multiple files within persistent storage (database, solr, files)  The Contractor must restore from mid-term backup  Within 4 hours
Recover from completely lost persistent Storage Volume (database, solr, files)  The Contractor must restore from mid-term backup  Within 4 hours
Recover from lost computing node (does not cause downtime)  The Contractor must ensure this is fully automatic by AWS & Scaling Scripts  Within 4 hours until additional node provisioned
Recover from lost control plane node (master, load balancer, storage – does not cause downtime)  The Contractor must manually recover via Ansible scripts to start new node  Within 4 hours
Recovers from lost Availability Zone  The Contractor must wait on AWS to restore Availability Zone, may start additional compute nodes in still working Availability Zone.  Within AWS hours
Recover from a complete loss of OpenShift Cluster  The Contractor must recreate whole OpenShift Cluster with complete new Resources. The Contractor must restore Cluster from backup.  Within 48 hours

 

General SLA's

Service Desk Hours

  • 24x7 online support for critical issues
     
  • 8am - 8pm Monday to Friday (excluding public holidays in VIC). 

Platform Issues - Response and Reaction Times

Timeframes for this Service Level commence from the time Salsa Digital is first informed or becomes aware of the issue.

Acknowledgement 
  • Within 1 hour for acknowledgement of critical impact platform issue (Business Hours and non-Business Hours) 
  • Within 4 Business Hours  for acknowledgement of non-critical impact platform issue if raised in Business Hours; next 4 hour of Business Hours if raised out of Business Hours. 
Reaction (analyse issue, plan fix and communicate plan) 
  • Best effort (immediate) reaction time for critical impact platform issue and not later than one hour after acknowledgement of the issue (during Business Hours and non-Business Hours) 
  • 8 hours reaction time for High impact platform issue if raised during Business Hours; next 8 Business Hours if raised out of Business Hours. 
  • 2 Business Day reaction time for Medium/Low impact platform issue. 
Resolution 

Resolution based on best efforts and severity. 

Critical Issues - Update Frequency and Post Incident Report

  • If a valid critical platform or application issue is raised by an Agency or Finance status reports must be provided to Finance (GovCMS Service Manager, or if outside of business hours in the ACT, the GovCMS On-Call Operations Officer) at 15 minute intervals. 
  • A post incident report must be produced by Salsa Digital and communicated to Finance upon resolution of the issue.  A draft incident report must be produced within 3 Business Days of issue rectification.  If the root cause of the issue is not known at the time of report submission to Finance, it must be indicated as TBD.  Root cause, if subsequently known, must be communicated against the issue at the next scheduled GovCMS programme operations meeting.  

Non-Critical Issues - Update Frequency

  • For High severity platform and application issues, the service desk must provide a status update to Finance every 3 Business Days. 
  • For Medium severity platform and application issues, the service desk must provide a status update to Finance every 5 Business Days. 
  • For Low severity platform issues, and application issues, Salsa Digital must provide a status update as a batch in the monthly operations report. 

Infrastructure Patches

Timeframes for the application of patches in this Service Level apply from the point in time when the patch becomes available to Salsa Digital.  

Proactive maintenance and patching of the platform code, per the following classifications: 

  • Critical patches applied within 24 hours, eg Spectre/Meltdown. These are applied outside of regular working hours where possible. 
  • Non-critical patches applied weekly within a maintenance window agreed by FInance. 

The classification of critical/non-critical as applicable to infrastructure/platform will use the RedHat severity ratings.  

Drupal Security Patches

Timeframes for the application of patches in this Service Level apply from the point in time when the patch becomes available to Salsa Digital.  

  • Highly critical security patches applied within 48 hours (for SaaS customers) to production.  Where all applicable automated tests have not passed, deployment to production will require Finance approval. 
  • Critical security patches applied within 7 days (SaaS customers) to production.  Where all applicable automated tests have not passed, deployment to production will require Finance approval. 
  • Monthly patches applied for non-critical updates - as agreed and prioritised by Finance

The classification of highly critical/critical etc will use the Drupal.org rating - see https://www.drupal.org/drupal-security-team/security-risk-levels-defined 

Operational reporting

Salsa Digital must deliver an operations report to the Finance team that contains the information required by Finance. This report includes at a minimum:

  • Critical issues/tickets must be explicitly detailed in the operations report. 
  • Other issues as Notified by Finance must be discussed, with a summary of all issues/tickets for the month provided.  
  • Cluster performance metrics and details of future recommendations to improve performance or resolve systemic issues. 

Operational reporting must be continuously improving and refined to provide best practice reporting to Finance month on month.  Salsa Digital must take into account and implement where agreed, any Finance comments on report content.