Skip to main content

Section 9.4 Response

In the response phase, the SOC deals with an incident to mitigate the harm it causes. Every incident is different, but the governing principles and steps are the same.

Subsection 9.4.1 Business Continuity

The concept of continuity is central to the steps taken to respond to an incident. Remember that the goal is to keep things running and keep services available. Business Continuity has three main parts: Business Continuity Planning (BCP), Business Impact Analysis (BIA), and Disaster Recovery Planning (DRP).
Business Continuity Planning (BCP) is a methodology for keeping things running. With BCP threats are identified in advance and critical business processes are prioritized. Recovery procedures for these processes have been developed and tested. In response to an incident, these procedures are followed as practiced.
Business Impact Analysis (BIA) identifies business functions and rates the impact of an outage on these functions. BIA measures the impact of an outage on:
BIA can help pinpoint mission-essential functions and single points of failure. This allows SOCs to determine where there resources should go in terms of having the best chance of maintaining business continuity.
Finally having a Disaster Recovery Plan (DRP) makes it easier to recover in the case of a large-scale issue. Disaster Recovery (DR) entails policies, tools, and procedures to recover from an outage. DRPs will detail order of restoration and require a lot of testing to ensure that the entire suite of supported applications can be brought back up. A standard DRP will detail:
  • Purpose and Scope
  • Recovery Team
  • Preparing for a Disaster
  • Emergency Procedures or Incident Response During an Incident
  • Restoration Procedures and Return to Normal

Subsection 9.4.2 Redundancy

Redundant services can help with continuity by making sure there is always an uncompromised service available. The key concepts of redundancy are detailed within the language it uses:
  • Redundancy: extra components/services that run in case of failures.
  • Failover: the process of turning over to a secondary device
  • High availability (HA): ensures high level of operation performance
  • Fault tolerance: allows a system to continue in the event of a failure
  • Single Point of Failure (SPOF): a single failure that can cause an outage

Example 9.4.1. Hot, Cold, & Warm.

One typical way to implement redundancy is through the use of hot, cold, and warm sites.
A hot site is a secondary location that is live and replicating in real-time what is happening in production. In the case of the primary site going down, a hot site can failover immediately.
A cold site is a secondary location without equipment. A cold site will take some time to set up and configure in the case of an outage.
A warm site is a secondary location with all equipment and connectivity. The equipment will still need to be turned on and made production ready, but it will not take as long to failover to a warm site as a cold one.

Example 9.4.2. RAID.

RAID is an interesting case of redundancy that occurs at the server storage level. RAID stands for Redundant Array of Inexpensive/Independent Disks and as the name states it uses multiple disks to make reads/writes faster and to be able to recover if one of the disks fails. It is important to note that RAID is not a backup. Backups are meant to aid in recovery and can be co-located. A RAID array is mean to work on a single machine and help mitigate damages cause by disk failures.
RAID has multiple levels, each of which prioritizes a different aspect:
  • RAID 0: Data is stripped across multiple disks to make reads/writes faster. If a single disk is lost the whole array goes down.
  • RAID 1: Data is mirrored across multiple disks for redundancy. If a single disk is lost the array can be recovered from the other disks.
  • RAID 5: At least three disks are used in a stripped and mirrored fashion such that read/write speeds are increased and if a single disk goes down the array can be rebuilt.
  • RAID 10: A combination of RAID0 and RAID1.

Subsection 9.4.3 Isolation and Containment

The first step in reaction to an incident is to remove the asset from the network so that the damage does not spread. It is standard procedure for malware to attempt to spread to other machines and the fastest way for it to do that is through an internal network. By isolating the infected asset, we can help prevent this.
There are a few other tools for containing malware such as sandboxing and snapshots. Sandboxing refers to the practice of running processes in a controlled environment on a machine. Most web browsers sandbox the JavaScript they run, meaning that if a website is serving malicious JS it should not be able to affect anything else on the machine. Snapshots refer to periodically saving the state of the storage device on a machine. This allows the SOC to roll the machine back to a previous state, before malware was active.

Subsection 9.4.4 Recovery

Recovery can be a long process, but it is the core of responding to an incident. If it is possible to remove malware from a machine, that action is taken in this step. Breached accounts are also disabled.
Unfortunately it may be impossible to roll back some assets to a previously uncompromised state, in which case they may need to be restored from a backup or failing that rebuilt from the ground up. Backups make recovery much simpler and companies that do not have a backup plan typically implement them after their first incident. That being said, malware may have also found its way into the backups if given enough time on the system. In that case the asset is typically destroyed and a new one is built. While this can take a long time, it is one of the few ways to know for sure that the asset isn’t compromised.

Subsection 9.4.5 Remediation

Remediation is focused on making sure that an incident can’t happen again. Remediation may entail patches, firewall changes, IoC database updates, or even adding more layers of security. The goal is to ensure that all assets are safe.

Subsection 9.4.6 Reporting

Reporting is a critical step. It is important to collect timestamped logs as well as accounts of how the incident plans were rolled out. This can help you determine if the plans should be changed and can help you know what to look for in the future. In the best case scenario good reporting lets you catch future precursors before they become incidents.
Disclosure is also an important aspect of the reporting phase. Both compliance and basic ethics mandate that customers be made aware of any data lost. By disclosing the details of an incident you can also other companies aware of what types of attacks are occurring "in the wild."
You have attempted 1 of 1 activities on this page.