使用定量概率×影响矩阵生成评分且优先排序的风险登记册。涵盖3个主题(环境、基础设施可靠性、安全)下的7种灾难类型,包含18+个预置场景。输出用于响应计划优先级排序、事件响应团队范围界定和灾难恢复测试选择。
何时使用
- 为新系统或现有系统开始灾难规划
- 为灾难恢复测试或桌面演练做准备
- 界定事件响应团队的章程和覆盖范围
- 评估基础设施变更(新数据中心、云迁移)如何改变风险敞口
- 为多地点组织进行每个站点的风险审查
- 在组织或威胁环境发生重大变化后重新评估之前
前置条件: 了解您系统的架构及其关键依赖(网络、认证、存储、第三方服务)。风险评级的质量取决于其背后的系统清单。
背景和输入收集
在评分之前,建立三个输入:
1. 带关键性分类的系统清单
将风险可能影响的每个系统分类为三个层级之一。此分类决定给定灾难对运营的实际影响程度。
| 层级 | 标签 | 定义 |
|---|
| 1 | 关键业务 | 缺失导致完全运营中断。组织无法运作。 |
| 2 | 重要业务 | 缺失显著降低运营但不会使其停止。 |
| 3 | 非必要 | 缺失影响最小。可容忍停机。 |
询问:如果以下服务离线24小时,哪些将是灾难性的(1级)、严重的(2级)或可接受的(3级)?
2. 地理和基础设施背景
风险评级依赖于位置。洛杉矶的站点比汉堡的站点需要更高的地震概率。 美国东南部的站点需要更高的飓风概率。单ISP设施比有多路冗余电路的站点需要更高的互联网连接丢失概率。
收集:
- 物理数据中心位置
- 现有容错控制(冗余电源、冗余ISP、UPS、发电机)
- 此站点或此区域的已知历史事件
3. 范围边界
决定您是在组织级别(全球)还是按站点进行评估。大型组织应该两者都做——仅托管3级系统的站点与托管1级系统的站点需要不同的响应计划。
流程
第1步 — 从预置的风险分类开始
Building Secure and Reliable Systems 附录A中的矩阵将灾难场景分为三个主题。使用这些作为起点,而不是空列表。预置场景可以防止遗漏不明显风险的常见失败模式(例如,新出现的零日漏洞、内部人员窃取知识产权)。
环境主题(影响物理基础设施的自然事件)
基础设施可靠性主题(组件和服务故障)
- 停电
- 互联网连接丢失
- 认证系统宕机
- 高系统延迟/基础设施放缓
安全主题(对抗性和漏洞驱动的事件)
- 系统被入侵(外部攻击者获得未授权访问)
- 内部人员窃取知识产权
- 分布式拒绝服务(DDoS)/拒绝服务(DoS)攻击
- 滥用系统资源(例如,加密货币挖矿)
- 故意破坏/网站篡改
- 网络钓鱼攻击
- 软件安全漏洞
- 硬件安全漏洞
- 新出现的严重漏洞(例如,Meltdown/Spectre、Heartbleed级别)
在此列表之外添加组织特定的场景。示例:针对备份系统的勒索软件、构建管道的供应链妥协、要求紧急数据删除的监管行动。
第2步 — 使用P×I量表对每个场景评分
对于每个场景,独立分配两个值,然后计算排名。
一年内发生的概率(P)
| 值 | 标签 |
|---|
| 0.0 | 几乎不会 |
| 0.2 | 不太可能 |
| 0.4 | 有点不可能 |
| 0.6 | 可能 |
| 0.8 | 极有可能 |
| 1.0 | 不可避免 |
根据您的具体位置、历史数据和现有控制措施评分概率。有发电机和UPS的站点降低停电概率;洪水易发地区的站点增加洪水概率。
风险发生时的组织影响(I)
| 值 | 标签 |
|---|
| 0.0 | 可忽略 |
| 0.2 | 轻微 |
| 0.5 | 中等 |
| 0.8 | 严重 |
| 1.0 | 关键 |
根据受影响的1/2/3级系统评分影响。如果灾难仅影响3级系统,影响最多为中等。如果它使1级系统宕机且没有故障转移,影响为严重或关键。
排名 = 概率 × 影响
停电评分为P=0.6,I=0.8产生排名=0.48。飓风P=0.2,I=1.0产生排名=0.20。将完成的登记册从最高到最低排名排序。
第3步 — 填充风险登记册
每个场景创建一行。最小列:
| 主题 | 风险 | 概率(P) | 影响(I) | 排名(P×I) | 受影响系统 | 层级 |
|---|
| 环境 | 地震 | — | — | — | — | — |
| 环境 | 洪水 | — | — | — | — | — |
| 环境 | 火灾 | — | — | — | — | — |
| 环境 | 飓风 | — | — | — | — | — |
| 基础设施可靠性 | 停电 | — | — | — | — | — |
| 基础设施可靠性 | 互联网连接丢失 | — | — | — | — | — |
| 基础设施可靠性 | 认证系统宕机 | — | — | — | — | — |
| 基础设施可靠性 | 高系统延迟/基础设施放缓 | — | — | — | — | — |
| 安全 | 系统被入侵 | — | — | — | — | — |
| 安全 | 内部人员窃取知识产权 | — | — | — | — | — |
| 安全 | DDoS/DoS攻击 | — | — | — | — | — |
| 安全 | 滥用系统资源 | — | — | — | — | — |
| 安全 | 故意破坏/网站篡改 | — | — | — | — | — |
| 安全 | 网络钓鱼攻击 | — | — | — | — | — |
| 安全 | 软件安全漏洞 | — | — | — | — | — |
| 安全 | 硬件安全漏洞 | — | — | — | — | — |
| 安全 | 新出现的严重漏洞 | — | — | — | — | — |
填写分数,按排名降序排序。
第4步 — 在定稿前审查异常值
按排名排序是一个起始启发式方法,不是最终答案。执行手动异常值审查:
- 低概率、高影响异常值: 排名0.10(P=0.1,I=1.0)的场景可能仍然需要响应计划,因为后果是灾难性的。无论排名如何,标记任何I=1.0的场景。
- 隐藏依赖: 看似低影响的风险如果禁用了其他事件响应所依赖的监控系统或日志系统,可能会变得关键。
- 相关风险: 地震可能同时引发停电、连接丢失和火灾。评估场景是否聚集在一起,以及组合影响是否超过个人排名。
- 专家审查: 征求团队外人员的审查,他们可以识别具有隐藏因素或依赖的风险。群体思维往往低估不熟悉的场景。
第5步 — 记录范围、假设和审查节奏
与登记册一起记录:
- 评估日期
- 评估的地点
- 假设的现有控制(例如,"假设有冗余ISP、UPS和柴油发电机")
- 负责下次审查的所有者
- 计划的审查节奏(最低:每年;建议:任何重大基础设施变更或事件后)
关键原则
量化对抗群体思维。 直觉风险评估倾向于对突出场景(近期新闻事件、令人难忘的险情)加权,而非统计上更可能的场景。评分矩阵强制明确的概率和影响估计,使看不见的假设可见且可争辩。
概率依赖于基础设施,而非普遍。 具有多区域故障转移的云托管系统与单点本地部署具有不同的认证系统宕机概率。在考虑现有控制后评分——但也要模拟控制失败时会发生什么。
评级必须随系统演变。 当组织添加冗余互联网电路、迁移到不同的云区域或发现新的漏洞类别时,风险态势会发生变化。安排审查;不要将登记册视为一次性产物。
低概率不意味着没有计划。 I=0.8或I=1.0的场景即使排名很低也需要响应计划。排名指导首先在哪里投入准备精力,而不是忽略哪些风险。
与主要系统一起评估依赖。 关键运营功能包括其底层依赖——网络、认证、应用层组件。依赖于3级认证系统的关键业务服务在事件期间实际上将该依赖提升到1级。
多地点组织需要按站点评估。 全球排名掩盖了特定站点的敞口。地震多发地区的站点与环境风险与总部不同。按站点运行矩阵并汇总。
示例
示例:SaaS公司,单个美国西海岸数据中心,无冗余电源
| 主题 | 风险 | P | I | 排名 | 受影响系统 |
|---|
| 安全 | 系统被入侵 | 0.6 | 1.0 | 0.60 | 认证服务(T1)、API(T1) |
| 基础设施 | 停电 | 0.6 | 0.8 | 0.48 | 所有系统 |
| 安全 | 软件安全漏洞 | 0.6 | 0.8 | 0.48 | API(T1) |
| 安全 | 网络钓鱼攻击 | 0.8 | 0.5 | 0.40 | 邮件(T2)、SSO(T1) |
| 基础设施 | 互联网连接丢失 | 0.4 | 1.0 | 0.40 | 所有面向外部的(T1) |
| 安全 | DDoS/DoS攻击 | 0.4 | 0.8 | 0.32 | API(T1) |
| 环境 | 地震 | 0.4 | 0.8 | 0.32 | 所有系统 |
| 安全 | 新出现的严重漏洞 | 0.2 | 1.0 | 0.20 | 所有系统 |
| 环境 | 洪水 | 0.2 | 0.5 | 0.10 | 本地设备(T2) |
异常值标记: 新出现的严重漏洞排名0.20但影响=1.0。尽管排名低,仍需标记为强制响应计划。地震和互联网连接丢失是相关的——它们的组合影响可能高于任何一个。
示例:调整现有控制
添加备份ISP后:互联网连接丢失从P=0.4降至P=0.2,排名从0.40降至0.20。
添加UPS和发电机后:停电从P=0.6降至P=0.2,排名从0.48降至0.16。
控制更改时重新运行矩阵以确认优先级仍然有效。
参考资料
- Building Secure and Reliable Systems(Blank、Oprea等,Google/O'Reilly,2020)
- 第16章“灾难规划”——第363-382页:灾难类型分类(第364页)、灾难风险分析方法(第366页)、系统关键性分类(第366页)、动态响应策略阶段(第365页)
- 附录A“灾难风险评估矩阵”——第499-500页:表A-1包含完整概率量表、影响量表、预置场景分类和排名=P×I公式
- 完成登记册后的后续步骤:事件响应团队设置(第16章,第367-375页)、响应计划开发(第371-373页)、灾难恢复测试规划(第376-382页)
许可证
本技能根据 CC-BY-SA-4.0 许可。
来源:BookForge — Building Secure And Reliable Systems by Unknown。
相关BookForge技能
本技能是独立的。浏览更多BookForge技能:bookforge-skills
Produces a scored, prioritized risk register using a quantitative Probability × Impact matrix. Covers 7 disaster types across 3 themes (Environmental, Infrastructure Reliability, Security) with 18+ pre-seeded scenarios. Output drives response plan prioritization, incident response team scoping, and disaster recovery test selection.
When to Use
- Starting disaster planning for a new or existing system
- Preparing for a disaster recovery test or tabletop exercise
- Scoping an incident response team's charter and coverage
- Evaluating how a change in infrastructure (new datacenter, cloud migration) shifts risk exposure
- Conducting a per-site risk review for a multi-location organization
- Revisiting a prior assessment after a significant organizational or threat environment change
Prerequisite: Know your system's architecture and its key dependencies (networking, authentication, storage, third-party services). Risk ratings are only as good as the system inventory behind them.
Context & Input Gathering
Before scoring, establish three inputs:
1. System inventory with criticality classification
Classify every system the risk could affect into one of three tiers. This classification determines how much impact a given disaster actually has on operations.
| Tier | Label | Definition |
|---|
| 1 | Mission-essential | Absence causes total operational disruption. Organization cannot function. |
| 2 | Mission-important | Absence significantly degrades operations but does not halt them. |
| 3 | Nonessential | Absence has minimal operational impact. Tolerable downtime. |
Ask: which services, if offline for 24 hours, would be catastrophic (Tier 1), serious (Tier 2), or acceptable (Tier 3)?
2. Geographic and infrastructure context
Risk ratings are location-dependent. A site in Los Angeles warrants a higher earthquake probability than one in Hamburg. A site in the southeastern US warrants higher hurricane probability. A single-ISP facility warrants higher internet connectivity loss probability than one with redundant circuits. Collect:
- Physical datacenter location(s)
- Existing fault-tolerance controls (redundant power, redundant ISP, UPS, generators)
- Known historical incidents at this site or in this region
3. Scope boundary
Decide whether you are assessing at the organizational level (global) or per site. Large organizations should do both — a site that hosts only Tier 3 systems warrants a different response plan than one hosting Tier 1 systems.
Process
Step 1 — Start with the pre-seeded risk taxonomy
The matrix in Appendix A of Building Secure and Reliable Systems groups disaster scenarios into three themes. Use these as your starting point rather than an empty list. Pre-seeded scenarios prevent the common failure mode of omitting non-obvious risks (e.g., emerging zero-day vulnerabilities, insider intellectual property theft).
Environmental theme (natural events that affect physical infrastructure)
- Earthquake
- Flood
- Fire
- Hurricane / severe storm
Infrastructure Reliability theme (component and service failures)
- Power outage
- Loss of internet connectivity
- Authentication system down
- High system latency / infrastructure slowdown
Security theme (adversarial and vulnerability-driven events)
- System compromise (external attacker gaining unauthorized access)
- Insider theft of intellectual property
- Distributed denial-of-service (DDoS) / denial-of-service (DoS) attack
- Misuse of system resources (e.g., cryptocurrency mining)
- Vandalism / website defacement
- Phishing attack
- Software security bug
- Hardware security bug
- Emerging serious vulnerability (e.g., Meltdown/Spectre, Heartbleed class)
Add organization-specific scenarios beyond this list. Examples: ransomware targeting backup systems, supply chain compromise of a build pipeline, regulatory action requiring emergency data deletion.
Step 2 — Score each scenario using the P×I scales
For each scenario, assign two values independently, then compute the ranking.
Probability of occurrence within a year (P)
| Value | Label |
|---|
| 0.0 | Almost never |
| 0.2 | Unlikely |
| 0.4 | Somewhat unlikely |
| 0.6 | Likely |
| 0.8 | Highly likely |
| 1.0 | Inevitable |
Score probability based on your specific location, historical data, and existing controls. A site with a generator and UPS reduces power outage probability; a site on a flood plain increases flood probability.
Impact to organization if risk occurs (I)
| Value | Label |
|---|
| 0.0 | Negligible |
| 0.2 | Minimal |
| 0.5 | Moderate |
| 0.8 | Severe |
| 1.0 | Critical |
Score impact relative to the Tier 1/2/3 systems affected. If a disaster only affects Tier 3 systems, impact is at most Moderate. If it takes down a Tier 1 system with no failover, impact is Severe or Critical.
Ranking = Probability × Impact
A power outage scored P=0.6, I=0.8 produces Ranking=0.48. A hurricane at P=0.2, I=1.0 produces Ranking=0.20. Sort the completed register from highest to lowest ranking.
Step 3 — Populate the risk register
Create one row per scenario. Minimum columns:
| Theme | Risk | Probability (P) | Impact (I) | Ranking (P×I) | Systems Impacted | Tier |
|---|
| Environmental | Earthquake | — | — | — | — | — |
| Environmental | Flood | — | — | — | — | — |
| Environmental | Fire | — | — | — | — | — |
| Environmental | Hurricane | — | — | — | — | — |
| Infrastructure Reliability | Power outage | — | — | — | — | — |
| Infrastructure Reliability | Loss of internet connectivity | — | — | — | — | — |
| Infrastructure Reliability | Authentication system down | — | — | — | — | — |
| Infrastructure Reliability | High system latency / infrastructure slowdown | — | — | — | — | — |
| Security | System compromise | — | — | — | — | — |
| Security | Insider theft of intellectual property | — | — | — | — | — |
| Security | DDoS/DoS attack | — | — | — | — | — |
| Security | Misuse of system resources | — | — | — | — | — |
| Security | Vandalism / website defacement | — | — | — | — | — |
| Security | Phishing attack | — | — | — | — | — |
| Security | Software security bug | — | — | — | — | — |
| Security | Hardware security bug | — | — | — | — | — |
| Security | Emerging serious vulnerability | — | — | — | — | — |
Fill in scores, sort by Ranking descending.
Step 4 — Review for outliers before finalizing
Sorting by ranking is a starting heuristic, not a final answer. Perform a manual outlier review:
- Low-probability, high-impact outliers: A scenario ranked 0.10 (P=0.1, I=1.0) may still demand a response plan because the consequence is catastrophic. Flag any scenario with I=1.0 regardless of ranking.
- Hidden dependencies: A seemingly low-impact risk may become critical if it disables a monitoring or logging system that other incident responses depend on.
- Correlated risks: An earthquake can simultaneously trigger power outage, connectivity loss, and fire. Assess whether scenarios cluster and whether the combined impact exceeds individual rankings.
- Expert review: Solicit review from someone outside the team who can identify risks with hidden factors or dependencies. Groupthink tends to underweight unfamiliar scenarios.
Step 5 — Document scope, assumptions, and review cadence
Record alongside the register:
- Date of assessment
- Location(s) assessed
- Existing controls assumed (e.g., "assumes redundant ISP, UPS, and diesel generator")
- Owner responsible for next review
- Planned review cadence (minimum: annually; recommended: after any major infrastructure change or post-incident)
Key Principles
Quantification counters groupthink. Intuitive risk assessment tends to weight salient scenarios (recent news events, memorable near-misses) over statistically more likely ones. A scored matrix forces explicit probability and impact estimates, making invisible assumptions visible and debatable.
Probability is infrastructure-dependent, not universal. A cloud-hosted system with multi-region failover has a different authentication system downtime probability than a single on-premises deployment. Score after accounting for existing controls — but also model what happens if a control fails.
Ratings must evolve with the system. Risk posture changes when the organization adds redundant internet circuits, migrates to a different cloud region, or discovers a new vulnerability class. Schedule reviews; do not treat the register as a one-time artifact.
Low probability does not mean no plan. Scenarios with I=0.8 or I=1.0 warrant response plans even if their ranking is low. The ranking guides where to invest preparation effort first, not which risks to ignore entirely.
Assess dependencies alongside primary systems. Key operational functions include their underlying dependencies — networking, authentication, application-layer components. A mission-essential service that depends on a Tier 3 authentication system effectively elevates that dependency to Tier 1 during an incident.
Multi-location organizations need per-site assessments. Global rankings mask site-specific exposure. A site in earthquake country has different environmental risk than headquarters. Run the matrix per site and aggregate.
Examples
Example: SaaS company, single US West Coast datacenter, no redundant power
| Theme | Risk | P | I | Ranking | Systems Impacted |
|---|
| Security | System compromise | 0.6 | 1.0 | 0.60 | Auth service (T1), API (T1) |
| Infrastructure | Power outage | 0.6 | 0.8 | 0.48 | All systems |
| Security | Software security bug | 0.6 | 0.8 | 0.48 | API (T1) |
| Security | Phishing attack | 0.8 | 0.5 | 0.40 | Email (T2), SSO (T1) |
| Infrastructure | Loss of internet connectivity | 0.4 | 1.0 | 0.40 | All externally facing (T1) |
| Security | DDoS/DoS attack | 0.4 | 0.8 | 0.32 | API (T1) |
| Environmental | Earthquake | 0.4 | 0.8 | 0.32 | All systems |
| Security | Emerging serious vulnerability | 0.2 | 1.0 | 0.20 | All systems |
| Environmental | Flood | 0.2 | 0.5 | 0.10 | On-prem equipment (T2) |
Outlier flag: Emerging serious vulnerability ranks 0.20 but Impact=1.0. Flag for mandatory response plan despite low ranking. Earthquake and internet connectivity loss are correlated — their combined impact may be higher than either alone.
Example: Adjusting for existing controls
After adding a backup ISP: Loss of internet connectivity drops from P=0.4 to P=0.2, Ranking drops from 0.40 to 0.20. After adding UPS and generator: Power outage drops from P=0.6 to P=0.2, Ranking drops from 0.48 to 0.16. Re-run the matrix when controls change to confirm prioritization remains valid.
References
- Building Secure and Reliable Systems (Blank, Oprea et al., Google/O'Reilly, 2020)
- Chapter 16 "Disaster Planning" — pp. 363–382: disaster type taxonomy (pp. 364), disaster risk analysis methodology (pp. 366), system criticality classification (pp. 366), dynamic response strategy phases (pp. 365)
- Appendix A "A Disaster Risk Assessment Matrix" — pp. 499–500: Table A-1 with full probability scale, impact scale, pre-seeded scenario taxonomy, and Ranking = P×I formula
- Next steps after completing the register: incident response team setup (Chapter 16, pp. 367–375), response plan development (pp. 371–373), disaster recovery test planning (pp. 376–382)
License
This skill is licensed under CC-BY-SA-4.0.
Source: BookForge — Building Secure And Reliable Systems by Unknown.
Related BookForge Skills
This skill is standalone. Browse more BookForge skills: bookforge-skills