测试 MetroCluster 配置
您可以测试故障情形,以确认 MetroCluster 配置是否正常运行。
验证协商的切换
您可以测试协商(计划内)切换操作,以确认无中断数据可用性。
此测试通过将集群切换到第二个数据中心来验证数据可用性不受影响( Microsoft 服务器消息块( SMB )和 Solaris 光纤通道协议除外)。
此测试需要大约 30 分钟。
此操作步骤具有以下预期结果:
-
MetroCluster switchover
命令将显示警告提示符。如果对提示符回答
yes
,则发出命令的站点将切换配对站点。
对于 MetroCluster IP 配置:
-
对于 ONTAP 9.4 及更早版本:
-
在协商切换后,镜像聚合将降级。
-
-
对于 ONTAP 9.5 及更高版本:
-
如果可以访问远程存储,则镜像聚合将保持正常状态。
-
如果无法访问远程存储,则在协商切换后,镜像聚合将降级。
-
-
对于 ONTAP 9.8 及更高版本:
-
如果无法访问远程存储,则位于灾难站点的未镜像聚合将不可用。这可能会导致控制器中断。
-
-
确认所有节点均处于已配置状态和正常模式:
MetroCluster node show
cluster_A::> metrocluster node show Cluster Configuration State Mode ------------------------------ ---------------------- ------------------------ Local: cluster_A configured normal Remote: cluster_B configured normal
-
开始切换操作:
MetroCluster switchover
cluster_A::> metrocluster switchover Warning: negotiated switchover is about to start. It will stop all the data Vservers on cluster "cluster_B" and automatically re-start them on cluster "`cluster_A`". It will finally gracefully shutdown cluster "cluster_B".
-
确认本地集群处于已配置状态和切换模式:
MetroCluster node show
cluster_A::> metrocluster node show Cluster Configuration State Mode ------------------------------ ---------------------- ------------------------ Local: cluster_A configured switchover Remote: cluster_B not-reachable - configured normal
-
确认切换操作已成功:
MetroCluster 操作显示
cluster_A::> metrocluster operation show cluster_A::> metrocluster operation show Operation: switchover State: successful Start Time: 2/6/2016 13:28:50 End Time: 2/6/2016 13:29:41 Errors: -
-
使用
vserver show
和network interface show
命令验证灾难恢复 SVM 和 LIF 是否已联机。
验证修复和手动切回
您可以通过在协商切换后将集群切回原始数据中心来测试修复和手动切回操作,以验证数据可用性不受影响( SMB 和 Solaris FC 配置除外)。
此测试需要大约 30 分钟。
此操作步骤的预期结果是,应将服务切换回其主节点。
-
验证修复是否已完成:
MetroCluster node show
以下示例显示了命令的成功完成:
cluster_A::> metrocluster node show DR Configuration DR Group Cluster Node State Mirroring Mode ----- ------- ------------------ -------------- --------- -------------------- 1 cluster_A node_A_1 configured enabled heal roots completed cluster_B node_B_2 unreachable - switched over 42 entries were displayed.metrocluster operation show
-
验证所有聚合是否均已镜像:
s存储聚合显示
以下示例显示所有聚合的 RAID 状态均为已镜像:
cluster_A::> storage aggregate show cluster Aggregates: Aggregate Size Available Used% State #Vols Nodes RAID Status --------- -------- --------- ----- ------- ------ ----------- ------------ data_cluster 4.19TB 4.13TB 2% online 8 node_A_1 raid_dp, mirrored, normal root_cluster 715.5GB 212.7GB 70% online 1 node_A_1 raid4, mirrored, normal cluster_B Switched Over Aggregates: Aggregate Size Available Used% State #Vols Nodes RAID Status --------- -------- --------- ----- ------- ------ ----------- ------------ data_cluster_B 4.19TB 4.11TB 2% online 5 node_A_1 raid_dp, mirrored, normal root_cluster_B - - - unknown - node_A_1 -
-
从灾难站点启动节点。
-
检查切回恢复的状态:
MetroCluster node show
cluster_A::> metrocluster node show DR Configuration DR Group Cluster Node State Mirroring Mode ----- ------- ------------------ -------------- --------- -------------------- 1 cluster_A node_A_1 configured enabled heal roots completed cluster_B node_B_2 configured enabled waiting for switchback recovery 2 entries were displayed.
-
执行切回:
MetroCluster 切回
cluster_A::> metrocluster switchback [Job 938] Job succeeded: Switchback is successful.Verify switchback
-
确认节点的状态:
MetroCluster node show
cluster_A::> metrocluster node show DR Configuration DR Group Cluster Node State Mirroring Mode ----- ------- ------------------ -------------- --------- -------------------- 1 cluster_A node_A_1 configured enabled normal cluster_B node_B_2 configured enabled normal 2 entries were displayed.
-
确认状态:
MetroCluster 操作显示
输出应显示成功状态。
cluster_A::> metrocluster operation show Operation: switchback State: successful Start Time: 2/6/2016 13:54:25 End Time: 2/6/2016 13:56:15 Errors: -
丢失一个 FC-SAS 网桥
您可以测试单个 FC-SAS 网桥的故障,以确保不存在单点故障。
此测试需要大约 15 分钟。
此操作步骤具有以下预期结果:
-
关闭网桥时,应生成错误。
-
不应发生故障转移或服务丢失。
-
只能通过一条路径从控制器模块连接到网桥后面的驱动器。
从 ONTAP 9.8 开始, storage bridge 命令将替换为 ssystem bridge 。以下步骤显示了 storage bridge 命令,但如果您运行的是 ONTAP 9.8 或更高版本,则首选使用 ssystem bridge 命令。
|
-
关闭网桥的电源。
-
确认网桥监控指示出现错误:
storage bridge show
cluster_A::> storage bridge show Is Monitor Bridge Symbolic Name Vendor Model Bridge WWN Monitored Status ---------- ------------- ------- --------- ---------------- --------- ------- ATTO_10.65.57.145 bridge_A_1 Atto FibreBridge 6500N 200000108662d46c true error
-
确认网桥后面的驱动器可通过一条路径使用:
s存储磁盘错误显示
cluster_A::> storage disk error show Disk Error Type Error Text ---------------- ----------------- -------------------------------------------- 1.0.0 onedomain 1.0.0 (5000cca057729118): All paths to this array LUN are connected to the same fault domain. This is a single point of failure. 1.0.1 onedomain 1.0.1 (5000cca057727364): All paths to this array LUN are connected to the same fault domain. This is a single point of failure. 1.0.2 onedomain 1.0.2 (5000cca05772e9d4): All paths to this array LUN are connected to the same fault domain. This is a single point of failure. ... 1.0.23 onedomain 1.0.23 (5000cca05772e9d4): All paths to this array LUN are connected to the same fault domain. This is a single point of failure.
在电源线中断后验证操作
您可以测试 MetroCluster 配置对 PDU 故障的响应。
最佳做法是,将组件中的每个电源设备( PSU )连接到单独的电源。如果两个 PSU 都连接到同一个配电单元( PDU ),并且发生电气中断,则站点可能会关闭,或者整个磁盘架可能不可用。测试一条电源线故障,以确认没有布线不匹配,从而发生原因可能导致服务中断。
此测试需要大约 15 分钟。
此测试需要关闭所有左侧 PDU 的电源,然后关闭包含 MetroCluster 组件的所有机架上的所有右侧 PDU 的电源。
此操作步骤具有以下预期结果:
-
当 PDU 断开连接时,应生成错误。
-
不应发生故障转移或服务丢失。
-
关闭包含 MetroCluster 组件的机架左侧 PDU 的电源。
-
在控制台上监控结果:
s系统环境传感器显示 -state fault
s存储架 show -errors
cluster_A::> system environment sensors show -state fault Node Sensor State Value/Units Crit-Low Warn-Low Warn-Hi Crit-Hi ---- --------------------- ------ ----------- -------- -------- ------- ------- node_A_1 PSU1 fault PSU_OFF PSU1 Pwr In OK fault FAULT node_A_2 PSU1 fault PSU_OFF PSU1 Pwr In OK fault FAULT 4 entries were displayed. cluster_A::> storage shelf show -errors Shelf Name: 1.1 Shelf UID: 50:0a:09:80:03:6c:44:d5 Serial Number: SHFHU1443000059 Error Type Description ------------------ --------------------------- Power Critical condition is detected in storage shelf power supply unit "1". The unit might fail.Reconnect PSU1
-
重新打开左侧 PDU 的电源。
-
确保 ONTAP 清除错误情况。
-
对右侧 PDU 重复上述步骤。
在交换机网络结构出现故障后验证操作
您可以禁用交换机网络结构,以显示数据可用性不受丢失影响。
此测试需要大约 15 分钟。
此操作步骤的预期结果是,禁用某个网络结构会导致所有集群互连和磁盘流量流向另一个网络结构。
在显示的示例中,交换机网络结构 1 已禁用。此网络结构包含两个交换机,每个 MetroCluster 站点一个:
-
cluster_A 上的 FC_switch_A_1
-
cluster_B 上的 FC_switch_B_1
-
禁用与 MetroCluster 配置中两个交换机网络结构之一的连接:
-
禁用网络结构中的第一个交换机:
sswitchdisable
FC_switch_A_1::> switchdisable
-
禁用网络结构中的第二个交换机:
sswitchdisable
FC_switch_B_1::> switchdisable
-
-
在控制器模块的控制台上监控结果。
您可以使用以下命令检查集群节点,以确保仍在提供所有数据。命令输出显示磁盘的路径缺失。这是预期的。
-
vserver show
-
network interface show
-
aggr show
-
system node runnodename 命令 storage show disk -p
-
存储磁盘错误显示
-
-
重新启用与 MetroCluster 配置中两个交换机网络结构之一的连接:
-
重新启用网络结构中的第一个交换机:
sswitchenable
FC_switch_A_1::> switchenable
-
重新启用网络结构中的第二个交换机:
sswitchenable
FC_switch_B_1::> switchenable
-
-
至少等待 10 分钟,然后对另一个交换机网络结构重复上述步骤。
在丢失一个存储架后验证操作
您可以测试单个存储架的故障,以验证是否没有单点故障。
此操作步骤具有以下预期结果:
-
监控软件应报告错误消息。
-
不应发生故障转移或服务丢失。
-
硬件故障恢复后,镜像重新同步将自动启动。
-
检查存储故障转移状态:
s存储故障转移显示
cluster_A::> storage failover show Node Partner Possible State Description -------------- -------------- -------- ------------------------------------- node_A_1 node_A_2 true Connected to node_A_2 node_A_2 node_A_1 true Connected to node_A_1 2 entries were displayed.
-
检查聚合状态:
s存储聚合显示
cluster_A::> storage aggregate show cluster Aggregates: Aggregate Size Available Used% State #Vols Nodes RAID Status --------- -------- --------- ----- ------- ------ ---------------- ------------ node_A_1data01_mirrored 4.15TB 3.40TB 18% online 3 node_A_1 raid_dp, mirrored, normal node_A_1root 707.7GB 34.29GB 95% online 1 node_A_1 raid_dp, mirrored, normal node_A_2_data01_mirrored 4.15TB 4.12TB 1% online 2 node_A_2 raid_dp, mirrored, normal node_A_2_data02_unmirrored 2.18TB 2.18TB 0% online 1 node_A_2 raid_dp, normal node_A_2_root 707.7GB 34.27GB 95% online 1 node_A_2 raid_dp, mirrored, normal
-
验证所有数据 SVM 和数据卷是否均已联机并提供数据:
vserver show -type data
network interface show -fields is-home false
volume show ! vol0 ,! mdv*
cluster_A::> vserver show -type data cluster_A::> vserver show -type data Admin Operational Root Vserver Type Subtype State State Volume Aggregate ----------- ------- ---------- ---------- ----------- ---------- ---------- SVM1 data sync-source running SVM1_root node_A_1_data01_mirrored SVM2 data sync-source running SVM2_root node_A_2_data01_mirrored cluster_A::> network interface show -fields is-home false There are no entries matching your query. cluster_A::> volume show !vol0,!MDV* Vserver Volume Aggregate State Type Size Available Used% --------- ------------ ------------ ---------- ---- ---------- ---------- ----- SVM1 SVM1_root node_A_1data01_mirrored online RW 10GB 9.50GB 5% SVM1 SVM1_data_vol node_A_1data01_mirrored online RW 10GB 9.49GB 5% SVM2 SVM2_root node_A_2_data01_mirrored online RW 10GB 9.49GB 5% SVM2 SVM2_data_vol node_A_2_data02_unmirrored online RW 1GB 972.6MB 5%
-
确定池 1 中用于节点 node_A_2 的磁盘架以关闭电源以模拟突然发生的硬件故障:
storage aggregate show -r -node node-name ! * root
您选择的磁盘架必须包含镜像数据聚合中的驱动器。
在以下示例中,选择磁盘架 ID 31 失败。
cluster_A::> storage aggregate show -r -node node_A_2 !*root Owner Node: node_A_2 Aggregate: node_A_2_data01_mirrored (online, raid_dp, mirrored) (block checksums) Plex: /node_A_2_data01_mirrored/plex0 (online, normal, active, pool0) RAID Group /node_A_2_data01_mirrored/plex0/rg0 (normal, block checksums) Usable Physical Position Disk Pool Type RPM Size Size Status -------- --------------------------- ---- ----- ------ -------- -------- ---------- dparity 2.30.3 0 BSAS 7200 827.7GB 828.0GB (normal) parity 2.30.4 0 BSAS 7200 827.7GB 828.0GB (normal) data 2.30.6 0 BSAS 7200 827.7GB 828.0GB (normal) data 2.30.8 0 BSAS 7200 827.7GB 828.0GB (normal) data 2.30.5 0 BSAS 7200 827.7GB 828.0GB (normal) Plex: /node_A_2_data01_mirrored/plex4 (online, normal, active, pool1) RAID Group /node_A_2_data01_mirrored/plex4/rg0 (normal, block checksums) Usable Physical Position Disk Pool Type RPM Size Size Status -------- --------------------------- ---- ----- ------ -------- -------- ---------- dparity 1.31.7 1 BSAS 7200 827.7GB 828.0GB (normal) parity 1.31.6 1 BSAS 7200 827.7GB 828.0GB (normal) data 1.31.3 1 BSAS 7200 827.7GB 828.0GB (normal) data 1.31.4 1 BSAS 7200 827.7GB 828.0GB (normal) data 1.31.5 1 BSAS 7200 827.7GB 828.0GB (normal) Aggregate: node_A_2_data02_unmirrored (online, raid_dp) (block checksums) Plex: /node_A_2_data02_unmirrored/plex0 (online, normal, active, pool0) RAID Group /node_A_2_data02_unmirrored/plex0/rg0 (normal, block checksums) Usable Physical Position Disk Pool Type RPM Size Size Status -------- --------------------------- ---- ----- ------ -------- -------- ---------- dparity 2.30.12 0 BSAS 7200 827.7GB 828.0GB (normal) parity 2.30.22 0 BSAS 7200 827.7GB 828.0GB (normal) data 2.30.21 0 BSAS 7200 827.7GB 828.0GB (normal) data 2.30.20 0 BSAS 7200 827.7GB 828.0GB (normal) data 2.30.14 0 BSAS 7200 827.7GB 828.0GB (normal) 15 entries were displayed.
-
物理关闭选定磁盘架的电源。
-
再次检查聚合状态:
s存储聚合显示
storage aggregate show -r -node node_A_2 ! * root
驱动器位于已关闭电源架上的聚合应具有
degraded
RAID 状态,而受影响丛上的驱动器应具有 "`Failed` " 状态,如以下示例所示:cluster_A::> storage aggregate show Aggregate Size Available Used% State #Vols Nodes RAID Status --------- -------- --------- ----- ------- ------ ---------------- ------------ node_A_1data01_mirrored 4.15TB 3.40TB 18% online 3 node_A_1 raid_dp, mirrored, normal node_A_1root 707.7GB 34.29GB 95% online 1 node_A_1 raid_dp, mirrored, normal node_A_2_data01_mirrored 4.15TB 4.12TB 1% online 2 node_A_2 raid_dp, mirror degraded node_A_2_data02_unmirrored 2.18TB 2.18TB 0% online 1 node_A_2 raid_dp, normal node_A_2_root 707.7GB 34.27GB 95% online 1 node_A_2 raid_dp, mirror degraded cluster_A::> storage aggregate show -r -node node_A_2 !*root Owner Node: node_A_2 Aggregate: node_A_2_data01_mirrored (online, raid_dp, mirror degraded) (block checksums) Plex: /node_A_2_data01_mirrored/plex0 (online, normal, active, pool0) RAID Group /node_A_2_data01_mirrored/plex0/rg0 (normal, block checksums) Usable Physical Position Disk Pool Type RPM Size Size Status -------- --------------------------- ---- ----- ------ -------- -------- ---------- dparity 2.30.3 0 BSAS 7200 827.7GB 828.0GB (normal) parity 2.30.4 0 BSAS 7200 827.7GB 828.0GB (normal) data 2.30.6 0 BSAS 7200 827.7GB 828.0GB (normal) data 2.30.8 0 BSAS 7200 827.7GB 828.0GB (normal) data 2.30.5 0 BSAS 7200 827.7GB 828.0GB (normal) Plex: /node_A_2_data01_mirrored/plex4 (offline, failed, inactive, pool1) RAID Group /node_A_2_data01_mirrored/plex4/rg0 (partial, none checksums) Usable Physical Position Disk Pool Type RPM Size Size Status -------- --------------------------- ---- ----- ------ -------- -------- ---------- dparity FAILED - - - 827.7GB - (failed) parity FAILED - - - 827.7GB - (failed) data FAILED - - - 827.7GB - (failed) data FAILED - - - 827.7GB - (failed) data FAILED - - - 827.7GB - (failed) Aggregate: node_A_2_data02_unmirrored (online, raid_dp) (block checksums) Plex: /node_A_2_data02_unmirrored/plex0 (online, normal, active, pool0) RAID Group /node_A_2_data02_unmirrored/plex0/rg0 (normal, block checksums) Usable Physical Position Disk Pool Type RPM Size Size Status -------- --------------------------- ---- ----- ------ -------- -------- ---------- dparity 2.30.12 0 BSAS 7200 827.7GB 828.0GB (normal) parity 2.30.22 0 BSAS 7200 827.7GB 828.0GB (normal) data 2.30.21 0 BSAS 7200 827.7GB 828.0GB (normal) data 2.30.20 0 BSAS 7200 827.7GB 828.0GB (normal) data 2.30.14 0 BSAS 7200 827.7GB 828.0GB (normal) 15 entries were displayed.
-
验证是否正在提供数据,以及所有卷是否仍处于联机状态:
vserver show -type data
network interface show -fields is-home false
volume show ! vol0 ,! mdv*
cluster_A::> vserver show -type data cluster_A::> vserver show -type data Admin Operational Root Vserver Type Subtype State State Volume Aggregate ----------- ------- ---------- ---------- ----------- ---------- ---------- SVM1 data sync-source running SVM1_root node_A_1_data01_mirrored SVM2 data sync-source running SVM2_root node_A_1_data01_mirrored cluster_A::> network interface show -fields is-home false There are no entries matching your query. cluster_A::> volume show !vol0,!MDV* Vserver Volume Aggregate State Type Size Available Used% --------- ------------ ------------ ---------- ---- ---------- ---------- ----- SVM1 SVM1_root node_A_1data01_mirrored online RW 10GB 9.50GB 5% SVM1 SVM1_data_vol node_A_1data01_mirrored online RW 10GB 9.49GB 5% SVM2 SVM2_root node_A_1data01_mirrored online RW 10GB 9.49GB 5% SVM2 SVM2_data_vol node_A_2_data02_unmirrored online RW 1GB 972.6MB 5%
-
物理启动磁盘架。
重新同步将自动启动。
-
验证重新同步是否已启动:
s存储聚合显示
受影响的聚合应具有 "
re同步
" RAID 状态,如以下示例所示:cluster_A::> storage aggregate show cluster Aggregates: Aggregate Size Available Used% State #Vols Nodes RAID Status --------- -------- --------- ----- ------- ------ ---------------- ------------ node_A_1_data01_mirrored 4.15TB 3.40TB 18% online 3 node_A_1 raid_dp, mirrored, normal node_A_1_root 707.7GB 34.29GB 95% online 1 node_A_1 raid_dp, mirrored, normal node_A_2_data01_mirrored 4.15TB 4.12TB 1% online 2 node_A_2 raid_dp, resyncing node_A_2_data02_unmirrored 2.18TB 2.18TB 0% online 1 node_A_2 raid_dp, normal node_A_2_root 707.7GB 34.27GB 95% online 1 node_A_2 raid_dp, resyncing
-
监控聚合以确认重新同步已完成:
s存储聚合显示
受影响的聚合应具有 "`normal` " RAID 状态,如以下示例所示:
cluster_A::> storage aggregate show cluster Aggregates: Aggregate Size Available Used% State #Vols Nodes RAID Status --------- -------- --------- ----- ------- ------ ---------------- ------------ node_A_1data01_mirrored 4.15TB 3.40TB 18% online 3 node_A_1 raid_dp, mirrored, normal node_A_1root 707.7GB 34.29GB 95% online 1 node_A_1 raid_dp, mirrored, normal node_A_2_data01_mirrored 4.15TB 4.12TB 1% online 2 node_A_2 raid_dp, normal node_A_2_data02_unmirrored 2.18TB 2.18TB 0% online 1 node_A_2 raid_dp, normal node_A_2_root 707.7GB 34.27GB 95% online 1 node_A_2 raid_dp, resyncing