Fortigate Cluster Protocol (FGCP) - High Availability - Parte 2 - Troubleshooting

Iñaki Urrutxi
5 abr 2020
11 Min. de lectura

Actualizado: 8 jun 2020

Esta es la segunda entrega donde continuamos hablando del HA de Fortigate. En esta ocasión veremos como saber si los equipos en HA están sincronizados correctamente y como resolver algunos problemas que pueden darse en entornos sin vdoms o con vdoms. Haremos un poco de TROUBLESHOOTING.

Cuando configuramos un HA entre dos o hasta 4 equipos Fortigate, el HA crea 1 cluster, en ese cluster un equipos ese el primario que procesa el tráfico y el resto esta esperando a que este primario falle fisicamente o algunos de los interfaces monitorizados para ponerse a procesar tráfico. En entornos con Virtual firewalls (Vdoms) podemos generar lo que se denomina virtual clusters en los cuales tenemos 2 clusters, podemos elegir el cluster por el que va a funcionar cada vdom. De esta forma podemos hacer que ambos fw cursen tráfico, unos vdoms tendrán como firewall principal el cluster1 y otros el cluster2, en caso de fallo de uno de los clusters, el otro recogerá todo el tráfico.

- HA en entornos con 1 cluster (sin Vdoms)

En este entorno hay un solo cluster con un equipo como Master y otro con Slave:

Cluster1: 
- MASTER:FG100ETK1XXXXXXX1
- SLAVE: FG100ETK1XXXXXXX2

Si queremos ver la salud y estado de nuestro cluster HA. Ejecutaremos vía CLI este comando que nos muestra gran información sobre el HA.

get system ha status 

HA Health Status: OK
Model: FortiGate-100E
Mode: HA A-P
Group: 0
Debug: 0
Cluster Uptime: 299 days 5:57:58
Cluster state change time: 2020-03-22 20:43:00
Master selected using:
    <2020/03/22 20:43:00> FG100ETK1XXXXXXX1 is selected as the master because it has the largest value of override priority.
ses_pickup: enable, ses_pickup_delay=disable
override: disable
Configuration Status:
    FG100ETK1XXXXXXX1(updated 3 seconds ago): in-sync
    FG100ETK1XXXXXXX2(updated 2 seconds ago): in-sync
System Usage stats:
    FG100ETK1XXXXXXX1(updated 3 seconds ago):
        sessions=2745, average-cpu-user/nice/system/idle=0%/0%/0%/99%, memory=23%
    FG100ETK1XXXXXXX2(updated 2 seconds ago):
        sessions=364, average-cpu-user/nice/system/idle=0%/0%/0%/100%, memory=22%
HBDEV stats:
    FG100ETK1XXXXXXX1(updated 3 seconds ago):
        ha1: physical/1000auto, up, rx-bytes/packets/dropped/errors=3949226204/8492792/0/0, tx=36704769/20670253/0/0
    FG100ETK1XXXXXXX2(updated 2 seconds ago):
        ha1: physical/1000auto, up, rx-bytes/packets/dropped/errors=37515960/20671410/0/0, tx=3949119434/8493552/0/0
MONDEV stats:
    FG100ETK1XXXXXXX1(updated 3 seconds ago):
        port1: physical/1000auto, up, rx-bytes/packets/dropped/errors=309397574/302323828/0/0, tx=1017271708/295835539/0/0
        port2: physical/1000auto, up, rx-bytes/packets/dropped/errors=4184771964/248443511/0/0, tx=759460361/412247490/0/0
        LAG-DMZ: aggregate/00, up, rx-bytes/packets/dropped/errors=199202242/550767339/0/0, tx=1776732069/708083029/0/0
        wan1: physical/1000auto, up, rx-bytes/packets/dropped/errors=3906985747/231285588/0/0, tx=1304391755/174943198/0/0
    FG100ETK1XXXXXXX2(updated 2 seconds ago):
        port1: physical/1000auto, up, rx-bytes/packets/dropped/errors=323651446/3974469/0/0, tx=258452/1405/0/0
        port2: physical/1000auto, up, rx-bytes/packets/dropped/errors=316543179/3845867/0/0, tx=251723/1476/0/0
        LAG-DMZ: aggregate/00, up, rx-bytes/packets/dropped/errors=640194625/7820336/0/0, tx=510175/2881/0/0
        wan1: physical/1000auto, up, rx-bytes/packets/dropped/errors=2743785164/45625558/0/0, tx=1609772/5065/0/0
Master: FG-FW-1, FG100ETK1XXXXXXX1, cluster index = 0
Slave : FG-FW-2, FG100ETK1XXXXXXX2, cluster index = 1
number of vcluster: 1
vcluster 1: work 169.254.0.1
Master: FG100ETK1XXXXXXX1, operating cluster index = 0
Slave : FG100ETK1XXXXXXX2, operating cluster index = 1

Desgranamos la información que nos da este comando:

Estado del HA, modelo de equipos, modo de HA, grupo de HA, cuando tiempo lleva funcionando el HA así como la última vez que hubo un cambio en el HA.

HA Health Status: OK
Model: FortiGate-100E
Mode: HA A-P
Group: 0
Debug: 0
Cluster Uptime: 299 days 5:57:58
Cluster state change time: 2020-03-22 20:43:00

Información sobre cual a sido la causa por la que ese equipo ha sido elegido como master del cluster, hay un histórico de tiempo:

Master selected using:
    <2020/03/21 20:43:00> FG100ETK1XXXXXXX1 is selected as the master because it has the largest value of override priority.
ses_pickup: enable, ses_pickup_delay=disable -> Session pickup enable
override: disable --> Override disable

Otros mensajes que pueden aparecer para saber la forma en la que se ha elegido el fw master:

<2020/02/16 10:18:15> FG100ETK1XXXXXXX1  is selected as the master because the peer member FG100ETK1XXXXXXX2 has UPGRADE_SLAVE flag set.
<2020/02/16 10:18:15> FG100ETK1XXXXXXX1 is selected as the master because it's the only member in the cluster.
<2020/02/16 10:18:15> FG100ETK1XXXXXXX1 is selected as the master  because it has UPGRADE_MASTER flag set.
<2020/02/16 10:18:15> FG100ETK1XXXXXXX1 is selected as the master because it has the largest value of uptime.
<2020/02/16 10:18:15> FG100ETK1XXXXXXX1is is selected as the master because it has the largest value of serialno.

Estado de la sincronización de la configuración, tiempo del último paquete de sincronización del HA e información general de estado de nuestros dispositivos: cpu, mem y sesiones.

Configuration Status:
    FG100ETK1XXXXXXX1(updated 3 seconds ago): in-sync
    FG100ETK1XXXXXXX2(updated 2 seconds ago): in-sync
System Usage stats:
    FG100ETK1XXXXXXX1(updated 3 seconds ago):
        sessions=2745, average-cpu-user/nice/system/idle=0%/0%/0%/99%, memory=23%
    FG100ETK1XXXXXXX2(updated 2 seconds ago):
        sessions=364, average-cpu-user/nice/system/idle=0%/0%/0%/100%, memory=22%

En caso de que NO estaría sincronizada la configuración mostaria este mensaje:

Configuration Status:
    FG100ETK1XXXXXXX1(updated 3 seconds ago): in-sync
    FG100ETK1XXXXXXX2(updated 2 seconds ago): out-of-sync

Estadísticas de los puertos de HA, en este caso es el puerto ha1:

HBDEV stats:
    FG100ETK1XXXXXXX1(updated 3 seconds ago):
        ha1: physical/1000auto, up, rx-bytes/packets/dropped/errors=3949226204/8492792/0/0, tx=36704769/20670253/0/0
    FG100ETK1XXXXXXX2(updated 2 seconds ago):
        ha1: physical/1000auto, up, rx-bytes/packets/dropped/errors=37515960/20671410/0/0, tx=3949119434/8493552/0/0

Estadisiticas de los puertos que se monitoriza su estado para hacer que si fallan salte al FW slave, en cada uno de los equipos que forman el cluster HA.

MONDEV stats:
    FG100ETK1XXXXXXX1(updated 3 seconds ago):
        port1: physical/1000auto, up, rx-bytes/packets/dropped/errors=309397574/302323828/0/0, tx=1017271708/295835539/0/0
        port2: physical/1000auto, up, rx-bytes/packets/dropped/errors=4184771964/248443511/0/0, tx=759460361/412247490/0/0
        LACP-DMZ: aggregate/00, up, rx-bytes/packets/dropped/errors=199202242/550767339/0/0, tx=1776732069/708083029/0/0
        wan1: physical/1000auto, up, rx-bytes/packets/dropped/errors=3906985747/231285588/0/0, tx=1304391755/174943198/0/0
    FG100ETK1XXXXXXX2(updated 2 seconds ago):
        port1: physical/1000auto, up, rx-bytes/packets/dropped/errors=323651446/3974469/0/0, tx=258452/1405/0/0
        port2: physical/1000auto, up, rx-bytes/packets/dropped/errors=316543179/3845867/0/0, tx=251723/1476/0/0
        LAG-DMZ: aggregate/00, up, rx-bytes/packets/dropped/errors=640194625/7820336/0/0, tx=510175/2881/0/0
        wan1: physical/1000auto, up, rx-bytes/packets/dropped/errors=2743785164/45625558/0/0, tx=1609772/5065/0/0
Master: FG-FW-1, FG100ETK1XXXXXXX1, cluster index = 0
Slave : FG-FW-2, FG100ETK1XXXXXXX2, cluster index = 1

Número de cluster e ip que usan y el indice usado por cada uno de ellos en el HA. Podemos usar este indice para saltar de un device al otro.

number of vcluster: 1
vcluster 1: work 169.254.0.1
Master: FG100ETK1XXXXXXX1, operating cluster index = 0
Slave : FG100ETK1XXXXXXX2, operating cluster index = 1

Como accedemos desde el fw master al fw slave: nos conectamos por ssh al equipo master, y desde este ejecutamos este comando (execute ha manage 1) para acceder al equipo slave:

FG-VDOM-HA-CLUSTER1(global) # execute ha manage 
<id>    please input peer box index.
<1>	Subsidary unit FG100ETK1XXXXXXX2

FG-VDOM-HA-CLUSTER1 (global) # execute ha manage 1
FG-VDOM-HA-CLUSTER2 login: admin
Password: ********
Welcome !

FG-VDOM-HA-CLUSTER2 $

Estado del HA vía GUI, como vemos ambos equipos están sincronizados

Con este comando vemos el estado del HA mediante los checksum de ambos equipos, si coinciden en ambos equipos el HA esta sincronizado correctamente, si falla en alguna de los apartados: global, root es que los equipos no están sincronizados. Lo normal es que falle en el root(vdom primario), si falla en alguno de ellos fallara por tanto el apartado de all y via GUI mostrara que no están sincronizados los equipos.

 diagnose sys ha checksum cluster 

================== FG100ETK1XXXXXXX1 ==================

is_manage_master()=1, is_root_master()=1
debugzone
global: d7 cc 25 ae c0 35 b8 64 6f 58 8b 17 b1 d2 4f 37 
root: 7f ae 5c 98 d5 20 86 c2 d8 10 fe 5c c1 b2 83 b0 
all: d7 a2 c4 b8 57 f8 d0 ba 60 d9 fa f2 fc 9c c3 16 

checksum
global: d7 cc 25 ae c0 35 b8 64 6f 58 8b 17 b1 d2 4f 37 --> OK
root: 7f ae 5c 98 d5 20 86 c2 d8 10 fe 5c c1 b2 83 b0 --> OK
all: d7 a2 c4 b8 57 f8 d0 ba 60 d9 fa f2 fc 9c c3 16 -->  OK

================== FG100ETKXXXXXXXX2 ==================

is_manage_master()=0, is_root_master()=0
debugzone
global: d7 cc 25 ae c0 35 b8 64 6f 58 8b 17 b1 d2 4f 37 
root: 7f ae 5c 98 d5 20 86 c2 d8 10 fe 5c c1 b2 83 b0 
all: d7 a2 c4 b8 57 f8 d0 ba 60 d9 fa f2 fc 9c c3 16 

checksum
global: d7 cc 25 ae c0 35 b8 64 6f 58 8b 17 b1 d2 4f 37 -> OK
root: 7f ae 5c 98 d5 20 86 c2 d8 10 fe 5c c1 b2 83 b0  --> OK
all: d7 a2 c4 b8 57 f8 d0 ba 60 d9 fa f2 fc 9c c3 16 --> OK

- HA en entornos con 2 clusters (con Vdoms)

Ahora vamos a hablar del HA en entornos con VDOMs. Mas adelante hablaremos más profundamente de los Virtual Domain en Fortigate, es decir, firewalls virtuales dentro de un FW físico. Cuando trabajamos con Vdoms en HA, podemos repartir la carga para que unos vdoms, salgan por uno de los dispositivos y otros vdoms salga por el otro. Con esto mejoramos el rendimiento y usamos ambos FWs simultáneamente en entornos de HA. Ya no tenemos un FW cursando tráfico y el otro esperando a que el este deje de funcionar para ponerse como master. Es el mejor entorno de HA para activo-activo.

Pero hay que tener cuidado de que el rendimiento de ambos cluster nos superen el rendimiento de uno de los dispositivos, porque sino cuando uno de ellos falle el otro dispositivo no podrá asumir la carga de todos los vdoms y el HA no servirá para mucho..

En este entorno se generan 2 cluster: Cluster1 de los vdoms que están saliendo por ese device1 y cluster2 de los vdoms que están saliendo por el device2. En ambos casos el otro dispositivo hace de HA de cada cluster.

Virtual cluster
Cluster1: root, vdom1
- MASTER:FG100ETK1XXXXXXX1
- SLAVE: FG100ETK1XXXXXXX2

Cluster2: vdom2, vdom3
- MASTER: FG100ETK1XXXXXXX2
- SLAVE: FG100ETK1XXXXXXX1

Ejecutamos el mismo comando para ver como se muestra.

get system ha status 

HA Health Status: OK
Model: FortiGate-100E
Mode: HA A-P
Group: 0
Debug: 0
Cluster Uptime: 33 days 6:25:7
Cluster state change time: 2020-04-02 08:38:18
Master selected using:
  virtual cluster 1:
    <2020/04/02 08:38:18> FG100ETK1XXXXXXX1 is selected as the master because it has the largest value of override priority.
    <2020/04/02 08:37:36> FG100ETK1XXXXXXX2 is selected as the master because it has the largest value of override priority.
  virtual cluster 2:
    <2020/04/02 08:37:36> FG100ETK1XXXXXXX2 is selected as the master because it has the largest value of override priority.
ses_pickup: enable, ses_pickup_delay=disable
override: vcluster1 enable, vcluster2 enable
Configuration Status:
    FG100ETK1XXXXXXX1(updated 2 seconds ago): in-sync
    FG100ETK1XXXXXXX2(updated 4 seconds ago): in-sync
System Usage stats:
    FG100ETK1XXXXXXX1(updated 2 seconds ago):
        sessions=1089, average-cpu-user/nice/system/idle=0%/0%/0%/99%, memory=21%
    FG100ETK1XXXXXXX2(updated 4 seconds ago):
        sessions=1569, average-cpu-user/nice/system/idle=0%/0%/0%/99%, memory=20%
HBDEV stats:
    FG100ETK1XXXXXXX1(updated 2 seconds ago):
        ha1: physical/10000full, up, rx-bytes/packets/dropped/errors=2629753093/3848324/0/0, tx=4469434381/4556998/0/0
        ha2: physical/10000full, up, rx-bytes/packets/dropped/errors=666400838/1039622/0/0, tx=676683397/1039455/0/0
    FG100ETK1XXXXXXX2(updated 4 seconds ago):
        ha1: physical/10000full, up, rx-bytes/packets/dropped/errors=7292541481/7329524/183/0, tx=3501091966/5953929/0/0
        ha2: physical/10000full, up, rx-bytes/packets/dropped/errors=1377355015/2114105/0/0, tx=1354104972/2114159/0/0
MONDEV stats:
  vcluster1:
    FG100ETK1XXXXXXX1(updated 2 seconds ago):
        LAG-DMZ: aggregate/00, up, rx-bytes/packets/dropped/errors=65778963205/169318300/0/0, tx=9595447569/7569671/0/0
        LAG-LAN: aggregate/00, up, rx-bytes/packets/dropped/errors=132097813611/348776926/0/0, tx=144887765140/324145671/0/0
        LAG-WAN: aggregate/00, up, rx-bytes/packets/dropped/errors=146869998059/328892453/0/0, tx=124546068087/348415894/0/0
    FG100ETK1XXXXXXX2(updated 4 seconds ago):
        LAG-DMZ: aggregate/00, up, rx-bytes/packets/dropped/errors=127800802682/324309575/0/0, tx=8942942094/6707500/0/0
        LAG-LAN: aggregate/00, up, rx-bytes/packets/dropped/errors=101755272639/270976470/0/0, tx=130956212705/257002817/0/0
        LAG-WAN: aggregate/00, up, rx-bytes/packets/dropped/errors=131578939098/256535746/0/0, tx=93608058231/265400336/0/0
  vcluster2:
    FG100ETK1XXXXXXX1(updated 2 seconds ago):
        LAG-DMZ: aggregate/00, up, rx-bytes/packets/dropped/errors=65778963205/169318300/0/0, tx=9595447569/7569671/0/0
        LAG-LAN: aggregate/00, up, rx-bytes/packets/dropped/errors=132097813611/348776926/0/0, tx=144887765140/324145671/0/0
        LAG-WAN: aggregate/00, up, rx-bytes/packets/dropped/errors=146869998059/328892453/0/0, tx=124546068087/348415894/0/0
    FG100ETK1XXXXXXX2(updated 4 seconds ago):
        LAG-DMZ: aggregate/00, up, rx-bytes/packets/dropped/errors=127800802682/324309575/0/0, tx=8942942094/6707500/0/0
        LAG-LAN: aggregate/00, up, rx-bytes/packets/dropped/errors=101755272639/270976470/0/0, tx=130956212705/257002817/0/0
        LAG-WAN: aggregate/00, up, rx-bytes/packets/dropped/errors=131578939098/256535746/0/0, tx=93608058231/265400336/0/0
Master: FG-VDOM-HA-CLUSTER1, FG100ETK1XXXXXXX1, cluster index = 0
Slave : FG-VDOM-HA-CLUSTER2, FG100ETK1XXXXXXX2, cluster index = 1
number of vcluster: 2
vcluster 1: work 169.254.0.1
Master: FG100ETK1XXXXXXX1, operating cluster index = 0
Slave : FG100ETK1XXXXXXX2, operating cluster index = 1
vcluster 2: standby 169.254.0.2
Slave : FG100ETK1XXXXXXX1, operating cluster index = 1
Master: FG100ETK1XXXXXXX2, operating cluster index = 0

En este entorno también podemos revisar los checksum y como el entorno es de multiples vdoms, tendremos uno por cada vdom. Como es el caso de este, donde vemos que el vdom3 no está correctamente sincronizado.

Estado del HA vía GUI, como vemos ambos equipos NO están sincronizados

Como vemos todos los vdoms están sincronizados menos el vdom3. Al haber un vdom no sincronizado correctamente el checksum de todo el equipo(all) nos dice que no está sincronizado.

 diagnose sys ha checksum cluster 
 
================== FG100ETK1XXXXXXX1 ==================

is_manage_master()=0, is_root_master()=0
debugzone
global: 8d ba 1a e6 fe e5 2c 4f ec e4 52 d7 1d ae 90 fc 
root: 73 b8 5f 2a ab 01 92 8d 19 57 dc 5b 83 e8 50 ce 
vdom1: c5 ac e2 02 e3 b3 72 44 dc f2 b3 f4 1b 3f f9 43 
vdom2: c1 4b 78 75 96 84 89 07 ac c9 79 72 dc 39 80 4c 
vdom3: c6 6d c3 f0 64 6a 16 d3 50 43 cc 60 9b c8 f0 8d  
all: ab 6b 2e 4e eb 2a fe 48 a8 f9 72 47 62 f0 a5 70 

checksum
global: 8d ba 1a e6 fe e5 2c 4f ec e4 52 d7 1d ae 90 fc 
root: 73 b8 5f 2a ab 01 92 8d 19 57 dc 5b 83 e8 50 ce 
vdom1: c5 ac e2 02 e3 b3 72 44 dc f2 b3 f4 1b 3f f9 43 
vdom2: c1 4b 78 75 96 84 89 07 ac c9 79 72 dc 39 80 4c 
vdom3: c6 6d c3 f0 64 6a 16 d3 50 43 cc 60 9b c8 f0 8d --> KO
all: ab 6b 2e 4e eb 2a fe 48 a8 f9 72 47 62 f0 a5 70 --> KO

================== FG100ETK1XXXXXXX2 ==================

is_manage_master()=1, is_root_master()=1
debugzone
global: 8d ba 1a e6 fe e5 2c 4f ec e4 52 d7 1d ae 90 fc 
root: 73 b8 5f 2a ab 01 92 8d 19 57 dc 5b 83 e8 50 ce 
vdom1: c5 ac e2 02 e3 b3 72 44 dc f2 b3 f4 1b 3f f9 43 
vdom2: c1 4b 78 75 96 84 89 07 ac c9 79 72 dc 39 80 4c 
vdom3: 9f 44 75 a9 62 fd dc d3 e0 50 8d f6 0e a6 e6 4d  
all: 5f 3a a5 61 db b7 33 ed 4b f9 42 cd b2 bd 44 c8 

checksum
global: 8d ba 1a e6 fe e5 2c 4f ec e4 52 d7 1d ae 90 fc 
root: 73 b8 5f 2a ab 01 92 8d 19 57 dc 5b 83 e8 50 ce 
vdom1: c5 ac e2 02 e3 b3 72 44 dc f2 b3 f4 1b 3f f9 43 
vdom2: c1 4b 78 75 96 84 89 07 ac c9 79 72 dc 39 80 4c 
vdom3: 9f 44 75 a9 62 fd dc d3 e0 50 8d f6 0e a6 e6 4d --> KO
all: 5f 3a a5 61 db b7 33 ed 4b f9 42 cd b2 bd 44 c8 --> KO

Hay veces que nos muestra que el vdom esta no sincronizado vía GUI o CLI, pero no vemos checksum diferentes en los vdoms, para esos casos podemos ejecutar este comando (diagnose sys ha checksum recalculate ) en el equipo master y slave para recalcular el checksum de ambos dispositivos, tras hacerlo, verificamos de nuevo que están correctamente sincronizados:

FG-VDOM-HA-CLUSTER1(global) # diagnose sys ha checksum recalculate 

FG-VDOM-HA-CLUSTER1(global) # execute ha manage 
<id>    please input peer box index.
<1>	Subsidary unit FG100ETK1XXXXXXX2

FG-VDOM-HA-CLUSTER1 (global) # execute ha manage 1
FG-VDOM-HA-CLUSTER2 login: admin
Password: ********
Welcome !

FG-VDOM-HA-CLUSTER2 $ config global 
FG-VDOM-HA-CLUSTER2 (global) $ diagnose sys ha checksum recalculate
FG-VDOM-HA-CLUSTER2 (global) $ quit
FG-VDOM-HA-CLUSTER1(global)$ quit

Si aún no está configurado podemos ver que parte de la configuración del vdom3 está desincronizada.

Ejecutaremos este comando en ambos fws, ese comando nos muestra los checksum de cada parte de la configuración del vdom que no está sincronizado:

FG-HA-VDOM-CLUSTER1(global)# diagnose sys ha checksum show vdom3
.........
system.object-tagging: c7e9fe21af1059c208e3681699f6916f
system.settings: fd38c82f3c459fb5ae7313c9dae52c90
system.sit-tunnel: 00000000000000000000000000000000
system.arp-table: 00000000000000000000000000000000
system.ipv6-neighbor-cache: 00000000000000000000000000000000
system.vdom-sflow: 00000000000000000000000000000000
system.vdom-netflow: 00000000000000000000000000000000
system.vdom-dns: 00000000000000000000000000000000
system.replacemsg-group: 81b6b656cb88106024f53c51d1a89c82
system.session-ttl: 00000000000000000000000000000000
system.dhcp.server: 00000000000000000000000000000000
system.dhcp6.server: 00000000000000000000000000000000
extender-controller.extender: 00000000000000000000000000000000
system.zone: 92e95f0d4f656815e0bee071826692c6
firewall.address: 1bed22c991bc1a9808a5d52e3f56e337
firewall.multicast-address: a00a0b721b4ca3cda2759ed08a6522e1
firewall.address6-template: 00000000000000000000000000000000
firewall.address6: eaebf5f469fbc3f023ad602313eb55fd
firewall.multicast-address6: 07d92ae2a2d377a0e8e46fb21876f6c4
system.ipv6-tunnel: 00000000000000000000000000000000
firewall.addrgrp: 2a3e955feff36f91ff3ab733b0042a80
firewall.addrgrp6: 00000000000000000000000000000000
firewall.wildcard-fqdn.custom: 00000000000000000000000000000000
firewall.wildcard-fqdn.group: 00000000000000000000000000000000
firewall.service.category: dd54e85333b30cbebaf19d59fa2959d3
firewall.service.custom: 085178661f96960e742f03f4f1df3e41
firewall.service.group: 5bf821d15d9e8881ab2d84afb0e0ae22
firewall.internet-service-group: 00000000000000000000000000000000
...................
router.ospf: 1a117d8a9aa3eb88b6e5ff761458b3be
router.ospf6: 1a117d8a9aa3eb88b6e5ff761458b3be
router.bgp: 5a0338395538f2b5df5763df634c201a
router.isis: 3fa36255059772b422afb00399c594b6
router.multicast-flow: 00000000000000000000000000000000
router.multicast: 5873dd45edd01f09c1ef2e7819369e82
router.multicast6: 00000000000000000000000000000000
router.auth-path: 00000000000000000000000000000000
router.setting: 00000000000000000000000000000000
router.bfd: 00000000000000000000000000000000
router.bfd6: 00000000000000000000000000000000
system.proxy-arp: 00000000000000000000000000000000
system.link-monitor: 00000000000000000000000000000000
system.wccp: 00000000000000000000000000000000
system.nat64: 00000000000000000000000000000000
system.nd-proxy: 00000000000000000000000000000000

FG-HA-VDOM-CLUSTER2(global)# diagnose sys ha checksum show vdom3
.........
system.object-tagging: c7e9fe21af1059c208e3681699f6916f
system.settings: fd38c82f3c459fb5ae7313c9dae52c90
system.sit-tunnel: 00000000000000000000000000000000
system.arp-table: 00000000000000000000000000000000
system.ipv6-neighbor-cache: 00000000000000000000000000000000
system.vdom-sflow: 00000000000000000000000000000000
system.vdom-netflow: 00000000000000000000000000000000
system.vdom-dns: 00000000000000000000000000000000
system.replacemsg-group: 81b6b656cb88106024f53c51d1a89c82
system.session-ttl: 00000000000000000000000000000000
system.dhcp.server: 00000000000000000000000000000000
system.dhcp6.server: 00000000000000000000000000000000
extender-controller.extender: 00000000000000000000000000000000
system.zone: 92e95f0d4f656815e0bee071826692c6
firewall.address: 1bed22c991bc1a9808a5d52e3f56e337
firewall.multicast-address: a00a0b721b4ca3cda2759ed08a6522e1
firewall.address6-template: 00000000000000000000000000000000
firewall.address6: eaebf5f469fbc3f023ad602313eb55fd
firewall.multicast-address6: 07d92ae2a2d377a0e8e46fb21876f6c4
system.ipv6-tunnel: 00000000000000000000000000000000
firewall.addrgrp: 2a3e955feff36f91ff3ab733b0042a80
firewall.addrgrp6: 00000000000000000000000000000000
firewall.wildcard-fqdn.custom: 00000000000000000000000000000000
firewall.wildcard-fqdn.group: 00000000000000000000000000000000
firewall.service.category: dd54e85333b30cbebaf19d59fa2959d3
firewall.service.custom: 085178661f96960e742f03f4f1df3e41
firewall.service.group: e00d62f3facd7ab20996716b63ba5b0a
firewall.internet-service-group: 00000000000000000000000000000000
........................
router.ospf: 1a117d8a9aa3eb88b6e5ff761458b3be
router.ospf6: 1a117d8a9aa3eb88b6e5ff761458b3be
router.bgp: 5a0338395538f2b5df5763df634c201a
router.isis: 3fa36255059772b422afb00399c594b6
router.multicast-flow: 00000000000000000000000000000000
router.multicast: 5873dd45edd01f09c1ef2e7819369e82
router.multicast6: 00000000000000000000000000000000
router.auth-path: 00000000000000000000000000000000
router.setting: 00000000000000000000000000000000
router.bfd: 00000000000000000000000000000000
router.bfd6: 00000000000000000000000000000000
system.proxy-arp: 00000000000000000000000000000000
system.link-monitor: 00000000000000000000000000000000
system.wccp: 00000000000000000000000000000000
system.nat64: 00000000000000000000000000000000
system.nd-proxy: 00000000000000000000000000000000

Si guardamos ambos resultados en 2 ficheros y comparamos la diferencias observaremos que el problema está en firewall.service.group:

diff cluster1.txt cluster2.txt 

< firewall.service.group: 5bf821d15d9e8881ab2d84afb0e0ae22
---
> firewall.service.group: e00d62f3facd7ab20996716b63ba5b0a
265c265

Podemos seguir mirando cual es el problema de esta sección de la configuración de ambos fws:

FG-HA-VDOM-CLUSTER1(global)$ diag sys ha checksum show vdom3 firewall.service.group

Email Access: 5873dd45edd01f09c1ef2e7819369e8e
Exchange Server: 5873dd45edd01f09c1ef2e7819369e8e
Web Access: 5873dd45edd01f09c1ef2e7819369e8e
Windows AD: 5873dd45edd01f09c1ef2e7819369e8e

FG-HA-VDOM-CLUSTER2(global)$ diag sys ha checksum show vdom3 firewall.service.group

En este caso, lo que pasaba era que algunos grupos de servicios estaban creados en el cluster1 pero no en el cluster2. Generando esos grupos en el cluster2 los equipos se sincronizan automáticamente.

Porque ocurre esto, pues lo normal es que se deba a fallos en la sincronización de la configuración, por problemas puntuales o por bugs en la release en la que estamos funcionando.

A veces en entornos de muchos cambios simultáneos por parte de diferentes usuarios podrían producir este problema.

Otro comando interesante que nos puede dar más información sobre nuestro HA:

diagnose sys ha dump-by command (group,vcluster,rcache,debug-zone,vdom,kernel,device)

Por ejemplo:

diagnose sys ha dump-by vcluster 
 
<hatalk>             HA information.
vcluster_nr=1
vcluster_0: start_time=1584902580(2020-03-22 19:43:00), state/o/chg_time=2(work)/2(work)/1584902580(2020-03-22 19:43:00)
	pingsvr_flip_timeout/expire=3600s/0s
	mondev: port1(prio=50,is_aggr=0,status=1) port2(prio=50,is_aggr=0,status=1) 
LAG-DMZ(prio=50,is_aggr=1,status=1) wan1(prio=50,is_aggr=0,status=1) 
	'FG100ETK1XXXXXXX1': ha_prio/o=1/1, link_failure=0, pingsvr_failure=0, flag=0x00000000, uptime/reset_cnt=252/0
	'FG100ETK1XXXXXXX2': ha_prio/o=0/0, link_failure=0, pingsvr_failure=0, flag=0x00000001, uptime/reset_cnt=0/0

Seguiremos profundizando un poco más en el HA de Fortigate en próximos blogs. Espero que os haya parecido interesante.

¡¡ A DISFRUTAR !!

Fortigate Cluster Protocol (FGCP) - High Availability - Parte 2 - Troubleshooting

- HA en entornos con 1 cluster (sin Vdoms)

- HA en entornos con 2 clusters (con Vdoms)

Entradas recientes

Comentarios

Formulario de suscripción