Wednesday, July 03, 2019

pacemaker fail count timing experiments

Occasionally we are troubled by failure alerts; since I have installed the check_crm nagios monitor plugin the alert seems more 'sensitive'.  I have come to understand that pacemaker needs manual intervention when things fail.  The things that especially fail on our site is the vmware fencing---every few weeks one or two logins to vmware would fail, forcing me to login and issue a "pcs resource cleanup" to reset the failure.

I am doing this experiment now to undestand the parameters that I need to adjust on a live system.  These observations are done on a RH7.6 cluster, with a dummy shell script as a resource.

dummy shell script, configured to fail:

#! /bin/bash
exit 1 # <-- this is removed when we want the result to succeed
while :
  do
  date
  sleep 5
  done > /tmp/do_nothing_log


OBSERVATIONS 1:

  • CLUSTER CONDITIONS: cluster property "start-failure-is-fatal" to defaults (true).
  • RESOURCE CONDITIONS: defaults
  • RESULT: node1 is tried ONCE, node2 is tried ONCE, then nothing is tried again.  When a resource fails the fail count is immediately set to INFINITY (1000000).  This is why the documentation says "the node will no longer be allowed to run the failed resource" until a manual intervention happens.
OBSERVATIONS 2:

  •  CLUSTER CONDITIONS: "start-failure-is-fatal" to FALSE ("pcs property set start-failure-is-fatal=false; pcs resource cleanup")
  • RESOURCE CONDITIONS: defaults
  • RESULT: resource is tried to restart on node1 nonstop (to infinity?).  It does not appear to be attempted to restart on another node.
OBSERVATIONS 3:
  • CLUSTER CONDITIONS:  start-failure-is-fatal" to FALSE ("pcs property set start-failure-is-fatal=false; pcs resource cleanup")
  • RESOURCE CONDITIONS:  migration-threshold=10 ("pcs resource update resname meta migration-threshold=10; pcs resource cleanup")
  • RESULT:  Resource is retried 10 times on node1, then retried 10 times on node2, then retried no longer.
OBSERVATIONS 4:
  • CLUSTER CONDITIONS:  start-failure-is-fatal=false,  cluster-recheck-interval=180.
  • RESOURCE CONDITIONS:  migration-threshold=10 and failure-timeout=2min ("pcs resource update resname  meta failure-timeout=2min")
  • RESULT: Resource is retried 10 times on node 1, 10 times on node2.  Errors are cleared after 2 minutes.  After that, resource is tried ONCE for node1 but 10 times on node2 every  cluster-recheck-interval (3 minutes).  Thats because the errors condition is gone but the counters do not necessarily reset (but sometimes they do on other nodes, when its tried on one node).
GROUP RESOURCES CONSIDERATION:
  • I an unable to apply the resources migration-threshold, failure-timeout at this moment.  It seems to still be a property of the individual resources. 
  • Update resource meta resname as usual regardless of whether it is part of a group or not; behavior should be as expected (group proceeds from one resource to the next in the list anyway).

JondZ 20190703

1 comment:

JondZ said...

Now I realize I don't undestand my own post, months later. What are we talking about here? The pcs commands involved are timings on the lines of (and here are the actual commands) --

note: "bmdbkill" is a stonith resource, "bmpgsql" is a database launcher resource.

pcs property set start-failure-is-fatal=false
pcs resource update bmdbkill meta migration-threshold=10
pcs resource update bmdbkill meta failure-timeout=5min

pcs resource update bmpgsql meta migration-threshold=5
pcs resource update bmpgsql meta failure-timeout=5min

[root@bmdb1 ~]# pcs property
:
start-failure-is-fatal: false

Creating ipip tunnels in RedHat 8 that is compatible with LVS/TUN. Every few seasons I built a virtual LVS (Linux Virtual Server) mock up ju...