I am doing this experiment now to undestand the parameters that I need to adjust on a live system. These observations are done on a RH7.6 cluster, with a dummy shell script as a resource.
dummy shell script, configured to fail:
#! /bin/bash
exit 1 # <-- this is removed when we want the result to succeed
while :
do
date
sleep 5
done > /tmp/do_nothing_log
OBSERVATIONS 1:
- CLUSTER CONDITIONS: cluster property "start-failure-is-fatal" to defaults (true).
- RESOURCE CONDITIONS: defaults
- RESULT: node1 is tried ONCE, node2 is tried ONCE, then nothing is tried again. When a resource fails the fail count is immediately set to INFINITY (1000000). This is why the documentation says "the node will no longer be allowed to run the failed resource" until a manual intervention happens.
- CLUSTER CONDITIONS: "start-failure-is-fatal" to FALSE ("pcs property set start-failure-is-fatal=false; pcs resource cleanup")
- RESOURCE CONDITIONS: defaults
- RESULT: resource is tried to restart on node1 nonstop (to infinity?). It does not appear to be attempted to restart on another node.
- CLUSTER CONDITIONS: start-failure-is-fatal" to FALSE ("pcs property set start-failure-is-fatal=false; pcs resource cleanup")
- RESOURCE CONDITIONS: migration-threshold=10 ("pcs resource update resname meta migration-threshold=10; pcs resource cleanup")
- RESULT: Resource is retried 10 times on node1, then retried 10 times on node2, then retried no longer.
- CLUSTER CONDITIONS: start-failure-is-fatal=false, cluster-recheck-interval=180.
- RESOURCE CONDITIONS: migration-threshold=10 and failure-timeout=2min ("pcs resource update resname meta failure-timeout=2min")
- RESULT: Resource is retried 10 times on node 1, 10 times on node2. Errors are cleared after 2 minutes. After that, resource is tried ONCE for node1 but 10 times on node2 every cluster-recheck-interval (3 minutes). Thats because the errors condition is gone but the counters do not necessarily reset (but sometimes they do on other nodes, when its tried on one node).
- I an unable to apply the resources migration-threshold, failure-timeout at this moment. It seems to still be a property of the individual resources.
- Update resource meta resname as usual regardless of whether it is part of a group or not; behavior should be as expected (group proceeds from one resource to the next in the list anyway).
JondZ 20190703