JondZ: 2019

Monday, September 16, 2019

Old DDS/4mm Tape drive speed issues

Ok, so I bought a tape drive from ebay..its a Dell PowerVault 100T dds4 tape drive (internal seagate std1400lw). I really like the 4mm tape format because theyre so small. Just perfect for freezing my usual data output (some notes, documents).

At first I thought I bought a lemon, since the drive was super slow. As usual I played around with block sizes. Normally I leave the blocksize to 512, which is usually what I find tape drives to have the default set up as. I usually would do this:

mt -f /dev/st1 setblk 512 # if necessary
tar cvfp /dev/st1 ....

Or, sometimes, If I want the default tar blocking factor to be the same as the hardware block size:

mt -f /dev/st1 setblk 10240 # thats 20 tar blocks of 512 each
tar cvfp /dev/st1 ...

But the tape drive is SUPER SLOW no matter how I tried...Until I tried 4096 (page size?) on a whim. And this drive just flew! So this is how to make this drive spin fast:

mt -f /dev/st1 setblk 4096 # page size
tar cvfpb /dev/st1 8 ...
tar tvfb /dev/st1 8 .....

What a surprise that was.. and I was using tapes for years.

--------------UPDATE-------------

20190918

The slow speed is for READING BACK DATA, not writing them. For some reason the tape drive stalls delivering data for data transfers > 4k. Therefore, to read back data, specify transfer buffers of 4k or less:

mbuffer_style# mbuffer -s 4096 -i /dev/st2 | tar tvf -
normal_tar# tar xvfpb /dev/st2 8

JondZ 20190917

Wednesday, July 03, 2019

pacemaker fail count timing experiments

Occasionally we are troubled by failure alerts; since I have installed the check_crm nagios monitor plugin the alert seems more 'sensitive'. I have come to understand that pacemaker needs manual intervention when things fail. The things that especially fail on our site is the vmware fencing---every few weeks one or two logins to vmware would fail, forcing me to login and issue a "pcs resource cleanup" to reset the failure.

I am doing this experiment now to undestand the parameters that I need to adjust on a live system. These observations are done on a RH7.6 cluster, with a dummy shell script as a resource.

dummy shell script, configured to fail:

#! /bin/bash
exit 1 # <-- this is removed when we want the result to succeed
while :
do
date
sleep 5
done > /tmp/do_nothing_log

OBSERVATIONS 1:

CLUSTER CONDITIONS: cluster property "start-failure-is-fatal" to defaults (true).
RESOURCE CONDITIONS: defaults
RESULT: node1 is tried ONCE, node2 is tried ONCE, then nothing is tried again. When a resource fails the fail count is immediately set to INFINITY (1000000). This is why the documentation says "the node will no longer be allowed to run the failed resource" until a manual intervention happens.

OBSERVATIONS 2:

CLUSTER CONDITIONS: "start-failure-is-fatal" to FALSE ("pcs property set start-failure-is-fatal=false; pcs resource cleanup")
RESOURCE CONDITIONS: defaults
RESULT: resource is tried to restart on node1 nonstop (to infinity?). It does not appear to be attempted to restart on another node.

OBSERVATIONS 3:

CLUSTER CONDITIONS: start-failure-is-fatal" to FALSE ("pcs property set start-failure-is-fatal=false; pcs resource cleanup")
RESOURCE CONDITIONS: migration-threshold=10 ("pcs resource update resname meta migration-threshold=10; pcs resource cleanup")
RESULT: Resource is retried 10 times on node1, then retried 10 times on node2, then retried no longer.

OBSERVATIONS 4:

CLUSTER CONDITIONS: start-failure-is-fatal=false, cluster-recheck-interval=180.
RESOURCE CONDITIONS: migration-threshold=10 and failure-timeout=2min ("pcs resource update resname meta failure-timeout=2min")
RESULT: Resource is retried 10 times on node 1, 10 times on node2. Errors are cleared after 2 minutes. After that, resource is tried ONCE for node1 but 10 times on node2 every cluster-recheck-interval (3 minutes). Thats because the errors condition is gone but the counters do not necessarily reset (but sometimes they do on other nodes, when its tried on one node).

GROUP RESOURCES CONSIDERATION:

I an unable to apply the resources migration-threshold, failure-timeout at this moment. It seems to still be a property of the individual resources.
Update resource meta resname as usual regardless of whether it is part of a group or not; behavior should be as expected (group proceeds from one resource to the next in the list anyway).

JondZ 20190703

Friday, March 15, 2019

pacemaker unfencing errors

While testing pacemaker clustering with iscsi (on redhat 8 beta) I came upon this error:

Pending Fencing Actions:
* unfencing of rh-8beta-b pending: client=pacemaker-controld.2603, origin=rh-8beta-b

It took me almost the whole morning to understand how to clear the error. Since the stonith resource includes the clause "meta provides=unfencing", this means that the fencing agent should account for unfencing, meaning we should simply reboot the node (rh-8beta-b in this case).

RedHat documentation explains this as well: " ...The act of booting in this case implies that unfencing occurred..."