Saturday, December 16, 2017

tape drive over iscsi problems

One of the problem with running a tape drive as an iscsi target is that there was no software I found that worked.  I tried IETD, TGT, and of course TARGETCLI.  After thinking and googling about this problem for about a day or so I decided to see if I could patch the python code on which targetcli runs.  I am suprised I can still code!!! This tool me perhaps half an hour to figure, it has been a long time since i wrote any thing. 

This file is ... rtslib/utils.py

-----------PATCH 1 of 2 --------------------------------------
1. In function convert_scsi_path_to_hctl

OLD:
    try:
        hctl = os.listdir("/sys/block/%s/device/scsi_device"
                          % devname)[0].split(':')
    except:
        return None
    return [int(data) for data in hctl]

NEW:
    try:
       hctl = os.listdir("/sys/block/%s/device/scsi_device"
                             % devname)[0].split(':')
       return [int(data) for data in hctl]
    except OSError: pass

    try:
       hctl = os.listdir("/sys/class/scsi_tape/%s/device/scsi_device"
                          % devname)[0].split(':')
       return [int(data) for data in hctl]
    except OSError: pass

    return None

-----------PATCH 2 of 2 --------------------------------------
In function convert_scsi_hctl_to_path
OLD:
    for devname in os.listdir("/sys/block"):
        path = "/dev/%s" % devname
        hctl = [host, controller, target, lun]
        if convert_scsi_path_to_hctl(path) == hctl:
            return os.path.realpath(path)
NEW:
    for devname in os.listdir("/sys/block"):
        path = "/dev/%s" % devname
        hctl = [host, controller, target, lun]
        if convert_scsi_path_to_hctl(path) == hctl:
            return os.path.realpath(path)
    try:
        for devname in os.listdir("/sys/class/scsi_tape"):
            path = "/dev/%s" % devname
            hctl = [host, controller, target, lun]
            if convert_scsi_path_to_hctl(path) == hctl:
                return os.path.realpath(path)
    except OSError: pass

Friday, December 01, 2017

pgsql archive command notes

This is a critique of documented "archive_command" usage in pgsql.  There is an example which says:

  archive_command = 'test ! -f /destination/%f && cp %p /destination/%f"

I wouldnt use this.  The problem is if the disk is full, I tested cp to produce short files (partial files).  Even if it does exit with a nonzero exit code, the next copy attempt will be seen as a success ("test ! -f __" is false since the file is already there). 
What I would do is use rsync.  In its default setting, it does not create short files:

  archive_command = 'test ! -f /destination/%f && rsync %p /destination/%f"

ep

Sunday, October 15, 2017

drbd and lvm: so many combinations

On vacation from a major dental surgery I am currently learning and  testing these three DRBD/LVM combinations and thinking about which one I would use on a real production setup. 

1. DRBD over plain device
2. DRBD over LVM
3. LVM over DRBD
4. LVM over DRBD over LVM

1. DRBD over plain device.  This puts actual device names such as sdb1 in the drbd configuration.  I dont like that.  There are ways around this such as using multipath or using /dev/disk/by-id.  I havent tested those yet with drbd but the point is the actual device names are in the configuration files and they had better agree with the real devices (after years of uptime and changeover of sysadmins :).

2. DRBD over LVM.  This puts an abstraction layer at the lowest layer and it avoids having to place actual device names in drbd resource files.   For example:

/etc/drbd.d/some-resource.res

resource __ {
  ...
  ...
  device /dev/drbd0
  disk /dev/vg/lvdisk0
  ...
  }

There you go, no /dev/sdb1 or whatever in the disk configuration.  This avoids problems arising from devices switching device names on reboot.

3. LVM over DRBD

As the name implies, puts the flexibility of provisioning in the drbd layer, where it is closer to the application.  It would make typical provisioning such as disk allocation, destruction, extensions and shrinkings much easier.  Howerer I still do not like writing device names in the config files...

4. LVM over  DRBD over LVM.

LVM over DRBD over LVM is probably the most flexible solution.  There are no actual device names in DRBD configuration; LVM is very much resilient with machine restarts due to its auto detection of metadata in whatever order the physical disks comes up in.   With this combinations I can re arrange the phsyical backing storage and at the same time have the flexibility of LVM on the upper layer.  The only issue is having to ADJUST STUFF IN /etc/lvm/lvm.conf. 

in /etc/lvm/lvm.conf

    # filter example -- 
    # /dev/vd* on the physical layer, 
    # /dev/drbd* on the drbd layer
    filter = [ "a|^/dev/vd.*|", "a|drbd.*|", "r|.*/|" ]
    write_cache_state = 0
    use_lvmetad = 0

Just a few lines of config.  This is fine.   The  problem is having to remember what all these configuration mean after 2 years of uptime...

---

JondZ 20171015



Wednesday, October 11, 2017

reducing lvm drbd disk size

Here is a snippet of my notes for reducing drbd disk size (assuming that the physical device is on LVM which can be resized). 

Just remember that a drbd device is a container, and HAS METADATA.  Therefore think about it as a filesystem.  Also, this procedure will only work if the disks are ONLINE (the disks are attached, and drbd is running).

In this example, A filesystem has only 100 megs worth of data; we want to shrink the physical store down from 500 to about 120 Megs.

WARNING: This procedure can be destructive is done wrong.

1. Note the filesytem consumed size.  For this example the filesystem contains 100M worth fo data.  Shrink the filesystem.  Note that -M resizes to minimum size.

   umount /dev/drbd0
   fsck -f /dev/drbd0
   resize2fs -M /dev/drbd0

At this point the filesytem on /dev/drbd0 should be at the minimum (i.e., close to the consumed size---about 100 MB in this example).  If you are not sure, mount the fileystem again and use "df" or use tune2fs (if ext4) to MAKE SURE. 

2. Resize the drbd device.  Make sure it is higher than the fileystem size because drbd also uses disk space for metadata!

   drbdadm -- --size=110M resize r0

If you would type in "lsblk" at this point, drbd0 should show about 110M.

3. Shrink the physical backing device to a bit higher than the drbd device:

   on first node (drb7): lvresize -L 120M /dev/drt7/disk1
   on first node (drb8): lvresize -L 120M /dev/drt8/disk1

4. Size up the drbd device to use up all available LV space:

   drbdadm resize r0

4. Finally size up the filesystem:

   resize2fs /dev/drbd0

5. Mount and verify that the filesystem is indeed 120 Megs.

Friday, October 06, 2017

Stress testing drbd online verification

DRB is so nice.  It is really very nice to have this skill available--I would use it personally, at home or at office production.  This is a very nice talent to have, practically in the real world, to be able to string together 2 computers with a network cable and replicate disks from one to the other automatically.

I just stress tested the "online verification" procedure.  Basically I wanted to see how I would formulate a recovery procedure for a corrupted disk.  In summary this is what I did---

1. Configure checksum method for online verification.
2. Perform online verification to compare disks.  This is as simple as typing out "drbdadm verify <resource>" and watching the logs in /var/log/messages.

STRESS TEST.  To make sure I would recover from a failed disk I tested out this scenario:

3.  On node1, stop drbd (systemctl stop drbd)
4. Force a disk corruption, for example dd if=/dev/zero of=/dev/vdb1
5. start drbd (systemctl start drbd)

RECOVERY PROCEDURE: Here is what I came up with as a procedure.

6. drbdadm verify r0 # r0 is the resource name

At this point I would notice the disk corruption in /var/log/messages.

7.  On the "bad" node:

drbdadm secondary r0
drbdadm invalidate r0

This is the summary of procedure (perhaps with some minor detail I forgot about).  After the "invalidate" instructions the disk should sync again.  Just make sure that the correct disk on the correct node is identified and invalidated.

-------
JondZ

Thursday, October 05, 2017

drbd diskless mode

I am still scratching my head over this one--that it is actually possible.  Sure I ran diskless stuff like iSCSI with special hardware cards before, but drbd?

I detached the disk and then turned the resource into active.  So basically the node, without a disk, is talking to a node, with a disk, and pretending that the disk is local:

[root@drb6 tmp]# drbdadm status
r0 role:Primary
  disk:Diskless
  drb5 role:Secondary
    peer-disk:UpToDate

[root@drb6 tmp]# df
Filesystem          1K-blocks    Used Available Use% Mounted on
/dev/drbd0            1014612   33860    964368   4% /mnt/tmp



On this node, /dev/drbd0 has no disk backing store for drbd0.  drbd0 is for all practical purposes a normal block device.

That is amazing...


jondz

Wednesday, October 04, 2017

My first DRBD cluster test

Here is my first cluster.  It took me the WHOLE MORNING to figure out (I misunderstood the meaning of the "clone-node-max" property).  Anyhow this is a 4-node active/passive drbd storage cluster.

In this example, only the Primary (pcs "Master") can use the block device at any one time.  The nodes work correctly in that nodes are promoted/demoted as expected when they leave/enter the cluster.

I will have to re-do this entire thing from scratch to make sure I can do it again and keep notes (so many things to remember!).  I will also enable some service here to use the block device: maybe an nfs or LIO iSCSI server or something.

Here are my raw notes and a sample "pcs status" output:

------------RAW NOTES-- SORT IT OUT LATER ----------

pcs resource create block0drb ocf:linbit:drbd drbd_resource=r0
pcs resource master block0drbms block0drb master-max=1 master-node-max=1 clone-max=4
# pcs resource update block0drbms clone-node-max=3 THIS IS WRONG--SHOULD BE 1 BECAUSE ONLY 1 CLONE SHOULD RUN ON EACH NODE (see below later)

pcs resource update block0drbms meta target-role='Started'
pcs resource update block0drbms notify=true

[root@drb3 cores]# systemctl disable drbd
Removed symlink /etc/systemd/system/multi-user.target.wants/drbd.service.

pcs resource update block0drb meta target-role="Started"
pcs resource update block0drb drbdconf="/etc/drbd.conf"

pcs property set stonith-enabled=false
pcs resource update block0drbms clone-node-max=1

pcs resource enable block0drbms



Also (info from the web), fix wrong permissions if needed:

--- chmod 777 some file in /var if needed ---- 

chmod 777 /var/lib/pacemaker/cores

---------- EXAMPLE PCS STATUS OUTPUT --------

[root@drb3 ~]# pcs status
Cluster name: drbdemo
Stack: corosync
Current DC: drb2 (version 1.1.16-12.el7_4.2-94ff4df) - partition with quorum
Last updated: Wed Oct  4 11:38:16 2017
Last change: Wed Oct  4 11:30:36 2017 by root via cibadmin on drb4

4 nodes configured
4 resources configured

Online: [ drb1 drb2 drb3 drb4 ]

Full list of resources:

 Master/Slave Set: block0drbms [block0drb]
     Masters: [ drb2 ]
     Slaves: [ drb1 drb3 drb4 ]

Daemon Status:
  corosync: active/enabled
  pacemaker: active/enabled
  pcsd: active/enabled
[root@drb3 ~]#

Wednesday, May 03, 2017

Random Ansible stuff -- commenting out variables

I am currently learning Ansible.  This is due to the fact that I realized I had to have a way to simultaneously configure many servers: I was up to using 6 (six) virtual centOS servers to learn glusterfs and manually coniguring each one was geting troublesome.  Anyhow I think I am a week into this already.

Todays lesson is: How to replace variables with comments on top.  This is a personal favorite style of mine, specifically in the form:

   # Previous value was VAR=value changed 20170504
   VAR=newvalue

I use this variable A LOT, in fact on all of my configuration changes whenever I can. 

Here are two examples from my personal tests.

One way is to simply use newlines:

         This task:
         ====================================================
         - name: positional backrefs embedded newline hacks
           lineinfile:
              dest: /tmp/testconfig.cfg
              regexp: '^(TESTCONFIGVAR9)=(.*)'
              line: '# \1 modified {{mod_timestamp_long}}
                    \n# \1 = (old value was) \2
                    \n\1=newvalue'
              backrefs: yes
              state: present
         ====================================================
         Produces this output:
         ====================================================
         # TESTCONFIGVAR9 modified 20170504T001157
         # TESTCONFIGVAR9 = (old value was) test
         TESTCONFIGVAR9=newvalue
         ====================================================


Another way is to split up the task:

         These tasks:
         ====================================================
         - name: another attempt at custom mod notes, step 1
           lineinfile:
              dest: /tmp/testconfig.cfg
              regexp: '^(TESTCONFIGVAR6=.*)'
              line: '# OLD VALUE: \1 {{ mod_timestamp_long }}'
              backrefs: yes
         - name: another attempt at custom mod notes, step 2
           lineinfile:
              dest: /tmp/testconfig.cfg
              insertafter: '# OLD VALUE: '
              line: 'TESTCONFIGVAR6=blahblahblah'
         ====================================================
         Result in this output:
         ====================================================
         # OLD VALUE: TESTCONFIGVAR6=test 20170504T001157
         TESTCONFIGVAR6=blahblahblah
         ====================================================


If anybody is reading this, I am open to suggestions (since I am still learning this at the moment).

JondZ Thu May  4 00:16:35 EDT 2017

Wednesday, April 05, 2017

todays random thoughts

As I write this my home server is down; I was learning glusterfs when I accidentally rebooted the Xen server which was holding all my virtual machines.  It has been a few minutes; it is unusual so the server may have crashed already.

Anyhow--

Today's lesson is: FIX HOSTNAMES FIRST before setting up glusterfs.  Glusterfs needs a good working hostname resolution in order to work.  Gluster is miserable with a broken DNS.

It also does not help that somewhere along the line, something, or somebody (*cough* ISP *cough*) modifies the dns queries that returns some far-off IP addresses on failed resolution.

Specifically make sure these work and actually point to your servers:

       node-testing-1.yoursubdomain.domain.net
       node-testing-2.yoursubdomain.domain.net
       node-testing-3.yoursubdomain.domain.net
       node-client-test.yoursubdomain.domain.net

ALSO make sure these work and actually point your servers (this is the part that someting in the dns query path might return some random IP address making the gluster server contact some unknown far-off host):

      node-testing-1
      node-testing-2
      node-testing-3
      node-client-test

The way I did this was put "yoursubdomain.domain.net" in the "search" parameter of /etc/resolv.conf.  Others will probably just put the entries in /etc/hosts.  Whatever works. 

By the way, configuring the search parameter on resolv.conf differs between debian and redhat-derived distributions.  For deiban-derived it is best to install "resolvconf" and put a keyword in /etc/network/interfaces; for redhat-derived it is easer to just use "nmtui" or put a keyword in /etc/sysconfig/network-scripts/whatever/ifcfg-whatever

My server is back online...thank you for reading this.

JondZ 20170505

Friday, March 31, 2017

Expermiment on learning active-active httpd

Not a bad way to spend a friday afternoon..Here are my raw notes.  I make up these tech notes for myself and this is not a bad addition:

Fri Mar 31 14:56:32 EDT 2017 LESSON: SIMPLE ACTIVE-ACTIVE HTTPD CLUSTER

This is based on the RedHat manual "Linux 7 High Availability Add On
Administration" except that this follows an active-active setup and assumes
there is a cluster filesystem available.

PACKAGES NEEDED:

wget - needed by pacemaker (for status checks; supposedly "curl" is also
       supported and must be specified by the ocf client= option)
lynx - OPTIONAL; to test status yourself.

ASSUMED CONDITIONS:

- httpd is already installed; furthermore it is enabled and running as stock
  via systemd
- there is a clustered gfs filesystem on /volumes/data1 (for common
  html content)

PART 1: HTTPD

Set up the document root as desired.  In this example, the html documents
are rooted at /volumes/data1/www and is common to all nodes.  In the config
file /etc/httpd/conf/httpd.conf:

        DocumentRoot "/volumes/data1/www"
        <Directory "/volumes/data1/www">
            AllowOverride None
            Require all granted
        </Directory>

Put some data on the directory; in this simple example there would be a file
named /volumes/data1/www/index.html

        <html>
        <body>
        <h1>hello</h1>
        This is a test website from JondZ
        </body>
        </html>

At the end of the config, put the following; this is used by pacmaker to check
status.

        <Location /server-status>
        SetHandler server-status
        Order deny,allow
        Deny from all
        Allow from 127.0.0.1
        </Location>

Use lynx to check that it works:

        lynx http://127.0.0.1/server-status


When satisfied that things are working, disable httpd activation by systemd;
the service will be managed by pacemaker instead.

        systemctl disable httpd
        systemctl stop httpd

PART 2: LOGROTATE

Edit the file /etc/logrotate.d/httpd and modify the "postrotate" section:

        # This is the old stuff.  Comment this out. 
        # Since httpd is going to be managed
        # by pacemaker, not by systemd, this is no longer valid:
        #
        # /bin/systemctl reload httpd.service > /dev/null 2>/dev/null || true
        #
        # This is the correct line that RedHat recommends.  Note that
        # the PID file is produced by pacemaker (or httpd itself?) and is
        # probably true only as long as httpd is not managed by systemd.
        #
        /usr/sbin/httpd -f /etc/httpd/conf/httpd.conf -c \
        "PidFile /var/run/httpd.pid" -k graceful > /dev/null 2>/dev/null \
        || true
        #
        # This is how I personally respawn apache on old production systems
        # but this NO LONGER RELIABLY WORKS (need testing).
        #
    # /sbin/apachectl graceful > /dev/null 2>/dev/null && true
   

Test logrotate.  First of all make sure that /var/run/httpd.pid is current.
Then force rotations by "logrotate -f /etc/logrotate.conf".  Also watch the
pid changes on the httpd processes (on a separate terminal you could say
watch -n1 "ps -efww | grep httpd" and watch the pid's being replaced.

PART 3: PCS RESOURCE ENTRY

I added the pcs resource as follows:

       pcs resource create batwww apache
       configfile="/etc/httpd/conf/httpd.conf"
       statusurl="http://127.0.0.1/server-status" clone

The option "clone" makes the httpd run an all nodes (instead of just one
instance). 


JondZ 201703

Thursday, March 30, 2017

Old Fashioned

I value my data: I possses an organizer that has no internet connection and I use a tape drive for backup.

Unlike modern gadgets, I do not have to recharge my organizer every day.  It also does not require backlight so is easier on my eyes.  I also do no trust the "cloud".  I had an android-based cell phone password organizer a while ago: not any more.

Tape is very cheap and I do not have to worry about having to replace spinning disks every 2-5 years.  Tape is still the last expensive option and is extremely easy to use.  I can just type this in the morning:

screen
tar cvfpb /dev/st1 128 files...

Contrary to popular myths, tape drives are actually fast.  I have a slow server (an athlon 5350 motherboard) and slow disk (actually a QLA iSCSI over a NetGear NAS device, and I measure tape speed at 20 to 30 MegaBytes per second.  It sounds as if my server cannot deliver the bytes fast enough, resulting in a motor pause every few seconds.  That implies that the LTO-3 tape drive is capable of more thoroughput.  In my case, I might also use an LTO-1 drive for smaller jobs just to keep the motor humming nicely.



Tape drive (LTO-3) bought from ebay.
A Palm Organizer

Wednesday, March 22, 2017

learning experiments on gfs2 clustering: no-quorum-policy,interleave, ordered

It has been probably a week of gfs2 (global filesystem, 2) crash course in my personal study on clustered filesystems.   Here is an in depth experiment result on 3 detail points that is mentioned in the RedHat manual:

Point 1: set no-quorum-policy to freeze
Point 2: when creating dlm and clvmd clones, set interleave=true
Point 3: when creating dlm and clvmd clones, set ordered=true

Experiments and explanations:

Point 1: What does "no-quorum-policy=freeze" do?

To differentiate a "freeze" with something else, a gfs2 cluster filesystem is tested with the following two options:

pcs property set no-quorum-policy=stop
pcs property set no-quorum-policy=freeze

With "stop", the resources are stopped, resulting in the gfs2 filesystems being unmounted (because the filesystems are just services).

With "freeze", I/O is blocked, until the problem is corrected.  Specifically, commands like this are frozen:

   ls > /path/to/gfs2/filesystem/sample-output.txt

When the problem gets fixed and the cluster becomes quorate again, the command resumes normally.

Point 2: interleave=true

This is the parameter that caused me much grief for a day or so.  When I had my first sucessful gfs2 clustered filesystem configured i was dissapointed that the filesystems were being unmounted when the nodes re-enter the cluster.  I found the answer by searching the web:  ALL instances dlm or/and clvmd clones need to restart before ANY gfs mount, when interleave = false. 

So basically if resource2 clone is dependent on resource1 clone, and interleave=false, then ALL instances of resource1 has to be present before ANY instance of resource1.  This results in the gfs2 filesystems being unmounted and re-mounted (in our example, where resource1 is the gfs2 mounts).

Thank you for the person who posted it, which I found on google.

Point 3: ordered=true

I have no observable difference to report, seems to make no difference either way.  I have tested this to true and false and the lm/clvm processes seem to start the same way on all nodes.

JondZ 201603

Monday, March 13, 2017

DISCARD EFFECT ON THIN VOLUMES
Notes by JondZ
2017-03-14

This note was prompted by my need for the use of SNAPPER to protect a massive amount of data.  This morning I realized the very good effect ofdiscard in space-savings as when dealing with Terabytes of data it is good to save as much space as  possible

In this example the thin POOL is tp1 and the thin VOLUME of interest is te1.  It is like this because I am merely testing out a configuration thatalready exists.

These are dumped-out unedited notes.

INITIAL CONDITIONS
------------------------------------------------------------------------
te1 is a 1-Gig (thin) disk mounted on /volumes/te1.
The actual, physical thin volume POOL is sized at 10.35 Gigs right now.

lvs -a | grep tp1
  te1             bmof Vwi-aotz--   1.00g tp1         4.77                                  
  te2             bmof Vwi-aotz--   1.00g tp1         97.66                                 
  te3             bmof Vwi-aotz--   1.00g tp1         4.75                                  
  te4             bmof Vwi-aotz--   3.00g tp1         42.32                                 
  tp1             bmof twi-aotz--  10.35g             22.62  15.28                          
  [tp1_tdata]     bmof Twi-ao----  10.35g                                                   
  [tp1_tmeta]     bmof ewi-ao----   8.00m                                                   
root@epike-OptiPlex-7040:/volumes/te1#


EFFECT 1: A 500-MEG FILE INSERTED
-------------------------------------------------------------------
Notice the increase in useage of "te1" now up to 52.36.   The thin volume tp1 increased as well, 27.22 usage.

dd if=/dev/zero of=500MFILE21 bs=1024 count=500000

root@epike-OptiPlex-7040:/volumes/te1# !lvs
lvs -a | grep tp1
  te1             bmof Vwi-aotz--   1.00g tp1         52.36                                 
  te2             bmof Vwi-aotz--   1.00g tp1         97.66                                 
  te3             bmof Vwi-aotz--   1.00g tp1         4.75                                  
  te4             bmof Vwi-aotz--   3.00g tp1         42.32                                 
  tp1             bmof twi-aotz--  10.35g             27.22  18.26                          
  [tp1_tdata]     bmof Twi-ao----  10.35g                                                   
  [tp1_tmeta]     bmof ewi-ao----   8.00m                                                   
root@epike-OptiPlex-7040:/volumes/te1#

EFFECT 2: 500 FILE REMOVED
---------------------------------------------------------------------
Removing a file did not reduce the Thin volume usage.  The numbers are the same for the pool use percentages.

root@epike-OptiPlex-7040:/volumes/te1# rm 500MFILE21
root@epike-OptiPlex-7040:/volumes/te1# df -h -P .
Filesystem            Size  Used Avail Use% Mounted on
/dev/mapper/bmof-te1  976M  1.3M  908M   1% /volumes/te1
root@epike-OptiPlex-7040:/volumes/te1#

root@epike-OptiPlex-7040:/volumes/te1# !lvs
lvs -a | grep tp1
  te1             bmof Vwi-aotz--   1.00g tp1         52.45                                 
  te1-snapshot1   bmof Vri---tz-k   1.00g tp1  te1                                          
  te2             bmof Vwi-aotz--   1.00g tp1         97.66                                 
  te3             bmof Vwi-aotz--   1.00g tp1         4.75                                  
  te4             bmof Vwi-aotz--   3.00g tp1         42.32                                 
  tp1             bmof twi-aotz--  10.35g             27.23  18.46                          
  [tp1_tdata]     bmof Twi-ao----  10.35g                                                   
  [tp1_tmeta]     bmof ewi-ao----   8.00m                                                   
root@epike-OptiPlex-7040:/volumes/te1#

EFFECT 3: fstrim
-----------------------------------------------------------------
FSTRIM will reclaim spaces on the thin volume AND the thin pool:

root@epike-OptiPlex-7040:/volumes/te1# fstrim -v /volumes/te1
/volumes/te1: 607.2 MiB (636727296 bytes) trimmed
root@epike-OptiPlex-7040:/volumes/te1# !lvs
lvs -a | grep tp1
  te1             bmof Vwi-aotz--   1.00g tp1         4.77                                  
  te1-snapshot1   bmof Vri---tz-k   1.00g tp1  te1                                          
  te2             bmof Vwi-aotz--   1.00g tp1         97.66                                 
  te3             bmof Vwi-aotz--   1.00g tp1         4.75                                  
  te4             bmof Vwi-aotz--   3.00g tp1         42.32                                 
  tp1             bmof twi-aotz--  10.35g             27.23  18.55                          
  [tp1_tdata]     bmof Twi-ao----  10.35g                                                   
  [tp1_tmeta]     bmof ewi-ao----   8.00m

Well it does not show here, but I recall that the thin POOL also is reduced.  Perhaps the snapshot gets in the way?  It has been put there automatically while I was composing this text. 

There, much better:

root@epike-OptiPlex-7040:/volumes/te1# snapper -c te1 delete 1
root@epike-OptiPlex-7040:/volumes/te1# !lvs
lvs -a | grep tp1
  te1             bmof Vwi-aotz--   1.00g tp1         4.77                                  
  te2             bmof Vwi-aotz--   1.00g tp1         97.66                                 
  te3             bmof Vwi-aotz--   1.00g tp1         4.75                                  
  te4             bmof Vwi-aotz--   3.00g tp1         42.32                                 
  tp1             bmof twi-aotz--  10.35g             22.62  15.28                          
  [tp1_tdata]     bmof Twi-ao----  10.35g                                                   
  [tp1_tmeta]     bmof ewi-ao----   8.00m                                                   
root@epike-OptiPlex-7040:/volumes/te1# fstrim -v /volumes/te1
/volumes/te1: 239.4 MiB (251031552 bytes) trimmed
root@epike-OptiPlex-7040:/volumes/te1# !lvs
lvs -a | grep tp1
  te1             bmof Vwi-aotz--   1.00g tp1         4.77                                  
  te2             bmof Vwi-aotz--   1.00g tp1         97.66                                 
  te3             bmof Vwi-aotz--   1.00g tp1         4.75                                  
  te4             bmof Vwi-aotz--   3.00g tp1         42.32                                 
  tp1             bmof twi-aotz--  10.35g             22.62  15.28                          
  [tp1_tdata]     bmof Twi-ao----  10.35g                                                   
  [tp1_tmeta]     bmof ewi-ao----   8.00m                                                   
root@epike-OptiPlex-7040:/volumes/te1#

The numbers are down to 4.77 consumed on the Thin VOLUME, and 22.62 percent on the thin POOL.

EFFECT 4: mount with DISCARD option automatically reclaims THIN space
----------------------------------------------------------------------
This example demonstrates that thin volume space are automatically reclaimed
and returned to POOL automatically, without needing to manually run fstrim.


root@epike-OptiPlex-7040:/volumes/te1# mount -o remount,discard /dev/mapper/bmof-te1

root@epike-OptiPlex-7040:/volumes/te1# !dd
dd if=/dev/zero of=500MFILE24 bs=1024 count=500000
500000+0 records in
500000+0 records out
512000000 bytes (512 MB, 488 MiB) copied, 0.553593 s, 925 MB/s
root@epike-OptiPlex-7040:/volumes/te1# !lvs
lvs -a | grep tp1
  te1             bmof Vwi-aotz--   1.00g tp1         52.39                                 
  te2             bmof Vwi-aotz--   1.00g tp1         97.66                                 
  te3             bmof Vwi-aotz--   1.00g tp1         4.75                                  
  te4             bmof Vwi-aotz--   3.00g tp1         42.32                                 
  tp1             bmof twi-aotz--  10.35g             27.22  18.26                          
  [tp1_tdata]     bmof Twi-ao----  10.35g                                                   
  [tp1_tmeta]     bmof ewi-ao----   8.00m                                                   
root@epike-OptiPlex-7040:/volumes/te1# rm 500MFILE24
root@epike-OptiPlex-7040:/volumes/te1# !lvs
lvs -a | grep tp1
  te1             bmof Vwi-aotz--   1.00g tp1         52.39                                 
  te2             bmof Vwi-aotz--   1.00g tp1         97.66                                 
  te3             bmof Vwi-aotz--   1.00g tp1         4.75                                  
  te4             bmof Vwi-aotz--   3.00g tp1         42.32                                 
  tp1             bmof twi-aotz--  10.35g             27.22  18.26                          
  [tp1_tdata]     bmof Twi-ao----  10.35g                                                   
  [tp1_tmeta]     bmof ewi-ao----   8.00m                                                   
root@epike-OptiPlex-7040:/volumes/te1# !lvs
lvs -a | grep tp1
  te1             bmof Vwi-aotz--   1.00g tp1         4.79                                  
  te2             bmof Vwi-aotz--   1.00g tp1         97.66                                 
  te3             bmof Vwi-aotz--   1.00g tp1         4.75                                  
  te4             bmof Vwi-aotz--   3.00g tp1         42.32                                 
  tp1             bmof twi-aotz--  10.35g             22.62  15.28                          
  [tp1_tdata]     bmof Twi-ao----  10.35g                                                   
  [tp1_tmeta]     bmof ewi-ao----   8.00m                                                   
root@epike-OptiPlex-7040:/volumes/te1# ls
lost+found
root@epike-OptiPlex-7040:/volumes/te1#

-----------------------------------------------------------------------
But does the reclamation work thru snaphots layers?  Well it would be difficult to test all combinations, but lets at least verify that the spaces are reclaimed when the snaphots are deleted.

First, mount with the discard mode

root@epike-OptiPlex-7040:~# !mount
mount -o remount,discard /dev/mapper/bmof-te1
root@epike-OptiPlex-7040:~#

the initial conditions are:

  te1             bmof Vwi-aotz--   1.00g tp1         4.77                                  
  te2             bmof Vwi-aotz--   1.00g tp1         97.66                                 
  te3             bmof Vwi-aotz--   1.00g tp1         4.75                                  
  te4             bmof Vwi-aotz--   3.00g tp1         42.32                                 
  tp1             bmof twi-aotz--  10.35g             22.62  15.28  

Ok, so LV is 4.77 percent, LV POOL is 22.62 percent. 

So..consume space. 

root@epike-OptiPlex-7040:/volumes/te1# !dd
dd if=/dev/zero of=500MFILE24 bs=1024 count=500000
500000+0 records in
500000+0 records out
512000000 bytes (512 MB, 488 MiB) copied, 0.559463 s, 915 MB/s
root@epike-OptiPlex-7040:/volumes/te1# !lvs
lvs | grep tp1
  te1             bmof Vwi-aotz--   1.00g tp1         52.37                                 
  te2             bmof Vwi-aotz--   1.00g tp1         97.66                                 
  te3             bmof Vwi-aotz--   1.00g tp1         4.75                                  
  te4             bmof Vwi-aotz--   3.00g tp1         42.32                                 
  tp1             bmof twi-aotz--  10.35g             27.22  18.26                          
root@epike-OptiPlex-7040:/volumes/te1#

snapshot, and consume space some more..

root@epike-OptiPlex-7040:/volumes/te1# snapper -c te1 create
root@epike-OptiPlex-7040:/volumes/te1# !lvs
lvs | grep tp1
  te1             bmof Vwi-aotz--   1.00g tp1         52.45                                 
  te1-snapshot1   bmof Vri---tz-k   1.00g tp1  te1                                          
  te2             bmof Vwi-aotz--   1.00g tp1         97.66                                 
  te3             bmof Vwi-aotz--   1.00g tp1         4.75                                  
  te4             bmof Vwi-aotz--   3.00g tp1         42.32                                 
  tp1             bmof twi-aotz--  10.35g             27.23  18.36                          
root@epike-OptiPlex-7040:/volumes/te1#

root@epike-OptiPlex-7040:/volumes/te1# !dd:p
dd if=/dev/zero of=500MFILE24 bs=1024 count=500000
root@epike-OptiPlex-7040:/volumes/te1# dd if=/dev/zero of=200mfile bs=1024 count=200000
200000+0 records in
200000+0 records out
204800000 bytes (205 MB, 195 MiB) copied, 0.211273 s, 969 MB/s

root@epike-OptiPlex-7040:/volumes/te1# !snapper
snapper -c te1 create
root@epike-OptiPlex-7040:/volumes/te1# !lvs
lvs | grep tp1
  te1             bmof Vwi-aotz--   1.00g tp1         71.53                                 
  te1-snapshot1   bmof Vri---tz-k   1.00g tp1  te1                                          
  te1-snapshot2   bmof Vri---tz-k   1.00g tp1  te1                                          
  te2             bmof Vwi-aotz--   1.00g tp1         97.66                                 
  te3             bmof Vwi-aotz--   1.00g tp1         4.75                                  
  te4             bmof Vwi-aotz--   3.00g tp1         42.32                                 
  tp1             bmof twi-aotz--  10.35g             29.08  19.82                          
root@epike-OptiPlex-7040:/volumes/te1#

Then remove the files.  The numbers should not go down since there are snap volumes.

root@epike-OptiPlex-7040:/volumes/te1# rm 200mfile 500MFILE24
root@epike-OptiPlex-7040:/volumes/te1# !lvs
lvs | grep tp1
  te1             bmof Vwi-aotz--   1.00g tp1         4.78                                  
  te1-snapshot1   bmof Vri---tz-k   1.00g tp1  te1                                          
  te1-snapshot2   bmof Vri---tz-k   1.00g tp1  te1                                          
  te2             bmof Vwi-aotz--   1.00g tp1         97.66                                 
  te3             bmof Vwi-aotz--   1.00g tp1         4.75                                  
  te4             bmof Vwi-aotz--   3.00g tp1         42.32                                 
  tp1             bmof twi-aotz--  10.35g             29.08  20.12     

Ok so I stand corrected:  The LVM VOLUME usage went down, but the LVM POOL did not.
That actually makes sense since the snapshot consumes the space. 
What happens when the snaphots are removed, are the spaces reclaimed into the thin POOL?

root@epike-OptiPlex-7040:/volumes/te1# snapper -c te1 delete 1
root@epike-OptiPlex-7040:/volumes/te1# snapper -c te1 delete 2
root@epike-OptiPlex-7040:/volumes/te1# !lvs
lvs | grep tp1
  te1             bmof Vwi-aotz--   1.00g tp1         4.78                                  
  te2             bmof Vwi-aotz--   1.00g tp1         97.66                                 
  te3             bmof Vwi-aotz--   1.00g tp1         4.75                                  
  te4             bmof Vwi-aotz--   3.00g tp1         42.32                                 
  tp1             bmof twi-aotz--  10.35g             22.62  15.28                          
root@epike-OptiPlex-7040:/volumes/te1#

It does!!!  When the snap volumes are removed, the spaces are reclaimed into the thin pool.

CONCLUSION:
--------------------
When working with thin volumes, use DISCARD mount option, even (or, especially) when not using SSD's.

OTHER TESTS
-----------
I tested mounting normally, consume space, then mount with discard option.  What happens is that the space are not automatically reclaimed just by mounting.  fstrim needs to run, and snapshots need to be deleted.   Still there is no harm and in fact an advantage to add "discard" option in fstab even for existing (thin volume) mounts.


JondZ 20170314



                 













Creating ipip tunnels in RedHat 8 that is compatible with LVS/TUN. Every few seasons I built a virtual LVS (Linux Virtual Server) mock up ju...