Monitoring a Filesystem – Part 1

So, one fine day at work, I got bumped into this simple task of detecting any changes on a specific folder on the filesystem, and sync any new/modified files to an S3 bucket in AWS. In common terms, a file watcher. Not a big deal at all… Here was my approach.
I wrote a simple shell script (since it was a RHEL VM on AWS) that looked something like

#!/bin/sh
cd /appl/data;
for filename in *; do
  #Do your stuff
  aws s3 sync $filename s3://sync-bucket/$filename
done

It scans the directory and syncs (using AWS CLI) to an S3 bucket. This is as simple as it can get. Now, invoke this script at regular intervals from a crontab or anywhere else. A crontab periodically runs a command (like invoking a script/collect data or whatever you ask it to do). You can learn more about crontabs in detail by searching online. So, my cron entry looked like this (you can update your cron entry by passing this command on a linux system)

$ sudo crontab -e
* * * * * /home/sync.sh >>/home/sync.log 2>&1

A few points here:

We are redirecting the standard output along with the error streams to a sync.log file for monitoring purpose.
The paths (to the script and log file) should be absolute so that there is no confusion.
The crontab is run as a sudo. You can choose to run it as any other user. Just take care of the file permission while invoking the script.

So far so good. But soon, I ran into another problem. By default, crons run every 60 seconds. So worst case, if there is a file change at the 1st second after the scan, it will have to wait for another 59 seconds to detect the change. It was a trade-off we decided not to live with. Hence, now we shifted to a more frequent scan model.So, now my crontab looked like this.

$ sudo crontab -e
* * * * * /home/sync.sh >>/home/sync.log 2>&1
* * * * * (sleep 10; /home/sync.sh >>/home/sync.log 2>&1)
* * * * * (sleep 20; /home/sync.sh >>/home/sync.log 2>&1)
* * * * * (sleep 30; /home/sync.sh >>/home/sync.log 2>&1)
* * * * * (sleep 40; /home/sync.sh >>/home/sync.log 2>&1)
* * * * * (sleep 50; /home/sync.sh >>/home/sync.log 2>&1)

We are now invoking this script more often (10 seconds; it is a workaround… don’t do it unless you are completely sure as it may have performance implication on your system).
Very soon, the next challenge surfaced. So, what if my scan is taking more than 10 seconds? The solution surfaced sooner too – “LOCK THE FOLDER FOR FURTHER SYNC IF IT IS BEING WORKED ON“. So, now the script looked like this.

#!/bin/sh
cd filename in *; do
  #Do your stuff
  flock -n /tmp/"$filename".lock aws s3 sync $filename s3://sync-bucket/$filename
done /appl/data;
for

The flock utility module would create a lock file with the name of the folder, and would reject any further requests on the folder if the lock is already acquired by some other thread.

All that is fine, but as the size of the folder grew (we are talking about thousands of files here), the scanning time increased significantly. It went up to a few minutes. Yes, it also depends on the machine’s capability it is running on (and ours was a very moderate VM), but it beats the purpose of having frequent scans now. The solution was not working as per our expectations now! Now, we were looking for more optimized and light wight options than this. One suitable candidate was an event based model. I will write more about it on my subsequent posts.

Here are the lessons learnt:

Even if the solution to a problem appears to be a walk in the park, things may quickly change based on the use cases and data sets we are working upon
Scanning a folder for syncing may not scale with the amount of data
Always have two/three alternative options and evaluate them before moving on with one

Monitoring a Filesystem – Part 1

Published by Sam Banerjee

Leave a comment Cancel reply

Share if you care

Related

Published by Sam Banerjee

Leave a comment Cancel reply