How to wait for gcsfuse to write-through (flush) to the GCS storage bucket? - google-cloud-storage

After a Compute Engine worker node writes files into a gcsfuse mounted local directory and closes them, I want it to synchronously flush the data through to GCS before it notifies other worker nodes that all the files are ready. This is to ensure synchronization between workers.
Q. How to ask gcsfuse to write-through to GCS, then wait for that to complete?
Ideas:
Run the Linux sync command?
Unmount the directory then wait for that fusermount command to return? (Besides the write-through time, would it take more than a second or two to unmount then remount for the next worker task?)
Make all the programs in this task call fsync() on all their output files? That'd be challenging.
Write an additional file, then flush() and fsync() that one?

Have a look at gcsfuse semantics:
Inodes may be opened for writing. Modifications are reflected
immediately in reads of the same inode by processes local to the
machine using the same file system. After a successful fsync or a
successful close, the contents of the inode are guaranteed to have
been written to the GCS object with the matching name if the object's
generation and meta-generation numbers still matched the source
generation of the inode. (They may not have if there had been
modifications from another actor in the meantime.) There are no
guarantees about whether local modifications are reflected in GCS
after writing but before syncing or closing.
So if your worker closes the files after writing them subsequent dependencies should see them consistently.

Related

EFS storage growing too big

We have a ECS fargate cluster that runs the fluentd application for collecting logs and routing them to elasticsearch. Logs are buffered on the disk(file buffer) before being routed to the destination. Since we are using FARGATE we mount the buffer path /var/log/fluentd/buffer/ to EFS.
What we would ideally expect is, the data in the buffer path will be flushed to elasticsearch and the buffer directory will be deleted. However we see a huge number of these buffer directories from tasks that have died and restarted several months ago.
So when a ECS tasks dies and comes back up again (autoscaling) it creates a new path /var/log/fluentd/buffer/ that gets mounted on EFS while also holding on to the path /var/log/fluentd/buffer/. I am not sure if its the EFS that holding on to these and remounting back on the new tasks.
Is there a way to delete these stale directories from EFS and just have paths specific to the running tasks. At a time, we have 5 tasks running in a service.
Any help is appreciated?

What happens when network connection to GCP is lost?

Imagine I have a GCS bucket mounted on my local Linux file system. Imagine I have an app that is writing new files into a Linux directory that is mounted to GCS. My goal is to have those locally written files eventually show up in GCS.
I understand that the writes on Linux happen "locally" until the file is closed ... what happens if I lose network connectivity and hence can't write to GCS? Will the local file eventually end up in GCS? Do retries and re-attempts happen?
Based on the repository documentation for gcsfuse, file upload retries are already built into the utility, and they happen when there are problems accessing the storage bucket that is mounted. You are able to modify the maximum backoff for retries by using the --max-retry-sleep flag. This flag controls the maximum time that can be reached between retries, after which retrying stops. The flag accepts an X amount of minutes as input.
This doc page is also relevant if you would like to know more about specific characteristics of gcsfuse.

Proper way for pods to read input files from the same persistent volume?

I'm new to Kubernetes and plan to use Google Kubernetes Engine. Hypothetically speaking, let's say I have a K8s cluster with 2 worker nodes. Each node would have its own pod housing the same application. This application will grab a file from some persistent volume and generate an output file that will be pushed back into a different persistent volume. Both pods in my cluster would be doing this continuously until there are no input files in the persistent volume left to be processed. Do the pods inherently know NOT to grab the same file that one pod is already using? If not, how would I be able account for this? I would like to avoid 2 pods using the same input file.
Do the pods inherently know NOT to grab the same file that one pod is already using?
Pods are just processes. Two separate processes accessing files from a shared directory are going to run into conflicts unless they have some sort of coordination mechanism.
Option 1
Have one process whose job it is to enumerate the available files. Your two workers connect to this process and receive filenames via some sort of queue/message bus/etc. When they finish processing a file, they request the next one, until all files are processed. Because only a single process is enumerating the files and passing out the work, there's no option for conflict.
Option 2
In general, renaming files is an atomic operation. Each worker creates a subdirectory within your PV. To claim a file, it renames the file into the appropriate subdirectory and then processes it. Because renames are atomic, even if both workers happen to pick the same file at the same time, only one will succeed.
Option 3
If your files have some sort of systematic naming convention, you can divide the list of files between your two workers (e.g., "everything that ends in an even number is processed by worker 1, and everything that ends with an odd number is processed by worker 2").
Etc. There are many ways to coordinate this sort of activity. The wikipedia entry on Synchronization may be of interest.

what is in zookeeper datadir and how to cleanup?

I found my zookeeper dataDir is huge. I would like to understand
What is in the dataDir?
How to cleanup? Does it automatically cleanup after certain period?
Thanks
According to Zookeeper's administrator guide:
The ZooKeeper Data Directory contains files which are a persistent copy of the znodes stored by a particular serving ensemble. These are the snapshot and transactional log files. As changes are made to the znodes these changes are appended to a transaction log, occasionally, when a log grows large, a snapshot of the current state of all znodes will be written to the filesystem. This snapshot supercedes all previous logs.
So in short, for your first question, you can assume that dataDir is used to store Zookeeper's state.
As for your second question, there is no automatic cleanup. From the doc:
A ZooKeeper server will not remove old snapshots and log files, this is the responsibility of the operator. Every serving environment is different and therefore the requirements of managing these files may differ from install to install (backup for example).
The PurgeTxnLog utility implements a simple retention policy that administrators can use. The API docs contains details on calling conventions (arguments, etc...).
In the following example the last count snapshots and their corresponding logs are retained and the others are deleted. The value of should typically be greater than 3 (although not required, this provides 3 backups in the unlikely event a recent log has become corrupted). This can be run as a cron job on the ZooKeeper server machines to clean up the logs daily.
java -cp zookeeper.jar:log4j.jar:conf org.apache.zookeeper.server.PurgeTxnLog <dataDir> <snapDir> -n <count>
If this is a dev instance, I guess you could just almost completely purge the folder (except some files like myid if its there). But for a production instance you should follow the cleanup procedure shown above.

Copy a file from HDFS to a local directory for multiple tasks on a node?

So, basically, I have a read only file (several GBs big, so broadcasting is no option) that must be copied to a local folder on the node, as each task internally runs a program (by using os.system in python or ! operator in scala) that reads from a local file (can't read from HDFS). The problem however is that several tasks would be running on one node. If the file is not already there on the node, it should be copied from the HDFS to a local directory. But how could I have one task get the file from the HDFS, while other tasks wait for it (note that each task would be running in parallel on a node). Which file synchronization mechanism could I use in Spark for that purpose?