What happens when network connection to GCP is lost? - gcsfuse

Imagine I have a GCS bucket mounted on my local Linux file system. Imagine I have an app that is writing new files into a Linux directory that is mounted to GCS. My goal is to have those locally written files eventually show up in GCS.
I understand that the writes on Linux happen "locally" until the file is closed ... what happens if I lose network connectivity and hence can't write to GCS? Will the local file eventually end up in GCS? Do retries and re-attempts happen?

Based on the repository documentation for gcsfuse, file upload retries are already built into the utility, and they happen when there are problems accessing the storage bucket that is mounted. You are able to modify the maximum backoff for retries by using the --max-retry-sleep flag. This flag controls the maximum time that can be reached between retries, after which retrying stops. The flag accepts an X amount of minutes as input.
This doc page is also relevant if you would like to know more about specific characteristics of gcsfuse.

Related

How to wait for gcsfuse to write-through (flush) to the GCS storage bucket?

After a Compute Engine worker node writes files into a gcsfuse mounted local directory and closes them, I want it to synchronously flush the data through to GCS before it notifies other worker nodes that all the files are ready. This is to ensure synchronization between workers.
Q. How to ask gcsfuse to write-through to GCS, then wait for that to complete?
Ideas:
Run the Linux sync command?
Unmount the directory then wait for that fusermount command to return? (Besides the write-through time, would it take more than a second or two to unmount then remount for the next worker task?)
Make all the programs in this task call fsync() on all their output files? That'd be challenging.
Write an additional file, then flush() and fsync() that one?
Have a look at gcsfuse semantics:
Inodes may be opened for writing. Modifications are reflected
immediately in reads of the same inode by processes local to the
machine using the same file system. After a successful fsync or a
successful close, the contents of the inode are guaranteed to have
been written to the GCS object with the matching name if the object's
generation and meta-generation numbers still matched the source
generation of the inode. (They may not have if there had been
modifications from another actor in the meantime.) There are no
guarantees about whether local modifications are reflected in GCS
after writing but before syncing or closing.
So if your worker closes the files after writing them subsequent dependencies should see them consistently.

how kubernetes deal with file write locker accross multi pods when hostpath Volumes concerned

I got app that logs to file my_log/1.log, and then I use filebeat to collect the logs from the file
Now I use k8s to deploy it into some nodes, and use hostpath type Volumes to mounts my_log file to the local file syetem, /home/my_log, suddenly I found a subtle situation:
what will it happened if more than one pod deployed on this machine, and then they try to write the log at the same time?
I know that in normal situation, multi-processing try to write to a file at the same time, the system will lock the file,so these processes can write one by one, BUT I am not sure will k8s diffirent pods will not share the same lock space, if so, it will be a disaster.
I try to test this and it seems diffirent pods will still share the file lock,the log file seems normal
how kubernetes deal with file write locker accross multi pods when hostpath Volumes concerned
It doesn't.
Operating System and File System are handling that.
As an example let's take syslog. It handles it by opening a socket, setting the socket to server mode, opening a log file in write mode, being notified of packages, parsing the message and finally writing it to the file.
Logs can also be cached, and the process can be limited to 1 thread, so you should not have many pods writing to one file. This could lead to issues like missing logs or lines being cut.
Your application should handle the file locking to push logs, also if you want to have many pods writing logs, you should have a separate log file for each pod.

Intermittant file not found using Google Cloud Storage from Dataproc - flushing writes?

I have a series of dataproc jobs that run to import some data received each morning. The process creates a cluster, runs four jobs in sequence, then shuts down the cluster. The input file is read from Google Cloud Storage, and the intermediate results are also saved in Avro form in GCS with the final output going to Cloud SQL.
Fairly often the jobs will fail trying to read the Avro written by the previous job. It appears that GCS hasn't "caught up" and the results from the previous job haven't been fully written. I was getting failures trying to read files that appeared to be from the previous day's run and partway through those files would disappear and be replaced by the new ones. I have changed my script that runs the files to clear the work area before starting the jobs, but still have problems where sometimes it starts reading and all the parts haven't been written fully.
I could change the code to simply store the intermediate files on the cluster, tho I like having them available outside for diagnosing other problems. I could also just write to both locations with the cluster for working and GCS for diagnostics.
But assuming this is some kind of sync issue, is there a way to force GCS to flush writes / be fully synced between jobs? Or is there some check I can do to make sure everything has been written before starting the next job in my chain?
EDIT: To answer the comment below, the sequence of jobs all run on the same cluster. The cluster is started, each job run in turn on that cluster, and then the cluster is shut down.
For now, I have worked around this by having the jobs write to HDFS on the cluster in addition to GCS, and the subsequent jobs reading from the cluster. The GCS output is now strictly for diagnostics in case of a problem. But even tho my immediate problem is (I believe) fixed I still would like to know what's happening and why GCS seems out of sync for a bit.

Google Compute Engine snapshot of instance with persistent disks attached failed

I have a working VM instance that I'm trying to copy to allow redundancy behind google load balancer.
A test run with a dummy instance worked fine, creating a new instance from a snapshot of a running one.
Now, the real "original" instance have a persistent disk attached and this cause a problem in starting up the cloned instance because of the (obviously) missing persistent disk mount.
Logs from serial console output is as:
* Stopping cold plug devices[74G[ OK ]
* Stopping log initial device creation[74G[ OK ]
* Starting enable remaining boot-time encrypted block devices[74G[ OK ]
The disk drive for /mnt/XXXX-log is not ready yet or not present.
keys:Continue to wait, or Press S to skip mounting or M for manual recovery
As I understand there is no way to send any of this key strokes to the instance, is there any other way to overcome this issue? I know that I could unmount the disk before the snapshot, but the workflow I would like to instate is creating period snapshots of production servers, so un-mounting disks every time before performing it would require instance downtime (plus all the unnecessary risks of doing an action that would seem pointless).
Is there a way to boot this type of cloned instances successfully, and attach a new persistence disk afterwards?
Is this happening because the original persistent disk is in use, or the same problem would occur even if the original instance is offline (for example due to a failure in which case I would try to created a new instance from a snapshot)?
One workaround that I am using to get away from the same issue is that I dont't actually unmount the disk rather just comment out the the mount line in /etc/fstab and take the snapshot. This way my instance has no downtime or down disks while snapshoting. (I am using Ubuntu 14.04 as OS if that matters)
Later I fix and uncomment it when I use that snapshot on a new instance.
However you can also look into adding the nofail option in the commented line to get a better solution.
By the way I am doing a similar task building a load balanced setup with multiple webserver nodes. Each being cloned from the said snapshot with extra persistent disks mounted for eg uploads,data and logs etc
I'm a little unclear as to what you're trying to accomplish. It sounds like you're looking to periodically snapshot the data volumes of a production server so you can clone them later.
In all likelihood, you simply need to sync and fsfreeze to before you make your snapshot, rather than just unmounting/remounting it. The GCP documentation has a basic example of this in the Snapshots documentation.

How to handle file paths in distributed environment

I'm working on setting up a distributed celery environment to do OCR on PDF files. I have about 3M PDFs and OCR is CPU-bound so the idea is to create a cluster of servers to process the OCR.
As I'm writing my task, I've got something like this:
#app.task
def do_ocr(pk, file_path):
content = run_tesseract_command(file_path)
item = Document.objects.get(pk=pk)
item.content = ocr_content
item.save()
The question I have what the best way is to make the file_path work in a distributed environment. How do people usually handle this? Right now all my files simply live in a simple directory on one of our servers.
If your are in linux environment the easiest way is mount a remote filesystem, using sshfs, in the /mnt folder foreach node in cluster. Then you can pass the node name to do_ocr function and work as all data is local to current node
For example, your cluster has N nodes named: node1, ... ,nodeN
Let's configure node1, foreach node mount remote filesystem. Here's a sample node1's /etc/fstab file
sshfs#user#node2:/var/your/app/pdfs /mnt/node2 fuse port=<port>,defaults,user,noauto,uid=1000,gid=1000 0 0
....
sshfs#user#nodeN:/var/your/app/pdfs /mnt/nodeN fuse port=<port>,defaults,user,noauto,uid=1000,gid=1000 0 0
In current node (node1) create a symlink named as current server pointing to pdf's path
ln -s /var/your/app/pdfs node1
Your mnt folder should contain remote's filesystem and a symlink
user#node1:/mnt$ ls -lsa
0 lrwxrwxrwx 1 user user 16 apr 12 2016 node1 -> /var/your/app/pdfs
0 lrwxrwxrwx 1 user user 16 apr 12 2016 node2
...
0 lrwxrwxrwx 1 user user 16 apr 12 2016 nodeN
Then your function should look like this:
import os
MOUNT_POINT = '/mtn'
#app.task
def do_ocr(pk, node_name, file_path):
content = run_tesseract_command(os.path.join(MOUNT_POINT,node_name,file_path))
item = Document.objects.get(pk=pk)
item.content = ocr_content
item.save()
It works like all files are in the current machine but there's remote-logic working for you transparently
Well, there are multiple ways to handle it, but let's stick to one of the simpliest one:
since you'd like to process big amount of files using multiple servers, my first suggestion would be to use the same OS in each server, so you won't have to worry about cross-platform compatibility
using the word 'cluster' indicates that all of those servers should know their mutual state - it adds complexity, try to switch to the farm of stateless workers (by 'stateless' I mean "not knowing about other's" as they should be aware of at least their own state, e.g.: IDLE, IN_PROGRESS, QUEUE_FULL or more if needed)
for the file list processing part you could use pull or push model:
push model could be easily implemented by a simple app that crawls the files and dispatches them (e.g.: over SCP, FTP, whatever) to a set of available servers; servers can monitor their local directories for changes and pick up new files to process; it's also very easy to scale - just spin up more servers and update the push client (even in runtime); the only limit is your push client's performance
pull model is a little bit more tricky, cause you have to handle more complexity; having a set of servers implicates having a proper starting index per node and offset - it will make error handling more difficult, plus, it doesn't scale easily (imagine adding twice as more servers to speedup the processing and updating indices and offsets properly on each node.. seems like an error-prone solution)
I assume that the network traffic isn't a big concern - having 3M files to process will generate it somewhere, one way or the other..
collecting/storing the results is a different ballpark, but here the list of possible solutions is limitless
Since I miss a lot of your architecture details and your application specifics, you can take this answer as a guiding answer rather than a strict one.
You can take this approach, in the following order:
1- deploy an internal file server that stores all the files in one place and serve them
Example:
http://interanal-ip-address/storage/filenameA.pdf
http://interanal-ip-address/storage/filenameB.pdf
http://interanal-ip-address/storage/filenameC.pdf
and so on ...
2- Install/Deploy Redis
3- Create an upload client/service/process that takes the files you want to upload and pass them to the above storage location (/storage/), so your files will be available once they are uploaded, at the same time push the full file path URL to a predefined Redis List/Queue (build on linked lists data structure), like this: http://internal-ip-address/storage/filenameA.pdf
You can get more details here about LPUSH and RPOP under Redis Lists here: http://redis.io/topics/data-types-intro
Examples:
A file upload form, that stores the files directly to storage area
A file upload utility/command-line/background-process, that you can create it yourself or use some existing tool to upload files to the storage location, that gets the files from specific location, be it a web address or some other server that has your files
4- Now we come to your celery workers, each one of your workers should pull (RPOP) one of the files URLs from Redis queue, download the file from your internal file server (we built in first step), and do the required processing on the way you wanted it to be.
An important thing to note from Redis documentation:
Lists have a special feature that make them suitable to implement
queues, and in general as a building block for inter process
communication systems: blocking operations.
However it is possible that sometimes the list is empty and there is
nothing to process, so RPOP just returns NULL. In this case a consumer
is forced to wait some time and retry again with RPOP. This is called
polling, and is not a good idea in this context because it has several
drawbacks
So Redis implements commands called BRPOP and BLPOP which are versions
of RPOP and LPOP able to block if the list is empty: they'll return to
the caller only when a new element is added to the list, or when a
user-specified timeout is reached.
Let me know if that answers your question.
Things to keep in mind
You can add as many workers as you want since this solution is very
scalable, and your only bottleneck is Redis server, which you can make cluster of and persist your queue in case of power outage or server crash
You can replace redis with RabbitMQ, Beanstalk, Kafka, or any other queuing/messaging system, but Redis has ben nominated in this race due to simplicity and meriad of features introduced out of the box.