how to avoid uploading file that is being partially written - Google storage rsync - google-cloud-storage

I'm using google cloud storage with option rsync
I create a cronjob that sync file every minute.
But there're the problem
Right on a file is being partially written, the cronjob run, then it sync a part of file even though it wasn't done.
Is there the way to settle this problem?

The gsutil rsync command doesn't have any way to check that a file is still being written. You will need to coordinate your writing and rsync'ing jobs such that they operate on disjoint parts of the file tree. For example, you could arrange your writing job to write to directory A while your rsync job rsyncs from directory B, and then switch pointers so your writing job writes to directory B while your rsync job writes to directory A. Another option would be to set up a staging area into which you copy all the files that have been written before running your rsync job. If you put it on the same file system as where they were written you could use hard links so the link operation works quickly (without byte copying).

Related

Google Cloud Task is not writing file to storage

I am working on a GAE app, one of the time consuming task, I have moved to Cloud Tasks. This task is supposed to read files from storage, process it and write final output back to the storage bucket.
Everything goes fine till generating the data, but task hangs before writing the output file back to cloud storage. Task remains continuously in running state without any error logs.
This code runs successfully if I run it as a manual python script.
I wonder if there is a permission thing that is blocking cloud task to write back to storage bucket.
Any lead on this is appreciated.
Edit: Problem appears in the path. output file path is something like: /logger_sh2 (1)/sh2_day4.
It works well for other paths.

How can i identify processed files in Data flow Job

How can I identify processed files in Data flow Job? I am using a wildcard to read files from cloud storage. but every time when the job runs, it re-read all files.
This is a batch Job and following is sample reading TextIO that I am using.
PCollection<String> filePColection = pipeline.apply("Read files from Cloud Storage ", TextIO.read().from("gs://bucketName/TrafficData*.txt"));
To see a list of files that match your wildcard you can use gsutils, which is the Cloud Storage command line utility. You'd do the following:
gsutils ls gs://bucketName/TrafficData*.txt
Now, when it comes to running a batch job multiple times, your pipeline has no way to know which files it has analyzed already or not. To avoid analyzing new files you could do either of the following:
Define a Streaming job, and use TextIO's watchForNewFiles functionality. You would have to leave your job to run for as long as you want to keep processing files.
Find a way to provide your pipeline with files that have already been analyzed. For this, every time you run your pipeline you could generate a list of files to analyze, put it into a PCollection, read each with TextIO.readAll(), and store the list of analyzed files somewhere. Later, when you run your pipeline again you can use this list as a blacklist for files that you don't need to run again.
Let me know in the comments if you want to work out a solution around one of these two options.

Spark - Writing to csv results in _temporary file [duplicate]

After a spark program completes, there are 3 temporary directories remain in the temp directory.
The directory names are like this: spark-2e389487-40cc-4a82-a5c7-353c0feefbb7
The directories are empty.
And when the Spark program runs on Windows, a snappy DLL file also remains in the temp directory.
The file name is like this: snappy-1.0.4.1-6e117df4-97b6-4d69-bf9d-71c4a627940c-snappyjava
They are created every time the Spark program runs. So the number of files and directories keeps growing.
How can let them be deleted?
Spark version is 1.3.1 with Hadoop 2.6.
UPDATE
I've traced the spark source code.
The module methods that create the 3 'temp' directories are as follows:
DiskBlockManager.createLocalDirs
HttpFileServer.initialize
SparkEnv.sparkFilesDir
They (eventually) call Utils.getOrCreateLocalRootDirs and then Utils.createDirectory, which intentionally does NOT mark the directory for automatic deletion.
The comment of createDirectory method says: "The directory is guaranteed to be
newly created, and is not marked for automatic deletion."
I don't know why they are not marked. Is this really intentional?
Three SPARK_WORKER_OPTS exists to support the worker application folder cleanup, copied here for further reference: from Spark Doc
spark.worker.cleanup.enabled, default value is false, Enable periodic cleanup of worker / application directories. Note that this only affects standalone mode, as YARN works differently. Only the directories of stopped applications are cleaned up.
spark.worker.cleanup.interval, default is 1800, i.e. 30 minutes, Controls the interval, in seconds, at which the worker cleans up old application work dirs on the local machine.
spark.worker.cleanup.appDataTtl, default is 7*24*3600 (7 days), The number of seconds to retain application work directories on each worker. This is a Time To Live and should depend on the amount of available disk space you have. Application logs and jars are downloaded to each application work dir. Over time, the work dirs can quickly fill up disk space, especially if you run jobs very frequently.
I assume you are using the "local" mode only for testing purposes. I solved this issue by creating a custom temp folder before running a test and then I delete it manually (in my case I use local mode in JUnit so the temp folder is deleted automatically).
You can change the path to the temp folder for Spark by spark.local.dir property.
SparkConf conf = new SparkConf().setMaster("local")
.setAppName("test")
.set("spark.local.dir", "/tmp/spark-temp");
After the test is completed I would delete the /tmp/spark-temp folder manually.
I don't know how to make Spark cleanup those temporary directories, but I was able to prevent the creation of the snappy-XXX files. This can be done in two ways:
Disable compression. Properties: spark.broadcast.compress, spark.shuffle.compress, spark.shuffle.spill.compress. See http://spark.apache.org/docs/1.3.1/configuration.html#compression-and-serialization
Use LZF as a compression codec. Spark uses native libraries for Snappy and lz4. And because of the way JNI works, Spark has to unpack these libraries before using them. LZF seems to be implemented natively in Java.
I'm doing this during development, but for production it is probably better to use compression and have a script to clean up the temp directories.
I do not think cleanup is supported for all scenarios. I would suggest to write a simple windows scheduler to clean up nightly.
You need to call close() on the spark context that you created at the end of the program.
for spark.local.dir, it will only move spark temp files, but the snappy-xxx file will still exists in /tmp dir.
Though didn't find way to make spark automatically clear it, but you can set JAVA option:
JVM_EXTRA_OPTS=" -Dorg.xerial.snappy.tempdir=~/some-other-tmp-dir"
to make it move to another dir, as most system has small /tmp size.

Copy a file from HDFS to a local directory for multiple tasks on a node?

So, basically, I have a read only file (several GBs big, so broadcasting is no option) that must be copied to a local folder on the node, as each task internally runs a program (by using os.system in python or ! operator in scala) that reads from a local file (can't read from HDFS). The problem however is that several tasks would be running on one node. If the file is not already there on the node, it should be copied from the HDFS to a local directory. But how could I have one task get the file from the HDFS, while other tasks wait for it (note that each task would be running in parallel on a node). Which file synchronization mechanism could I use in Spark for that purpose?

"Injecting" configuration files at startup

I have a number of legacy services running which read their configuration files from disk and a separate daemon which updates these files as they change in zookeeper (somewhat similar to confd).
For most of these types of configuration we would love to move to a more environment variable like model, where the config is fixed for the lifetime of the pod. We need to keep the outside config files as the source of truth as services are transitioning from the legacy model to kubernetes, however. I'm curious if there is a clean way to do this in kubernetes.
A simplified version of the current model that we are pursuing is:
Create a docker image which has a utility for fetching config files and writing them to disk ones. Then writes a /donepath/done file.
The main image waits until the done file exists. Then allows the normal service startup to progress.
Use an empty dir volume and volume mounts to get the conf from the helper image into the main image.
I keep seeing instances of this problem where I "just" need to get a couple of files into the docker image at startup (to allow per-env/canary/etc variance), and running all of this machinery each time seems like a burden throw on devs. I'm curious if there is a more simplistic way to do this already in kubernetes or on the horizon.
You can use the ADD command in your Dockerfile. It is used as ADD File /path/in/docker. This will allow you to add a complete file quickly to your container. You need to have the file you want to add to the image in the same directory as the Dockerfile when you build the container. You can also add a tar file this way which will be expanded during the build.
Another option is the ENV command in a your Dockerfile. This adds the data as an environment variable.