GCP Cloud Composer - files in data folder - google-cloud-storage

My dag generator file generates ~100 dags based on dag_params.json file which holds information about each dag - dag_id, schedule and all information required to execute tasks properly. Dag generator and dag_params.json files are available in dags/ folder.
Additionally there is a dags/ELT/ folder with sql files, json files with schemas etc. (~2000files). Files from dags/ELT/ folder are mapped to tasks in dags based on dag_params.json file:
"market": {
"dag_id": "dag1",
"bucket": "bucket1",
"bq_dataset": "ds1",
"sqls": [
{
"file_name": "query1.sql",
"file_path": "/home/airflow/gcs/dags/ELT/folder/sqls/query1.sql",
"file_path_template": "/folder/sqls/query1.sql"
},
{
...
}
]
}
So as an example, I use file_path_template value in BigQueryInsertJobOperator templated field to execute query.
Environment: composer-2.0.23-airflow-2.2.5
I would like to move dags/ELT/ folder to data/ELT/ folder. To make it work properly in dag generator I have defined template_searchpath=/home/gcs/airflow/data/ELT/ and everything worked as expected. However, it seems like this change generates much more gcs operations (~3 times more) which increase costs. What is the cause of this increase?
Logs show thousands of these:
Documentation https://cloud.google.com/composer/docs/composer-2/cloud-storage says thare are different types of data synchronization in dags/ and data/ folders. Based on my understanding files like table schemas, sql queries should be placed in data/ but if it increases storage cost by default (operations), maybe it is easier (and cheaper?) to increase machines in environment and keep it in dags/ folder instead.

Related

HadoopDataSource: Skipping Partition {} as no new files detected # s3:

So, I have an S3 folder with several subfolders acting as partitions (based on the date of creation). I have a Glue Table for those partitions and can see the data using Athena.
Running a Glue Job and trying to access the Catalog I get the following error:
HadoopDataSource: Skipping Partition {} as no new files detected # s3:...
The line that gives me problems is the following:
glueContext.getCatalogSource(database = "DB_NAME", tableName = "TABLE_NAME", redshiftTmpDir = "", transformationContext = "datasource0").getDynamicFrame().toDF()
I'll want at every point to access all the data in those S3 subfolders as it is updated regularly.
I'm thinking the problem is the Glue Job Bookmark not detecting new files, but this is not running directly as part of a Job but as part of a library used by a Job.
Removing "transformationContext" or changing its value to empty hasn't worked.
So the Hadoop output you are getting is not an error but just a simple log that the partition is empty.
But the partition that is getting logged, {}, seems to be off. Can you check that?
In addition, could you run the job with bookmark disabled, to make sure that this is not the cause of the problem?
I also found this unresolved GitHub issue, maybe you can comment there too, so that the issue gets some attention.

terraform remote backend using postgres

i am planning to use remote backend as postgres instead of s3 as enterprise standard.
terraform {
backend "pg" {
conn_str = "postgres://user:pass#db.example.com/schema_name"
}
}
When we use postgres remote backend, when we run terraform init, we have to provide schema which is specific to that terraform folder, as backend supports only one table and new record will be created with workspace name.
I am stuck now, as i have 50 projects and each have 2 tiers which is maintained in different folders, then we need to create 100 schemas in postgres. Also it is difficult to handle so many schemas in automated provisioning.
Can we handle something in similar to S3, where we have one bucket for all projects and multiple entries in same bucket with different key which specified in each terraform script. Can we use single schema for all projects and multiple tables/records based on key provide in backend configuration of each terraform folder.
You can use a single database and the pg provider will automatically create a specified schema.
Something like this:
terraform {
backend "pg" {
conn_str = "postgres://user:pass#db.example.com/terraform_backend"
schema = "fooapp"
}
}
This keeps the projects unique, at least. You could append a tier to that, too, or use Terraform Workspaces.
If you specify the config on the command line (aka partial configuration), as the provider recommends, it might make it easier to dynamically set for your use case:
terraform init \
-backend-config="conn_str=postgres://user:pass#db.example.com/terraform_backend" \
-backend-config="schema=fooapp-prod"
This works pretty well in my scenario similar to yours. Each project has a unique schema in a shared database and no tasks beyond the initial creation/configuration of the database is needed - the provider creates the schema as specified.

How can i identify processed files in Data flow Job

How can I identify processed files in Data flow Job? I am using a wildcard to read files from cloud storage. but every time when the job runs, it re-read all files.
This is a batch Job and following is sample reading TextIO that I am using.
PCollection<String> filePColection = pipeline.apply("Read files from Cloud Storage ", TextIO.read().from("gs://bucketName/TrafficData*.txt"));
To see a list of files that match your wildcard you can use gsutils, which is the Cloud Storage command line utility. You'd do the following:
gsutils ls gs://bucketName/TrafficData*.txt
Now, when it comes to running a batch job multiple times, your pipeline has no way to know which files it has analyzed already or not. To avoid analyzing new files you could do either of the following:
Define a Streaming job, and use TextIO's watchForNewFiles functionality. You would have to leave your job to run for as long as you want to keep processing files.
Find a way to provide your pipeline with files that have already been analyzed. For this, every time you run your pipeline you could generate a list of files to analyze, put it into a PCollection, read each with TextIO.readAll(), and store the list of analyzed files somewhere. Later, when you run your pipeline again you can use this list as a blacklist for files that you don't need to run again.
Let me know in the comments if you want to work out a solution around one of these two options.

how to avoid uploading file that is being partially written - Google storage rsync

I'm using google cloud storage with option rsync
I create a cronjob that sync file every minute.
But there're the problem
Right on a file is being partially written, the cronjob run, then it sync a part of file even though it wasn't done.
Is there the way to settle this problem?
The gsutil rsync command doesn't have any way to check that a file is still being written. You will need to coordinate your writing and rsync'ing jobs such that they operate on disjoint parts of the file tree. For example, you could arrange your writing job to write to directory A while your rsync job rsyncs from directory B, and then switch pointers so your writing job writes to directory B while your rsync job writes to directory A. Another option would be to set up a staging area into which you copy all the files that have been written before running your rsync job. If you put it on the same file system as where they were written you could use hard links so the link operation works quickly (without byte copying).

Spark - Writing to csv results in _temporary file [duplicate]

After a spark program completes, there are 3 temporary directories remain in the temp directory.
The directory names are like this: spark-2e389487-40cc-4a82-a5c7-353c0feefbb7
The directories are empty.
And when the Spark program runs on Windows, a snappy DLL file also remains in the temp directory.
The file name is like this: snappy-1.0.4.1-6e117df4-97b6-4d69-bf9d-71c4a627940c-snappyjava
They are created every time the Spark program runs. So the number of files and directories keeps growing.
How can let them be deleted?
Spark version is 1.3.1 with Hadoop 2.6.
UPDATE
I've traced the spark source code.
The module methods that create the 3 'temp' directories are as follows:
DiskBlockManager.createLocalDirs
HttpFileServer.initialize
SparkEnv.sparkFilesDir
They (eventually) call Utils.getOrCreateLocalRootDirs and then Utils.createDirectory, which intentionally does NOT mark the directory for automatic deletion.
The comment of createDirectory method says: "The directory is guaranteed to be
newly created, and is not marked for automatic deletion."
I don't know why they are not marked. Is this really intentional?
Three SPARK_WORKER_OPTS exists to support the worker application folder cleanup, copied here for further reference: from Spark Doc
spark.worker.cleanup.enabled, default value is false, Enable periodic cleanup of worker / application directories. Note that this only affects standalone mode, as YARN works differently. Only the directories of stopped applications are cleaned up.
spark.worker.cleanup.interval, default is 1800, i.e. 30 minutes, Controls the interval, in seconds, at which the worker cleans up old application work dirs on the local machine.
spark.worker.cleanup.appDataTtl, default is 7*24*3600 (7 days), The number of seconds to retain application work directories on each worker. This is a Time To Live and should depend on the amount of available disk space you have. Application logs and jars are downloaded to each application work dir. Over time, the work dirs can quickly fill up disk space, especially if you run jobs very frequently.
I assume you are using the "local" mode only for testing purposes. I solved this issue by creating a custom temp folder before running a test and then I delete it manually (in my case I use local mode in JUnit so the temp folder is deleted automatically).
You can change the path to the temp folder for Spark by spark.local.dir property.
SparkConf conf = new SparkConf().setMaster("local")
.setAppName("test")
.set("spark.local.dir", "/tmp/spark-temp");
After the test is completed I would delete the /tmp/spark-temp folder manually.
I don't know how to make Spark cleanup those temporary directories, but I was able to prevent the creation of the snappy-XXX files. This can be done in two ways:
Disable compression. Properties: spark.broadcast.compress, spark.shuffle.compress, spark.shuffle.spill.compress. See http://spark.apache.org/docs/1.3.1/configuration.html#compression-and-serialization
Use LZF as a compression codec. Spark uses native libraries for Snappy and lz4. And because of the way JNI works, Spark has to unpack these libraries before using them. LZF seems to be implemented natively in Java.
I'm doing this during development, but for production it is probably better to use compression and have a script to clean up the temp directories.
I do not think cleanup is supported for all scenarios. I would suggest to write a simple windows scheduler to clean up nightly.
You need to call close() on the spark context that you created at the end of the program.
for spark.local.dir, it will only move spark temp files, but the snappy-xxx file will still exists in /tmp dir.
Though didn't find way to make spark automatically clear it, but you can set JAVA option:
JVM_EXTRA_OPTS=" -Dorg.xerial.snappy.tempdir=~/some-other-tmp-dir"
to make it move to another dir, as most system has small /tmp size.