spark-scala checkpointing cleanup - scala

I am running a spark application in 'local' mode. It's checkpointing correctly to the directory defined in the checkpointFolder config. However, there are two issues that I am seeing that are causing some disk space issues.
1) As we have multiple users running the application, the checkpoint folder on server is created by the first user executing it, which causes other user's run to fail due to permissions issue on the OS. Is there a way to provide a relative path in the checkpointFolder, for example checkpointFolder=~/spark/checkpoint?
2) I have used the spark.worker.cleanup.enabled=true config to cleanup the checkpoint folder after the run, but don't see that happening. Is there an alternate way of cleaning it up through the app, instead of resorting to some cron job?

Hope the following is sensible:
1) You may create unique folder each time like /tmp/spark_checkpoint_1578032476801
2a) You may just delete folder at the end of the app.
2b) If you use HDFS for checkpointing then use such code
def cleanFS(sc: SparkContext, fsPath: String) = {
val fs = org.apache.hadoop.fs.FileSystem.get(new URI(fsPath), sc.hadoopConfiguration)
fs.delete(new Path(fsPath), true)
}

Check this answer out!
PySpark: fully cleaning checkpoints
I was facing the same issue and it is solved in the above link!

Related

HadoopDataSource: Skipping Partition {} as no new files detected # s3:

So, I have an S3 folder with several subfolders acting as partitions (based on the date of creation). I have a Glue Table for those partitions and can see the data using Athena.
Running a Glue Job and trying to access the Catalog I get the following error:
HadoopDataSource: Skipping Partition {} as no new files detected # s3:...
The line that gives me problems is the following:
glueContext.getCatalogSource(database = "DB_NAME", tableName = "TABLE_NAME", redshiftTmpDir = "", transformationContext = "datasource0").getDynamicFrame().toDF()
I'll want at every point to access all the data in those S3 subfolders as it is updated regularly.
I'm thinking the problem is the Glue Job Bookmark not detecting new files, but this is not running directly as part of a Job but as part of a library used by a Job.
Removing "transformationContext" or changing its value to empty hasn't worked.
So the Hadoop output you are getting is not an error but just a simple log that the partition is empty.
But the partition that is getting logged, {}, seems to be off. Can you check that?
In addition, could you run the job with bookmark disabled, to make sure that this is not the cause of the problem?
I also found this unresolved GitHub issue, maybe you can comment there too, so that the issue gets some attention.

Hadoop FileUtils not able to write files on local(Unix) filesystem from Scala

I'm trying to write file to local FileSystem using FileSystem library of org.apache.hadoop.fs. Below is my one liner code inside the big scala code that should be doing this, but it's not.
fs.copyToLocalFile(false, hdfsSourcePath, new Path(newFile.getAbsolutePath), true)
The value of newFile is:
val newFile = new File(s"${localPath}/fileName.dat")
localPath is just a variable containing the full path on local disk.
hdfsSourcePath is the full path on HDFS location.
The job executes properly but I don't see the files created on local. I'm running it through Spark engine in cluster mode, that's why I used the copyToLocalFile method which overloads the 4th argument of useRawLocalFileSystem and set it to true. Using this, we can avoid getting the files being written on the executor node.
Any ideas?
I used the copyToLocalFile method which overloads the 4th argument of useRawLocalFileSystem and set it to true. Using this, we can avoid getting the files being written on the executor node.
I think you got this point wrong. Cluster mode makes driver run on executor node and local file system is that executor's file system. useRawLocalFileSystem only prevents writing checksum files (->info), it does not make the files appear on machine that is submitting the job, which is probably what you expected.
The best you can do is to save files to HDFS and retrieve them explicitly after the job finishes.

How can I execute a S3-dist-cp command within a spark-submit application

I have a jar file that is being provided to spark-submit.With in the method in a jar. I’m trying to do a
Import sys.process._
s3-dist-cp —src hdfs:///tasks/ —dest s3://<destination-bucket>
I also installed s3-dist-cp on all salves along with master.
The application starts and succeeded without error but does not move the data to S3.
This isn't a proper direct answer to your question, but I've used hadoop distcp (https://hadoop.apache.org/docs/current/hadoop-distcp/DistCp.html) instead and it sucessfully moved the data. In my tests it's quite slow compared to spark.write.parquet(path) though (when accounting in the time taken by the additional write to hdfs that is required in order to use hadoop distcp). I'm also very interested in the answer to your question though; I think s3-dist-cp might be faster given the aditional optimizations done by Amazon.
s3-dist-cp is now a default thing on the Master node of the EMR cluster.
I was able to do an s3-dist-cp from with in the spark-submit successfully if the spark application is submitted in "client" mode.

Spark - Writing to csv results in _temporary file [duplicate]

After a spark program completes, there are 3 temporary directories remain in the temp directory.
The directory names are like this: spark-2e389487-40cc-4a82-a5c7-353c0feefbb7
The directories are empty.
And when the Spark program runs on Windows, a snappy DLL file also remains in the temp directory.
The file name is like this: snappy-1.0.4.1-6e117df4-97b6-4d69-bf9d-71c4a627940c-snappyjava
They are created every time the Spark program runs. So the number of files and directories keeps growing.
How can let them be deleted?
Spark version is 1.3.1 with Hadoop 2.6.
UPDATE
I've traced the spark source code.
The module methods that create the 3 'temp' directories are as follows:
DiskBlockManager.createLocalDirs
HttpFileServer.initialize
SparkEnv.sparkFilesDir
They (eventually) call Utils.getOrCreateLocalRootDirs and then Utils.createDirectory, which intentionally does NOT mark the directory for automatic deletion.
The comment of createDirectory method says: "The directory is guaranteed to be
newly created, and is not marked for automatic deletion."
I don't know why they are not marked. Is this really intentional?
Three SPARK_WORKER_OPTS exists to support the worker application folder cleanup, copied here for further reference: from Spark Doc
spark.worker.cleanup.enabled, default value is false, Enable periodic cleanup of worker / application directories. Note that this only affects standalone mode, as YARN works differently. Only the directories of stopped applications are cleaned up.
spark.worker.cleanup.interval, default is 1800, i.e. 30 minutes, Controls the interval, in seconds, at which the worker cleans up old application work dirs on the local machine.
spark.worker.cleanup.appDataTtl, default is 7*24*3600 (7 days), The number of seconds to retain application work directories on each worker. This is a Time To Live and should depend on the amount of available disk space you have. Application logs and jars are downloaded to each application work dir. Over time, the work dirs can quickly fill up disk space, especially if you run jobs very frequently.
I assume you are using the "local" mode only for testing purposes. I solved this issue by creating a custom temp folder before running a test and then I delete it manually (in my case I use local mode in JUnit so the temp folder is deleted automatically).
You can change the path to the temp folder for Spark by spark.local.dir property.
SparkConf conf = new SparkConf().setMaster("local")
.setAppName("test")
.set("spark.local.dir", "/tmp/spark-temp");
After the test is completed I would delete the /tmp/spark-temp folder manually.
I don't know how to make Spark cleanup those temporary directories, but I was able to prevent the creation of the snappy-XXX files. This can be done in two ways:
Disable compression. Properties: spark.broadcast.compress, spark.shuffle.compress, spark.shuffle.spill.compress. See http://spark.apache.org/docs/1.3.1/configuration.html#compression-and-serialization
Use LZF as a compression codec. Spark uses native libraries for Snappy and lz4. And because of the way JNI works, Spark has to unpack these libraries before using them. LZF seems to be implemented natively in Java.
I'm doing this during development, but for production it is probably better to use compression and have a script to clean up the temp directories.
I do not think cleanup is supported for all scenarios. I would suggest to write a simple windows scheduler to clean up nightly.
You need to call close() on the spark context that you created at the end of the program.
for spark.local.dir, it will only move spark temp files, but the snappy-xxx file will still exists in /tmp dir.
Though didn't find way to make spark automatically clear it, but you can set JAVA option:
JVM_EXTRA_OPTS=" -Dorg.xerial.snappy.tempdir=~/some-other-tmp-dir"
to make it move to another dir, as most system has small /tmp size.

spark on yarn; how to send metrics to graphite sink?

I am new to spark and we are running spark on yarn. I can run my test applications just fine. I am trying to collect the spark metrics in Graphite. I know what changes to make to metrics.properties file. But how will my spark application see this conf file?
/xxx/spark/spark-0.9.0-incubating-bin-hadoop2/bin/spark-class org.apache.spark.deploy.yarn.Client --jar /xxx/spark/spark-0.9.0-incubating-bin-hadoop2/examples/target/scala-2.10/spark-examples_2.10-assembly-0.9.0-incubating.jar --addJars "hdfs://host:port/spark/lib/spark-assembly_2.10-0.9.0-incubating-hadoop2.2.0.jar" --class org.apache.spark.examples.Test --args yarn-standalone --num-workers 50 --master-memory 1024m --worker-memory 1024m --args "xx"
Where should I be specifying the metrics.properties file?
I made these changes to it:
*.sink.Graphite.class=org.apache.spark.metrics.sink.GraphiteSink
*.sink.Graphite.host=machine.domain.com
*.sink.Graphite.port=2003
master.source.jvm.class=org.apache.spark.metrics.source.JvmSource
worker.source.jvm.class=org.apache.spark.metrics.source.JvmSource
driver.source.jvm.class=org.apache.spark.metrics.source.JvmSource
executor.source.jvm.class=org.apache.spark.metrics.source.JvmSource
I have found a different solution to the same problem. It looks like that Spark can also take these metric settings from its config properties. For example the following line from metrics.properties:
*.sink.Graphite.class=org.apache.spark.metrics.sink.GraphiteSink
Can also be specified as a Spark property with key spark.metrics.conf.*.sink.graphite.class and value org.apache.spark.metrics.sink.GraphiteSink. You just need to prepend spark.metrics.conf. to each key.
I have ended up putting all these settings in the code like this:
val sparkConf = new spark.SparkConf()
.set("spark.metrics.conf.*.sink.graphite.class", "org.apache.spark.metrics.sink.GraphiteSink")
.set("spark.metrics.conf.*.sink.graphite.host", graphiteHostName)
// etc.
val sc = new spark.SparkContext(sparkConf)
This way I've got the metrics sink set up for both the driver and the executors. I was using Spark 1.6.0.
I struggled with the same thing. I have it working using these flags:
--files=/path/to/metrics.properties --conf spark.metrics.conf=metrics.properties
It's tricky because the --files flag makes it so your /path/to/metrics.properties file ends up in every executor's local disk space as metrics.properties; AFAIK there's no way to specify more complex directory structure there, or have two files with the same basename.
Related, I filed SPARK-5152 about letting the spark.metrics.conf file be read from HDFS, but that seems like it would require a fairly invasive change, so I'm not holding my breath on that one.