I am trying to read a file from my local EMR file system. It is there as a file under the folder /emr/myFile.csv. However, I keep getting a FileNotFoundException. Here is the line of code that I use to read it:
val myObj: File = new File("/emr/myFile.csv")
I added a file://// prefix to the file path as well because I have seen that work for others, but that still did not work. So I also try to read directly from the hadoop file system where it is stored in the folder: /emr/CNSMR_ACCNT_BAL/myFile.csv because I thought it was maybe checking by default in hdfs. However, that also results in a FileNotFoundException. Here is the code for that:
val myObj: File = new File("/emr/CNSMR_ACCNT_BAL/myFile.csv")
How can I read this file into a File?
For your 1st problem:
When you submit a hadoop job application master can get created on any of your worker node including master node (depending on your configuration).
If you are using EMR, your application master by default gets created on any of your worker node (CORE node) but not on master.
When you say file:///emr/myFile.csv this file exists on your local file system (I'm assuming that means on master node), your program will search for this file on that node where the application master is and its definitely not on your master node because for that you wouldn’t get any error.
2nd problem:
When you try to access a file in HDFS using java File.class, it won’t be able to access that file.
You need to use hadoop FileSystem api (org.apache.hadoop.fs.FileSystem) to interact with a HDFS file.
Also use HDFS file tag hdfs://<namenode>:<port>/emr/CNSMR_ACCNT_BAL/myFile.csv.
If your core-site.xml contains value of fs.defaultFS then you don’t need to put namenode and port info just simply hdfs:///emr/CNSMR_ACCNT_BAL/myFile.csv
So what's better option here while accessing file in hadoop cluster?
The answer depends upon your use case, but most cases putting it in HDFS it much better, because you don’t have to worry about where your application master is. Each and every node have access to the hdfs.
Hope that resolves your problem.
I'm setting up a distributed cluster for ZooKeeper based on version 3.5.2. In specific, I'm utilizing the reconfig command to dynamically update the configuration when there is any rebalance in the cluster (e.g. one of the nodes comes down).
The observation I have is that the zoo.cfg.dynamic file is not getting updated even when the reconfig (add/remove) command is correctly executed. Is this the expected behavior ? Basically I'm looking for guidance whether we should manage the zoo.cfg.dynamic file also through a separate script (update it lock-step with the reconfig command) or can we rely on the reconfig command to do this for us. My preference/expectation is the latter.
Following is the sample command:
reconfig -remove 6 -add server.5=125.23.63.23:1234:1235;1236
From the reconfig documentation:
Dynamic configuration parameters are stored in a separate file on the server (which we call the dynamic configuration file). This file is linked from the static config file using the new dynamicConfigFile keyword.
So I could practically start with any file name to host the ensemble list and ensure the 'dynamicConfigFile' config keyword just point to this file.
Now when the reconfig command is run, basically a new dynamic-config file (e.g. zoo.cfg.dynamic.00000112) is generated which contains the transformed list of servers, in the form as below (as an example):
server.1=125.23.63.23:2780:2783:participant;2791
server.2=125.23.63.24:2781:2784:participant;2792
server.3=125.23.63.25:2782:2785:participant;2793
The zoo.cfg file is hence auto-updated to point the 'dynamicConfigFile' config keyword to the new config file (zoo.cfg.dynamic.00000112). The previous dynamic-config file continues to be available in the runtime (config directory) but it is not being referred by the main config anymore.
So overall, there is no overhead to update any file lock-step to the reconfig command i.e. reconfig command takes care of it all. The only potential overhead to upfront resolve is to write a periodic purge of the old dynamic-config files.
After a spark program completes, there are 3 temporary directories remain in the temp directory.
The directory names are like this: spark-2e389487-40cc-4a82-a5c7-353c0feefbb7
The directories are empty.
And when the Spark program runs on Windows, a snappy DLL file also remains in the temp directory.
The file name is like this: snappy-1.0.4.1-6e117df4-97b6-4d69-bf9d-71c4a627940c-snappyjava
They are created every time the Spark program runs. So the number of files and directories keeps growing.
How can let them be deleted?
Spark version is 1.3.1 with Hadoop 2.6.
UPDATE
I've traced the spark source code.
The module methods that create the 3 'temp' directories are as follows:
DiskBlockManager.createLocalDirs
HttpFileServer.initialize
SparkEnv.sparkFilesDir
They (eventually) call Utils.getOrCreateLocalRootDirs and then Utils.createDirectory, which intentionally does NOT mark the directory for automatic deletion.
The comment of createDirectory method says: "The directory is guaranteed to be
newly created, and is not marked for automatic deletion."
I don't know why they are not marked. Is this really intentional?
Three SPARK_WORKER_OPTS exists to support the worker application folder cleanup, copied here for further reference: from Spark Doc
spark.worker.cleanup.enabled, default value is false, Enable periodic cleanup of worker / application directories. Note that this only affects standalone mode, as YARN works differently. Only the directories of stopped applications are cleaned up.
spark.worker.cleanup.interval, default is 1800, i.e. 30 minutes, Controls the interval, in seconds, at which the worker cleans up old application work dirs on the local machine.
spark.worker.cleanup.appDataTtl, default is 7*24*3600 (7 days), The number of seconds to retain application work directories on each worker. This is a Time To Live and should depend on the amount of available disk space you have. Application logs and jars are downloaded to each application work dir. Over time, the work dirs can quickly fill up disk space, especially if you run jobs very frequently.
I assume you are using the "local" mode only for testing purposes. I solved this issue by creating a custom temp folder before running a test and then I delete it manually (in my case I use local mode in JUnit so the temp folder is deleted automatically).
You can change the path to the temp folder for Spark by spark.local.dir property.
SparkConf conf = new SparkConf().setMaster("local")
.setAppName("test")
.set("spark.local.dir", "/tmp/spark-temp");
After the test is completed I would delete the /tmp/spark-temp folder manually.
I don't know how to make Spark cleanup those temporary directories, but I was able to prevent the creation of the snappy-XXX files. This can be done in two ways:
Disable compression. Properties: spark.broadcast.compress, spark.shuffle.compress, spark.shuffle.spill.compress. See http://spark.apache.org/docs/1.3.1/configuration.html#compression-and-serialization
Use LZF as a compression codec. Spark uses native libraries for Snappy and lz4. And because of the way JNI works, Spark has to unpack these libraries before using them. LZF seems to be implemented natively in Java.
I'm doing this during development, but for production it is probably better to use compression and have a script to clean up the temp directories.
I do not think cleanup is supported for all scenarios. I would suggest to write a simple windows scheduler to clean up nightly.
You need to call close() on the spark context that you created at the end of the program.
for spark.local.dir, it will only move spark temp files, but the snappy-xxx file will still exists in /tmp dir.
Though didn't find way to make spark automatically clear it, but you can set JAVA option:
JVM_EXTRA_OPTS=" -Dorg.xerial.snappy.tempdir=~/some-other-tmp-dir"
to make it move to another dir, as most system has small /tmp size.
I uploaded a configuration folder for Solr core to Apache zookeeper using zkClient.
When I delete a file in my local configuration and update it to Zookeeper again, I can't see the change reflected in Solr admin page.
Could somebody please explain how to update/delete files from zookeeper?
Also where to find the physical files in zookeeper folder?
In order to upload a modified file in zookeeper client, you need to:
remove the old file from Zookeeper and
upload the new one and
restart the Solr nodes (depending on the change, you could reload the collection instead).
For instance if you need to update solrconfig.xml, you can:
a) clear old file on zookeeper (otherwise depending from the client version you'll get an error):
zkcli.sh --zkhost <ZK_HOST>:<ZK_PORT> -cmd clear /configs/<MY_COLLECTION>/solrconfig.xml
b) upload the updated file:
zkcli.sh --zkhost <ZK_HOST>:<ZK_PORT> -cmd putfile /configs/<MY_COLLECTION>/solrconfig.xml /<MY_UPDATED_FILE_LOCAL_FOLDER>/solrconfig.xml
c) Restart the Solr nodes.
I believe your Solr files should be in /configs/<MY_COLLECTION>.
I am learning zookeeper. I see that the config file relies on a separate config file called "myid."
I know that all installations have the same config file so I guess the myid file lets a particular zookeeper instance know its ID within the system. Why does an installation need to know its own ID? Is there a particular reason, from an architectural standpoint, why this information is split into its own config file?
The id gives zookeeper a stable identifier, and this link might be helpful.