spark No such file or directory - scala

Here is my code:
s3.textFile("s3n://data/hadoop/data_log/req/20160618/*")
.map(doMap)
.saveAsTextFile()
spark 1.4.1, standalone cluster
Sometimes(not always, this is important) it throws this error:
[2016-09-13 03:22:51,545: ERROR/Worker-1] err: java.io.FileNotFoundException:
No such file or directory
's3n://data/hadoop/data_log/req/20160618/hadoop.req.2016061811.log.0.gz'
But when i use
aws s3 ls s3://data/hadoop/data_log/req/20160618/hadoop.req.2016061811.log.0.gz
The file exists.
How to avoid this problem?

The problem is with s3 consistency.
Even though the file is listed, it does not exist. Try aws s3 cp the file and you will get the same exception.
"Read-after-write consistency is only valid for GETS of new objects - LISTS might not contain the new objects until the change is fully propagated."
Is listing Amazon S3 objects a strong consistency operation or eventual consistency operation?

Related

CDF pipeline throws error in GCS bucket doesn't exist

While moving data to Sink, I'm getting this error in GCP Data fusion pipeline.
Can someone help?
GCS path cdap-job/dd5d2bba-9cce-11ed-8666-56bac137a1c0 was not cleaned up for bucket gs://df-5999901975431383890-d2icduix5qi63hnabcpaqiyllq due to The specified bucket does not exist..
I tried recreating the temp buckets as it appeared in the log but it keeps on changing.
I had deleted few temp buckets from the list and I suspect that caused this issue.
As of today, with every Data Fusion instance created there is always a bucket that is created with it that has a name similar to the one that you shared in your message, i.e - df-5999901975431383890-d2icduix5qi63hnabcpaqiyllq. This bucket is visible in GCS, on the project you created you instance on.
You are probably getting this error because you deleted this bucket from GCS. You can either recreate the bucket or your Data Fusion instance.

Reading a file from local file system after reading it from hadoop file system

I am trying to read a file from my local EMR file system. It is there as a file under the folder /emr/myFile.csv. However, I keep getting a FileNotFoundException. Here is the line of code that I use to read it:
val myObj: File = new File("/emr/myFile.csv")
I added a file://// prefix to the file path as well because I have seen that work for others, but that still did not work. So I also try to read directly from the hadoop file system where it is stored in the folder: /emr/CNSMR_ACCNT_BAL/myFile.csv because I thought it was maybe checking by default in hdfs. However, that also results in a FileNotFoundException. Here is the code for that:
val myObj: File = new File("/emr/CNSMR_ACCNT_BAL/myFile.csv")
How can I read this file into a File?
For your 1st problem:
When you submit a hadoop job application master can get created on any of your worker node including master node (depending on your configuration).
If you are using EMR, your application master by default gets created on any of your worker node (CORE node) but not on master.
When you say file:///emr/myFile.csv this file exists on your local file system (I'm assuming that means on master node), your program will search for this file on that node where the application master is and its definitely not on your master node because for that you wouldn’t get any error.
2nd problem:
When you try to access a file in HDFS using java File.class, it won’t be able to access that file.
You need to use hadoop FileSystem api (org.apache.hadoop.fs.FileSystem) to interact with a HDFS file.
Also use HDFS file tag hdfs://<namenode>:<port>/emr/CNSMR_ACCNT_BAL/myFile.csv.
If your core-site.xml contains value of fs.defaultFS then you don’t need to put namenode and port info just simply hdfs:///emr/CNSMR_ACCNT_BAL/myFile.csv
So what's better option here while accessing file in hadoop cluster?
The answer depends upon your use case, but most cases putting it in HDFS it much better, because you don’t have to worry about where your application master is. Each and every node have access to the hdfs.
Hope that resolves your problem.

Pyspark dataframe write parquet without deleting /_temporary folder

df.write.mode("append").parquet(path)
I'm using this to write parquet files to an S3 location. It seems like in order to write the files, it's also creating a /_temporary directory and deleting it after use. So I got access denied. Admin on our AWS account doesn't want to grant the code delete permission on that folder.
I proposed to write the files to another folder where delete permission can be granted then copy the files over. But Admin still wants me to write files directly to the destination folder.
Is there certain configuration I can set to ask Pyspark don't do the deletion on the temporary directory?
I don't think there is such option for _temporary folder.
But if you're running your Spark job on EMR cluster, you can write first to local HDFS of your cluster and then copy data to S3 using Hadoop FileUtil.copy function.
On Pyspark, you can access this function via JVM gateway like this :
sc._gateway.jvm.org.apache.hadoop.fs.FileUtil

Hadoop FileUtils not able to write files on local(Unix) filesystem from Scala

I'm trying to write file to local FileSystem using FileSystem library of org.apache.hadoop.fs. Below is my one liner code inside the big scala code that should be doing this, but it's not.
fs.copyToLocalFile(false, hdfsSourcePath, new Path(newFile.getAbsolutePath), true)
The value of newFile is:
val newFile = new File(s"${localPath}/fileName.dat")
localPath is just a variable containing the full path on local disk.
hdfsSourcePath is the full path on HDFS location.
The job executes properly but I don't see the files created on local. I'm running it through Spark engine in cluster mode, that's why I used the copyToLocalFile method which overloads the 4th argument of useRawLocalFileSystem and set it to true. Using this, we can avoid getting the files being written on the executor node.
Any ideas?
I used the copyToLocalFile method which overloads the 4th argument of useRawLocalFileSystem and set it to true. Using this, we can avoid getting the files being written on the executor node.
I think you got this point wrong. Cluster mode makes driver run on executor node and local file system is that executor's file system. useRawLocalFileSystem only prevents writing checksum files (->info), it does not make the files appear on machine that is submitting the job, which is probably what you expected.
The best you can do is to save files to HDFS and retrieve them explicitly after the job finishes.

Data access Spark EC2

After following instruction to install cluster via ec2 script, i'm not able to correctly launch my .jar because they don't find the data file which i put on /root/persistent-hdfs/ on the master and slave nodes.
I read on an other post that i need to prefix the file location with file:// but it doesn't change anything... I have this error :
Exception in thread "main" org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: file://root/persistent-hdfs/data/ds_1.csv
To launch the job i used the ./bin/spark-submit on the master node, am i correct ?
Thank you in advance for your support.
There are a few things you need to do:
The default configuration uses the ephemeral hdfs so you need to turn that off $ /root/ephemeral-hdfs/bin/stop-all.sh and turn persistent on $ /root/persistent-hdfs/bin/start-all.sh.
Put your file into the persistent hdfs root directory for simplicity $ /root/persistent-hdfs/bin/hadoop fs -put /root/ds_1.csv /ds_1.csv. Now check to see it is actually there $ /root/persistent-hdfs/bin/hadoop fs -ls.
Finally, edit Spark's configuration files in /root/spark/conf/spark-defaults.conf and /root/spark/conf/spark-env.sh and change everything that says ephemeral to persistent.
Assuming you put your csv in the root directory of the persistent hdfs (as we did in step 2) you can access it in spark using val rawData = sc.textFile("/ds_1.csv").
Have fun!
Seeing the code of your job would provide more details.
So far looks like workers cannot access the file on the local file system of the driver.
You need to use hadoop fs -put or -cp command to upload your file to HDFS. So workers will be able access the file with hdfs:// uri.
Since you are running your cluster on EC2 I would suggest to put the file to s3 bucket and use s3://... file uri.