Pyspark dataframe write parquet without deleting /_temporary folder - pyspark

df.write.mode("append").parquet(path)
I'm using this to write parquet files to an S3 location. It seems like in order to write the files, it's also creating a /_temporary directory and deleting it after use. So I got access denied. Admin on our AWS account doesn't want to grant the code delete permission on that folder.
I proposed to write the files to another folder where delete permission can be granted then copy the files over. But Admin still wants me to write files directly to the destination folder.
Is there certain configuration I can set to ask Pyspark don't do the deletion on the temporary directory?

I don't think there is such option for _temporary folder.
But if you're running your Spark job on EMR cluster, you can write first to local HDFS of your cluster and then copy data to S3 using Hadoop FileUtil.copy function.
On Pyspark, you can access this function via JVM gateway like this :
sc._gateway.jvm.org.apache.hadoop.fs.FileUtil

Related

Reading a file from local file system after reading it from hadoop file system

I am trying to read a file from my local EMR file system. It is there as a file under the folder /emr/myFile.csv. However, I keep getting a FileNotFoundException. Here is the line of code that I use to read it:
val myObj: File = new File("/emr/myFile.csv")
I added a file://// prefix to the file path as well because I have seen that work for others, but that still did not work. So I also try to read directly from the hadoop file system where it is stored in the folder: /emr/CNSMR_ACCNT_BAL/myFile.csv because I thought it was maybe checking by default in hdfs. However, that also results in a FileNotFoundException. Here is the code for that:
val myObj: File = new File("/emr/CNSMR_ACCNT_BAL/myFile.csv")
How can I read this file into a File?
For your 1st problem:
When you submit a hadoop job application master can get created on any of your worker node including master node (depending on your configuration).
If you are using EMR, your application master by default gets created on any of your worker node (CORE node) but not on master.
When you say file:///emr/myFile.csv this file exists on your local file system (I'm assuming that means on master node), your program will search for this file on that node where the application master is and its definitely not on your master node because for that you wouldn’t get any error.
2nd problem:
When you try to access a file in HDFS using java File.class, it won’t be able to access that file.
You need to use hadoop FileSystem api (org.apache.hadoop.fs.FileSystem) to interact with a HDFS file.
Also use HDFS file tag hdfs://<namenode>:<port>/emr/CNSMR_ACCNT_BAL/myFile.csv.
If your core-site.xml contains value of fs.defaultFS then you don’t need to put namenode and port info just simply hdfs:///emr/CNSMR_ACCNT_BAL/myFile.csv
So what's better option here while accessing file in hadoop cluster?
The answer depends upon your use case, but most cases putting it in HDFS it much better, because you don’t have to worry about where your application master is. Each and every node have access to the hdfs.
Hope that resolves your problem.

is there a way dump a TSV file from Storage Bucket to Cloud MySql in GCP?

is there a way to dump a TSV file from Storage Bucket to Cloud MySql in GCP ?. I have large file of TSV with 4M rows.
I couldn't convert it into CSV.
As of today, Cloud SQL only supports CSV and SQL. Nonetheless, I suggest that you take a look at this solution. I used Python to be able to automate this process in case you really need it to make it more than one time. In this case I tried to reproduce your issue and I code a script that basically:
Downloads the TSV file from the Cloud Storage Bucket specified.
Converts the TSV file to a CSV file. Uploads the CSV file to the
Cloud Storage Bucket specified.
Imports the newly added CSV file to
Cloud SQL.
You can find the code as well as the requirements for running this script here. Furthermore, take into account that you will need to replace those values closed by claudators such as [BUCKET_NAME] before running it. Also keep in mind that this script does not delete the TSV download it as well as the CSV file, therefore you will need to delete it manually or you can modify the code in order to delete the files automatically.
Finally, if you would like to investigate further about the API used on the script section, I will attach the documentation need it here & here.

aws EMR cluster via cloud formation: not seeing hdfs on one

okay i have a EMR cluster which writes to HDFS and I am able to view the directory and see the files
via
hadoop fs -ls /user/hadoop/jobs - i am not seeing /user/hive or jobs directory in hadoop, but its supposed to be there.
I need to get in to the spark shell and perform sparql, so i created identical cluster with same vpc,security groups, and subnet id.
What i am supposed to see
Why this is happending i am not sure but i think this might be it? Or any suggestions
Could this be something to with a stale rule?

Deleting temp file in a Spark datasource

I am trying to write a spark datasource package.
My datasource job is simple.
Get the content from remote system and store the content in a temp
file
Construct dataframe using the content present in temp file
I was able to do the above stuff. But I want to delete the temp file once the dataframe has been constructed.
As Spark constructs the dataframe lazily, It can't be deleted from my DatasetRelation. So the option to delete from DefaultSource or DatasetRelation is ruled out.
Another option is add my temp folder in ShutdownHookManager which will take care of deleting my temp folder during spark shutdown. But unfortunately ShutdownHookManager is private.
Another option is get the temp directory that spark uses and deletes it during shutdown. There are temp directories created by spark. But I can't get the temp directory name created by spark. Spark creates temp directory with UUID in its directory name. Also there is no environment variable to get this temp directory. So not able to use this option as well.
Is there any other option to delete my temp file used to construct a dataframe in spark?
Get the content from remote system and store the content in a temp file
You should probably not do this in spark. If you did it in an external script, you could pass the path to that script to spark and spark would then copy it to the cluster and delete it afterwards.

Data access Spark EC2

After following instruction to install cluster via ec2 script, i'm not able to correctly launch my .jar because they don't find the data file which i put on /root/persistent-hdfs/ on the master and slave nodes.
I read on an other post that i need to prefix the file location with file:// but it doesn't change anything... I have this error :
Exception in thread "main" org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: file://root/persistent-hdfs/data/ds_1.csv
To launch the job i used the ./bin/spark-submit on the master node, am i correct ?
Thank you in advance for your support.
There are a few things you need to do:
The default configuration uses the ephemeral hdfs so you need to turn that off $ /root/ephemeral-hdfs/bin/stop-all.sh and turn persistent on $ /root/persistent-hdfs/bin/start-all.sh.
Put your file into the persistent hdfs root directory for simplicity $ /root/persistent-hdfs/bin/hadoop fs -put /root/ds_1.csv /ds_1.csv. Now check to see it is actually there $ /root/persistent-hdfs/bin/hadoop fs -ls.
Finally, edit Spark's configuration files in /root/spark/conf/spark-defaults.conf and /root/spark/conf/spark-env.sh and change everything that says ephemeral to persistent.
Assuming you put your csv in the root directory of the persistent hdfs (as we did in step 2) you can access it in spark using val rawData = sc.textFile("/ds_1.csv").
Have fun!
Seeing the code of your job would provide more details.
So far looks like workers cannot access the file on the local file system of the driver.
You need to use hadoop fs -put or -cp command to upload your file to HDFS. So workers will be able access the file with hdfs:// uri.
Since you are running your cluster on EC2 I would suggest to put the file to s3 bucket and use s3://... file uri.