how to save a Dataset[row] as text file in spark? [duplicate] - scala

This question already has answers here:
Write single CSV file using spark-csv
(16 answers)
Write spark dataframe to single parquet file
(2 answers)
Closed 3 years ago.
I would like to save a Dataset[Row] as text file with a specific name in specific location.
Can anybody help me?
I have tried this, but this produce me a folder (LOCAL_FOLDER_TEMP/filename) with a parquet file inside of it:
Dataset.write.save(LOCAL_FOLDER_TEMP+filename)
Thanks

You can`t save your dataset to specific filename using spark api, there is multiple workarounds to do that.
as Vladislav offered, collect your dataset then write it into your filesystem using scala/java/python api.
apply repartition/coalesce(1), write your dataset and then change the filename.
both are not very recommended, because in large datasets it can cause OOM or just lost of the power of spark`s parallelism.
The second issue that you are getting parquet file, its because the default format of spark, you should use:
df.write.format("text").save("/path/to/save")

Please use
RDD.saveAsTextFile()
It Write the elements of the dataset as a text file (or set of text files) in a given directory in the local filesystem, HDFS or any other Hadoop-supported file system. Spark will call toString on each element to convert it to a line of text in the file.
Refer Link : rdd-programming-guide

Spark always creates multiple files - one file per partition. If you want a single file - you need to do collect() and then just write it to file the usual way.

Related

how to decompress and read a file containing multiple compressed file in spark

I have a file AA.zip which again contains multiple files for ex aa.tar.gz, bb.tar.gz , etc
I need to read this files in spark scala , how can i achieve that??
the only problem here is to extract the contents of zip file.
so ZIPs on HDFS are going to be a bit tricky because they don't split well so you'll have to process 1 or more zip file per executor. This is also one of the few cases were you probably have to fall back to SparkContext because for some reason binary file support in Spark is not that good.
https://spark.apache.org/docs/2.4.0/api/scala/index.html#org.apache.spark.SparkContext
there's a readBinaryFiles there which gives you access to the zip binary data which you can then utilize with the usual ZIP-handling from java or scala.

How to write data as single (normal) csv file in Spark? [duplicate]

This question already has answers here:
Write single CSV file using spark-csv
(16 answers)
Closed 5 years ago.
I am trying to save a data frame as CSV file in my local drive. But, when I do that so, I get a folder generated and within that partition files were written. Is there any suggestion to overcome this ?
My Requirement:
To get a normal csv file with actual name given in the code.
Code Snippet:
dataframe.coalesce(1).write.mode("overwrite").format("com.databricks.spark.csv").option("header", "true").csv("E:/dataframe.csv")
TL:DR You are trying to enforce sequential, in-core concepts on a distribute enviornment. It cannot end up well.
Spark doesn't provide utility like this one. To be able to create one in a semi distributed fashion, you'd have to implement multistep, source dependent protocol where:
You write header.
You write data files for each partition.
You merge the files, and give a new name.
Since this has limited applications, is useful only for smallish files, and can be very expensive with some sources (like object stores) nothing like this is implemented in Spark.
You can of course collect data, use standard CSV parser (Univoicity, Apache Commons) and then put to the storage of your choice. This is sequential and requires multiple data transfers.
There is no automatic way to do this. I see two solutions
If the local directory is mounted on all the executors: Write the file as you did, but then move/rename the part-*csv file to the desired name
Or if the directory is not available on all executors: collect the
dataframe to the driver and then create the file using plain scala
But both solutions kind of destroy parallelism and thus the goal of spark.
It is not possible but you can do somethings like this:
dataframe.coalesce(1).write.mode("overwrite").format("com.databricks.spark.csv").option("header", "true").csv("E:/data/")
import org.apache.hadoop.fs._
val fs = FileSystem.get(sc.hadoopConfiguration)
val filePath = "E:/data/"
val fileName = fs.globStatus(new Path(filePath+"part*"))(0).getPath.getName
fs.rename(new Path(filePath+fileName), new Path(filePath+"dataframe.csv"))

Load DataFrame as Text File into HDFS and S3 [duplicate]

This question already has answers here:
Write single CSV file using spark-csv
(16 answers)
Closed 5 years ago.
I am trying to load a DataFrame into HDFS and S3 as text format file using below code. DataFrame name is finalData.
val targetPath = "/user/test/File"
val now = Calendar.getInstance().getTime()
val formatter = new SimpleDateFormat("yyyyMMddHHmmss")
val timeStampAfterFormatting = formatter.format(now)
val targetFile = s"""$targetPath/test_$timeStampAfterFormatting.txt"""
finalData.repartition(1).rdd.saveAsTextFile(targetFile)
Using above code I can load the Data successfully. But file name is not same as I provided and also not in text format. A directory has created with the name as I mentioned.
Directory Name - /user/test/File/test_20170918055206.txt
-bash-4.2$ hdfs dfs -ls /user/test/File/test_20170918055206.txt
Found 2 items
/user/test/File/test_20170918055206.txt/_SUCCESS
/user/test/File/test_20170918055206.txt/part-00000
I want to create the file as I mentioned instead of creating the directory. Can anyone please assist me.
In my opinion, this is working as design.
You got a repartition operation just before you saved your rdd data, and this would trigger a shuffle operation among your whole data, and eventually got a new rdd which had only one partition.
So this only one partition was stored in your HDFS as your saveAsTextFile operation.
This method is designed such way to let arbitrary number of partitions to be writed in a uniform way.
For example, if your rdd has 100 partitions, no coalesce or repartition before write to HDFS. Then you will get a directory include _SUCCESS flag and 100 files!
if this method is not designed such way, how rdd with multiple partitions could be stored in a concise, uniform and elegant way, and maybe user need to direct the multiple file names? ...ah, so tedious maybe
I hope this explanation helps you.
If you then need and a complete whole file on your local file system, just try the hadoop client command:
hadoop fs -getmerge [src] [des]

Reading binary File into Spark [duplicate]

This question already has answers here:
Is it possible to read pdf/audio/video files(unstructured data) using Apache Spark?
(2 answers)
Closed 5 years ago.
I have a set of file that each contain a specific Record in Marc21 binary format. I would like to ingest the set of files as an RDD, where each element would be a record object as binary data. Later on I will use a Marc library to convert the object into Java Object for further processing.
As of now, I am puzzled as to how i can read a binary file.
I have seen the following function:
binaryRecord(path: string, recordLength: int, conf)
However, it assume that it is a file with multiple records of the same length. My records will definitively be of different sizes. Beside each one is on a separate file.
Is there a way to get around that ? How can I for each file, give a length ? Would the only way only be calculating the length of my file and then reading the records ?
The other solution I see obviously would be to read the record in Java format and serialized that into whatever format is comfortable ingesting.
Please advise.
Have you tried sc.binaryFiles() from spark?
Here is the link to documentation
https://spark.apache.org/docs/2.1.1/api/java/org/apache/spark/SparkContext.html#binaryFiles(java.lang.String,%20int)

Reading in multiple files compressed in tar.gz archive into Spark [duplicate]

This question already has answers here:
Read whole text files from a compression in Spark
(2 answers)
Closed 6 years ago.
I'm trying to create a Spark RDD from several json files compressed into a tar.
For example, I have 3 files
file1.json
file2.json
file3.json
And these are contained in archive.tar.gz.
I want to create a dataframe from the json files. The problem is that Spark is not reading in the json files correctly. Creating an RDD using sqlContext.read.json("archive.tar.gz") or sc.textFile("archive.tar.gz") results in garbled/extra output.
Is there some way to handle gzipped archives containing multiple files in Spark?
UPDATE
Using the method given in the answer to Read whole text files from a compression in Spark I was able to get things running, but this method does not seem to be suitable for large tar.gz archives (>200 mb compressed) as the application chokes up on large archive sizes. As some of the archives I'm dealing with reach sizes upto 2 GB after compression I'm wondering if there is some efficient way to deal with the problem.
I'm trying to avoid extracting the archives and then merging the files together as this would be time consuming.
A solution is given in Read whole text files from a compression in Spark .
Using the code sample provided, I was able to create a DataFrame from the compressed archive like so:
val jsonRDD = sc.binaryFiles("gzarchive/*").
flatMapValues(x => extractFiles(x).toOption).
mapValues(_.map(decode())
val df = sqlContext.read.json(jsonRDD.map(_._2).flatMap(x => x))
This method works fine for tar archives of a relatively small size, but is not suitable for large archive sizes.
A better solution to the problem seems to be to convert the tar archives to Hadoop SequenceFiles, which are splittable and hence can be read and processed in parallel in Spark (as opposed to tar archives.)
See: A Million Little Files – Digital Digressions by Stuart Sierra.
Files inside of a *.tar.gz file, as you already have mentioned are compressed. You cannot put the 3 files into a single compressed tar file and expect the import function (which is looking for only text) to know how to handle decompressing the files, unpacking them from the tar archive, and then importing each file individually.
I would recommend you take the time to upload each individual json file manually since both sc.textfile and sqlcontext.read.json functions cannot handle compressed data.