This question already has answers here:
Is it possible to read pdf/audio/video files(unstructured data) using Apache Spark?
(2 answers)
Closed 5 years ago.
I have a set of file that each contain a specific Record in Marc21 binary format. I would like to ingest the set of files as an RDD, where each element would be a record object as binary data. Later on I will use a Marc library to convert the object into Java Object for further processing.
As of now, I am puzzled as to how i can read a binary file.
I have seen the following function:
binaryRecord(path: string, recordLength: int, conf)
However, it assume that it is a file with multiple records of the same length. My records will definitively be of different sizes. Beside each one is on a separate file.
Is there a way to get around that ? How can I for each file, give a length ? Would the only way only be calculating the length of my file and then reading the records ?
The other solution I see obviously would be to read the record in Java format and serialized that into whatever format is comfortable ingesting.
Please advise.
Have you tried sc.binaryFiles() from spark?
Here is the link to documentation
https://spark.apache.org/docs/2.1.1/api/java/org/apache/spark/SparkContext.html#binaryFiles(java.lang.String,%20int)
Related
This question already has answers here:
Write single CSV file using spark-csv
(16 answers)
Write spark dataframe to single parquet file
(2 answers)
Closed 3 years ago.
I would like to save a Dataset[Row] as text file with a specific name in specific location.
Can anybody help me?
I have tried this, but this produce me a folder (LOCAL_FOLDER_TEMP/filename) with a parquet file inside of it:
Dataset.write.save(LOCAL_FOLDER_TEMP+filename)
Thanks
You can`t save your dataset to specific filename using spark api, there is multiple workarounds to do that.
as Vladislav offered, collect your dataset then write it into your filesystem using scala/java/python api.
apply repartition/coalesce(1), write your dataset and then change the filename.
both are not very recommended, because in large datasets it can cause OOM or just lost of the power of spark`s parallelism.
The second issue that you are getting parquet file, its because the default format of spark, you should use:
df.write.format("text").save("/path/to/save")
Please use
RDD.saveAsTextFile()
It Write the elements of the dataset as a text file (or set of text files) in a given directory in the local filesystem, HDFS or any other Hadoop-supported file system. Spark will call toString on each element to convert it to a line of text in the file.
Refer Link : rdd-programming-guide
Spark always creates multiple files - one file per partition. If you want a single file - you need to do collect() and then just write it to file the usual way.
This question already has answers here:
Write single CSV file using spark-csv
(16 answers)
Closed 5 years ago.
I am trying to save a data frame as CSV file in my local drive. But, when I do that so, I get a folder generated and within that partition files were written. Is there any suggestion to overcome this ?
My Requirement:
To get a normal csv file with actual name given in the code.
Code Snippet:
dataframe.coalesce(1).write.mode("overwrite").format("com.databricks.spark.csv").option("header", "true").csv("E:/dataframe.csv")
TL:DR You are trying to enforce sequential, in-core concepts on a distribute enviornment. It cannot end up well.
Spark doesn't provide utility like this one. To be able to create one in a semi distributed fashion, you'd have to implement multistep, source dependent protocol where:
You write header.
You write data files for each partition.
You merge the files, and give a new name.
Since this has limited applications, is useful only for smallish files, and can be very expensive with some sources (like object stores) nothing like this is implemented in Spark.
You can of course collect data, use standard CSV parser (Univoicity, Apache Commons) and then put to the storage of your choice. This is sequential and requires multiple data transfers.
There is no automatic way to do this. I see two solutions
If the local directory is mounted on all the executors: Write the file as you did, but then move/rename the part-*csv file to the desired name
Or if the directory is not available on all executors: collect the
dataframe to the driver and then create the file using plain scala
But both solutions kind of destroy parallelism and thus the goal of spark.
It is not possible but you can do somethings like this:
dataframe.coalesce(1).write.mode("overwrite").format("com.databricks.spark.csv").option("header", "true").csv("E:/data/")
import org.apache.hadoop.fs._
val fs = FileSystem.get(sc.hadoopConfiguration)
val filePath = "E:/data/"
val fileName = fs.globStatus(new Path(filePath+"part*"))(0).getPath.getName
fs.rename(new Path(filePath+fileName), new Path(filePath+"dataframe.csv"))
This question already has answers here:
Write single CSV file using spark-csv
(16 answers)
Closed 5 years ago.
I am trying to load a DataFrame into HDFS and S3 as text format file using below code. DataFrame name is finalData.
val targetPath = "/user/test/File"
val now = Calendar.getInstance().getTime()
val formatter = new SimpleDateFormat("yyyyMMddHHmmss")
val timeStampAfterFormatting = formatter.format(now)
val targetFile = s"""$targetPath/test_$timeStampAfterFormatting.txt"""
finalData.repartition(1).rdd.saveAsTextFile(targetFile)
Using above code I can load the Data successfully. But file name is not same as I provided and also not in text format. A directory has created with the name as I mentioned.
Directory Name - /user/test/File/test_20170918055206.txt
-bash-4.2$ hdfs dfs -ls /user/test/File/test_20170918055206.txt
Found 2 items
/user/test/File/test_20170918055206.txt/_SUCCESS
/user/test/File/test_20170918055206.txt/part-00000
I want to create the file as I mentioned instead of creating the directory. Can anyone please assist me.
In my opinion, this is working as design.
You got a repartition operation just before you saved your rdd data, and this would trigger a shuffle operation among your whole data, and eventually got a new rdd which had only one partition.
So this only one partition was stored in your HDFS as your saveAsTextFile operation.
This method is designed such way to let arbitrary number of partitions to be writed in a uniform way.
For example, if your rdd has 100 partitions, no coalesce or repartition before write to HDFS. Then you will get a directory include _SUCCESS flag and 100 files!
if this method is not designed such way, how rdd with multiple partitions could be stored in a concise, uniform and elegant way, and maybe user need to direct the multiple file names? ...ah, so tedious maybe
I hope this explanation helps you.
If you then need and a complete whole file on your local file system, just try the hadoop client command:
hadoop fs -getmerge [src] [des]
I get stuck with the following problem. I have around 30,000 JSON files stored in S3 inside a particular bucket. These files are very small; each one takes only 400-500 Kb, but their quantity is not so small.
I want to create DataFrame based on all these files. I am reading JSON files using wildcard as follows:
var df = sqlContext.read.json("s3n://path_to_bucket/*.json")
I also tried this approach since json(...) is deprecated:
var df = sqlContext.read.format("json").load("s3n://path_to_bucket/*.json")
The problem is that it takes a very long time to create df. I was waiting 4 hours and the Spark job was still running.
Is there any more efficient approach to collect all these JSON files and create a DataFrame based on them?
UPDATE:
Or at least is it possible to read last 1000 files instead of reading all files? I found out that one can pass options as follows sqlContext.read.format("json").options, however I cannot figure out how to read only N newest files.
If you can get the last 1000 modified file names into a simple list you can simply call:
sqlContext.read.format("json").json(filePathsList: _*)
Please note that the .option call(s) are usually used to configure schema options.
Unfortunately, I haven't used S3 before, but I think you can use the same logic in the answer to this question to get the last modified file names:
How do I find the last modified file in a directory in Java?
You are loading like 13Gb of information. Are you sure that it takes a long time in just to create the DF? Maybe it's running the rest of the application but the UI shows that.
Try just to load and print the first row of the DF.
Anyway, what is the configuration of the cluster?
This question already has answers here:
Read whole text files from a compression in Spark
(2 answers)
Closed 6 years ago.
I'm trying to create a Spark RDD from several json files compressed into a tar.
For example, I have 3 files
file1.json
file2.json
file3.json
And these are contained in archive.tar.gz.
I want to create a dataframe from the json files. The problem is that Spark is not reading in the json files correctly. Creating an RDD using sqlContext.read.json("archive.tar.gz") or sc.textFile("archive.tar.gz") results in garbled/extra output.
Is there some way to handle gzipped archives containing multiple files in Spark?
UPDATE
Using the method given in the answer to Read whole text files from a compression in Spark I was able to get things running, but this method does not seem to be suitable for large tar.gz archives (>200 mb compressed) as the application chokes up on large archive sizes. As some of the archives I'm dealing with reach sizes upto 2 GB after compression I'm wondering if there is some efficient way to deal with the problem.
I'm trying to avoid extracting the archives and then merging the files together as this would be time consuming.
A solution is given in Read whole text files from a compression in Spark .
Using the code sample provided, I was able to create a DataFrame from the compressed archive like so:
val jsonRDD = sc.binaryFiles("gzarchive/*").
flatMapValues(x => extractFiles(x).toOption).
mapValues(_.map(decode())
val df = sqlContext.read.json(jsonRDD.map(_._2).flatMap(x => x))
This method works fine for tar archives of a relatively small size, but is not suitable for large archive sizes.
A better solution to the problem seems to be to convert the tar archives to Hadoop SequenceFiles, which are splittable and hence can be read and processed in parallel in Spark (as opposed to tar archives.)
See: A Million Little Files – Digital Digressions by Stuart Sierra.
Files inside of a *.tar.gz file, as you already have mentioned are compressed. You cannot put the 3 files into a single compressed tar file and expect the import function (which is looking for only text) to know how to handle decompressing the files, unpacking them from the tar archive, and then importing each file individually.
I would recommend you take the time to upload each individual json file manually since both sc.textfile and sqlcontext.read.json functions cannot handle compressed data.