Which is the fastest way to read a few lines out of a large hdfs dir using spark? - scala

My goal is to read a few lines out of a large hdfs dir, I'm using spark2.2.
This dir is generated by previous spark job and each task generated a single little file in the dir, so the whole dir is like 1GB size and have thousands of little files.
When I use collect() or head() or limit(), spark will load all the files, and creates thousands of tasks(monitoring in sparkUI), which costs a lot of time, even I just want to show the first few lines of the files in this dir.
So which is the fastest way to read this dir? I hope the best solution is only load only a few lines of data so it would save time.
Following is my code:
sparkSession.sqlContext.read.format("csv").option("header","true").option("inferschema","true").load(file).limit(20).toJSON.toString()
sparkSession.sql(s"select * from $file").head(100).toString
sparkSession.sql(s"select * from $file").limit(100).toString

If you directly want to use spark then it will anyways load the files and then it does taking records. So first even before spark logic you have to get one file name from the directory using ur technology like java or scala or python and pass that file name to text File method that won't load all files.

Related

how to decompress and read a file containing multiple compressed file in spark

I have a file AA.zip which again contains multiple files for ex aa.tar.gz, bb.tar.gz , etc
I need to read this files in spark scala , how can i achieve that??
the only problem here is to extract the contents of zip file.
so ZIPs on HDFS are going to be a bit tricky because they don't split well so you'll have to process 1 or more zip file per executor. This is also one of the few cases were you probably have to fall back to SparkContext because for some reason binary file support in Spark is not that good.
https://spark.apache.org/docs/2.4.0/api/scala/index.html#org.apache.spark.SparkContext
there's a readBinaryFiles there which gives you access to the zip binary data which you can then utilize with the usual ZIP-handling from java or scala.

How to write data as single (normal) csv file in Spark? [duplicate]

This question already has answers here:
Write single CSV file using spark-csv
(16 answers)
Closed 5 years ago.
I am trying to save a data frame as CSV file in my local drive. But, when I do that so, I get a folder generated and within that partition files were written. Is there any suggestion to overcome this ?
My Requirement:
To get a normal csv file with actual name given in the code.
Code Snippet:
dataframe.coalesce(1).write.mode("overwrite").format("com.databricks.spark.csv").option("header", "true").csv("E:/dataframe.csv")
TL:DR You are trying to enforce sequential, in-core concepts on a distribute enviornment. It cannot end up well.
Spark doesn't provide utility like this one. To be able to create one in a semi distributed fashion, you'd have to implement multistep, source dependent protocol where:
You write header.
You write data files for each partition.
You merge the files, and give a new name.
Since this has limited applications, is useful only for smallish files, and can be very expensive with some sources (like object stores) nothing like this is implemented in Spark.
You can of course collect data, use standard CSV parser (Univoicity, Apache Commons) and then put to the storage of your choice. This is sequential and requires multiple data transfers.
There is no automatic way to do this. I see two solutions
If the local directory is mounted on all the executors: Write the file as you did, but then move/rename the part-*csv file to the desired name
Or if the directory is not available on all executors: collect the
dataframe to the driver and then create the file using plain scala
But both solutions kind of destroy parallelism and thus the goal of spark.
It is not possible but you can do somethings like this:
dataframe.coalesce(1).write.mode("overwrite").format("com.databricks.spark.csv").option("header", "true").csv("E:/data/")
import org.apache.hadoop.fs._
val fs = FileSystem.get(sc.hadoopConfiguration)
val filePath = "E:/data/"
val fileName = fs.globStatus(new Path(filePath+"part*"))(0).getPath.getName
fs.rename(new Path(filePath+fileName), new Path(filePath+"dataframe.csv"))

"sqlContext.read.json" takes very long time to read 30,000 small JSON files (400 Kb) from S3

I get stuck with the following problem. I have around 30,000 JSON files stored in S3 inside a particular bucket. These files are very small; each one takes only 400-500 Kb, but their quantity is not so small.
I want to create DataFrame based on all these files. I am reading JSON files using wildcard as follows:
var df = sqlContext.read.json("s3n://path_to_bucket/*.json")
I also tried this approach since json(...) is deprecated:
var df = sqlContext.read.format("json").load("s3n://path_to_bucket/*.json")
The problem is that it takes a very long time to create df. I was waiting 4 hours and the Spark job was still running.
Is there any more efficient approach to collect all these JSON files and create a DataFrame based on them?
UPDATE:
Or at least is it possible to read last 1000 files instead of reading all files? I found out that one can pass options as follows sqlContext.read.format("json").options, however I cannot figure out how to read only N newest files.
If you can get the last 1000 modified file names into a simple list you can simply call:
sqlContext.read.format("json").json(filePathsList: _*)
Please note that the .option call(s) are usually used to configure schema options.
Unfortunately, I haven't used S3 before, but I think you can use the same logic in the answer to this question to get the last modified file names:
How do I find the last modified file in a directory in Java?
You are loading like 13Gb of information. Are you sure that it takes a long time in just to create the DF? Maybe it's running the rest of the application but the UI shows that.
Try just to load and print the first row of the DF.
Anyway, what is the configuration of the cluster?

How to Process multiple files in talend one after another and the size of the files are too large?

i want to process the multiple files using talend and one after another and the size of the files are large and while processing one file if another file comes into that directory it has to process that file also.
is there any possible way to do this could you please suggest guys?
You can use tFileList component, which will iterate all the files in a given directory.
You can check the component functionality here
Simple concept would be,
When there is a file in a directory say Folder1, move that file to another location say Folder2.
After processing file in Folder2 again, check Folder1, that is any new files arrived.
If arrived, then again move that file to Folder2 and process it.
If there is no new file, end the job.
A great way to do this in Talend is to setup a file watcher job which is simple to do. Talend provides the tWaitForFile Component which will watch a directory for files. You can configure the max iterations in which it will look for the file and time between polls/scans. Since you said you are loading large files, to avoid DB concurrency issues give enough time between scans to account for this.
In my example below I am watching a directory for new files, scanning every 60 seconds over an 8 hour period. You would want to schedule the job in either the TAC or whatever scheduling tool you use. In my example I simply join to a tJavaRow and display the information about the file that was found.
you can see the output from my tJavaRow here which shows the file info:

Reading in multiple files compressed in tar.gz archive into Spark [duplicate]

This question already has answers here:
Read whole text files from a compression in Spark
(2 answers)
Closed 6 years ago.
I'm trying to create a Spark RDD from several json files compressed into a tar.
For example, I have 3 files
file1.json
file2.json
file3.json
And these are contained in archive.tar.gz.
I want to create a dataframe from the json files. The problem is that Spark is not reading in the json files correctly. Creating an RDD using sqlContext.read.json("archive.tar.gz") or sc.textFile("archive.tar.gz") results in garbled/extra output.
Is there some way to handle gzipped archives containing multiple files in Spark?
UPDATE
Using the method given in the answer to Read whole text files from a compression in Spark I was able to get things running, but this method does not seem to be suitable for large tar.gz archives (>200 mb compressed) as the application chokes up on large archive sizes. As some of the archives I'm dealing with reach sizes upto 2 GB after compression I'm wondering if there is some efficient way to deal with the problem.
I'm trying to avoid extracting the archives and then merging the files together as this would be time consuming.
A solution is given in Read whole text files from a compression in Spark .
Using the code sample provided, I was able to create a DataFrame from the compressed archive like so:
val jsonRDD = sc.binaryFiles("gzarchive/*").
flatMapValues(x => extractFiles(x).toOption).
mapValues(_.map(decode())
val df = sqlContext.read.json(jsonRDD.map(_._2).flatMap(x => x))
This method works fine for tar archives of a relatively small size, but is not suitable for large archive sizes.
A better solution to the problem seems to be to convert the tar archives to Hadoop SequenceFiles, which are splittable and hence can be read and processed in parallel in Spark (as opposed to tar archives.)
See: A Million Little Files – Digital Digressions by Stuart Sierra.
Files inside of a *.tar.gz file, as you already have mentioned are compressed. You cannot put the 3 files into a single compressed tar file and expect the import function (which is looking for only text) to know how to handle decompressing the files, unpacking them from the tar archive, and then importing each file individually.
I would recommend you take the time to upload each individual json file manually since both sc.textfile and sqlcontext.read.json functions cannot handle compressed data.