"sqlContext.read.json" takes very long time to read 30,000 small JSON files (400 Kb) from S3 - scala

I get stuck with the following problem. I have around 30,000 JSON files stored in S3 inside a particular bucket. These files are very small; each one takes only 400-500 Kb, but their quantity is not so small.
I want to create DataFrame based on all these files. I am reading JSON files using wildcard as follows:
var df = sqlContext.read.json("s3n://path_to_bucket/*.json")
I also tried this approach since json(...) is deprecated:
var df = sqlContext.read.format("json").load("s3n://path_to_bucket/*.json")
The problem is that it takes a very long time to create df. I was waiting 4 hours and the Spark job was still running.
Is there any more efficient approach to collect all these JSON files and create a DataFrame based on them?
UPDATE:
Or at least is it possible to read last 1000 files instead of reading all files? I found out that one can pass options as follows sqlContext.read.format("json").options, however I cannot figure out how to read only N newest files.

If you can get the last 1000 modified file names into a simple list you can simply call:
sqlContext.read.format("json").json(filePathsList: _*)
Please note that the .option call(s) are usually used to configure schema options.
Unfortunately, I haven't used S3 before, but I think you can use the same logic in the answer to this question to get the last modified file names:
How do I find the last modified file in a directory in Java?

You are loading like 13Gb of information. Are you sure that it takes a long time in just to create the DF? Maybe it's running the rest of the application but the UI shows that.
Try just to load and print the first row of the DF.
Anyway, what is the configuration of the cluster?

Related

Which is the fastest way to read a few lines out of a large hdfs dir using spark?

My goal is to read a few lines out of a large hdfs dir, I'm using spark2.2.
This dir is generated by previous spark job and each task generated a single little file in the dir, so the whole dir is like 1GB size and have thousands of little files.
When I use collect() or head() or limit(), spark will load all the files, and creates thousands of tasks(monitoring in sparkUI), which costs a lot of time, even I just want to show the first few lines of the files in this dir.
So which is the fastest way to read this dir? I hope the best solution is only load only a few lines of data so it would save time.
Following is my code:
sparkSession.sqlContext.read.format("csv").option("header","true").option("inferschema","true").load(file).limit(20).toJSON.toString()
sparkSession.sql(s"select * from $file").head(100).toString
sparkSession.sql(s"select * from $file").limit(100).toString
If you directly want to use spark then it will anyways load the files and then it does taking records. So first even before spark logic you have to get one file name from the directory using ur technology like java or scala or python and pass that file name to text File method that won't load all files.

NullPointerException when writing parquet

I am trying to measure how long does it take me to read and write parquet files in Amazon s3 (under a specific partition)
For that I wrote a script that simply reads the files and than write them back:
val df = sqlContext.read.parquet(path + "p1.parquet/partitionBy=partition1")
df.write.mode("overwrite").parquet(path + "p1.parquet/partitionBy=partition1")
However I get a null pointer exception. I tried to add df.count in between, but got the same error.
The reason for the error is that Spark only reads the data when it is going to be used. This results in Spark reading data from the file at the same time as trying to overwrite the file. This causes an issue since data can't be overwritten while reading.
I'd recommend saving to a temporary location as this is for timing purposes. An alternative would be to use .cache() on the data when reading, perform an action to force the read (as well as actually cache the data), and then overwrite the file.

How to write data as single (normal) csv file in Spark? [duplicate]

This question already has answers here:
Write single CSV file using spark-csv
(16 answers)
Closed 5 years ago.
I am trying to save a data frame as CSV file in my local drive. But, when I do that so, I get a folder generated and within that partition files were written. Is there any suggestion to overcome this ?
My Requirement:
To get a normal csv file with actual name given in the code.
Code Snippet:
dataframe.coalesce(1).write.mode("overwrite").format("com.databricks.spark.csv").option("header", "true").csv("E:/dataframe.csv")
TL:DR You are trying to enforce sequential, in-core concepts on a distribute enviornment. It cannot end up well.
Spark doesn't provide utility like this one. To be able to create one in a semi distributed fashion, you'd have to implement multistep, source dependent protocol where:
You write header.
You write data files for each partition.
You merge the files, and give a new name.
Since this has limited applications, is useful only for smallish files, and can be very expensive with some sources (like object stores) nothing like this is implemented in Spark.
You can of course collect data, use standard CSV parser (Univoicity, Apache Commons) and then put to the storage of your choice. This is sequential and requires multiple data transfers.
There is no automatic way to do this. I see two solutions
If the local directory is mounted on all the executors: Write the file as you did, but then move/rename the part-*csv file to the desired name
Or if the directory is not available on all executors: collect the
dataframe to the driver and then create the file using plain scala
But both solutions kind of destroy parallelism and thus the goal of spark.
It is not possible but you can do somethings like this:
dataframe.coalesce(1).write.mode("overwrite").format("com.databricks.spark.csv").option("header", "true").csv("E:/data/")
import org.apache.hadoop.fs._
val fs = FileSystem.get(sc.hadoopConfiguration)
val filePath = "E:/data/"
val fileName = fs.globStatus(new Path(filePath+"part*"))(0).getPath.getName
fs.rename(new Path(filePath+fileName), new Path(filePath+"dataframe.csv"))

Append new data to partitioned parquet files

I am writing an ETL process where I will need to read hourly log files, partition the data, and save it. I am using Spark (in Databricks).
The log files are CSV so I read them and apply a schema, then perform my transformations.
My problem is, how can I save each hour's data as a parquet format but append to the existing data set? When saving, I need to partition by 4 columns present in the dataframe.
Here is my save line:
data
.filter(validPartnerIds($"partnerID"))
.write
.partitionBy("partnerID","year","month","day")
.parquet(saveDestination)
The problem is that if the destination folder exists the save throws an error.
If the destination doesn't exist then I am not appending my files.
I've tried using .mode("append") but I find that Spark sometimes fails midway through so I end up loosing how much of my data is written and how much I still need to write.
I am using parquet because the partitioning substantially increases my querying in the future. As well, I must write the data as some file format on disk and cannot use a database such as Druid or Cassandra.
Any suggestions for how to partition my dataframe and save the files (either sticking to parquet or another format) is greatly appreciated.
If you need to append the files, you definitely have to use the append mode. I don't know how many partitions you expect it to generate, but I find that if you have many partitions, partitionBy will cause a number of problems (memory- and IO-issues alike).
If you think that your problem is caused by write operations taking too long, I recommend that you try these two things:
1) Use snappy by adding to the configuration:
conf.set("spark.sql.parquet.compression.codec", "snappy")
2) Disable generation of the metadata files in the hadoopConfiguration on the SparkContext like this:
sc.hadoopConfiguration.set("parquet.enable.summary-metadata", "false")
The metadata-files will be somewhat time consuming to generate (see this blog post), but according to this they are not actually important. Personally, I always disable them and have no issues.
If you generate many partitions (> 500), I'm afraid the best I can do is suggest to you that you look into a solution not using append-mode - I simply never managed to get partitionBy to work with that many partitions.
If you're using unsorted partitioning your data is going to be split across all of your partitions. That means every task will generate and write data to each of your output files.
Consider repartitioning your data according to your partition columns before writing to have all the data per output file on the same partitions:
data
.filter(validPartnerIds($"partnerID"))
.repartition([optional integer,] "partnerID","year","month","day")
.write
.partitionBy("partnerID","year","month","day")
.parquet(saveDestination)
See: DataFrame.repartition

How to distribute processing to find waldos in csv using spark scala in a clustered environment?

I have a spark cluster of 1 master, 3 workers. I have a simple, but gigantic CSV file like this:
FirstName, Age
Waldo, 5
Emily, 7
John, 4
Waldo, 9
Amy, 2
Kevin, 4
...
I want to get all the records where FirstName is waldo. Normally on one machine and local mode, if I have an RDD, I can do a ".parallelize()" to get an RDD, then assuming the variable is "mydata", I can do:
mydata.foreach(x => // check if the first row's first column value contains "Waldo", if so, then print it)
From my understanding, using the above method, every spark slave would have to perform that iteration on the entire gigantic csv to get the result instead of each machine processing a third of the file (correct me if I am wrong).
Problem is, if I have 3 machines, how do I make it so that:
The csv file is broken up into 3 different "sets" to each of the
workers so that each worker is working with a much smaller file
(1/3rd of the original)
Each worker processes it, and finds all the "FirstName=Waldo"s"
The resulting list of "Waldo" records are
reported back to me in a way that actually takes advantage of the
cluster.
Mmm, lot of points to make here. First, if you are using HDFS, you file is already partitioned, it is a distributed file system. You probably even have the data replicated 3 times, as that is the default (depends on your config though).
Second, Spark will indeed make use of this partitioning when you told it to load data, and will process chunks locally. Shuffling data around is only required when you want to, for instance, re-partition you data by some criteria, like keys in a key/value pair, etc.
Third, Spark is indeed great for doing batch processing and some datamining if you don't want to structure a database or don't have predefined access patterns. In short, for what you seem to need. You don't even need to write and compile code since you can run a Spark Shell and try with a few lines. I do recommend you to look at the docs, since you don't seem to have a clear grasp of the platform yet.
Fourth, I don't have an IDE or anything here, but the code you need should be something line this (sort of pseudocode, but should be VERY close):
sc
.textFile("my_hdfs_path")
.keyBy(_.split('\t')(0))
.filter(_._1 == "Waldo")
.map(_._2)
.saveAsTextFile("my_hdfs_out")
if not too big, you can also use collect to bring all results to the driver location instead of saving to file, but after that you are back in a single machine. Hope it helps!