Split Spark DataFrame into parts - scala

I have a table of distinct users, which has 400,000 users. I would like to split it into 4 parts, and expected each user located in one part only.
Here is my code:
val numPart = 4
val size = 1.0 / numPart
val nsizes = Array.fill(numPart)(size)
val data = userList.randomSplit(nsizes)
Then I write each data(i), i from 0 to 3, into parquet files. Select the directory, group by user id and count by part, there are some users that located in two or more parts.
I still have no idea why?

I have found the solution: cache the DataFrame before you split it.
Should be
val data = userList.cache().randomSplit(nsizes)
Still have no idea why. My guess, each time the randomSplit function "fill" the data, it reads records from userList which is re-evaluate from the parquet file(s), and give a different order of rows, that's why some users are lost and some users are duplicated.
That's what I thought. If some one have any answer or explanation, I will update.
References:
(Why) do we need to call cache or persist on a RDD
https://jaceklaskowski.gitbooks.io/mastering-apache-spark/content/spark-rdd-caching.html
http://159.203.217.164/using-sparks-cache-for-correctness-not-just-performance/

If your goal is to split it to different files you can use the functions.hash to calculate a hash, then mod 4 to get a number between 0 to 4 and when you write the parquet use partitionBy which would create a directory for each of the 4 values.

Related

Read file content per row of Spark DataFrame

We have an AWS S3 bucket with millions of documents in a complex hierarchy, and a CSV file with (among other data) links to a subset of those files, I estimate this file will be about 1000 to 10.000 rows. I need to join the data from the CSV file with the contents of the documents for further processing in Spark. In case it matters, we're using Scala and Spark 2.4.4 on an Amazon EMR 6.0.0 cluster.
I can think of two ways to do this. First is to add a transformation on the CSV DataFrame that adds the content as a new column:
val df = spark.read.format("csv").load("<csv file>")
val attempt1 = df.withColumn("raw_content", spark.sparkContext.textFile($"document_url"))
or variations thereof (for example, wrapping it in a udf) don't seem to work, I think because sparkContext.textFile returns an RDD, so I'm not sure it's even supposed to work this way? Even if I get it working, is the best way to keep it performant in Spark?
An alternative I tried to think of is to use spark.sparkContext.wholeTextFiles upfront and then join the two dataframes together:
val df = spark.read.format("csv").load("<csv file>")
val contents = spark.sparkContext.wholeTextFiles("<s3 bucket>").toDF("document_url", "raw_content");
val attempt2 = df.join(contents, df("document_url") === contents("document_url"), "left")
but wholeTextFiles doesn't go into subdirectories and the needed paths are hard to predict, and I'm also unsure of the performance impact of trying to build an RDD of the entire bucket of millions of files if I only need a small fraction of it, since the S3 API probably doesn't make it very fast to list all the objects in the bucket.
Any ideas? Thanks!
I did figure out a solution in the end:
val df = spark.read.format("csv").load("<csv file>")
val allS3Links = df.map(row => row.getAs[String]("document_url")).collect()
val joined = allS3Links.mkString(",")
val contentsDF = spark.sparkContext.wholeTextFiles(joined).toDF("document_url", "raw_content");
The downside to this solution is that it pulls all the urls to the driver, but it's workable in my case (100,000 * ~100 char length strings is not that much) and maybe even unavoidable.

Spark: transform dataframe

I work with Spark 1.6.1 in Scala.
I have one dataframe, and I want to create different dataframe and only want to read 1 time.
For example one dataframe have two columns ID and TYPE, and I want to create two dataframe one with the value of type = A and other with type value = B.
I've checked another posts on stackoverflow, but found only the option to read the dataframe 2 times.
However, I would like another solution with the best performance possible.
Kinds regards.
Spark will read from the data source multiple times if you perform multiple actions on the data. The way to aviod this is to use cache(). In this way, the data will be saved to memory after the first action, which will make subsequent actions using the data faster.
Your two dataframes can be created in this way, requiring only one read of the data source.
val df = spark.read.csv(path).cache()
val dfA = df.filter($"TYPE" === "A").drop("TYPE")
val dfB = df.filter($"TYPE" === "B").drop("TYPE")
The "TYPE" column is dropped as it should be unnecessary after the separation.

Getting the value of a DataFrame column in Spark

I am trying to retrieve the value of a DataFrame column and store it in a variable. I tried this :
val name=df.select("name")
val name1=name.collect()
But none of the above is returning the value of column "name".
Spark version :2.2.0
Scala version :2.11.11
There are couple of things here. If you want see all the data collect is the way to go. However in case your data is too huge it will cause drive to fail.
So the alternate is to check few items from the dataframe. What I generally do is
df.limit(10).select("name").as[String].collect()
This will provide output of 10 element. But now the output doesn't look good
So, 2nd alternative is
df.select("name").show(10)
This will print first 10 element, Sometime if the column values are big it generally put "..." instead of actual value which is annoying.
Hence there is third option
df.select("name").take(10).foreach(println)
Takes 10 element and print them.
Now in all the cases you won't get a fair sample of the data, as the first 10 data will be picked. So to truely pickup randomly from the dataframe you can use
df.select("name").sample(.2, true).show(10)
or
df.select("name").sample(.2, true).take(10).foreach(println)
You can check the "sample" function on dataframe
The first will do :)
val name = df.select("name") will return another DataFrame. You can do for example name.show() to show content of the DataFrame. You can also do collect or collectAsMap to materialize results on driver, but be aware, that data amount should not be too big for driver
You can also do:
val names = df.select("name").as[String].collect()
This will return array of names in this DataFrame

Output Sequence while writing to HDFS using Apache Spark

I am working on a project in apache Spark and the requirement is to write the processed output from spark into a specific format like Header -> Data -> Trailer. For writing to HDFS I am using the .saveAsHadoopFile method and writing the data to multiple files using the key as a file name. But the issue is the sequence of the data is not maintained files are written in Data->Header->Trailer or a different combination of three. Is there anything I am missing with RDD transformation?
Ok so after reading from StackOverflow questions, blogs and mail archives from google. I found out how exactly .union() and other transformation works and how partitioning is managed. When we use .union() the partition information is lost by the resulting RDD and also the ordering and that's why My output sequence was not getting maintained.
What I did to overcome the issue is numbering the Records like
Header = 1, Body = 2, and Footer = 3
so using sortBy on RDD which is union of all three I sorted it using this order number with 1 partition. And after that to write to multiple file using key as filename I used HashPartitioner so that same key data should go into separate file.
val header: RDD[(String,(String,Int))] = ... // this is my header RDD`
val data: RDD[(String,(String,Int))] = ... // this is my data RDD
val footer: RDD[(String,(String,Int))] = ... // this is my footer RDD
val finalRDD: [(String,String)] = header.union(data).union(footer).sortBy(x=>x._2._2,true,1).map(x => (x._1,x._2._1))
val output: RDD[(String,String)] = new PairRDDFunctions[String,String](finalRDD).partitionBy(new HashPartitioner(num))
output.saveAsHadoopFile ... // and using MultipleTextOutputFormat save to multiple file using key as filename
This might not be the final or most economical solution but it worked. I am also trying to find other ways to maintain the sequence of output as Header->Body->Footer. I also tried .coalesce(1) on all three RDD's and then do the union but that was just adding three more transformation to RDD's and .sortBy function also take partition information which I thought will be same, but coalesceing the RDDs first also worked. If Anyone has some another approach please let me know, or add more to this will be really helpful as I am new to Spark
References:
Write to multiple outputs by key Spark - one Spark job
Ordered union on spark RDDs
http://apache-spark-user-list.1001560.n3.nabble.com/Union-of-2-RDD-s-only-returns-the-first-one-td766.html -- this one helped a lot

How to find a specific record in spark in cluster mode using scala?

I have a simple csv like this, but its got 1 million records:
Name, Age
Waldo, 5
Emily, 7
John, 4
Amy Johns, 2
Kevin, 4
...
I want to find someone with the name "Amy Johns". I have a spark cluster of 10 machines. Assuming rdd contains the RDD of the csv, how can I take advantage of the cluster so that I can...
Split up the work so that each of the 10 machines are working on 1/10th of the original gigantic set.
When the FIRST occurence of "Amy Johns" is found and output to the console, the job is done. (e.g. If Machine #4 found "Amy Johns", all the other machines should stop looking and the result is output)
My code right now just does: rdd = sc.textFile
Then it does a rdd.foreach( // checks if field is "Amy Johns", if so, then exits).
The problem I have with this is that the rdd contains ALL the records (if this is not the case, speak up) so I don't think work is being distributed.
Also, I don't know how to finish/stop the job once "Amy Johns" is found.
Your RDD by virtue of its definition does contain all of the records. You can split your RDD into multiple partitions to increase the parallelization on computation. Furthermore, after partitioning, you can apply a transformation on your RDD to filter elements according to some criteria.
You may want to try something like this:
val myRDD = sc.textFile(_inputPath_, 10)
val filteredRDD = myRDD.filter(line => line.split(",")(0).equals("Amy Johns"))
filteredRDD.first.foreach(println)