Configuration for spark job to write 3000000 file as output - scala

I have to generate 3000000 files as the output of spark job.
I have two input file :
File 1 -> Size=3.3 Compressed, No.Of Records=13979835
File 2 -> Size=1.g Compressed, No.Of Records=6170229
Spark Job is doing the following:
reading both this file and joining them based on common column1. -> DataFrame-A
Grouping result of DataFrame-A based on one column2 -> DataFrame-B
From DataFrame-B used array_join for the aggregated column and separate that column by '\n' char. -> DataFrame-C
Writing result of DataFrame-C partition by column2.
val DF1 = sparkSession.read.json("FILE1") // |ID |isHighway|isRamp|pvId |linkIdx|ffs |length |
val DF12 = sparkSession.read.json("FILE2") // |lId |pid |
val joinExpression = DF1.col("pvId") === DF2.col("lId")
val DFA = DF.join(tpLinkDF, joinExpression, "inner").select(col("ID").as("SCAR"), col("lId"), col("length"), col("ffs"), col("ar"), col("pid")).orderBy("linkIdx")
val DFB = DFA.select(col("SCAR"),concat_ws(",", col("lId"), col("length"),col("ffs"), col("ar"), col("pid")).as("links")).groupBy("SCAR").agg(collect_list("links").as("links"))
val DFC = DFB.select(col("SCAR"), array_join(col("links"), "\n").as("links"))
DFC.write.format("com.databricks.spark.csv").option("quote", "\u0000").partitionBy("SCAR").mode(SaveMode.Append).format("csv").save("/tmp")
I have to generate 3000000 files as output of spark job.

After running some test I got an idea to run this job in batch like :
query startIdx: 0, endIndex:100000
query startIdx: 100000, endIndex:200000
query startIdx: 200000, endIndex:300000
and so.... on till
query startIdx: 2900000, endIndex:3000000

Related

Convert Spark Dataframe to Scala Map of key, list of values

I have a dataframe of the form:
Abc | apple
Abc | mango
xyz | grapes
xyz | peach
I want to convert this dataframe into a scala map of (key, list of values) eg: (Abc->(apple,mango), (xyz -> (grapes,peach)).
My code :
concatenatedLogs.collect.map( r => {
val key = r(0).toString
val value = r(1).toString
var currList = testMap.getOrElse(key,List[String]())
currList = value ::currList
testMap+=(key -> currList)
}
)
It gives me Java heap space out of memory error. Is there a more efficient and easy way to do this ?
Spark is a distributed processing framework, when you are dealing with a lot of data. Spark is processing them on a cluster, when you call the collect function all the data that is read on all the different cores/machines is brought back to the driver. When you are doing this you need to make sure, you have enough memory on your driver.
What you are doing is highly inefficient, because you are collecting the entire dataframe to the driver and then you are doing transformations on it. Using spark, you could do something similar with the code below:
val someDF = Seq(
("Abc", "apple"),
("Abc", "mango"),
("xyz", "grapes"),
("xyz", "peach")
).toDF(
"group", "fruit")
val s = someDF.groupBy(col("group")).
agg(collect_list("fruit").as("fruits")).as[(String, List[String])].collect.toMap
the output of this
Map(Abc -> List(apple, mango), xyz -> List(grapes, peach))

Spark filter out columns and create dataFrame with remaining columns and create dataFrame with filtered columns

I am new to Spark.
I have loaded a CSV file into a Spark DataFrame, say OriginalDF
Now I want to
1. filter out some columns from it and create a new dataframe of the originalDF
2. create a dataFrame out of the extracted columns
How can these 2 dataframes be created in spark scala?
using select, you can select what columns you want.
val df2 = OriginalDF.select($"col1",$"col2",$"col3")
using filter you should able to filter the rows.
val df3 = OriginalDF.where($"col1" < 10)
another way to filter data is using where. Both filter and where are synonyms so you can use them interchangeably.
val df3 = OriginalDF.filter($"col1" < 10)
Note select and filter returns a new dataframe as a result.

Scala Spark: Order changes when writing a DataFrame to a CSV file

I have two data frames which I am merging using union. After performing the union, printing the final dataframe using df.show(), shows that the records are in the order as intended (first dataframe records on the top followed by second dataframe records). But when I write this final data frame to the csv file, the records from the first data frame, that I want to be on the top of the csv file are losing their position. The first data frame's records are getting mixed with the second dataframe's records. Any help would be appreciated.
Below is a the code sample:
val intVar = 1
val myList = List(("hello",intVar))
val firstDf = myList.toDF()
val secondDf: DataFrame = testRdd.toDF()
val finalDF = firstDf.union(secondDf)
finalDF.show() // prints the dataframe with firstDf records on the top followed by the secondDf records
val outputfilePath = "/home/out.csv"
finalDF.coalesce(1).write.csv(outputFilePath) //the first Df records are getting mixed with the second Df records.

Compare 2 dataframes and filter results based on date column in spark

I have 2 dataframes in spark as mentioned below.
val test = hivecontext.sql("select max(test_dt) as test_dt from abc");
test: org.apache.spark.sql.DataFrame = [test_dt: string]
val test1 = hivecontext.table("testing");
where test1 has columns like id,name,age,audit_dt
I want to compare these 2 dataframes and filter rows from test1 where audit_dt > test_dt. Somehow I am not able to do that. I am able to compare audit_dt with literal date using lit function but i am not able to compare it with another dataframe column.
I am able to compare literal date using lit function as mentioned below
val output = test1.filter(to_date(test1("audit_date")).gt(lit("2017-03-23")))
Max Date in test dataframe is -> 2017-04-26
Data in test1 Dataframe ->
Id,Name,Age,Audit_Dt
1,Rahul,23,2017-04-26
2,Ankit,25,2017-04-26
3,Pradeep,28,2017-04-27
I just need the data for Id=3 since that only row qualifies the greater than criteria of max date.
I have already tried below mentioned option but it is not working.
val test = hivecontext.sql("select max(test_dt) as test_dt from abc")
val MAX_AUDIT_DT = test.first().toString()
val output = test.filter(to_date(test("audit_date")).gt((lit(MAX_AUDIT_DT))))
Can anyone suggest as way to compare it with column of dataframe test?
Thanks
You can use non-equi joins, if both columns "test_dt" and "audit_date" are of class date.
/// cast to correct type
import org.apache.spark.sql.functions.to_date
val new_test = test.withColumn("test_dt",to_date($"test_dt"))
val new_test1 = test1.withColumn("Audit_Dt", to_date($"Audit_Dt"))
/// join
new_test1.join(new_test, $"Audit_Dt" > $"test_dt")
.drop("test_dt").show()
+---+-------+---+----------+
| Id| Name|Age| Audit_Dt|
+---+-------+---+----------+
| 3|Pradeep| 28|2017-04-27|
+---+-------+---+----------+
Data
val test1 = sc.parallelize(Seq((1,"Rahul",23,"2017-04-26"),(2,"Ankit",25,"2017-04-26"),
(3,"Pradeep",28,"2017-04-27"))).toDF("Id","Name", "Age", "Audit_Dt")
val test = sc.parallelize(Seq(("2017-04-26"))).toDF("test_dt")
Try with this:
test1.filter(to_date(test1("audit_date")).gt(to_date(test("test_dt"))))
Store the value in a variable and use in filter.
val dtValue = test.select("test_dt")
OR
val dtValue = test.first().getString(0)
Now apply filter
val output = test1.filter(to_date(test1("audit_date")).gt(lit(dtValue)))

How to divide dataset in two parts based on filter in Spark-scala

Is it possible to divide DF in two parts using single filter operation.For example
let say df has below records
UID Col
1 a
2 b
3 c
if I do
df1 = df.filter(UID <=> 2)
can I save filtered and non-filtered records in different RDD in single operation
?
df1 can have records where uid = 2
df2 can have records with uid 1 and 3
If you're interested only in saving data you can add an indicator column to the DataFrame:
val df = Seq((1, "a"), (2, "b"), (3, "c")).toDF("uid", "col")
val dfWithInd = df.withColumn("ind", $"uid" <=> 2)
and use it as a partition column for the DataFrameWriter with one of the supported formats (as for 1.6 it is Parquet, text, and JSON):
dfWithInd.write.partitionBy("ind").parquet(...)
It will create two separate directories (ind=false, ind=true) on write.
In general though, it is not possible to yield multiple RDDs or DataFrames from a single transformation. See How to split a RDD into two or more RDDs?