DataFrame persist does not improve performance in Spark - scala

I am writing a Scala script that reads from a table, transforms data and shows result using Spark. I am using Spark 2.1.1.2 and Scala 2.11.8. There is a dataframe instance I use twice in the script (df2 in the code below.). Since dataframes are calculated when an action is called on them, not when they are declared, I predict that this dataframe to be calculated twice. I thought that persisting this dataframe would improve performance thinking that, it would be calculated once (when persisted), instead of twice, if persisted.
However, script run lasts ~10 seconds longer when I persist compared to when I don't persist. I cannot figure out what the reason for this is. If someone has an idea, it would be much appreciated.
My submission command line is below:
spark-submit --class TestQuery --master yarn --driver-memory 10G --executor-memory 10G --executor-cores 2 --num-executors 4 /home/bcp_data/test/target/TestQuery-1.0-SNAPSHOT.jar
Scala script is below:
val spark = SparkSession
.builder()
.appName("TestQuery")
.config("spark.sql.warehouse.dir", "file:/tmp/hsperfdata_hdfs/spark-warehouse/")
.enableHiveSupport()
.getOrCreate()
val m = spark.sql("select id, startdate, enddate, status from members")
val l = spark.sql("select mid, no, status, potential from log")
val r = spark.sql("select mid, code from records")
val df1 = m.filter(($"status".isin(1,2).and($"startdate" <= one_year_ago)).and((($"enddate" >= one_year_ago)))
val df2 = df1.select($"id", $"code").join(l, "mid").filter(($"status".equalTo(1)).and($"potential".notEqual(9))).select($"no", $"id", $"code")
df2.persist
val df3 = df2.join(r, df2("id").equalTo(r("mid"))).filter($"code".isin("0001","0010","0015","0003","0012","0014","0032","0033")).groupBy($"code").agg(countDistinct($"no"))
val fa = spark.sql("select mid, acode from actions")
val fc = spark.sql("select dcode, fcode from params.codes")
val df5 = fa.join(fc, fa("acode").startsWith(fc("dcode")), "left_outer").select($"mid", $"fcode")
val df6 = df2.join(df5, df2("id").equalTo(df5("mid"))).groupBy($"code", $"fcode")
println("count1: " + df3.count + " count2: " + df6.count)

using caching is the right choice here, but your statement
df2.persist
has no effect because you do not utilize the returned dataframe. Just do
val df2 = df1.select($"id", $"code")
.join(l, "mid")
.filter(($"status".equalTo(1)).and($"potential".notEqual(9)))
.select($"no", $"id", $"code")
.persist

Related

how to increase performance on Spark distinct() on multiple columns

Could you please suggest alternative way of implementing distinct in spark data frame.
I tried both SQL and spark distinct but since the dataset size (>2 Billion) it fails on the shuffle .
If I increase the node and memory to >250GB, process run for a longe time (more than 7 hours).
val df = spark.read.parquet(out)
val df1 = df.
select($"ID", $"col2", $"suffix",
$"date", $"year", $"codes").distinct()
val df2 = df1.withColumn("codes", expr("transform(codes, (c,s) -> (d,s) )"))
df2.createOrReplaceTempView("df2")
val df3 = spark.sql(
"""SELECT
ID, col2, suffix
d.s as seq,
d.c as code,
year,date
FROM
df2
LATERAL VIEW explode(codes) exploded_table as d
""")
df3.
repartition(
600,
List(col("year"), col("date")): _*).
write.
mode("overwrite").
partitionBy("year", "date").
save(OutDir)

Spark SQL join really lazy?

I'm performing an inner join between, say, 8 dataframes, all coming from the same parent. Sample code:
// read parquet
val readDF = session.read.parquet(...)
// multiple expensive transformations are performed over readDF, making its DAG grow
// repartition + cache
val df = readDF.repartition($"type").cache
val df1 = df.filter($"type" === 1)
val df2 = df.filter($"type" === 2)
val df3 = df.filter($"type" === 3)
val df4 = df.filter($"type" === 4)
val df5 = df.filter($"type" === 5)
val df6 = df.filter($"type" === 6)
val df7 = df.filter($"type" === 7)
val df8 = df.filter($"type" === 8)
val joinColumns = Seq("col1", "col2", "col3", "col4")
val joinDF = df1
.join(df2, joinColumns)
.join(df3, joinColumns)
.join(df4, joinColumns)
.join(df5, joinColumns)
.join(df6, joinColumns)
.join(df7, joinColumns)
.join(df8, joinColumns)
Unexpectedly, the joinDF sentence is taking a long time. Join is supposed to be a transformation, not an action.
Do you know what's happening? Is this a use case for checkpointing?
Notes:
- joinDF.explain shows a long DAG lineage.
- using Spark 2.3.0 with Scala
RDD JOIN, SPARK SQL JOIN are known as a Transformation. I ran this with no issue on DataBricks Notebook, but I am not privy to " ...// multiple expensive transformations are performed over readDF, making its DAG grow .... May be there is an Action there.
Indeed, checkpointing seems to fix the long running join. It now behaves as a transformation, returning faster. So, I conclude that the delay was related to the large DAG lineage.
Also, the subsequent actions are now faster.

I can't fit the FP-Growth model in spark

Please, can you help me ? I have an 80 CSV files dataset and a cluster of one master and 4 slaves. I want to read the CSV files in a dataframe and parallelize it on the four slaves. After that, I want to filter the dataframe with a group by. In my spark queries, the result contains columns "code_ccam" and "dossier" grouped by ("code_ccam","dossier"). I want to use the FP-Growth algorithm to detect sequences of "code_ccam" which are repeated by "folder". But when I use the FPGrowth.fit() command, I have the following error :
"error: type mismatch;
found : org.apache.spark.rdd.RDD[org.apache.spark.sql.Row]
required: org.apache.spark.sql.Dataset[_]"
Here are my spark commands:
val df = spark.read.option("header", "true").csv("file:///home/ia/Projet-Spark-ace/Donnees/Fichiers CSV/*.csv")
import org.apache.spark.sql.functions.{concat, lit}
val df2 = df.withColumn("dossier", concat(col("num_immatriculation"), lit(""), col("date_acte"), lit(""), col("rang_naissance"), lit(""), col("date_naissance")))
val df3 = df2.drop("num_immatriculation").drop("date_acte").drop("rang_naissance").drop("date_naissance")
val df4 = df3.select("dossier","code_ccam").groupBy("dossier","code_ccam").count()
val transactions = df4.agg(collect_list("code_ccam").alias("codes_ccam")).rdd.map(x => x)
import org.apache.spark.ml.fpm.FPGrowth
val fpgrowth = new FPGrowth().setItemsCol("code_ccam").setMinSupport(0.5).setMinConfidence(0.6)
val model = fpgrowth.fit(transactions)
Tkank you very much. It worked. I replaced collect_list by collect_set.

Better way to join in spark scala [duplicate]

I would like to join two spark-scala dataframes on multiple columns dynamically. I would to avoid hard coding column name comparison as shown in the following statments;
val joinRes = df1.join(df2, df1("col1") == df2("col1") and df1("col2") == df2("col2"))
The solution for this query already exists in pyspark version --provided in the following link
PySpark DataFrame - Join on multiple columns dynamically
I would like to code the same code using spark-scala
In scala you do it in similar way like in python but you need to use map and reduce functions:
val sparkSession = SparkSession.builder().getOrCreate()
import sparkSession.implicits._
val df1 = List("a,b", "b,c", "c,d").toDF("col1","col2")
val df2 = List("1,2", "2,c", "3,4").toDF("col1","col2")
val columnsdf1 = df1.columns
val columnsdf2 = df2.columns
val joinExprs = columnsdf1
.zip(columnsdf2)
.map{case (c1, c2) => df1(c1) === df2(c2)}
.reduce(_ && _)
val dfJoinRes = df1.join(df2,joinExprs)

Union operation in spark running very slow

I'm running a spark sql with below statements and configuration but apparently dfs.reduce((x, y) => x.union(y)).distinct().coalesce(1) step is taking a lot of time to execute,roughly 5 mins, my input parquet file has just 88 records. Any thoughts what could be the issue ?
val spark = SparkSession
.builder()
.appName("SparkSessionZipsExample")
.config("spark.sql.warehouse.dir", warehouseLocation)
.config("spark.master", "local")
.config("spark.serializer",
"org.apache.spark.serializer.KryoSerializer")
.getOrCreate()
//set new runtime options
spark.conf.set("spark.sql.shuffle.partitions", 6)
spark.conf.set("spark.executor.memory", "2g")
spark.conf.set("spark.driver.host", "localhost")
spark.conf.set("spark.cores.max", "8")
val dfs = m.map(field => spark.sql(s"select 'DataProfilerStats' as Table_Name,
'$field' as Column_Name,min($field) as min_value from parquetDFTable"))
val withSum = dfs.reduce((x, y) => x.union(y)).distinct().coalesce(1)
UPDATE
I have a single parquet which I'm reading into dataframe, question is also that if it can be split into smaller chunks.