Better way to join in spark scala [duplicate] - scala

I would like to join two spark-scala dataframes on multiple columns dynamically. I would to avoid hard coding column name comparison as shown in the following statments;
val joinRes = df1.join(df2, df1("col1") == df2("col1") and df1("col2") == df2("col2"))
The solution for this query already exists in pyspark version --provided in the following link
PySpark DataFrame - Join on multiple columns dynamically
I would like to code the same code using spark-scala

In scala you do it in similar way like in python but you need to use map and reduce functions:
val sparkSession = SparkSession.builder().getOrCreate()
import sparkSession.implicits._
val df1 = List("a,b", "b,c", "c,d").toDF("col1","col2")
val df2 = List("1,2", "2,c", "3,4").toDF("col1","col2")
val columnsdf1 = df1.columns
val columnsdf2 = df2.columns
val joinExprs = columnsdf1
.zip(columnsdf2)
.map{case (c1, c2) => df1(c1) === df2(c2)}
.reduce(_ && _)
val dfJoinRes = df1.join(df2,joinExprs)

Related

dataframe.select, select dataframe columns from file

I am trying to create a child dataframe from parent dataframe. but I have more than 100 cols to select.
so in Select statement can I give the columns from a file?
val Raw_input_schema=spark.read.format("text").option("header","true").option("delimiter","\t").load("/HEADER/part-00000").schema
val Raw_input_data=spark.read.format("text").schema(Raw_input_schema).option("delimiter","\t").load("/DATA/part-00000")
val filtered_data = Raw_input_data.select(all_cols)
how can I send the columns names from file in all_cols
I would assume you would read file somewhere from hdfs or from shared config file? Reason for this, that on the cluster this code, would be executed on individual node etc.
In this case I would approach this with next pice of code:
import org.apache.spark.sql.functions.col
val lines = Source.fromFile("somefile.name.csv").getLines
val cols = lines.flatMap(_.split(",")).map( col(_)).toArray
val df3 = df2.select(cols :_ *)
Essentially, you just have to provide array of strings and use :_ * notation for variable number of arguments.
finally this worked for me;
val Raw_input_schema=spark.read.format("csv").option("header","true").option("delimiter","\t").load("headerFile").schema
val Raw_input_data=spark.read.format("csv").schema(Raw_input_schema).option("delimiter","\t").load("dataFile")
val filtered_file = sc.textFile("filter_columns_file").map(cols=>cols.split("\t")).flatMap(x=>x).collect().toList
//or
val filtered_file = sc.textFile(filterFile).map(cols=>cols.split("\t")).flatMap(x=>x).collect().toList.map(x => new Column(x))
val final_df=Raw_input_data.select(filtered_file.head, filtered_file.tail: _*)
//or
val final_df = Raw_input_data.select(filtered_file:_*)'

I can't fit the FP-Growth model in spark

Please, can you help me ? I have an 80 CSV files dataset and a cluster of one master and 4 slaves. I want to read the CSV files in a dataframe and parallelize it on the four slaves. After that, I want to filter the dataframe with a group by. In my spark queries, the result contains columns "code_ccam" and "dossier" grouped by ("code_ccam","dossier"). I want to use the FP-Growth algorithm to detect sequences of "code_ccam" which are repeated by "folder". But when I use the FPGrowth.fit() command, I have the following error :
"error: type mismatch;
found : org.apache.spark.rdd.RDD[org.apache.spark.sql.Row]
required: org.apache.spark.sql.Dataset[_]"
Here are my spark commands:
val df = spark.read.option("header", "true").csv("file:///home/ia/Projet-Spark-ace/Donnees/Fichiers CSV/*.csv")
import org.apache.spark.sql.functions.{concat, lit}
val df2 = df.withColumn("dossier", concat(col("num_immatriculation"), lit(""), col("date_acte"), lit(""), col("rang_naissance"), lit(""), col("date_naissance")))
val df3 = df2.drop("num_immatriculation").drop("date_acte").drop("rang_naissance").drop("date_naissance")
val df4 = df3.select("dossier","code_ccam").groupBy("dossier","code_ccam").count()
val transactions = df4.agg(collect_list("code_ccam").alias("codes_ccam")).rdd.map(x => x)
import org.apache.spark.ml.fpm.FPGrowth
val fpgrowth = new FPGrowth().setItemsCol("code_ccam").setMinSupport(0.5).setMinConfidence(0.6)
val model = fpgrowth.fit(transactions)
Tkank you very much. It worked. I replaced collect_list by collect_set.

Spark Join of 2 dataframes which have 2 different column names in list

Is there a way to join two Spark Dataframes with different column names via 2 lists?
I know that if they had the same names in a list I could do the following:
val joindf = df1.join(df2, Seq("col_a", "col_b"), "left")
or if I knew the different column names I could do this:
df1.join(
df2,
df1("col_a") <=> df2("col_x")
&& df1("col_b") <=> df2("col_y"),
"left"
)
Since my method is expecting inputs of 2 lists which specify which columns are to be used for the join for each DF, I was wondering if Scala Spark had a way of doing this?
P.S
I'm looking for something like a python pandas merge:
joindf = pd.merge(df1, df2, left_on = list1, right_on = list2, how = 'left')
You can easely define such a method yourself:
def merge(left: DataFrame, right: DataFrame, left_on: Seq[String], right_on: Seq[String], how: String) = {
import org.apache.spark.sql.functions.lit
val joinExpr = left_on.zip(right_on).foldLeft(lit(true)) { case (acc, (lkey, rkey)) => acc and (left(lkey) === right(rkey)) }
left.join(right, joinExpr, how)
}
val df1 = Seq((1, "a")).toDF("id1", "n1")
val df2 = Seq((1, "a")).toDF("id2", "n2")
val joindf = merge(df1, df2, left_on = Seq("id1", "n1"), right_on = Seq("id2", "n2"), how = "left")
If you expect two lists of strings:
val leftOn = Seq("col_a", "col_b")
val rightOn = Seq("col_x", "coly")
Just zip and reduce:
import org.apache.spark.sql.functions.col
val on = leftOn.zip(rightOn)
.map { case (x, y) => df1(x) <=> df2(y) }
.reduce(_ and _)
df1.join(df2, on, "left")

dynamically join two spark-scala dataframes on multiple columns without hardcoding join conditions

I would like to join two spark-scala dataframes on multiple columns dynamically. I would to avoid hard coding column name comparison as shown in the following statments;
val joinRes = df1.join(df2, df1("col1") == df2("col1") and df1("col2") == df2("col2"))
The solution for this query already exists in pyspark version --provided in the following link
PySpark DataFrame - Join on multiple columns dynamically
I would like to code the same code using spark-scala
In scala you do it in similar way like in python but you need to use map and reduce functions:
val sparkSession = SparkSession.builder().getOrCreate()
import sparkSession.implicits._
val df1 = List("a,b", "b,c", "c,d").toDF("col1","col2")
val df2 = List("1,2", "2,c", "3,4").toDF("col1","col2")
val columnsdf1 = df1.columns
val columnsdf2 = df2.columns
val joinExprs = columnsdf1
.zip(columnsdf2)
.map{case (c1, c2) => df1(c1) === df2(c2)}
.reduce(_ && _)
val dfJoinRes = df1.join(df2,joinExprs)

Dropping constant columns in a csv file

i would like to drop columns which are constant in a dataframe , here what i did , but i see that it tooks some much time to do it , special while writing the dataframe into the csv file , please any help to optimize the code to take less time
val spark = SparkSession.builder.master("local").appName("my-spark-app").getOrCreate()
val df = spark.read.option("inferSchema", "true").option("header", "false").csv("D:\\ProcessDataSet\\anis_data\\Set _1Mud Pumps_Merged.csv")
val aggregations = df.drop("DateTime").columns.map(c => stddev("c").as(c))
val df2 = df.agg(aggregations.head, aggregations.tail: _*)
val columnsToKeep: Seq[String] = (df2.first match {
case r : Row => r.toSeq.toArray.map(_.asInstanceOf[Double])
}).zip(df.columns)
.filter(_._1 != 0) // your special condition is in the filter
.map(_._2) // keep just the name of the column
// select columns with stddev != 0
val finalResult = df.select(columnsToKeep.head, columnsToKeep.tail : _*)
finalResult.write.option("header",true).csv("D:\\ProcessDataSet\\dataWithoutConstant\\Set _1Mud Pumps_MergedCleaned.csv")
}
I think there is no much room left for optimization. You are doing the right thing.
Maybe what you can try is to cache() your dataframe df.
df is used in two separate Spark actions so it is loaded twice.
Try :
...
val df = spark.read.option("inferSchema", "true").option("header", "false").csv("D:\\ProcessDataSet\\anis_data\\Set _1Mud Pumps_Merged.csv")
df.cache()
val aggregations = df.drop("DateTime").columns.map(c => stddev("c").as(c))
...