I can't fit the FP-Growth model in spark - scala

Please, can you help me ? I have an 80 CSV files dataset and a cluster of one master and 4 slaves. I want to read the CSV files in a dataframe and parallelize it on the four slaves. After that, I want to filter the dataframe with a group by. In my spark queries, the result contains columns "code_ccam" and "dossier" grouped by ("code_ccam","dossier"). I want to use the FP-Growth algorithm to detect sequences of "code_ccam" which are repeated by "folder". But when I use the FPGrowth.fit() command, I have the following error :
"error: type mismatch;
found : org.apache.spark.rdd.RDD[org.apache.spark.sql.Row]
required: org.apache.spark.sql.Dataset[_]"
Here are my spark commands:
val df = spark.read.option("header", "true").csv("file:///home/ia/Projet-Spark-ace/Donnees/Fichiers CSV/*.csv")
import org.apache.spark.sql.functions.{concat, lit}
val df2 = df.withColumn("dossier", concat(col("num_immatriculation"), lit(""), col("date_acte"), lit(""), col("rang_naissance"), lit(""), col("date_naissance")))
val df3 = df2.drop("num_immatriculation").drop("date_acte").drop("rang_naissance").drop("date_naissance")
val df4 = df3.select("dossier","code_ccam").groupBy("dossier","code_ccam").count()
val transactions = df4.agg(collect_list("code_ccam").alias("codes_ccam")).rdd.map(x => x)
import org.apache.spark.ml.fpm.FPGrowth
val fpgrowth = new FPGrowth().setItemsCol("code_ccam").setMinSupport(0.5).setMinConfidence(0.6)
val model = fpgrowth.fit(transactions)

Tkank you very much. It worked. I replaced collect_list by collect_set.

Related

Better way to join in spark scala [duplicate]

I would like to join two spark-scala dataframes on multiple columns dynamically. I would to avoid hard coding column name comparison as shown in the following statments;
val joinRes = df1.join(df2, df1("col1") == df2("col1") and df1("col2") == df2("col2"))
The solution for this query already exists in pyspark version --provided in the following link
PySpark DataFrame - Join on multiple columns dynamically
I would like to code the same code using spark-scala
In scala you do it in similar way like in python but you need to use map and reduce functions:
val sparkSession = SparkSession.builder().getOrCreate()
import sparkSession.implicits._
val df1 = List("a,b", "b,c", "c,d").toDF("col1","col2")
val df2 = List("1,2", "2,c", "3,4").toDF("col1","col2")
val columnsdf1 = df1.columns
val columnsdf2 = df2.columns
val joinExprs = columnsdf1
.zip(columnsdf2)
.map{case (c1, c2) => df1(c1) === df2(c2)}
.reduce(_ && _)
val dfJoinRes = df1.join(df2,joinExprs)

spark scala reducekey dataframe operation

I'm trying to do a count in scala with dataframe. My data has 3 columns and I've already loaded the data and split by tab. So I want to do something like this:
val file = file.map(line=>line.split("\t"))
val x = file1.map(line=>(line(0), line(2).toInt)).reduceByKey(_+_,1)
I want to put the data in dataframe, and having some trouble on the syntax
val file = file.map(line=>line.split("\t")).toDF
val file.groupby(line(0))
.count()
Can someone help check if this is correct?
spark needs to know the schema of the df
there are many ways to specify the schema, here is one option:
val df = file
.map(line=>line.split("\t"))
.map(l => (l(0), l(1).toInt)) //at this point spark knows the number of columns and their types
.toDF("a", "b") //give the columns names for ease of use
df
.groupby('a)
.count()

Spark - UnsupportedOperationException: collect_list is not supported in a window operation

I am using Spark 1.6. I have a dataframe generated from a parquet file with 6 columns. I am trying to group (partitionBy) and order(orderBy) the rows in the dataframe, to later collect those columns in an Array.
I wasn't sure if this actions were possible in Spark 1.6, but in the following answers they show how it can be done:
https://stackoverflow.com/a/35529093/1773841 #zero323
https://stackoverflow.com/a/45135012/1773841 #Ramesh Maharjan
Based on those answers I wrote the following code:
val sqlContext: SQLContext = new HiveContext(sc)
val conf = sc.hadoopConfiguration
val dataPath = "/user/today/*/*"
val dfSource : DataFrame = sqlContext.read.format("parquet").option("dateFormat", "DDMONYY").option("timeFormat", "HH24:MI:SS").load(dataPath)
val w = Window.partitionBy("code").orderBy("date".desc)
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions.row_number
val dfCollec = dfData.withColumn("collected", collect_list(struct("col1","col2","col3","col4","col5","col6")).over(w))
So, I followed the pattern written by Ramesh, and I created the sqlContext based on Hive as Zero recommended. But I am still getting the following error:
java.lang.UnsupportedOperationException:
'collect_list(struct('col1,'col2,'col3,'col4,'col5,'col6)) is not
supported in a window operation.
at org.apache.spark.sql.expressions.WindowSpec.withAggregate(WindowSpec.scala:191)
at org.apache.spark.sql.Column.over(Column.scala:1052)
What am I missing still?

dynamically join two spark-scala dataframes on multiple columns without hardcoding join conditions

I would like to join two spark-scala dataframes on multiple columns dynamically. I would to avoid hard coding column name comparison as shown in the following statments;
val joinRes = df1.join(df2, df1("col1") == df2("col1") and df1("col2") == df2("col2"))
The solution for this query already exists in pyspark version --provided in the following link
PySpark DataFrame - Join on multiple columns dynamically
I would like to code the same code using spark-scala
In scala you do it in similar way like in python but you need to use map and reduce functions:
val sparkSession = SparkSession.builder().getOrCreate()
import sparkSession.implicits._
val df1 = List("a,b", "b,c", "c,d").toDF("col1","col2")
val df2 = List("1,2", "2,c", "3,4").toDF("col1","col2")
val columnsdf1 = df1.columns
val columnsdf2 = df2.columns
val joinExprs = columnsdf1
.zip(columnsdf2)
.map{case (c1, c2) => df1(c1) === df2(c2)}
.reduce(_ && _)
val dfJoinRes = df1.join(df2,joinExprs)

How to sum the values of one column of a dataframe in spark/scala

I have a Dataframe that I read from a CSV file with many columns like: timestamp, steps, heartrate etc.
I want to sum the values of each column, for instance the total number of steps on "steps" column.
As far as I see I want to use these kind of functions:
http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.functions$
But I can understand how to use the function sum.
When I write the following:
val df = CSV.load(args(0))
val sumSteps = df.sum("steps")
the function sum cannot be resolved.
Do I use the function sum wrongly?
Do Ι need to use first the function map? and if yes how?
A simple example would be very helpful! I started writing Scala recently.
You must first import the functions:
import org.apache.spark.sql.functions._
Then you can use them like this:
val df = CSV.load(args(0))
val sumSteps = df.agg(sum("steps")).first.get(0)
You can also cast the result if needed:
val sumSteps: Long = df.agg(sum("steps").cast("long")).first.getLong(0)
Edit:
For multiple columns (e.g. "col1", "col2", ...), you could get all aggregations at once:
val sums = df.agg(sum("col1").as("sum_col1"), sum("col2").as("sum_col2"), ...).first
Edit2:
For dynamically applying the aggregations, the following options are available:
Applying to all numeric columns at once:
df.groupBy().sum()
Applying to a list of numeric column names:
val columnNames = List("col1", "col2")
df.groupBy().sum(columnNames: _*)
Applying to a list of numeric column names with aliases and/or casts:
val cols = List("col1", "col2")
val sums = cols.map(colName => sum(colName).cast("double").as("sum_" + colName))
df.groupBy().agg(sums.head, sums.tail:_*).show()
If you want to sum all values of one column, it's more efficient to use DataFrame's internal RDD and reduce.
import sqlContext.implicits._
import org.apache.spark.sql.functions._
val df = sc.parallelize(Array(10,2,3,4)).toDF("steps")
df.select(col("steps")).rdd.map(_(0).asInstanceOf[Int]).reduce(_+_)
//res1 Int = 19
Simply apply aggregation function, Sum on your column
df.groupby('steps').sum().show()
Follow the Documentation http://spark.apache.org/docs/2.1.0/api/python/pyspark.sql.html
Check out this link also https://www.analyticsvidhya.com/blog/2016/10/spark-dataframe-and-operations/
Not sure this was around when this question was asked but:
df.describe().show("columnName")
gives mean, count, stdtev stats on a column. I think it returns on all columns if you just do .show()
Using spark sql query..just incase if it helps anyone!
import org.apache.spark.sql.SparkSession
import org.apache.spark.SparkConf
import org.apache.spark.sql.functions._
import org.apache.spark.SparkContext
import java.util.stream.Collectors
val conf = new SparkConf().setMaster("local[2]").setAppName("test")
val spark = SparkSession.builder.config(conf).getOrCreate()
val df = spark.sparkContext.parallelize(Seq(1, 2, 3, 4, 5, 6, 7)).toDF()
df.createOrReplaceTempView("steps")
val sum = spark.sql("select sum(steps) as stepsSum from steps").map(row => row.getAs("stepsSum").asInstanceOf[Long]).collect()(0)
println("steps sum = " + sum) //prints 28