can I not change partition when I use window with spark/scala? - scala

I have a RDD , the RDD'partition of result changes to 1 when I use window,can I not change partition when I use window?
this is my code:
val rdd=sc.parallelize(List(1,3,2,4,5,6,7,8),4)
val sqlContext = new SQLContext(m_sparkCtx)
import sqlContext.implicits._
val result = rdd.toDF("values").withColumn("csum", sum(col("values")).over(Window.orderBy("values"))).rdd
println(result.getNumPartitions+"rdd2")
My input'partition is 4,I want my result'partition is alse 4,is there any cleaner solution?

The partitions of the RDD is 1 as expected this is because your are performing a Window function on a DataFrame without a partitionBy clause. So all the data has to be grouped into a single partition in this case.
When we include a partitionBy clause in our Window function then the number of partitions in the result RDD is no longer 1 as shown below. In the below example we have included an-another column in called col1 in the original dataframe and the same window function is applied with a partitionBy clause on col1 column.
scala> import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.expressions.Window
scala> val rdd = spark.sparkContext.parallelize(List((1,1),(3,1),(2,2),(4,2),(5,2),(6,3),(7,3),(8,3)),4)
rdd: org.apache.spark.rdd.RDD[(Int, Int)] = ParallelCollectionRDD[49] at parallelize at <console>:28
scala> val result = rdd.toDF("values", "col1").withColumn("csum", sum(col("values")).over(Window.partitionBy("col1").orderBy("values"))).rdd
result: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = MapPartitionsRDD[58] at rdd at <console>:30
scala> result.getNumPartitions
res6: Int = 200

Related

I can't fit the FP-Growth model in spark

Please, can you help me ? I have an 80 CSV files dataset and a cluster of one master and 4 slaves. I want to read the CSV files in a dataframe and parallelize it on the four slaves. After that, I want to filter the dataframe with a group by. In my spark queries, the result contains columns "code_ccam" and "dossier" grouped by ("code_ccam","dossier"). I want to use the FP-Growth algorithm to detect sequences of "code_ccam" which are repeated by "folder". But when I use the FPGrowth.fit() command, I have the following error :
"error: type mismatch;
found : org.apache.spark.rdd.RDD[org.apache.spark.sql.Row]
required: org.apache.spark.sql.Dataset[_]"
Here are my spark commands:
val df = spark.read.option("header", "true").csv("file:///home/ia/Projet-Spark-ace/Donnees/Fichiers CSV/*.csv")
import org.apache.spark.sql.functions.{concat, lit}
val df2 = df.withColumn("dossier", concat(col("num_immatriculation"), lit(""), col("date_acte"), lit(""), col("rang_naissance"), lit(""), col("date_naissance")))
val df3 = df2.drop("num_immatriculation").drop("date_acte").drop("rang_naissance").drop("date_naissance")
val df4 = df3.select("dossier","code_ccam").groupBy("dossier","code_ccam").count()
val transactions = df4.agg(collect_list("code_ccam").alias("codes_ccam")).rdd.map(x => x)
import org.apache.spark.ml.fpm.FPGrowth
val fpgrowth = new FPGrowth().setItemsCol("code_ccam").setMinSupport(0.5).setMinConfidence(0.6)
val model = fpgrowth.fit(transactions)
Tkank you very much. It worked. I replaced collect_list by collect_set.

Functional way of joining multiple dataframes

I'm learning Spark in Scala coming from heavy Python abuse and I'm getting a java.lang.NullPointerException because I'm doing things the python way.
I have say 3 dataframes of shape 4x2 each, first column is always an index 0,1,2,3 and the second column is some binary feature. The end goal is to have a 4x4 dataframe with a join of all of individual ones. In python I would first define some master df and then loop over the intermediate ones, assigning at each loop the resulting joined dataframe to the master dataframe variable name (ugly):
dataframes = [temp1, temp2, temp3]
df = pd.DataFrame(index=[0,1,2,3]) # Master df
for temp in dataframes:
df = df.join(temp)
In Spark this doesnt play well:
q = "select * from table"
val df = sql(q) Works obviously
scala> val df = df.join(sql(q))
<console>:33: error: recursive value df needs type
val df = df.join(sql(q))
Ok so:
scala> val df:org.apache.spark.sql.DataFrame = df.join(sql(q))
java.lang.NullPointerException
... 50 elided
I think its highly likely that I'm not doing it the functional way. So I tried (uglyest!):
scala> :paste
// Entering paste mode (ctrl-D to finish)
sql(q).
join(sql(q), "device_id").
join(sql(q), "device_id").
join(sql(q), "device_id")
// Exiting paste mode, now interpreting.
res128: org.apache.spark.sql.DataFrame = [device_id: string, devtype: int ... 3 more fields]
This just looks ugly and inelegant and beginner. What would be a proper functional Scala way to achieve this?
foldLeft:
val dataframes: Seq[String] = ???
val df: Dataset[Row] = ???
dataframes.foldLeft(df)((acc, q) => acc.join(sql(q)))
And if you're looking for imperative equivalent of your Python code:
var dataframes: Seq[String] = ??? // IMPORTANT: var
for (q <- dataframes ) { df = df.join(sql(q)) }
Even simpler,
val dataframes: Seq[String] = ???
dataframes.reduce(_ join _)

how to convert dataframe to RDD and don't change partition?

For some reason i have to convert RDD to dataframe,then do something with dataframe,but my interface is RDD,so i have to convert dataframe to RDD,when i use df.rdd,the partition change to 1,so i have to repartition and sortBy RDD,Is there any cleaner solution ?thanks!
this is my try:
val rdd=sc.parallelize(List(1,3,2,4,5,6,7,8),4)
val partition=rdd.getNumPartitions
val sqlContext = new SQLContext(m_sparkCtx)
import sqlContext.implicits._
val df=rdd.toDF()
df.rdd.zipWithIndex().sortBy(x => {x._2}, true, partition).map(x => {x._1})
Partitions should remain the same when you convert the DataFrame to an RDD.
For example when the rdd of 4 partitions is converted to DF and back the RDD the partitions of the RDD remains same as shown below.
scala> val rdd=sc.parallelize(List(1,3,2,4,5,6,7,8),4)
rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[11] at parallelize at <console>:27
scala> val partition=rdd.getNumPartitions
partition: Int = 4
scala> val df=rdd.toDF()
df: org.apache.spark.sql.DataFrame = [value: int]
scala> df.rdd.getNumPartitions
res1: Int = 4
scala> df.withColumn("col2", lit(10)).rdd.getNumPartitions
res1: Int = 4

Spark: How can DataFrame be Dataset[Row] if DataFrame's have a schema

This article claims that a DataFrame in Spark is equivalent to a Dataset[Row], but this blog post shows that a DataFrame has a schema.
Take the example in the blog post of converting an RDD to a DataFrame: if DataFrame were the same thing as Dataset[Row], then converting an RDD to a DataFrameshould be as simple
val rddToDF = rdd.map(value => Row(value))
But instead it shows that it's this
val rddStringToRowRDD = rdd.map(value => Row(value))
val dfschema = StructType(Array(StructField("value",StringType)))
val rddToDF = sparkSession.createDataFrame(rddStringToRowRDD,dfschema)
val rDDToDataSet = rddToDF.as[String]
Clearly a dataframe is actually a dataset of rows and a schema.
In Spark 2.0, in code there is:
type DataFrame = Dataset[Row]
It is Dataset[Row], just because of definition.
Dataset has also schema, you can print it using printSchema() function. Normally Spark infers schema, so you don't have to write it by yourself - however it's still there ;)
You can also do createTempView(name) and use it in SQL queries, just like DataFrames.
In other words, Dataset = DataFrame from Spark 1.5 + encoder, that converts rows to your classes. After merging types in Spark 2.0, DataFrame becomes just an alias for Dataset[Row], so without specified encoder.
About conversions: rdd.map() also returns RDD, it never returns DataFrame. You can do:
// Dataset[Row]=DataFrame, without encoder
val rddToDF = sparkSession.createDataFrame(rdd)
// And now it has information, that encoder for String should be used - so it becomes Dataset[String]
val rDDToDataSet = rddToDF.as[String]
// however, it can be shortened to:
val dataset = sparkSession.createDataset(rdd)
Note (in addition to the answer of T Gaweda) that there is a schema associated to each Row (Row.schema). However, this schema is not set until it is integrated in a DataFrame (or Dataset[Row])
scala> Row(1).schema
res12: org.apache.spark.sql.types.StructType = null
scala> val rdd = sc.parallelize(List(Row(1)))
rdd: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = ParallelCollectionRDD[5] at parallelize at <console>:28
scala> spark.createDataFrame(rdd,schema).first
res15: org.apache.spark.sql.Row = [1]
scala> spark.createDataFrame(rdd,schema).first.schema
res16: org.apache.spark.sql.types.StructType = StructType(StructField(a,IntegerType,true))

How to sum the values of one column of a dataframe in spark/scala

I have a Dataframe that I read from a CSV file with many columns like: timestamp, steps, heartrate etc.
I want to sum the values of each column, for instance the total number of steps on "steps" column.
As far as I see I want to use these kind of functions:
http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.functions$
But I can understand how to use the function sum.
When I write the following:
val df = CSV.load(args(0))
val sumSteps = df.sum("steps")
the function sum cannot be resolved.
Do I use the function sum wrongly?
Do Ι need to use first the function map? and if yes how?
A simple example would be very helpful! I started writing Scala recently.
You must first import the functions:
import org.apache.spark.sql.functions._
Then you can use them like this:
val df = CSV.load(args(0))
val sumSteps = df.agg(sum("steps")).first.get(0)
You can also cast the result if needed:
val sumSteps: Long = df.agg(sum("steps").cast("long")).first.getLong(0)
Edit:
For multiple columns (e.g. "col1", "col2", ...), you could get all aggregations at once:
val sums = df.agg(sum("col1").as("sum_col1"), sum("col2").as("sum_col2"), ...).first
Edit2:
For dynamically applying the aggregations, the following options are available:
Applying to all numeric columns at once:
df.groupBy().sum()
Applying to a list of numeric column names:
val columnNames = List("col1", "col2")
df.groupBy().sum(columnNames: _*)
Applying to a list of numeric column names with aliases and/or casts:
val cols = List("col1", "col2")
val sums = cols.map(colName => sum(colName).cast("double").as("sum_" + colName))
df.groupBy().agg(sums.head, sums.tail:_*).show()
If you want to sum all values of one column, it's more efficient to use DataFrame's internal RDD and reduce.
import sqlContext.implicits._
import org.apache.spark.sql.functions._
val df = sc.parallelize(Array(10,2,3,4)).toDF("steps")
df.select(col("steps")).rdd.map(_(0).asInstanceOf[Int]).reduce(_+_)
//res1 Int = 19
Simply apply aggregation function, Sum on your column
df.groupby('steps').sum().show()
Follow the Documentation http://spark.apache.org/docs/2.1.0/api/python/pyspark.sql.html
Check out this link also https://www.analyticsvidhya.com/blog/2016/10/spark-dataframe-and-operations/
Not sure this was around when this question was asked but:
df.describe().show("columnName")
gives mean, count, stdtev stats on a column. I think it returns on all columns if you just do .show()
Using spark sql query..just incase if it helps anyone!
import org.apache.spark.sql.SparkSession
import org.apache.spark.SparkConf
import org.apache.spark.sql.functions._
import org.apache.spark.SparkContext
import java.util.stream.Collectors
val conf = new SparkConf().setMaster("local[2]").setAppName("test")
val spark = SparkSession.builder.config(conf).getOrCreate()
val df = spark.sparkContext.parallelize(Seq(1, 2, 3, 4, 5, 6, 7)).toDF()
df.createOrReplaceTempView("steps")
val sum = spark.sql("select sum(steps) as stepsSum from steps").map(row => row.getAs("stepsSum").asInstanceOf[Long]).collect()(0)
println("steps sum = " + sum) //prints 28