How to specify multiple columns in groupby together with streaming window operations?

How to specify multiple columns in groupby together with streaming window operations? - scala

I'm not able to specify a list of columns in the groupBy function along with a window operation. My current code:
val groupCols = List("SINR_Distribution","NE_VERSION","NE_ID","NE_NAME","cNum","EarfcnDl","datetime","circle")
val aggDFrame = dframe.groupBy(groupCols, window($"EVENT_TIME", "60 minutes")).agg(Rule_Agg)
Error:
Multiple markers at this line: overloaded method value groupBy with alternatives: (col1: String,cols: String*)org.apache.spark.sql.RelationalGroupedDataset (cols: org.apache.spark.sql.Column*)org.apache.spark.sql.RelationalGroupedDataset cannot be applied to (List[String], org.apache.spark.sql.Column) overloaded method value groupBy with alternatives: (col1: String,cols: String*)org.apache.spark.sql.RelationalGroupedDataset (cols: org.apache.spark.sql.Column*)org.apache.spark.sql.RelationalGroupedDataset cannot be applied to (List[String], org.apache.spark.sql.Column)
What am I doing wrong?

You are mixing strings with a column in the groupBy. The window window($"EVENT_TIME", "60 minutes") is correctly interpreted as a column but the list of column names needs to be columns to match, it's not possible to mix types.
What you can do is:
val cols = groupCols.map(col) ++ Seq(window($"EVENT_TIME", "60 minutes"))
val aggDFrame = dframe.groupBy(cols: _*).agg(...)

Related

error: overloaded method value select with alternatives:

I am reading a CSV file in dataframe1 and then filter some columns in dataframe2, during selecting columns for dataframe2 from dataframe1 I want to apply my function on the column value. Like
import utilities._
val Logs = sqlContext.read
.format("csv")
.option("header", "true")
.load("dbfs:/mnt/records/Logs/2016.07.17/2016.07.17.{*}.csv")
val Log = Logs.select(
"key1",
utility.stringToGuid("username"),
"key2",
"key3",
"startdatetime",
"enddatetime")
display(Log)
so here I am calling utility.stringToGuid("username"). And it is giving me error:
notebook:5: error: overloaded method value select with alternatives:
(col: String,cols: String*)org.apache.spark.sql.DataFrame <and>
(cols: org.apache.spark.sql.Column*)org.apache.spark.sql.DataFrame

So actually I found the answer to my question. Actually I was passing the string "username" to the utility function instead of passing the column value of "username".
So in argument it should be like utility.stringToGuid($"username"). In scala $"" is used to send the column enter code here value and in python col() is used.

How to repartition a dataframe based on more than one column?

I have a dataframe: yearDF with the following columns: name, id_number, location, source_system_name, period_year.
If I want to repartition the dataframe based on a column, I'd do:
yearDF.repartition('source_system_name')
I have a variable: val partition_columns = "source_system_name,period_year"
I tried to do it this way:
val dataDFPart = yearDF.repartition(col(${prtn_String_columns}))
but I get a compilation error: cannot resolve the symbol $
Is there anyway I can repartition the dataframe: yearDF based on the values in partition_columns

There are three implementations of the repartition function in Scala / Spark :
def repartition(partitionExprs: Column*): Dataset[T]
def repartition(numPartitions: Int, partitionExprs: Column*): Dataset[T]
def repartition(numPartitions: Int): Dataset[T]
So in order to repartition on multiple columns, you can try to split your field by the comma and use the vararg operator of Scala on it, like this :
val columns = partition_columns.split(",").map(x => col(x))
yearDF.repartition(columns: _*)
Another way to do it, is to call every col one by one :
yearDF.repartition(col("source_system_name"), col("period_year"))

Spark DataFrame, how to to aggregate sequence of columns?

I have a dataframe and I could do aggregate with static column names i.e:
df.groupBy("_c0", "_c1", "_c2", "_c3", "_c4").agg(
concat_ws(",", collect_list("_c5")),
concat_ws(",", collect_list("_c6")))
And it works fine but how to do same if I get sequence of groupby columns and sequence of aggregate columns?
In other words, what if I have
val toGroupBy = Seq("_c0", "_c1", "_c2", "_c3", "_c4")
val toAggregate = Seq("_c5", "_c6")
and want to perform the above?

To perform the same groupBy and aggregation using the sequences you can do the following:
val aggCols = toAggregate.map(c => expr(s"""concat_ws(",", collect_list($c))"""))
df.groupBy(toGroupBy.head, toGroupBy.tail:_*).agg(aggCols.head, aggCols.tail:_*)
The expr function takes an expression and evaluates it into a column. Then the varargs variants of groupBy and agg are applied on the lists of columns.

Difference b/w val df = List(("amit","porwal")) and val df = List("amit","porwal")

what is the difference b/w
val df = List(("amit","porwal"))
and
val df = List("amit","porwal")
My question is how 2 parenthesis are making a difference.Because On doing
scala > val df = List(("amit","porwal")).toDF("fname","lname")
it works, but on doing
scala > val df = List("amit","porwal").toDF("fname","lname")
scala throws me an error
java.lang.IllegalArgumentException: requirement failed:
The number of columns doesn't match. Old column names (1): value New column names (2):
fname,lname –
at scala.Predef$.require(Predef.scala:224)
at org.apache.spark.sql.Dataset.toDF(Dataset.scala:393)
at org.apache.spark.sql.DatasetHolder.toDF(DatasetHolder.scala:44)
... 48 elided

Yes, they are different. The paranthesis inside is treated as tuples by scala compiler. Since there are two string values inside the nested brackets of your first example, it will be treated as Tuple2(String, String). While the second example the string values inside the List are treated as separate elements as String.
the first one val df = List(("amit","porwal")) is List[Tuple2(String, String)]. There is only one element in df and to get porwal you have to do df(0)._2
And,
the second one val df = List("amit","porwal") is List[String]. There are two elements in df and to get porwal you have to do df(1)

Even though the question is not related to spark
val df = List(("amit","porwal"))
Here df is list of Tuple2 as List[(String, String)], To get the value "amit" you should use df(0)._1 and for "porwal" df(0)._2
val df = List("amit","porwal")
Here is df is simply list of String as List[String]
In case of List[String] you can simply get as df(0) and df(1)
Hope this helps!

Save two or more different RDDs in a single text file in scala

When I use saveAsTextFile like,
rdd1.saveAsTextFile("../savefile")
rdd2.saveAsTextFile("../savefile")
I can't put two different RDDs into a single text file. Is there a way I can do so?
Besides, is there a way I can apply some format to the text I am wring to the text file? For example, add a \n or some other format.

A single text file is rather ambiguous in Spark. Each partition is saved individually and it means you get a single file per partition. If you want a single for a RDD you have to move your data to a single partition or collect, and most of the time it is either to expensive or simply not feasible.
You can get an union of RDDs using union method (or ++ as mentioned by lpiepiora in the comments) but it works only if both RDDs are of the same type:
val rdd1 = sc.parallelize(1 to 5)
val rdd2 = sc.parallelize(Seq("a", "b", "c", "d", "e"))
rdd1.union(rdd2)
// <console>:26: error: type mismatch;
// found : org.apache.spark.rdd.RDD[String]
// required: org.apache.spark.rdd.RDD[Int]
// rdd1.union(rdd2)
If types are different a whole idea smells fishy though.
If you want a specific format you have to apply it before calling saveAsTextFile. saveAsTextFile simply calls toString on each element.
Putting all of the above together:
import org.apache.spark.rdd.RDD
val rddStr1: RDD[String] = rdd1.map(x => ???) // Map to RDD[String]
val rddStr2: RDD[String] = rdd2.map(x => ???)
rdd1.union(rdd2)
.repartition(1) // Not recommended!
.saveAsTextFile(some_path)