groupByKey in Spark dataset - scala

Please help me understand the parameter we pass to groupByKey when it is used on a dataset
scala> val data = spark.read.text("Sample.txt").as[String]
data: org.apache.spark.sql.Dataset[String] = [value: string]
scala> data.flatMap(_.split(" ")).groupByKey(l=>l).count.show
In the above code, please help me understand what (l=>l) means in groupByKey(l=>l).

l =>l says use the whole string(in your case that's every word as you're tokenizing on space) will be used as a key. This way you get all occurrences of each word in same partition and you can count them.
- As you probably seen in other articles, it is preferable to use reduceByKey in this case so you don't need to collect all values for each key in memory before counting.
Always a good place to start is the API Docs:
http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.Dataset
def groupByKey[K](func: (T) ⇒ K)(implicit arg0: Encoder[K]): KeyValueGroupedDataset[K, T]
(Scala-specific) Returns a KeyValueGroupedDataset where the data is grouped by the given key func.
You need a function that derives your key from the dataset's data.
In your example, your function takes the whole string as is and uses it as the key. A different example will be, for a Dataset[String], to use as a key the first 3 characters of your string and not the whole string:
scala> val ds = List("abcdef", "abcd", "cdef", "mnop").toDS
ds: org.apache.spark.sql.Dataset[String] = [value: string]
scala> ds.show
+------+
| value|
+------+
|abcdef|
| abcd|
| cdef|
| mnop|
+------+
scala> ds.groupByKey(l => l.substring(0,3)).keys.show
+-----+
|value|
+-----+
| cde|
| mno|
| abc|
+-----+
group of key "abc" will have 2 values.
Here is the difference on how the key gets transformed vs the (l => l) so you can see better:
scala> ds.groupByKey(l => l.substring(0,3)).count.show
+-----+--------+
|value|count(1)|
+-----+--------+
| cde| 1|
| mno| 1|
| abc| 2|
+-----+--------+
scala> ds.groupByKey(l => l).count.show
+------+--------+
| value|count(1)|
+------+--------+
| abcd| 1|
| cdef| 1|
|abcdef| 1|
| mnop| 1|
+------+--------+

Related

substring from lastIndexOf in spark scala

I have a column in my dataframe which contains the filename
test_1_1_1_202012010101101
I want to get the string after the lastIndexOf(_)
I tried this and it is working
val timestamp_df =file_name_df.withColumn("timestamp",split(col("filename"),"_").getItem(4))
But I want to make it more generic, so that if in future if the filename can have any number of _ in it, it can split it on the basis of lastIndexOf _
val timestamp_df =file_name_df.withColumn("timestamp", expr("substring(filename, length(filename)-15,17)"))
This also is not generic as the character length can vary.
Can anyone help me in using the lastIndexOf function with withColumn.
You can use element_at function with split to get last element of array.
Example:
df.withColumn("timestamp",element_at(split(col("filename"),"_"),-1)).show(false)
+--------------------------+---------------+
|filename |timestamp |
+--------------------------+---------------+
|test_1_1_1_202012010101101|202012010101101|
+--------------------------+---------------+
You can use substring_index
scala> val df = Seq(("a-b-c", 1),("d-ef-foi",2)).toDF("c1","c2")
df: org.apache.spark.sql.DataFrame = [c1: string, c2: int]
+--------+---+
| c1| c2|
+--------+---+
| a-b-c| 1|
|d-ef-foi| 2|
+--------+---+
scala> df.withColumn("c3", substring_index(col("c1"), "-", -1)).show
+--------+---+---+
| c1| c2| c3|
+--------+---+---+
| a-b-c| 1| c|
|d-ef-foi| 2|foi|
+--------+---+---+
Per docs: When the last argument "is negative, everything to the right of the final delimiter (counting from the right) is returned"
val timestamp_df =file_name_df.withColumn("timestamp",reverse(split(reverse(col("filename")),"_").getItem(0)))
It's working with this.

Sequential Dynamic filters on the same Spark Dataframe Column in Scala Spark

I have a column named root and need to filter dataframe based on the different values of a root column.
Suppose I have a values in root are parent,child or sub-child and I want to apply these filters dynamically through a variable.
val x = ("parent,child,sub-child").split(",")
x.map(eachvalue <- {
var df1 = df.filter(col("root").contains(eachvalue))
}
But when I am doing it, it always overwriting the DF1 instead, I want to apply all the 3 filters and get the result.
May be in future I may extend the list to any number of filter values and the code should work.
Thanks,
Bab
You should apply the subsequent filters to the result of the previous filter, not on df:
val x = ("parent,child,sub-child").split(",")
var df1 = df
x.map(eachvalue <- {
df1 = df1.filter(col("root").contains(eachvalue))
}
df1 after the map operation will have all filters applied to it.
Let's see an example with spark shell. Hope it helps you.
scala> import spark.implicits._
import spark.implicits._
scala> val df0 =
spark.sparkContext.parallelize(List(1,2,1,3,3,2,1)).toDF("number")
df0: org.apache.spark.sql.DataFrame = [number: int]
scala> val list = List(1,2,3)
list: List[Int] = List(1, 2, 3)
scala> val dfFiltered = for (number <- list) yield { df0.filter($"number" === number)}
dfFiltered: List[org.apache.spark.sql.Dataset[org.apache.spark.sql.Row]] = List([number: int], [number: int], [number: int])
scala> dfFiltered(0).show
+------+
|number|
+------+
| 1|
| 1|
| 1|
+------+
scala> dfFiltered(1).show
+------+
|number|
+------+
| 2|
| 2|
+------+
scala> dfFiltered(2).show
+------+
|number|
+------+
| 3|
| 3|
+------+
AFAIK isin can be used in this case below is the example.
import spark.implicits._
val colorStringArr = "red,yellow,blue".split(",")
val colorDF =
List(
"red",
"yellow",
"purple"
).toDF("color")
// to derive a column using a list
colorDF.withColumn(
"is_primary_color",
col("color").isin(colorStringArr: _*)
).show()
println( "if you don't want derived column and directly want to filter using a list with isin then .. ")
colorDF.filter(col("color").isin(colorStringArr: _*)).show
Result :
+------+----------------+
| color|is_primary_color|
+------+----------------+
| red| true|
|yellow| true|
|purple| false|
+------+----------------+
if you don't want derived column and directly want to filter using a list with isin then ....
+------+
| color|
+------+
| red|
|yellow|
+------+
One more way using array_contains and swapping the arguments.
scala> val x = ("parent,child,sub-child").split(",")
x: Array[String] = Array(parent, child, sub-child)
scala> val df = Seq(("parent"),("grand-parent"),("child"),("sub-child"),("cousin")).toDF("root")
df: org.apache.spark.sql.DataFrame = [root: string]
scala> df.show
+------------+
| root|
+------------+
| parent|
|grand-parent|
| child|
| sub-child|
| cousin|
+------------+
scala> df.withColumn("check", array_contains(lit(x),'root)).show
+------------+-----+
| root|check|
+------------+-----+
| parent| true|
|grand-parent|false|
| child| true|
| sub-child| true|
| cousin|false|
+------------+-----+
scala>
Here are my two cents
val filters = List(1,2,3)
val data = List(5,1,2,1,3,3,2,1,4)
val colName = "number"
val df = spark.
sparkContext.
parallelize(data).
toDF(colName).
filter(
r => filters.contains(r.getAs[Int](colName))
)
df.show()
which results in
+------+
|number|
+------+
| 1|
| 2|
| 1|
| 3|
| 3|
| 2|
| 1|
+------+

How to get the name of a Spark Column as String?

I want to write a method to round a numeric column without doing something like:
df
.select(round($"x",2).as("x"))
Therefore I need to have a reusable column-expression like:
def roundKeepName(c:Column,scale:Int) = round(c,scale).as(c.name)
Unfortunately c.name does not exist, therefore the above code does not compile. I've found a solution for ColumName:
def roundKeepName(c:ColumnName,scale:Int) = round(c,scale).as(c.string.name)
But how can I do that with Column (which is generated if I use col("x") instead of $"x")
Not sure if the question has really been answered. Your function could be implemented like this (toString returns the name of the column):
def roundKeepname(c:Column,scale:Int) = round(c,scale).as(c.toString)
In case you don't like relying on toString, here is a more robust version. You can rely on the underlying expression, cast it to a NamedExpression and take its name.
import org.apache.spark.sql.catalyst.expressions.NamedExpression
def roundKeepname(c:Column,scale:Int) =
c.expr.asInstanceOf[NamedExpression].name
And it works:
scala> spark.range(2).select(roundKeepname('id, 2)).show
+---+
| id|
+---+
| 0|
| 1|
+---+
EDIT
Finally, if that's OK for you to use the name of the column instead of the Column object, you can change the signature of the function and that yields a much simpler implementation:
def roundKeepName(columnName:String, scale:Int) =
round(col(columnName),scale).as(columnName)
Update:
With the solution way given by BlueSheepToken, here is how you can do it dynamically assuming you have all "double" columns.
scala> val df = Seq((1.22,4.34,8.93),(3.44,12.66,17.44),(5.66,9.35,6.54)).toDF("x","y","z")
df: org.apache.spark.sql.DataFrame = [x: double, y: double ... 1 more field]
scala> df.show
+----+-----+-----+
| x| y| z|
+----+-----+-----+
|1.22| 4.34| 8.93|
|3.44|12.66|17.44|
|5.66| 9.35| 6.54|
+----+-----+-----+
scala> df.columns.foldLeft(df)( (acc,p) => (acc.withColumn(p+"_t",round(col(p),1)).drop(p).withColumnRenamed(p+"_t",p))).show
+---+----+----+
| x| y| z|
+---+----+----+
|1.2| 4.3| 8.9|
|3.4|12.7|17.4|
|5.7| 9.4| 6.5|
+---+----+----+
scala>

How to get a value from Dataset and store it in a Scala value?

I have a dataframe which looks like this:
scala> avgsessiontime.show()
+-----------------+
| avg|
+-----------------+
|2.073455735838315|
+-----------------+
I need to store the value 2.073455735838315 in a variable. I tried using
avgsessiontime.collect
but that starts giving me Task not serializable exceptions. So to avoid that I started using foreachPrtition. But I dont know how to extract the value 2.073455735838315 in an array variable.
scala> avgsessiontime.foreachPartition(x => x.foreach(println))
[2.073455735838315]
But when I do this:
avgsessiontime.foreachPartition(x => for (name <- x) name.get(0))
I get a blank/empty result. Even the length returns empty.
avgsessiontime.foreachPartition(x => for (name <- x) name.length)
I know name is of type org.apache.spark.sql.Row then it should return both those results.
You might need:
avgsessiontime.first.getDouble(0)
Here use first to extract the Row object, and .getDouble(0) to extract value from the Row object.
val df = Seq(2.0743).toDF("avg")
df.show
+------+
| avg|
+------+
|2.0743|
+------+
df.first.getDouble(0)
// res6: Double = 2.0743
scala> val df = spark.range(10)
df: org.apache.spark.sql.Dataset[Long] = [id: bigint]
scala> df.show
+---+
| id|
+---+
| 0|
| 1|
| 2|
| 3|
| 4|
| 5|
| 6|
| 7|
| 8|
| 9|
+---+
scala> val variable = df.select("id").as[Long].collect
variable: Array[Long] = Array(0, 1, 2, 3, 4, 5, 6, 7, 8, 9)
Same way you can extract values of any type i.e double,string. You just need to give data type while selecting values from df.
rdd and dataframes/datasets are distributed in nature, and foreach and foreachPartition are executed on executors, transforming dataframe or rdd on executors itself without returning anything. So if you want to return the variable to the driver node then you will have to use collect.
Supposing you have a dataframe as
+-----------------+
|avg |
+-----------------+
|2.073455735838315|
|2.073455735838316|
+-----------------+
doing the following will print all the values, which you can store in a variable too
avgsessiontime.rdd.collect().foreach(x => println(x(0)))
it will print
2.073455735838315
2.073455735838316
Now if you want only the first one then you can do
avgsessiontime.rdd.collect()(0)(0)
which will give you
2.073455735838315
I hope the answer is helpful

Spark (scala) - Iterate over DF column and count number of matches from a set of items

So I can now iterate over a column of strings in a dataframe and check whether any of the strings contain any items in a large dictionary (see here, thanks to #raphael-roth and #tzach-zohar). The basic udf (not including broadcasting the dict list) for that is:
val checkerUdf = udf { (s: String) => dict.exists(s.contains(_)) }
df.withColumn("word_check", checkerUdf($"words")).show()
The next thing I am trying to do is also COUNT the number of matches that occur from the dict set, in the most efficient way possible (i'm dealing with very large datasets and dict files).
I have been trying to use findAllMatchIn in the udf, using both count and map:
val checkerUdf = udf { (s: String) => dict.count(_.r.findAllMatchIn(s))
// OR
val checkerUdf = udf { (s: String) => dict.map(_.r.findAllMatchIn(s))
But this returns a list of iterators (empty and non-empty) I get a type mismatch (found Iterator, required Boolean). I am not sure how to count the non-empty iterators (count and size and length don't work).
Any idea what i'm doing wrong? Is there a better / more efficient way to achieve what I am trying to do?
you can just change a little bit of the answers from your other question as
import org.apache.spark.sql.functions._
val checkerUdf = udf { (s: String) => dict.count(s.contains(_)) }
df.withColumn("word_check", checkerUdf($"words")).show()
Given the dataframe as
+---+---------+
|id |words |
+---+---------+
|1 |foo |
|2 |barriofoo|
|3 |gitten |
|4 |baa |
+---+---------+
and dict file as
val dict = Set("foo","bar","baaad")
You should have output as
+---+---------+----------+
| id| words|word_check|
+---+---------+----------+
| 1| foo| 1|
| 2|barriofoo| 2|
| 3| gitten| 0|
| 4| baa| 0|
+---+---------+----------+
I hope the answer is helpful