How to convert column values into a single array in scala? - scala

I am trying to convert all columns of my dataframe into single arrays.
Is there an operation supported in structured streaming by which we can perform something opposite to "explode"?
Any suggestion is much appreciated !!!
Tried collect() and collectAsList() . But it is not supported in streaming
+---+---------------+----------------+--------+
|row|ADDRESS_TYPE_CD|DISCONTINUE_DATE|param_cd|
+---+---------------+----------------+--------+
|0 |1 |null |7 |
|2 |6 |null |1 |
+---+---------------+----------------+--------+
My result should look like :
+---+---------------+----------------+--------+
|row|ADDRESS_TYPE_CD|DISCONTINUE_DATE|param_cd|
+---+---------------+----------------+--------+
[0,2] [1,6] [null,null] [7,2]
+---+---------------+----------------+--------+

You can use collect_list on all your columns for instance. It would go as follows:
val aggs = df.columns.map(c => collect_list(col(c)) as c)
df.select(aggs :_*).show()
+------+---------------+----------------+--------+
| row|ADDRESS_TYPE_CD|DISCONTINUE_DATE|param_cd|
+------+---------------+----------------+--------+
|[0, 2]| [1, 6]| [null, null]| [7, 1]|
+------+---------------+----------------+--------+

Related

Scala -- apply a custom if-then on a dataframe

I have this kind of dataset:
val cols = Seq("col_1","col_2")
val data = List(("a",1),
("b",1),
("a",2),
("c",3),
("a",3))
val df = spark.createDataFrame(data).toDF(cols:_*)
+-----+-----+
|col_1|col_2|
+-----+-----+
|a |1 |
|b |1 |
|a |2 |
|c |3 |
|a |3 |
+-----+-----+
I want to add an if-then column based on the existing columns.
df
.withColumn("col_new",
when(col("col_2").isin(2, 5), "str_1")
.when(col("col_2").isin(4, 6), "str_2")
.when(col("col_2").isin(1) && col("col_1").contains("a"), "str_3")
.when(col("col_2").isin(3) && col("col_1").contains("b"), "str_1")
.when(col("col_2").isin(1,2,3), "str_4")
.otherwise(lit("other")))
Instead of the list of when-then statements, I would prefer to apply a custom function. In Python I would run a lambda & map.
thank you!

Scala spark: Access value in a struct in an array-typed column? (Or, access member of anonymous struct-typed column)

I have a dataframe that has a column which is an array of structs, like:
+-----+-----+------------------+---+----+
|index|state|entries |0 |1 |
+-----+-----+------------------+---+----+
|0 |KY |[[A, 45]] |45 |null|
|1 |OR |[[A, 30], [B, 10]]|30 |10 |
+-----+-----+------------------+---+----+
where "Entries" are structs with two fields, "name" and "number". I want to be able to grab one of those inner values at a particular index.
One way I could do this is:
df.withColumn(col("entries").getItem(0), "dumbName").select("dumbName.name")
I want to be able to do this with anonymous columns, though, so it would look more like
col("entries").getItem(0).someMagicFunction("name")
getItem works as magic function:
df.select(col("entries").getItem(0).getItem("Name")).show()
prints
+---------------+
|entries[0].Name|
+---------------+
| A|
| A|
+---------------+
It is also possible to use element_at from the functions object (available since 2.4.0):
df.select(element_at('entries, 1).getItem("Name")).show()
prints
+---------------------------+
|element_at(entries, 1).Name|
+---------------------------+
| A|
| A|
+---------------------------+
For earlier Spark versions it would be possible to use SQL:
df.createOrReplaceGlobalTempView("df")
spark.sql("select entries[0].name from global_temp.df").show()

How to count the frequency of words with CountVectorizer in spark ML?

The below code gives a count vector for each row in the DataFrame:
import org.apache.spark.ml.feature.{CountVectorizer, CountVectorizerModel}
val df = spark.createDataFrame(Seq(
(0, Array("a", "b", "c")),
(1, Array("a", "b", "b", "c", "a"))
)).toDF("id", "words")
// fit a CountVectorizerModel from the corpus
val cvModel: CountVectorizerModel = new CountVectorizer()
.setInputCol("words")
.setOutputCol("features")
.fit(df)
cvModel.transform(df).show(false)
The result is:
+---+---------------+-------------------------+
|id |words |features |
+---+---------------+-------------------------+
|0 |[a, b, c] |(3,[0,1,2],[1.0,1.0,1.0])|
|1 |[a, b, b, c, a]|(3,[0,1,2],[2.0,2.0,1.0])|
+---+---------------+-------------------------+
How to get total counts of each words, like:
+---+------+------+
|id |words |counts|
+---+------+------+
|0 |a | 3 |
|1 |b | 3 |
|2 |c | 2 |
+---+------+------+
Shankar's answer only gives you the actual frequencies if the CountVectorizer model keeps every single word in the corpus (e.g. no minDF or VocabSize limitations). In these cases you can use Summarizer to directly sum each Vector. Note: this requires Spark 2.3+ for Summarizer.
import org.apache.spark.ml.stat.Summarizer.metrics
// You need to select normL1 and another item (like mean) because, for some reason, Spark
// won't allow one Vector to be selected at a time (at least in 2.4)
val totalCounts = cvModel.transform(df)
.select(metrics("normL1", "mean").summary($"features").as("summary"))
.select("summary.normL1", "summary.mean")
.as[(Vector, Vector)]
.first()
._1
You'll then have to zip totalCounts with cvModel.vocabulary to get the words themselves.
You can simply explode and groupBy to get the count of each word
cvModel.transform(df).withColumn("words", explode($"words"))
.groupBy($"words")
.agg(count($"words").as("counts"))
.withColumn("id", row_number().over(Window.orderBy("words")) -1)
.show(false)
Output:
+-----+------+---+
|words|counts|id |
+-----+------+---+
|a |3 |1 |
|b |3 |2 |
|c |2 |3 |
+-----+------+---+

How can I do map reduce on spark dataframe group by conditional columns?

My spark dataframe looks like this:
+------+------+-------+------+
|userid|useid1|userid2|score |
+------+------+-------+------+
|23 |null |dsad |3 |
|11 |44 |null |4 |
|231 |null |temp |5 |
|231 |null |temp |2 |
+------+------+-------+------+
I want to do the calculation for each pair of userid and useid1/userid2 (whichever is not null).
And if it's useid1, I multiply the score by 5, if it's userid2, I multiply the score by 3.
Finally, I want to add all score for each pair.
The result should be:
+------+--------+-----------+
|userid|useid1/2|final score|
+------+--------+-----------+
|23 |dsad |9 |
|11 |44 |20 |
|231 |temp |21 |
+------+------+-------------+
How can I do this?
For the groupBy part, I know dataframe has the groupBy function, but I don't know if I can use it conditionally, like if userid1 is null, groupby(userid, userid2), if userid2 is null, groupby(userid, useid1).
For the calculation part, how to multiply 3 or 5 based on the condition?
The below solution will help to solve your problem.
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions._
val groupByUserWinFun = Window.partitionBy("userid","useid1/2")
val finalScoreDF = userDF.withColumn("useid1/2", when($"userid1".isNull, $"userid2").otherwise($"userid1"))
.withColumn("finalscore", when($"userid1".isNull, $"score" * 3).otherwise($"score" * 5))
.withColumn("finalscore", sum("finalscore").over(groupByUserWinFun))
.select("userid", "useid1/2", "finalscore").distinct()
using when method in spark SQL, select userid1 or 2 and multiply with values based on the condition
Output:
+------+--------+----------+
|userid|useid1/2|finalscore|
+------+--------+----------+
| 11 | 44| 20.0|
| 23 | dsad| 9.0|
| 231| temp| 21.0|
+------+--------+----------+
Group by will work:
val original = Seq(
(23, null, "dsad", 3),
(11, "44", null, 4),
(231, null, "temp", 5),
(231, null, "temp", 2)
).toDF("userid", "useid1", "userid2", "score")
// action
val result = original
.withColumn("useid1/2", coalesce($"useid1", $"userid2"))
.withColumn("score", $"score" * when($"useid1".isNotNull, 5).otherwise(3))
.groupBy("userid", "useid1/2")
.agg(sum("score").alias("final score"))
result.show(false)
Output:
+------+--------+-----------+
|userid|useid1/2|final score|
+------+--------+-----------+
|23 |dsad |9 |
|231 |temp |21 |
|11 |44 |20 |
+------+--------+-----------+
coalesce will do the needful.
df.withColumn("userid1/2", coalesce(col("useid1"), col("useid1")))
basically this function return first non-null value of the order
documentation :
COALESCE(T v1, T v2, ...)
Returns the first v that is not NULL, or NULL if all v's are NULL.
needs an import import org.apache.spark.sql.functions.coalesce

Spark scala join RDD between 2 datasets

Supposed i have two dataset as following:
Dataset 1:
id, name, score
1, Bill, 200
2, Bew, 23
3, Amy, 44
4, Ramond, 68
Dataset 2:
id,message
1, i love Bill
2, i hate Bill
3, Bew go go !
4, Amy is the best
5, Ramond is the wrost
6, Bill go go
7, Bill i love ya
8, Ramond is Bad
9, Amy is great
I wanted to join above two datasets and counting the top number of person's name that appears in dataset2 according to the name in dataset1 the result should be:
Bill, 4
Ramond, 2
..
..
I managed to join both of them together but not sure how to count how many time it appear for each person.
Any suggestion would be appreciated.
Edited:
my join code:
val rdd = sc.textFile("dataset1")
val rdd2 = sc.textFile("dataset2")
val rddPair1 = rdd.map { x =>
var data = x.split(",")
new Tuple2(data(0), data(1))
}
val rddPair2 = rdd2.map { x =>
var data = x.split(",")
new Tuple2(data(0), data(1))
}
rddPair1.join(rddPair2).collect().foreach(f =>{
println(f._1+" "+f._2._1+" "+f._2._2)
})
Using RDDs, achieving the solution you desire, would be complex. Not so much using dataframes.
First step would be to read the two files you have into dataframes as below
val df1 = sqlContext.read.format("com.databricks.spark.csv")
.option("header", true)
.load("dataset1")
val df2 = sqlContext.read.format("com.databricks.spark.csv")
.option("header", true)
.load("dataset1")
so that you should be having
df1
+---+------+-----+
|id |name |score|
+---+------+-----+
|1 |Bill |200 |
|2 |Bew |23 |
|3 |Amy |44 |
|4 |Ramond|68 |
+---+------+-----+
df2
+---+-------------------+
|id |message |
+---+-------------------+
|1 |i love Bill |
|2 |i hate Bill |
|3 |Bew go go ! |
|4 |Amy is the best |
|5 |Ramond is the wrost|
|6 |Bill go go |
|7 |Bill i love ya |
|8 |Ramond is Bad |
|9 |Amy is great |
+---+-------------------+
join, groupBy and count should give your desired output as
df1.join(df2, df2("message").contains(df1("name")), "left").groupBy("name").count().as("count").show(false)
Final output would be
+------+-----+
|name |count|
+------+-----+
|Ramond|2 |
|Bill |4 |
|Amy |2 |
|Bew |1 |
+------+-----+