Dynamic GroupBy and count using Spark Dataframes/Datasets - scala

The use case is to group by each column in a given dataset, and get the count of that column.
The resulting set is (key, value) map and then finally uinion of them all.
For eg
students = {(age, firstname, lastname)(12, "FN", "LN"), (13, "df", "gh")}
groupby age => (12, 1), (13, 1)
groupby firstname => etc
I know the brute force approach is to do a map and maintain a map for count for each column
but i wanted to see if there is something more we can do with maybe foldLeft and windows function. I tried using rollup and cube but that does groups all column together rather than indivdual

Assuming that you need Key, Value, Grouping Column name as three columns in the output, you would have to use the below code so that key and grouping column relationships can be understood.
Code
val df = Seq(("12", "FN", "LN"),
("13", "FN", "gh")).toDF("age", "firstname", "lastname")
df.show(false)
val initialDF = spark.createDataFrame(spark.sparkContext.emptyRDD[Row], StructType(
Seq(StructField("Key", StringType), StructField("Value", IntegerType),
StructField("GroupColumn", StringType))
))
val resultantDf = df.columns.foldLeft(initialDF)((df1, column) => df1.union(
df.groupBy(column).count().withColumn("GroupColumn", lit(column))
))
resultantDf.show(false)
resultantDf.collect().map { row =>
(row.getString(0), row.getLong(1))
}.foreach(println)
Output
INPUT DF:
+---+---------+--------+
|age|firstname|lastname|
+---+---------+--------+
|12 |FN |LN |
|13 |FN |gh |
+---+---------+--------+
OUTPUT DF:
+---+-----+-----------+
|Key|Value|GroupColumn|
+---+-----+-----------+
|12 |1 |age |
|13 |1 |age |
|FN |2 |firstname |
|gh |1 |lastname |
|LN |1 |lastname |
+---+-----+-----------+
OUTPUT LIST:
(12,1)
(13,1)
(FN,2)
(gh,1)
(LN,1)

Assuming that you need Union of the grouped data frames, I was able to solve it as below:
Code
val df = Seq(("12", "FN", "LN"),
("13", "FN", "gh")).toDF("age", "firstname", "lastname")
df.show(false)
val initialDF = spark.createDataFrame(spark.sparkContext.emptyRDD[Row], StructType(
Seq(StructField("column", StringType), StructField("count", IntegerType))
))
df.columns.foldLeft(initialDF)((df1, column) => df1.union(df.groupBy(column).count())).show(false)
Output
INPUT DF:
+---+---------+--------+
|age|firstname|lastname|
+---+---------+--------+
|12 |FN |LN |
|13 |FN |gh |
+---+---------+--------+
OUTPUT DF:
+------+-----+
|column|count|
+------+-----+
|12 |1 |
|13 |1 |
|FN |2 |
|gh |1 |
|LN |1 |
+------+-----+

Related

Extracting array elements and mapping them to case class

The following is the output I am getting after performing a groupByKey, mapGroups and then a joinWith operation on the caseclass dataset:
+------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------+
|_1 |_2 |
+------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------+
|[IND0001,Christopher,Black] |null |
|[IND0002,Madeleine,Kerr] |[IND0002,WrappedArray([IND0002,ACC0155,323], [IND0002,ACC0262,60])] |
|[IND0003,Sarah,Skinner] |[IND0003,WrappedArray([IND0003,ACC0235,631], [IND0003,ACC0486,400], [IND0003,ACC0540,53])] |
|[IND0004,Rachel,Parsons] |[IND0004,WrappedArray([IND0004,ACC0116,965])] |
|[IND0005,Oliver,Johnston] |[IND0005,WrappedArray([IND0005,ACC0146,378], [IND0005,ACC0201,34], [IND0005,ACC0450,329])] |
|[IND0006,Carl,Metcalfe] |[IND0006,WrappedArray([IND0006,ACC0052,57], [IND0006,ACC0597,547])] |
The code is as follows:
val test = accountDS.groupByKey(_.customerId).mapGroups{ case (id, xs) => (id, xs.toSeq)}
test.show(false)
val newTest = customerDS.joinWith(test, customerDS("customerId") === test("_1"), "leftouter")
newTest.show(500,false)
Now I want to take the arrays and output them in a format as follows:
+----------+-----------+----------+---------------------------------------------------------------------+--------------+------------+-----------------+
* |customerId|forename |surname |accounts |numberAccounts|totalBalance|averageBalance |
* +----------+-----------+----------+---------------------------------------------------------------------+--------------+------------+-----------------+
* |IND0001 |Christopher|Black |[] |0 |0 |0.0 |
* |IND0002 |Madeleine |Kerr |[[IND0002,ACC0155,323], [IND0002,ACC0262,60]] |2 |383 |191.5 |
* |IND0003 |Sarah |Skinner |[[IND0003,ACC0235,631], [IND0003,ACC0486,400], [IND0003,ACC0540,53]] |3 |1084 |361.3333333333333|
Note: I cannot use spark.sql.functions._ at all --> training academy rules :(
How do I get the above output which should be mapped to a case class as follows:
case class CustomerAccountOutput(
customerId: String,
forename: String,
surname: String,
//Accounts for this customer
accounts: Seq[AccountData],
//Statistics of the accounts
numberAccounts: Int,
totalBalance: Long,
averageBalance: Double
)
I really need help with this. Stuck with this for weeks without a working solution.
Let's say you have the following DataFrame:
val sourceDf = Seq((1, Array("aa", "CA")), (2, Array("bb", "OH"))).toDF("id", "data_arr")
sourceDf.show()
// output:
+---+--------+
| id|data_arr|
+---+--------+
| 1|[aa, CA]|
| 2|[bb, OH]|
+---+--------+
and you want to convert it to the following schema:
val destSchema = StructType(Array(
StructField("id", IntegerType, true),
StructField("name", StringType, true),
StructField("state", StringType, true)
))
You can do:
val destDf: DataFrame = sourceDf
.map { sourceRow =>
Row(sourceRow(0), sourceRow.getAs[mutable.WrappedArray[String]](1)(0), sourceRow.getAs[mutable.WrappedArray[String]](1)(1))
}(RowEncoder(destSchema))
destDf.show()
// output:
+---+----+-----+
| id|name|state|
+---+----+-----+
| 1| aa| CA|
| 2| bb| OH|
+---+----+-----+

How to count the frequency of words with CountVectorizer in spark ML?

The below code gives a count vector for each row in the DataFrame:
import org.apache.spark.ml.feature.{CountVectorizer, CountVectorizerModel}
val df = spark.createDataFrame(Seq(
(0, Array("a", "b", "c")),
(1, Array("a", "b", "b", "c", "a"))
)).toDF("id", "words")
// fit a CountVectorizerModel from the corpus
val cvModel: CountVectorizerModel = new CountVectorizer()
.setInputCol("words")
.setOutputCol("features")
.fit(df)
cvModel.transform(df).show(false)
The result is:
+---+---------------+-------------------------+
|id |words |features |
+---+---------------+-------------------------+
|0 |[a, b, c] |(3,[0,1,2],[1.0,1.0,1.0])|
|1 |[a, b, b, c, a]|(3,[0,1,2],[2.0,2.0,1.0])|
+---+---------------+-------------------------+
How to get total counts of each words, like:
+---+------+------+
|id |words |counts|
+---+------+------+
|0 |a | 3 |
|1 |b | 3 |
|2 |c | 2 |
+---+------+------+
Shankar's answer only gives you the actual frequencies if the CountVectorizer model keeps every single word in the corpus (e.g. no minDF or VocabSize limitations). In these cases you can use Summarizer to directly sum each Vector. Note: this requires Spark 2.3+ for Summarizer.
import org.apache.spark.ml.stat.Summarizer.metrics
// You need to select normL1 and another item (like mean) because, for some reason, Spark
// won't allow one Vector to be selected at a time (at least in 2.4)
val totalCounts = cvModel.transform(df)
.select(metrics("normL1", "mean").summary($"features").as("summary"))
.select("summary.normL1", "summary.mean")
.as[(Vector, Vector)]
.first()
._1
You'll then have to zip totalCounts with cvModel.vocabulary to get the words themselves.
You can simply explode and groupBy to get the count of each word
cvModel.transform(df).withColumn("words", explode($"words"))
.groupBy($"words")
.agg(count($"words").as("counts"))
.withColumn("id", row_number().over(Window.orderBy("words")) -1)
.show(false)
Output:
+-----+------+---+
|words|counts|id |
+-----+------+---+
|a |3 |1 |
|b |3 |2 |
|c |2 |3 |
+-----+------+---+

Add new record before another in Spark

I have a Dataframe:
| ID | TIMESTAMP | VALUE |
1 15:00:01 3
1 17:04:02 2
I want to add a new record with Spark-Scala before with the same time minus 1 second when the value is 2.
The output would be:
| ID | TIMESTAMP | VALUE |
1 15:00:01 3
1 17:04:01 2
1 17:04:02 2
Thanks
You need a .flatMap()
Similar to map, but each input item can be mapped to 0 or more output items (so func should return a Seq rather than a single item).
val data = (spark.createDataset(Seq(
(1, "15:00:01", 3),
(1, "17:04:02", 2)
)).toDF("ID", "TIMESTAMP_STR", "VALUE")
.withColumn("TIMESTAMP", $"TIMESTAMP_STR".cast("timestamp").as("TIMESTAMP"))
.drop("TIMESTAMP_STR")
.select("ID", "TIMESTAMP", "VALUE")
)
data.as[(Long, java.sql.Timestamp, Long)].flatMap(r => {
if(r._3 == 2) {
Seq(
(r._1, new java.sql.Timestamp(r._2.getTime() - 1000L), r._3),
(r._1, r._2, r._3)
)
} else {
Some(r._1, r._2, r._3)
}
}).toDF("ID", "TIMESTAMP", "VALUE").show()
Which results in:
+---+-------------------+-----+
| ID| TIMESTAMP|VALUE|
+---+-------------------+-----+
| 1|2019-03-04 15:00:01| 3|
| 1|2019-03-04 17:04:01| 2|
| 1|2019-03-04 17:04:02| 2|
+---+-------------------+-----+
You can introduce a new column array - when value =2 then Array(-1,0) else Array(0), then explode that column and add it with the timestamp as seconds. The below one should work for you. Check this out:
scala> val df = Seq((1,"15:00:01",3),(1,"17:04:02",2)).toDF("id","timestamp","value")
df: org.apache.spark.sql.DataFrame = [id: int, timestamp: string ... 1 more field]
scala> val df2 = df.withColumn("timestamp",'timestamp.cast("timestamp"))
df2: org.apache.spark.sql.DataFrame = [id: int, timestamp: timestamp ... 1 more field]
scala> df2.show(false)
+---+-------------------+-----+
|id |timestamp |value|
+---+-------------------+-----+
|1 |2019-03-04 15:00:01|3 |
|1 |2019-03-04 17:04:02|2 |
+---+-------------------+-----+
scala> val df3 = df2.withColumn("newc", when($"value"===lit(2),lit(Array(-1,0))).otherwise(lit(Array(0))))
df3: org.apache.spark.sql.DataFrame = [id: int, timestamp: timestamp ... 2 more fields]
scala> df3.show(false)
+---+-------------------+-----+-------+
|id |timestamp |value|newc |
+---+-------------------+-----+-------+
|1 |2019-03-04 15:00:01|3 |[0] |
|1 |2019-03-04 17:04:02|2 |[-1, 0]|
+---+-------------------+-----+-------+
scala> val df4 = df3.withColumn("c_explode",explode('newc)).withColumn("timestamp2",to_timestamp(unix_timestamp('timestamp)+'c_explode))
df4: org.apache.spark.sql.DataFrame = [id: int, timestamp: timestamp ... 4 more fields]
scala> df4.select($"id",$"timestamp2",$"value").show(false)
+---+-------------------+-----+
|id |timestamp2 |value|
+---+-------------------+-----+
|1 |2019-03-04 15:00:01|3 |
|1 |2019-03-04 17:04:01|2 |
|1 |2019-03-04 17:04:02|2 |
+---+-------------------+-----+
scala>
If you want the time part alone, then you can do like
scala> df4.withColumn("timestamp",from_unixtime(unix_timestamp('timestamp2),"HH:mm:ss")).select($"id",$"timestamp",$"value").show(false)
+---+---------+-----+
|id |timestamp|value|
+---+---------+-----+
|1 |15:00:01 |3 |
|1 |17:04:01 |2 |
|1 |17:04:02 |2 |
+---+---------+-----+

Spark scala join RDD between 2 datasets

Supposed i have two dataset as following:
Dataset 1:
id, name, score
1, Bill, 200
2, Bew, 23
3, Amy, 44
4, Ramond, 68
Dataset 2:
id,message
1, i love Bill
2, i hate Bill
3, Bew go go !
4, Amy is the best
5, Ramond is the wrost
6, Bill go go
7, Bill i love ya
8, Ramond is Bad
9, Amy is great
I wanted to join above two datasets and counting the top number of person's name that appears in dataset2 according to the name in dataset1 the result should be:
Bill, 4
Ramond, 2
..
..
I managed to join both of them together but not sure how to count how many time it appear for each person.
Any suggestion would be appreciated.
Edited:
my join code:
val rdd = sc.textFile("dataset1")
val rdd2 = sc.textFile("dataset2")
val rddPair1 = rdd.map { x =>
var data = x.split(",")
new Tuple2(data(0), data(1))
}
val rddPair2 = rdd2.map { x =>
var data = x.split(",")
new Tuple2(data(0), data(1))
}
rddPair1.join(rddPair2).collect().foreach(f =>{
println(f._1+" "+f._2._1+" "+f._2._2)
})
Using RDDs, achieving the solution you desire, would be complex. Not so much using dataframes.
First step would be to read the two files you have into dataframes as below
val df1 = sqlContext.read.format("com.databricks.spark.csv")
.option("header", true)
.load("dataset1")
val df2 = sqlContext.read.format("com.databricks.spark.csv")
.option("header", true)
.load("dataset1")
so that you should be having
df1
+---+------+-----+
|id |name |score|
+---+------+-----+
|1 |Bill |200 |
|2 |Bew |23 |
|3 |Amy |44 |
|4 |Ramond|68 |
+---+------+-----+
df2
+---+-------------------+
|id |message |
+---+-------------------+
|1 |i love Bill |
|2 |i hate Bill |
|3 |Bew go go ! |
|4 |Amy is the best |
|5 |Ramond is the wrost|
|6 |Bill go go |
|7 |Bill i love ya |
|8 |Ramond is Bad |
|9 |Amy is great |
+---+-------------------+
join, groupBy and count should give your desired output as
df1.join(df2, df2("message").contains(df1("name")), "left").groupBy("name").count().as("count").show(false)
Final output would be
+------+-----+
|name |count|
+------+-----+
|Ramond|2 |
|Bill |4 |
|Amy |2 |
|Bew |1 |
+------+-----+

append two dataframes and update data

Hello guys I want to update an old dataframe based on pos_id and article_id field.
If the tuple (pos_id,article_id) exist , I will add each column to the old one, if it doesn't exist I will add the new one. It worked fine. But I don't know how to deal with the case , when the dataframe is intially empty , in this case , I will add the new rows in the second dataframe to the old one. Here it is what I did
val histocaisse = spark.read
.format("csv")
.option("header", "true") //reading the headers
.load("C:/Users/MHT/Desktop/histocaisse_dte1.csv")
val hist = histocaisse
.withColumn("pos_id", 'pos_id.cast(LongType))
.withColumn("article_id", 'pos_id.cast(LongType))
.withColumn("date", 'date.cast(DateType))
.withColumn("qte", 'qte.cast(DoubleType))
.withColumn("ca", 'ca.cast(DoubleType))
val histocaisse2 = spark.read
.format("csv")
.option("header", "true") //reading the headers
.load("C:/Users/MHT/Desktop/histocaisse_dte2.csv")
val hist2 = histocaisse2.withColumn("pos_id", 'pos_id.cast(LongType))
.withColumn("article_id", 'pos_id.cast(LongType))
.withColumn("date", 'date.cast(DateType))
.withColumn("qte", 'qte.cast(DoubleType))
.withColumn("ca", 'ca.cast(DoubleType))
hist2.show(false)
+------+----------+----------+----+----+
|pos_id|article_id|date |qte |ca |
+------+----------+----------+----+----+
|1 |1 |2000-01-07|2.5 |3.5 |
|2 |2 |2000-01-07|14.7|12.0|
|3 |3 |2000-01-07|3.5 |1.2 |
+------+----------+----------+----+----+
+------+----------+----------+----+----+
|pos_id|article_id|date |qte |ca |
+------+----------+----------+----+----+
|1 |1 |2000-01-08|2.5 |3.5 |
|2 |2 |2000-01-08|14.7|12.0|
|3 |3 |2000-01-08|3.5 |1.2 |
|4 |4 |2000-01-08|3.5 |1.2 |
|5 |5 |2000-01-08|14.5|1.2 |
|6 |6 |2000-01-08|2.0 |1.25|
+------+----------+----------+----+----+
+------+----------+----------+----+----+
|pos_id|article_id|date |qte |ca |
+------+----------+----------+----+----+
|1 |1 |2000-01-08|5.0 |7.0 |
|2 |2 |2000-01-08|39.4|24.0|
|3 |3 |2000-01-08|7.0 |2.4 |
|4 |4 |2000-01-08|3.5 |1.2 |
|5 |5 |2000-01-08|14.5|1.2 |
|6 |6 |2000-01-08|2.0 |1.25|
+------+----------+----------+----+----+
Here is the solution , i found
val df = hist2.join(hist1, Seq("article_id", "pos_id"), "left")
.select($"pos_id", $"article_id",
coalesce(hist2("date"), hist1("date")).alias("date"),
(coalesce(hist2("qte"), lit(0)) + coalesce(hist1("qte"), lit(0))).alias("qte"),
(coalesce(hist2("ca"), lit(0)) + coalesce(hist1("ca"), lit(0))).alias("ca"))
.orderBy("pos_id", "article_id")
This case doesn't work when hist1 is empty .Any help please ?
Thanks a lot
Not sure if I understood correctly, but if the problem is sometimes the second dataframe is empty, and that makes the join crash, something you can try is this:
val checkHist1Empty = Try(hist1.first)
val df = checkHist1Empty match {
case Success(df) => {
hist2.join(hist1, Seq("article_id", "pos_id"), "left")
.select($"pos_id", $"article_id",
coalesce(hist2("date"), hist1("date")).alias("date"),
(coalesce(hist2("qte"), lit(0)) + coalesce(hist1("qte"), lit(0))).alias("qte"),
(coalesce(hist2("ca"), lit(0)) + coalesce(hist1("ca"), lit(0))).alias("ca"))
.orderBy("pos_id", "article_id")
}
case Failure(e) => {
hist2.select($"pos_id", $"article_id",
coalesce(hist2("date")).alias("date"),
coalesce(hist2("qte"), lit(0)).alias("qte"),
coalesce(hist2("ca"), lit(0)).alias("ca"))
.orderBy("pos_id", "article_id")
}
}
This basically checks if the hist1 is empty before performing the join. In case it is empty it generates the df based on the same logic but applied only to the hist2 dataframe. In case it contains information it applies the logic you had, which you said that works.
instead of doing a join, why don't you do a union of the two dataframes and then groupBy (pos_id,article_id) and apply udf to each column sum for qte and ca.
val df3 = df1.unionAll(df2)
val df4 = df3.groupBy("pos_id", "article_id").agg($"pos_id", $"article_id", max("date"), sum("qte"), sum("ca"))