Partial Replication of DataFrame rows - scala

I have a Dataframe which has the following structure and data
Source:
Column1(String), Column2(String), Date
-----------------------
1, 2, 01/01/2021
A, B, 02/01/2021
M, N, 05/01/2021
I want to transform it to the following (First 2 columns are replicated in values and date is incremented until a fixed date (until 07/01/2021 in this example) for each of the source row)
To Result:
1, 2, 01/01/2021
1, 2, 02/01/2021
1, 2, 03/01/2021
1, 2, 04/01/2021
1, 2, 05/01/2021
1, 2, 06/01/2021
1, 2, 07/01/2021
A, B, 02/01/2021
A, B, 03/01/2021
A, B, 04/01/2021
A, B, 05/01/2021
A, B, 06/01/2021
A, B, 07/01/2021
M, N, 05/01/2021
M, N, 06/01/2021
M, N, 07/01/2021
Any idea on how this can be achieved in scala spark?
I got this link Replicate Spark Row N-times, but there is no hint on how a particular column can be incremented during replication.

We can use sequence function to generate list of dates in required range, then explode the output array of sequence function to get dataframe in required format.
val spark = SparkSession.builder().master("local[*]").getOrCreate()
import spark.implicits._
spark.sparkContext.setLogLevel("ERROR")
// Sample dataframe
val df = List(("1", "2", "01/01/2021"),
("A", "B", "02/01/2021"),
("M", "N", "05/01/2021"))
.toDF("Column1(String)", "Column2(String)", "Date")
df
.withColumn("Date",explode_outer(sequence(to_date('Date,"dd/MM/yyyy"),
to_date(lit("07/01/2021"),"dd/MM/yyyy"))))
.withColumn("Date",date_format('Date,"dd/MM/yyyy"))
.show(false)
/*
+---------------+---------------+----------+
|Column1(String)|Column2(String)|Date |
+---------------+---------------+----------+
|1 |2 |01/01/2021|
|1 |2 |02/01/2021|
|1 |2 |03/01/2021|
|1 |2 |04/01/2021|
|1 |2 |05/01/2021|
|1 |2 |06/01/2021|
|1 |2 |07/01/2021|
|A |B |02/01/2021|
|A |B |03/01/2021|
|A |B |04/01/2021|
|A |B |05/01/2021|
|A |B |06/01/2021|
|A |B |07/01/2021|
|M |N |05/01/2021|
|M |N |06/01/2021|
|M |N |07/01/2021|
+---------------+---------------+----------+ */

Related

Convert a column from StringType to Json (object)

Here is a sample data
val df4 = sc.parallelize(List(
("A1",45, "5", 1, 90),
("A2",60, "1", 1, 120),
("A6", 30, "9", 1, 450),
("A7", 89, "7", 1, 333),
("A7", 89, "4", 1, 320),
("A2",60, "5", 1, 22),
("A1",45, "22", 1, 1)
)).toDF("CID","age", "children", "marketplace_id","value")
thanks to #Shu for this piece of code
val df5 = df4.selectExpr("CID","""to_json(named_struct("id", children)) as item""", "value", "marketplace_id")
+---+-----------+-----+--------------+
|CID|item |value|marketplace_id|
+---+-----------+-----+--------------+
|A1 |{"id":"5"} |90 |1 |
|A2 |{"id":"1"} |120 |1 |
|A6 |{"id":"9"} |450 |1 |
|A7 |{"id":"7"} |333 |1 |
|A7 |{"id":"4"} |320 |1 |
|A2 |{"id":"5"} |22 |1 |
|A1 |{"id":"22"}|1 |1 |
+---+-----------+-----+--------------+
when you do df5.dtypes
(CID,StringType), (item,StringType), (value,IntegerType), (marketplace_id,IntegerType)
the column item is of string type, is there a way this can be of json/object type(if that is a thing)?
EDIT 1:
I will describe what I am trying to achieve here, the above two steps remains same.
val w = Window.partitionBy("CID").orderBy(desc("value"))
val sorted_list = df5.withColumn("item", collect_list("item").over(w)).groupBy("CID").agg(max("item") as "item")
Output:
+---+-------------------------+
|CID|item |
+---+-------------------------+
|A6 |[{"id":"9"}] |
|A2 |[{"id":"1"}, {"id":"5"}] |
|A7 |[{"id":"7"}, {"id":"4"}] |
|A1 |[{"id":"5"}, {"id":"22"}]|
+---+-------------------------+
now whatever is inside [ ] is a string. which is causing a problem for one of the tools we are using.
Sorry, pardon me I am new to scala, spark if this is a basic question.
Store json data using struct type, check below code.
scala> dfa
.withColumn("item_without_json",struct($"cid".as("id")))
.withColumn("item_as_json",to_json($"item_without_json"))
.show(false)
+---+-----------+-----+--------------+-----------------+------------+
|CID|item |value|marketplace_id|item_without_json|item_as_json|
+---+-----------+-----+--------------+-----------------+------------+
|A1 |{"id":"A1"}|90 |1 |[A1] |{"id":"A1"} |
|A2 |{"id":"A2"}|120 |1 |[A2] |{"id":"A2"} |
|A6 |{"id":"A6"}|450 |1 |[A6] |{"id":"A6"} |
|A7 |{"id":"A7"}|333 |1 |[A7] |{"id":"A7"} |
|A7 |{"id":"A7"}|320 |1 |[A7] |{"id":"A7"} |
|A2 |{"id":"A2"}|22 |1 |[A2] |{"id":"A2"} |
|A1 |{"id":"A1"}|1 |1 |[A1] |{"id":"A1"} |
+---+-----------+-----+--------------+-----------------+------------+
Based on the comment you made to have the dataset converted to json you would use:
df4
.select(collect_list(struct($"CID".as("id"))).as("items"))
.write()
.json(path)
The output will look like:
{"items":[{"id":"A1"},{"id":"A2"},{"id":"A6"},{"id":"A7"}, ...]}
If you need the thing in memory to pass down to a function, instead of write().json(...) use toJSON

How to count the frequency of words with CountVectorizer in spark ML?

The below code gives a count vector for each row in the DataFrame:
import org.apache.spark.ml.feature.{CountVectorizer, CountVectorizerModel}
val df = spark.createDataFrame(Seq(
(0, Array("a", "b", "c")),
(1, Array("a", "b", "b", "c", "a"))
)).toDF("id", "words")
// fit a CountVectorizerModel from the corpus
val cvModel: CountVectorizerModel = new CountVectorizer()
.setInputCol("words")
.setOutputCol("features")
.fit(df)
cvModel.transform(df).show(false)
The result is:
+---+---------------+-------------------------+
|id |words |features |
+---+---------------+-------------------------+
|0 |[a, b, c] |(3,[0,1,2],[1.0,1.0,1.0])|
|1 |[a, b, b, c, a]|(3,[0,1,2],[2.0,2.0,1.0])|
+---+---------------+-------------------------+
How to get total counts of each words, like:
+---+------+------+
|id |words |counts|
+---+------+------+
|0 |a | 3 |
|1 |b | 3 |
|2 |c | 2 |
+---+------+------+
Shankar's answer only gives you the actual frequencies if the CountVectorizer model keeps every single word in the corpus (e.g. no minDF or VocabSize limitations). In these cases you can use Summarizer to directly sum each Vector. Note: this requires Spark 2.3+ for Summarizer.
import org.apache.spark.ml.stat.Summarizer.metrics
// You need to select normL1 and another item (like mean) because, for some reason, Spark
// won't allow one Vector to be selected at a time (at least in 2.4)
val totalCounts = cvModel.transform(df)
.select(metrics("normL1", "mean").summary($"features").as("summary"))
.select("summary.normL1", "summary.mean")
.as[(Vector, Vector)]
.first()
._1
You'll then have to zip totalCounts with cvModel.vocabulary to get the words themselves.
You can simply explode and groupBy to get the count of each word
cvModel.transform(df).withColumn("words", explode($"words"))
.groupBy($"words")
.agg(count($"words").as("counts"))
.withColumn("id", row_number().over(Window.orderBy("words")) -1)
.show(false)
Output:
+-----+------+---+
|words|counts|id |
+-----+------+---+
|a |3 |1 |
|b |3 |2 |
|c |2 |3 |
+-----+------+---+

How to convert column values into a single array in scala?

I am trying to convert all columns of my dataframe into single arrays.
Is there an operation supported in structured streaming by which we can perform something opposite to "explode"?
Any suggestion is much appreciated !!!
Tried collect() and collectAsList() . But it is not supported in streaming
+---+---------------+----------------+--------+
|row|ADDRESS_TYPE_CD|DISCONTINUE_DATE|param_cd|
+---+---------------+----------------+--------+
|0 |1 |null |7 |
|2 |6 |null |1 |
+---+---------------+----------------+--------+
My result should look like :
+---+---------------+----------------+--------+
|row|ADDRESS_TYPE_CD|DISCONTINUE_DATE|param_cd|
+---+---------------+----------------+--------+
[0,2] [1,6] [null,null] [7,2]
+---+---------------+----------------+--------+
You can use collect_list on all your columns for instance. It would go as follows:
val aggs = df.columns.map(c => collect_list(col(c)) as c)
df.select(aggs :_*).show()
+------+---------------+----------------+--------+
| row|ADDRESS_TYPE_CD|DISCONTINUE_DATE|param_cd|
+------+---------------+----------------+--------+
|[0, 2]| [1, 6]| [null, null]| [7, 1]|
+------+---------------+----------------+--------+

How to make pairs of nodes using filtering in Spark?

I have the following DataFrame in Spark and Scala:
nodeId typeFrom typeTo date
1 A G 2016-10-12T12:10:00.000Z
2 B A 2016-10-12T12:00:00.000Z
3 A B 2016-10-12T12:05:00.000Z
4 D C 2016-10-12T12:30:00.000Z
5 G D 2016-10-12T12:35:00.000Z
I want to make pairs of nodeId for those cases when typeFrom and typeTo values are the same.
The expected output for the above-shown example is the following one:
nodeId_1 nodeId_2 type date
1 2 A 2016-10-12T12:10:00.000Z
3 2 A 2016-10-12T12:05:00.000Z
2 3 B 2016-10-12T12:00:00.000Z
4 5 C 2016-10-12T12:30:00.000Z
5 1 G 2016-10-12T12:35:00.000Z
I don't know how to make pairs of nodeId:
df.
.filter($"typeFrom" === $"typeTo")
.???
You can use self-join on matching nodeFrom with nodeTo:
val df = Seq(
(1, "A", "G", "2016-10-12T12:10:00.000Z"),
(2, "B", "A", "2016-10-12T12:00:00.000Z"),
(3, "A", "B", "2016-10-12T12:05:00.000Z"),
(4, "D", "C", "2016-10-12T12:30:00.000Z"),
(5, "G", "D", "2016-10-12T12:35:00.000Z")
).toDF("nodeId", "typeFrom", "typeTo", "date")
df.as("df1").join(
df.as("df2"),
$"df1.typeFrom" === $"df2.typeTo"
).select(
$"df1.nodeId".as("nodeId_1"), $"df2.nodeId".as("nodeId_2"), $"df1.typeFrom".as("type"), $"df1.date"
).show(truncate=false)
// +--------+--------+----+------------------------+
// |nodeId_1|nodeId_2|type|date |
// +--------+--------+----+------------------------+
// |1 |2 |A |2016-10-12T12:10:00.000Z|
// |2 |3 |B |2016-10-12T12:00:00.000Z|
// |3 |2 |A |2016-10-12T12:05:00.000Z|
// |4 |5 |D |2016-10-12T12:30:00.000Z|
// |5 |1 |G |2016-10-12T12:35:00.000Z|
// +--------+--------+----+------------------------+

How to create pairs of nodes in Spark?

I have the following DataFrame in Spark and Scala:
group nodeId date
1 1 2016-10-12T12:10:00.000Z
1 2 2016-10-12T12:00:00.000Z
1 3 2016-10-12T12:05:00.000Z
2 1 2016-10-12T12:30:00.000Z
2 2 2016-10-12T12:35:00.000Z
I need to group records by group, sort them in ascending order by date and make pairs of sequential nodeId. Also, date should be converted to Unix epoch.
This can be better explained with the expected output:
group nodeId_1 nodeId_2 date
1 2 3 2016-10-12T12:00:00.000Z
1 3 1 2016-10-12T12:05:00.000Z
2 1 2 2016-10-12T12:30:00.000Z
This is what I did so far:
df
.groupBy("group")
.agg($"nodeId",$"date")
.orderBy(asc("date"))
But I don't know how to create pairs of nodeId.
You can be benefited by using Window function with lead inbuilt function to create the pairs and to_utc_timestamp inbuilt function to convert the date to epoch date. Finally you have to filter the unpaired rows as you don't require them in the output.
Following is the program of above explanation. I have used comments for clarity
import org.apache.spark.sql.expressions._
def windowSpec = Window.partitionBy("group").orderBy("date") //defining window function grouping by group and ordering by date
import org.apache.spark.sql.functions._
df.withColumn("date", to_utc_timestamp(col("date"), "Asia/Kathmandu")) //converting the date to epoch datetime you can choose other timezone as required
.withColumn("nodeId_2", lead("nodeId", 1).over(windowSpec)) //using window for creating pairs
.filter(col("nodeId_2").isNotNull) //filtering out the unpaired rows
.select(col("group"), col("nodeId").as("nodeId_1"), col("nodeId_2"), col("date")) //selecting as required final dataframe
.show(false)
You should get the final dataframe as required
+-----+--------+--------+-------------------+
|group|nodeId_1|nodeId_2|date |
+-----+--------+--------+-------------------+
|1 |2 |3 |2016-10-12 12:00:00|
|1 |3 |1 |2016-10-12 12:05:00|
|2 |1 |2 |2016-10-12 12:30:00|
+-----+--------+--------+-------------------+
I hope the answer is helpful
Note to get the correct epoch date I have used Asia/Kathmandu as timezone.
If I understand your requirement correctly, you can use a self-join on group and a < inequality condition on nodeId:
val df = Seq(
(1, 1, "2016-10-12T12:10:00.000Z"),
(1, 2, "2016-10-12T12:00:00.000Z"),
(1, 3, "2016-10-12T12:05:00.000Z"),
(2, 1, "2016-10-12T12:30:00.000Z"),
(2, 2, "2016-10-12T12:35:00.000Z")
).toDF("group", "nodeId", "date")
df.as("df1").join(
df.as("df2"),
$"df1.group" === $"df2.group" && $"df1.nodeId" < $"df2.nodeId"
).select(
$"df1.group", $"df1.nodeId", $"df2.nodeId",
when($"df1.date" < $"df2.date", $"df1.date").otherwise($"df2.date").as("date")
)
// +-----+------+------+------------------------+
// |group|nodeId|nodeId|date |
// +-----+------+------+------------------------+
// |1 |1 |3 |2016-10-12T12:05:00.000Z|
// |1 |1 |2 |2016-10-12T12:00:00.000Z|
// |1 |2 |3 |2016-10-12T12:00:00.000Z|
// |2 |1 |2 |2016-10-12T12:30:00.000Z|
// +-----+------+------+------------------------+