spark multiplie value of frequence in an other string column dataframe

spark multiplie value of frequence in an other string column dataframe - scala

I have a dataframe with this shema:
ocId: integer (nullable = true).
freq: integer (nullable = true).
nameFile: string (nullable = true)
word: string (nullable = true).
I want to multiplie value of row freq in column word to have word*freq in a new column
example :
value freq = 2, word = analyses ---->a new column : analyses,analyses.
value freq = 3, word = carried ---->a new column
: carried,carried,carried.
value freq = 1, word =
atlantic---->a new column : atlantic.
value freq = 2, word =
hello---->a new column : hello,hello.

You can define a UDF similar to the following:
val df = Seq(
(1, 2, "a", "analyses"),
(2, 3, "b", "carried"),
(3, 1, "c", "atlantic"),
(4, 2, "d", "hello"),
(5, 2, "e", ""),
(6, 1, "f", null),
(7, 0, "f", "blah")
).toDF("ocId", "freq", "nameFile", "word")
def multiWords = udf(
(word: String, freq: Int) => word match {
case null => null
case "" => ""
case _ => if (freq > 0) ((word + ",") * freq).dropRight(1) + "."
else ""
}
)
df.withColumn("multiWords", multiWords($"word", $"freq")).
show(false)
// +----+----+--------+--------+------------------------+
// |ocId|freq|nameFile|word |multiWords |
// +----+----+--------+--------+------------------------+
// |1 |2 |a |analyses|analyses,analyses. |
// |2 |3 |b |carried |carried,carried,carried.|
// |3 |1 |c |atlantic|atlantic. |
// |4 |2 |d |hello |hello,hello. |
// |5 |2 |e | | |
// |6 |1 |f |null |null |
// |7 |0 |g |blah | |
// +----+----+--------+--------+------------------------+

Related

scala: get column name corresponding to max column value from variable columns list

I have the following working solution in a databricks notebook as test.
var maxcol = udf((col1: Long, col2: Long, col3: Long) => {
var res = ""
if (col1 > col2 && col1 > col3) res = "col1"
else if (col2 > col1 && col2 > col3) res = "col2"
else res = "col3"
res
})
val someDF = Seq(
(8, 10, 12, "bat"),
(64, 61, 59, "mouse"),
(-27, -30, -15, "horse")
).toDF("number1", "number2", "number3", "word")
.withColumn("maxColVal", greatest("number1", "number2", "number3"))
.withColumn("maxColVal_Name", maxcol(col("number1"), col("number2"), col("number3")))
display(someDF)
Is there any way to make this generic? I have a usecase to make variable columns pass to this UDF and still get the max column name as output corresponding to the column having max value.
Unlike above where I have hard coded the column names 'col1', 'col2' and 'col3' in the UDF.

Use below:
val df = List((1,2,3,5,"a"),(4,2,3,1,"a"),(1,20,3,1,"a"),(1,22,22,2,"a")).toDF("mycol1","mycol2","mycol3","mycol4","mycol5")
//list all your columns among which you want to find the max value
val colGroup = List(df("mycol1"),df("mycol2"),df("mycol3"),df("mycol4"))
//list column value -> column name of the columns among which you want to find max value column NAME
val colGroupMap = List(df("mycol1"),lit("mycol1"),
df("mycol2"),lit("mycol2"),
df("mycol3"),lit("mycol3"),
df("mycol4"),lit("mycol4"))
var maxcol = udf((colVal: Map[Int,String]) => {
colVal.max._2 //you can easily find the column name of the max column value
})
df.withColumn("maxColValue",greatest(colGroup:_*)).withColumn("maxColVal_Name",maxcol(map(colGroupMap:_*))).show(false)
+------+------+------+------+------+-----------+--------------+
|mycol1|mycol2|mycol3|mycol4|mycol5|maxColValue|maxColVal_Name|
+------+------+------+------+------+-----------+--------------+
|1 |2 |3 |5 |a |5 |mycol4 |
|4 |2 |3 |1 |a |4 |mycol1 |
|1 |20 |3 |1 |a |20 |mycol2 |
|1 |22 |22 |2 |a |22 |mycol3 |
+------+------+------+------+------+-----------+--------------+

Add new record before another in Spark

I have a Dataframe:
| ID | TIMESTAMP | VALUE |
1 15:00:01 3
1 17:04:02 2
I want to add a new record with Spark-Scala before with the same time minus 1 second when the value is 2.
The output would be:
| ID | TIMESTAMP | VALUE |
1 15:00:01 3
1 17:04:01 2
1 17:04:02 2
Thanks

You need a .flatMap()
Similar to map, but each input item can be mapped to 0 or more output items (so func should return a Seq rather than a single item).
val data = (spark.createDataset(Seq(
(1, "15:00:01", 3),
(1, "17:04:02", 2)
)).toDF("ID", "TIMESTAMP_STR", "VALUE")
.withColumn("TIMESTAMP", $"TIMESTAMP_STR".cast("timestamp").as("TIMESTAMP"))
.drop("TIMESTAMP_STR")
.select("ID", "TIMESTAMP", "VALUE")
)
data.as[(Long, java.sql.Timestamp, Long)].flatMap(r => {
if(r._3 == 2) {
Seq(
(r._1, new java.sql.Timestamp(r._2.getTime() - 1000L), r._3),
(r._1, r._2, r._3)
)
} else {
Some(r._1, r._2, r._3)
}
}).toDF("ID", "TIMESTAMP", "VALUE").show()
Which results in:
+---+-------------------+-----+
| ID| TIMESTAMP|VALUE|
+---+-------------------+-----+
| 1|2019-03-04 15:00:01| 3|
| 1|2019-03-04 17:04:01| 2|
| 1|2019-03-04 17:04:02| 2|
+---+-------------------+-----+

You can introduce a new column array - when value =2 then Array(-1,0) else Array(0), then explode that column and add it with the timestamp as seconds. The below one should work for you. Check this out:
scala> val df = Seq((1,"15:00:01",3),(1,"17:04:02",2)).toDF("id","timestamp","value")
df: org.apache.spark.sql.DataFrame = [id: int, timestamp: string ... 1 more field]
scala> val df2 = df.withColumn("timestamp",'timestamp.cast("timestamp"))
df2: org.apache.spark.sql.DataFrame = [id: int, timestamp: timestamp ... 1 more field]
scala> df2.show(false)
+---+-------------------+-----+
|id |timestamp |value|
+---+-------------------+-----+
|1 |2019-03-04 15:00:01|3 |
|1 |2019-03-04 17:04:02|2 |
+---+-------------------+-----+
scala> val df3 = df2.withColumn("newc", when($"value"===lit(2),lit(Array(-1,0))).otherwise(lit(Array(0))))
df3: org.apache.spark.sql.DataFrame = [id: int, timestamp: timestamp ... 2 more fields]
scala> df3.show(false)
+---+-------------------+-----+-------+
|id |timestamp |value|newc |
+---+-------------------+-----+-------+
|1 |2019-03-04 15:00:01|3 |[0] |
|1 |2019-03-04 17:04:02|2 |[-1, 0]|
+---+-------------------+-----+-------+
scala> val df4 = df3.withColumn("c_explode",explode('newc)).withColumn("timestamp2",to_timestamp(unix_timestamp('timestamp)+'c_explode))
df4: org.apache.spark.sql.DataFrame = [id: int, timestamp: timestamp ... 4 more fields]
scala> df4.select($"id",$"timestamp2",$"value").show(false)
+---+-------------------+-----+
|id |timestamp2 |value|
+---+-------------------+-----+
|1 |2019-03-04 15:00:01|3 |
|1 |2019-03-04 17:04:01|2 |
|1 |2019-03-04 17:04:02|2 |
+---+-------------------+-----+
scala>
If you want the time part alone, then you can do like
scala> df4.withColumn("timestamp",from_unixtime(unix_timestamp('timestamp2),"HH:mm:ss")).select($"id",$"timestamp",$"value").show(false)
+---+---------+-----+
|id |timestamp|value|
+---+---------+-----+
|1 |15:00:01 |3 |
|1 |17:04:01 |2 |
|1 |17:04:02 |2 |
+---+---------+-----+

Convert Spark DataFrame to HashMap of HashMaps

I have a dataframe that looks like this:
column1_ID column2 column3 column4
A_123 12 A 1
A_123 12 B 2
A_123 23 A 1
B_456 56 DB 4
B_456 56 BD 5
B_456 60 BD 3
I would like to convert above dataframe/rdd into below OUTPUT column1_ID(KEY): HashMap(Long, HashMap(String, Long))
'A_123': {12 : {'A': 1, 'B': 2}, 23: {'A': 1} },
'B_456': {56 : {'DB': 4, 'BD': 5}, 60: {'BD': 3} }
Tried with reduceByKey and groupByKey but couldn't convert the output as expected.

Can be done with creating complex structure from three last columns, and then apply UDF:
val data = List(
("A_123", 12, "A", 1),
("A_123", 12, "B", 2),
("A_123", 23, "A", 1),
("B_456", 56, "DB", 4),
("B_456", 56, "BD", 5),
("B_456", 60, "BD", 3))
val df = data.toDF("column1_ID", "column2", "column3", "column4")
val twoLastCompacted = df.withColumn("lastTwo", struct($"column3", $"column4"))
twoLastCompacted.show(false)
val grouppedByTwoFirst = twoLastCompacted.groupBy("column1_ID", "column2").agg(collect_list("lastTwo").alias("lastTwoArray"))
grouppedByTwoFirst.show(false)
val treeLastCompacted = grouppedByTwoFirst.withColumn("lastThree", struct($"column2", $"lastTwoArray"))
treeLastCompacted.show(false)
val gruppedByFirst = treeLastCompacted.groupBy("column1_ID").agg(collect_list("lastThree").alias("lastThreeArray"))
gruppedByFirst.printSchema()
gruppedByFirst.show(false)
val structToMap = (value: Seq[Row]) =>
value.map(v => v.getInt(0) ->
v.getSeq(1).asInstanceOf[Seq[Row]].map(r => r.getString(0) -> r.getInt(1)).toMap)
.toMap
val structToMapUDF = udf(structToMap)
gruppedByFirst.select($"column1_ID", structToMapUDF($"lastThreeArray")).show(false)
Output:
+----------+-------+-------+-------+-------+
|column1_ID|column2|column3|column4|lastTwo|
+----------+-------+-------+-------+-------+
|A_123 |12 |A |1 |[A,1] |
|A_123 |12 |B |2 |[B,2] |
|A_123 |23 |A |1 |[A,1] |
|B_456 |56 |DB |4 |[DB,4] |
|B_456 |56 |BD |5 |[BD,5] |
|B_456 |60 |BD |3 |[BD,3] |
+----------+-------+-------+-------+-------+
+----------+-------+----------------+
|column1_ID|column2|lastTwoArray |
+----------+-------+----------------+
|B_456 |60 |[[BD,3]] |
|A_123 |12 |[[A,1], [B,2]] |
|B_456 |56 |[[DB,4], [BD,5]]|
|A_123 |23 |[[A,1]] |
+----------+-------+----------------+
+----------+-------+----------------+---------------------------------+
|column1_ID|column2|lastTwoArray |lastThree |
+----------+-------+----------------+---------------------------------+
|B_456 |60 |[[BD,3]] |[60,WrappedArray([BD,3])] |
|A_123 |12 |[[A,1], [B,2]] |[12,WrappedArray([A,1], [B,2])] |
|B_456 |56 |[[DB,4], [BD,5]]|[56,WrappedArray([DB,4], [BD,5])]|
|A_123 |23 |[[A,1]] |[23,WrappedArray([A,1])] |
+----------+-------+----------------+---------------------------------+
root
|-- column1_ID: string (nullable = true)
|-- lastThreeArray: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- column2: integer (nullable = false)
| | |-- lastTwoArray: array (nullable = true)
| | | |-- element: struct (containsNull = true)
| | | | |-- column3: string (nullable = true)
| | | | |-- column4: integer (nullable = false)
+----------+--------------------------------------------------------------+
|column1_ID|lastThreeArray |
+----------+--------------------------------------------------------------+
|B_456 |[[60,WrappedArray([BD,3])], [56,WrappedArray([DB,4], [BD,5])]]|
|A_123 |[[12,WrappedArray([A,1], [B,2])], [23,WrappedArray([A,1])]] |
+----------+--------------------------------------------------------------+
+----------+----------------------------------------------------+
|column1_ID|UDF(lastThreeArray) |
+----------+----------------------------------------------------+
|B_456 |Map(60 -> Map(BD -> 3), 56 -> Map(DB -> 4, BD -> 5))|
|A_123 |Map(12 -> Map(A -> 1, B -> 2), 23 -> Map(A -> 1)) |
+----------+----------------------------------------------------+

You can convert the DF to an rdd and apply the operations like below:
scala> case class Data(col1: String, col2: Int, col3: String, col4: Int)
defined class Data
scala> var x: Seq[Data] = List(Data("A_123",12,"A",1), Data("A_123",12,"B",2), Data("A_123",23,"A",1), Data("B_456",56,"DB",4), Data("B_456",56,"BD",5), Data("B_456",60,"BD",3))
x: Seq[Data] = List(Data(A_123,12,A,1), Data(A_123,12,B,2), Data(A_123,23,A,1), Data(B_456,56,DB,4), Data(B_456,56,BD,5), Data(B_456,60,BD,3))
scala> sc.parallelize(x).groupBy(_.col1).map{a => (a._1, HashMap(a._2.groupBy(_.col2).map{b => (b._1, HashMap(b._2.groupBy(_.col3).map{c => (c._1, c._2.map(_.col4).head)}.toArray: _*))}.toArray: _*))}.toDF()
res26: org.apache.spark.sql.DataFrame = [_1: string, _2: map<int,map<string,int>>]
I have initialized an rdd with the data structure as in your case by sc.parallelize(x)

How to make pairs of nodes using filtering in Spark?

I have the following DataFrame in Spark and Scala:
nodeId typeFrom typeTo date
1 A G 2016-10-12T12:10:00.000Z
2 B A 2016-10-12T12:00:00.000Z
3 A B 2016-10-12T12:05:00.000Z
4 D C 2016-10-12T12:30:00.000Z
5 G D 2016-10-12T12:35:00.000Z
I want to make pairs of nodeId for those cases when typeFrom and typeTo values are the same.
The expected output for the above-shown example is the following one:
nodeId_1 nodeId_2 type date
1 2 A 2016-10-12T12:10:00.000Z
3 2 A 2016-10-12T12:05:00.000Z
2 3 B 2016-10-12T12:00:00.000Z
4 5 C 2016-10-12T12:30:00.000Z
5 1 G 2016-10-12T12:35:00.000Z
I don't know how to make pairs of nodeId:
df.
.filter($"typeFrom" === $"typeTo")
.???

You can use self-join on matching nodeFrom with nodeTo:
val df = Seq(
(1, "A", "G", "2016-10-12T12:10:00.000Z"),
(2, "B", "A", "2016-10-12T12:00:00.000Z"),
(3, "A", "B", "2016-10-12T12:05:00.000Z"),
(4, "D", "C", "2016-10-12T12:30:00.000Z"),
(5, "G", "D", "2016-10-12T12:35:00.000Z")
).toDF("nodeId", "typeFrom", "typeTo", "date")
df.as("df1").join(
df.as("df2"),
$"df1.typeFrom" === $"df2.typeTo"
).select(
$"df1.nodeId".as("nodeId_1"), $"df2.nodeId".as("nodeId_2"), $"df1.typeFrom".as("type"), $"df1.date"
).show(truncate=false)
// +--------+--------+----+------------------------+
// |nodeId_1|nodeId_2|type|date |
// +--------+--------+----+------------------------+
// |1 |2 |A |2016-10-12T12:10:00.000Z|
// |2 |3 |B |2016-10-12T12:00:00.000Z|
// |3 |2 |A |2016-10-12T12:05:00.000Z|
// |4 |5 |D |2016-10-12T12:30:00.000Z|
// |5 |1 |G |2016-10-12T12:35:00.000Z|
// +--------+--------+----+------------------------+

Spark - named_struct for empty Map

I use Spark 2.0.1 Scala 2.11, and this question is related to this
Below is the setup:
val ss = new StructType().add("x", IntegerType).add("y", MapType(DoubleType, IntegerType))
val s = new StructType()
.add("a", IntegerType)
.add("b", ss)
val d = Seq(Row(1, Row(1,Map(1.0->1, 2.0->2))),
Row(2, Row(2,Map(2.0->2, 3.0->3))),
Row(3, null ),
Row(4, Row(4, Map())))
val rd = sc.parallelize(d)
val df = spark.createDataFrame(rd, s)
df.select($"a", $"b").show(false)
// +---+---------------------------+
// |a |b |
// +---+---------------------------+
// |1 |[1,Map(1.0 -> 1, 2.0 -> 2)]|
// |2 |[2,Map(2.0 -> 2, 3.0 -> 3)]|
// |3 |null |
// |4 |[4,Map()] |
// +---+---------------------------+
//
The below statement works when I have to provide a default to coalesce (row 2 col 3 cell has the default value):
df.groupBy($"a").pivot("a").
agg(expr("first(coalesce(b, named_struct('x', cast(null as Int), 'y', Map(0.0D, 0) )))" ) )
.show(false)
// +---+---------------------------+---------------------------+--------------------+---------+
// |a |1 |2 |3 |4 |
// +---+---------------------------+---------------------------+--------------------+---------+
// |1 |[1,Map(1.0 -> 1, 2.0 -> 2)]|null |null |null |
// |3 |null |null |[null,Map(0.0 -> 0)]|null |
// |4 |null |null |null |[4,Map()]|
// |2 |null |[2,Map(2.0 -> 2, 3.0 -> 3)]|null |null |
// +---+---------------------------+---------------------------+--------------------+---------+
But how to create an empty Map() (like what's seen in a=4) using named_struct or otherwise?

You can achieve this with an case class and an UDF:
case class MyStruct(x:Option[Int], y:Map[Double,Int])
import org.apache.spark.sql.functions.{udf, first,coalesce}
val emptyStruct = udf(() => MyStruct(None,Map.empty[Double,Int]))
df.groupBy($"a").pivot("a")
.agg(first(coalesce($"b",emptyStruct())))
.show(false)

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

spark multiplie value of frequence in an other string column dataframe - scala

Related

scala: get column name corresponding to max column value from variable columns list

Add new record before another in Spark

Convert Spark DataFrame to HashMap of HashMaps

How to make pairs of nodes using filtering in Spark?

Spark - named_struct for empty Map

Categories

Resources