I have a dataframe that looks something like this.
The tfs column is a map of String to Long and the weights are floats
+----------+---+------------------------+------------------+
|term |df |tfs |weight |
+----------+---+------------------------+------------------+
|keyword1 |2 |{2.txt -> 2, 1.txt -> 2}|1.3 |
|keyword2 |1 |{2.txt -> 1} |0.6931471805599453|
|keyword3 |2 |{2.txt -> 1, 1.txt -> 2}|0.52343473 |
+----------+---+------------------------+------------------+
I would like to combine the last two columns by multiplying each value in the tfsmap by its respective weight to get something like
+----------+---+------------------------------------------+
|term |df |weighted-tfs |
+----------+---+------------------------------------------+
|keyword1 |2 |{2.txt -> 2.6, 1.txt -> 2.6} |
|keyword2 |1 |{2.txt -> 0.6931471805599453} |
|keyword3 |2 |{2.txt -> 0.52343473, 1.txt -> 1,04686946}|
+----------+---+------------------------------------------+
My guess is that it would be quite simle to write a udf for this, but I'm quite experienced in both Spark and Scala, so I'm not sure how to do this.
Use map_from_arrays,map_keys & map_values functions.
Try below code.
val finalDF = df
.withColumn(
"weighted-tfs",
map_from_arrays(
map_keys($"tfs"),
expr("transform(map_values(tfs),i -> i * weight)")
)
)
Output
finalDF.show(false)
+--------+---+------------------------+------------------+------------------------------------------+
|term |df |tfs |weight |product |
+--------+---+------------------------+------------------+------------------------------------------+
|keyword1|2 |[2.txt -> 2, 1.txt -> 2]|1.3 |[2.txt -> 2.6, 1.txt -> 2.6] |
|keyword2|1 |[2.txt -> 1] |0.6931471805599453|[2.txt -> 0.6931471805599453] |
|keyword3|2 |[2.txt -> 1, 1.txt -> 2]|0.52343473 |[2.txt -> 0.52343473, 1.txt -> 1.04686946]|
+--------+---+------------------------+------------------+------------------------------------------+
Related
I am working on data frame spark. There are 60 columns in my data frame and below is the sample map column of data frame. Need to remove 'N/A' key from map. I haven't find any function to do this
+----------------------------------+
| userbytestsample |
+----------------------------------+
|[TEST -> 2000050008, N/A ->] |
+----------------------------------+
schema
|-- userbytestsample: map (nullable = true)
| |-- key: string
| |-- value: string (valueContainsNull = true)
output
+----------------------------------+
| userbytestsample |
+----------------------------------+
|[TEST -> 2000050008] |
+----------------------------------+
For Spark 3+ you can use map_filter as
df.select(map_filter($"userbytestsample", (k, v) => !k.equalTo("N/A")).as("userbytestsample"))
.show(false)
Output:
+----------------------------------+
| userbytestsample |
+----------------------------------+
|[TEST -> 2000050008] |
+----------------------------------+
For Spark 2.4+ you might need an udf
val map_filter_udf = udf{ (xs: Map[String, String]) => xs.filter(!_._1.equalsIgnoreCase(("N/A"))}
df.select(map_filter_udf($"userbytestsample"). as("userbytestsample"))
.show(false)
Output:
+----------------------------------+
| userbytestsample |
+----------------------------------+
|[TEST -> 2000050008] |
+----------------------------------+
I suggest you use UDF
Let's suppose i have this dataframe
val df = Seq((Map("a" -> 1, "b" -> 2, "c" -> 3)), (Map("a" -> 10, "ff" -> 2, "gg" -> 30))).toDF("colmap")
scala> df.printSchema
root
|-- colmap: map (nullable = true)
| |-- key: string
| |-- value: integer (valueContainsNull = false)
df.show(false)
+----------------------------+
|colmap |
+----------------------------+
|[a -> 1, b -> 2, c -> 3] |
|[a -> 10, ff -> 2, gg -> 30]|
+----------------------------+
If i want to remove the key "a"
val unwantedKey : String = "a"
I create my UDF which will take the column 'colmap', remove the key and return the map without the key
def updateMap(unwantedKey : String) = udf((colMapName :Map[String, Int]) => {
colMapName.-(unwantedKey)
})
Finally, to apply this udf you can call it this way
val finalDF = df.withColumn("newcol", updateMap(unwantedKey)(col("colmap")))
finalDF.show(false)
+----------------------------+-------------------+
|colmap |newcol |
+----------------------------+-------------------+
|[a -> 1, b -> 2, c -> 3] |[b -> 2, c -> 3] |
|[a -> 10, ff -> 2, gg -> 30]|[ff -> 2, gg -> 30]|
+----------------------------+-------------------+
I have the dataframe like below.
scala> df.show
+---+-------+
|key| count|
+---+-------+
| 11| 100212|
| 12| 122371|
| 13| 235637|
| 14| 54923|
| 15| 9785|
| 16| 5217|
+---+-------+
I am looking at the ways to convert it into a Map like below. Please help.
Map(
"11" -> "100212",
"12" -> "122371",
"13" -> "235637",
"14" -> "54923",
"15" -> "9785",
"16" -> "9785"
)
df.collect().map(row => row.getAs[String](0) -> row.getAs[String](1)).toMap
You can use the collectAsMap method.
val result = data.as[(String, String)].rdd.collectAsMap()
// result: Map[String, String] = Map(12 -> 122371, 15 -> 9785, 11 -> 100212, 14 -> 54923, 16 -> 5217, 13 -> 235637)
BTW, remember that collecting all the data to the driver is an expensive operation and may result in out of memory errors, make sure the data is small before.
Use map function to convert columns of type map & collect data. Check below code.
scala> df.show(false)
+---+------+
|key|value |
+---+------+
|11 |100212|
|12 |122371|
|13 |235637|
|14 |54923 |
|15 |9785 |
|16 |5217 |
+---+------+
scala> df
.select(map(df.columns.map(col):_*).as("map"))
.as[Map[String,String]]
.collect()
.reduce(_ ++ _)
res48: Map[String,String] = Map(12 -> 122371, 15 -> 9785, 11 -> 100212, 13 -> 235637, 16 -> 5217, 14 -> 54923)
I would like to save my DataFrame to a Parquet file in a Hive table...but I would like to partition that DataFrame by the value of a specific map element (which is guaranteed to be present).
For example:
case class Person(name: String, attributes: Map[String, String])
val people = Seq[Person](Person("John", Map("birthDate"->"2019-12-30", "favoriteColor"->"red")),
Person("Lucy", Map("birthDate"->"2019-12-31", "favoriteFood"->"pizza")),
Person("David", Map("birthDate"->"2020-01-01", "favoriteMusic"->"jazz"))).toDF
//pseudo-code, doesn't work
//people.write.format("parquet").partitionBy("attributes[birthDate]").saveAsTable("people")
I can get around it by promoting this value to a top level field and joining (see below), but it would be nicer to avoid that. In addition to avoiding the join overhead, our users are expected to query on attributes[birthDate], so it would be advantageous to partition directly on that field, and not a separate top-level field.
Is there a way to directly partition on that value, without needing temporary DFs/joins?
val justNameAndBirthDate = people.select($"name", $"attributes"("birthDate")).withColumnRenamed("attributes[birthDate]", "birthDate")
val newDfWithBirthDate = people.join(justNameAndBirthDate, Seq("name"), "left")
newDfWithBirthDate.write.format("parquet").partitionBy("birthDate").saveAsTable("people")
One way is to create a column to partition by it and name it as you wish.
val df = people.withColumn("attributes[birthDate]", $"attributes"("birthDate"))
scala> df.show(false)
+------+------------------------------------------------+---------------------+
|name |attributes |attributes[birthDate]|
+------+------------------------------------------------+---------------------+
|John |[birthDate -> 2019-12-30, favoriteColor -> red] |2019-12-30 |
|Lucy |[birthDate -> 2019-12-31, favoriteFood -> pizza]|2019-12-31 |
|David |[birthDate -> 2020-01-01, favoriteMusic -> jazz]|2020-01-01 |
+------+------------------------------------------------+---------------------+
It will sure duplicate the data, but it'll do the trick
Then you can partition as you wish:
df.write.format("parquet").partitionBy("attributes[birthDate]").saveAsTable("people")
Checking output table
scala> spark.sql("select * from people").show(false)
+------+------------------------------------------------+---------------------+
|name |attributes |attributes[birthDate]|
+------+------------------------------------------------+---------------------+
|David |[birthDate -> 2020-01-01, favoriteMusic -> jazz]|2020-01-01 |
|Lucy |[birthDate -> 2019-12-31, favoriteFood -> pizza]|2019-12-31 |
|John |[birthDate -> 2019-12-30, favoriteColor -> red] |2019-12-30 |
+------+------------------------------------------------+---------------------+
spark.sql("desc people").show(false)
+-----------------------+------------------+-------+
|col_name |data_type |comment|
+-----------------------+------------------+-------+
|name |string |null |
|attributes |map<string,string>|null |
|attributes[birthDate] |string |null |
|# Partition Information| | |
|# col_name |data_type |comment|
|attributes[birthDate] |string |null |
+-----------------------+------------------+-------+
I have a pyspark dataframe with a column of MapType(StringType(), FloatType()) and I would get a list of all keys appearing in the column. For example having this dataframe:
+---+--------------------+
| ID| map|
+---+--------------------+
| 0|[a -> 3.0, b -> 2...|
| 1|[a -> 1.0, b -> 4...|
| 2|[a -> 6.0, c -> 5...|
| 3|[a -> 6.0, f -> 8...|
| 4|[a -> 2.0, c -> 1...|
| 5|[c -> 1.0, d -> 1...|
| 6|[a -> 4.0, c -> 1...|
| 7|[a -> 2.0, d -> 1...|
| 8| [a -> 2.0]|
| 9|[e -> 1.0, f -> 1.0]|
| 10| [g -> 1.0]|
| 11|[e -> 2.0, b -> 3.0]|
+---+--------------------+
I am expecting to get the following list:
['a', 'b', 'c', 'd', 'e', 'f', 'g']
I already tried
df.select(explode(col('map'))).groupby('key').count().select('key').collect()
df.select(explode(col('map'))).select('key').drop_duplicates().collect()
df.select(explode(col('map'))).select('key').distinct().collect()
df.select(explode(map_keys(col('map')))).select('key').distinct().collect()
...
But for each of these commands I get differing results, not only across the different commands, but also wenn I execute the exact same command on the same dataframe.
For example:
keys_1 = df.select(explode(col('map'))).select('key').drop_duplicates().collect()
keys_1 = [row['key'] for row in keys_1]
And:
keys_2 = df.select(explode(col('map'))).select('key').drop_duplicates().collect()
keys_2 = [row['key'] for row in keys_2]
Then it happens quite often that len(keys_1) != len(keys_2).
My dataframe has around 10e7 rows and there are around 2000 different keys for my map column
NOTE that on the small example dataset it works without a problem, but unfortunatly it is quite hard to find a large example dataset.
Small Dataset example code:
df = spark.createDataFrame([
(0, {'a':3.0, 'b':2.0, 'c':2.0}),
(1, {'a':1.0, 'b':4.0, 'd':6.0}),
(2, {'a':6.0, 'e':5.0, 'c':5.0}),
(3, {'f':8.0, 'a':6.0, 'g':4.0}),
(4, {'a':2.0, 'c':1.0, 'd':3.0}),
(5, {'d':1.0, 'g':5.0, 'c':1.0}),
(6, {'a':4.0, 'c':1.0, 'f':1.0}),
(7, {'a':2.0, 'e':2.0, 'd':1.0}),
(8, {'a':2.0}),
(9, {'e':1.0, 'f':1.0}),
(10, {'g':1.0}),
(11, {'b':3.0, 'e':2.0})
],
['ID', 'map']
)
df.select(explode(col('map'))).groupby('key').count().select('key').collect()
this should work
df.select(explode($"map")).select($"key").distinct().collect()
I want to merge multiple maps using Spark/Scala. The maps have a case class instance as value.
Following is the relevant code:
case class SampleClass(value1:Int,value2:Int)
val sampleDataDs = Seq(
("a",25,Map(1->SampleClass(1,2))),
("a",32,Map(1->SampleClass(3,4),2->SampleClass(1,2))),
("b",16,Map(1->SampleClass(1,2))),
("b",18,Map(2->SampleClass(10,15)))).toDF("letter","number","maps")
Output:
+------+-------+--------------------------+
|letter|number |maps |
+------+-------+--------------------------+
|a | 25 | [1-> [1,2]] |
|a | 32 | [1-> [3,4], 2 -> [1,2]] |
|b | 16 | [1 -> [1,2]] |
|b | 18 | [2 -> [10,15]] |
+------+-------+--------------------------+
I want to group the data based on the "letter" column so that the final dataset should have the below expected final output:
+------+---------------------------------+
|letter| maps |
+------+---------------------------------+
|a | [1-> [4,6], 2 -> [1,2]] |
|b | [1-> [1,2], 2 -> [10,15]] |
+------+---------------------------------+
I tried to group by "letter" and then apply an udf to aggregate the values in the map. Below is what I tried:
val aggregatedDs = SampleDataDs.groupBy("letter").agg(collect_list(SampleDataDs("maps")).alias("mapList"))
Output:
+------+----------------------------------------+
|letter| mapList |
+------+-------+--------------------------------+
|a | [[1-> [1,2]],[1-> [3,4], 2 -> [1,2]]] |
|b | [[1-> [1,2]],[2 -> [10,15]]] |
+------+----------------------------------------+
After this I tried to write an udf to merge the output of collect_list and get the final output:
def mergeMap = udf { valSeq:Seq[Map[Int,SampleClass]] =>
valSeq.flatten.groupBy(_._1).mapValues(x=>(x.map(_._2.value1).reduce(_ + _),x.map(_._2.value2).reduce(_ + _)))
}
val aggMapDs = aggregatedDs.withColumn("aggValues",mergeMap(col("mapList")))
However it fails with the error message:
Failed to execute user defined function
Caused by :java.lang.classCastException: org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema cannot be cast to SampleClass
My Spark version is 2.3.1. Any ideas how I can get the expected final output?
The problem is due to the UDF not being able to accept the case class as input. Spark dataframes will internally represent your case class as a Row object. The problem can thus be avoided by changing the UDF input type as follows:
val mergeMap = udf((valSeq:Seq[Map[Int, Row]]) => {
valSeq.flatten
.groupBy(_._1)
.mapValues(x=>
SampleClass(
x.map(_._2.getAs[Int]("value1")).reduce(_ + _),
x.map(_._2.getAs[Int]("value2")).reduce(_ + _)
)
)
})
Notice above that some minor additional changes are necessary to handle the Row object.
Running this code will result in:
val aggMapDs = aggregatedDs.withColumn("aggValues",mergeMap(col("mapList")))
+------+----------------------------------------------+-----------------------------+
|letter|mapList |aggValues |
+------+----------------------------------------------+-----------------------------+
|b |[Map(1 -> [1,2]), Map(2 -> [10,15])] |Map(2 -> [10,15], 1 -> [1,2])|
|a |[Map(1 -> [1,2]), Map(1 -> [3,4], 2 -> [1,2])]|Map(2 -> [1,2], 1 -> [4,6]) |
+------+----------------------------------------------+-----------------------------+
There is a slight difference between Dataframe and Dataset.
Dataset takes on two distinct APIs characteristics: a strongly-typed API and an untyped API, as shown in the table below. Conceptually, consider DataFrame as an alias for a collection of generic objects Dataset[Row], where a Row is a generic untyped JVM object. Dataset, by contrast, is a collection of strongly-typed JVM objects, dictated by a case class you define in Scala or a class in Java
When you converting your Seq to Dataframe type information is lost.
val df: Dataframe = Seq(...).toDf() <-- here
What you could have done instead is convert Seq to Dataset
val typedDs: Dataset[(String, Int, Map[Int, SampleClass])] = Seq(...).toDS()
+---+---+--------------------+
| _1| _2| _3|
+---+---+--------------------+
| a| 25| [1 -> [1, 2]]|
| a| 32|[1 -> [3, 4], 2 -...|
| b| 16| [1 -> [1, 2]]|
| b| 18| [2 -> [10, 15]]|
+---+---+--------------------+
Because your top-level object in the Seq is Tuple Spark generates dummy column names.
Now you should pay attention to the return type, there are functions on a typed Dataset that losing type information.
val untyped: Dataframe = typedDs
.groupBy("_1")
.agg(collect_list(typedDs("_3")).alias("mapList"))
In order to work with typed API you should explicitly define types
val aggregatedDs = sampleDataDs
.groupBy("letter")
.agg(collect_list(sampleDataDs("maps")).alias("mapList"))
val toTypedAgg: Dataset[(String, Array[Map[Int, SampleClass]])] = aggregatedDs
.as[(String, Array[Map[Int, SampleClass]])] //<- here
Unfortunately, udf won't work as there are a limited number of types for which Spark can infer a schema.
toTypedAgg.withColumn("aggValues", mergeMap1(col("mapList"))).show()
Schema for type org.apache.spark.sql.Dataset[(String, Array[Map[Int,SampleClass]])] is not supported
What you could do instead is to map over a Dataset
val mapped = toTypedAgg.map(v => {
(v._1, v._2.flatten.groupBy(_._1).mapValues(x=>(x.map(_._2.value1).sum,x.map(_._2.value2).sum)))
})
+---+----------------------------+
|_1 |_2 |
+---+----------------------------+
|b |[2 -> [10, 15], 1 -> [1, 2]]|
|a |[2 -> [1, 2], 1 -> [4, 6]] |
+---+----------------------------+