How to randomly choose element in array column of different size? - scala

Given a dataframe with a column of arrays of integers with different sizes:
scala> sampleDf.show()
+------------+
| arrays|
+------------+
|[15, 16, 17]|
|[15, 16, 17]|
| [14]|
| [11]|
| [11]|
+------------+
scala> sampleDf.printSchema()
root
|-- arrays: array (nullable = true)
| |-- element: integer (containsNull = true)
I would like to generate a new column with a random chosen item in each array.
I've tried two solution:
1. Using UDF:
import scala.util.Random
def getRandomElement(arr: Array[Int]): Int = {
arr(Random.nextInt(arr.size))
}
val getRandomElementUdf = udf{arr: Array[Int] => getRandomElement(arr)}
sampleDf.withColumn("randomItem", getRandomElementUdf('arrays)).show
crashes on the last line with a long error message: (extracts)
...
Caused by: org.apache.spark.SparkException: Failed to execute user defined function($anonfun$1: (array<int>) => int)
...
Caused by: java.lang.ClassCastException: scala.collection.mutable.WrappedArray$ofRef cannot be cast to [I
I've tried with the alternative udf definition:
val getRandomElementUdf = udf[Int, Array[Int]] (getRandomElement)
but I get the same error.
2. Second method by creating intermediary columns with a random index in the range of the corresponding array:
// Return a dataframe with a column with random index from column of Arrays with different sizes
def choice(df: DataFrame, colName: String): DataFrame = {
df.withColumn("array_size", size(col(colName)))
.withColumn("random_idx", least('array_size, floor(rand * 'array_size)))
}
choice(sampleDf, "arrays").show
outputs:
+------------+----------+----------+
| arrays|array_size|random_idx|
+------------+----------+----------+
|[15, 16, 17]| 3| 2|
|[15, 16, 17]| 3| 1|
| [14]| 1| 0|
| [11]| 1| 0|
| [11]| 1| 0|
+------------+----------+----------+
and ideally we would like to use the column random_idx to choose an item in column arrays, kind of:
sampleDf.withColumn("choosen_item", 'arrays.getItem('random_idx))
Unfortunaltely, getItem cannot take a column as argument.
Any suggestion is welcome.

You can use the below udf to select the random element from the array as
val getRandomElement = udf ((array: Seq[Integer]) => {
array(Random.nextInt(array.size))
})
df.withColumn("c1", getRandomElement($"arrays"))
.withColumn("c2", getRandomElement($"arrays"))
.withColumn("c3", getRandomElement($"arrays"))
.withColumn("c4", getRandomElement($"arrays"))
.withColumn("c5", getRandomElement($"arrays"))
.show(false)
You can see the random element selected in each use as a new column.
+------------+---+---+---+---+---+
|arrays |c1 |c2 |c3 |c4 |c5 |
+------------+---+---+---+---+---+
|[15, 16, 17]|15 |16 |16 |17 |16 |
|[15, 16, 17]|16 |16 |17 |15 |15 |
|[14] |14 |14 |14 |14 |14 |
|[11] |11 |11 |11 |11 |11 |
|[11] |11 |11 |11 |11 |11 |
+------------+---+---+---+---+---+

If you want to remain udf-free, here is a possibility:
first add a key to the dataframe outputed by choice (assume its name is choiceDf)
val myDf = choiceDf.withColumn("key", monotonically_increasing_id())
then create an intermediary dataframe that explode the arrays column and keep the index of the values
val tmp = myDf.select('key, posexplode('arrays))
finally join using key and random_idx
myDf.join(tmp.withColumnRenamed("pos", "random_idx"), Seq("key", "random_idx", "left")
the item you look for is stored in the column col
+---+----------+------------+----------+---+
|key|random_idx| arrays|array_size|col|
+---+----------+------------+----------+---+
| 0| 2|[15, 16, 17]| 3| 17|
| 1| 1|[15, 16, 17]| 3| 16|
| 2| 0| [14]| 1| 14|
+---+----------+------------+----------+---+

You can extract the random element from an array by one line using Spark-SQL
sampleDF.createOrReplaceTempView("sampleDF")
spark.sql("select arrays[Cast((FLOOR(RAND() * FLOOR(size(arrays)))) as INT)] as random from sampleDF")

Related

Spark Scala split column values in a dataframe to appended lists

I have data in a spark dataframe that I need to search for elements by name, append the values to a list, and split searched elements into separate columns of the dataframe.
I am using scala and the below is an example of my current code that works to get the first value but I need to append all values available not just the first.
I'm new to Scala (and python) so any help will be greatly appreciated!
val getNumber: (String => String) = (colString: String) => {
if (colString != null) {
raw"number:(\d+)".r
.findAllIn(colString)
.group(1)
}
else
null
}
val udfGetColumn = udf(getNumber)
val mydf = df.select(cols.....)
.withColumn("var_number", udfGetColumn($"var"))
Example Data:
+------+----------------------------------------------------------------------------------------------------------------------------------------------------------------+
| key| var |
+------+----------------------------------------------------------------------------------------------------------------------------------------------------------------+
|1 |["[number:123456 rate:111970 position:1]","[number:123457 rate:662352 position:2]","[number:123458 rate:890 position:3]","[number:123459 rate:190 position:4]"] | |
|2 |["[number:654321 rate:211971 position:1]","[number:654322 rate:124 position:2]","[number:654323 rate:421 position:3]"] |
+------+----------------------------------------------------------------------------------------------------------------------------------------------------------------+
Desired Result:
+------+------------------------------------------------------------+
| key| var_number | var_rate | var_position |
+------+------------------------------------------------------------+
|1 | 123456 | 111970 | 1 |
|1 | 123457 | 662352 | 2 |
|1 | 123458 | 890 | 3 |
|1 | 123459 | 190 | 4 |
|2 | 654321 | 211971 | 1 |
|2 | 654322 | 124 | 2 |
|2 | 654323 | 421 | 3 |
+------+-----------------+---------------------+--------------------+
You don't need to use UDF here. You can easily transform the array column var by converting each element into a map using str_to_map after removing the square brackets ([]) with regexp_replace function. Finally, explode the transformed array and select the fields:
val df = Seq(
(1, Seq("[number:123456 rate:111970 position:1]", "[number:123457 rate:662352 position:2]", "[number:123458 rate:890 position:3]", "[number:123459 rate:190 position:4]")),
(2, Seq("[number:654321 rate:211971 position:1]", "[number:654322 rate:124 position:2]", "[number:654323 rate:421 position:3]"))
).toDF("key", "var")
val result = df.withColumn(
"var",
explode(expr(raw"transform(var, x -> str_to_map(regexp_replace(x, '[\\[\\]]', ''), ' '))"))
).select(
col("key"),
col("var").getField("number").alias("var_number"),
col("var").getField("rate").alias("var_rate"),
col("var").getField("position").alias("var_position")
)
result.show
//+---+----------+--------+------------+
//|key|var_number|var_rate|var_position|
//+---+----------+--------+------------+
//| 1| 123456| 111970| 1|
//| 1| 123457| 662352| 2|
//| 1| 123458| 890| 3|
//| 1| 123459| 190| 4|
//| 2| 654321| 211971| 1|
//| 2| 654322| 124| 2|
//| 2| 654323| 421| 3|
//+---+----------+--------+------------+
From you comment, it appears the column var is of type string not array. In this case, you can first transform it by removing [] and " characters then split by comma to get an array:
val df = Seq(
(1, """["[number:123456 rate:111970 position:1]", "[number:123457 rate:662352 position:2]", "[number:123458 rate:890 position:3]", "[number:123459 rate:190 position:4]"]"""),
(2, """["[number:654321 rate:211971 position:1]", "[number:654322 rate:124 position:2]", "[number:654323 rate:421 position:3]"]""")
).toDF("key", "var")
val result = df.withColumn(
"var",
split(regexp_replace(col("var"), "[\\[\\]\"]", ""), ",")
).withColumn(
"var",
explode(expr("transform(var, x -> str_to_map(x, ' '))"))
).select(
// select your columns as above...
)

Spark GroupBy and Aggregate Strings to Produce a Map of Counts of the Strings Based on a Condition

I have a dataframe with two multiple columns, two of which are id and label as shown below.
+---+---+---+
| id| label|
+---+---+---+
| 1| "abc"|
| 1| "abc"|
| 1| "def"|
| 2| "def"|
| 2| "def"|
+---+---+---+
I want to groupBy "id" and aggregate the label column by counts (ignore null) of label in a map data structure and the expected result is as shown below:
+---+---+--+--+--+--+--+--
| id| label |
+---+-----+----+----+----+
| 1| {"abc":2, "def":1}|
| 2| {"def":2} |
+---+-----+----+----+----+
Is it possible to do this without using user-defined aggregate functions? I saw a similar answer here, but it doesn't aggregate based on the count of each item.
I apologize if this question is silly, I am new to both Scala and Spark.
Thanks
Without Custom UDFs
import org.apache.spark.sql.functions.{map, collect_list}
df.groupBy("id", "label")
.count
.select($"id", map($"label", $"count").as("map"))
.groupBy("id")
.agg(collect_list("map"))
.show(false)
+---+------------------------+
|id |collect_list(map) |
+---+------------------------+
|1 |[[def -> 1], [abc -> 2]]|
|2 |[[def -> 2]] |
+---+------------------------+
Using Custom UDF,
import org.apache.spark.sql.functions.udf
val customUdf = udf((seq: Seq[String]) => {
seq.groupBy(x => x).map(x => x._1 -> x._2.size)
})
df.groupBy("id")
.agg(collect_list("label").as("list"))
.select($"id", customUdf($"list").as("map"))
.show(false)
+---+--------------------+
|id |map |
+---+--------------------+
|1 |[abc -> 2, def -> 1]|
|2 |[def -> 2] |
+---+--------------------+

How to find the max String length of a column in Spark using dataframe?

I have a dataframe. I need to calculate the Max length of the String value in a column and print both the value and its length.
I have written the below code but the output here is the max length only but not its corresponding value.
This How to get max length of string column from dataframe using scala? did help me out in getting the below query.
df.agg(max(length(col("city")))).show()
Use row_number() window function on length('city) desc order.
Then filter out only the first row_number column and add length('city) column to dataframe.
Ex:
val df=Seq(("A",1,"US"),("AB",1,"US"),("ABC",1,"US"))
.toDF("city","num","country")
val win=Window.orderBy(length('city).desc)
df.withColumn("str_len",length('city))
.withColumn("rn", row_number().over(win))
.filter('rn===1)
.show(false)
+----+---+-------+-------+---+
|city|num|country|str_len|rn |
+----+---+-------+-------+---+
|ABC |1 |US |3 |1 |
+----+---+-------+-------+---+
(or)
In spark-sql:
df.createOrReplaceTempView("lpl")
spark.sql("select * from (select *,length(city)str_len,row_number() over (order by length(city) desc)rn from lpl)q where q.rn=1")
.show(false)
+----+---+-------+-------+---+
|city|num|country|str_len| rn|
+----+---+-------+-------+---+
| ABC| 1| US| 3| 1|
+----+---+-------+-------+---+
Update:
Find min,max values:
val win_desc=Window.orderBy(length('city).desc)
val win_asc=Window.orderBy(length('city).asc)
df.withColumn("str_len",length('city))
.withColumn("rn", row_number().over(win_desc))
.withColumn("rn1",row_number().over(win_asc))
.filter('rn===1 || 'rn1 === 1)
.show(false)
Result:
+----+---+-------+-------+---+---+
|city|num|country|str_len|rn |rn1|
+----+---+-------+-------+---+---+
|A |1 |US |1 |3 |1 | //min value of string
|ABC |1 |US |3 |1 |3 | //max value of string
+----+---+-------+-------+---+---+
In case you have multiple rows which share the same length, then the solution with the window function won't work, since it filters the first row after ordering.
Another way would be to create a new column with the length of the string, find it's max element and filter the data frame upon the obtained maximum value.
import org.apache.spark.sql._
import org.apache.spark.sql.functions._
import spark.implicits._
val df=Seq(("A",1,"US"),("AB",1,"US"),("ABC",1,"US"), ("DEF", 2, "US"))
.toDF("city","num","country")
val dfWithLength = df.withColumn("city_length", length($"city")).cache()
dfWithLength.show()
+----+---+-------+-----------+
|city|num|country|city_length|
+----+---+-------+-----------+
| A| 1| US| 1|
| AB| 1| US| 2|
| ABC| 1| US| 3|
| DEF| 2| US| 3|
+----+---+-------+-----------+
val Row(maxValue: Int) = dfWithLength.agg(max("city_length")).head()
dfWithLength.filter($"city_length" === maxValue).show()
+----+---+-------+-----------+
|city|num|country|city_length|
+----+---+-------+-----------+
| ABC| 1| US| 3|
| DEF| 2| US| 3|
+----+---+-------+-----------+
Find a maximum string length on a string column with pyspark
from pyspark.sql.functions import length, col, max
df2 = df.withColumn("len_Description",length(col("Description"))).groupBy().max("len_Description")

Element position from nested DataFrame array (Spark 2.2)

I'm trying to explode a nested DataFrame in Spark Scala. I have a DataFrame df which contains the following information:
root
|-- id: integer (nullable = false)
|-- features: array (nullable = true)
| |-- element: float (containsNull = false)
I've exploded the array information into to a flat DataFrame with:
df.selectExpr("id","explode(features) as features")
and got the following DataFrame:
id features
0 0.0629885
0 0.15931357
0 0.08922347
My end goal is to pivot the data and calculate some similarities with that information. To do that, it would be very cool to get the actual position of the feature for every ID into the DataFrame, like this:
id features feature_pos
0 0.0629885 0
0 0.15931357 1
0 0.08922347 2
Use posexplode in place of explode:
Creates a new row for each element with position in the given array or map column.
Unlike posexplode, if the array/map is null or empty then the row (null, null) is produced.
Here is the example with posexplode.
scala> val df = Seq((0, Seq(0.1f, 0.2f, 0.3f)),(1, Seq(0.4f, 0.5f, 0.6f))).toDF("id", "features")
df: org.apache.spark.sql.DataFrame = [id: int, features: array<float>]
scala> df.show(false)
+---+---------------+
|id |features |
+---+---------------+
|0 |[0.1, 0.2, 0.3]|
|1 |[0.4, 0.5, 0.6]|
+---+---------------+
Note that df.withColumn("pos cols",posexplode('features)).show(false) will throw error, so use df.select()
scala> df.select(posexplode('features)).show(false)
+---+---+
|pos|col|
+---+---+
|0 |0.1|
|1 |0.2|
|2 |0.3|
|0 |0.4|
|1 |0.5|
|2 |0.6|
+---+---+
scala>
The default names are "pos" and "col". You can rename them as
scala> df.select(posexplode('features).as(Seq("a","b"))).show(false)
+---+---+
|a |b |
+---+---+
|0 |0.1|
|1 |0.2|
|2 |0.3|
|0 |0.4|
|1 |0.5|
|2 |0.6|
+---+---+
scala>
When you want to explode and select all columns, use
scala> df.select(col("*"), posexplode('features).as( Seq("a","b")) ).show(false)
+---+---------------+---+---+
|id |features |a |b |
+---+---------------+---+---+
|0 |[0.1, 0.2, 0.3]|0 |0.1|
|0 |[0.1, 0.2, 0.3]|1 |0.2|
|0 |[0.1, 0.2, 0.3]|2 |0.3|
|1 |[0.4, 0.5, 0.6]|0 |0.4|
|1 |[0.4, 0.5, 0.6]|1 |0.5|
|1 |[0.4, 0.5, 0.6]|2 |0.6|
+---+---------------+---+---+
scala>
You can also apply Scala's zipWithIndex via a UDF as follows:
val df = Seq(
(0, Seq(0.1f, 0.2f, 0.3f)),
(1, Seq(0.4f, 0.5f, 0.6f))
).toDF("id", "features")
def addIndex = udf(
(s: Seq[Float]) => s.zipWithIndex
)
val df2 = df.withColumn( "features_idx", explode(addIndex($"features")) )
df2.select( $"id", $"features_idx._1".as("features"), $"features_idx._2".as("features_pos") ).show
+---+--------+------------+
| id|features|features_pos|
+---+--------+------------+
| 0| 0.1| 0|
| 0| 0.2| 1|
| 0| 0.3| 2|
| 1| 0.4| 0|
| 1| 0.5| 1|
| 1| 0.6| 2|
+---+--------+------------+

Programmatically Rename All But One Column Spark Scala

I have this DataFrame:
val df = Seq(
("LeBron", 36, 18, 12),
("Kevin", 42, 8, 9),
("Russell", 44, 5, 14)).
toDF("player", "points", "rebounds", "assists")
df.show()
+-------+------+--------+-------+
| player|points|rebounds|assists|
+-------+------+--------+-------+
| LeBron| 36| 18| 12|
| Kevin| 42| 8| 9|
|Russell| 44| 5| 14|
+-------+------+--------+-------+
I want to add "season_high" to every column name except player. I also want to use a function to do this because my real data set has 250 columns.
I've come up with the method below that gets me the output that I want, but I'm wondering if there is a way to pass a rule to the renamedColumns mapping function that makes it so that the column name player doesn't get switched to season_high_player, and then back to player with the additional .withColumnRenamed function.
val renamedColumns = df.columns.map(name => col(name).as(s"season_high_$name"))
val df2 = df.select(renamedColumns : _*).
withColumnRenamed("season_high_player", "player")
df2.show()
+-------+------------------+--------------------+-------------------+
| player|season_high_points|season_high_rebounds|season_high_assists|
+-------+------------------+--------------------+-------------------+
| LeBron| 36| 18| 12|
| Kevin| 42| 8| 9|
|Russell| 44| 5| 14|
+-------+------------------+--------------------+-------------------+
#philantrovert was right but he just forgot to tell you how to use that "formula", so here you go :
val selection : Seq[Column] = Seq(col("player")) ++ df.columns.filter(_ != "player")
.map(name => col(name).as(s"season_high_$name"))
df.select(selection : _*).show
// +-------+------------------+--------------------+-------------------+
// | player|season_high_points|season_high_rebounds|season_high_assists|
// +-------+------------------+--------------------+-------------------+
// | LeBron| 36| 18| 12|
// | Kevin| 42| 8| 9|
// |Russell| 44| 5| 14|
// +-------+------------------+--------------------+-------------------+
So what we have done here is to filter out the column name that we don't need (This is plain scala). Then we map the column names that we kept to convert them into column which we rename.
you can do the following by making the one column you don't want to rename as the first column and apply the following logic
import org.apache.spark.sql.functions._
val columnsRenamed = col(df.columns.head) +: df.columns.tail.map(name => col(name).as(s"season_high_$name"))
df.select(columnsRenamed :_*).show(false)
You should be getting output as
+-------+------------------+--------------------+-------------------+
|player |season_high_points|season_high_rebounds|season_high_assists|
+-------+------------------+--------------------+-------------------+
|LeBron |36 |18 |12 |
|Kevin |42 |8 |9 |
|Russell|44 |5 |14 |
+-------+------------------+--------------------+-------------------+
one more variation that doesn't depend on the position of the field.
scala> val df = Seq(
| ("LeBron", 36, 18, 12),
| ("Kevin", 42, 8, 9),
| ("Russell", 44, 5, 14)).
| toDF("player", "points", "rebounds", "assists")
df: org.apache.spark.sql.DataFrame = [player: string, points: int ... 2 more fields]
scala> val newColumns = df.columns.map( x => x match { case "player" => col("player") case x => col(x).as(s"season_high_$x")} )
newColumns: Array[org.apache.spark.sql.Column] = Array(player, points AS `season_high_points`, rebounds AS `season_high_rebounds`, assists AS `season_high_assists`)
scala> df.select(newColumns:_*).show(false)
+-------+------------------+--------------------+-------------------+
|player |season_high_points|season_high_rebounds|season_high_assists|
+-------+------------------+--------------------+-------------------+
|LeBron |36 |18 |12 |
|Kevin |42 |8 |9 |
|Russell|44 |5 |14 |
+-------+------------------+--------------------+-------------------+
scala>