How to count the elements in a column of arrays? - scala

I'm trying to count the number of elements in FavouriteCities column in the following DataFrame.
+-----------------+
| FavouriteCities |
+-----------------+
| [NY, Canada] |
+-----------------+
The schema is as follows:
scala> data.printSchema
root
|-- FavouriteCities: array (nullable = true)
| |-- element: string (containsNull = true)
Expected output should be something like,
+------------+-------------+
| City | Count |
+------------+-------------+
| NY | 1 |
| Canada | 1 |
+------------+-------------+
I have tried using the agg() and count() but like the following, but it fails to extract individual elements from the array and tries to find the most common set of elements in the column.
data.agg(count("FavouriteCities").alias("count"))
Can someone please guide me with this?

To match schema you've shown:
scala> val data = Seq(Tuple1(Array("NY", "Canada"))).toDF("FavouriteCities")
data: org.apache.spark.sql.DataFrame = [FavouriteCities: array<string>]
scala> data.printSchema
root
|-- FavouriteCities: array (nullable = true)
| |-- element: string (containsNull = true)
Explode:
val counts = data
.select(explode($"FavouriteCities" as "City"))
.groupBy("City")
.count
and aggregate:
import spark.implicits._
scala> counts.as[(String, Long)].reduce((a, b) => if (a._2 > b._2) a else b)
res3: (String, Long) = (Canada,1)

Related

Spark: Check whether a value exists in a nested array without exploding

I have a dataset like below:
val df = Seq(("beatles", Seq(Seq("help", "hey jude"))),
("romeo", Seq(Seq("help2", "hey judge"),Seq("help3", "they judge")))).toDF("col1", "col2")
root
|-- col1: string (nullable = true)
|-- col2: array (nullable = true)
| |-- element: array (containsNull = true)
| | |-- element: string (containsNull = true)
I want to add a column to the dataframe, hasHitSong, which will iterate the sequence of hitsongs under col2, check if a hit song exist, for eg. "Hey Jude" and mark it as 1, else 0.
| col1 | col2 | hasHitSongs |
|---------|-------------------------------------------------|-------------|
| beatles | ["help", "hey jude"] | 1 |
| romeo | [["help2", "hey judge"],["help3", "hey judge"]] | 0 |
Is there a way to do this without exploding the column col2 and just iterating the nested arrays under col2?
If you are using spark version 2.4 or higher version:
Using built-in function
df.withColumn("hasHitSongs", array_contains(flatten(col("col2")), "hey jude"))
Using higher order function
df.withColumn("hasHitSongs, expr("exists(col2, a -> exists(a, b -> b = 'hey jude'))"))

Scala compare dataframe complex array type field

I'm trying to create a dataframe to feed to a function as part of my unit tests. If I have the following
val myDf = sparkSession.sqlContext.createDataFrame(
sparkSession.sparkContext.parallelize(Seq(
Row(Some(Seq(MyObject(1024, 100001D), MyObject(1, -1D)))))),
StructType(List(
StructField("myList", ArrayType[???], true)
)))
MyObject is a case class.
I don't know what to put for the object type. Any suggestions? I've tried ArrayType of pretty much every combination I can think of.
I'm looking for a dataframe that looks something like:
+--------------------+
| myList |
+--------------------+
| [1024, 100001] |
| [1, -1] |
+--------------------+
Coming in the reverse way...
val s = Seq(Array(1024, 100001D), Array(1, -1D)).toDS().toDF("myList")
println(s.schema)
s.printSchema
s.show
Your schema is like below... DoubleType is coming since these 100001D and -1D are double.
StructType(StructField(myList,ArrayType(DoubleType,false),true))
Output you needed:
root
|-- myList: array (nullable = true)
| |-- element: double (containsNull = false)
+------------------+
| myList|
+------------------+
|[1024.0, 100001.0]|
| [1.0, -1.0]|
+------------------+
Or this way also you can do that.
case class MyObject(a:Int , b:Double)
val s = Seq(MyObject(1024, 100001D), MyObject(1, -1D)).toDS()
.select(struct($"a",$"b").as[MyObject] as "myList")
println(s.schema)
s.printSchema
s.show
Result:
//schema :
StructType(StructField(myList,StructType(StructField(a,IntegerType,false), StructField(b,DoubleType,false)),false))
root
|-- myList: struct (nullable = false)
| |-- a: integer (nullable = false)
| |-- b: double (nullable = false)
+----------------+
| myList|
+----------------+
|[1024, 100001.0]|
| [1, -1.0]|
+----------------+
Try this
scala> case class MyObject(prop1:Int, prop2:Double)
defined class MyObject
scala> val df = Seq((1024, 100001D), (1, -1D)).toDF("prop1","prop2").select(struct($"prop1",$"prop2").as[MyObject] as "myList")
df: org.apache.spark.sql.DataFrame = [myList: struct<prop1: int, prop2: double>]
scala> df.show(false)
+----------------+
|myList |
+----------------+
|[1024, 100001.0]|
|[1, -1.0] |
+----------------+
scala> df.printSchema
root
|-- myList: struct (nullable = false)
| |-- prop1: integer (nullable = false)
| |-- prop2: double (nullable = false)

Convert Array with nested struct to string column along with other columns from the PySpark DataFrame

This is similar to Pyspark: cast array with nested struct to string
But, the accepted answer is not working for my case, so asking here
|-- Col1: string (nullable = true)
|-- Col2: array (nullable = true)
|-- element: struct (containsNull = true)
|-- Col2Sub: string (nullable = true)
Sample JSON
{"Col1":"abc123","Col2":[{"Col2Sub":"foo"},{"Col2Sub":"bar"}]}
This gives result in a single column
import pyspark.sql.functions as F
df.selectExpr("EXPLODE(Col2) AS structCol").select(F.expr("concat_ws(',', structCol.*)").alias("Col2_concated")).show()
+----------------+
| Col2_concated |
+----------------+
|foo,bar |
+----------------+
But, how to get a result or DataFrame like this
+-------+---------------+
|Col1 | Col2_concated |
+-------+---------------+
|abc123 |foo,bar |
+-------+---------------+
EDIT:
This solution gives the wrong result
df.selectExpr("Col1","EXPLODE(Col2) AS structCol").select("Col1", F.expr("concat_ws(',', structCol.*)").alias("Col2_concated")).show()
+-------+---------------+
|Col1 | Col2_concated |
+-------+---------------+
|abc123 |foo |
+-------+---------------+
|abc123 |bar |
+-------+---------------+
Just avoid the explode and you are already there. All you need is the concat_ws function. This function concatenates multiple string columns with a given seperator. See example below:
from pyspark.sql import functions as F
j = '{"Col1":"abc123","Col2":[{"Col2Sub":"foo"},{"Col2Sub":"bar"}]}'
df = spark.read.json(sc.parallelize([j]))
#printSchema tells us the column names we can use with concat_ws
df.printSchema()
Output:
root
|-- Col1: string (nullable = true)
|-- Col2: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- Col2Sub: string (nullable = true)
The column Col2 is an array of Col2Sub and we can use this column name to get the desired result:
bla = df.withColumn('Col2', F.concat_ws(',', df.Col2.Col2Sub))
bla.show()
+------+-------+
| Col1| Col2|
+------+-------+
|abc123|foo,bar|
+------+-------+

Lookup table in Spark

I have a dataframe in Spark with no clearly defined schema that I want to use as a lookup table. For example, the dataframe below:
+------------------------------------------------------------------------+
|lookupcolumn |
+------------------------------------------------------------------------+
|[val1,val2,val3,val4,val5,val6] |
+------------------------------------------------------------------------+
The schema would look like this:
|-- lookupcolumn: struct (nullable = true)
| |-- key1: string (nullable = true)
| |-- key2: string (nullable = true)
| |-- key3: string (nullable = true)
| |-- key4: string (nullable = true)
| |-- key5: string (nullable = true)
| |-- key6: string (nullable = true)
I'm saying "schema not clearly defined" since the number of keys is unknown while the data is being read, so I leave it to Spark to infer the schema.
Now, if I have another dataframe with a column as below:
+-----------------+
| datacolumn|
+-----------------+
| key1 |
| key3 |
| key5 |
| key2 |
| key4 |
+-----------------+
and I want the result to be:
+-----------------+
| resultcolumn|
+-----------------+
| val1 |
| val3 |
| val5 |
| val2 |
| val4 |
+-----------------+
I tried a UDF like this:
val get_val = udf((keyindex: String) => {
val res = lookupDf.select($"lookupcolumn"(keyindex).alias("result"))
res.head.toString
})
But it throws a Null Pointer exception error.
Can someone tell me what's wrong with the UDF, and if there's a better/simpler way of doing this lookup in Spark?
I assume that the lookup table is quite small, in this case it would make more sense to collect it to the driver and convert it to a normal Map. Then use this Map in the UDF function. It can be done in many way, for example like this:
val values = lookupDf.select("lookupcolumn.*").head.toSeq.map(_.toString)
val keys = lookupDf.select("lookupcolumn.*").columns
val lookup_map = keys.zip(values).toMap
Using the above lookup_map variable, the UDF will simply be:
val lookup = udf((key: String) => lookup_map.get(key))
And the final dataframe can be obtained by:
val df2 = df.withColumn("resultcolumn", lookup($"datacolumn"))

Adding attribute of type Array[long] from existing attribute value in DF

I am using spark 2.0 and have a use case where I need to convert the attribute type of a column from string to Array[long].
Suppose I have a dataframe with schema :
root
|-- unique_id: string (nullable = true)
|-- column2 : string (nullable = true)
DF :
+----------+---------+
|unique_id | column2 |
+----------+---------+
| 1 | 123 |
| 2 | 125 |
+----------+---------+
now i want to add a new column with name "column3" of type Array[long]having the values from "column2"
like :
root
|-- unique_id: string (nullable = true)
|-- column2: long (nullable = true)
|-- column3: array (nullable = true)
| |-- element: long (containsNull = true)
new DF :
+----------+---------+---------+
|unique_id | column2 | column3 |
+----------+---------+---------+
| 1 | 123 | [123] |
| 2 | 125 | [125] |
+----------+---------+---------+
I there a way to achieve this ?
You can simply use withColumn and array function as
df.withColumn("column3", array(df("columnd")))
And I also see that you are trying to change the column2 from string to Long. A simple udf function should do the trick. So final solution would be
def changeToLong = udf((str: String) => str.toLong)
val finalDF = df
.withColumn("column2", changeToLong(col("column2")))
.withColumn("column3", array(col("column2")))
You need to import functions library too as
import org.apache.spark.sql.functions._