Pyspark RDD column value selection - pyspark

I have a rdd like this:
|item_id| recommendations|
+-------+------------------+
| 1|[{810, 5.2324243},{134, 4.58323},{810, 4.89248}]
| 23|[[{1643, 5.1180077}, {1463, 4.8429747}, {1368, 4.4758873}]
if I want to only extract the first value in each {} from col "recommendations".
Expected result looks like this:
|item_id| recommendations|
+-------+------------------+
| 1|[{810, 134, 810}]
| 23|[{1643, 1463, 1368}]
What should I do? Thanks!

Not sure if your data is an rdd or a dataframe, so I provide both here. Overall, from your sample data, I assume your recommendations is an array of struct type. You will know the exact columns by running df.printSchema() (if it was a dataframe) or rdd.first() (if it was an rdd). I created a dummy schema with two columns a and b.
This is my "dummy" class
class X():
def __init__(self, a, b):
self.a = a
self.b = b
This is my "dummy" data
schema = T.StructType([
T.StructField('id', T.IntegerType()),
T.StructField('rec', T.ArrayType(T.StructType([
T.StructField('a', T.IntegerType()),
T.StructField('b', T.FloatType()),
])))
])
df = spark.createDataFrame([
(1, [X(810, 5.2324243), X(134, 4.58323), X(810, 4.89248)]),
(23, [X(1643, 5.1180077), X(1463, 4.8429747), X(1368, 4.4758873)])
], schema)
If your data is a dataframe
df.show(10, False)
df.printSchema()
+---+---------------------------------------------------------+
|id |rec |
+---+---------------------------------------------------------+
|1 |[{810, 5.2324243}, {134, 4.58323}, {810, 4.89248}] |
|23 |[{1643, 5.1180077}, {1463, 4.8429747}, {1368, 4.4758873}]|
+---+---------------------------------------------------------+
root
|-- id: integer (nullable = true)
|-- rec: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- a: integer (nullable = true)
| | |-- b: float (nullable = true)
(df
.select('id', F.explode('rec').alias('rec'))
.groupBy('id')
.agg(F.collect_list('rec.a').alias('rec'))
.show()
)
+---+------------------+
| id| rec|
+---+------------------+
| 1| [810, 134, 810]|
| 23|[1643, 1463, 1368]|
+---+------------------+
If your data is an rdd
dfrdd = df.rdd
dfrdd.first()
# Row(id=1, rec=[Row(a=810, b=5.232424259185791), Row(a=134, b=4.583230018615723), Row(a=810, b=4.89247989654541)])
(dfrdd
.map(lambda x: (x.id, [r.a for r in x.rec]))
.toDF()
.show()
)
+---+------------------+
| _1| _2|
+---+------------------+
| 1| [810, 134, 810]|
| 23|[1643, 1463, 1368]|
+---+------------------+

Related

Transpose DataFrame single row to column in Spark with scala

I saw this question here:
Transpose DataFrame Without Aggregation in Spark with scala and I wanted to do exactly the opposite.
I have this Dataframe with a single row, with values that are string, int, bool, array:
+-----+-------+-----+------+-----+
|col1 | col2 |col3 | col4 |col5 |
+-----+-------+-----+------+-----+
|val1 | val2 |val3 | val4 |val5 |
+-----+-------+-----+------+-----+
And I want to transpose it like this:
+-----------+-------+
|Columns | values|
+-----------+-------+
|col1 | val1 |
|col2 | val2 |
|col3 | val3 |
|col4 | val4 |
|col5 | val5 |
+-----------+-------+
I am using Apache Spark 2.4.3 with Scala 2.11
Edit: Values can be of any type (int, double, bool, array), not only strings.
Thought differently with out using arrays_zip (which is available in => Spark 2.4)] and got the below...
It will work for Spark =>2.0 onwards in a simpler way (flatmap , map and explode functions)...
Here map function (used in with column) creates a new map column. The input columns must be grouped as key-value pairs.
Case : String data type in Data :
import org.apache.spark.sql.functions._
val df: DataFrame =Seq((("val1"),("val2"),("val3"),("val4"),("val5"))).toDF("col1","col2","col3","col4","col5")
var columnsAndValues = df.columns.flatMap { c => Array(lit(c), col(c)) }
df.printSchema()
df.withColumn("myMap", map(columnsAndValues:_*)).select(explode($"myMap"))
.toDF("Columns","Values").show(false)
Result :
root
|-- col1: string (nullable = true)
|-- col2: string (nullable = true)
|-- col3: string (nullable = true)
|-- col4: string (nullable = true)
|-- col5: string (nullable = true)
+-------+------+
|Columns|Values|
+-------+------+
|col1 |val1 |
|col2 |val2 |
|col3 |val3 |
|col4 |val4 |
|col5 |val5 |
+-------+------+
Case : Mix of data types in Data :
If you have different types convert them to String... remaining steps wont change..
val df1 = df.select(df.columns.map(c => col(c).cast(StringType)): _*)
Full Example :
import org.apache.spark.sql.functions._
import spark.implicits._
import org.apache.spark.sql.Column
val df = Seq(((2), (3), (true), (2.4), ("val"))).toDF("col1", "col2", "col3", "col4", "col5")
df.printSchema()
/**
* convert all columns to to string type since its needed further
*/
val df1 = df.select(df.columns.map(c => col(c).cast(StringType)): _*)
df1.printSchema()
var ColumnsAndValues: Array[Column] = df.columns.flatMap { c => {
Array(lit(c), col(c))
}
}
df1.withColumn("myMap", map(ColumnsAndValues: _*))
.select(explode($"myMap"))
.toDF("Columns", "Values")
.show(false)
Result :
root
|-- col1: integer (nullable = false)
|-- col2: integer (nullable = false)
|-- col3: boolean (nullable = false)
|-- col4: double (nullable = false)
|-- col5: string (nullable = true)
root
|-- col1: string (nullable = false)
|-- col2: string (nullable = false)
|-- col3: string (nullable = false)
|-- col4: string (nullable = false)
|-- col5: string (nullable = true)
+-------+------+
|Columns|Values|
+-------+------+
|col1 |2 |
|col2 |3 |
|col3 |true |
|col4 |2.4 |
|col5 |val |
+-------+------+
From Spark-2.4 Use arrays_zip with array(column_values), array(column_names) then explode to get the result.
Example:
val df=Seq((("val1"),("val2"),("val3"),("val4"),("val5"))).toDF("col1","col2","col3","col4","col5")
val cols=df.columns.map(x => col(s"${x}"))
val str_cols=df.columns.mkString(",")
df.withColumn("new",explode(arrays_zip(array(cols:_*),split(lit(str_cols),",")))).
select("new.*").
toDF("values","Columns").
show()
//+------+-------+
//|values|Columns|
//+------+-------+
//| val1| col1|
//| val2| col2|
//| val3| col3|
//| val4| col4|
//| val5| col5|
//+------+-------+
UPDATE:
val df=Seq(((2),(3),(true),(2.4),("val"))).toDF("col1","col2","col3","col4","col5")
df.printSchema
//root
// |-- col1: integer (nullable = false)
// |-- col2: integer (nullable = false)
// |-- col3: boolean (nullable = false)
// |-- col4: double (nullable = false)
// |-- col5: string (nullable = true)
//cast to string
val cols=df.columns.map(x => col(s"${x}").cast("string").alias(s"${x}"))
val str_cols=df.columns.mkString(",")
df.withColumn("new",explode(arrays_zip(array(cols:_*),split(lit(str_cols),",")))).
select("new.*").
toDF("values","Columns").
show()
//+------+-------+
//|values|Columns|
//+------+-------+
//| 2| col1|
//| 3| col2|
//| true| col3|
//| 2.4| col4|
//| val| col5|
//+------+-------+

Spark select item in array by max score

Given the following DataFrame containing an id and a Seq of Stuff (with an id and score), how do I select the "best" Stuff in the array by score?
I'd like NOT to use UDFs and possibly work with Spark DataFrame functions only.
case class Stuff(id: Int, score: Double)
val df = spark.createDataFrame(Seq(
(1, Seq(Stuff(11, 0.4), Stuff(12, 0.5))),
(2, Seq(Stuff(22, 0.9), Stuff(23, 0.8)))
)).toDF("id", "data")
df.show(false)
+---+----------------------+
|id |data |
+---+----------------------+
|1 |[[11, 0.4], [12, 0.5]]|
|2 |[[22, 0.9], [23, 0.8]]|
+---+----------------------+
df.printSchema
root
|-- id: integer (nullable = false)
|-- data: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- id: integer (nullable = false)
| | |-- score: double (nullable = false)
I tried going down the route of window functions but the code gets a bit too convoluted. Expected output:
+---+---------+
|id |topStuff |
+---+---------
|1 |[12, 0.5]|
|2 |[22, 0.9]|
+---+---------+
You can use Spark 2.4 higher-order functions:
df
.selectExpr("id","(filter(data, x -> x.score == array_max(data.score)))[0] as topstuff")
.show()
gives
+---+---------+
| id| topstuff|
+---+---------+
| 1|[12, 0.5]|
| 2|[22, 0.9]|
+---+---------+
As an alternative, use window-functions (requires shuffling!):
df
.select($"id",explode($"data").as("topstuff"))
.withColumn("selector",max($"topstuff.score") .over(Window.partitionBy($"id")))
.where($"topstuff.score"===$"selector")
.drop($"selector")
.show()
also gives:
+---+---------+
| id| topstuff|
+---+---------+
| 1|[12, 0.5]|
| 2|[22, 0.9]|
+---+---------+

How to loop through the Dataframe which is of type of Array and append the value to a final Dataframe using Scala

Please could you help me with the solution for the below Questions: Question 01: Is there a way i can loop only Array types as
looping string type within an array will throw an error. I cannot drop String Type(VIN) as i need this data on the final df.
df.printSchema
returns:
root
|-- APP: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- E: long (nullable = true)
| | |-- V: double (nullable = true)
|-- B1X: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- E: long (nullable = true)
| | |-- V: long (nullable = true)
|-- B2X: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- E: long (nullable = true)
| | |-- V: long (nullable = true)
|-- B3X: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- E: long (nullable = true)
| | |-- V: long (nullable = true)
|-- VIN: string (nullable = true)
After running the below forloop:
Question 02: Dataframe jsonDF2 is holding only the last E, V value as stime, can_value of the last signal B3X. Is there a way to append all the values( i mean all the signal values{APP, B1X, B2X, B3X, VIN}) to a Dataframe jsonDF2
after it comes out of foreach loop.
val columns:Array[String] = df.columns
for(col_name <- columns){
| df = df.withColumn("element", explode(col(col_name)))
| .withColumn("stime", col("element.E"))
| .withColumn("can_value", col("element.V"))
| .withColumn("SIGNAL", lit(col_name))
| .drop(col("element"))
| .drop(col(col_name))
| }
Here's one approach illustrated using the following example:
import org.apache.spark.sql.types._
import org.apache.spark.sql.Row
import org.apache.spark.sql.functions._
import spark.implicits._
case class Elem(e: Long, v: Double)
val df = Seq(
(Seq(Elem(1, 1.0)), Seq(Elem(2, 2.0), Elem(3, 3.0)), Seq(Elem(4, 4.0)), Seq(Elem(5, 5.0)), "a"),
(Seq(Elem(6, 6.0)), Seq(Elem(7, 7.0), Elem(8, 8.0)), Seq(Elem(9, 9.0)), Seq(Elem(10, 10.0)), "b")
).toDF("APP", "B1X", "B2X", "B3X", "VIN")
Question #1: Is there a way i can loop only Array types?
You can simply collect all the top-level field names of ArrayType as follows:
val arrCols = df.schema.fields.collect{
case StructField(name, dtype: ArrayType, _, _) => name
}
// arrCols: Array[String] = Array(APP, B1X, B2X, B3X)
Question #2: Is there a way to append all the signal values {APP, B1X, B2X, B3X, VIN}?
Not sure I completely understand your requirement without sample output. Based on your code snippet, I'm assuming your goal is to flatten all array columns of struct-typed elements into separate top-level columns. Below are the steps:
Step 1: Group all the array columns into a single array column of struct(colName, colValue); then transform for each row using foldLeft to generate a combined array of struct(colName, Elem-E, Elem-V):
case class ColElem(c: String, e: Long, v: Double)
val df2 = df.
select(array(arrCols.map(c => struct(lit(c).as("_1"), col(c).as("_2"))): _*)).
map{ case Row(rs: Seq[Row] #unchecked) => rs.foldLeft(Seq[ColElem]()){
(acc, r) => r match { case Row(c: String, s: Seq[Row] #unchecked) =>
acc ++ s.map(el => ColElem(c, el.getAs[Long](0), el.getAs[Double](1)))
}
}}.toDF("combined_array")
df2.show(false)
// +-----------------------------------------------------------------------------+
// |combined_array |
// +-----------------------------------------------------------------------------+
// |[[APP, 1, 1.0], [B1X, 2, 2.0], [B1X, 3, 3.0], [B2X, 4, 4.0], [B3X, 5, 5.0]] |
// |[[APP, 6, 6.0], [B1X, 7, 7.0], [B1X, 8, 8.0], [B2X, 9, 9.0], [B3X, 10, 10.0]]|
// +-----------------------------------------------------------------------------+
Step 2: Flatten the combined array of struct-typed elements into top-level columns:
df2.
select(explode($"combined_array").as("flattened")).
select($"flattened.c".as("signal"), $"flattened.e".as("stime"), $"flattened.v".as("can_value")).
orderBy("signal", "stime").
show
// +------+-----+---------+
// |signal|stime|can_value|
// +------+-----+---------+
// | APP| 1| 1.0|
// | APP| 6| 6.0|
// | B1X| 2| 2.0|
// | B1X| 3| 3.0|
// | B1X| 7| 7.0|
// | B1X| 8| 8.0|
// | B2X| 4| 4.0|
// | B2X| 9| 9.0|
// | B3X| 5| 5.0|
// | B3X| 10| 10.0|
// +------+-----+---------+
You can use the schema member and then filter them out before hand with a filter and a map. Then do your for loop stuff.
import org.apache.spark.sql.types._
val schema = df.schema.filter{ case StructField(_, datatype, _, _) => datatype == ArrayType }
val columns = schema.map{ case StructField(columnName, _ , _, _) => columnName }

How to return ListBuffer as a column from UDF using Spark Scala?

I am trying to use UDF's and return ListBuffer as a column from UDF, i am getting error.
I have created Df by executing below code:
val df = Seq((1,"dept3##rama##kumar","dept3##rama##kumar"), (2,"dept31##rama1##kumar1","dept33##rama3##kumar3")).toDF("id","str1","str2")
df.show()
it show like below:
+---+--------------------+--------------------+
| id| str1| str2|
+---+--------------------+--------------------+
| 1| dept3##rama##kumar| dept3##rama##kumar|
| 2|dept31##rama1##ku...|dept33##rama3##ku...|
+---+--------------------+--------------------+
as per my requirement i have to use i have to split the above columns based some inputs so i have tried UDF like below :
def appendDelimiterError=udf((id: Int, str1: String, str2: String)=> {
var lit = new ListBuffer[Any]()
if(str1.contains("##"){val a=str1.split("##")}
else if(str1.contains("##"){val a=str1.split("##")}
else if(str1.contains("#&"){val a=str1.split("#&")}
if(str2.contains("##"){ val b=str2.split("##")}
else if(str2.contains("##"){ val b=str2.split("##") }
else if(str1.contains("##"){val b=str2.split("##")}
var tmp_row = List(a,"test1",b)
lit +=tmp_row
return lit
})
val
i try to cal by executing below code:
val df1=df.appendDelimiterError("newcol",appendDelimiterError(df("id"),df("str1"),df("str2"))
i getting error "this was a bad call" .i want use ListBuffer/list to store and return to calling place.
my expected output will be:
+---+--------------------+------------------------+----------------------------------------------------------------------+
| id| str1| str2 | newcol |
+---+--------------------+------------------------+----------------------------------------------------------------------+
| 1| dept3##rama##kumar| dept3##rama##kumar |ListBuffer(List("dept","rama","kumar"),List("dept3","rama","kumar")) |
| 2|dept31##rama1##kumar1|dept33##rama3##kumar3 | ListBuffer(List("dept31","rama1","kumar1"),List("dept33","rama3","kumar3")) |
+---+--------------------+------------------------+----------------------------------------------------------------------+
How to achieve this?
An alternative with my own fictional data to which you can tailor and no UDF:
import org.apache.spark.sql.functions.{col, udf}
import org.apache.spark.sql.expressions._
import org.apache.spark.sql.functions._
val df = Seq(
(1, "111##cat##666", "222##fritz##777"),
(2, "AAA##cat##555", "BBB##felix##888"),
(3, "HHH##mouse##yyy", "123##mickey##ZZZ")
).toDF("c0", "c1", "c2")
val df2 = df.withColumn( "c_split", split(col("c1"), ("(##)|(##)|(##)|(##)") ))
.union(df.withColumn("c_split", split(col("c2"), ("(##)|(##)|(##)|(##)") )) )
df2.show(false)
df2.printSchema()
val df3 = df2.groupBy(col("c0")).agg(collect_list(col("c_split")).as("List_of_Data") )
df3.show(false)
df3.printSchema()
Gives answer but no ListBuffer - really necessary?, as follows:
+---+---------------+----------------+------------------+
|c0 |c1 |c2 |c_split |
+---+---------------+----------------+------------------+
|1 |111##cat##666 |222##fritz##777 |[111, cat, 666] |
|2 |AAA##cat##555 |BBB##felix##888 |[AAA, cat, 555] |
|3 |HHH##mouse##yyy|123##mickey##ZZZ|[HHH, mouse, yyy] |
|1 |111##cat##666 |222##fritz##777 |[222, fritz, 777] |
|2 |AAA##cat##555 |BBB##felix##888 |[BBB, felix, 888] |
|3 |HHH##mouse##yyy|123##mickey##ZZZ|[123, mickey, ZZZ]|
+---+---------------+----------------+------------------+
root
|-- c0: integer (nullable = false)
|-- c1: string (nullable = true)
|-- c2: string (nullable = true)
|-- c_split: array (nullable = true)
| |-- element: string (containsNull = true)
+---+---------------------------------------+
|c0 |List_of_Data |
+---+---------------------------------------+
|1 |[[111, cat, 666], [222, fritz, 777]] |
|3 |[[HHH, mouse, yyy], [123, mickey, ZZZ]]|
|2 |[[AAA, cat, 555], [BBB, felix, 888]] |
+---+---------------------------------------+
root
|-- c0: integer (nullable = false)
|-- List_of_Data: array (nullable = true)
| |-- element: array (containsNull = true)
| | |-- element: string (containsNull = true)

How to transform Spark Dataframe columns to a single column of a string array

I want to know how can I "merge" multiple dataframe columns into one as a string array?
For example, I have this dataframe:
val df = sqlContext.createDataFrame(Seq((1, "Jack", "125", "Text"), (2,"Mary", "152", "Text2"))).toDF("Id", "Name", "Number", "Comment")
Which looks like this:
scala> df.show
+---+----+------+-------+
| Id|Name|Number|Comment|
+---+----+------+-------+
| 1|Jack| 125| Text|
| 2|Mary| 152| Text2|
+---+----+------+-------+
scala> df.printSchema
root
|-- Id: integer (nullable = false)
|-- Name: string (nullable = true)
|-- Number: string (nullable = true)
|-- Comment: string (nullable = true)
How can I transform it so it would look like this:
scala> df.show
+---+-----------------+
| Id| List|
+---+-----------------+
| 1| [Jack,125,Text]|
| 2| [Mary,152,Text2]|
+---+-----------------+
scala> df.printSchema
root
|-- Id: integer (nullable = false)
|-- List: Array (nullable = true)
| |-- element: string (containsNull = true)
Use org.apache.spark.sql.functions.array:
import org.apache.spark.sql.functions._
val result = df.select($"Id", array($"Name", $"Number", $"Comment") as "List")
result.show()
// +---+------------------+
// |Id |List |
// +---+------------------+
// |1 |[Jack, 125, Text] |
// |2 |[Mary, 152, Text2]|
// +---+------------------+
Can also be used with withColumn :
import org.apache.spark.sql.functions as F
df.withColumn("Id", F.array(F.col("Name"), F.col("Number"), F.col("Comment")))