pyspark VectorUDT to integer or float conversion - pyspark

Below is my dataframe
a b c d
1 2 3 [1211]
2 2 4 [1222]
4 5 4 [12322]
Here d column is of vector type and was not able to convert directly from vectorUDT to integer below was my code for conversion
newDF = newDF.select(col('d'),
newDF.d.cast('int').alias('d'))
someone please help on same

We can use udf to reserialize the vector and access the values,
>>> from pyspark.sql import function as F
>>> from pyspark.sql.types import IntegerType
>>> df = spark.createDataFrame([(1,2,3,Vectors.dense([1211])),(2,2,4,Vectors.dense([1222])),(4,5,4,Vectors.dense([12322]))],['a','b','c','d'])
>>> df.show()
+---+---+---+---------+
| a| b| c| d|
+---+---+---+---------+
| 1| 2| 3| [1211.0]|
| 2| 2| 4| [1222.0]|
| 4| 5| 4|[12322.0]|
+---+---+---+---------+
>>> df.printSchema()
root
|-- a: long (nullable = true)
|-- b: long (nullable = true)
|-- c: long (nullable = true)
|-- d: vector (nullable = true)
>>> udf1 = F.udf(lambda x : int(x[0]),IntegerType())
>>> df.select('d',udf1('d').alias('d1')).show()
+---------+-----+
| d| d1|
+---------+-----+
| [1211.0]| 1211|
| [1222.0]| 1222|
|[12322.0]|12322|
+---------+-----+
>>> df.select('d',udf1('d').alias('d1')).printSchema()
root
|-- d: vector (nullable = true)
|-- d1: integer (nullable = true)

Related

How to change struct dataType to Integer in pyspark?

I have a dataframe df, and one column has data type of struct<long:bigint, string:string>
because of this data type structure, I can not perform addition, subtration etc...
how to change struct<long:bigint, string:string> to just IntegerType??
You can use a dot syntax to access parts of the struct column.
For example if you start with this dataframe
df = spark.createDataFrame([(1,(3,'x')),(4,(8, 'y'))]).toDF("col1", "col2")
df.show()
df.printSchema()
+----+------+
|col1| col2|
+----+------+
| 1|[3, x]|
| 4|[8, y]|
+----+------+
root
|-- col1: long (nullable = true)
|-- col2: struct (nullable = true)
| |-- _1: long (nullable = true)
| |-- _2: string (nullable = true)
use can select the first part of the struct column and either create a new column or replace an existing one:
df.withColumn('col2', df['col2._1']).show()
prints
+----+----+
|col1|col2|
+----+----+
| 1| 3|
| 4| 8|
+----+----+

flattern scala array data type column to multiple columns

Is their any possible way to flatten an array in Scala DF?
As I know with columns and select filed.a works, but I don't want to specify them Manually.
df.printSchema()
|-- client_version: string (nullable = true)
|-- filed: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- a: string (nullable = true)
| | |-- b: string (nullable = true)
| | |-- c: string (nullable = true)
| | |-- d: string (nullable = true)
final df
df.printSchema()
|-- client_version: string (nullable = true)
|-- filed_a: string (nullable = true)
|-- filed_b: string (nullable = true)
|-- filed_c: string (nullable = true)
|-- filed_d: string (nullable = true)
You can flatten your ArrayType column with explode and map the nested struct element names to the wanted top-level column names, as shown below:
import org.apache.spark.sql.functions._
case class S(a: String, b: String, c: String, d: String)
val df = Seq(
("1.0", Seq(S("a1", "b1", "c1", "d1"))),
("2.0", Seq(S("a2", "b2", "c2", "d2"), S("a3", "b3", "c3", "d3")))
).toDF("client_version", "filed")
df.printSchema
// root
// |-- client_version: string (nullable = true)
// |-- filed: array (nullable = true)
// | |-- element: struct (containsNull = true)
// | | |-- a: string (nullable = true)
// | | |-- b: string (nullable = true)
// | | |-- c: string (nullable = true)
// | | |-- d: string (nullable = true)
val dfFlattened = df.withColumn("filed_element", explode($"filed"))
val structElements = dfFlattened.select($"filed_element.*").columns
val dfResult = dfFlattened.select( col("client_version") +: structElements.map(
c => col(s"filed_element.$c").as(s"filed_$c")
): _*
)
dfResult.show
// +--------------+-------+-------+-------+-------+
// |client_version|filed_a|filed_b|filed_c|filed_d|
// +--------------+-------+-------+-------+-------+
// | 1.0| a1| b1| c1| d1|
// | 2.0| a2| b2| c2| d2|
// | 2.0| a3| b3| c3| d3|
// +--------------+-------+-------+-------+-------+
dfResult.printSchema
// root
// |-- client_version: string (nullable = true)
// |-- filed_a: string (nullable = true)
// |-- filed_b: string (nullable = true)
// |-- filed_c: string (nullable = true)
// |-- filed_d: string (nullable = true)
Use explode to flatten the arrays by adding more rows and then select with the * notation to bring the struct columns back to the top.
import org.apache.spark.sql.functions.{collect_list, explode, struct}
import spark.implicits._
val df = Seq(("1", "a", "a", "a"),
("1", "b", "b", "b"),
("2", "a", "a", "a"),
("2", "b", "b", "b"),
("2", "c", "c", "c"),
("3", "a", "a","a")).toDF("idx", "A", "B", "C")
.groupBy(("idx"))
.agg(collect_list(struct("A", "B", "C")).as("nested_col"))
df.printSchema()
// root
// |-- idx: string (nullable = true)
// |-- nested_col: array (nullable = true)
// | |-- element: struct (containsNull = true)
// | | |-- A: string (nullable = true)
// | | |-- B: string (nullable = true)
// | | |-- C: string (nullable = true)
df.show
// +---+--------------------+
// |idx| nested_col|
// +---+--------------------+
// | 3| [[a, a, a]]|
// | 1|[[a, a, a], [b, b...|
// | 2|[[a, a, a], [b, b...|
// +---+--------------------+
val dfExploded = df.withColumn("exploded", explode($"nested_col")).drop("nested_col")
dfExploded.show
// +---+---------+
// |idx| exploded|
// +---+---------+
// | 3|[a, a, a]|
// | 1|[a, a, a]|
// | 1|[b, b, b]|
// | 2|[a, a, a]|
// | 2|[b, b, b]|
// | 2|[c, c, c]|
// +---+---------+
val finalDF = dfExploded.select("idx", "exploded.*")
finalDF.show
// +---+---+---+---+
// |idx| A| B| C|
// +---+---+---+---+
// | 3| a| a| a|
// | 1| a| a| a|
// | 1| b| b| b|
// | 2| a| a| a|
// | 2| b| b| b|
// | 2| c| c| c|
// +---+---+---+---+

Read CSV with last column as array of values (and the values are inside parenthesis and separated by comma) in Spark

I have a CSV file where the last column is inside parenthesis and the values are separated by commas. The number of values is variable in the last column. When I read them to as Dataframe with some column names as follows, I get Exception in thread "main" java.lang.IllegalArgumentException: requirement failed: The number of columns doesn't match. My CSV file looks like this
a1,b1,true,2017-05-16T07:00:41.0000000,2.5,(c1,d1,e1)
a2,b2,true,2017-05-26T07:00:42.0000000,0.5,(c2,d2,e2,f2,g2)
a2,b2,true,2017-05-26T07:00:42.0000000,0.5,(c2)
a2,b2,true,2017-05-26T07:00:42.0000000,0.5,(c2,d2)
a2,b2,true,2017-05-26T07:00:42.0000000,0.5,(c2,d2,e2)
a2,b2,true,2017-05-26T07:00:42.0000000,0.5,(c2,d2,e2,k2,f2)
what I finally want is something like this:
root
|-- MId: string (nullable = true)
|-- PId: string (nullable = true)
|-- IsTeacher: boolean(nullable = true)
|-- STime: datetype(nullable = true)
|-- TotalMinutes: double(nullable = true)
|-- SomeArrayHeader: array<string>(nullable = true)
I have written the following code till now:
val infoDF =
sqlContext.read.format("csv")
.option("header", "false")
.load(inputPath)
.toDF(
"MId",
"PId",
"IsTeacher",
"STime",
"TotalMinutes",
"SomeArrayHeader")
I thought of reading them without giving column names and then cast the columns which are after the 5th columns to array type. But then I am having problems with the parentheses. Is there a way I can do this while reading and telling that fields inside parenthesis are actually one field of type array.
Ok. The solution is only tactical for your case. The below one worked for me
val df = spark.read.option("quote", "(").csv("in/staff.csv").toDF(
"MId",
"PId",
"IsTeacher",
"STime",
"TotalMinutes",
"arr")
df.show()
val df2 = df.withColumn("arr",split(regexp_replace('arr,"[)]",""),","))
df2.printSchema()
df2.show()
Output:
+---+---+---------+--------------------+------------+---------------+
|MId|PId|IsTeacher| STime|TotalMinutes| arr|
+---+---+---------+--------------------+------------+---------------+
| a1| b1| true|2017-05-16T07:00:...| 2.5| c1,d1,e1)|
| a2| b2| true|2017-05-26T07:00:...| 0.5|c2,d2,e2,f2,g2)|
| a2| b2| true|2017-05-26T07:00:...| 0.5| c2)|
| a2| b2| true|2017-05-26T07:00:...| 0.5| c2,d2)|
| a2| b2| true|2017-05-26T07:00:...| 0.5| c2,d2,e2)|
| a2| b2| true|2017-05-26T07:00:...| 0.5|c2,d2,e2,k2,f2)|
+---+---+---------+--------------------+------------+---------------+
root
|-- MId: string (nullable = true)
|-- PId: string (nullable = true)
|-- IsTeacher: string (nullable = true)
|-- STime: string (nullable = true)
|-- TotalMinutes: string (nullable = true)
|-- arr: array (nullable = true)
| |-- element: string (containsNull = true)
+---+---+---------+--------------------+------------+--------------------+
|MId|PId|IsTeacher| STime|TotalMinutes| arr|
+---+---+---------+--------------------+------------+--------------------+
| a1| b1| true|2017-05-16T07:00:...| 2.5| [c1, d1, e1]|
| a2| b2| true|2017-05-26T07:00:...| 0.5|[c2, d2, e2, f2, g2]|
| a2| b2| true|2017-05-26T07:00:...| 0.5| [c2]|
| a2| b2| true|2017-05-26T07:00:...| 0.5| [c2, d2]|
| a2| b2| true|2017-05-26T07:00:...| 0.5| [c2, d2, e2]|
| a2| b2| true|2017-05-26T07:00:...| 0.5|[c2, d2, e2, k2, f2]|
+---+---+---------+--------------------+------------+--------------------+

Spark GroupBy agg collect_list multiple columns

I have a question similar to this but the number of columns to be operated by collect_list is given by a name list. For example:
scala> w.show
+---+-----+----+-----+
|iid|event|date|place|
+---+-----+----+-----+
| A| D1| T0| P1|
| A| D0| T1| P2|
| B| Y1| T0| P3|
| B| Y2| T2| P3|
| C| H1| T0| P5|
| C| H0| T9| P5|
| B| Y0| T1| P2|
| B| H1| T3| P6|
| D| H1| T2| P4|
+---+-----+----+-----+
scala> val combList = List("event", "date", "place")
combList: List[String] = List(event, date, place)
scala> val v = w.groupBy("iid").agg(collect_list(combList(0)), collect_list(combList(1)), collect_list(combList(2)))
v: org.apache.spark.sql.DataFrame = [iid: string, collect_list(event): array<string> ... 2 more fields]
scala> v.show
+---+-------------------+------------------+-------------------+
|iid|collect_list(event)|collect_list(date)|collect_list(place)|
+---+-------------------+------------------+-------------------+
| B| [Y1, Y2, Y0, H1]| [T0, T2, T1, T3]| [P3, P3, P2, P6]|
| D| [H1]| [T2]| [P4]|
| C| [H1, H0]| [T0, T9]| [P5, P5]|
| A| [D1, D0]| [T0, T1]| [P1, P2]|
+---+-------------------+------------------+-------------------+
Is there any way I can apply collect_list to multiple columns inside agg without knowing the number of elements in the combList prior?
You can use collect_list(struct(col1, col2)) AS elements.
Example:
df.select("cd_issuer", "cd_doc", "cd_item", "nm_item").printSchema
val outputDf = spark.sql(s"SELECT cd_issuer, cd_doc, collect_list(struct(cd_item, nm_item)) AS item FROM teste GROUP BY cd_issuer, cd_doc")
outputDf.printSchema
df
|-- cd_issuer: string (nullable = true)
|-- cd_doc: string (nullable = true)
|-- cd_item: string (nullable = true)
|-- nm_item: string (nullable = true)
outputDf
|-- cd_issuer: string (nullable = true)
|-- cd_doc: string (nullable = true)
|-- item: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- cd_item: string (nullable = true)
| | |-- nm_item: string (nullable = true)

Sample values from a list colum in spark dataframe

I have a spark-scala dataframe as shown in df1 below: I would like to sample with replacement from scores column(a List), based on counts in another column of df1.
val df1 = sc.parallelize(Seq(("a1",2,List(20,10)),("a2",1,List(30,10)),
("a3",3,List(10)),("a4",2,List(10,20,40)))).toDF("colA","counts","scores")
df1.show()
+----+------+------------+
|colA|counts| scores|
+----+------+------------+
| a1| 2| [20, 10]|
| a2| 1| [30, 10]|
| a3| 3| [10]|
| a4| 2|[10, 20, 40]|
+----+------+------------+
Expected output is shown in df2: from row 1, sample 2 values from list [20,10]; from row 2 sample 1 value from list [30,10]; from row 3 sample 3 values from list[10] with repetition.. etc.
df2.show() //expected output
+----+------+------------+-------------+
|colA|counts| scores|sampledScores|
+----+------+------------+-------------+
| a1| 2| [20, 10]| [20, 10]|
| a2| 1| [30, 10]| [30]|
| a3| 3| [10]| [10, 10, 10]|
| a4| 2|[10, 20, 40]| [10, 40]|
+----+------+------------+-------------+
I wrote an udf 'takeSample' and applied to df1 but did not work as intended.
val takeSample = udf((a:Array[Int], count1:Int) => {Array.fill(count1)(
a(new Random(System.currentTimeMillis).nextInt(a.size)))}
)
val df2 = df1.withColumn("SampledScores", takeSample(df1("Scores"),df1("counts")))
I got the following run-time error; when executing
df2.printSchema()
root
|-- colA: string (nullable = true)
|-- counts: integer (nullable = true)
|-- scores: array (nullable = true)
| |-- element: integer (containsNull = false)
|-- SampledScores: array (nullable = true)
| |-- element: integer (containsNull = false)
df2.show()
org.apache.spark.SparkException: Failed to execute user defined
function($anonfun$1: (array<int>, int) => array<int>)
Caused by: java.lang.ClassCastException:
scala.collection.mutable.WrappedArray$ofRef cannot be cast to [I
at $anonfun$1.apply(<console>:47)
Any solution is greatly appreciated.
Changing the data type from Array[Int] to Seq[Int] in the UDF will resolve the issue:
val takeSample = udf((a:Seq[Int], count1:Int) => {Array.fill(count1)(
a(new Random(System.currentTimeMillis).nextInt(a.size)))}
)
val df2 = df1.withColumn("SampledScores", takeSample(df1("Scores"),df1("counts")))
It will give us the expected output:
df2.printSchema()
root
|-- colA: string (nullable = true)
|-- counts: integer (nullable = true)
|-- scores: array (nullable = true)
| |-- element: integer (containsNull = false)
|-- SampledScores: array (nullable = true)
| |-- element: integer (containsNull = false)
df2.show
+----+------+------------+-------------+
|colA|counts| scores|SampledScores|
+----+------+------------+-------------+
| a1| 2| [20, 10]| [20, 20]|
| a2| 1| [30, 10]| [30]|
| a3| 3| [10]| [10, 10, 10]|
| a4| 2|[10, 20, 40]| [20, 20]|
+----+------+------------+-------------+