AWS Glue pyspark UDF - pyspark

In AWS Glue, I need to convert a float value (celsius to fahrenheit) and am using an UDF.
Following is my UDF:
toFahrenheit = udf(lambda x: '-1' if x in not_found else x * 9 / 5 + 32, StringType())
I am using the UDF as follows, in the spark dataframe:
weather_df.withColumn("new_tmax", toFahrenheit(weather_df["tmax"])).drop("tmax").withColumnRenamed("new_tmax","tmax")
When I run the code, am getting the error message as :
IllegalArgumentException: u"requirement failed: The number of columns doesn't match.\nOld column names (11): station, name, latitude, longitude, elevation, date, awnd, prcp, snow, tmin, tmax\nNew column names (0): "
Not sure how to invoke the UDF, as am new to python / pyspark, and my new column schema is not created, and empty.
The code snipped used for above sample is :
%pyspark
import sys
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.context import DynamicFrame
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from awsglue.job import Job
from pyspark.sql import SparkSession
from pyspark.sql.functions import udf
from pyspark.sql.types import StringType
glueContext = GlueContext(SparkContext.getOrCreate())
weather_raw = glueContext.create_dynamic_frame.from_catalog(database = "ohare-airport-2006", table_name = "ohare_intl_airport_2006_08_climate_csv")
print "cpnt : ", weather_raw.count()
weather_raw.printSchema()
weather_raw.toDF().show(10)
#UDF to convert the air temperature from celsius to fahrenheit (For sample transformation)
#toFahrenheit = udf((lambda c: c[1:], c * 9 / 5 + 32)
toFahrenheit = udf(lambda x: '-1' if x in not_found_cat else x * 9 / 5 + 32, StringType())
#Apply the UDF to maximum and minimum air temperature
wthdf = weather_df.withColumn("new_tmin", toFahrenheit(weather_df["tmin"])).withColumn("new_tmax", toFahrenheit(weather_df["tmax"])).drop("tmax").drop("tmin").withColumnRenamed("new_tmax","tmax").withColumnRenamed("new_tmin","tmin")
wthdf.toDF().show(5)
The schema for
weather_df:
root
|-- station: string
|-- name: string
|-- latitude: double
|-- longitude: double
|-- elevation: double
|-- date: string
|-- awnd: double
|-- fmtm: string
|-- pgtm: string
|-- prcp: double
|-- snow: double
|-- snwd: long
|-- tavg: string
|-- tmax: long
|-- tmin: long
Error trace:
Traceback (most recent call last):
File "/tmp/zeppelin_pyspark-3684249459612979499.py", line 349, in <module>
raise Exception(traceback.format_exc())
Exception: Traceback (most recent call last):
File "/tmp/zeppelin_pyspark-3684249459612979499.py", line 342, in <module>
exec(code)
File "<stdin>", line 3, in <module>
File "/usr/lib/spark/python/pyspark/sql/dataframe.py", line 1558, in toDF
jdf = self._jdf.toDF(self._jseq(cols))
File "/usr/lib/spark/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1133, in __call__
answer, self.gateway_client, self.target_id, self.name)
File "/usr/lib/spark/python/pyspark/sql/utils.py", line 79, in deco
raise IllegalArgumentException(s.split(': ', 1)[1], stackTrace)
IllegalArgumentException: u"requirement failed: The number of columns doesn't match.\nOld column names (11): station, name, latitude, longitude, elevation, date, awnd, prcp, snow, tmin, tmax\nNew column names (0): "
Thanks

Solution for the above (Celcius to Fahrenheit), just in case for reference:
#UDF to convert the air temperature from celsius to fahrenheit
toFahrenheit = udf(lambda x: x * 9 / 5 + 32, StringType())
weather_in_Fahrenheit = weather_df.withColumn("new_tmax", toFahrenheit(weather_df["tmax"])).withColumn("new_tmin", toFahrenheit(weather_df["tmin"])).drop("tmax").drop("tmin").withColumnRenamed("new_tmax","tmax").withColumnRenamed("new_tmin","tmin")
weather_in_Fahrenheit.show(5)
Raw data sample:
+-----------+--------------------+---------+--------+---------+----+----+----+----+----------+
| station| name|elevation|latitude|longitude|prcp|snow|tmax|tmin| date|
+-----------+--------------------+---------+--------+---------+----+----+----+----+----------+
|USW00094846|CHICAGO OHARE INT...| 201.8| 41.995| -87.9336| 0.0| 0.0| 25| 11|2013-01-01|
|USW00094846|CHICAGO OHARE INT...| 201.8| 41.995| -87.9336| 0.0| 0.0| 30| 10|2013-01-02|
|USW00094846|CHICAGO OHARE INT...| 201.8| 41.995| -87.9336| 0.0| 0.0| 29| 18|2013-01-03|
|USW00094846|CHICAGO OHARE INT...| 201.8| 41.995| -87.9336| 0.0| 0.0| 36| 13|2013-01-04|
|USW00094846|CHICAGO OHARE INT...| 201.8| 41.995| -87.9336|0.03| 0.4| 39| 18|2013-01-05|
|USW00094846|CHICAGO OHARE INT...| 201.8| 41.995| -87.9336| 0.0| 0.0| 36| 18|2013-01-06|
|USW00094846|CHICAGO OHARE INT...| 201.8| 41.995| -87.9336| 0.0| 0.0| 41| 15|2013-01-07|
|USW00094846|CHICAGO OHARE INT...| 201.8| 41.995| -87.9336| 0.0| 0.0| 44| 22|2013-01-08|
|USW00094846|CHICAGO OHARE INT...| 201.8| 41.995| -87.9336| 0.0| 0.0| 50| 27|2013-01-09|
|USW00094846|CHICAGO OHARE INT...| 201.8| 41.995| -87.9336|0.63| 0.0| 45| 22|2013-01-10|
+-----------+--------------------+---------+--------+---------+----+----+----+----+----------+
After applying the UDF toFahrenheit:
+-----------+--------------------+--------+---------+---------+----------+-----+----+----+----+----+
| station| name|latitude|longitude|elevation| date| awnd|prcp|snow|tmax|tmin|
+-----------+--------------------+--------+---------+---------+----------+-----+----+----+----+----+
|USW00094846|CHICAGO OHARE INT...| 41.995| -87.9336| 201.8|2013-01-01| 8.5| 0.0| 0.0| 77| 51|
|USW00094846|CHICAGO OHARE INT...| 41.995| -87.9336| 201.8|2013-01-02| 8.05| 0.0| 0.0| 86| 50|
|USW00094846|CHICAGO OHARE INT...| 41.995| -87.9336| 201.8|2013-01-03|11.41| 0.0| 0.0| 84| 64|
|USW00094846|CHICAGO OHARE INT...| 41.995| -87.9336| 201.8|2013-01-04| 13.2| 0.0| 0.0| 96| 55|
|USW00094846|CHICAGO OHARE INT...| 41.995| -87.9336| 201.8|2013-01-05| 9.62|0.03| 0.4| 102| 64|
+-----------+--------------------+--------+---------+---------+----------+-----+----+----+----+----+

Related

Denormalize data in AWS Glue PySpark [duplicate]

I have the below spark dataframe.
Name age subject parts
xxxx 21 Maths,Physics I
yyyy 22 English,French I,II
I am trying to explode the above dataframe in both subject and parts like below.
Expected output:
Name age subject parts
xxxx 21 Maths I
xxxx 21 Physics I
yyyy 22 English I
yyyy 22 English II
yyyy 22 French I
yyyy 22 French II
I tried using array.zip for subject and parts and then tried to explode using the temp column, but I am getting null values in the place where there is only one part.
Is there a way to achieve this in Pyspark.
You simply need to use both split and explode :
Data Sample
df.show()
+----+---+--------------+-----+
|Name|age| subject|parts|
+----+---+--------------+-----+
|xxxx| 21| Maths,Physics| I|
|yyyy| 22|English,French| I,II|
+----+---+--------------+-----+
df.printSchema()
root
|-- Name: string (nullable = true)
|-- age: long (nullable = true)
|-- subject: string (nullable = true)
|-- parts: string (nullable = true)
Data Transformation
from pyspark.sql import functions as F
df.withColumn(
"subject", F.explode(F.split("subject", ","))
).withColumn(
"parts", F.explode(F.split("parts", ","))
).show()
+----+---+-------+-----+
|Name|age|subject|parts|
+----+---+-------+-----+
|xxxx| 21| Maths| I|
|xxxx| 21|Physics| I|
|yyyy| 22|English| I|
|yyyy| 22|English| II|
|yyyy| 22| French| I|
|yyyy| 22| French| II|
+----+---+-------+-----+
You can split them separately then join them back together
Split subjects
df1 = (df
.select('name', 'age', 'subject')
.withColumn('subject', F.explode(F.split('subject', ',')))
)
# +----+---+-------+
# |name|age|subject|
# +----+---+-------+
# |xxxx| 21| Maths|
# |xxxx| 21|Physics|
# |yyyy| 22|English|
# |yyyy| 22| French|
# +----+---+-------+
Split parts
df2 = (df
.select('name', 'age', 'parts')
.withColumn('parts', F.explode(F.split('parts', ',')))
)
# +----+---+-----+
# |name|age|parts|
# +----+---+-----+
# |xxxx| 21| I|
# |yyyy| 22| I|
# |yyyy| 22| II|
# +----+---+-----+
Join back
df1.join(df2, on=['name', 'age'])
# +----+---+-------+-----+
# |name|age|subject|parts|
# +----+---+-------+-----+
# |xxxx| 21| Maths| I|
# |xxxx| 21|Physics| I|
# |yyyy| 22|English| I|
# |yyyy| 22|English| II|
# |yyyy| 22| French| I|
# |yyyy| 22| French| II|
# +----+---+-------+-----+
I did this by passing columns as list to a for loop and exploded the dataframe for every element in list

Explode multiple columns to rows in pyspark

I have the below spark dataframe.
Name age subject parts
xxxx 21 Maths,Physics I
yyyy 22 English,French I,II
I am trying to explode the above dataframe in both subject and parts like below.
Expected output:
Name age subject parts
xxxx 21 Maths I
xxxx 21 Physics I
yyyy 22 English I
yyyy 22 English II
yyyy 22 French I
yyyy 22 French II
I tried using array.zip for subject and parts and then tried to explode using the temp column, but I am getting null values in the place where there is only one part.
Is there a way to achieve this in Pyspark.
You simply need to use both split and explode :
Data Sample
df.show()
+----+---+--------------+-----+
|Name|age| subject|parts|
+----+---+--------------+-----+
|xxxx| 21| Maths,Physics| I|
|yyyy| 22|English,French| I,II|
+----+---+--------------+-----+
df.printSchema()
root
|-- Name: string (nullable = true)
|-- age: long (nullable = true)
|-- subject: string (nullable = true)
|-- parts: string (nullable = true)
Data Transformation
from pyspark.sql import functions as F
df.withColumn(
"subject", F.explode(F.split("subject", ","))
).withColumn(
"parts", F.explode(F.split("parts", ","))
).show()
+----+---+-------+-----+
|Name|age|subject|parts|
+----+---+-------+-----+
|xxxx| 21| Maths| I|
|xxxx| 21|Physics| I|
|yyyy| 22|English| I|
|yyyy| 22|English| II|
|yyyy| 22| French| I|
|yyyy| 22| French| II|
+----+---+-------+-----+
You can split them separately then join them back together
Split subjects
df1 = (df
.select('name', 'age', 'subject')
.withColumn('subject', F.explode(F.split('subject', ',')))
)
# +----+---+-------+
# |name|age|subject|
# +----+---+-------+
# |xxxx| 21| Maths|
# |xxxx| 21|Physics|
# |yyyy| 22|English|
# |yyyy| 22| French|
# +----+---+-------+
Split parts
df2 = (df
.select('name', 'age', 'parts')
.withColumn('parts', F.explode(F.split('parts', ',')))
)
# +----+---+-----+
# |name|age|parts|
# +----+---+-----+
# |xxxx| 21| I|
# |yyyy| 22| I|
# |yyyy| 22| II|
# +----+---+-----+
Join back
df1.join(df2, on=['name', 'age'])
# +----+---+-------+-----+
# |name|age|subject|parts|
# +----+---+-------+-----+
# |xxxx| 21| Maths| I|
# |xxxx| 21|Physics| I|
# |yyyy| 22|English| I|
# |yyyy| 22|English| II|
# |yyyy| 22| French| I|
# |yyyy| 22| French| II|
# +----+---+-------+-----+
I did this by passing columns as list to a for loop and exploded the dataframe for every element in list

Adding a list to a dataframe in Scala / Spark such that each element is added to a separate row

say for example I have a dataframe in the following format (in reality is a lot more documents):
df.show()
//output
+-----+-----+-----+
|doc_0|doc_1|doc_2|
+-----+-----+-----+
| 0.0| 1.0| 0.0|
+-----+-----+-----+
| 0.0| 1.0| 0.0|
+-----+-----+-----+
| 2.0| 0.0| 1.0|
+-----+-----+-----+
// ngramShingles is a list of shingles
println(ngramShingles)
//output
List("the", "he ", "e l")
Where the ngramShingles length is equal to the size of the dataframes columns.
How would I get to the following output?
// Desired Output
+-----+-----+-----+-------+
|doc_0|doc_1|doc_2|shingle|
+-----+-----+-----+-------+
| 0.0| 1.0| 0.0| "the"|
+-----+-----+-----+-------+
| 0.0| 1.0| 0.0| "he "|
+-----+-----+-----+-------+
| 2.0| 0.0| 1.0| "e l"|
+-----+-----+-----+-------+
I have tried to add a column via the following line of code:
val finalDf = df.withColumn("shingle", typedLit(ngramShingles))
But that gives me this output:
+-----+-----+-----+-----------------------+
|doc_0|doc_1|doc_2| shingle|
+-----+-----+-----+-----------------------+
| 0.0| 1.0| 0.0| ("the", "he ", "e l")|
+-----+-----+-----+-----------------------+
| 0.0| 1.0| 0.0| ("the", "he ", "e l")|
+-----+-----+-----+-----------------------+
| 2.0| 0.0| 1.0| ("the", "he ", "e l")|
+-----+-----+-----+-----------------------+
I have tried a few other solutions, but really nothing I have tried even comes close. Basically, I just want the new column to be added to each row in the DataFrame.
This question shows how to do this, but both answers rely on having a one column already existing. I don't think I can apply those answers to my situation where I have thousands of columns.
You could make dataframe from your list and then join two dataframes together.
TO do join you'd need to add an additional column, that would be used for join (can be dropped later):
val listDf = List("the", "he ", "e l").toDF("shingle")
val result = df.withColumn("rn", monotonically_increasing_id())
.join(listDf.withColumn("rn", monotonically_increasing_id()), "rn")
.drop("rn")
Result:
+-----+-----+-----+-------+
|doc_0|doc_1|doc_2|shingle|
+-----+-----+-----+-------+
| 0.0| 1.0| 0.0| the|
| 0.0| 1.0| 0.0| he |
| 2.0| 0.0| 1.0| e l|
+-----+-----+-----+-------+

string manipulations using Spark scala

I have the following Spark scala dataframe.
val someDF = Seq(
(1, "bat",1.3222),
(4, "cbat",1.40222),
(3, "horse",1.501212)
).toDF("number", "word","value")
I created a User Defined Function (UDF) to create a new variable as follows :
Logic : if words equals bat then value else zero.
import org.apache.spark.sql.functions.{col}
val func1 = udf( (s:String ,y:Double) => if(s.contains("bat")) y else 0 )
func1(col("word"),col("value"))
+------+-----+-------+
|number| word|cal_var|
+------+-----+-------+
| 1| bat| 1.3222|
| 4| cbat|1.40222|
| 3|horse| 0.0|
+------+-----+-------+
Here to check the equality i used contains function . Because of that i am getting the incorrect output .
My desired output should be like this :
+------+-----+-------+
|number| word|cal_var|
+------+-----+-------+
| 1| bat| 1.3222|
| 4| cbat| 0.0|
| 3|horse| 0.0|
+------+-----+-------+
Can anyone help me to figure out the correct string function that i should use to check the equality ?
Thank you
Try to avoid using UDF's as it gives poor performance,
Another approach:
val someDF = Seq(
(1, "bat",1.3222),
(4, "cbat",1.40222),
(3, "horse",1.501212)
).toDF("number", "word","value")
import org.apache.spark.sql.functions._
someDF.show
+------+-----+--------+
|number| word| value|
+------+-----+--------+
| 1| bat| 1.3222|
| 4| cbat| 1.40222|
| 3|horse|1.501212|
+------+-----+--------+
someDF.withColumn("value",when('word === "bat",'value).otherwise(0)).show()
+------+-----+------+
|number| word| value|
+------+-----+------+
| 1| bat|1.3222|
| 4| cbat| 0.0|
| 3|horse| 0.0|
+------+-----+------+
The solution is to use equals method rather than contains. contains checks whether string bat is present anywhere in the given string s and not the equality. The code is shown below:
scala> someDF.show
+------+-----+--------+
|number| word| value|
+------+-----+--------+
| 1| bat| 1.3222|
| 4| cbat| 1.40222|
| 3|horse|1.501212|
+------+-----+--------+
scala> import org.apache.spark.sql.functions._
import org.apache.spark.sql.functions._
scala> val func1 = udf( (s:String ,y:Double) => if(s.equals("bat")) y else 0 )
func1: org.apache.spark.sql.expressions.UserDefinedFunction = UserDefinedFunction(<function2>,DoubleType,Some(List(StringType, DoubleType)))
scala> someDF.withColumn("col_var", func1(col("word"),col("value"))).drop("value").show
+------+-----+-------+
|number| word|col_var|
+------+-----+-------+
| 1| bat| 1.3222|
| 4| cbat| 0.0|
| 3|horse| 0.0|
+------+-----+-------+
Let me know if it helps!!

pyspark - Convert sparse vector obtained after one hot encoding into columns

I am using apache Spark ML lib to handle categorical features using one hot encoding. After writing the below code I am getting a vector c_idx_vec as output of one hot encoding. I do understand how to interpret this output vector but I am unable to figure out how to convert this vector into columns so that I get a new transformed dataframe.Take this dataset for example:
>>> fd = spark.createDataFrame( [(1.0, "a"), (1.5, "a"), (10.0, "b"), (3.2, "c")], ["x","c"])
>>> ss = StringIndexer(inputCol="c",outputCol="c_idx")
>>> ff = ss.fit(fd).transform(fd)
>>> ff.show()
+----+---+-----+
| x| c|c_idx|
+----+---+-----+
| 1.0| a| 0.0|
| 1.5| a| 0.0|
|10.0| b| 1.0|
| 3.2| c| 2.0|
+----+---+-----+
By default, the OneHotEncoder will drop the last category:
>>> oe = OneHotEncoder(inputCol="c_idx",outputCol="c_idx_vec")
>>> fe = oe.transform(ff)
>>> fe.show()
+----+---+-----+-------------+
| x| c|c_idx| c_idx_vec|
+----+---+-----+-------------+
| 1.0| a| 0.0|(2,[0],[1.0])|
| 1.5| a| 0.0|(2,[0],[1.0])|
|10.0| b| 1.0|(2,[1],[1.0])|
| 3.2| c| 2.0| (2,[],[])|
+----+---+-----+-------------+
Of course, this behavior can be changed:
>>> oe.setDropLast(False)
>>> fl = oe.transform(ff)
>>> fl.show()
+----+---+-----+-------------+
| x| c|c_idx| c_idx_vec|
+----+---+-----+-------------+
| 1.0| a| 0.0|(3,[0],[1.0])|
| 1.5| a| 0.0|(3,[0],[1.0])|
|10.0| b| 1.0|(3,[1],[1.0])|
| 3.2| c| 2.0|(3,[2],[1.0])|
+----+---+-----+-------------+
So, I wanted to know how to convert my c_idx_vec vector into new dataframe as below:
Here is what you can do:
>>> from pyspark.ml.feature import OneHotEncoder, StringIndexer
>>>
>>> fd = spark.createDataFrame( [(1.0, "a"), (1.5, "a"), (10.0, "b"), (3.2, "c")], ["x","c"])
>>> ss = StringIndexer(inputCol="c",outputCol="c_idx")
>>> ff = ss.fit(fd).transform(fd)
>>> ff.show()
+----+---+-----+
| x| c|c_idx|
+----+---+-----+
| 1.0| a| 0.0|
| 1.5| a| 0.0|
|10.0| b| 1.0|
| 3.2| c| 2.0|
+----+---+-----+
>>>
>>> oe = OneHotEncoder(inputCol="c_idx",outputCol="c_idx_vec")
>>> oe.setDropLast(False)
OneHotEncoder_49e58b281387d8dc0c6b
>>> fl = oe.transform(ff)
>>> fl.show()
+----+---+-----+-------------+
| x| c|c_idx| c_idx_vec|
+----+---+-----+-------------+
| 1.0| a| 0.0|(3,[0],[1.0])|
| 1.5| a| 0.0|(3,[0],[1.0])|
|10.0| b| 1.0|(3,[1],[1.0])|
| 3.2| c| 2.0|(3,[2],[1.0])|
+----+---+-----+-------------+
// Get c and its repective index. One hot encoder will put those on same index in vector
>>> colIdx = fl.select("c","c_idx").distinct().rdd.collectAsMap()
>>> colIdx
{'c': 2.0, 'b': 1.0, 'a': 0.0}
>>>
>>> colIdx = sorted((value, "ls_" + key) for (key, value) in colIdx.items())
>>> colIdx
[(0.0, 'ls_a'), (1.0, 'ls_b'), (2.0, 'ls_c')]
>>>
>>> newCols = list(map(lambda x: x[1], colIdx))
>>> actualCol = fl.columns
>>> actualCol
['x', 'c', 'c_idx', 'c_idx_vec']
>>> allColNames = actualCol + newCols
>>> allColNames
['x', 'c', 'c_idx', 'c_idx_vec', 'ls_a', 'ls_b', 'ls_c']
>>>
>>> def extract(row):
... return tuple(map(lambda x: row[x], row.__fields__)) + tuple(row.c_idx_vec.toArray().tolist())
...
>>> result = fl.rdd.map(extract).toDF(allColNames)
>>> result.show(20, False)
+----+---+-----+-------------+----+----+----+
|x |c |c_idx|c_idx_vec |ls_a|ls_b|ls_c|
+----+---+-----+-------------+----+----+----+
|1.0 |a |0.0 |(3,[0],[1.0])|1.0 |0.0 |0.0 |
|1.5 |a |0.0 |(3,[0],[1.0])|1.0 |0.0 |0.0 |
|10.0|b |1.0 |(3,[1],[1.0])|0.0 |1.0 |0.0 |
|3.2 |c |2.0 |(3,[2],[1.0])|0.0 |0.0 |1.0 |
+----+---+-----+-------------+----+----+----+
// Typecast new columns to int
>>> for col in newCols:
... result = result.withColumn(col, result[col].cast("int"))
...
>>> result.show(20, False)
+----+---+-----+-------------+----+----+----+
|x |c |c_idx|c_idx_vec |ls_a|ls_b|ls_c|
+----+---+-----+-------------+----+----+----+
|1.0 |a |0.0 |(3,[0],[1.0])|1 |0 |0 |
|1.5 |a |0.0 |(3,[0],[1.0])|1 |0 |0 |
|10.0|b |1.0 |(3,[1],[1.0])|0 |1 |0 |
|3.2 |c |2.0 |(3,[2],[1.0])|0 |0 |1 |
+----+---+-----+-------------+----+----+----+
Hope this helps!!
Not sure it is the most efficient or simple way, but you can do it with a udf; starting from your fl dataframe:
from pyspark.sql.types import DoubleType
from pyspark.sql.functions import lit, udf
def ith_(v, i):
try:
return float(v[i])
except ValueError:
return None
ith = udf(ith_, DoubleType())
(fl.withColumn('is_a', ith("c_idx_vec", lit(0)))
.withColumn('is_b', ith("c_idx_vec", lit(1)))
.withColumn('is_c', ith("c_idx_vec", lit(2))).show())
The result is:
+----+---+-----+-------------+----+----+----+
| x| c|c_idx| c_idx_vec|is_a|is_b|is_c|
+----+---+-----+-------------+----+----+----+
| 1.0| a| 0.0|(3,[0],[1.0])| 1.0| 0.0| 0.0|
| 1.5| a| 0.0|(3,[0],[1.0])| 1.0| 0.0| 0.0|
|10.0| b| 1.0|(3,[1],[1.0])| 0.0| 1.0| 0.0|
| 3.2| c| 2.0|(3,[2],[1.0])| 0.0| 0.0| 1.0|
+----+---+-----+-------------+----+----+----+
i.e. exactly as requested.
HT (and +1) to this answer that provided the udf.
Given that the situation is specified to the case that StringIndexer was used to generate the index number, and then One-hot encoding is generated using OneHotEncoderEstimator. The entire code from end to end should be like:
Generate the data and index the string values, with the StringIndexerModel object is "saved"
>>> fd = spark.createDataFrame( [(1.0, "a"), (1.5, "a"), (10.0, "b"), (3.2, "c")], ["x","c"])
>>> ss = StringIndexer(inputCol="c",outputCol="c_idx")
>>>
>>> # need to save the indexer model object for indexing label info to be used later
>>> ss_fit = ss.fit(fd)
>>> ss_fit.labels # to be used later
['a', 'b', 'c']
>>> ff = ss_fit.transform(fd)
>>> ff.show()
+----+---+-----+
| x| c|c_idx|
+----+---+-----+
| 1.0| a| 0.0|
| 1.5| a| 0.0|
|10.0| b| 1.0|
| 3.2| c| 2.0|
+----+---+-----+
Do one-hot encoding using OneHotEncoderEstimator class, since OneHotEncoder is deprecating
>>> oe = OneHotEncoderEstimator(inputCols=["c_idx"],outputCols=["c_idx_vec"])
>>> oe_fit = oe.fit(ff)
>>> fe = oe_fit.transform(ff)
>>> fe.show()
+----+---+-----+-------------+
| x| c|c_idx| c_idx_vec|
+----+---+-----+-------------+
| 1.0| a| 0.0|(2,[0],[1.0])|
| 1.5| a| 0.0|(2,[0],[1.0])|
|10.0| b| 1.0|(2,[1],[1.0])|
| 3.2| c| 2.0| (2,[],[])|
+----+---+-----+-------------+
Perform one-hot binary value reshaping. The one-hot values will always be 0.0 or 1.0.
>>> from pyspark.sql.types dimport FloatType, IntegerType
>>> from pyspark.sql.functions import lit, udf
>>> ith = udf(lambda v, i: float(v[i]), FloatType())
>>> fx = fe
>>> for sidx, oe_col in zip([ss_fit], oe.getOutputCols()):
...
... # iterate over string values and ignore the last one
... for ii, val in list(enumerate(sidx.labels))[:-1]:
... fx = fx.withColumn(
... sidx.getInputCol() + '_' + val,
... ith(oe_col, lit(ii)).astype(IntegerType())
... )
>>> fx.show()
+----+---+-----+-------------+---+---+
| x| c|c_idx| c_idx_vec|c_a|c_b|
+----+---+-----+-------------+---+---+
| 1.0| a| 0.0|(2,[0],[1.0])| 1| 0|
| 1.5| a| 0.0|(2,[0],[1.0])| 1| 0|
|10.0| b| 1.0|(2,[1],[1.0])| 0| 1|
| 3.2| c| 2.0| (2,[],[])| 0| 0|
+----+---+-----+-------------+---+---+
To be noticed that Spark, by default, removes the last category. So, following the behavior, the c_c column is not necessary here.
I can't find a way to access sparse vector with data frame and i converted it to rdd.
from pyspark.sql import Row
# column names
labels = ['a', 'b', 'c']
extract_f = lambda row: Row(**row.asDict(), **dict(zip(labels, row.c_idx_vec.toArray())))
fe.rdd.map(extract_f).collect()