pyspark - Convert sparse vector obtained after one hot encoding into columns - pyspark

I am using apache Spark ML lib to handle categorical features using one hot encoding. After writing the below code I am getting a vector c_idx_vec as output of one hot encoding. I do understand how to interpret this output vector but I am unable to figure out how to convert this vector into columns so that I get a new transformed dataframe.Take this dataset for example:
>>> fd = spark.createDataFrame( [(1.0, "a"), (1.5, "a"), (10.0, "b"), (3.2, "c")], ["x","c"])
>>> ss = StringIndexer(inputCol="c",outputCol="c_idx")
>>> ff = ss.fit(fd).transform(fd)
>>> ff.show()
+----+---+-----+
| x| c|c_idx|
+----+---+-----+
| 1.0| a| 0.0|
| 1.5| a| 0.0|
|10.0| b| 1.0|
| 3.2| c| 2.0|
+----+---+-----+
By default, the OneHotEncoder will drop the last category:
>>> oe = OneHotEncoder(inputCol="c_idx",outputCol="c_idx_vec")
>>> fe = oe.transform(ff)
>>> fe.show()
+----+---+-----+-------------+
| x| c|c_idx| c_idx_vec|
+----+---+-----+-------------+
| 1.0| a| 0.0|(2,[0],[1.0])|
| 1.5| a| 0.0|(2,[0],[1.0])|
|10.0| b| 1.0|(2,[1],[1.0])|
| 3.2| c| 2.0| (2,[],[])|
+----+---+-----+-------------+
Of course, this behavior can be changed:
>>> oe.setDropLast(False)
>>> fl = oe.transform(ff)
>>> fl.show()
+----+---+-----+-------------+
| x| c|c_idx| c_idx_vec|
+----+---+-----+-------------+
| 1.0| a| 0.0|(3,[0],[1.0])|
| 1.5| a| 0.0|(3,[0],[1.0])|
|10.0| b| 1.0|(3,[1],[1.0])|
| 3.2| c| 2.0|(3,[2],[1.0])|
+----+---+-----+-------------+
So, I wanted to know how to convert my c_idx_vec vector into new dataframe as below:

Here is what you can do:
>>> from pyspark.ml.feature import OneHotEncoder, StringIndexer
>>>
>>> fd = spark.createDataFrame( [(1.0, "a"), (1.5, "a"), (10.0, "b"), (3.2, "c")], ["x","c"])
>>> ss = StringIndexer(inputCol="c",outputCol="c_idx")
>>> ff = ss.fit(fd).transform(fd)
>>> ff.show()
+----+---+-----+
| x| c|c_idx|
+----+---+-----+
| 1.0| a| 0.0|
| 1.5| a| 0.0|
|10.0| b| 1.0|
| 3.2| c| 2.0|
+----+---+-----+
>>>
>>> oe = OneHotEncoder(inputCol="c_idx",outputCol="c_idx_vec")
>>> oe.setDropLast(False)
OneHotEncoder_49e58b281387d8dc0c6b
>>> fl = oe.transform(ff)
>>> fl.show()
+----+---+-----+-------------+
| x| c|c_idx| c_idx_vec|
+----+---+-----+-------------+
| 1.0| a| 0.0|(3,[0],[1.0])|
| 1.5| a| 0.0|(3,[0],[1.0])|
|10.0| b| 1.0|(3,[1],[1.0])|
| 3.2| c| 2.0|(3,[2],[1.0])|
+----+---+-----+-------------+
// Get c and its repective index. One hot encoder will put those on same index in vector
>>> colIdx = fl.select("c","c_idx").distinct().rdd.collectAsMap()
>>> colIdx
{'c': 2.0, 'b': 1.0, 'a': 0.0}
>>>
>>> colIdx = sorted((value, "ls_" + key) for (key, value) in colIdx.items())
>>> colIdx
[(0.0, 'ls_a'), (1.0, 'ls_b'), (2.0, 'ls_c')]
>>>
>>> newCols = list(map(lambda x: x[1], colIdx))
>>> actualCol = fl.columns
>>> actualCol
['x', 'c', 'c_idx', 'c_idx_vec']
>>> allColNames = actualCol + newCols
>>> allColNames
['x', 'c', 'c_idx', 'c_idx_vec', 'ls_a', 'ls_b', 'ls_c']
>>>
>>> def extract(row):
... return tuple(map(lambda x: row[x], row.__fields__)) + tuple(row.c_idx_vec.toArray().tolist())
...
>>> result = fl.rdd.map(extract).toDF(allColNames)
>>> result.show(20, False)
+----+---+-----+-------------+----+----+----+
|x |c |c_idx|c_idx_vec |ls_a|ls_b|ls_c|
+----+---+-----+-------------+----+----+----+
|1.0 |a |0.0 |(3,[0],[1.0])|1.0 |0.0 |0.0 |
|1.5 |a |0.0 |(3,[0],[1.0])|1.0 |0.0 |0.0 |
|10.0|b |1.0 |(3,[1],[1.0])|0.0 |1.0 |0.0 |
|3.2 |c |2.0 |(3,[2],[1.0])|0.0 |0.0 |1.0 |
+----+---+-----+-------------+----+----+----+
// Typecast new columns to int
>>> for col in newCols:
... result = result.withColumn(col, result[col].cast("int"))
...
>>> result.show(20, False)
+----+---+-----+-------------+----+----+----+
|x |c |c_idx|c_idx_vec |ls_a|ls_b|ls_c|
+----+---+-----+-------------+----+----+----+
|1.0 |a |0.0 |(3,[0],[1.0])|1 |0 |0 |
|1.5 |a |0.0 |(3,[0],[1.0])|1 |0 |0 |
|10.0|b |1.0 |(3,[1],[1.0])|0 |1 |0 |
|3.2 |c |2.0 |(3,[2],[1.0])|0 |0 |1 |
+----+---+-----+-------------+----+----+----+
Hope this helps!!

Not sure it is the most efficient or simple way, but you can do it with a udf; starting from your fl dataframe:
from pyspark.sql.types import DoubleType
from pyspark.sql.functions import lit, udf
def ith_(v, i):
try:
return float(v[i])
except ValueError:
return None
ith = udf(ith_, DoubleType())
(fl.withColumn('is_a', ith("c_idx_vec", lit(0)))
.withColumn('is_b', ith("c_idx_vec", lit(1)))
.withColumn('is_c', ith("c_idx_vec", lit(2))).show())
The result is:
+----+---+-----+-------------+----+----+----+
| x| c|c_idx| c_idx_vec|is_a|is_b|is_c|
+----+---+-----+-------------+----+----+----+
| 1.0| a| 0.0|(3,[0],[1.0])| 1.0| 0.0| 0.0|
| 1.5| a| 0.0|(3,[0],[1.0])| 1.0| 0.0| 0.0|
|10.0| b| 1.0|(3,[1],[1.0])| 0.0| 1.0| 0.0|
| 3.2| c| 2.0|(3,[2],[1.0])| 0.0| 0.0| 1.0|
+----+---+-----+-------------+----+----+----+
i.e. exactly as requested.
HT (and +1) to this answer that provided the udf.

Given that the situation is specified to the case that StringIndexer was used to generate the index number, and then One-hot encoding is generated using OneHotEncoderEstimator. The entire code from end to end should be like:
Generate the data and index the string values, with the StringIndexerModel object is "saved"
>>> fd = spark.createDataFrame( [(1.0, "a"), (1.5, "a"), (10.0, "b"), (3.2, "c")], ["x","c"])
>>> ss = StringIndexer(inputCol="c",outputCol="c_idx")
>>>
>>> # need to save the indexer model object for indexing label info to be used later
>>> ss_fit = ss.fit(fd)
>>> ss_fit.labels # to be used later
['a', 'b', 'c']
>>> ff = ss_fit.transform(fd)
>>> ff.show()
+----+---+-----+
| x| c|c_idx|
+----+---+-----+
| 1.0| a| 0.0|
| 1.5| a| 0.0|
|10.0| b| 1.0|
| 3.2| c| 2.0|
+----+---+-----+
Do one-hot encoding using OneHotEncoderEstimator class, since OneHotEncoder is deprecating
>>> oe = OneHotEncoderEstimator(inputCols=["c_idx"],outputCols=["c_idx_vec"])
>>> oe_fit = oe.fit(ff)
>>> fe = oe_fit.transform(ff)
>>> fe.show()
+----+---+-----+-------------+
| x| c|c_idx| c_idx_vec|
+----+---+-----+-------------+
| 1.0| a| 0.0|(2,[0],[1.0])|
| 1.5| a| 0.0|(2,[0],[1.0])|
|10.0| b| 1.0|(2,[1],[1.0])|
| 3.2| c| 2.0| (2,[],[])|
+----+---+-----+-------------+
Perform one-hot binary value reshaping. The one-hot values will always be 0.0 or 1.0.
>>> from pyspark.sql.types dimport FloatType, IntegerType
>>> from pyspark.sql.functions import lit, udf
>>> ith = udf(lambda v, i: float(v[i]), FloatType())
>>> fx = fe
>>> for sidx, oe_col in zip([ss_fit], oe.getOutputCols()):
...
... # iterate over string values and ignore the last one
... for ii, val in list(enumerate(sidx.labels))[:-1]:
... fx = fx.withColumn(
... sidx.getInputCol() + '_' + val,
... ith(oe_col, lit(ii)).astype(IntegerType())
... )
>>> fx.show()
+----+---+-----+-------------+---+---+
| x| c|c_idx| c_idx_vec|c_a|c_b|
+----+---+-----+-------------+---+---+
| 1.0| a| 0.0|(2,[0],[1.0])| 1| 0|
| 1.5| a| 0.0|(2,[0],[1.0])| 1| 0|
|10.0| b| 1.0|(2,[1],[1.0])| 0| 1|
| 3.2| c| 2.0| (2,[],[])| 0| 0|
+----+---+-----+-------------+---+---+
To be noticed that Spark, by default, removes the last category. So, following the behavior, the c_c column is not necessary here.

I can't find a way to access sparse vector with data frame and i converted it to rdd.
from pyspark.sql import Row
# column names
labels = ['a', 'b', 'c']
extract_f = lambda row: Row(**row.asDict(), **dict(zip(labels, row.c_idx_vec.toArray())))
fe.rdd.map(extract_f).collect()

Related

How to add a list of data into dataschame in pyspark?

there is a dataframe contains two columns, one is key and another is value. Like below:
+--------------------+--------------------+
| key| value |
+--------------------+--------------------+
|a |abcde |
I want to slice the value into mutiple values with position and generate a new dataframe following the key. Like below:
+--------------------+--------------------+
| key| value|
+--------------------+--------------------+
|a |[a, 0] |
|a |[b, 1] |
|a |[c, 2] |
|a |[d, 3] |
|a |[e, 4] |
I have tried to use join() and StructType() but I failed. Are there any possible method to do that? THANKS!
from pyspark.sql import SparkSession,Row
spark = SparkSession.builder.appName('SparkByExamples.com').getOrCreate()
data = [("a","abcd",3000)]
df = spark.createDataFrame(data,["Name","Department","Salary"])
df.select( f.expr( "posexplode( filter (split(Department,''), x -> x != '' ) ) as (value, index)") , f.col("Name")).show()
+-----+-----+----+
|value|index|Name|
+-----+-----+----+
| 0| a| a|
| 1| b| a|
| 2| c| a|
| 3| d| a|
+-----+-----+----+
# to explain the above functions more clearly.
f.expr( "..." ) #--> write a SQL expression
"posexplode( [..] )" #for an array turn them into rows with their index
"filter( [..] , x -> x != '' )" #for an array filter out ''
"split( Department, '' )" #split the column on null (extract characters) which will add null in out array that we need to filter out.
Here's the update to fit with your exact request, just a little more manipulation to put it into your required format:
df.select( f.expr( "posexplode( filter (split(Department,''), x -> x != '' ) ) as (myvalue, index)") , f.col("Name"),f.expr( "array(myvalue,index) as value ")).drop("index","myvalue").show()
+----+------+
|Name| value|
+----+------+
| a|[0, a]|
| a|[1, b]|
| a|[2, c]|
| a|[3, d]|
+----+------+
The following snippet transform your data into your specified format:
import pyspark.sql.functions as F
df = spark.createDataFrame([("a", "abcde",)], ["key", "value"])
df_split = df.withColumn("split", F.array_remove(F.split("value", ""), ""))
df_split.show()
df_exploded = df_split.select("key", F.posexplode("split"))
df_exploded.show()
df_array = df_exploded.select("key", F.array("col", "pos").alias("value"))
df_array.show()
Output:
+---+-----+---------------+
|key|value| split|
+---+-----+---------------+
| a|abcde|[a, b, c, d, e]|
+---+-----+---------------+
+---+---+---+
|key|pos|col|
+---+---+---+
| a| 0| a|
| a| 1| b|
| a| 2| c|
| a| 3| d|
| a| 4| e|
+---+---+---+
+---+------+
|key| value|
+---+------+
| a|[a, 0]|
| a|[b, 1]|
| a|[c, 2]|
| a|[d, 3]|
| a|[e, 4]|
+---+------+
First, the string is split into an array, where the split pattern is the empty string. Therefore, the last element needs to be removed.
Then each element of a the split column array is transformed to a row with its position in the array as column pos
Lastly, the columns are combined to an array.

PySpark: Pandas UDF for scipy statistical transformations

I'm trying to create a column of standardized (z-score) of a column x on a Spark dataframe, but am missing something because none of it is working.
Here's my example:
import pandas as pd
from pyspark.sql.functions import pandas_udf, PandasUDFType
from scipy.stats import zscore
#pandas_udf('float')
def zscore_udf(x: pd.Series) -> pd.Series:
return zscore(x)
spark = SparkSession.builder.appName('SparkByExamples.com').getOrCreate()
columns = ["id","x"]
data = [("a", 81.0),
("b", 36.2),
("c", 12.0),
("d", 81.0),
("e", 36.3),
("f", 12.0),
("g", 111.7)]
df = spark.createDataFrame(data=data,schema=columns)
df.show()
df = df.withColumn('y', zscore_udf(df.x))
df.show()
Which results in obviously wrong calculations:
+---+-----+----+
| id| x| y|
+---+-----+----+
| a| 81.0|null|
| b| 36.2| 1.0|
| c| 12.0|-1.0|
| d| 81.0| 1.0|
| e| 36.3|-1.0|
| f| 12.0|-1.0|
| g|111.7| 1.0|
+---+-----+----+
Thank you for your help.
How to fix:
instead of using a UDF calculate the stddev_pop and the avg of the dataframe and calculate z-score manually.
I suggest using "window function" over the entire dataframe for the first step and then a simple arithmetic to get the z-score.
see suggested code:
from pyspark.sql.functions import avg, col, stddev_pop
from pyspark.sql.window import Window
df2 = df \
.select(
"*",
avg("x").over(Window.partitionBy()).alias("avg_x"),
stddev_pop("x").over(Window.partitionBy()).alias("stddev_x"),
) \
.withColumn("manual_z_score", (col("x") - col("avg_x")) / col("stddev_x"))
Why the UDF didn't work?
Spark is used for distributed computation. When you perform operations on a DataFrame Spark distributes the workload into partitions on the executors/workers available.
pandas_udf is not different. When running a UDF from the type pd.Series -> pd.Series some rows are sent to partition X and some to partition Y, then when zscore is run it calculates the mean and std of the data in the partition and writes the zscore based on that data only.
I'll use spark_partition_id to "prove" this.
rows a,b,c were mapped in partition 0 while d,e,f,g in partition 1. I've calculated manually the mean/stddev_pop of both the entire set and the partitioned data and then calculated the z-score. the UDF z-score was equal to the z-score of the partition.
from pyspark.sql.functions import pandas_udf, spark_partition_id, avg, stddev, col, stddev_pop
from pyspark.sql.window import Window
df2 = df \
.select(
"*",
zscore_udf(df.x).alias("z_score"),
spark_partition_id().alias("partition"),
avg("x").over(Window.partitionBy(spark_partition_id())).alias("avg_partition_x"),
stddev_pop("x").over(Window.partitionBy(spark_partition_id())).alias("stddev_partition_x"),
) \
.withColumn("partition_z_score", (col("x") - col("avg_partition_x")) / col("stddev_partition_x"))
df2.show()
+---+-----+-----------+---------+-----------------+------------------+--------------------+
| id| x| z_score|partition| avg_partition_x|stddev_partition_x| partition_z_score|
+---+-----+-----------+---------+-----------------+------------------+--------------------+
| a| 81.0| 1.327058| 0|43.06666666666666|28.584533502500186| 1.3270579815484989|
| b| 36.2|-0.24022315| 0|43.06666666666666|28.584533502500186|-0.24022314955974558|
| c| 12.0| -1.0868348| 0|43.06666666666666|28.584533502500186| -1.0868348319887526|
| d| 81.0| 0.5366879| 1| 60.25|38.663063768925504| 0.5366879387524718|
| e| 36.3|-0.61945426| 1| 60.25|38.663063768925504| -0.6194542714757446|
| f| 12.0| -1.2479612| 1| 60.25|38.663063768925504| -1.247961110593097|
| g|111.7| 1.3307275| 1| 60.25|38.663063768925504| 1.3307274433163698|
+---+-----+-----------+---------+-----------------+------------------+--------------------+
I also added df.repartition(8) prior to the calculation and managed to get similar results as in the original question.
partitions with 0 stddev --> null z score, partition with 2 rows --> (-1, 1) z scores.
+---+-----+-------+---------+---------------+------------------+-----------------+
| id| x|z_score|partition|avg_partition_x|stddev_partition_x|partition_z_score|
+---+-----+-------+---------+---------------+------------------+-----------------+
| a| 81.0| null| 0| 81.0| 0.0| null|
| d| 81.0| null| 0| 81.0| 0.0| null|
| f| 12.0| null| 1| 12.0| 0.0| null|
| b| 36.2| -1.0| 6| 73.95| 37.75| -1.0|
| g|111.7| 1.0| 6| 73.95| 37.75| 1.0|
| c| 12.0| -1.0| 7| 24.15|12.149999999999999| -1.0|
| e| 36.3| 1.0| 7| 24.15|12.149999999999999| 1.0|
+---+-----+-------+---------+---------------+------------------+-----------------+

Pyspark: How to build sum of a column(which contains negative and positive values) with stop at 0

I think a example says more then the describtion.
The right column "sum" is the one i am looking for.
enter image description here
to_count|sum
-------------
-1 |0
+1 |1
-1 |0
-1 |0
+1 |1
+1 |2
-1 |1
+1 |2
. |.
. |.
I tried to rebuild that with several groupings with comparing lead and lag but that only works for the first time the sum usually ends in a negativ value.
Summing only positive and negative values seperatly also ends in another final result.
Would be great if anyone has a good idea how to solve this in pyspark!
I would use pandas_udf here:
from pyspark.sql.functions import pandas_udf, PandasUDFType
pdf = pd.DataFrame({'g':[1]*8, 'id':range(8), 'value': [-1,1,-1,-1,1,1,-1,1]})
df = spark.createDataFrame(pdf)
df = df.withColumn('cumsum', F.lit(math.inf))
#pandas_udf(df.schema, PandasUDFType.GROUPED_MAP)
def _calc_cumsum(pdf):
pdf.sort_values(by=['id'], inplace=True, ascending=True)
cumsums = []
prev = 0
for v in pdf['value'].values:
prev = max(prev + v, 0)
cumsums.append(prev)
pdf['cumsum'] = cumsums
return pdf
df = df.groupby('g').apply(_calc_cumsum)
df.show()
The results:
+---+---+-----+------+
| g| id|value|cumsum|
+---+---+-----+------+
| 1| 0| -1| 0.0|
| 1| 1| 1| 1.0|
| 1| 2| -1| 0.0|
| 1| 3| -1| 0.0|
| 1| 4| 1| 1.0|
| 1| 5| 1| 2.0|
| 1| 6| -1| 1.0|
| 1| 7| 1| 2.0|
+---+---+-----+------+
Please look at the pic first there is a testdataset(first 3 columns) and the calc steps.
The column "flag" is now in another format. We also checked our datasource and realized that we only have to handle 1 and -1 entries. We mapped 1 to 0 and -1 to 1. Now it's working like exspected as you see in the column result.
The code is this:
w1 = Window.partitionBy('group').orderBy('order')
df_0 = tst.withColumn('edge_det',F.when(((F.col('flag')==0)&((F.lag('flag',default=1).over(w1))==1)),1).otherwise(0))
df_0 = df_0.withColumn('edge_cyl',F.sum('edge_det').over(w1))
df1 = df_0.withColumn('condition', F.when(F.col('edge_cyl')==0,0).otherwise(F.when(F.col('flag')==1,-1).otherwise(1)))
df1 =df1.withColumn('cond_sum',F.sum('condition').over(w1))
cond = (F.col('cond_sum')>=0)|(F.col('condition')==1)
df2 = df1.withColumn('new_cond',F.when(cond,F.col('condition')).otherwise(0))
df3 = df2.withColumn("result",F.sum('new_cond').over(w1))

Unsure how to apply row-wise normalization on pyspark dataframe

Disclaimer: I'm a beginner when it comes to Pyspark.
For each cell in a row, I'd like to apply the following function
new_col_i = col_i / max(col_1,col_2,col_3,...,col_n)
At the very end, I'd like the range of values to go from 0.0 to 1.0.
Here are the details of my dataframe:
Dimensions: (6.5M, 2905)
Dtypes: Double
Initial DF:
+-----+-------+-------+-------+
|. id| col_1| col_2| col_n |
+-----+-------+-------+-------+
| 1| 7.5| 0.1| 2.0|
| 2| 0.3| 3.5| 10.5|
+-----+-------+-------+-------+
Updated DF:
+-----+-------+-------+-------+
|. id| col_1| col_2| col_n |
+-----+-------+-------+-------+
| 1| 1.0| 0.013| 0.26|
| 2| 0.028| 0.33| 1.0|
+-----+-------+-------+-------+
Any help would be appreciated.
You can find the maximum value from an array of columns and loop your dataframe to replace the normalized column value.
cols = df.columns[1:]
import builtins as p
df2 = df.withColumn('max', array_max(array(*[col(c) for c in cols]))) \
for c in cols:
df2 = df2.withColumn(c, col(c) / col('max'))
df2.show()
+---+-------------------+--------------------+-------------------+----+
| id| col_1| col_2| col_n| max|
+---+-------------------+--------------------+-------------------+----+
| 1| 1.0|0.013333333333333334|0.26666666666666666| 7.5|
| 2|0.02857142857142857| 0.3333333333333333| 1.0|10.5|
+---+-------------------+--------------------+-------------------+----+

transform a feature of a spark groupedBy DataFrame

I'm searching for a scala analogue of python .transform()
Namely, i need to create a new feature - a group mean of a corresponding: class
val df = Seq(
("a", 1),
("a", 3),
("b", 3),
("b", 7)
).toDF("class", "val")
+-----+---+
|class|val|
+-----+---+
| a| 1|
| a| 3|
| b| 3|
| b| 7|
+-----+---+
val grouped_df = df.groupBy('class)
Here's python implementation:
df["class_mean"] = grouped_df["class"].transform(
lambda x: x.mean())
So, the desired result:
+-----+---+----------+
|class|val|class_mean|
+-----+---+---+------+
| a| 1| 2.0|
| a| 3| 2.0|
| b| 3| 5.0|
| b| 7| 5.0|
+-----+---+----------+
You can use
df.groupBy("class").agg(mean("val").as("class_mean"))
If you can want all the columns then you can use window function
val w = Window.partitionBy("class")
df.withColumn("class_mean", mean("val").over(w))
.show(false)
Output:
+-----+---+----------+
|class|val|class_mean|
+-----+---+----------+
|b |3 |5.0 |
|b |7 |5.0 |
|a |1 |2.0 |
|a |3 |2.0 |
+-----+---+----------+