Pyspark string array of dynamic length in dataframe column to onehot-encoded - pyspark

I would like to convert a column which contains strings like:
["ABC","def","ghi"]
["Jkl","ABC","def"]
["Xyz","ABC"]
Into a encoded column like this:
[1,1,1,0,0]
[1,1,0,1,0]
[0,1,0,0,1]
Is there a class for that in pyspark.ml.feature?
Edit: In the encoded column the first entry always corresponds to the value "ABC" etc. 1 means "ABC" is present while 0 means it is not present in the corresponding row.

You can probably use CountVectorizer, Below is an example:
Update: removed the step to drop duplicates in arrays, you can set binary=True when setting up CountVectorizer:
from pyspark.ml.feature import CountVectorizer
from pyspark.sql.functions import udf, col
df = spark.createDataFrame([
(["ABC","def","ghi"],)
, (["Jkl","ABC","def"],)
, (["Xyz","ABC"],)
], ['arr']
)
create the CountVectorizer model:
cv = CountVectorizer(inputCol='arr', outputCol='c1', binary=True)
model = cv.fit(df)
vocabulary = model.vocabulary
# [u'ABC', u'def', u'Xyz', u'ghi', u'Jkl']
Create a UDF to convert a vector into array
udf_to_array = udf(lambda v: v.toArray().tolist(), 'array<double>')
Get the vector and check the content:
df1 = model.transform(df)
df1.withColumn('c2', udf_to_array('c1')) \
.select('*', *[ col('c2')[i].astype('int').alias(vocabulary[i]) for i in range(len(vocabulary))]) \
.show(3,0)
+---------------+-------------------------+-------------------------+---+---+---+---+---+
|arr |c1 |c2 |ABC|def|Xyz|ghi|Jkl|
+---------------+-------------------------+-------------------------+---+---+---+---+---+
|[ABC, def, ghi]|(5,[0,1,3],[1.0,1.0,1.0])|[1.0, 1.0, 0.0, 1.0, 0.0]|1 |1 |0 |1 |0 |
|[Jkl, ABC, def]|(5,[0,1,4],[1.0,1.0,1.0])|[1.0, 1.0, 0.0, 0.0, 1.0]|1 |1 |0 |0 |1 |
|[Xyz, ABC] |(5,[0,2],[1.0,1.0]) |[1.0, 0.0, 1.0, 0.0, 0.0]|1 |0 |1 |0 |0 |
+---------------+-------------------------+-------------------------+---+---+---+---+---+

You will have to expand the list in a single column to multiple n columns (where n is the number of items in the given list). Then you can use the OneHotEncoderEstimator class to convert it into One hot encoded features.
Please follow the example in the documentation:
from pyspark.ml.feature import OneHotEncoderEstimator
df = spark.createDataFrame([
(0.0, 1.0),
(1.0, 0.0),
(2.0, 1.0),
(0.0, 2.0),
(0.0, 1.0),
(2.0, 0.0)
], ["categoryIndex1", "categoryIndex2"])
encoder = OneHotEncoderEstimator(inputCols=["categoryIndex1", "categoryIndex2"],
outputCols=["categoryVec1", "categoryVec2"])
model = encoder.fit(df)
encoded = model.transform(df)
encoded.show()
OneHotEncoder class has been deprecated since v2.3 because it is a stateless transformer, it is not usable on new data where the number of categories may differ from the training data.
This will help you to split the list: How to split a list to multiple columns in Pyspark?

Related

Pivot dataframe in pyspark using column for suffix

This question is similar to one I've asked before (Pandas pivot ussing column as suffix) but this time I need to do it using Pyspark instead of Pandas. The problem is as follows.
I have a dataframe like the following example:
Id
Type
Value_1
Value_2
1234
A
1
2
1234
B
1
2
789
A
1
2
789
B
1
2
567
A
1
2
And I want to transform to get the following:
Id
Value_1_A
Value_1_B
Value_2_A
Value_2_B
1234
1
1
2
2
789
1
1
2
2
567
1
1
In summary: replicating the value columns using the 'Type' column as a suffix and convert the dataframe to a wide format.
One solution I can think of is creating the columns with the suffix manually and then aggregating.
Other solutions I've tried are using pyspark GroupedData pivot function as follows:
import pandas as pd
df = spark.createDataFrame(pd.DataFrame({'Id': {0: 1234, 1: 1234, 2: 789, 3: 789, 4: 567},
'Type': {0: 'A', 1: 'B', 2: 'A', 3: 'B', 4: 'A'},
'Value_1': {0: 1, 1: 1, 2: 1, 3: 1, 4: 1},
'Value_2': {0: 2, 1: 2, 2: 2, 3: 2, 4: 2}}))
df.groupBy("Id").pivot("Type").avg().show()
The issue of this solution is that the resulting dataframe would include the Id column repeated 3 times, and the inability to name the columns adding Type as the suffix, since they would be named liked this:
['Id',
'A_avg(Id)',
'A_avg(Value_1)',
'A_avg(Value_2)',
'B_avg(Id)',
'B_avg(Value_1)',
'B_avg(Value_2)']
I also tried specifying the value columns to the pivot functions as follows
df.groupBy("Id").pivot("Type", values=["Value_1", "Value_2"]).avg().show()
This removes the extra Id columns, but the rest of the columns only have null values.
Is there any elegant way to do the transformation I'm attempting on pyspark?
Option 1:
If you don't mind having your Type values as column prefixes rather than suffixes, you can use a combination of agg, avg, and alias:
import pyspark.sql.functions as F
df_pivot = df \
.groupBy("Id") \
.pivot("Type") \
.agg(F.avg("Value_1").alias("Value_1"), F.avg("Value_2").alias("Value_2"))
+----+---------+---------+---------+---------+
|Id |A_Value_1|A_Value_2|B_Value_1|B_Value_2|
+----+---------+---------+---------+---------+
|789 |1.0 |2.0 |1.0 |2.0 |
|567 |1.0 |2.0 |null |null |
|1234|1.0 |2.0 |1.0 |2.0 |
+----+---------+---------+---------+---------+
Separately, it's worth noting here that the values argument in the pivot method is used to limit which values you want to retain from your pivot (i.e., Type) column. For example, if you only wanted A and not B in your output, you would specify pivot("Type", values=["A"]).
Option 2:
If you do still want them as suffixes, you'll likely have to use some regex and withColumnRenamed, which could look something like this:
import pyspark.sql.functions as F
import re
df_pivot = df \
.groupBy("Id") \
.pivot("Type") \
.agg(F.avg("Value_1"), F.avg("Value_2"))
for col in df_pivot.columns:
if "avg(" in col:
suffix = re.findall("^.*(?=_avg\()|$", col)[0]
base_name = re.findall("(?<=\().*(?=\)$)|$", col)[0]
df_pivot = df_pivot.withColumnRenamed(col, "_".join([base_name, suffix]))
+----+---------+---------+---------+---------+
|Id |Value_1_A|Value_2_A|Value_1_B|Value_2_B|
+----+---------+---------+---------+---------+
|789 |1.0 |2.0 |1.0 |2.0 |
|567 |1.0 |2.0 |null |null |
|1234|1.0 |2.0 |1.0 |2.0 |
+----+---------+---------+---------+---------+

How do I split a column by using delimiters from another column in Spark/Scala

I have another question that is related to the split function.
I am new to Spark/Scala.
below is the sample data frame -
+-------------------+---------+
| VALUES|Delimiter|
+-------------------+---------+
| 50000.0#0#0#| #|
| 0#1000.0#| #|
| 1$| $|
|1000.00^Test_string| ^|
+-------------------+---------+
and I want the output to be -
+-------------------+---------+----------------------+
|VALUES |Delimiter|split_values |
+-------------------+---------+----------------------+
|50000.0#0#0# |# |[50000.0, 0, 0, ] |
|0#1000.0# |# |[0, 1000.0, ] |
|1$ |$ |[1, ] |
|1000.00^Test_string|^ |[1000.00, Test_string]|
+-------------------+---------+----------------------+
I tried to split this manually -
dept.select(split(col("VALUES"),"#|#|\\$|\\^").show()
and the output is -
+-----------------------+
|split(VALUES,#|#|\$|\^)|
+-----------------------+
| [50000.0, 0, 0, ]|
| [0, 1000.0, ]|
| [1, ]|
| [1000.00, Test_st...|
+-----------------------+
But I want to pull up the delimiter automatically for a large dataset.
You need to use expr with split() to make the split dynamic
df = spark.createDataFrame([("50000.0#0#0#","#"),("0#1000.0#","#")],["VALUES","Delimiter"])
df = df.withColumn("split", F.expr("""split(VALUES, Delimiter)"""))
df.show()
+------------+---------+-----------------+
| VALUES|Delimiter| split|
+------------+---------+-----------------+
|50000.0#0#0#| #|[50000.0, 0, 0, ]|
| 0#1000.0#| #| [0, 1000.0, ]|
+------------+---------+-----------------+
EDIT: Please check the bottom of the answer for scala version.
You can use a custom user-defined function (pyspark.sql.functions.udf) to achieve this.
from typing import List
from pyspark.sql import SparkSession
from pyspark.sql.functions import udf
from pyspark.sql.types import StringType, ArrayType
def split_col(value: StringType, delimiter: StringType) -> List[str]:
return str(value).split(str(delimiter))
udf_split = udf(lambda x, y: split_col(x, y), ArrayType(StringType()))
spark = SparkSession.builder.getOrCreate()
df = spark.createDataFrame([
('50000.0#0#0#', '#'), ('0#1000.0#', '#'), ('1$', '$'), ('1000.00^Test_string', '^')
], schema='VALUES String, Delimiter String')
df = df.withColumn("split_values", udf_split(df['VALUES'], df['Delimiter']))
df.show(truncate=False)
Output
+-------------------+---------+----------------------+
|VALUES |Delimiter|split_values |
+-------------------+---------+----------------------+
|50000.0#0#0# |# |[50000.0, 0, 0, ] |
|0#1000.0# |# |[0, 1000.0, ] |
|1$ |$ |[1, ] |
|1000.00^Test_string|^ |[1000.00, Test_string]|
+-------------------+---------+----------------------+
Note that the split_values column contains a list of strings. You can also update split_col function to do more changes to values.
EDIT : Scala version
import org.apache.spark.sql.functions.udf
import spark.implicits._
val data = Seq(("50000.0#0#0#", "#"), ("0#1000.0#", "#"), ("1$", "$"), ("1000.00^Test_string", "^"))
var df = data.toDF("VALUES", "Delimiter")
val udf_split_col = udf {(x:String,y:String)=> x.split(y)}
df = df.withColumn("split_values", udf_split_col(df.col("VALUES"), df.col("Delimiter")))
df.show(false)
Edit 2
To avoid the issue with special characters used in regexes, you can use char instead of a String when using the split() method as follow.
val udf_split_col = udf { (x: String, y: String) => x.split(y.charAt(0)) }
This is another way of handling this,using sparksql
df.createOrReplaceTempView("test")
spark.sql("""select VALUES,delimiter,split(values,case when delimiter in ("$","^") then concat("\\",delimiter) else delimiter end) as split_value from test""").show(false)
Note that I included the case when statement to add escape characters to handle cases for '$' and '^',otherwise it doesn't split.
+-------------------+---------+----------------------+
|VALUES |delimiter|split_value |
+-------------------+---------+----------------------+
|50000.0#0#0# |# |[50000.0, 0, 0, ] |
|0#1000.0# |# |[0, 1000.0, ] |
|1$ |$ |[1, ] |
|1000.00^Test_string|^ |[1000.00, Test_string]|
+-------------------+---------+----------------------+
This is my lately solution
import java.util.regex.Pattern
val split_udf = udf((value: String, delimiter: String) => value.split(Pattern.quote(delimiter), -1))
val solution = dept.withColumn("split_values", split_udf(col("VALUES"),col("Delimiter")))
solution.show(truncate = false)
it will skip special characters in Delimiter column.
Other answers not work for
("50000.0\\0\\0\\", "\\")
and linusRian's answer need to add special characters manually

Convert Array of String column to multiple columns in spark scala

I have a dataframe with following schema:
id : int,
emp_details: Array(String)
Some sample data:
1, Array(empname=xxx,city=yyy,zip=12345)
2, Array(empname=bbb,city=bbb,zip=22345)
This data is there in a dataframe and I need to read emp_details from the array and assign it to new columns as below or if I can split this array to multiple columns with column names as empname,city and zip:
.withColumn("empname", xxx)
.withColumn("city", yyy)
.withColumn("zip", 12345)
Could you please guide how we can achieve this by using Spark (1.6) Scala.
Really appreciate your help...
Thanks a lot
You can use withColumn and split to get the required data
df1.withColumn("empname", split($"emp_details" (0), "=")(1))
.withColumn("city", split($"emp_details" (1), "=")(1))
.withColumn("zip", split($"emp_details" (2), "=")(1))
Output:
+---+----------------------------------+-------+----+-----+
|id |emp_details |empname|city|zip |
+---+----------------------------------+-------+----+-----+
|1 |[empname=xxx, city=yyy, zip=12345]|xxx |yyy |12345|
|2 |[empname=bbb, city=bbb, zip=22345]|bbb |bbb |22345|
+---+----------------------------------+-------+----+-----+
UPDATE:
If you don't have fixed sequence of data in array then you can use UDF to convert to map and use it as
val getColumnsUDF = udf((details: Seq[String]) => {
val detailsMap = details.map(_.split("=")).map(x => (x(0), x(1))).toMap
(detailsMap("empname"), detailsMap("city"),detailsMap("zip"))
})
Now use the udf
df1.withColumn("emp",getColumnsUDF($"emp_details"))
.select($"id", $"emp._1".as("empname"), $"emp._2".as("city"), $"emp._3".as("zip"))
.show(false)
Output:
+---+-------+----+---+
|id |empname|city|zip|
+---+-------+----+---+
|1 |xxx |xxx |xxx|
|2 |bbb |bbb |bbb|
+---+-------+----+---+
Hope this helps!

Flip each bit in Spark dataframe calling a custom function

I have a spark Dataframe that looks like
ID |col1|col2|col3|col4.....
A |0 |1 |0 |0....
C |1 |0 |0 |0.....
E |1 |0 |1 |1......
ID is a unique key and other columns have binary values 0/1
now,I want to iterate over each row and if the column value is 0 i want to apply some function passing this single row as a data frame to that function
like col1 ==0 in above data frame for ID A
now the DF of line should look like
newDF.show()
ID |col1|col2|col3|col4.....
A |1 |1 |0 |0....
myfunc(newDF)
next 0 is encountered at col3 for ID A so new DF look like
newDF.show()
ID |col1|col2|col3|col4.....
A |0 |1 |1 |0....
val max=myfunc(newDF) //function returns a double.
so on...
Note:- Each 0 bit is flipped once at row level for function
calling resetting last flipped bit effect
P.S:- I tried using withcolumn calling a UDF but serialisation issues of Df inside DF
actually the myfunc i'm calling is send for scoring for ML model that i have that returns probability for that user if a particular bit is flipped .So i have to iterate through each 0 set column ad set it 1 for that particular instance .
I'm not sure you need anything particularly complex for this. Given that you have imported the SQL functions and the session implicits
val spark: SparkSession = ??? // your session
import spark.implicits._
import org.apache.spark.sql.functions._
you should be able to "flip the bits" (although I'm assuming those are actually encoded as numbers) by applying the following function
def flip(col: Column): Column = when(col === 1, lit(0)).otherwise(lit(1))
as in this example
df.select($"ID", flip($"col1") as "col1", flip($"col2") as "col2")
You can easily rewrite the flip function to deal with edge cases or use different type (if, for example, the "bits" are encoded with booleans or strings).

Unzip a list of tuples - PySpark

I asked the reverse question here Create a tuple out of two columns - PySpark. What I am trying to do now is unzip a list of tuples located in a dataframe column, into two different lists per row. So based on the dataframe below, v_tuple column back to v1 and v2.
+---------------+---------------+--------------------+
| v1| v2| v_tuple|
+---------------+---------------+--------------------+
|[2.0, 1.0, 9.0]|[9.0, 7.0, 2.0]|[(2.0,9.0), (1.0,...|
|[4.0, 8.0, 9.0]|[1.0, 1.0, 2.0]|[(4.0,1.0), (8.0,...|
+---------------+---------------+--------------------+
Based on my previous column I tried the following without success:
unzip_ = udf(
lambda l: list(zip(*l)),
ArrayType(ArrayType("_1", DoubleType()), ArrayType("_2", DoubleType())))
I am using pyspark 1.6
You can explode you're array and then group it back again:
First let's create our dataframe:
df = spark.createDataFrame(
sc.parallelize([
[[2.0, 1.0, 9.0], [9.0, 7.0, 2.0], [(2.0,9.0), (1.0,7.), (9.,2.)]],
[[4.0, 8.0, 9.0], [1.0, 1.0, 2.0], [(4.0,1.0), (8.0,1.), (9., 2.)]]
]),
["v1", "v2", "v_tuple"]
)
Let's add a row id to identify it uniquely:
import pyspark.sql.functions as psf
df = df.withColumn("id", psf.monotonically_increasing_id())
Now, we can explode column "v_tuple" and create two columns from the two elements of the tuple:
df = df.withColumn("v_tuple", psf.explode("v_tuple")).select(
"id",
psf.col("v_tuple._1").alias("v1"),
psf.col("v_tuple._2").alias("v2")
)
+-----------+---+---+
| id| v1| v2|
+-----------+---+---+
|42949672960|2.0|9.0|
|42949672960|1.0|7.0|
|42949672960|9.0|2.0|
|94489280512|4.0|1.0|
|94489280512|8.0|1.0|
|94489280512|9.0|2.0|
+-----------+---+---+
Finally, we can group it back again:
df = df.groupBy("id").agg(
psf.collect_list("v1").alias("v1"),
psf.collect_list("v2").alias("v2")
)
+-----------+---------------+---------------+
| id| v1| v2|
+-----------+---------------+---------------+
|42949672960|[2.0, 1.0, 9.0]|[9.0, 7.0, 2.0]|
|94489280512|[4.0, 8.0, 9.0]|[1.0, 1.0, 2.0]|
+-----------+---------------+---------------+