I have the below spark dataframe.
Name age subject parts
xxxx 21 Maths,Physics I
yyyy 22 English,French I,II
I am trying to explode the above dataframe in both subject and parts like below.
Expected output:
Name age subject parts
xxxx 21 Maths I
xxxx 21 Physics I
yyyy 22 English I
yyyy 22 English II
yyyy 22 French I
yyyy 22 French II
I tried using array.zip for subject and parts and then tried to explode using the temp column, but I am getting null values in the place where there is only one part.
Is there a way to achieve this in Pyspark.
You simply need to use both split and explode :
Data Sample
df.show()
+----+---+--------------+-----+
|Name|age| subject|parts|
+----+---+--------------+-----+
|xxxx| 21| Maths,Physics| I|
|yyyy| 22|English,French| I,II|
+----+---+--------------+-----+
df.printSchema()
root
|-- Name: string (nullable = true)
|-- age: long (nullable = true)
|-- subject: string (nullable = true)
|-- parts: string (nullable = true)
Data Transformation
from pyspark.sql import functions as F
df.withColumn(
"subject", F.explode(F.split("subject", ","))
).withColumn(
"parts", F.explode(F.split("parts", ","))
).show()
+----+---+-------+-----+
|Name|age|subject|parts|
+----+---+-------+-----+
|xxxx| 21| Maths| I|
|xxxx| 21|Physics| I|
|yyyy| 22|English| I|
|yyyy| 22|English| II|
|yyyy| 22| French| I|
|yyyy| 22| French| II|
+----+---+-------+-----+
You can split them separately then join them back together
Split subjects
df1 = (df
.select('name', 'age', 'subject')
.withColumn('subject', F.explode(F.split('subject', ',')))
)
# +----+---+-------+
# |name|age|subject|
# +----+---+-------+
# |xxxx| 21| Maths|
# |xxxx| 21|Physics|
# |yyyy| 22|English|
# |yyyy| 22| French|
# +----+---+-------+
Split parts
df2 = (df
.select('name', 'age', 'parts')
.withColumn('parts', F.explode(F.split('parts', ',')))
)
# +----+---+-----+
# |name|age|parts|
# +----+---+-----+
# |xxxx| 21| I|
# |yyyy| 22| I|
# |yyyy| 22| II|
# +----+---+-----+
Join back
df1.join(df2, on=['name', 'age'])
# +----+---+-------+-----+
# |name|age|subject|parts|
# +----+---+-------+-----+
# |xxxx| 21| Maths| I|
# |xxxx| 21|Physics| I|
# |yyyy| 22|English| I|
# |yyyy| 22|English| II|
# |yyyy| 22| French| I|
# |yyyy| 22| French| II|
# +----+---+-------+-----+
I did this by passing columns as list to a for loop and exploded the dataframe for every element in list
I have the below spark dataframe.
Name age subject parts
xxxx 21 Maths,Physics I
yyyy 22 English,French I,II
I am trying to explode the above dataframe in both subject and parts like below.
Expected output:
Name age subject parts
xxxx 21 Maths I
xxxx 21 Physics I
yyyy 22 English I
yyyy 22 English II
yyyy 22 French I
yyyy 22 French II
I tried using array.zip for subject and parts and then tried to explode using the temp column, but I am getting null values in the place where there is only one part.
Is there a way to achieve this in Pyspark.
You simply need to use both split and explode :
Data Sample
df.show()
+----+---+--------------+-----+
|Name|age| subject|parts|
+----+---+--------------+-----+
|xxxx| 21| Maths,Physics| I|
|yyyy| 22|English,French| I,II|
+----+---+--------------+-----+
df.printSchema()
root
|-- Name: string (nullable = true)
|-- age: long (nullable = true)
|-- subject: string (nullable = true)
|-- parts: string (nullable = true)
Data Transformation
from pyspark.sql import functions as F
df.withColumn(
"subject", F.explode(F.split("subject", ","))
).withColumn(
"parts", F.explode(F.split("parts", ","))
).show()
+----+---+-------+-----+
|Name|age|subject|parts|
+----+---+-------+-----+
|xxxx| 21| Maths| I|
|xxxx| 21|Physics| I|
|yyyy| 22|English| I|
|yyyy| 22|English| II|
|yyyy| 22| French| I|
|yyyy| 22| French| II|
+----+---+-------+-----+
You can split them separately then join them back together
Split subjects
df1 = (df
.select('name', 'age', 'subject')
.withColumn('subject', F.explode(F.split('subject', ',')))
)
# +----+---+-------+
# |name|age|subject|
# +----+---+-------+
# |xxxx| 21| Maths|
# |xxxx| 21|Physics|
# |yyyy| 22|English|
# |yyyy| 22| French|
# +----+---+-------+
Split parts
df2 = (df
.select('name', 'age', 'parts')
.withColumn('parts', F.explode(F.split('parts', ',')))
)
# +----+---+-----+
# |name|age|parts|
# +----+---+-----+
# |xxxx| 21| I|
# |yyyy| 22| I|
# |yyyy| 22| II|
# +----+---+-----+
Join back
df1.join(df2, on=['name', 'age'])
# +----+---+-------+-----+
# |name|age|subject|parts|
# +----+---+-------+-----+
# |xxxx| 21| Maths| I|
# |xxxx| 21|Physics| I|
# |yyyy| 22|English| I|
# |yyyy| 22|English| II|
# |yyyy| 22| French| I|
# |yyyy| 22| French| II|
# +----+---+-------+-----+
I did this by passing columns as list to a for loop and exploded the dataframe for every element in list
say for example I have a dataframe in the following format (in reality is a lot more documents):
df.show()
//output
+-----+-----+-----+
|doc_0|doc_1|doc_2|
+-----+-----+-----+
| 0.0| 1.0| 0.0|
+-----+-----+-----+
| 0.0| 1.0| 0.0|
+-----+-----+-----+
| 2.0| 0.0| 1.0|
+-----+-----+-----+
// ngramShingles is a list of shingles
println(ngramShingles)
//output
List("the", "he ", "e l")
Where the ngramShingles length is equal to the size of the dataframes columns.
How would I get to the following output?
// Desired Output
+-----+-----+-----+-------+
|doc_0|doc_1|doc_2|shingle|
+-----+-----+-----+-------+
| 0.0| 1.0| 0.0| "the"|
+-----+-----+-----+-------+
| 0.0| 1.0| 0.0| "he "|
+-----+-----+-----+-------+
| 2.0| 0.0| 1.0| "e l"|
+-----+-----+-----+-------+
I have tried to add a column via the following line of code:
val finalDf = df.withColumn("shingle", typedLit(ngramShingles))
But that gives me this output:
+-----+-----+-----+-----------------------+
|doc_0|doc_1|doc_2| shingle|
+-----+-----+-----+-----------------------+
| 0.0| 1.0| 0.0| ("the", "he ", "e l")|
+-----+-----+-----+-----------------------+
| 0.0| 1.0| 0.0| ("the", "he ", "e l")|
+-----+-----+-----+-----------------------+
| 2.0| 0.0| 1.0| ("the", "he ", "e l")|
+-----+-----+-----+-----------------------+
I have tried a few other solutions, but really nothing I have tried even comes close. Basically, I just want the new column to be added to each row in the DataFrame.
This question shows how to do this, but both answers rely on having a one column already existing. I don't think I can apply those answers to my situation where I have thousands of columns.
You could make dataframe from your list and then join two dataframes together.
TO do join you'd need to add an additional column, that would be used for join (can be dropped later):
val listDf = List("the", "he ", "e l").toDF("shingle")
val result = df.withColumn("rn", monotonically_increasing_id())
.join(listDf.withColumn("rn", monotonically_increasing_id()), "rn")
.drop("rn")
Result:
+-----+-----+-----+-------+
|doc_0|doc_1|doc_2|shingle|
+-----+-----+-----+-------+
| 0.0| 1.0| 0.0| the|
| 0.0| 1.0| 0.0| he |
| 2.0| 0.0| 1.0| e l|
+-----+-----+-----+-------+
I have the following Spark scala dataframe.
val someDF = Seq(
(1, "bat",1.3222),
(4, "cbat",1.40222),
(3, "horse",1.501212)
).toDF("number", "word","value")
I created a User Defined Function (UDF) to create a new variable as follows :
Logic : if words equals bat then value else zero.
import org.apache.spark.sql.functions.{col}
val func1 = udf( (s:String ,y:Double) => if(s.contains("bat")) y else 0 )
func1(col("word"),col("value"))
+------+-----+-------+
|number| word|cal_var|
+------+-----+-------+
| 1| bat| 1.3222|
| 4| cbat|1.40222|
| 3|horse| 0.0|
+------+-----+-------+
Here to check the equality i used contains function . Because of that i am getting the incorrect output .
My desired output should be like this :
+------+-----+-------+
|number| word|cal_var|
+------+-----+-------+
| 1| bat| 1.3222|
| 4| cbat| 0.0|
| 3|horse| 0.0|
+------+-----+-------+
Can anyone help me to figure out the correct string function that i should use to check the equality ?
Thank you
Try to avoid using UDF's as it gives poor performance,
Another approach:
val someDF = Seq(
(1, "bat",1.3222),
(4, "cbat",1.40222),
(3, "horse",1.501212)
).toDF("number", "word","value")
import org.apache.spark.sql.functions._
someDF.show
+------+-----+--------+
|number| word| value|
+------+-----+--------+
| 1| bat| 1.3222|
| 4| cbat| 1.40222|
| 3|horse|1.501212|
+------+-----+--------+
someDF.withColumn("value",when('word === "bat",'value).otherwise(0)).show()
+------+-----+------+
|number| word| value|
+------+-----+------+
| 1| bat|1.3222|
| 4| cbat| 0.0|
| 3|horse| 0.0|
+------+-----+------+
The solution is to use equals method rather than contains. contains checks whether string bat is present anywhere in the given string s and not the equality. The code is shown below:
scala> someDF.show
+------+-----+--------+
|number| word| value|
+------+-----+--------+
| 1| bat| 1.3222|
| 4| cbat| 1.40222|
| 3|horse|1.501212|
+------+-----+--------+
scala> import org.apache.spark.sql.functions._
import org.apache.spark.sql.functions._
scala> val func1 = udf( (s:String ,y:Double) => if(s.equals("bat")) y else 0 )
func1: org.apache.spark.sql.expressions.UserDefinedFunction = UserDefinedFunction(<function2>,DoubleType,Some(List(StringType, DoubleType)))
scala> someDF.withColumn("col_var", func1(col("word"),col("value"))).drop("value").show
+------+-----+-------+
|number| word|col_var|
+------+-----+-------+
| 1| bat| 1.3222|
| 4| cbat| 0.0|
| 3|horse| 0.0|
+------+-----+-------+
Let me know if it helps!!
I am using apache Spark ML lib to handle categorical features using one hot encoding. After writing the below code I am getting a vector c_idx_vec as output of one hot encoding. I do understand how to interpret this output vector but I am unable to figure out how to convert this vector into columns so that I get a new transformed dataframe.Take this dataset for example:
>>> fd = spark.createDataFrame( [(1.0, "a"), (1.5, "a"), (10.0, "b"), (3.2, "c")], ["x","c"])
>>> ss = StringIndexer(inputCol="c",outputCol="c_idx")
>>> ff = ss.fit(fd).transform(fd)
>>> ff.show()
+----+---+-----+
| x| c|c_idx|
+----+---+-----+
| 1.0| a| 0.0|
| 1.5| a| 0.0|
|10.0| b| 1.0|
| 3.2| c| 2.0|
+----+---+-----+
By default, the OneHotEncoder will drop the last category:
>>> oe = OneHotEncoder(inputCol="c_idx",outputCol="c_idx_vec")
>>> fe = oe.transform(ff)
>>> fe.show()
+----+---+-----+-------------+
| x| c|c_idx| c_idx_vec|
+----+---+-----+-------------+
| 1.0| a| 0.0|(2,[0],[1.0])|
| 1.5| a| 0.0|(2,[0],[1.0])|
|10.0| b| 1.0|(2,[1],[1.0])|
| 3.2| c| 2.0| (2,[],[])|
+----+---+-----+-------------+
Of course, this behavior can be changed:
>>> oe.setDropLast(False)
>>> fl = oe.transform(ff)
>>> fl.show()
+----+---+-----+-------------+
| x| c|c_idx| c_idx_vec|
+----+---+-----+-------------+
| 1.0| a| 0.0|(3,[0],[1.0])|
| 1.5| a| 0.0|(3,[0],[1.0])|
|10.0| b| 1.0|(3,[1],[1.0])|
| 3.2| c| 2.0|(3,[2],[1.0])|
+----+---+-----+-------------+
So, I wanted to know how to convert my c_idx_vec vector into new dataframe as below:
Here is what you can do:
>>> from pyspark.ml.feature import OneHotEncoder, StringIndexer
>>>
>>> fd = spark.createDataFrame( [(1.0, "a"), (1.5, "a"), (10.0, "b"), (3.2, "c")], ["x","c"])
>>> ss = StringIndexer(inputCol="c",outputCol="c_idx")
>>> ff = ss.fit(fd).transform(fd)
>>> ff.show()
+----+---+-----+
| x| c|c_idx|
+----+---+-----+
| 1.0| a| 0.0|
| 1.5| a| 0.0|
|10.0| b| 1.0|
| 3.2| c| 2.0|
+----+---+-----+
>>>
>>> oe = OneHotEncoder(inputCol="c_idx",outputCol="c_idx_vec")
>>> oe.setDropLast(False)
OneHotEncoder_49e58b281387d8dc0c6b
>>> fl = oe.transform(ff)
>>> fl.show()
+----+---+-----+-------------+
| x| c|c_idx| c_idx_vec|
+----+---+-----+-------------+
| 1.0| a| 0.0|(3,[0],[1.0])|
| 1.5| a| 0.0|(3,[0],[1.0])|
|10.0| b| 1.0|(3,[1],[1.0])|
| 3.2| c| 2.0|(3,[2],[1.0])|
+----+---+-----+-------------+
// Get c and its repective index. One hot encoder will put those on same index in vector
>>> colIdx = fl.select("c","c_idx").distinct().rdd.collectAsMap()
>>> colIdx
{'c': 2.0, 'b': 1.0, 'a': 0.0}
>>>
>>> colIdx = sorted((value, "ls_" + key) for (key, value) in colIdx.items())
>>> colIdx
[(0.0, 'ls_a'), (1.0, 'ls_b'), (2.0, 'ls_c')]
>>>
>>> newCols = list(map(lambda x: x[1], colIdx))
>>> actualCol = fl.columns
>>> actualCol
['x', 'c', 'c_idx', 'c_idx_vec']
>>> allColNames = actualCol + newCols
>>> allColNames
['x', 'c', 'c_idx', 'c_idx_vec', 'ls_a', 'ls_b', 'ls_c']
>>>
>>> def extract(row):
... return tuple(map(lambda x: row[x], row.__fields__)) + tuple(row.c_idx_vec.toArray().tolist())
...
>>> result = fl.rdd.map(extract).toDF(allColNames)
>>> result.show(20, False)
+----+---+-----+-------------+----+----+----+
|x |c |c_idx|c_idx_vec |ls_a|ls_b|ls_c|
+----+---+-----+-------------+----+----+----+
|1.0 |a |0.0 |(3,[0],[1.0])|1.0 |0.0 |0.0 |
|1.5 |a |0.0 |(3,[0],[1.0])|1.0 |0.0 |0.0 |
|10.0|b |1.0 |(3,[1],[1.0])|0.0 |1.0 |0.0 |
|3.2 |c |2.0 |(3,[2],[1.0])|0.0 |0.0 |1.0 |
+----+---+-----+-------------+----+----+----+
// Typecast new columns to int
>>> for col in newCols:
... result = result.withColumn(col, result[col].cast("int"))
...
>>> result.show(20, False)
+----+---+-----+-------------+----+----+----+
|x |c |c_idx|c_idx_vec |ls_a|ls_b|ls_c|
+----+---+-----+-------------+----+----+----+
|1.0 |a |0.0 |(3,[0],[1.0])|1 |0 |0 |
|1.5 |a |0.0 |(3,[0],[1.0])|1 |0 |0 |
|10.0|b |1.0 |(3,[1],[1.0])|0 |1 |0 |
|3.2 |c |2.0 |(3,[2],[1.0])|0 |0 |1 |
+----+---+-----+-------------+----+----+----+
Hope this helps!!
Not sure it is the most efficient or simple way, but you can do it with a udf; starting from your fl dataframe:
from pyspark.sql.types import DoubleType
from pyspark.sql.functions import lit, udf
def ith_(v, i):
try:
return float(v[i])
except ValueError:
return None
ith = udf(ith_, DoubleType())
(fl.withColumn('is_a', ith("c_idx_vec", lit(0)))
.withColumn('is_b', ith("c_idx_vec", lit(1)))
.withColumn('is_c', ith("c_idx_vec", lit(2))).show())
The result is:
+----+---+-----+-------------+----+----+----+
| x| c|c_idx| c_idx_vec|is_a|is_b|is_c|
+----+---+-----+-------------+----+----+----+
| 1.0| a| 0.0|(3,[0],[1.0])| 1.0| 0.0| 0.0|
| 1.5| a| 0.0|(3,[0],[1.0])| 1.0| 0.0| 0.0|
|10.0| b| 1.0|(3,[1],[1.0])| 0.0| 1.0| 0.0|
| 3.2| c| 2.0|(3,[2],[1.0])| 0.0| 0.0| 1.0|
+----+---+-----+-------------+----+----+----+
i.e. exactly as requested.
HT (and +1) to this answer that provided the udf.
Given that the situation is specified to the case that StringIndexer was used to generate the index number, and then One-hot encoding is generated using OneHotEncoderEstimator. The entire code from end to end should be like:
Generate the data and index the string values, with the StringIndexerModel object is "saved"
>>> fd = spark.createDataFrame( [(1.0, "a"), (1.5, "a"), (10.0, "b"), (3.2, "c")], ["x","c"])
>>> ss = StringIndexer(inputCol="c",outputCol="c_idx")
>>>
>>> # need to save the indexer model object for indexing label info to be used later
>>> ss_fit = ss.fit(fd)
>>> ss_fit.labels # to be used later
['a', 'b', 'c']
>>> ff = ss_fit.transform(fd)
>>> ff.show()
+----+---+-----+
| x| c|c_idx|
+----+---+-----+
| 1.0| a| 0.0|
| 1.5| a| 0.0|
|10.0| b| 1.0|
| 3.2| c| 2.0|
+----+---+-----+
Do one-hot encoding using OneHotEncoderEstimator class, since OneHotEncoder is deprecating
>>> oe = OneHotEncoderEstimator(inputCols=["c_idx"],outputCols=["c_idx_vec"])
>>> oe_fit = oe.fit(ff)
>>> fe = oe_fit.transform(ff)
>>> fe.show()
+----+---+-----+-------------+
| x| c|c_idx| c_idx_vec|
+----+---+-----+-------------+
| 1.0| a| 0.0|(2,[0],[1.0])|
| 1.5| a| 0.0|(2,[0],[1.0])|
|10.0| b| 1.0|(2,[1],[1.0])|
| 3.2| c| 2.0| (2,[],[])|
+----+---+-----+-------------+
Perform one-hot binary value reshaping. The one-hot values will always be 0.0 or 1.0.
>>> from pyspark.sql.types dimport FloatType, IntegerType
>>> from pyspark.sql.functions import lit, udf
>>> ith = udf(lambda v, i: float(v[i]), FloatType())
>>> fx = fe
>>> for sidx, oe_col in zip([ss_fit], oe.getOutputCols()):
...
... # iterate over string values and ignore the last one
... for ii, val in list(enumerate(sidx.labels))[:-1]:
... fx = fx.withColumn(
... sidx.getInputCol() + '_' + val,
... ith(oe_col, lit(ii)).astype(IntegerType())
... )
>>> fx.show()
+----+---+-----+-------------+---+---+
| x| c|c_idx| c_idx_vec|c_a|c_b|
+----+---+-----+-------------+---+---+
| 1.0| a| 0.0|(2,[0],[1.0])| 1| 0|
| 1.5| a| 0.0|(2,[0],[1.0])| 1| 0|
|10.0| b| 1.0|(2,[1],[1.0])| 0| 1|
| 3.2| c| 2.0| (2,[],[])| 0| 0|
+----+---+-----+-------------+---+---+
To be noticed that Spark, by default, removes the last category. So, following the behavior, the c_c column is not necessary here.
I can't find a way to access sparse vector with data frame and i converted it to rdd.
from pyspark.sql import Row
# column names
labels = ['a', 'b', 'c']
extract_f = lambda row: Row(**row.asDict(), **dict(zip(labels, row.c_idx_vec.toArray())))
fe.rdd.map(extract_f).collect()