Unzip a list of tuples - PySpark - pyspark

I asked the reverse question here Create a tuple out of two columns - PySpark. What I am trying to do now is unzip a list of tuples located in a dataframe column, into two different lists per row. So based on the dataframe below, v_tuple column back to v1 and v2.
+---------------+---------------+--------------------+
| v1| v2| v_tuple|
+---------------+---------------+--------------------+
|[2.0, 1.0, 9.0]|[9.0, 7.0, 2.0]|[(2.0,9.0), (1.0,...|
|[4.0, 8.0, 9.0]|[1.0, 1.0, 2.0]|[(4.0,1.0), (8.0,...|
+---------------+---------------+--------------------+
Based on my previous column I tried the following without success:
unzip_ = udf(
lambda l: list(zip(*l)),
ArrayType(ArrayType("_1", DoubleType()), ArrayType("_2", DoubleType())))
I am using pyspark 1.6

You can explode you're array and then group it back again:
First let's create our dataframe:
df = spark.createDataFrame(
sc.parallelize([
[[2.0, 1.0, 9.0], [9.0, 7.0, 2.0], [(2.0,9.0), (1.0,7.), (9.,2.)]],
[[4.0, 8.0, 9.0], [1.0, 1.0, 2.0], [(4.0,1.0), (8.0,1.), (9., 2.)]]
]),
["v1", "v2", "v_tuple"]
)
Let's add a row id to identify it uniquely:
import pyspark.sql.functions as psf
df = df.withColumn("id", psf.monotonically_increasing_id())
Now, we can explode column "v_tuple" and create two columns from the two elements of the tuple:
df = df.withColumn("v_tuple", psf.explode("v_tuple")).select(
"id",
psf.col("v_tuple._1").alias("v1"),
psf.col("v_tuple._2").alias("v2")
)
+-----------+---+---+
| id| v1| v2|
+-----------+---+---+
|42949672960|2.0|9.0|
|42949672960|1.0|7.0|
|42949672960|9.0|2.0|
|94489280512|4.0|1.0|
|94489280512|8.0|1.0|
|94489280512|9.0|2.0|
+-----------+---+---+
Finally, we can group it back again:
df = df.groupBy("id").agg(
psf.collect_list("v1").alias("v1"),
psf.collect_list("v2").alias("v2")
)
+-----------+---------------+---------------+
| id| v1| v2|
+-----------+---------------+---------------+
|42949672960|[2.0, 1.0, 9.0]|[9.0, 7.0, 2.0]|
|94489280512|[4.0, 8.0, 9.0]|[1.0, 1.0, 2.0]|
+-----------+---------------+---------------+

Related

Pyspark string array of dynamic length in dataframe column to onehot-encoded

I would like to convert a column which contains strings like:
["ABC","def","ghi"]
["Jkl","ABC","def"]
["Xyz","ABC"]
Into a encoded column like this:
[1,1,1,0,0]
[1,1,0,1,0]
[0,1,0,0,1]
Is there a class for that in pyspark.ml.feature?
Edit: In the encoded column the first entry always corresponds to the value "ABC" etc. 1 means "ABC" is present while 0 means it is not present in the corresponding row.
You can probably use CountVectorizer, Below is an example:
Update: removed the step to drop duplicates in arrays, you can set binary=True when setting up CountVectorizer:
from pyspark.ml.feature import CountVectorizer
from pyspark.sql.functions import udf, col
df = spark.createDataFrame([
(["ABC","def","ghi"],)
, (["Jkl","ABC","def"],)
, (["Xyz","ABC"],)
], ['arr']
)
create the CountVectorizer model:
cv = CountVectorizer(inputCol='arr', outputCol='c1', binary=True)
model = cv.fit(df)
vocabulary = model.vocabulary
# [u'ABC', u'def', u'Xyz', u'ghi', u'Jkl']
Create a UDF to convert a vector into array
udf_to_array = udf(lambda v: v.toArray().tolist(), 'array<double>')
Get the vector and check the content:
df1 = model.transform(df)
df1.withColumn('c2', udf_to_array('c1')) \
.select('*', *[ col('c2')[i].astype('int').alias(vocabulary[i]) for i in range(len(vocabulary))]) \
.show(3,0)
+---------------+-------------------------+-------------------------+---+---+---+---+---+
|arr |c1 |c2 |ABC|def|Xyz|ghi|Jkl|
+---------------+-------------------------+-------------------------+---+---+---+---+---+
|[ABC, def, ghi]|(5,[0,1,3],[1.0,1.0,1.0])|[1.0, 1.0, 0.0, 1.0, 0.0]|1 |1 |0 |1 |0 |
|[Jkl, ABC, def]|(5,[0,1,4],[1.0,1.0,1.0])|[1.0, 1.0, 0.0, 0.0, 1.0]|1 |1 |0 |0 |1 |
|[Xyz, ABC] |(5,[0,2],[1.0,1.0]) |[1.0, 0.0, 1.0, 0.0, 0.0]|1 |0 |1 |0 |0 |
+---------------+-------------------------+-------------------------+---+---+---+---+---+
You will have to expand the list in a single column to multiple n columns (where n is the number of items in the given list). Then you can use the OneHotEncoderEstimator class to convert it into One hot encoded features.
Please follow the example in the documentation:
from pyspark.ml.feature import OneHotEncoderEstimator
df = spark.createDataFrame([
(0.0, 1.0),
(1.0, 0.0),
(2.0, 1.0),
(0.0, 2.0),
(0.0, 1.0),
(2.0, 0.0)
], ["categoryIndex1", "categoryIndex2"])
encoder = OneHotEncoderEstimator(inputCols=["categoryIndex1", "categoryIndex2"],
outputCols=["categoryVec1", "categoryVec2"])
model = encoder.fit(df)
encoded = model.transform(df)
encoded.show()
OneHotEncoder class has been deprecated since v2.3 because it is a stateless transformer, it is not usable on new data where the number of categories may differ from the training data.
This will help you to split the list: How to split a list to multiple columns in Pyspark?

Spark [Scala]: Checking if all the Rows of a smaller DataFrame exists in the bigger DataFrame

I got two DataFrames, with the same schema (but +100 columns):
Small size: 1000 rows
Bigger size: 90000 rows
How to check every Row in 1 exists in 2? What is the "Spark way" of doing this? Should I use map and then deal with it at the Row level; or I use join and then use some sort of comparison with the small size DataFrame?
You can use except, which returns all rows of the first dataset that are not present in the second
smaller.except(bigger).isEmpty()
You can inner join the DF and count to check if ther eis a difference.
def isIncluded(smallDf: Dataframe, biggerDf: Dataframe): Boolean = {
val keys = smallDf.columns.toSeq
val joinedDf = smallDf.join(biggerDf, keys) // You might want to broadcast smallDf for performance issues
joinedDf.count == smallDf
}
However, I think the except method is clearer. Not sure about the performances (It might just be a join underneath)
I would do it with join, probably
This join will give you all rows that are in small data frame but are missing in large data frame. Then just check if it is zero size or no.
Code:
val seq1 = Seq(
("A", "abc", 0.1, 0.0, 0),
("B", "def", 0.15, 0.5, 0),
("C", "ghi", 0.2, 0.2, 1),
("D", "jkl", 1.1, 0.1, 0),
("E", "mno", 0.1, 0.1, 0)
)
val seq2 = Seq(
("A", "abc", "a", "b", "?"),
("C", "ghi", "a", "c", "?")
)
val df1 = ss.sparkContext.makeRDD(seq1).toDF("cA", "cB", "cC", "cD", "cE")
val df2 = ss.sparkContext.makeRDD(seq2).toDF("cA", "cB", "cH", "cI", "cJ")
df2.join(df1, df1("cA") === df2("cA"), "leftOuter").show
Output:
+---+---+---+---+---+---+---+---+---+---+
| cA| cB| cH| cI| cJ| cA| cB| cC| cD| cE|
+---+---+---+---+---+---+---+---+---+---+
| C|ghi| a| c| ?| C|ghi|0.2|0.2| 1|
| A|abc| a| b| ?| A|abc|0.1|0.0| 0|
+---+---+---+---+---+---+---+---+---+---+

Pivot and aggregate a PySpark Data Frame with alias

I have a PySpark DataFrame similar to this:
df = sc.parallelize([
("c1", "A", 3.4, 0.4, 3.5),
("c1", "B", 9.6, 0.0, 0.0),
("c1", "A", 2.8, 0.4, 0.3),
("c1", "B", 5.4, 0.2, 0.11),
("c2", "A", 0.0, 9.7, 0.3),
("c2", "B", 9.6, 8.6, 0.1),
("c2", "A", 7.3, 9.1, 7.0),
("c2", "B", 0.7, 6.4, 4.3)
]).toDF(["user_id", "type", "d1", 'd2', 'd3'])
df.show()
which gives:
+-------+----+---+---+----+
|user_id|type| d1| d2| d3|
+-------+----+---+---+----+
| c1| A|3.4|0.4| 3.5|
| c1| B|9.6|0.0| 0.0|
| c1| A|2.8|0.4| 0.3|
| c1| B|5.4|0.2|0.11|
| c2| A|0.0|9.7| 0.3|
| c2| B|9.6|8.6| 0.1|
| c2| A|7.3|9.1| 7.0|
| c2| B|0.7|6.4| 4.3|
+-------+----+---+---+----+
And I've pivoted it by type column aggregating the result with a sum():
data_wide = df.groupBy('user_id')\
.pivot('type').sum()
data_wide.show()
which gives:
+-------+-----------------+------------------+-----------+------------------+-----------+------------------+
|user_id| A_sum(`d1`)| A_sum(`d2`)|A_sum(`d3`)| B_sum(`d1`)|B_sum(`d2`)| B_sum(`d3`)|
+-------+-----------------+------------------+-----------+------------------+-----------+------------------+
| c1|6.199999999999999| 0.8| 3.8| 15.0| 0.2| 0.11|
| c2| 7.3|18.799999999999997| 7.3|10.299999999999999| 15.0|4.3999999999999995|
+-------+-----------------+------------------+-----------+------------------+-----------+------------------+
Now, the resulting column names contains the `(tilde) character, and this is a problem to, for example, introduce this new columns in a Vector Assembler because it returns a syntax error in attribute name. For this reason, I need to rename the column names but to call a withColumnRenamed method inside a loop or inside a reduce(lambda...) function takes a lot of time (actually my df has 11.520 columns).
Is there any way to avoid this character in the pivot+aggregation step or recursively assign an alias that depends on the name of the new pivoted column?
Thank you in advance
You can do the renaming within the aggregation for the pivot using alias:
import pyspark.sql.functions as f
data_wide = df.groupBy('user_id')\
.pivot('type')\
.agg(*[f.sum(x).alias(x) for x in df.columns if x not in {"user_id", "type"}])
data_wide.show()
#+-------+-----------------+------------------+----+------------------+----+------------------+
#|user_id| A_d1| A_d2|A_d3| B_d1|B_d2| B_d3|
#+-------+-----------------+------------------+----+------------------+----+------------------+
#| c1|6.199999999999999| 0.8| 3.8| 15.0| 0.2| 0.11|
#| c2| 7.3|18.799999999999997| 7.3|10.299999999999999|15.0|4.3999999999999995|
#+-------+-----------------+------------------+----+------------------+----+------------------+
However, this is really no different than doing the pivot and renaming afterwards. Here is the execution plan for this method:
#== Physical Plan ==
#HashAggregate(keys=[user_id#0], functions=[pivotfirst(type#1, sum(`d1`) AS `d1`#169, A, B, 0, 0), pivotfirst(type#1, sum(`d2`)
#AS `d2`#170, A, B, 0, 0), pivotfirst(type#1, sum(`d3`) AS `d3`#171, A, B, 0, 0)])
#+- Exchange hashpartitioning(user_id#0, 200)
# +- HashAggregate(keys=[user_id#0], functions=[partial_pivotfirst(type#1, sum(`d1`) AS `d1`#169, A, B, 0, 0), partial_pivotfirst(type#1, sum(`d2`) AS `d2`#170, A, B, 0, 0), partial_pivotfirst(type#1, sum(`d3`) AS `d3`#171, A, B, 0, 0)])
# +- *HashAggregate(keys=[user_id#0, type#1], functions=[sum(d1#2), sum(d2#3), sum(d3#4)])
# +- Exchange hashpartitioning(user_id#0, type#1, 200)
# +- *HashAggregate(keys=[user_id#0, type#1], functions=[partial_sum(d1#2), partial_sum(d2#3), partial_sum(d3#4)])
# +- Scan ExistingRDD[user_id#0,type#1,d1#2,d2#3,d3#4]
Compare this with the method in this answer:
import re
def clean_names(df):
p = re.compile("^(\w+?)_([a-z]+)\((\w+)\)(?:\(\))?")
return df.toDF(*[p.sub(r"\1_\3", c) for c in df.columns])
pivoted = df.groupBy('user_id').pivot('type').sum()
clean_names(pivoted).explain()
#== Physical Plan ==
#HashAggregate(keys=[user_id#0], functions=[pivotfirst(type#1, sum(`d1`)#363, A, B, 0, 0), pivotfirst(type#1, sum(`d2`)#364, A, B, 0, 0), pivotfirst(type#1, sum(`d3`)#365, A, B, 0, 0)])
#+- Exchange hashpartitioning(user_id#0, 200)
# +- HashAggregate(keys=[user_id#0], functions=[partial_pivotfirst(type#1, sum(`d1`)#363, A, B, 0, 0), partial_pivotfirst(type#1, sum(`d2`)#364, A, B, 0, 0), partial_pivotfirst(type#1, sum(`d3`)#365, A, B, 0, 0)])
# +- *HashAggregate(keys=[user_id#0, type#1], functions=[sum(d1#2), sum(d2#3), sum(d3#4)])
# +- Exchange hashpartitioning(user_id#0, type#1, 200)
# +- *HashAggregate(keys=[user_id#0, type#1], functions=[partial_sum(d1#2), partial_sum(d2#3), partial_sum(d3#4)])
# +- Scan ExistingRDD[user_id#0,type#1,d1#2,d2#3,d3#4]
You'll see that the two are practically identical. You'll likely have some minuscule speed up by avoiding the regular expression, but it will be negligible compared to the pivot.
Wrote an easy and fast function to rename PySpark pivot tables. Enjoy! :)
# This function efficiently rename pivot tables' urgly names
def rename_pivot_cols(rename_df, remove_agg):
"""change spark pivot table's default ugly column names at ease.
Option 1: remove_agg = True: `2_sum(sum_amt)` --> `sum_amt_2`.
Option 2: remove_agg = False: `2_sum(sum_amt)` --> `sum_sum_amt_2`
"""
for column in rename_df.columns:
if remove_agg == True:
start_index = column.find('(')
end_index = column.find(')')
if (start_index > 0 and end_index > 0):
rename_df = rename_df.withColumnRenamed(column, column[start_index+1:end_index]+'_'+column[:1])
else:
new_column = column.replace('(','_').replace(')','')
rename_df = rename_df.withColumnRenamed(column, new_column[2:]+'_'+new_column[:1])
return rename_df

Problem in converting MS-SQL Query to spark SQL

I want to convert this basic SQL Query in Spark
select Grade, count(*) * 100.0 / sum(count(*)) over()
from StudentGrades
group by Grade
I have tried using windowing functions in spark like this
val windowSpec = Window.rangeBetween(Window.unboundedPreceding,Window.unboundedFollowing)
df1.select(
$"Arrest"
).groupBy($"Arrest").agg(sum(count("*")) over windowSpec,count("*")).show()
+------+--------------------------------------------------------------------
----------+--------+
|Arrest|sum(count(1)) OVER (RANGE BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED
FOLLOWING)|count(1)|
+------+--------------------------------------------------------------------
----------+--------+
| true|
665517| 184964|
| false|
665517| 480553|
+------+------------------------------------------------------------------------------+--------+
But when I try dividing by count(*) it through's error
df1.select(
$"Arrest"
).groupBy($"Arrest").agg(count("*")/sum(count("*")) over
windowSpec,count("*")).show()
It is not allowed to use an aggregate function in the argument of another aggregate function. Please use the inner aggregate function in a sub-query.;;
My Question is when I'm already using count() inside sum() in the first query I'm not receiving any errors of using an aggregate function inside another aggregate function but why get error in the second one?
An example:
import org.apache.spark.sql.expressions._
import org.apache.spark.sql.functions._
val df = sc.parallelize(Seq(
("A", "X", 2, 100), ("A", "X", 7, 100), ("B", "X", 10, 100),
("C", "X", 1, 100), ("D", "X", 50, 100), ("E", "X", 30, 100)
)).toDF("c1", "c2", "Val1", "Val2")
val df2 = df
.groupBy("c1")
.agg(sum("Val1").alias("sum"))
.withColumn("fraction", col("sum") / sum("sum").over())
df2.show
You will need to tailor to your own situation. E.g. count instead of sum. As follows:
val df2 = df
.groupBy("c1")
.agg(count("*"))
.withColumn("fraction", col("count(1)") / sum("count(1)").over())
returning:
+---+--------+-------------------+
| c1|count(1)| fraction|
+---+--------+-------------------+
| E| 1|0.16666666666666666|
| B| 1|0.16666666666666666|
| D| 1|0.16666666666666666|
| C| 1|0.16666666666666666|
| A| 2| 0.3333333333333333|
+---+--------+-------------------+
You can do x 100. I note the alias does not seem to work as per the sum, so worked around this and left comparison above. Again, you will need to tailor to your specifics, this is part of my general modules for research and such.

Join/unfolded mapType column in spark back with the original dataframe

I have a dataframe in (py)Spark, where 1 of the columns is from the type 'map'. That column I want to flatten or split into multiple columns which should be added to the original dataframe. I'm able to unfold the column with flatMap, however I loose the key to join the new dataframe (from the unfolded column) with the original dataframe.
My schema is like this:
rroot
|-- key: string (nullable = true)
|-- metric: map (nullable = false)
| |-- key: string
| |-- value: float (valueContainsNull = true)
As you can see, the column 'metric' is a map-field. This is the column that I want to flatten. Before flattening it looks like:
+----+---------------------------------------------------+
|key |metric |
+----+---------------------------------------------------+
|123k|Map(metric1 -> 1.3, metric2 -> 6.3, metric3 -> 7.6)|
|d23d|Map(metric1 -> 1.5, metric2 -> 2.0, metric3 -> 2.2)|
|as3d|Map(metric1 -> 2.2, metric2 -> 4.3, metric3 -> 9.0)|
+----+---------------------------------------------------+
To convert that field to columns I do
df2.select('metric').rdd.flatMap(lambda x: x).toDF().show()
which gives
+------------------+-----------------+-----------------+
| metric1| metric2| metric3|
+------------------+-----------------+-----------------+
|1.2999999523162842|6.300000190734863|7.599999904632568|
| 1.5| 2.0|2.200000047683716|
| 2.200000047683716|4.300000190734863| 9.0|
+------------------+-----------------+-----------------+
However I don't see the key , therefore I don't know how to add this data to the original dataframe.
What I want is:
+----+-------+-------+-------+
| key|metric1|metric2|metric3|
+----+-------+-------+-------+
|123k| 1.3| 6.3| 7.6|
|d23d| 1.5| 2.0| 2.2|
|as3d| 2.2| 4.3| 9.0|
+----+-------+-------+-------+
My question thus is: How can i get df2 back to df (given that i originally don't know df and only have df2)
To make df2:
rdd = sc.parallelize([('123k', 1.3, 6.3, 7.6),
('d23d', 1.5, 2.0, 2.2),
('as3d', 2.2, 4.3, 9.0)
])
schema = StructType([StructField('key', StringType(), True),
StructField('metric1', FloatType(), True),
StructField('metric2', FloatType(), True),
StructField('metric3', FloatType(), True)])
df = sqlContext.createDataFrame(rdd, schema)
from pyspark.sql.functions import lit, col, create_map
from itertools import chain
metric = create_map(list(chain(*(
(lit(name), col(name)) for name in df.columns if "metric" in name
)))).alias("metric")
df2 = df.select("key", metric)
from pyspark.sql.functions import explode
# fetch column names of the original dataframe from keys of MapType 'metric' column
col_names = df2.select(explode("metric")).select("key").distinct().sort("key").rdd.flatMap(lambda x: x).collect()
exprs = [col("key")] + [col("metric").getItem(k).alias(k) for k in col_names]
df2_to_original_df = df2.select(*exprs)
df2_to_original_df.show()
Output is:
+----+-------+-------+-------+
| key|metric1|metric2|metric3|
+----+-------+-------+-------+
|123k| 1.3| 6.3| 7.6|
|d23d| 1.5| 2.0| 2.2|
|as3d| 2.2| 4.3| 9.0|
+----+-------+-------+-------+
I can select a certain key from a maptype by doing:
df.select('maptypecolumn'.'key')
In my example I did it as follows:
columns= df2.select('metric').rdd.flatMap(lambda x: x).toDF().columns
for i in columns:
df2= df2.withColumn(i,lit(df2.metric[i]))
You can access key and value for example like this:
from pyspark.sql.functions import explode
df.select(explode("custom_dimensions")).select("key")