PySpark Dataframe Transformation - pyspark

I have the following Dataframe:
import pandas as pd
from pyspark import SparkContext
from pyspark.sql import SQLContext
sc = SparkContext('local')
df_pd = pd.DataFrame([[11, 'abc', 1, 114],
[11, 'abc', 2, 104],
[11, 'def', 9, 113],
[12, 'abc', 1, 14],
[12, 'def', 3, 110],
[14, 'abc', 1, 194],
[14, 'abc', 2, 164],
[14, 'abc', 3, 104],],
columns=['id', 'str', 'num', 'val'])
sql_sc = SQLContext(sc)
df_spark = sql_sc.createDataFrame(df_pd)
df_spark.show()
Which prints:
+---+---+---+---+
| id|str|num|val|
+---+---+---+---+
| 11|abc| 1|114|
| 11|abc| 2|104|
| 11|def| 9|113|
| 12|abc| 1| 14|
| 12|def| 3|110|
| 14|abc| 1|194|
| 14|abc| 2|164|
| 14|abc| 3|104|
+---+---+---+---+
My goal is to transform it to this:
+---+-----+-----+-----+-----+-----+
| id|abc_1|abc_2|abc_3|def_3|def_9|
+---+-----+-----+-----+-----+-----+
| 11| 114| 104| NaN| NaN| 113|
| 12| 14| NaN| NaN| 110| NaN|
| 14| 194| 164| 104| NaN| NaN|
+---+-----+-----+-----+-----+-----+
(One row per id, colum names are str+'_'+str(val), the resulting table is filled with the respective vals, all other entries are NaN)
How would I achieve this?
I started with
column = df_spark.select(concat(col("str"), lit("_"), col("num")))
by which I get the column names.
df_spark.select('id').distinct()
Gives the distinct ids
But I fail to build the new Dataframe or fill it.
Edit: The difference to the possible duplicate is that I didnt know about the pivot functionality, whereas the other question asked where to find the function "pivot" in pyspark. I dont know if thats a duplicate or not, but I didnt find the other question because I didnt know what to look for.

I am not sure which kind of aggregation you want to use for val field. I used sum and here is the solution
import pyspark.sql.functions as F
df_spark = df_spark.withColumn('col', F.concat(F.col("str"), F.lit("_"), F.col("num")))
df_spark.groupBy('id').pivot('col').agg({'val':'sum'}).orderBy('id').show()
+---+-----+-----+-----+-----+-----+
| id|abc_1|abc_2|abc_3|def_3|def_9|
+---+-----+-----+-----+-----+-----+
| 11| 114| 104| null| null| 113|
| 12| 14| null| null| 110| null|
| 14| 194| 164| 104| null| null|
+---+-----+-----+-----+-----+-----+

Related

Calculate number of columns with missing values per each row in PySpark

Let see we have the following data set
columns = ['id', 'dogs', 'cats']
values = [(1, 2, 0),(2, None, None),(3, None,9)]
df = spark.createDataFrame(values,columns)
df.show()
+----+----+----+
| id|dogs|cats|
+----+----+----+
| 1| 2| 0|
| 2|null|null|
| 3|null| 9|
+----+----+----+
I would like to calculate number ("miss_nb") and percents ("miss_pt") of columns with missing values per rows and get the following table
+----+-------+-------+
| id|miss_nb|miss_pt|
+----+-------+-------+
| 1| 0| 0.00|
| 2| 2| 0.67|
| 3| 1| 0.33|
+----+-------+-------+
The number of columns should be any (non-fixed list).
How to do it?
Thanks!

Explode function is increasing job time in Spark DataFrame

I have a dataframe with one column arrs having an array of size close to 100000.
Now I need to explode this column to get unique rows for all the elements of Array.
Explode function of spark.sql is doing the job but is taking enough time
Any alternative of explode which I can try to optimize job.
dfs.printSchema()
println("Orginal DF")
dfs.show()
//Performing Explode operation
import org.apache.spark.sql.functions.{explode,col}
val opdfs=dfs.withColumn("explarrs",explode(col("arrs"))).drop("arrs")
println("Exploded DF")
opdfs.show()
Expected result should be as below but an alternative to this code which will optimize the job more efficiently.
Orginal DF
+----+------+----+--------------------+
|col1| col2|col3| arrs|
+----+------+----+--------------------+
| A|DFtest| K|[1, 2, 3, 4, 5, 6...|
+----+------+----+--------------------+
Exploded DF
+----+------+----+--------+
|col1| col2|col3|explarrs|
+----+------+----+--------+
| A|DFtest| K| 1|
| A|DFtest| K| 2|
| A|DFtest| K| 3|
| A|DFtest| K| 4|
| A|DFtest| K| 5|
| A|DFtest| K| 6|
| A|DFtest| K| 7|
| A|DFtest| K| 8|
| A|DFtest| K| 9|
| A|DFtest| K| 10|
| A|DFtest| K| 11|
| A|DFtest| K| 12|
| A|DFtest| K| 13|
| A|DFtest| K| 14|
| A|DFtest| K| 15|
| A|DFtest| K| 16|
| A|DFtest| K| 17|
| A|DFtest| K| 18|
| A|DFtest| K| 19|
| A|DFtest| K| 20|
+----+------+----+--------+
only showing top 20 rows
You can do the same without explode using flatMap method from Dataframe. For example, if you need to explode an array of integers you can proceed with something like:
val els = Seq(Row(Array(1, 2, 3)))
val df = spark.createDataFrame(spark.sparkContext.parallelize(els), StructType(Seq(StructField("data", ArrayType(IntegerType), false))))
df.show()
It gives:
+---------+
| data|
+---------+
|[1, 2, 3]|
+---------+
Using DataframeĀ“s flatmap:
df.flatMap(row => row.getAs[mutable.WrappedArray[Int]](0)).show()
+-----+
|value|
+-----+
| 1|
| 2|
| 3|
+-----+
The problem with this is that you need to put the right type of the elements of your array in the getAs function, in addition to the memory overhead. As I said in my comment there was a bug that was fixed: https://issues.apache.org/jira/browse/SPARK-21657
But if you canĀ“t upgrade your Spark version you can try the code above and compare.
If you want to add the other fields to your result you could do something like:
val els = Seq(Row(Array(1, 2, 3), "data1", "data2"), Row(Array(1, 2, 3, 4, 5, 6), "data10", "data20"))
val df = spark.createDataFrame(spark.sparkContext.parallelize(els),
StructType(Seq(StructField("data", ArrayType(IntegerType), false), StructField("data1", StringType, false), StructField("data2", StringType, false))))
df.show()
df.flatMap{ row =>
val arr = row.getAs[mutable.WrappedArray[Int]](0)
arr.map { el =>
(row.getAs[String](1), row.getAs[String](2), el)
}
}.show()
It gives:
+------+------+---+
| _1| _2| _3|
+------+------+---+
| data1| data2| 1|
| data1| data2| 2|
| data1| data2| 3|
|data10|data20| 1|
|data10|data20| 2|
|data10|data20| 3|
|data10|data20| 4|
|data10|data20| 5|
|data10|data20| 6|
+------+------+---+
maybe it can help.

Count number of times array contains string per category in PySpark

I begin with the spark array "df_spark":
from pyspark.sql import SparkSession
import pandas as pd
import numpy as np
import pyspark.sql.functions as F
spark = SparkSession.builder.master("local").appName("Word Count").config("spark.some.config.option", "some-value").getOrCreate()
np.random.seed(0)
rows = 6
df_pandas = pd.DataFrame({ 'color' : pd.Categorical(np.random.choice(["blue","orange", "red"], rows)),
'animal' : [['cat', 'dog'], ['cat', 'monkey'], ['monkey', 'cat'], ['dog', 'monkey'], ['cat', 'dog'], ['monkey', 'dog']]})
print(df_pandas)
df_spark = spark.createDataFrame(df_pandas)
df_spark.show()
I want to end up with a new spark table "df_results_spark", that counts the occurrence of each of the strings "cat", "monkey", "dog" in the array per category "red, blue, orange".
df_results_pandas = pd.DataFrame({'color': ['red', 'blue', 'orange'],
'cat': [0, 2, 2],
'dog': [1, 1, 2],
'monkey': [1, 1, 2]})
print(df_results_pandas)
df_results_spark = spark.createDataFrame(df_results_pandas)
df_results_spark.show()
You can use the explode() function to create one row per element in the array.
df_spark_exploded = df_spark.selectExpr("color","explode(animal) as animal")
df_spark_exploded.show()
+------+------+
| color|animal|
+------+------+
| blue| cat|
| blue| dog|
|orange| cat|
|orange|monkey|
| blue|monkey|
| blue| cat|
|orange| dog|
|orange|monkey|
|orange| cat|
|orange| dog|
| red|monkey|
| red| dog|
+------+------+
Then reshape the dataframe using pivot() and applying count aggregate function to get count of each animal.
df_results_spark = df_spark_exploded.groupby("color").pivot("animal").count().fillna(0)
df_results_spark.show()
+------+---+---+------+
| color|cat|dog|monkey|
+------+---+---+------+
|orange| 2| 2| 2|
| red| 0| 1| 1|
| blue| 2| 1| 1|
+------+---+---+------+

How to add columns in pyspark dataframe dynamically

I am trying to add few columns based on input variable vIssueCols
from pyspark.sql import HiveContext
from pyspark.sql import functions as F
from pyspark.sql.window import Window
vIssueCols=['jobid','locid']
vQuery1 = 'vSrcData2= vSrcData'
vWindow1 = Window.partitionBy("vKey").orderBy("vOrderBy")
for x in vIssueCols:
Query1=vQuery1+'.withColumn("'+x+'_prev",F.lag(vSrcData.'+x+').over(vWindow1))'
exec(vQuery1)
now above query will generate vQuery1 as below, and it is working, but
vSrcData2= vSrcData.withColumn("jobid_prev",F.lag(vSrcData.jobid).over(vWindow1)).withColumn("locid_prev",F.lag(vSrcData.locid).over(vWindow1))
Cant I write a query something like
vSrcData2= vSrcData.withColumn(x+"_prev",F.lag(vSrcData.x).over(vWindow1))for x in vIssueCols
and generate the columns with the loop statement. Some blog has suggested to add a udf and call that, But instead using udf I will use above executing string method.
You can build your query using reduce.
from pyspark.sql.functions import lag
from pyspark.sql.window import Window
from functools import reduce
#sample data
df = sc.parallelize([[1, 200, '1234', 'asdf'],
[1, 50, '2345', 'qwerty'],
[1, 100, '4567', 'xyz'],
[2, 300, '123', 'prem'],
[2, 10, '000', 'ankur']]).\
toDF(["vKey","vOrderBy","jobid","locid"])
df.show()
vWindow1 = Window.partitionBy("vKey").orderBy("vOrderBy")
#your existing processing
df1= df.\
withColumn("jobid_prev",lag(df.jobid).over(vWindow1)).\
withColumn("locid_prev",lag(df.locid).over(vWindow1))
df1.show()
#to-be processing
vIssueCols=['jobid','locid']
df2 = (reduce(
lambda r_df, col_name: r_df.withColumn(col_name+"_prev", lag(r_df[col_name]).over(vWindow1)),
vIssueCols,
df
))
df2.show()
Sample data:
+----+--------+-----+------+
|vKey|vOrderBy|jobid| locid|
+----+--------+-----+------+
| 1| 200| 1234| asdf|
| 1| 50| 2345|qwerty|
| 1| 100| 4567| xyz|
| 2| 300| 123| prem|
| 2| 10| 000| ankur|
+----+--------+-----+------+
Output:
+----+--------+-----+------+----------+----------+
|vKey|vOrderBy|jobid| locid|jobid_prev|locid_prev|
+----+--------+-----+------+----------+----------+
| 1| 50| 2345|qwerty| null| null|
| 1| 100| 4567| xyz| 2345| qwerty|
| 1| 200| 1234| asdf| 4567| xyz|
| 2| 10| 000| ankur| null| null|
| 2| 300| 123| prem| 000| ankur|
+----+--------+-----+------+----------+----------+
Hope this helps!

How to create new DataFrame with dict

I had one dict, like:
cMap = {"k1" : "v1", "k2" : "v1", "k3" : "v2", "k4" : "v2"}
and one DataFrame A, like:
+---+
|key|
+----
| k1|
| k2|
| k3|
| k4|
+---+
to create the DataFame above with code:
data = [('k1'),
('k2'),
('k3'),
('k4')]
A = spark.createDataFrame(data, ['key'])
I want to get the new DataFrame, like:
+---+----------+----------+
|key| v1 | v2 |
+---+----------+----------+
| k1|true |false |
| k2|true |false |
| k3|false |true |
| k4|false |true |
+---+----------+----------+
I wish to get some suggestions, thanks!
I just wanted to contribute a different and possibly easier way to solve this.
In my code I convert a dict to a pandas dataframe, which I find is much easier. Then I directly convert the pandas dataframe to spark.
data = {'visitor': ['foo', 'bar', 'jelmer'],
'A': [0, 1, 0],
'B': [1, 0, 1],
'C': [1, 0, 0]}
df = pd.DataFrame(data)
ddf = spark.createDataFrame(df)
Output:
+---+---+---+-------+
| A| B| C|visitor|
+---+---+---+-------+
| 0| 1| 1| foo|
| 1| 0| 0| bar|
| 0| 1| 0| jelmer|
+---+---+---+-------+
I just wanted to add an easy way to create DF, using pyspark
values = [("K1","true","false"),("K2","true","false")]
columns = ['Key', 'V1', 'V2']
df = spark.createDataFrame(values, columns)
from pyspark import SparkContext,SparkConf
from pyspark.sql import SQLContext
sc = SparkContext()
spark = SQLContext(sc)
val_dict = {
'key1':val1,
'key2':val2,
'key3':val3
}
rdd = sc.parallelize([val_dict])
bu_zdf = spark.read.json(rdd)
The dictionary can be converted to dataframe and joined with other one. My piece of code,
data = sc.parallelize([(k,)+(v,) for k,v in cMap.items()]).toDF(['key','val'])
keys = sc.parallelize([('k1',),('k2',),('k3',),('k4',)]).toDF(["key"])
newDF = data.join(keys,'key').select("key",F.when(F.col("val") == "v1","True").otherwise("False").alias("v1"),F.when(F.col("val") == "v2","True").otherwise("False").alias("v2"))
>>> newDF.show()
+---+-----+-----+
|key| v1| v2|
+---+-----+-----+
| k1| True|False|
| k2| True|False|
| k3|False| True|
| k4|False| True|
+---+-----+-----+
If there are more values, you can code that when clause as a UDF and use it.
Thanks everyone for some suggestions, I figured out the other way to resolve my problem with pivot, the code is:
cMap = {"k1" : "v1", "k2" : "v1", "k3" : "v2", "k4" : "v2"}
a_cMap = [(k,)+(v,) for k,v in cMap.items()]
data = spark.createDataFrame(a_cMap, ['key','val'])
from pyspark.sql.functions import count
data = data.groupBy('key').pivot('val').agg(count('val'))
data.show()
+---+----+----+
|key| v1| v2|
+---+----+----+
| k2| 1|null|
| k4|null| 1|
| k1| 1|null|
| k3|null| 1|
+---+----+----+
data = data.na.fill(0)
data.show()
+---+---+---+
|key| v1| v2|
+---+---+---+
| k2| 1| 0|
| k4| 0| 1|
| k1| 1| 0|
| k3| 0| 1|
+---+---+---+
keys = spark.createDataFrame([('k1','2'),('k2','3'),('k3','4'),('k4','5'),('k5','6')], ["key",'temp'])
newDF = keys.join(data,'key')
newDF.show()
+---+----+---+---+
|key|temp| v1| v2|
+---+----+---+---+
| k2| 3| 1| 0|
| k4| 5| 0| 1|
| k1| 2| 1| 0|
| k3| 4| 0| 1|
+---+----+---+---+
But, I can't convert 1 to true, 0 to false.
I parallelize cMap.items() and check if value equal to v1 or v2 or not. Then joining back to dataframe A on column key
# example dataframe A
df_A = spark.sparkContext.parallelize(['k1', 'k2', 'k3', 'k4']).map(lambda x: Row(**{'key': x})).toDF()
cmap_rdd = spark.sparkContext.parallelize(cMap.items())
cmap_df = cmap_rdd.map(lambda x: Row(**dict([('key', x[0]), ('v1', x[1]=='v1'), ('v2', x[1]=='v2')]))).toDF()
df_A.join(cmap_df, on='key').orderBy('key').show()
Dataframe
+---+-----+-----+
|key| v1| v2|
+---+-----+-----+
| k1| true|false|
| k2| true|false|
| k3|false| true|
| k4|false| true|
+---+-----+-----+