Apply StringIndexer to change columns in a PySpark Dataframe - pyspark

I am new to pyspark. I want to apply StringIndexer to change the value of the column to index.
I checked this post:
Apply StringIndexer to several columns in a PySpark Dataframe
This solution will create a new column rather than updating the input column. Is there a way to update the currrent column? I tried to use the same name for input and output, but it does not work.
label_stringIdx = StringIndexer(inputCol ="WindGustDir", outputCol = "WindGustDir_index")

You cannot simply update that column. But what you can do is
create a new column using the StringIndexer
delete the original column
rename the new column with the name of the original column
You can use this code
from pyspark.ml.feature import StringIndexer
import pyspark.sql.functions as F
df = spark.createDataFrame([['a', 1], ['b', 1], ['c', 2], ['b', 5]], ['WindGustDir', 'value'])
df.show()
# +-----------+-----+
# |WindGustDir|value|
# +-----------+-----+
# | a| 1|
# | b| 1|
# | c| 2|
# | b| 5|
# +-----------+-----+
# 1. create new column
label_stringIdx = StringIndexer(inputCol ="WindGustDir", outputCol = "WindGustDir_index")
label_stringIdx_model = label_stringIdx.fit(df)
df = label_stringIdx_model.transform(df)
# 2. delete original column
df = df.drop("WindGustDir")
# 3. rename new column
to_rename = ['WindGustDir_index', 'value']
replace_with = ['WindGustDir', 'value']
mapping = dict(zip(to_rename, replace_with))
df = df.select([F.col(c).alias(mapping.get(c, c)) for c in to_rename])
df.show()
# +-----------+-----+
# |WindGustDir|value|
# +-----------+-----+
# | 1.0| 1|
# | 0.0| 1|
# | 2.0| 2|
# | 0.0| 5|
# +-----------+-----+

Related

filter record in dataframe base on list of value

I have below scenario.
li = ['g1','g2','g3']
df1 = id name goal
1 raj g1
2 harsh g3/g1
3 ramu g1
Above as you can see dataframe df1 and list li
i wanted to filter record in df1 base on list values of li but you can see in goal column first we need to split value base of / del but getting error
df1 = df1.filter(~df1.goal.isin(li))
but this is not returning any record...
is there any way to get record
Using this exemple:
from pyspark.sql import functions as F
from pyspark.sql.types import *
li = ['g1','g2','g3']
df1 = spark.createDataFrame(
[
('1','raj','g1'),
('2','harsh','g3/g1'),
('3','ramu','g1'),
('4','luiz','g2/g4')
],
["id", "name", "goal"]
)
df1.show()
# +---+-----+-----+
# | id| name| goal|
# +---+-----+-----+
# | 1| raj| g1|
# | 2|harsh|g3/g1|
# | 3| ramu| g1|
# | 4| luiz|g2/g4|
# +---+-----+-----+
You can use split to split the goal column and then array_except to find which records are not in your list:
result = df1\
.withColumn('goal_split', F.split(F.col('goal'), "/"))\
.withColumn('li', F.array([F.lit(x) for x in li]))\
.withColumn("test",F.array_except('goal_split','li'))\
.filter(F.col('test') == F.array([]))\
result.show()
# +---+-----+-----+----------+------------+----+
# | id| name| goal|goal_split| li|test|
# +---+-----+-----+----------+------------+----+
# | 1| raj| g1| [g1]|[g1, g2, g3]| []|
# | 2|harsh|g3/g1| [g3, g1]|[g1, g2, g3]| []|
# | 3| ramu| g1| [g1]|[g1, g2, g3]| []|
# +---+-----+-----+----------+------------+----+
Than, select the columns you want for the result:
result.select('id', 'name', 'goal').show().
# +---+-----+-----+
# | id| name| goal|
# +---+-----+-----+
# | 1| raj| g1|
# | 2|harsh|g3/g1|
# | 3| ramu| g1|
# +---+-----+-----+

How to process multiple input files and save the output in respective folders with filename as folder name using pyspark rdd

I am having 5 input files say A,B,C,D,E. I want to load these files to a pyspark rdd and do some processing. Finally I want to save the output in a folder with the corresponding filename as folder name. Is this possible in spark cluster mode?
As the rdd/dataframe is essentially a bunch of rows distributed across multiple partitions you will lose track of the origin of the data after it is read in from multiple sources. So, my simplistic solution is to assign an additional value to the row which tracks its origin. Using the dataframe API:
from pyspark.sql.functions import lit, col
from pyspark.sql import DataFrame
from functools import reduce
fnames = ['file_A.csv','file_B.csv','file_C.csv']
dfs = []
# 1. Read in data from individual sources and assign the filename as a string to a column
for fname in fnames:
dfs.append(spark.read.format('csv')
.option("header", "true")
.load(fname)
.withColumn('origin',lit(fname))
)
# 2. Combine data
df = reduce(DataFrame.unionAll,dfs)
# +---+---+---+----------+
# | A| B| C| origin|
# +---+---+---+----------+
# | 1| 1| 1|file_A.csv|
# | 1| 1| 1|file_A.csv|
# | 1| 1| 1|file_A.csv|
# | 0| 0| 0|file_B.csv|
# | 0| 0| 0|file_B.csv|
# | 0| 0| 0|file_B.csv|
# | 2| 2| 2|file_C.csv|
# | 2| 2| 2|file_C.csv|
# | 2| 2| 2|file_C.csv|
# +---+---+---+----------+
# 3. Do processing
# ...
# 4. Subset the combined data by origin and write out each subset to file
for fname in fnames:
out_fname = '_new.'.join(fname.split('.'))
df.filter(col('origin')==fname)\
.write.format('csv')\
.option('header',True)\
.save(out_fname)

Merge many dataframes into one in Pyspark [non pandas df]

I will be getting dataframes generated one by one through a process. I have to merge them into one.
+--------+----------+
| Name|Age |
+--------+----------+
|Alex | 30|
+--------+----------+
+--------+----------+
| Name|Age |
+--------+----------+
|Earl | 32|
+--------+----------+
+--------+----------+
| Name|Age |
+--------+----------+
|Jane | 15|
+--------+----------+
Finally:
+--------+----------+
| Name|Age |
+--------+----------+
|Alex | 30|
+--------+----------+
|Earl | 32|
+--------+----------+
|Jane | 15|
+--------+----------+
Tried many options like concat, merge, append but all are I guess pandas libraries. I am not using pandas. Using version python 2.7 and Spark 2.2
Edited to cover final scenario with foreachpartition:
l = [('Alex', 30)]
k = [('Earl', 32)]
ldf = spark.createDataFrame(l, ('Name', 'Age'))
ldf = spark.createDataFrame(k, ('Name', 'Age'))
# option 1:
union_df(ldf).show()
#option 2:
uxdf = union_df(ldf)
uxdf.show()
output in both cases:
+-------+---+
| Name|Age|
+-------+---+
|Earl | 32|
+-------+---+
You can use unionAll() for dataframes:
from functools import reduce # For Python 3.x
from pyspark.sql import DataFrame
def unionAll(*dfs):
return reduce(DataFrame.union, dfs)
df1 = sqlContext.createDataFrame([(1, "foo1"), (2, "bar1")], ("k", "v"))
df2 = sqlContext.createDataFrame([(3, "foo2"), (4, "bar2")], ("k", "v"))
df3 = sqlContext.createDataFrame([(5, "foo3"), (6, "bar3")], ("k", "v"))
unionAll(df1, df2, df3).show()
## +---+----+
## | k| v|
## +---+----+
## | 1|foo1|
## | 2|bar1|
## | 3|foo2|
## | 4|bar2|
## | 5|foo3|
## | 6|bar3|
## +---+----+
EDIT:
You can create an empty dataframe, and keep doing a union to it:
# Create first dataframe
ldf = spark.createDataFrame(l, ["Name", "Age"])
ldf.show()
# Save it's schema
schema = ldf.schema
# Create an empty DF with the same schema, (you need to provide schema to create empty dataframe)
empty_df = spark.createDataFrame(spark.sparkContext.emptyRDD(), schema)
empty_df.show()
# Union the first DF with the empty df
empty_df = empty_df.union(ldf)
empty_df.show()
# New dataframe after some operations
ldf = spark.createDataFrame(k, schema)
# Union with the empty_df again
empty_df = empty_df.union(ldf)
empty_df.show()
# First DF ldf
+----+---+
|Name|Age|
+----+---+
|Alex| 30|
+----+---+
# Empty dataframe empty_df
+----+---+
|Name|Age|
+----+---+
+----+---+
# After first union empty_df.union(ldf)
+----+---+
|Name|Age|
+----+---+
|Alex| 30|
+----+---+
# After second union with new ldf
+----+---+
|Name|Age|
+----+---+
|Alex| 30|
|Earl| 32|
+----+---+

Pyspark : select specific column with its position

I would like to know how to select a specific column with its number but not with its name in a dataframe ?
Like this in Pandas:
df = df.iloc[:,2]
It's possible ?
You can always get the name of the column with df.columns[n] and then select it:
df = spark.createDataFrame([[1,2], [3,4]], ['a', 'b'])
To select column at position n:
n = 1
df.select(df.columns[n]).show()
+---+
| b|
+---+
| 2|
| 4|
+---+
To select all but column n:
n = 1
You can either use drop:
df.drop(df.columns[n]).show()
+---+
| a|
+---+
| 1|
| 3|
+---+
Or select with manually constructed column names:
df.select(df.columns[:n] + df.columns[n+1:]).show()
+---+
| a|
+---+
| 1|
| 3|
+---+
Same solution as mirkhosro:
For a dataframe df, you can select the column n using df[n], where n is the index of the column.
Example:
df = df.filter(df[3]!=0)
will remove the rows of df, where the value in the fourth column is 0.
Note that you can check the columns using df.printSchema()

Adding historical path feature to a PySpark dataframe

I have column 'Event' in my original dataframe, I want to add the other 2 columns.
Event
Event_lag
Hist_event
0
N
N
0
0
N0
1
0
N00
0
1
N001
from pyspark.sql.functions import lag, col, monotonically_increasing_id, collect_list, concat_ws
from pyspark.sql import Window
#sample data
df= sc.parallelize([[0], [0], [1], [0]]).toDF(["Event"])
#add row index to the dataframe
df = df.withColumn("row_idx", monotonically_increasing_id())
w = Window.orderBy("row_idx")
#add 'Event_Lag' column to the dataframe
df = df.withColumn("Event_Lag", lag(col('Event').cast('string')).over(w))
df = df.fillna({'Event_Lag':'N'})
#finally add 'Hist_Event' column to the dataframe and remove row index column (i.e. 'row_idx') to have the final result
df = df.withColumn("Hist_Event", collect_list(col('Event_Lag')).over(w)).\
withColumn("Hist_Event", concat_ws("","Hist_Event")).\
drop("row_idx")
df.show()
Sample input:
+-----+
|Event|
+-----+
| 0|
| 0|
| 1|
| 0|
+-----+
Output is:
+-----+---------+----------+
|Event|Event_Lag|Hist_Event|
+-----+---------+----------+
| 0| N| N|
| 0| 0| N0|
| 1| 0| N00|
| 0| 1| N001|
+-----+---------+----------+