np.where logic in pyspark dataframe - pyspark

I'm looking for a way to get character after 2nd place from a string in a dataframe column only if the length of the character is > 2 and place it into another column else null. I have several other columns in the spark dataframe
I have a Spark dataframe that looks like this:
animal
======
mo
cat
mouse
snake
reptiles
I want something like this:
remainder
========
null
t
use
ake
ptiles
I can do it using np.where in pandas dataframe like below
import numpy as np
df['remainder'] = np.where(len(df['animal]) > 2, df['animal].str[2:], 'null)
How do I do the same in pyspark dataframe

You can easily do this with a combination of when-otherwise with substring
Data Preparation
s = StringIO("""
animal
mo
cat
mouse
snake
reptiles
""")
df = pd.read_csv(s,delimiter=',')
sparkDF = sql.createDataFrame(df)
sparkDF.show()
+--------+
| animal|
+--------+
| mo|
| cat|
| mouse|
| snake|
|reptiles|
+--------+
When-Otherwise - Substring
sparkDF = sparkDF.withColumn('animal_length',F.length(F.col('animal'))) \
.withColumn('remainder',F.when(F.col('animal_length') > 2
,F.substring(F.col('animal'),2,1000)
).otherwise(None)
) \
.drop('animal_length')
sparkDF.show()
+--------+---------+
| animal|remainder|
+--------+---------+
| mo| null|
| cat| at|
| mouse| ouse|
| snake| nake|
|reptiles| eptiles|
+--------+---------+

Related

assign values to a new column depending on old column values in dataframe

I have assigned values to 4 variables in a conf or application.properties file,
A = 1
B = 2
C = 3
D = 4
I have a dataframe as follows,
+-----+
|name |
+-----+
| A |
| C |
| B |
| D |
| B |
+-----+
I want to add a new column that has the values assigned from the conf variables declared above for A,B,C,D respectively depending on the value in the name column.
Final Dataframe should have,
+----+----------+
|name|NAME_VALUE|
+----+----------+
| A | 1 |
| C | 3 |
| B | 2 |
| D | 4 |
| B | 2 |
+----+----------+
I tried lit function in .WITHCOLUMN with conf.getint($name), not accepting Column in lit func requires string, I have to hardcode the variable names in lit. Is there anyway for me to dynamically assign those respective conf variable names in LIT so it can automatically assign values to another column in spark scala?
For this moment i dont have any ideas how to do it as you intended with dynamic usage of vals names.
My proposition is to use a seq of tuples instead of multiple vals, in such case you can create some udf and try to map this value for each row, but you can also use join which i am showing in below example:
val data = Seq(("A"),("C"), ("B"), ("D"), ("B"))
val df = data.toDF("name")
val mappings = Seq(("A",1), ("B",2), ("C",3), ("D",4))
val mappingsDf = mappings.toDF("name", "value")
df.join(broadcast(mappingsDf), df("name") === mappingsDf("name"), "left")
.select(
df("name"),
mappingsDf("value")
).show
output is as expected:
+----+-----+
|name|value|
+----+-----+
| A| 1|
| C| 3|
| B| 2|
| D| 4|
| B| 2|
+----+-----+
This solution is pretty generic as your mapping are df here so you can hardcode them as showed in my example or load them from some csv or json easily with spark api
Due to broadcast join it should be quite efficient (you should remove this hint if you want to use big amount of mappings!)
I think its easy to understand and maintain as its not udf but only Spark api

How to select elements of a column of a dataframe with respect to a column of the another dataframe?

How I can use two dataframes, and select elements of df2, if a column in df1 is included in a column in df2and NA otherwise.
df2:
name
summer
winter
water
play
df1:
col1
play ground
winter cold
something
work
output:
col1 name
play ground play
winter cold winter
something NA
work NA
#Create match column
df1 = df1.alias('df1').withColumn('col_new',explode(split('col1','\s')))
new = (df1.join(df2, how='left',on=df1.col_new==df2.name)#merge on common columns
.drop('col_new')#drop the match column introduced
.orderBy([df2.name.desc(),'name'])#Order the df
.drop_duplicates(['col1'])#eliminate duplicates
).show()
+-----------+------+
| col1| name|
+-----------+------+
|play ground| play|
| something| null|
|winter cold|winter|
| work| null|
+-----------+------+
It is recommended to use the contains condition directly to join.
df = df1.join(df2, on=[df1.col1.contains(df2.name)], how='left')
df.show(truncate=False)
df1 = spark.createDataFrame([("play ground",),("winter cold",),("something",),("work",)], ['col1',])
df2 = spark.createDataFrame([("summer",),("winter",),("play bc",),("play",)], ['name',])
df1 = df1.withColumn('common_word', explode(split(col('col1'), '\s')))
# Also split & explode Column 'name' of df2.
df2 = df2.withColumn('common_word', explode(split(col('name'), '\s')))
(
df1
.join(df2, ['common_word'], "left")
.sort('col1')
.fillna("NA")
.show()
)
+-----------+-----------+-------+
|common_word| col1| name|
+-----------+-----------+-------+
| ground|play ground| NA|
| play|play ground|play bc|
| play|play ground| play|
| something| something| NA|
| cold|winter cold| NA|
| winter|winter cold| winter|
| work| work| NA|
+-----------+-----------+-------+

transform distinct row values to different columns with corresponding rows using Pyspark

I'm new to Pyspark and trying to transform data
Given dataframe
Col1
A=id1a A=id2a B=id1b C=id1c B=id2b
D=id1d A=id3a B=id3b C=id2c
A=id4a C=id3c
Required:
A B C
id1a id1b id1c
id2a id2b id2c
id3a id3b id3b
id4a null null
I have tried pivot, but that gives first value.
There might be a better way , however an approach is splitting the column on spaces to create array of the entries and then using higher order functions(spark 2.4+) to split on the '=' for each entry in the splitted array .Then explode and create 2 columns one with the id and one with the value. Then we can assign a row number to each partition and groupby then pivot:
import pyspark.sql.functions as F
df1 = (df.withColumn("Col1",F.split(F.col("Col1"),"\s+")).withColumn("Col1",
F.explode(F.expr("transform(Col1,x->split(x,'='))")))
.select(F.col("Col1")[0].alias("cols"),F.col("Col1")[1].alias("vals")))
from pyspark.sql import Window
w = Window.partitionBy("cols").orderBy("cols")
final = (df1.withColumn("Rnum",F.row_number().over(w)).groupBy("Rnum")
.pivot("cols").agg(F.first("vals")).orderBy("Rnum"))
final.show()
+----+----+----+----+----+
|Rnum| A| B| C| D|
+----+----+----+----+----+
| 1|id1a|id1b|id1c|id1d|
| 2|id2a|id2b|id2c|null|
| 3|id3a|id3b|id3c|null|
| 4|id4a|null|null|null|
+----+----+----+----+----+
this is how df1 looks like after the transformation:
df1.show()
+----+----+
|cols|vals|
+----+----+
| A|id1a|
| A|id2a|
| B|id1b|
| C|id1c|
| B|id2b|
| D|id1d|
| A|id3a|
| B|id3b|
| C|id2c|
| A|id4a|
| C|id3c|
+----+----+
May be I don't know the full picture, but the data format seems to be strange. If nothing can be done at the data source, then some collects, pivots and joins will be needed. Try this.
import pyspark.sql.functions as F
test = sqlContext.createDataFrame([('A=id1a A=id2a B=id1b C=id1c B=id2b',1),('D=id1d A=id3a B=id3b C=id2c',2),('A=id4a C=id3c',3)],schema=['col1','id'])
tst_spl = test.withColumn("item",(F.split('col1'," ")))
tst_xpl = tst_spl.select(F.explode("item"))
tst_map = tst_xpl.withColumn("key",F.split('col','=')[0]).withColumn("value",F.split('col','=')[1]).drop('col')
#%%
tst_pivot = tst_map.groupby(F.lit(1)).pivot('key').agg(F.collect_list(('value'))).drop('1')
#%%
tst_arr = [tst_pivot.select(F.posexplode(coln)).withColumnRenamed('col',coln) for coln in tst_pivot.columns]
tst_fin = reduce(lambda df1,df2:df1.join(df2,on='pos',how='full'),tst_arr).orderBy('pos')
tst_fin.show()
+---+----+----+----+----+
|pos| A| B| C| D|
+---+----+----+----+----+
| 0|id3a|id3b|id1c|id1d|
| 1|id4a|id1b|id2c|null|
| 2|id1a|id2b|id3c|null|
| 3|id2a|null|null|null|
+---+----+----+----+----

Split Data-frame Column on Hyphen Delimiter in PySpark

I am having trouble splitting my data-frame column into two rows based on a hyphen delimiter.
from pyspark.mllib.linalg.distributed import IndexedRow
rows = sc.parallelize([['14-banana'], ['12-cheese'], ['13-olives'], ['11-almonds']])
rows_df = rows.toDF(["ID"])
rows_df.show()
+----------+
| ID|
+----------+
| 14-banana|
| 12-cheese|
| 13-olives|
|11-almonds|
+----------+
So I want two columns one for ID in numeric and one for the food type as a string.
You are looking for the split function. Please find example below:
import pyspark.sql.functions as F
rows = sc.parallelize([['14-banana'], ['12-cheese'], ['13-olives'], ['11-almonds']])
rows_df = rows.toDF(["ID"])
split = F.split(rows_df.ID, '-')
rows_df = rows_df.withColumn('number', split.getItem(0))
rows_df = rows_df.withColumn('fruit', split.getItem(1))
rows_df.show()
Output:
+----------+------+-------+
| ID|number| fruit|
+----------+------+-------+
| 14-banana| 14| banana|
| 12-cheese| 12| cheese|
| 13-olives| 13| olives|
|11-almonds| 11|almonds|
+----------+------+-------+

List to DataFrame in pyspark

Can someone tell me how to convert a list containing strings to a Dataframe in pyspark. I am using python 3.6 with spark 2.2.1. I am just started learning spark environment and my data looks like below
my_data =[['apple','ball','ballon'],['cat','camel','james'],['none','focus','cake']]
Now, i want to create a Dataframe as follows
---------------------------------
|ID | words |
---------------------------------
1 | ['apple','ball','ballon'] |
2 | ['cat','camel','james'] |
I even want to add ID column which is not associated in the data
You can convert the list to a list of Row objects, then use spark.createDataFrame which will infer the schema from your data:
from pyspark.sql import Row
R = Row('ID', 'words')
# use enumerate to add the ID column
spark.createDataFrame([R(i, x) for i, x in enumerate(my_data)]).show()
+---+--------------------+
| ID| words|
+---+--------------------+
| 0|[apple, ball, bal...|
| 1| [cat, camel, james]|
| 2| [none, focus, cake]|
+---+--------------------+
Try this -
data_array = []
for i in range (0,len(my_data)) :
data_array.extend([(i, my_data[i])])
df = spark.createDataframe(data = data_array, schema = ["ID", "words"])
df.show()
Try this -- the simplest approach
from pyspark.sql import *
x = Row(utc_timestamp=utc, routine='routine name', message='your message')
data = [x]
df = sqlContext.createDataFrame(data)
Simple Approach:
my_data =[['apple','ball','ballon'],['cat','camel','james'],['none','focus','cake']]
spark.sparkContext.parallelize(my_data).zipWithIndex() \
toDF(["id", "words"]).show(truncate=False)
+---------------------+-----+
|id |words|
+---------------------+-----+
|[apple, ball, ballon]|0 |
|[cat, camel, james] |1 |
|[none, focus, cake] |2 |
+---------------------+-----+