I'm practicing using spark.sql() functions for pyspark. When I use the not equal functions in spark I can't seem to use <> != Not to do complex queries.
Sample query:
+--------------------+-------------+--------------------+
| Party| Handle| Tweet|
+--------------------+-------------+--------------------+
| Democrat|RepDarrenSoto|Today, Senate Dem...|
| Democrat|RepDarrenSoto|RT #WinterHavenSu...|
| Democrat|RepDarrenSoto|RT #NBCLatino: .#...|
|Congress has allo...| null| null|
| Democrat|RepDarrenSoto|RT #NALCABPolicy:...|
| Democrat|RepDarrenSoto|RT #Vegalteno: Hu...|
| Democrat|RepDarrenSoto|RT #EmgageActionF...|
| Democrat|RepDarrenSoto|Hurricane Maria l...|
| Democrat|RepDarrenSoto|RT #Tharryry: I a...|
| Democrat|RepDarrenSoto|RT #HispanicCaucu...|
| Democrat|RepDarrenSoto|RT #RepStephMurph...|
| Democrat|RepDarrenSoto|RT #AllSaints_FL:...|
| Democrat|RepDarrenSoto|.#realDonaldTrump...|
| Democrat|RepDarrenSoto|Thank you to my m...|
| Democrat|RepDarrenSoto|We paid our respe...|
|Sgt Sam Howard - ...| null| null|
| Democrat|RepDarrenSoto|RT #WinterHavenSu...|
| Democrat|RepDarrenSoto|Meet 12 incredibl...|
| Democrat|RepDarrenSoto|RT #wildlifeactio...|
| Democrat|RepDarrenSoto|RT #CHeathWFTV: K...|
+--------------------+-------------+--------------------+
spark.sql("""select Party from tweets_tempview where Party <>'Democrat' or 'Republican' """).show(20,False)
Error Message:
"cannot resolve '((NOT (tweets_tempview.`Party` = 'Democrat')) OR 'Republican')' due to data type mismatch: differing types in '((NOT (tweets_tempview.`Party` = 'Democrat')) OR 'Republican')' (boolean and string).; line 1 pos 40;\n'Project ['Party]\n+- 'Filter (NOT (Party#98 = Democrat) || Republican)\n +- SubqueryAlias `tweets_tempview`\n +- Relation[Party#98,Handle#99,Tweet#100] csv\n"
What is the spark sql function to get both where clause values to work?
You can't compare to two strings using a single <> operation. Either use:
where Party <> 'Democrat' and Party <> 'Republican'
Or use this, as suggested in the comment
where Party not in ('Democrat', 'Republican')
Related
I have the following PySpark DataFrame where each column represents a time series and I'd like to study their distance to the mean.
+----+----+-----+---------+
| T1 | T2 | ... | Average |
+----+----+-----+---------+
| 1 | 2 | ... | 2 |
| -1 | 5 | ... | 4 |
+----+----+-----+---------+
This is what I'm hoping to get:
+----+----+-----+---------+
| T1 | T2 | ... | Average |
+----+----+-----+---------+
| -1 | 0 | ... | 2 |
| -5 | 1 | ... | 4 |
+----+----+-----+---------+
Up until now, I've tried naively running a UDF on individual columns but it takes respectively 30s-50s-80s... (keeps increasing) per column so I'm probably doing something wrong.
cols = ["T1", "T2", ...]
for c in cols:
df = df.withColumn(c, df[c] - df["Average"])
Is there a better way to do this transformation of adding one column to many other?
By using rdd, it can be done in this way.
+---+---+-------+
|T1 |T2 |Average|
+---+---+-------+
|1 |2 |2 |
|-1 |5 |4 |
+---+---+-------+
df.rdd.map(lambda r: (*[r[i] - r[-1] for i in range(0, len(r) - 1)], r[-1])) \
.toDF(df.columns).show()
+---+---+-------+
| T1| T2|Average|
+---+---+-------+
| -1| 0| 2|
| -5| 1| 4|
+---+---+-------+
I am a newbie to the PYSPARK.
I am reading the data from a table and updating the same table. I have a requirement where I have to search for a small string in to columns and if found I need to write that into new column.
Logic is like this:
IF
(Terminal_Region is not NULL & Terminal_Region contains "WC") OR
(Terminal_Footprint is not NULL & Terminal_Footprint contains "WC")
THEN REGION = "EOR"
ELSE
REGION ="WOR"
If both of those fields has NULL, then REGION = 'NotMapped'
I need to create a new REGION in the Datafarme using PYSPARK. Can somebody help me?
|Terminal_Region |Terminal_footprint | REGION |
+-------------------+-------------------+----------+
| west street WC | | EOR |
| WC 87650 | | EOR |
| BOULVEVARD WC | | EOR |
| | |Not Mapped|
| |landinf dr WC | EOR |
| |FOX VALLEY WC 76543| EOR |
+-------------------+-------------------+----------+
I think the following code should create your desired output. The code should work with spark 2.2, which includes the contains function.
from pyspark.sql.functions import *
df = spark.createDataFrame([("west street WC",None),\
("WC 87650",None),\
("BOULVEVARD WC",None),\
(None,None),\
(None,"landinf dr WC"),\
(None,"FOX VALLEY WC 76543")],\
["Terminal_Region","Terminal_footprint"]) #Creating Dataframe
df.show() #print initial df
df.withColumn("REGION", when( col("Terminal_Region").isNull() & col("Terminal_footprint").isNull(), "NotMapped").\ #check if both are Null
otherwise(when((col("Terminal_Region").contains("WC")) | ( col("Terminal_footprint").contains("WC")), "EOR").otherwise("WOR"))).show() #otherwise search for "WC"
Output:
#initial dataframe
+---------------+-------------------+
|Terminal_Region| Terminal_footprint|
+---------------+-------------------+
| west street WC| null|
| WC 87650| null|
| BOULVEVARD WC| null|
| null| null|
| null| landinf dr WC|
| null|FOX VALLEY WC 76543|
+---------------+-------------------+
# df with the logic applied
+---------------+-------------------+---------+
|Terminal_Region| Terminal_footprint| REGION|
+---------------+-------------------+---------+
| west street WC| null| EOR|
| WC 87650| null| EOR|
| BOULVEVARD WC| null| EOR|
| null| null|NotMapped|
| null| landinf dr WC| EOR|
| null|FOX VALLEY WC 76543| EOR|
+---------------+-------------------+---------+
I currently have a dataset grouped into hourly increments by a variable "aggregator". There are gaps in this hourly data and what i would ideally like to do is forward fill the rows with the prior row which maps to the variable in column x.
I've seen some solutions to similar problems using PANDAS but ideally i would like to understand how best to approach this with a pyspark UDF.
I'd initially thought about something like the following with PANDAS but also struggled to implement this to just fill ignoring the aggregator as a first pass:
df = df.set_index(keys=[df.timestamp]).resample('1H', fill_method='ffill')
But ideally i'd like to avoid using PANDAS.
In the example below i have two missing rows of hourly data (labeled as MISSING).
| timestamp | aggregator |
|----------------------|------------|
| 2018-12-27T09:00:00Z | A |
| 2018-12-27T10:00:00Z | A |
| MISSING | MISSING |
| 2018-12-27T12:00:00Z | A |
| 2018-12-27T13:00:00Z | A |
| 2018-12-27T09:00:00Z | B |
| 2018-12-27T10:00:00Z | B |
| 2018-12-27T11:00:00Z | B |
| MISSING | MISSING |
| 2018-12-27T13:00:00Z | B |
| 2018-12-27T14:00:00Z | B |
The expected output here would be the following:
| timestamp | aggregator |
|----------------------|------------|
| 2018-12-27T09:00:00Z | A |
| 2018-12-27T10:00:00Z | A |
| 2018-12-27T11:00:00Z | A |
| 2018-12-27T12:00:00Z | A |
| 2018-12-27T13:00:00Z | A |
| 2018-12-27T09:00:00Z | B |
| 2018-12-27T10:00:00Z | B |
| 2018-12-27T11:00:00Z | B |
| 2018-12-27T12:00:00Z | B |
| 2018-12-27T13:00:00Z | B |
| 2018-12-27T14:00:00Z | B |
Appreciate the help.
Thanks.
Here is the solution, to fill the missing hours. using windows, lag and udf. With little modification it can extend to days as well.
from pyspark.sql.window import Window
from pyspark.sql.types import *
from pyspark.sql.functions import *
from dateutil.relativedelta import relativedelta
def missing_hours(t1, t2):
return [t1 + relativedelta(hours=-x) for x in range(1, t1.hour-t2.hour)]
missing_hours_udf = udf(missing_hours, ArrayType(TimestampType()))
df = spark.read.csv('dates.csv',header=True,inferSchema=True)
window = Window.partitionBy("aggregator").orderBy("timestamp")
df_mising = df.withColumn("prev_timestamp",lag(col("timestamp"),1, None).over(window))\
.filter(col("prev_timestamp").isNotNull())\
.withColumn("timestamp", explode(missing_hours_udf(col("timestamp"), col("prev_timestamp"))))\
.drop("prev_timestamp")
df.union(df_mising).orderBy("aggregator","timestamp").show()
which results
+-------------------+----------+
| timestamp|aggregator|
+-------------------+----------+
|2018-12-27 09:00:00| A|
|2018-12-27 10:00:00| A|
|2018-12-27 11:00:00| A|
|2018-12-27 12:00:00| A|
|2018-12-27 13:00:00| A|
|2018-12-27 09:00:00| B|
|2018-12-27 10:00:00| B|
|2018-12-27 11:00:00| B|
|2018-12-27 12:00:00| B|
|2018-12-27 13:00:00| B|
|2018-12-27 14:00:00| B|
+-------------------+----------+
Given a DataFrame df, when I do
df.select(df['category_id']+1000), I get results
>>> df.select(df['category_id']).limit(3).show()
+-----------+
|category_id|
+-----------+
| 1|
| 2|
| 3|
+-----------+
>>> df.select(df['category_id']+1000).limit(3).show()
+--------------------+
|(category_id + 1000)|
+--------------------+
| 1001|
| 1002|
| 1003|
+--------------------+
However when I do df.select(df['category_name']+ ' blah'), get null
>>> df.select(df['category_name']).limit(3).show()
+-------------------+
| category_name|
+-------------------+
| Football|
| Soccer|
|Baseball & Softball|
+-------------------+
>>> df.select(df['category_name']+'blah').limit(3).show()
+----------------------+
|(category_name + blah)|
+----------------------+
| null|
| null|
| null|
+----------------------+
Just wondering what makes one work and the other is not? What am I missing?
Unlike python, the + operator is not defined as string concatenation in spark (and sql doesn't do this too), instead it has concat/concat_ws for string concatenation.
import pyspark.sql.functions as f
df.select(f.concat(df.category_name, f.lit('blah')).alias('category_name')).show(truncate=False)
#+-----------------------+
#|category_name |
#+-----------------------+
#|Footballblah |
#|Soccerblah |
#|Baseball & Softballblah|
#+-----------------------+
df.select(f.concat_ws(' ', df.category_name, f.lit('blah')).alias('category_name')).show(truncate=False)
#+------------------------+
#|category_name |
#+------------------------+
#|Football blah |
#|Soccer blah |
#|Baseball & Softball blah|
#+------------------------+
I have input dataframe as below with id, app, and customer
Input dataframe
+--------------------+-----+---------+
| id|app |customer |
+--------------------+-----+---------+
|id1 | fw| WM |
|id1 | fw| CS |
|id2 | fw| CS |
|id1 | fe| WM |
|id3 | bc| TR |
|id3 | bc| WM |
+--------------------+-----+---------+
Expected output
Using pivot and aggregate - make app values as column name and put aggregated customer names as list in the dataframe
Expected dataframe
+--------------------+----------+-------+----------+
| id| bc | fe| fw |
+--------------------+----------+-------+----------+
|id1 | 0 | WM| [WM,CS]|
|id2 | 0 | 0| [CS] |
|id3 | [TR,WM] | 0| 0 |
+--------------------+----------+-------+----------+
What have i tried ?
val newDF =
df.groupBy("id").pivot("app").agg(expr("coalesce(first(customer),0)")).drop("app").show()
+--------------------+-----+-------+------+
| id|bc | fe| fw|
+--------------------+-----+-------+------+
|id1 | 0 | WM| WM|
|id2 | 0 | 0| CS|
|id3 | TR | 0| 0|
+--------------------+-----+-------+------+
Issue : In my query , i am not able to get the list of customer like [WM,CS] for "id1" under "fw" (as shown in expected output) , only "WM" is coming. Similarly, for "id3" only "TR" is appearing - instead a list should appear with value [TR,WM] under "bc" for "id3"
Need your suggestion to get the list of customer under each app respectively.
You can use collect_list if you can bear with an empty List at cells where it should be zero:
df.groupBy("id").pivot("app").agg(collect_list("customer")).show
+---+--------+----+--------+
| id| bc| fe| fw|
+---+--------+----+--------+
|id3|[TR, WM]| []| []|
|id1| []|[WM]|[CS, WM]|
|id2| []| []| [CS]|
+---+--------+----+--------+
Using CONCAT_WS we can explode array and can remove the square brackets.
df.groupBy("id").pivot("app").agg(concat_ws(",",collect_list("customer")))