Getting Error In The when/otherwise clause - pyspark

I am trying to create a new column in dataframe based on three other columns data. The below code I have written for the same
dataFrame.withColumn('net_inventory_qty', when((dataFrame.raw_wip_fg_indicator =='RAW MATERIALS') |
(dataFrame.raw_wip_fg_indicator =='WIP') |
(dataFrame.raw_wip_fg_indicator =='FINISHED GOODS'), dataFrame.total_stock_qty+dataFrame.sit_qty).
otherwise(dataFrame.sit_qty))
But when I run the Glue Job it is throwing error
pyspark.sql.utils.AnalysisException: u"cannot resolve '(`total_stock_qty` + `sit_qty`)' due to data type mismatch: differing types in '(`total_stock_qty` + `sit_qty`)' (struct<double:double,string:string> and double)
What am I missing? Any suggestion will help

Because check your schema, according to error message, I can guest 2 columns following with their type:
total_stock_qty: struct<double:double,string:string>
sit_qty: double
You can printSchema() or show() to check data first

Related

pyspark add int column to a fixed date

I have a fixed date "2000/01/01" and a dataframe:
data1 = [{'index':1,'offset':50}]
data_p = sc.parallelize(data1)
df = spark.createDataFrame(data_p)
I want to create a new column by adding the offset column to this fixed date
I tried different method but cannot pass the column iterator and expr error as:
function is neither a registered temporary function nor a permanent function registered in the database 'default'
The only solution I can think of is
df = df.withColumn("zero",lit(datetime.strptime('2000/01/01', '%Y/%m/%d')))
df.withColumn("date_offset",expr("date_add(zero,offset)")).drop("zero")
Since I cannot use lit and datetime.strptime in the expr, I have to use this approach which creates a redundant column and redundant operations.
Any better way to do it?
As you have marked it as pyspark question so in python you can do below
df_a3.withColumn("date_offset",F.lit("2000-01-01").cast("date") + F.col("offset").cast("int")).show()
Edit- As per comment below lets assume there was an extra column of type then based on it below code can be used
df_a3.withColumn("date_offset",F.expr("case when type ='month' then add_months(cast('2000-01-01' as date),offset) else date_add(cast('2000-01-01' as date),cast(offset as int)) end ")).show()

Trying to join tables and getting "Resolved attribute(s) columnName#17 missing from ..."

I'm trying to join two tables and getting a frustrating series of errors:
If I try this:
pop_table = mtrips.join(trips, (mtrips["DOLocationID"] == trips["PULocationID"]))
Then I get this error:
Resolved attribute(s) PULocationID#17 missing from PULocationID#2508,
If I try this:
pop_table = mtrips.join(trips, (col("DOLocationID") == col("PULocationID")))
I get this error:
"Reference 'DOLocationID' is ambiguous, could be: DOLocationID, DOLocationID.;"
If I try this:
pop_table = mtrips.join(trips, col("mtrips.DOLocationID") == col("trips.PULocationID"))
I get this error:
"cannot resolve '`mtrips.DOLocationID`' given input columns: [DOLocationID]
When I search on SO for these errors it seems like every post is telling me to try something that I've already tried and isn't working.
I don't know where to go from here. Help appreciated!
It looks like this problem. There is some ambiguity in the names.
Are you deriving one of the dataframes from another one? In that case, use withColumnRenamed() to rename the 'join' columns in the second dataframe before you do the join operation.
This is pretty evidante that the issue with column name in both the dataframe.
1. When you have all different columns in both the dataframe , expect join key column is same name in both DF, use this
**`df = df.join(df_right, 'join_col_which_is_same_in_both_df', 'left')`**
2. When your join column is different name in both the dataframe - This join will take both the column i.e col1 and col2 in the joined df
**`df = df.join(df_right, df.col1 == df_right.col2, 'left')`**

Sum() and count() are not working together with agg in pyspark2

I am using json data file “order_items” and data looks like
{“order_item_id”:1,“order_item_order_id”:1,“order_item_product_id”:957,“order_item_quantity”:1,“order_item_subtotal”:299.98,“order_item_product_price”:299.98}
{“order_item_id”:2,“order_item_order_id”:2,“order_item_product_id”:1073,“order_item_quantity”:1,“order_item_subtotal”:199.99,“order_item_product_price”:199.99}
{“order_item_id”:3,“order_item_order_id”:2,“order_item_product_id”:502,“order_item_quantity”:5,“order_item_subtotal”:250.0,“order_item_product_price”:50.0}
{“order_item_id”:4,“order_item_order_id”:2,“order_item_product_id”:403,“order_item_quantity”:1,“order_item_subtotal”:129.99,“order_item_product_price”:129.99}
orders = spark.read.json("/user/data/retail_db_json/order_items")
I am getting a error while run following command .
orders.where("order_item_order_id in( 2,4,5,6,7,8,9,10) ").groupby(“order_item_order_id”).agg(sum(“order_item_subtotal”),count()).orderBy(“order_item_order_id”).show()
TypeError: unsupported operand type(s) for +: ‘int’ and 'str’
I am not sure why I am getting ...All column vales are string. Any suggestion ?
Cast the column to int type. Can’t apply aggregation methods on string types.

How to apply nltk.pos_tag on pyspark dataframe

I'm trying to apply pos tagging on one of my tokenized column called "removed" in pyspark dataframe.
I'm trying with
nltk.pos_tag(df_removed.select("removed"))
But all I get is Value Error: ValueError: Cannot apply 'in' operator against a column: please use 'contains' in a string column or 'array_contains' function for an array column.
How can I make it?
It seems the answer is in the error message: the input of pos_tag should be a string and you provide a column input. You should apply pos_tag on each row of you column, using the function withColumn
For example you start by writing:
my_new_df = df_removed.withColumn("removed", nltk.pos_tag(df_removed.removed))
You can do also :
my_new_df = df_removed.select("removed").rdd.map(lambda x: nltk.pos_tag(x)).toDF()
Here you have the documentation.

In Spark I am not able to filter by existing column

I am trying to filter by one of the column in the dataframe using spark. But spark throws below error,
Exception in thread "main" org.apache.spark.sql.AnalysisException: cannot resolve 'Inv. Pty' given input columns: [Pstng Date, Name 1, Inv. Pty, Year]
invDF.filter(col("Inv. Pty") === "2001075").show()
Try this with the backwards `
invDF.filter(col("`Inv. Pty`") === "2001075").show()
The issue is Spark assumes the column with "dot" as struct column.
To counter that, you need to use a backtick "`". This should work:
invDF.filter(col("`Inv. Pty`") === "2001075").show()
Not sure but given input columns: [Pstng Date, Name 1, Inv. Pty, Year] column has an extra space Inv. Pty,might be that is the problem.