I am trying to create a new column in dataframe based on three other columns data. The below code I have written for the same
dataFrame.withColumn('net_inventory_qty', when((dataFrame.raw_wip_fg_indicator =='RAW MATERIALS') |
(dataFrame.raw_wip_fg_indicator =='WIP') |
(dataFrame.raw_wip_fg_indicator =='FINISHED GOODS'), dataFrame.total_stock_qty+dataFrame.sit_qty).
otherwise(dataFrame.sit_qty))
But when I run the Glue Job it is throwing error
pyspark.sql.utils.AnalysisException: u"cannot resolve '(`total_stock_qty` + `sit_qty`)' due to data type mismatch: differing types in '(`total_stock_qty` + `sit_qty`)' (struct<double:double,string:string> and double)
What am I missing? Any suggestion will help
Because check your schema, according to error message, I can guest 2 columns following with their type:
total_stock_qty: struct<double:double,string:string>
sit_qty: double
You can printSchema() or show() to check data first
Related
I have a fixed date "2000/01/01" and a dataframe:
data1 = [{'index':1,'offset':50}]
data_p = sc.parallelize(data1)
df = spark.createDataFrame(data_p)
I want to create a new column by adding the offset column to this fixed date
I tried different method but cannot pass the column iterator and expr error as:
function is neither a registered temporary function nor a permanent function registered in the database 'default'
The only solution I can think of is
df = df.withColumn("zero",lit(datetime.strptime('2000/01/01', '%Y/%m/%d')))
df.withColumn("date_offset",expr("date_add(zero,offset)")).drop("zero")
Since I cannot use lit and datetime.strptime in the expr, I have to use this approach which creates a redundant column and redundant operations.
Any better way to do it?
As you have marked it as pyspark question so in python you can do below
df_a3.withColumn("date_offset",F.lit("2000-01-01").cast("date") + F.col("offset").cast("int")).show()
Edit- As per comment below lets assume there was an extra column of type then based on it below code can be used
df_a3.withColumn("date_offset",F.expr("case when type ='month' then add_months(cast('2000-01-01' as date),offset) else date_add(cast('2000-01-01' as date),cast(offset as int)) end ")).show()
I'm trying to join two tables and getting a frustrating series of errors:
If I try this:
pop_table = mtrips.join(trips, (mtrips["DOLocationID"] == trips["PULocationID"]))
Then I get this error:
Resolved attribute(s) PULocationID#17 missing from PULocationID#2508,
If I try this:
pop_table = mtrips.join(trips, (col("DOLocationID") == col("PULocationID")))
I get this error:
"Reference 'DOLocationID' is ambiguous, could be: DOLocationID, DOLocationID.;"
If I try this:
pop_table = mtrips.join(trips, col("mtrips.DOLocationID") == col("trips.PULocationID"))
I get this error:
"cannot resolve '`mtrips.DOLocationID`' given input columns: [DOLocationID]
When I search on SO for these errors it seems like every post is telling me to try something that I've already tried and isn't working.
I don't know where to go from here. Help appreciated!
It looks like this problem. There is some ambiguity in the names.
Are you deriving one of the dataframes from another one? In that case, use withColumnRenamed() to rename the 'join' columns in the second dataframe before you do the join operation.
This is pretty evidante that the issue with column name in both the dataframe.
1. When you have all different columns in both the dataframe , expect join key column is same name in both DF, use this
**`df = df.join(df_right, 'join_col_which_is_same_in_both_df', 'left')`**
2. When your join column is different name in both the dataframe - This join will take both the column i.e col1 and col2 in the joined df
**`df = df.join(df_right, df.col1 == df_right.col2, 'left')`**
I am using json data file “order_items” and data looks like
{“order_item_id”:1,“order_item_order_id”:1,“order_item_product_id”:957,“order_item_quantity”:1,“order_item_subtotal”:299.98,“order_item_product_price”:299.98}
{“order_item_id”:2,“order_item_order_id”:2,“order_item_product_id”:1073,“order_item_quantity”:1,“order_item_subtotal”:199.99,“order_item_product_price”:199.99}
{“order_item_id”:3,“order_item_order_id”:2,“order_item_product_id”:502,“order_item_quantity”:5,“order_item_subtotal”:250.0,“order_item_product_price”:50.0}
{“order_item_id”:4,“order_item_order_id”:2,“order_item_product_id”:403,“order_item_quantity”:1,“order_item_subtotal”:129.99,“order_item_product_price”:129.99}
orders = spark.read.json("/user/data/retail_db_json/order_items")
I am getting a error while run following command .
orders.where("order_item_order_id in( 2,4,5,6,7,8,9,10) ").groupby(“order_item_order_id”).agg(sum(“order_item_subtotal”),count()).orderBy(“order_item_order_id”).show()
TypeError: unsupported operand type(s) for +: ‘int’ and 'str’
I am not sure why I am getting ...All column vales are string. Any suggestion ?
Cast the column to int type. Can’t apply aggregation methods on string types.
I'm trying to apply pos tagging on one of my tokenized column called "removed" in pyspark dataframe.
I'm trying with
nltk.pos_tag(df_removed.select("removed"))
But all I get is Value Error: ValueError: Cannot apply 'in' operator against a column: please use 'contains' in a string column or 'array_contains' function for an array column.
How can I make it?
It seems the answer is in the error message: the input of pos_tag should be a string and you provide a column input. You should apply pos_tag on each row of you column, using the function withColumn
For example you start by writing:
my_new_df = df_removed.withColumn("removed", nltk.pos_tag(df_removed.removed))
You can do also :
my_new_df = df_removed.select("removed").rdd.map(lambda x: nltk.pos_tag(x)).toDF()
Here you have the documentation.
I am trying to filter by one of the column in the dataframe using spark. But spark throws below error,
Exception in thread "main" org.apache.spark.sql.AnalysisException: cannot resolve 'Inv. Pty' given input columns: [Pstng Date, Name 1, Inv. Pty, Year]
invDF.filter(col("Inv. Pty") === "2001075").show()
Try this with the backwards `
invDF.filter(col("`Inv. Pty`") === "2001075").show()
The issue is Spark assumes the column with "dot" as struct column.
To counter that, you need to use a backtick "`". This should work:
invDF.filter(col("`Inv. Pty`") === "2001075").show()
Not sure but given input columns: [Pstng Date, Name 1, Inv. Pty, Year] column has an extra space Inv. Pty,might be that is the problem.