Here my main purpose in the code is to get all the columns from the dataframe after the group by condition but after the group by condition only the selected columns are coming.
so I tried using Join condition to merge it the group by data frame with the original dataframe.
But it is throwing error.
code
from pyspark.sql.types import IntegerType
from pyspark.sql.types import *
from pyspark.sql.functions import *
retail_sales_transaction = glueContext.create_dynamic_frame.from_catalog(
database="conform_main_mobconv",
table_name="retail_sales_transaction"
df_retail_sales_transaction=df_retail_sales_transaction
.select("transaction_id","transaction_key","transaction_timestamp","personnel_key","retail_site_id","personnel_role","country_code","business_week")
df_retail_sales_transaction = df_retail_sales_transaction.groupBy("business_week", "location_id", "country_code", "personnel_key")
df_retail_sales_transaction =df_retail_sales_transaction.join(df_retail_sales_transaction,['business_week'],'outer')
Error I'm getting is :
df_retail_sales_transaction = df_retail_sales_transaction.join(df_retail_sales_transaction,['business_week'],'outer')
AttributeError: 'GroupedData' object has no attribute 'join'
Related
I received help with following PySpark to prevent errors when doing a Merge in Databricks, see here
Databricks Error: Cannot perform Merge as multiple source rows matched and attempted to modify the same target row in the Delta table conflicting way
I was wondering if I could get help to modify the code to drop NULLs.
from pyspark.sql.window import Window
from pyspark.sql.functions import row_number
df2 = partdf.withColumn("rn", row_number().over(Window.partitionBy("P_key").orderBy("Id")))
df3 = df2.filter("rn = 1").drop("rn")
Thanks
The code that you are using does not completely delete the rows where P_key is null. It is applying the row number for null values and where row number value is 1 where P_key is null, that row is not getting deleted.
You can instead use the df.na.drop instead to get the required result.
df.na.drop(subset=["P_key"]).show(truncate=False)
To make your approach work, you can use the following approach. Add a row with least possible unique id value. Store this id in a variable, use the same code and add additional condition in filter as shown below.
from pyspark.sql.window import Window
from pyspark.sql.functions import row_number,when,col
df = spark.read.option("header",True).csv("dbfs:/FileStore/sample1.csv")
#adding row with least possible id value.
dup_id = '0'
new_row = spark.createDataFrame([[dup_id,'','x','x']], schema = ['id','P_key','c1','c2'])
#replacing empty string with null for P_Key
new_row = new_row.withColumn('P_key',when(col('P_key')=='',None).otherwise(col('P_key')))
df = df.union(new_row) #row added
#code to remove duplicates
df2 = df.withColumn("rn", row_number().over(Window.partitionBy("P_key").orderBy("id")))
df2.show(truncate=False)
#additional condition to remove added id row.
df3 = df2.filter((df2.rn == 1) & (df2.P_key!=dup_id)).drop("rn")
df3.show()
I'm getting the following error: It shows the following error: Invalid argument, not a string or column: 1586906.0151878505 of type <class 'float'>. For column literals, use 'lit', 'array', 'struct' or 'create_map' function.
the code goes something like this
from pyspark.sql.types import *
population= int(df.count())
income_sum= int(df.agg(_sum('income_job1')).collect()[0][0])
Tot_mean_income = round(income_sum/population,2)
I think you are having conflict issues with round function in pyspark.sql.functions package with regular python round function.
Try with importing functions with alias so that there will be no conflict with round function.
#import functions with alias
from pyspark.sql import functions as F
from pyspark.sql.types import *
df=spark.createDataFrame([(1,'a'),(2,'b')],['id','name'])
population=int(df.count())
income_sum=int(df.agg(F.sum(F.col("id"))).collect()[0][0])
Tot_mean_income = round(income_sum/population,2)
#1.5
I am trying to convert a pyspark code to spark Scala and i am facing the below error:
pyspark code
import pyspark.sql.functions as fn
valid_data = bcd_df.filter(fn.lower(bdb_df.table_name)==tbl_nme)
.select("valid_data").rdd
.map(lambda x: x[0])
.collect()[0]
From bcd_df dataframe I am getting a column with table_name and matching the value of table_name with the argument tbl_name that i am passing and then selecting the valid_data column data.
Here is the code in spark scala.
val valid_data =bcd_df..filter(col(table_name)===tbl_nme).select(col("valid_data")).rdd.map(x=> x(0)).collect()(0)
Error as below:
Exception in thread "main" org.apache.spark.sql.AnalysisException: cannot resolve '`abcd`' given input
columns:
Not sure why it is taking abcd as column.
Any help is appreciated.
Version
scala2.11.8
spark2.3
Enclose table_name column with quotes(") in col
val valid_data =bcd_df.filter(col("table_name")===tbl_nme).select(col("valid_data")).rdd.map(x=> x(0)).collect()(0)
Trying to create a 30 minute time bucket and i'm getting a the following attribute error:
'datetime.timedelta' object has no attribute '_get_object_id'
The column being ingested is called timestamp and holds the schema type timestamp. e.g.
2019-02-01T15:53:44Z
I can't work out why i'm getting the error given the below code should be able to ingest the timestamp.
def ceil_dt(dt, delta):
return dt + (datetime.min - dt) % delta
df = df.withColumn("bucket_timestamp", ceil_dt(df.timestamp, timedelta(minutes=30)))
return df
You need to make use of a user defined function (UDF):
from pyspark.sql.types import *
from pyspark.sql import functions as f
from pyspark.sql import Row
from datetime import datetime, timedelta
# example DF
date = datetime.strptime('2019-02-01T15:53:44', '%Y-%m-%dT%H:%M:%S')
df = sc.parallelize([Row(timestamp=date)]).toDF()
# define UDF based on OP's function
ceil_dt = (f.udf(lambda dt, delta: dt + (datetime.min - dt) % timedelta(minutes=delta),
TimestampType()))
# now apply to timestamp columns
df = df.withColumn("bucket_timestamp", ceil_dt(df.timestamp, f.lit(30)))
I have a dataframe in Pyspark with a date column called "report_date".
I want to create a new column called "report_date_10" that is 10 days added to the original report_date column.
Below is the code I tried:
df_dc["report_date_10"] = df_dc["report_date"] + timedelta(days=10)
This is the error I got:
AttributeError: 'datetime.timedelta' object has no attribute '_get_object_id'
Help! thx
It seems you are using the pandas syntax for adding a column; For spark, you need to use withColumn to add a new column; For adding the date, there's the built in date_add function:
import pyspark.sql.functions as F
df_dc = spark.createDataFrame([['2018-05-30']], ['report_date'])
df_dc.withColumn('report_date_10', F.date_add(df_dc['report_date'], 10)).show()
+-----------+--------------+
|report_date|report_date_10|
+-----------+--------------+
| 2018-05-30| 2018-06-09|
+-----------+--------------+