Trying to create a 30 minute time bucket and i'm getting a the following attribute error:
'datetime.timedelta' object has no attribute '_get_object_id'
The column being ingested is called timestamp and holds the schema type timestamp. e.g.
2019-02-01T15:53:44Z
I can't work out why i'm getting the error given the below code should be able to ingest the timestamp.
def ceil_dt(dt, delta):
return dt + (datetime.min - dt) % delta
df = df.withColumn("bucket_timestamp", ceil_dt(df.timestamp, timedelta(minutes=30)))
return df
You need to make use of a user defined function (UDF):
from pyspark.sql.types import *
from pyspark.sql import functions as f
from pyspark.sql import Row
from datetime import datetime, timedelta
# example DF
date = datetime.strptime('2019-02-01T15:53:44', '%Y-%m-%dT%H:%M:%S')
df = sc.parallelize([Row(timestamp=date)]).toDF()
# define UDF based on OP's function
ceil_dt = (f.udf(lambda dt, delta: dt + (datetime.min - dt) % timedelta(minutes=delta),
TimestampType()))
# now apply to timestamp columns
df = df.withColumn("bucket_timestamp", ceil_dt(df.timestamp, f.lit(30)))
Related
Here my main purpose in the code is to get all the columns from the dataframe after the group by condition but after the group by condition only the selected columns are coming.
so I tried using Join condition to merge it the group by data frame with the original dataframe.
But it is throwing error.
code
from pyspark.sql.types import IntegerType
from pyspark.sql.types import *
from pyspark.sql.functions import *
retail_sales_transaction = glueContext.create_dynamic_frame.from_catalog(
database="conform_main_mobconv",
table_name="retail_sales_transaction"
df_retail_sales_transaction=df_retail_sales_transaction
.select("transaction_id","transaction_key","transaction_timestamp","personnel_key","retail_site_id","personnel_role","country_code","business_week")
df_retail_sales_transaction = df_retail_sales_transaction.groupBy("business_week", "location_id", "country_code", "personnel_key")
df_retail_sales_transaction =df_retail_sales_transaction.join(df_retail_sales_transaction,['business_week'],'outer')
Error I'm getting is :
df_retail_sales_transaction = df_retail_sales_transaction.join(df_retail_sales_transaction,['business_week'],'outer')
AttributeError: 'GroupedData' object has no attribute 'join'
I need to create a column in pyspark with has the row number of each row. I'm smonotonically_increasing_id function, but it sometimes generate very large values. How can I generate a column which has the values starting from 1 and goes to size of my dataframe?
top_seller_elast_df = top_seller_elast_df.withColumn("rank", F.monotonically_increasing_id() + 1)
Use row_number() function by ordering to monotonically_increasing_id()
from pyspark.sql.functions import *
from pyspark.sql import *
w=Window.orderBy("mid")
top_seller_elast_df = top_seller_elast_df.withColumn("mid", monotonically_increasing_id())
top_seller_elast_df.withColumn("row_number",row_number().over(w)).show()
I tried the following code to subset my data so that it only gives me a date range from 6/1 to yesterday:
day_1 = '2018-06-01'
df = df.where((F.col('report_date') >= day_1) & (F.col('report_date') < F.current_date()))
I get the following error: AnalysisException: u"cannot resolve '2018-06-01' given input columns
You can use lit method from sql functions to create artificial column.
df = df.where((F.col('report_date') >= F.lit(day_1)) & (F.col('report_date') < F.current_date()))
consider querying hive data from inside spark using
val selectMemCntQry = "select column1 from table1 where column2 = "+col_2_val
val table_col2 = sparkSession.sql(selectMemCntQry)
val diff = table_col2 - file_member_count
where file_member_count is an integer value.I know result of table_col2 is always going to a single number
I want to subtract result of the query from an integer value.But error that I am facing is
value - is not a member of org.apache.spark.sql.DataFrame
import org.apache.spark.sql.Row
val Row(colum1: Integer) = sparkSession.sql(selectMemCntQry).first
colum1 - file_member_count
or
sparkSession.sql(selectMemCntQry).first.getAs[Integer]("column1") - file_member_count
or
sparkSession.sql(selectMemCntQry).as[Integer].first - file_member_count
I have a dataframe in Pyspark with a date column called "report_date".
I want to create a new column called "report_date_10" that is 10 days added to the original report_date column.
Below is the code I tried:
df_dc["report_date_10"] = df_dc["report_date"] + timedelta(days=10)
This is the error I got:
AttributeError: 'datetime.timedelta' object has no attribute '_get_object_id'
Help! thx
It seems you are using the pandas syntax for adding a column; For spark, you need to use withColumn to add a new column; For adding the date, there's the built in date_add function:
import pyspark.sql.functions as F
df_dc = spark.createDataFrame([['2018-05-30']], ['report_date'])
df_dc.withColumn('report_date_10', F.date_add(df_dc['report_date'], 10)).show()
+-----------+--------------+
|report_date|report_date_10|
+-----------+--------------+
| 2018-05-30| 2018-06-09|
+-----------+--------------+