Pyspark Java java.lang.NullPointerException - pyspark

from pyspark.sql.functions import lit
from pyspark.sql import functions as F
df_customer_account_statement = df_current_balance.select\
(
col('last_statement_close_balance').alias('beginBalance'),
col('current_statement_close_balance').alias('closeBalance')
)\
.withColumn('documentType',lit('RewardAccountStatement'))\
.withColumn("statementBeginAt", lit(current_statement_begin_date.strftime('%Y-%m-%dT%H:%M:%S.%f')[:-3]+'Z'))\
.withColumn('statementCloseAt', lit(current_statement_close_date.strftime('%Y-%m-%dT%H:%M:%S.%f')[:-3]+'Z'))\
.withColumn("uuid", F.expr("uuid()"))
The last withColumn throws null pointer exception.
Is there anything wrong with this?

The 'z' and 'Z' are used to represent the time zone in date time format. Use the below code block
strftime("%Y-%m-%dT%H:%M:%S%z")
instead of
strftime('%Y-%m-%dT%H:%M:%S.%f')[:-3]+'Z'

Related

Call function on Dataframe's columns has error TypeError: Column is not iterable

I am using Databricks with Spark 2.4. and i am coding Python
I have created this function to convert null to empty string
def xstr(s):
if s is None:
return ""
return str(s)
Then I have below code
from pyspark.sql.functions import *
lv_query = """
SELECT
SK_ID_Site, Designation_Site
FROM db_xxx.t_xxx
ORDER BY SK_ID_Site
limit 2"""
lvResult = spark.sql(lv_query)
a = lvResult1.select(map(xstr, col("Designation_Site")))
display(a)
I have this error : TypeError: Column is not iterable
what i need to do here is to call a function for each row that i have in my Dataframe. i would like to pass columns as parameters and have a result.
That's not how spark works. You cannot apply direct python code to a spark dataframe content.
There are already builtin functions that do the job for you.
from pyspark.sql import functions as F
a = lvResult1.select(
F.when(F.col("Designation_Site").isNull(), "").otherwise(
F.col("Designation_Site").cast("string")
)
)
In case you want some more complex functions that you cannot do with the builtin functions, you can use an UDF but it may impact a lot your performances (better check for existing builtin functions before building your own UDF).

pyspark udf with parameter

Need to transfer one pyspark dataframe colume checkin_time from milisec to timezone adjusted timestamp, timezone information is in another column tz_info.
Tried following:
def tz_adjust(x,tz_info):
if tz_info:
y = col(x)+ col(tz_info)
return from_unixtime(col(y)/1000)
else:
return from_unixtime(col(x)/1000)
def udf_tz_adjust(tz_info):
return udf(lambda l: tz_adjust(l, tz_info))
While using this udf to the column
df.withColumn('checkin_time',udf_tz_adjust('time_zone')(col('checkin_time')))
got some error:
AttributeError: 'NoneType' object has no attribute '_jvm'
Any idea to pass the second column as parameter to udf?
Thanks.
IMHO, what you are doing is a combination of UDF and partial function which could get tricky. I don't think you need to use UDF at all for your application purpose. You can do the following
#not tested
from pyspark.sql.functions import *
df.withColumn('checkin_time', when(col("tz_info").isNotNull(), (from_unixtime(col('checkin_time')) + F.col("tz_info"))/1000).otherwise(from_unixtime(col("checkin_time"))/1000))
UDF has its own serde inefficiencies which is even worse when using with python as it puts an extra overhead of converting scala datatypes into python datatypes.

I'm having problems to round to two digits plain calculation in pyspark in DataBricks

I'm getting the following error: It shows the following error: Invalid argument, not a string or column: 1586906.0151878505 of type <class 'float'>. For column literals, use 'lit', 'array', 'struct' or 'create_map' function.
the code goes something like this
from pyspark.sql.types import *
population= int(df.count())
income_sum= int(df.agg(_sum('income_job1')).collect()[0][0])
Tot_mean_income = round(income_sum/population,2)
I think you are having conflict issues with round function in pyspark.sql.functions package with regular python round function.
Try with importing functions with alias so that there will be no conflict with round function.
#import functions with alias
from pyspark.sql import functions as F
from pyspark.sql.types import *
df=spark.createDataFrame([(1,'a'),(2,'b')],['id','name'])
population=int(df.count())
income_sum=int(df.agg(F.sum(F.col("id"))).collect()[0][0])
Tot_mean_income = round(income_sum/population,2)
#1.5

How can i access variable in Pyspark just like in Scala by using s & $

I have a code below which is used to copy the data from HIVE table to HDFS in append mode as Parquet file.
from pyspark.sql.functions import current_date, date_format, date_sub
from datetime import datetime, timedelta
import datetime
q = """select label_yyyy_mm_dd
,label_yyyy_mm
,q_media_name
,a_accepted
,a_end_ts
,a_media_name
,a_resource_name
,a_start_ts
,k_callpurpose
,k_srf
,q_entry_ordinal
,q_interaction_id
,q_interaction_type
,q_ixn_resource_id
,q_resource_name
,a_consult_rcv_warm_engage_time
,a_consult_rcv_warm_hold_time
,a_consult_rcv_warm_wrap_time
,a_customer_handle_count
,a_customer_talk_duration
,a_interaction_resource_id
,a_interaction_id
,a_wrap_time
a_technical_result
,k_ixn_type
,k_ixn_type_source
,k_transfer_count
,k_language
,k_agentauth
,k_auth,k_rg
,k_channel
,k_gms_result
,k_connid
,k_rbcprimaryid
,k_agent_id
,a_interaction_resource_ordinal
from prod_T0V0_cct0.cct0_gim_measures_gold A
inner join prod_T0V0_cct0.yle0_gim_date_time B on A.a_start_date_time_key = B.date_time_key
where label_yyyy_mm_dd = date_format(date_sub(current_date(), 1), 'y-MM-dd')
"""
date = (datetime.date.today()-datetime.timedelta(days=1)).strftime('%Y-%m-%d')
spark.sql(q).write.mode('append').parquet('hdfs:/prod/11323/app/H9A0/data/T0V0/DIG/info_gold_chat.parquet/label_yyyy_mm_dd=$date')
The parquet file needs to be moved by creating folder as per the value of the variable "date". However this is throwing syntax error as i can understand that the above path has 's' and '$' which is for Scala and not Pyspark. I tried removing both of them and it works however the files goes and sits into a folder name "date" which i think it treats as a constant and not variable value.
Can someone help me on how to write the parquet files into folder
having the name as Day-1 (%y-%m-%d) format ?
The issue is with the last line, I have tested the data in Pyspark shell, its giving the correct result. Use proper String formatting in last line in PySpark as below:
date = (datetime.date.today()-datetime.timedelta(days=1)).strftime('%Y-%m-%d')
date # Testing the date value in PySpark Shell.
'2018-09-24'
spark.sql(q).write.mode('append').parquet('hdfs:/prod/11323/app/H9A0/data/T0V0/DIG/info_gold_chat.parquet/label_yyyy_mm_dd=%s' %date')

Pyspark: How to add ten days to existing date column

I have a dataframe in Pyspark with a date column called "report_date".
I want to create a new column called "report_date_10" that is 10 days added to the original report_date column.
Below is the code I tried:
df_dc["report_date_10"] = df_dc["report_date"] + timedelta(days=10)
This is the error I got:
AttributeError: 'datetime.timedelta' object has no attribute '_get_object_id'
Help! thx
It seems you are using the pandas syntax for adding a column; For spark, you need to use withColumn to add a new column; For adding the date, there's the built in date_add function:
import pyspark.sql.functions as F
df_dc = spark.createDataFrame([['2018-05-30']], ['report_date'])
df_dc.withColumn('report_date_10', F.date_add(df_dc['report_date'], 10)).show()
+-----------+--------------+
|report_date|report_date_10|
+-----------+--------------+
| 2018-05-30| 2018-06-09|
+-----------+--------------+