pyspark function to change datatype - pyspark

The code is working outside the function however when I take it inside the function and adjust for var argument passed, I am getting an error. Thanks for the help!
from pyspark.sql.types import DateType
from pyspark.sql.functions import col, unix_timestamp, to_date
def change_string_to_date(df,var):
df = df.withColumn("{}".format(var),to_date(unix_timestamp(col("{}".format(var))), 'yyyy-MM-dd').cast("timestamp"))
return df
df_data = change_string_to_date(df_data,'mis_dt')

figured it out. "unix_timestamp" was causing the problem . very silly error.

Related

Azure Databricks analyze if the columns names are lower case, using islower() function

This is my logic on pyspark:
df2 = spark.sql(f" SELECT tbl_name, column_name, data_type, current_count FROM {database_name}.{tablename}")
query_df = spark.sql(f"SELECT tbl_name, COUNT(column_name) as `num_cols` FROM {database_name}.{tablename} GROUP BY tbl_name")
df_join = df2.join(query_df,['tbl_name'])
Then I want to add to the Dataframe another column called 'column_case_lower' with the analyzes if the columns_names are lower case using islower() function.
I'm using this logic to do the analyzes:
df_join.withColumn("column_case_lower",
when((col("column_name").islower()) == 'true'.otherwise('false'))
-- The error is: unexpected EOF while parsing
expecting something like this:
islower() cant be applied on column type. Use the below code that uses UDF instead.
def checkCase(col_value):
return col_value.islower()
from pyspark.sql.functions import col, udf
from pyspark.sql.types import StringType
checkUDF = udf(lambda z: checkCase(z),StringType())
from pyspark.sql.functions import col,when
df.withColumn("new_col", when(checkUDF(col('column_name')) == True,"True")
.otherwise("False")).show()

Convert timestamp to day-of-week string with date_format in Spark/Scala

I keep on getting error, that I pass too many arguments and not sure why, as I am following the exact examples from:
command-629173529675356:9: error: too many arguments for method apply: (colName: String)org.apache.spark.sql.Column in class Dataset
val df_2 = date_format.withColumn("day_of_week", date_format(col("date"), "EEEE"))
My code:
val date_format = df_filter.withColumn("date", to_date(col("pickup_datetime")))
val df_2 = date_format.withColumn("day_of_week", date_format(col("date"), "EEEE"))
Thank you for the help!
dayofweek is the function you're looking for, so something like this
import org.apache.spark.sql.functions.dayofweek
date_format.withColumn("day_of_week", dayofweek(col("date")))
You get your error because you named your first dataframe date_format, which is the same name as the Spark's built-in function you want to use. So when you call date_format, you're retrieving your dataframe instead of date_format built-in function.
To solve this, you should either rename your first dataframe:
val df_1 = df_filter.withColumn("date", to_date(col("pickup_datetime")))
val df_2 = df_1.withColumn("day_of_week", date_format(col("date"), "EEEE"))
Or ensure that you're calling right date_format by importing functions and then call functions.date_format when extracting day of week:
import org.apache.spark.sql.functions
val date_format = df_filter.withColumn("date", to_date(col("pickup_datetime")))
val df_2 = date_format.withColumn("day_of_week", functions.date_format(col("date"), "EEEE"))

Filtering rows which are causing datatype parsing issue in spark

I have a spark dataFrame with column Salary as shown below:
|Salary|
|"100"|
|"200"|
|"abc"|
The dafault datatype is string. I want to convert that to Integer with removing those rows which are causing parsing issue.
Desired Output
|Salary|
|100|
|200|
Can someone please let me know the code for filtering the rows which will be causing datatype parsing issue.
Thanks in advance.
You can filter the desired Field with a regex and then casting the column:
import org.apache.spark.sql.types._
df.filter(row => row.getAs[String]("Salary").matches("""\d+"""))
.withColumn("Salary", $"Salary".cast(IntegerType))
You can do it also with Try if you don't like regex:
import scala.util._
df.filter(row => Try(row.getAs[String]("Salary").toInt).isSuccess)
.withColumn("Salary", $"Salary".cast(IntegerType))

Pyspark Window Function

I am trying to calculate the row_number on a data-set based on certain column but i am getting the below error
AttributeError: 'module' object has no attribute 'rowNumber'
I am using the below script to get the row number based on MID and ClaimID. Ay thoughts why this is coming up?
from pyspark.sql.functions import first
from pyspark.sql.types import *
from pyspark.sql import *
from pyspark.sql import Row, functions as F
from pyspark.sql.window import Window
import pyspark.sql.functions as func
def Codes(pharmacyCodes):
df_data=pharmacyCodes
(df_data
.select("MID","claimid",
F.rowNumber()
.over(Window
.partitionBy("MID")
.orderBy("MID")
)
.alias("rowNum")
)
.show()
)
I think you're looking for row_number rather than rowNumber. The mixture of camel case and snake case with Pyspark can get confusing.

How to group by on epoch timestame field in Scala spark

I want to group by the records by date. but the date is in epoch timestamp in millisec.
Here is the sample data.
date, Col1
1506838074000, a
1506868446000, b
1506868534000, c
1506869064000, a
1506869211000, c
1506871846000, f
1506874462000, g
1506879651000, a
Here is what I'm trying to achieve.
**date Count of records**
02-10-2017 4
04-10-2017 3
03-10-2017 5
Here is the code which I tried to group by,
import java.text.SimpleDateFormat
val dateformat:SimpleDateFormat = new SimpleDateFormat("yyyy-MM-dd")
val df = sqlContext.read.csv("<path>")
val result = df.select("*").groupBy(dateformat.format($"date".toLong)).agg(count("*").alias("cnt")).select("date","cnt")
But while executing code I am getting below exception.
<console>:30: error: value toLong is not a member of org.apache.spark.sql.ColumnName
val t = df.select("*").groupBy(dateformat.format($"date".toLong)).agg(count("*").alias("cnt")).select("date","cnt")
Please help me to resolve the issue.
you would need to change the date column, which seems to be in long, to date data type. This can be done by using from_unixtime built-in function. And then its just a groupBy and agg function calls and use count function.
import org.apache.spark.sql.functions._
def stringDate = udf((date: Long) => new java.text.SimpleDateFormat("dd-MM-yyyy").format(date))
df.withColumn("date", stringDate($"date"))
.groupBy("date")
.agg(count("Col1").as("Count of records"))
.show(false)
Above answer is using udf function which should be avoided as much as possible, since udf is a black box and requires serialization and deserialisation of columns.
Updated
Thanks to #philantrovert for his suggestion to divide by 1000
import org.apache.spark.sql.functions._
df.withColumn("date", from_unixtime($"date"/1000, "yyyy-MM-dd"))
.groupBy("date")
.agg(count("Col1").as("Count of records"))
.show(false)
Both ways work.