I am getting a value from a DF using max aggregation, so I get a string and I want to convert it to Date.
What I am doing is this:
var date = spark.read.parquet("data/users").select("Date").agg(max(col("Date"))).first.get(0).toString
df2 = table_read.filter("Date=" + lastDate)
In this way I get a variable of string type and now I want to convert it to Date type. I have been searching to do this in another answers but all I saw is to do it with DataFrames and using to_date. How can I do in this case?
EDIT:
Schema:
root
|-- Date: date (nullable = false)
|-- op: string (nullable = true)
|-- value: string (nullable = true)
Output of spark.read.parquet("data/users").select("Date").agg(max(col("Date"))).show:
+-----------+
|max(Date) |
+-----------+
|2019-11-10 |
+-----------+
Error:
Exception message: cannot resolve '(`Date` = ((2021 - 12) - 14))' due to data type mismatch: differing types in '(`Date` = ((2021 - 12) - 14))' (date and int).; line 1 pos 0;
'Filter (Date#5488 = ((2021 - 12) - 14))
You can use .getDate, e.g.
var date = spark.read.parquet("data/users").select("Date").agg(max(col("Date"))).first.getDate(0)
To use it in a filter, you can do
df2 = table_read.filter(col("Date") === lastDate)
// or df2 = table_read.filter("date='" + date + "'")
Related
I am working on some requirement in which I am getting one small table in from of CSV file as follow:
root
|-- ACCT_NO: string (nullable = true)
|-- SUBID: integer (nullable = true)
|-- MCODE: string (nullable = true)
|-- NewClosedDate: timestamp (nullable = true
We also have a very big external hive table in form of Avro which is stored in HDFS as follow:
root
-- accountlinks: array (nullable = true)
| | |-- account: struct (nullable = true)
| | | |-- acctno: string (nullable = true)
| | | |-- subid: string (nullable = true)
| | | |-- mcode: string (nullable = true)
| | | |-- openeddate: string (nullable = true)
| | | |-- closeddate: string (nullable = true)
Now, the requirement is to look up the the external hive table based on the three columns from the csv file : ACCT_NO - SUBID - MCODE. If it matches, updates the accountlinks.account.closeddate with NewClosedDate from CSV file.
I have already written the following code to explode the required columns and join it with the small table but I am not really sure how to update the closeddate field ( this is currently null for all account holders) with NewClosedDate because closeddate is a nested column and I cannot easily use withColumn to populate it. In addition to that the schema and column names cannot be changed as these files are linked to some external hive table.
val df = spark.sql("select * from db.table where archive='201711'")
val ExtractedColumn = df
.coalesce(150)
.withColumn("ACCT_NO", explode($"accountlinks.account.acctno"))
.withColumn("SUBID", explode($"accountlinks.account.acctsubid"))
.withColumn("MCODE", explode($"C.mcode"))
val ReferenceData = spark.read.format("csv")
.option("header","true")
.option("inferSchema","true")
.load("file.csv")
val FinalData = ExtractedColumn.join(ReferenceData, Seq("ACCT_NO","SUBID","MCODE") , "left")
All you need is to explode the accountlinks array and then join the 2 dataframes like this:
val explodedDF = df.withColumn("account", explode($"accountlinks"))
val joinCondition = $"ACCT_NO" === $"account.acctno" && $"SUBID" === $"account.subid" && $"MCODE" === $"account.mcode"
val joinDF = explodedDF.join(ReferenceData, joinCondition, "left")
Now you can update the account struct column like below, and collect list to get back the array structure:
val FinalData = joinDF.withColumn("account",
struct($"account.acctno", $"account.subid", $"account.mcode",
$"account.openeddate", $"NewClosedDate".alias("closeddate")
)
)
.groupBy().agg(collect_list($"account").alias("accountlinks"))
The idea is to create a new struct with all the fields from account except closedate that you get from NewCloseDate column.
If the struct contains many fields you can use a for-comprehension to get all the fields except the close date to prevent typing them all.
During an ETL process I have this one SAS date field that is in a 5 digit integer format, which indicates days since 01-01-1960. I order to make this data column more useful in analysis I would like to convert the column to a date data type field in Redshift.
Currently I am trying to do this in pyspark as follows:
created new column "sas_date" with string literal "1960-01-01"
Using pyspark.sql.function.date_add I pass the "sas-date" column as
the start date parameter and the integer value 'arrival_date' column as the second parameter.
When the date_add function runs I get error Column not iterable, even though I would think the arrival_date column being a series would mean it was iterable. But its not, why?
When I remove the 'arrival_date' column and replace it with a static integer value (say 1) the date_add function will work.
i94 = i94.withColumn('arrival_date', col('arrival_date').cast(Int()))
i94 = i94.withColumn('sas_date', lit("1960-01-01"))
i94 = i94.withColumn('arrival_date', date_add(col('sas_date'), i94['arrival_date']))
I want to be able to pass my column so that the second date_add parameter will be dynamic. However it seems date_add does not accept this? If date_addd does not accomplish this what other option do I have outside of using a UDF?
UPDATE:
State of data right before the date_add() operation
i94.printSchema()
root
|-- cic_id: double (nullable = true)
|-- visa_id: string (nullable = true)
|-- port_id: string (nullable = true)
|-- airline_id: string (nullable = true)
|-- cit_id: double (nullable = true)
|-- res_id: double (nullable = true)
|-- year: double (nullable = true)
|-- month: double (nullable = true)
|-- age: double (nullable = true)
|-- gender: string (nullable = true)
|-- arrival_date: integer (nullable = true)
|-- depart_date: double (nullable = true)
|-- date_begin: string (nullable = true)
|-- date_end: string (nullable = true)
|-- sas_date: string (nullable = false)
i94.limit(10).toPandas()
toPandas() result
I think you are absolutely right, date_add is designed to take int values only till Spark <3.0.0:
In spark scala implementation i see below lines.
It indicates that whatever value we pass it to function date_add it is converting again into column with lit
Spark <3.0.0:
def date_add(start: Column, days: Int): Column = date_add(start,
lit(days))
Spark >=3.0.0:
def date_add(start: Column, days: Column): Column = withExpr {
DateAdd(start.expr, days.expr) }
Now lets talk about Solution, i can think of two approaches :
Imports and prepare small set of your dataset:
import pyspark.sql.functions as f
import pyspark.sql.types as t
from datetime import datetime
from datetime import timedelta
l1 = [(5748517.0,'1960-01-01', 20574), (5748517.0,'1960-01-01', 20574), (5748517.0,'1960-01-01', 20574)]
df = spark.createDataFrame(l1).toDF('cic_id','sas_date','arrival_date')
df.show()
+---------+----------+------------+
| cic_id| sas_date|arrival_date|
+---------+----------+------------+
|5748517.0|1960-01-01| 20574|
|5748517.0|1960-01-01| 20574|
|5748517.0|1960-01-01| 20574|
+---------+----------+------------+
Now, there are two ways to achive functionality.
UDF Way :
def date_add_(date, days):
# Type check and convert to datetime object
# Format and other things should be handle more delicately
if type(date) is not datetime:
date = datetime.strptime('1960-01-01', "%Y-%m-%d")
return date + timedelta(days)
date_add_udf = f.udf(date_add_, t.DateType())
df.withColumn('actual_arrival_date', date_add_udf(f.to_date('sas_date'), 'arrival_date')).show()
+---------+----------+------------+-------------------+
| cic_id| sas_date|arrival_date|actual_arrival_date|
+---------+----------+------------+-------------------+
|5748517.0|1960-01-01| 20574| 2016-04-30|
|5748517.0|1960-01-01| 20574| 2016-04-30|
|5748517.0|1960-01-01| 20574| 2016-04-30|
+---------+----------+------------+-------------------+
Using expr evaluation :
df.withColumn('new_arrival_date', f.expr("date_add(sas_date, arrival_date)")).show()
+---------+----------+------------+----------------+
| cic_id| sas_date|arrival_date|new_arrival_date|
+---------+----------+------------+----------------+
|5748517.0|1960-01-01| 20574| 2016-04-30|
|5748517.0|1960-01-01| 20574| 2016-04-30|
|5748517.0|1960-01-01| 20574| 2016-04-30|
+---------+----------+------------+----------------+
Sample Code:
val sparkSession = SparkUtil.getSparkSession("timestamp_format_test")
import sparkSession.implicits._
val format = "yyyy/MM/dd HH:mm:ss.SSS"
val time = "2018/12/21 08:07:36.927"
val df = sparkSession.sparkContext.parallelize(Seq(time)).toDF("in_timestamp")
val df2 = df.withColumn("out_timestamp", to_timestamp(df.col("in_timestamp"), format))
Output:
df2.show(false)
plz notice: out_timestamp loses the milli-second part from the original value
+-----------------------+-------------------+
|in_timestamp |out_timestamp |
+-----------------------+-------------------+
|2018/12/21 08:07:36.927|2018-12-21 08:07:36|
+-----------------------+-------------------+
df2.printSchema()
root
|-- in_timestamp: string (nullable = true)
|-- out_timestamp: timestamp (nullable = true)
In the above result: in_timestamp is of string type, and I would like to convert to timestamp data type, it does get convert but the millisecond part gets lost. Any idea.? Thanks.!
Sample code for preserving millisecond during conversion from string to timestamp.
val df2 = df.withColumn("out_timestamp", to_timestamp(df.col("in_timestamp")))
df2.show(false)
+-----------------------+-----------------------+
|in_timestamp |out_timestamp |
+-----------------------+-----------------------+
|2018-12-21 08:07:36.927|2018-12-21 08:07:36.927|
+-----------------------+-----------------------+
scala> df2.printSchema
root
|-- in_timestamp: string (nullable = true)
|-- out_timestamp: timestamp (nullable = true)
You just need to remove format parameter from to_timestamp. This will save your result with data type timestamp similar to String value.
I have a DataFrame which simplified schema has got two columns with 3 fields each column:
root
|-- npaDetails: struct (nullable = true)
| |-- additionalInformation: struct (nullable = true)
| |-- npaStatus: struct (nullable = true)
| |-- npaDetails: struct (nullable = true)
|-- npaHeaderData: struct (nullable = true)
| |-- npaNumber: string (nullable = true)
| |-- npaDownloadDate: string (nullable = true)
| |-- npaDownloadTime: string (nullable = true)
Possible values:
npaDownloadDate - "30JAN17"
npaDownloadTime - "19.50.00"
I need to compare two rows in a DataFrame with this schema, to determine which one is "fresher". To do so I need to merge the fields npaDownloadDate and npaDownloadTime to generate a Date that I can compare easily.
Below its the code I have written so far. It works, but I think it takes more steps than necessary and I'm sure that Scala offers better solutions than my approach.
val parquetFileDF = sqlContext.read.parquet("MyParquet.parquet")
val relevantRows = parquetFileDF.filter($"npaHeaderData.npaNumber" === "123456")
val date = relevantRows .select($"npaHeaderData.npaDownloadDate").head().get(0)
val time = relevantRows .select($"npaHeaderData.npaDownloadTime").head().get(0)
val dateTime = new SimpleDateFormat("ddMMMyykk.mm.ss").(date+time)
//I would replicate the previous steps to get dateTime2
if(dateTime.before(dateTime2))
println("dateTime is before dateTime2")
So the output of "30JAN17" and "19.50.00" would be Mon Jan 30 19:50:00 GST 2017
Is there another way to generate a Date from two fields of a column, without extracting and merging them as strings? Or even better, is it possible to compare directly both values (date and time) between two different rows in a dataframe to know which one has an older date
In spark 2.2,
df.filter(
to_date(
concat(
$"npaHeaderData.npaDownloadDate",
$"npaHeaderData.npaDownloadTime"),
fmt = "[your format here]")
) < lit(some date))
I'd use
import org.apache.spark.sql.functions._
df.withColumn("some_name", date_format(unix_timestamp(
concat($"npaHeaderData.npaDownloadDate", $"npaHeaderData.npaDownloadTime"),
"ddMMMyykk.mm.ss").cast("timestamp"),
"EEE MMM d HH:mm:ss z yyyy"))
I put some log files into sql tables through Spark and my schema looks like this:
|-- timestamp: timestamp (nullable = true)
|-- c_ip: string (nullable = true)
|-- cs_username: string (nullable = true)
|-- s_ip: string (nullable = true)
|-- s_port: string (nullable = true)
|-- cs_method: string (nullable = true)
|-- cs_uri_stem: string (nullable = true)
|-- cs_query: string (nullable = true)
|-- sc_status: integer (nullable = false)
|-- sc_bytes: integer (nullable = false)
|-- cs_bytes: integer (nullable = false)
|-- time_taken: integer (nullable = false)
|-- User_Agent: string (nullable = true)
|-- Referrer: string (nullable = true)
As you can notice I created a timestamp field which I read is supported by Spark (Date wouldn't work as far as I understood). I would love to use for queries like "where timestamp>(2012-10-08 16:10:36.0)" but when I run it I keep getting errors.
I tried these 2 following sintax forms:
For the second one I parse a string so Im sure Im actually pass it in a timestamp format.
I use 2 functions: parse and date2timestamp.
Any hint on how I should handle timestamp values?
Thanks!
1)
scala> sqlContext.sql("SELECT * FROM Logs as l where l.timestamp=(2012-10-08 16:10:36.0)").collect
java.lang.RuntimeException: [1.55] failure: ``)'' expected but 16 found
SELECT * FROM Logs as l where l.timestamp=(2012-10-08 16:10:36.0)
^
2)
sqlContext.sql("SELECT * FROM Logs as l where l.timestamp="+date2timestamp(formatTime3.parse("2012-10-08 16:10:36.0"))).collect
java.lang.RuntimeException: [1.54] failure: ``UNION'' expected but 16 found
SELECT * FROM Logs as l where l.timestamp=2012-10-08 16:10:36.0
^
I figured that the problem was the precision of the timestamp first of all and also the string that I pass representing the timestamp has to be casted as a String
So this query works now:
sqlContext.sql("SELECT * FROM Logs as l where cast(l.timestampLog as String) <= '2012-10-08 16:10:36'")
You forgot the quotation marks.
Try something with this syntax:
L.timestamp = '2012-07-16 00:00:00'
Alternatively, try
L.timestamp = CAST('2012-07-16 00:00:00' AS TIMESTAMP)
Cast the string representation of the timestamp to timestamp. cast('2012-10-10 12:00:00' as timestamp) Then you can do comparison as timestamps, not strings. Instead of:
sqlContext.sql("SELECT * FROM Logs as l where cast(l.timestamp as String) <= '2012-10-08 16:10:36'")
try
sqlContext.sql("SELECT * FROM Logs as l where l.timestamp <= cast('2012-10-08 16:10:36' as timestamp)")
Sadly this didn't work for me. I am using Apache Spark 1.4.1. The following code is my solution:
Date date = new Date();
String query = "SELECT * FROM Logs as l where l.timestampLog <= CAST('" + new java.sql.Timestamp(date.getTime()) + "' as TIMESTAMP)";
sqlContext.sql(query);
Casting the timestampLog as string did not throw any errors but returned no data.