SparkSQL Timestamp query failure - scala

I put some log files into sql tables through Spark and my schema looks like this:
|-- timestamp: timestamp (nullable = true)
|-- c_ip: string (nullable = true)
|-- cs_username: string (nullable = true)
|-- s_ip: string (nullable = true)
|-- s_port: string (nullable = true)
|-- cs_method: string (nullable = true)
|-- cs_uri_stem: string (nullable = true)
|-- cs_query: string (nullable = true)
|-- sc_status: integer (nullable = false)
|-- sc_bytes: integer (nullable = false)
|-- cs_bytes: integer (nullable = false)
|-- time_taken: integer (nullable = false)
|-- User_Agent: string (nullable = true)
|-- Referrer: string (nullable = true)
As you can notice I created a timestamp field which I read is supported by Spark (Date wouldn't work as far as I understood). I would love to use for queries like "where timestamp>(2012-10-08 16:10:36.0)" but when I run it I keep getting errors.
I tried these 2 following sintax forms:
For the second one I parse a string so Im sure Im actually pass it in a timestamp format.
I use 2 functions: parse and date2timestamp.
Any hint on how I should handle timestamp values?
Thanks!
1)
scala> sqlContext.sql("SELECT * FROM Logs as l where l.timestamp=(2012-10-08 16:10:36.0)").collect
java.lang.RuntimeException: [1.55] failure: ``)'' expected but 16 found
SELECT * FROM Logs as l where l.timestamp=(2012-10-08 16:10:36.0)
^
2)
sqlContext.sql("SELECT * FROM Logs as l where l.timestamp="+date2timestamp(formatTime3.parse("2012-10-08 16:10:36.0"))).collect
java.lang.RuntimeException: [1.54] failure: ``UNION'' expected but 16 found
SELECT * FROM Logs as l where l.timestamp=2012-10-08 16:10:36.0
^

I figured that the problem was the precision of the timestamp first of all and also the string that I pass representing the timestamp has to be casted as a String
So this query works now:
sqlContext.sql("SELECT * FROM Logs as l where cast(l.timestampLog as String) <= '2012-10-08 16:10:36'")

You forgot the quotation marks.
Try something with this syntax:
L.timestamp = '2012-07-16 00:00:00'
Alternatively, try
L.timestamp = CAST('2012-07-16 00:00:00' AS TIMESTAMP)

Cast the string representation of the timestamp to timestamp. cast('2012-10-10 12:00:00' as timestamp) Then you can do comparison as timestamps, not strings. Instead of:
sqlContext.sql("SELECT * FROM Logs as l where cast(l.timestamp as String) <= '2012-10-08 16:10:36'")
try
sqlContext.sql("SELECT * FROM Logs as l where l.timestamp <= cast('2012-10-08 16:10:36' as timestamp)")

Sadly this didn't work for me. I am using Apache Spark 1.4.1. The following code is my solution:
Date date = new Date();
String query = "SELECT * FROM Logs as l where l.timestampLog <= CAST('" + new java.sql.Timestamp(date.getTime()) + "' as TIMESTAMP)";
sqlContext.sql(query);
Casting the timestampLog as string did not throw any errors but returned no data.

Related

How to rename a column in Spark dataframe while using explode function

I am trying to use explode array function in Pyspark and below is the code -
df_map_transformation.select(col("_name") , explode(arrays_zip(col("instances.Instance._name"), col("instances.Instance._id") ))).select(col("_name"), col("col.*")).printSchema()
Output -
root
|-- _name: string (nullable = true)
|-- 0: string (nullable = true)
|-- 1: string (nullable = true)
When I am trying to select "_name" column I am able to do so like -
df_map_transformation.select(col("_name") , explode(arrays_zip(col("instances.Instance._name"), col("instances.Instance._id") ))).select(col("_name"), col("col.*")).select(col("_name")).show(50,False)
But the same is not working while trying to access "0" or "1" column -
Error -
File "/usr/local/spark/spark/python/lib/py4j-0.10.8.1-src.zip/py4j/protocol.py", line 328, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o1614.showString.
: org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Binding attribute, tree: _gen_alias_696#696
Is there any way to rename column "0" and "1" or extract them via select in dataframe ?
Use explode('col_name').alias('new_name')
Try cast to col column to struct<cola:string,colb:string>. You can choose your own column names inside struct, for example I have taken cola & colb
Check below code.
df_map_transformation.select(col("_name") , explode(arrays_zip(col("instances.Instance._name"), col("instances.Instance._id") ))).select(col("_name"), col("col").cast("struct<cola:string,colb:string>")).select(col("_name"),col("col.cola"),col("col.colb")).printSchema()
root
|-- _name: string (nullable = true)
|-- cola: string (nullable = true)
|-- colb: string (nullable = true)
Also you can use withColumnRenamed
df_map_transformation.select(col("_name") ,explode(
arrays_zip(
col("instances.Instance._name"),
col("instances.Instance._id") )
)
).select(col("_name"), col("col.*"))
.withColumnRenamed("0","cola")
.withColumnRenamed("1","colb")
df = (df_map_transformation
.select(
col("_name"),
explode(arrays_zip(
col("instances.Instance._name"),
col("instances.Instance._id")
)
.alias('name','id') # <-- alias can take multiple names
)
)
here's what it looks like in my notebook:

Is there a way to use pyspark.sql.functions.date_add with a col('column_name') as a the second parameter instead of a static integer?

During an ETL process I have this one SAS date field that is in a 5 digit integer format, which indicates days since 01-01-1960. I order to make this data column more useful in analysis I would like to convert the column to a date data type field in Redshift.
Currently I am trying to do this in pyspark as follows:
created new column "sas_date" with string literal "1960-01-01"
Using pyspark.sql.function.date_add I pass the "sas-date" column as
the start date parameter and the integer value 'arrival_date' column as the second parameter.
When the date_add function runs I get error Column not iterable, even though I would think the arrival_date column being a series would mean it was iterable. But its not, why?
When I remove the 'arrival_date' column and replace it with a static integer value (say 1) the date_add function will work.
i94 = i94.withColumn('arrival_date', col('arrival_date').cast(Int()))
i94 = i94.withColumn('sas_date', lit("1960-01-01"))
i94 = i94.withColumn('arrival_date', date_add(col('sas_date'), i94['arrival_date']))
I want to be able to pass my column so that the second date_add parameter will be dynamic. However it seems date_add does not accept this? If date_addd does not accomplish this what other option do I have outside of using a UDF?
UPDATE:
State of data right before the date_add() operation
i94.printSchema()
root
|-- cic_id: double (nullable = true)
|-- visa_id: string (nullable = true)
|-- port_id: string (nullable = true)
|-- airline_id: string (nullable = true)
|-- cit_id: double (nullable = true)
|-- res_id: double (nullable = true)
|-- year: double (nullable = true)
|-- month: double (nullable = true)
|-- age: double (nullable = true)
|-- gender: string (nullable = true)
|-- arrival_date: integer (nullable = true)
|-- depart_date: double (nullable = true)
|-- date_begin: string (nullable = true)
|-- date_end: string (nullable = true)
|-- sas_date: string (nullable = false)
i94.limit(10).toPandas()
toPandas() result
I think you are absolutely right, date_add is designed to take int values only till Spark <3.0.0:
In spark scala implementation i see below lines.
It indicates that whatever value we pass it to function date_add it is converting again into column with lit
Spark <3.0.0:
def date_add(start: Column, days: Int): Column = date_add(start,
lit(days))
Spark >=3.0.0:
def date_add(start: Column, days: Column): Column = withExpr {
DateAdd(start.expr, days.expr) }
Now lets talk about Solution, i can think of two approaches :
Imports and prepare small set of your dataset:
import pyspark.sql.functions as f
import pyspark.sql.types as t
from datetime import datetime
from datetime import timedelta
l1 = [(5748517.0,'1960-01-01', 20574), (5748517.0,'1960-01-01', 20574), (5748517.0,'1960-01-01', 20574)]
df = spark.createDataFrame(l1).toDF('cic_id','sas_date','arrival_date')
df.show()
+---------+----------+------------+
| cic_id| sas_date|arrival_date|
+---------+----------+------------+
|5748517.0|1960-01-01| 20574|
|5748517.0|1960-01-01| 20574|
|5748517.0|1960-01-01| 20574|
+---------+----------+------------+
Now, there are two ways to achive functionality.
UDF Way :
def date_add_(date, days):
# Type check and convert to datetime object
# Format and other things should be handle more delicately
if type(date) is not datetime:
date = datetime.strptime('1960-01-01', "%Y-%m-%d")
return date + timedelta(days)
date_add_udf = f.udf(date_add_, t.DateType())
df.withColumn('actual_arrival_date', date_add_udf(f.to_date('sas_date'), 'arrival_date')).show()
+---------+----------+------------+-------------------+
| cic_id| sas_date|arrival_date|actual_arrival_date|
+---------+----------+------------+-------------------+
|5748517.0|1960-01-01| 20574| 2016-04-30|
|5748517.0|1960-01-01| 20574| 2016-04-30|
|5748517.0|1960-01-01| 20574| 2016-04-30|
+---------+----------+------------+-------------------+
Using expr evaluation :
df.withColumn('new_arrival_date', f.expr("date_add(sas_date, arrival_date)")).show()
+---------+----------+------------+----------------+
| cic_id| sas_date|arrival_date|new_arrival_date|
+---------+----------+------------+----------------+
|5748517.0|1960-01-01| 20574| 2016-04-30|
|5748517.0|1960-01-01| 20574| 2016-04-30|
|5748517.0|1960-01-01| 20574| 2016-04-30|
+---------+----------+------------+----------------+

How to add assign value to empty dataframe existing column in scala?

I am reading a csv file which has | delimiter at last , while load method make last column in dataframe with no name and no values in Spark 1.6
df.withColumnRenamed(df.columns(83),"Invalid_Status").drop(df.col("Invalid_Status"))
val df = sqlContext.read.format("com.databricks.spark.csv").option("delimiter","|").option("header","true").load("filepath")
val df2 = df.withColumnRenamed(df.columns(83),"Invalid_Status").
I expected result
root
|-- FddCell: string (nullable = true)
|-- Trn_time: string (nullable = true)
|-- CELLNAME.FddCell: string (nullable = true)
|-- Invalid_Status: string (nullable = true)
but actual output is
root
|-- FddCell: string (nullable = true)
|-- Trn_time: string (nullable = true)
|-- CELLNAME.FddCell: string (nullable = true)
|-- : string (nullable = true)
with no value in column so I have to drop this column and again make new column.
It is not completely clear what you want to do, to just rename the column to Invalid_Status or to drop the column entirely. What I understand is, you are trying to operate (rename/drop) on the last column which has no name.
But I will try to help you with both the solution -
To Rename the column with same values (blanks) as it is:
val df2 = df.withColumnRenamed(df.columns.last,"Invalid_Status")
Only To Drop the last column without knowing its name, use:
val df3 = df.drop(df.columns.last)
And then add the "Invalid_Status" column with default values:
val requiredDf = df3.withColumn("Invalid_Status", lit("Any_Default_Value"))

SpHow to merge two fields(string type) of a DataFrame's column to generate a Date

I have a DataFrame which simplified schema has got two columns with 3 fields each column:
root
|-- npaDetails: struct (nullable = true)
| |-- additionalInformation: struct (nullable = true)
| |-- npaStatus: struct (nullable = true)
| |-- npaDetails: struct (nullable = true)
|-- npaHeaderData: struct (nullable = true)
| |-- npaNumber: string (nullable = true)
| |-- npaDownloadDate: string (nullable = true)
| |-- npaDownloadTime: string (nullable = true)
Possible values:
npaDownloadDate - "30JAN17"
npaDownloadTime - "19.50.00"
I need to compare two rows in a DataFrame with this schema, to determine which one is "fresher". To do so I need to merge the fields npaDownloadDate and npaDownloadTime to generate a Date that I can compare easily.
Below its the code I have written so far. It works, but I think it takes more steps than necessary and I'm sure that Scala offers better solutions than my approach.
val parquetFileDF = sqlContext.read.parquet("MyParquet.parquet")
val relevantRows = parquetFileDF.filter($"npaHeaderData.npaNumber" === "123456")
val date = relevantRows .select($"npaHeaderData.npaDownloadDate").head().get(0)
val time = relevantRows .select($"npaHeaderData.npaDownloadTime").head().get(0)
val dateTime = new SimpleDateFormat("ddMMMyykk.mm.ss").(date+time)
//I would replicate the previous steps to get dateTime2
if(dateTime.before(dateTime2))
println("dateTime is before dateTime2")
So the output of "30JAN17" and "19.50.00" would be Mon Jan 30 19:50:00 GST 2017
Is there another way to generate a Date from two fields of a column, without extracting and merging them as strings? Or even better, is it possible to compare directly both values (date and time) between two different rows in a dataframe to know which one has an older date
In spark 2.2,
df.filter(
to_date(
concat(
$"npaHeaderData.npaDownloadDate",
$"npaHeaderData.npaDownloadTime"),
fmt = "[your format here]")
) < lit(some date))
I'd use
import org.apache.spark.sql.functions._
df.withColumn("some_name", date_format(unix_timestamp(
concat($"npaHeaderData.npaDownloadDate", $"npaHeaderData.npaDownloadTime"),
"ddMMMyykk.mm.ss").cast("timestamp"),
"EEE MMM d HH:mm:ss z yyyy"))

How to get the first row data of each list?

my DataFrame like this :
+------------------------+----------------------------------------+
|ID |probability |
+------------------------+----------------------------------------+
|583190715ccb64f503a|[0.49128147201958017,0.5087185279804199]|
|58326da75fc764ad200|[0.42143416087939345,0.5785658391206066]|
|583270ff17c76455610|[0.3949217100212508,0.6050782899787492] |
|583287c97ec7641b2d4|[0.4965059792664432,0.5034940207335569] |
|5832d7e279c764f52e4|[0.49128147201958017,0.5087185279804199]|
|5832e5023ec76406760|[0.4775830044196701,0.52241699558033] |
|5832f88859cb64960ea|[0.4360509428173421,0.563949057182658] |
|58332e6238c7643e6a7|[0.48730029128352853,0.5126997087164714]|
and I get the column of probability using
val proVal = Data.select("probability").rdd.map(r => r(0)).collect()
proVal.foreach(println)
the result is :
[0.49128147201958017,0.5087185279804199]
[0.42143416087939345,0.5785658391206066]
[0.3949217100212508,0.6050782899787492]
[0.4965059792664432,0.5034940207335569]
[0.49128147201958017,0.5087185279804199]
[0.4775830044196701,0.52241699558033]
[0.4360509428173421,0.563949057182658]
[0.48730029128352853,0.5126997087164714]
but I want to get the first column of data for each row, like this:
0.49128147201958017
0.42143416087939345
0.3949217100212508
0.4965059792664432
0.49128147201958017
0.4775830044196701
0.4360509428173421
0.48730029128352853
how can this be done?
The input is standard random forest input, above the input is val Data = predictions.select("docID", "probability")
predictions.printSchema()
root
|-- docID: string (nullable = true)
|-- label: double (nullable = false)
|-- features: vector (nullable = true)
|-- indexedLabel: double (nullable = true)
|-- rawPrediction: vector (nullable = true)
|-- probability: vector (nullable = true)
|-- prediction: double (nullable = true)
|-- predictedLabel: string (nullable = true)
and I want to get the first value of the "probability" column
You can use the Column.apply method to get the n-th item on an array column - in this case the first column (using index 0):
import sqlContext.implicits._
val proVal = Data.select($"probability"(0)).rdd.map(r => r(0)).collect()
BTW, if you're using Spark 1.6 or higher, you can also use the Dataset API for a cleaner way to convert the dataframe into Doubles:
val proVal = Data.select($"probability"(0)).as[Double].collect()