Is there any way to prevent spark sql functions from nulling values?
For example I have the following dataframe
df.show
+--------------------+--------------+------+------------+
| Title|Year Published|Rating|Length (Min)|
+--------------------+--------------+------+------------+
| 101 Dalmatians| 01/1996| G| 103|
|101 Dalmatians (A...| 1961| G| 79|
|101 Dalmations II...| 2003| G| 70|
I want to apply spark sqls date_format function to Year Published column.
val sql = """date_format(`Year Published`, 'MM/yyyy')"""
val df2 = df.withColumn("Year Published", expr(sql))
df2.show
+--------------------+--------------+------+------------+
| Title|Year Published|Rating|Length (Min)|
+--------------------+--------------+------+------------+
| 101 Dalmatians| null| G| 103|
|101 Dalmatians (A...| 01/1961| G| 79|
|101 Dalmations II...| 01/2003| G| 70|
The first row of the Year Published column has been nulled as the original value was in a different date format than the other dates.
This behaviour is not unique to date_format for example format_number will null non-numeric types.
With my dataset I expect different date formats and dirty data with unparseable values. I have a use case where if the value of a cell cannot be formatted then I want to return the current value as opposed to null.
Is there a way to make spark use the original value in df instead of null if the function for df2 cannot be applied correctly?
What I've tried
I've looked at wrapping Expressions in org.apache.spark.sql.catalyst.expressions but could not see a way to replace the existing functions.
The only working solution I could find is creating my own date_format and registering it as a udf but this isn't practical for all functions. I'm looking for a solution that will never return null if the input to a function is non-null or an automated way to wrap all existing spark functions.
You could probably use the coalesce function for your purposes:
coalesce(date_format(`Year Published`, 'MM/yyyy'), `Year Published`)
Related
I have a dataframe like,
df.show
+----+-------+
| age| name|
+----+-------+
| 20|Michael|
| 30| Andy|
| 19| Justin|
+----+-------+
I need to put it in a scala Map like (Michael->20,Andy->30,Justin->19)
How can I achieve this?
Plain and simple - convert to static types and collect:
df.select($"name", $"age".cast("int")).as[(String, Int)].collect.toMap
Though in practice there you won't find much use for that I guess, given it stores data in the driver memory.
I am trying to filter out bad records from a csv file using pyspark. Code snippet given below
from pyspark.sql import SparkSession
schema="employee_id int,name string,address string,dept_id int"
spark = SparkSession.builder.appName("TestApp").getOrCreate()
data = spark.read.format("csv").option("header", True).schema(schema).option("badRecordsPath", "/tmp/bad_records").load("/path/to/csv/file")
schema_for_bad_record="path string,record string,reason string"
bad_records_frame=spark.read.schema(schema_for_bad_record).json("/tmp/bad_records")
bad_records_frame.select("reason").show()
The valid dataframe is
+-----------+-------+-------+-------+
|employee_id| name|address|dept_id|
+-----------+-------+-------+-------+
| 1001| Floyd| Delhi| 1|
| 1002| Watson| Mumbai| 2|
| 1004|Thomson|Chennai| 3|
| 1005| Bonila| Pune| 4|
+-----------+-------+-------+-------+
In one of the records, both employee_id and dept_id has incorrect values. But the reason shows only one column's issue.
java.lang.NumberFormatException: For input string: \\"abc\\"
Is there any way to show reasons for multiple columns in case of failure?
I have a sample CSV file with columns as shown below.
col1,col2
1,57.5
2,24.0
3,56.7
4,12.5
5,75.5
I want a new Timestamp column in the HH:mm:ss format and the timestamp should keep on the increase by seconds as shown below.
col1,col2,ts
1,57.5,00:00:00
2,24.0,00:00:01
3,56.7,00:00:02
4,12.5,00:00:03
5,75.5,00:00:04
Thanks in advance for your help.
I can propose a solution based on pyspark. The scala implementation should be almost transparent.
My idea is to create a column filled with a unique timestamps (here 1980 as an example but does not matter) and add seconds based on your first column (row number). Then, you just reformat the timestamp to only see hours
import pyspark.sql.functions as psf
df = (df
.withColumn("ts", psf.unix_timestamp(timestamp=psf.lit('1980-01-01 00:00:00'), format='YYYY-MM-dd HH:mm:ss'))
.withColumn("ts", psf.col("ts") + psf.col("i") - 1)
.withColumn("ts", psf.from_unixtime("ts", format='HH:mm:ss'))
)
df.show(2)
+---+----+---------+
| i| x| ts|
+---+----+---------+
| 1|57.5| 00:00:00|
| 2|24.0| 00:00:01|
+---+----+---------+
only showing top 2 rows
Data generation
df = spark.createDataFrame([(1,57.5),
(2,24.0),
(3,56.7),
(4,12.5),
(5,75.5)], ['i','x'])
df.show(2)
+---+----+
| i| x|
+---+----+
| 1|57.5|
| 2|24.0|
+---+----+
only showing top 2 rows
Update: if you don't have a row number in your csv (from your comment)
In that case, you will need row_number function.
This is not straightforward to number rows in Spark because the data are distributed on independent partitions and locations. The order observed in the csv will not be respected by spark when mapping file rows to partitions. I think it would be better not to use Spark to number your rows in your csv if the order in the file is important. A pre-processing step based on pandas with a loop over all your files, to take it one at a time, could make it work.
Anyway, I can propose you a solution if you don't mind having row order different from the one in your csv stored in disk.
import pyspark.sql.window as psw
w = psw.Window.partitionBy().orderBy("x")
(df
.drop("i")
.withColumn("i", psf.row_number().over(w))
.withColumn("Timestamp", psf.unix_timestamp(timestamp=psf.lit('1980-01-01 00:00:00'), format='YYYY-MM-dd HH:mm:ss'))
.withColumn("Timestamp", psf.col("Timestamp") + psf.col("i") - 1)
.withColumn("Timestamp", psf.from_unixtime("Timestamp", format='HH:mm:ss'))
.show(2)
)
+----+---+---------+
| x| i|Timestamp|
+----+---+---------+
|12.5| 1| 00:00:00|
|24.0| 2| 00:00:01|
+----+---+---------+
only showing top 2 rows
In terms of efficiency this is bad (it's like collecting all the data in master) because you don't use partitionBy. In this step, using Spark is overkill.
You could also use a temporary column and use this one to order. In this particular example it will produce the expected output but not sure it works great in general
w2 = psw.Window.partitionBy().orderBy("temp")
(df
.drop("i")
.withColumn("temp", psf.lit(1))
.withColumn("i", psf.row_number().over(w2))
.withColumn("Timestamp", psf.unix_timestamp(timestamp=psf.lit('1980-01-01 00:00:00'), format='YYYY-MM-dd HH:mm:ss'))
.withColumn("Timestamp", psf.col("Timestamp") + psf.col("i") - 1)
.withColumn("Timestamp", psf.from_unixtime("Timestamp", format='HH:mm:ss'))
.show(2)
)
+----+----+---+---------+
| x|temp| i|Timestamp|
+----+----+---+---------+
|57.5| 1| 1| 00:00:00|
|24.0| 1| 2| 00:00:01|
+----+----+---+---------+
only showing top 2 rows
I want to understand the best way to solve date-related problems in spark SQL. I'm trying to solve simple problem where I have a file that has date ranges like below:
startdate,enddate
01/01/2018,30/01/2018
01/02/2018,28/02/2018
01/03/2018,30/03/2018
and another table that has date and counts:
date,counts
03/01/2018,10
25/01/2018,15
05/02/2018,23
17/02/2018,43
Now all I want to find is sum of counts for each date range, so the output expected is:
startdate,enddate,sum(count)
01/01/2018,30/01/2018,25
01/02/2018,28/02/2018,66
01/03/2018,30/03/2018,0
Following is the code I have written but it's giving me a cartesian result set:
val spark = SparkSession.builder().appName("DateBasedCount").master("local").getOrCreate()
import spark.implicits._
val df1 = spark.read.option("header","true").csv("dateRange.txt").toDF("startdate","enddate")
val df2 = spark.read.option("header","true").csv("dateCount").toDF("date","count")
df1.createOrReplaceTempView("daterange")
df2.createOrReplaceTempView("datecount")
val res = spark.sql("select startdate,enddate,date,visitors from daterange left join datecount on date >= startdate and date <= enddate")
res.rdd.foreach(println)
The output is:
| startdate| enddate| date|visitors|
|01/01/2018|30/01/2018|03/01/2018| 10|
|01/01/2018|30/01/2018|25/01/2018| 15|
|01/01/2018|30/01/2018|05/02/2018| 23|
|01/01/2018|30/01/2018|17/02/2018| 43|
|01/02/2018|28/02/2018|03/01/2018| 10|
|01/02/2018|28/02/2018|25/01/2018| 15|
|01/02/2018|28/02/2018|05/02/2018| 23|
|01/02/2018|28/02/2018|17/02/2018| 43|
|01/03/2018|30/03/2018|03/01/2018| 10|
|01/03/2018|30/03/2018|25/01/2018| 15|
|01/03/2018|30/03/2018|05/02/2018| 23|
|01/03/2018|30/03/2018|17/02/2018| 43|
Now if I groupby startdate and enddate with sum on count I see following result which is incorrect:
| startdate| enddate| sum(count)|
|01/01/2018|30/01/2018| 91.0|
|01/02/2018|28/02/2018| 91.0|
|01/03/2018|30/03/2018| 91.0|
So how do we handle this and what is the best way to deal with dates in Spark SQL? Should we build columns as dateType in first place OR read as strings and then cast it to date while necessary?
The problem is that your dates are not interpreted as dates by Spark automatically, they are just strings. The solution is therefore to convert them into dates:
val df1 = spark.read.option("header","true").csv("dateRange.txt")
.toDF("startdate","enddate")
.withColumn("startdate", to_date(unix_timestamp($"startdate", "dd/MM/yyyy").cast("timestamp")))
.withColumn("enddate", to_date(unix_timestamp($"enddate", "dd/MM/yyyy").cast("timestamp")))
val df2 = spark.read.option("header","true").csv("dateCount")
.toDF("date","count")
.withColumn("date", to_date(unix_timestamp($"date", "dd/MM/yyyy").cast("timestamp")))
Then use the same code as before. The output of the SQL command is now:
+----------+----------+----------+------+
| startdate| enddate| date|counts|
+----------+----------+----------+------+
|2018-01-01|2018-01-30|2018-01-03| 10|
|2018-01-01|2018-01-30|2018-01-25| 15|
|2018-02-01|2018-02-28|2018-02-05| 23|
|2018-02-01|2018-02-28|2018-02-17| 43|
|2018-03-01|2018-03-30| null| null|
+----------+----------+----------+------+
If the last line should be ignored, simply change to an inner join instead.
Using df.groupBy("startdate", "enddate").sum() on this new dataframe will give the wanted output.
I am using CassandraSQLContext from spark-shell to query data from Cassandra. So, I want to know two things one how to fetch more than 20 rows using CassandraSQLContext and second how do Id display the full value of column. As you can see below by default it append dots in the string values.
Code :
val csc = new CassandraSQLContext(sc)
csc.setKeyspace("KeySpace")
val maxDF = csc.sql("SQL_QUERY" )
maxDF.show
Output:
+--------------------+--------------------+-----------------+--------------------+
| id| Col2| Col3| Col4|
+--------------------+--------------------+-----------------+--------------------+
|8wzloRMrGpf8Q3bbk...| Value1| X| K1|
|AxRfoHDjV1Fk18OqS...| Value2| Y| K2|
|FpMVRlaHsEOcHyDgy...| Value3| Z| K3|
|HERt8eFLRtKkiZndy...| Value4| U| K4|
|nWOcbbbm8ZOjUSNfY...| Value5| V| K5|
If you want to print the whole value of a column, in scala, you just need to set the argument truncate from the show method to false :
maxDf.show(false)
and if you wish to show more than 20 rows :
// example showing 30 columns of
// maxDf untruncated
maxDf.show(30, false)
For pyspark, you'll need to specify the argument name :
maxDF.show(truncate = False)
You won't get in nice tabular form instead it will be converted to scala object.
maxDF.take(50)