I'm trying to create a calendar file with columns for day, month, etc. The following code works fine, but I couldn't find a clean way to extract the week of year (1-52). In spark 3.0+, the following line of code doesn't work: .withColumn("week_of_year", date_format(col("day_id"), "W"))
I know that I can create a view/table and then run a SQL query on it to extract the week_of_year, but is there no better way to do it?
`
df.withColumn("day_id", to_date(col("day_id"), date_fmt))
.withColumn("week_day", date_format(col("day_id"), "EEEE"))
.withColumn("month_of_year", date_format(col("day_id"), "M"))
.withColumn("year", date_format(col("day_id"), "y"))
.withColumn("day_of_month", date_format(col("day_id"), "d"))
.withColumn("quarter_of_year", date_format(col("day_id"), "Q"))
It seems those patterns are not supported anymore in spark 3+
Caused by: java.lang.IllegalArgumentException: All week-based patterns are unsupported since Spark 3.0, detected: w, Please use the SQL function EXTRACT instead
You can use this:
import org.apache.spark.sql.functions._
df.withColumn("week_of_year", weekofyear($"date"))
TESTING
INPUT
val df = List("2021-05-15", "1985-10-05")
.toDF("date")
.withColumn("date", to_date($"date", "yyyy-MM-dd")
df.show
+----------+
| date|
+----------+
|2021-05-15|
|1985-10-05|
+----------+
OUTPUT
df.withColumn("week_of_year", weekofyear($"date")).show
+----------+------------+
| date|week_of_year|
+----------+------------+
|2021-05-15| 19|
|1985-10-05| 40|
+----------+------------+
The exception you saw, recomend to use EXTRACT SQL function instead https://spark.apache.org/docs/3.0.0/api/sql/index.html#extract
val df = Seq(("2019-11-16 16:50:59.406")).toDF("input_timestamp")
df.selectExpr("input_timestamp", "extract(week FROM input_timestamp) as w").show
+--------------------+---+
| input_timestamp| w|
+--------------------+---+
|2019-11-16 16:50:...| 46|
+--------------------+---+
Related
I have a sample CSV file with columns as shown below.
col1,col2
1,57.5
2,24.0
3,56.7
4,12.5
5,75.5
I want a new Timestamp column in the HH:mm:ss format and the timestamp should keep on the increase by seconds as shown below.
col1,col2,ts
1,57.5,00:00:00
2,24.0,00:00:01
3,56.7,00:00:02
4,12.5,00:00:03
5,75.5,00:00:04
Thanks in advance for your help.
I can propose a solution based on pyspark. The scala implementation should be almost transparent.
My idea is to create a column filled with a unique timestamps (here 1980 as an example but does not matter) and add seconds based on your first column (row number). Then, you just reformat the timestamp to only see hours
import pyspark.sql.functions as psf
df = (df
.withColumn("ts", psf.unix_timestamp(timestamp=psf.lit('1980-01-01 00:00:00'), format='YYYY-MM-dd HH:mm:ss'))
.withColumn("ts", psf.col("ts") + psf.col("i") - 1)
.withColumn("ts", psf.from_unixtime("ts", format='HH:mm:ss'))
)
df.show(2)
+---+----+---------+
| i| x| ts|
+---+----+---------+
| 1|57.5| 00:00:00|
| 2|24.0| 00:00:01|
+---+----+---------+
only showing top 2 rows
Data generation
df = spark.createDataFrame([(1,57.5),
(2,24.0),
(3,56.7),
(4,12.5),
(5,75.5)], ['i','x'])
df.show(2)
+---+----+
| i| x|
+---+----+
| 1|57.5|
| 2|24.0|
+---+----+
only showing top 2 rows
Update: if you don't have a row number in your csv (from your comment)
In that case, you will need row_number function.
This is not straightforward to number rows in Spark because the data are distributed on independent partitions and locations. The order observed in the csv will not be respected by spark when mapping file rows to partitions. I think it would be better not to use Spark to number your rows in your csv if the order in the file is important. A pre-processing step based on pandas with a loop over all your files, to take it one at a time, could make it work.
Anyway, I can propose you a solution if you don't mind having row order different from the one in your csv stored in disk.
import pyspark.sql.window as psw
w = psw.Window.partitionBy().orderBy("x")
(df
.drop("i")
.withColumn("i", psf.row_number().over(w))
.withColumn("Timestamp", psf.unix_timestamp(timestamp=psf.lit('1980-01-01 00:00:00'), format='YYYY-MM-dd HH:mm:ss'))
.withColumn("Timestamp", psf.col("Timestamp") + psf.col("i") - 1)
.withColumn("Timestamp", psf.from_unixtime("Timestamp", format='HH:mm:ss'))
.show(2)
)
+----+---+---------+
| x| i|Timestamp|
+----+---+---------+
|12.5| 1| 00:00:00|
|24.0| 2| 00:00:01|
+----+---+---------+
only showing top 2 rows
In terms of efficiency this is bad (it's like collecting all the data in master) because you don't use partitionBy. In this step, using Spark is overkill.
You could also use a temporary column and use this one to order. In this particular example it will produce the expected output but not sure it works great in general
w2 = psw.Window.partitionBy().orderBy("temp")
(df
.drop("i")
.withColumn("temp", psf.lit(1))
.withColumn("i", psf.row_number().over(w2))
.withColumn("Timestamp", psf.unix_timestamp(timestamp=psf.lit('1980-01-01 00:00:00'), format='YYYY-MM-dd HH:mm:ss'))
.withColumn("Timestamp", psf.col("Timestamp") + psf.col("i") - 1)
.withColumn("Timestamp", psf.from_unixtime("Timestamp", format='HH:mm:ss'))
.show(2)
)
+----+----+---+---------+
| x|temp| i|Timestamp|
+----+----+---+---------+
|57.5| 1| 1| 00:00:00|
|24.0| 1| 2| 00:00:01|
+----+----+---+---------+
only showing top 2 rows
Given:
val df = Seq((1L, "04-04-2015")).toDF("id", "date")
val df2 = df.withColumn("month", from_unixtime(unix_timestamp($"date", "dd/MM/yy"), "MMMMM"))
df2.show()
I got this output:
+---+----------+-----+
| id| date|month|
+---+----------+-----+
| 1|04-04-2015| null|
+---+----------+-----+
However, I want the output to be as below:
+---+----------+-----+
| id| date|month|
+---+----------+-----+
| 1|04-04-2015|April|
+---+----------+-----+
How can I do that in sparkSQL using Scala?
This should do it:
val df2 = df.withColumn("month", date_format(to_date($"date", "dd-MM-yyyy"), "MMMM"))
df2.show
+---+----------+-----+
| id| date|month|
+---+----------+-----+
| 1|04-04-2015|April|
+---+----------+-----+
NOTE:
The first string (to_date) must match the format of your existing date
Be careful with: "dd-MM-yyyy" vs "MM-dd-yyyy"
The second string (date_format) is the format of the output
Docs:
to_date
date_format
Nothing Wrong in your code just keeps your date format as your date column.
Here i am attaching screenshot with your code and change codes.
HAppy Hadoooooooooooopppppppppppppppppppppp
Not exactly related to this question but who wants to get a month as integer there is a month function:
val df2 = df.withColumn("month", month($"date", "dd-MM-yyyy"))
df2.show
+---+----------+-----+
| id| date|month|
+---+----------+-----+
| 1|04-04-2015| 4|
+---+----------+-----+
The same way you can use the year function to get only year.
I want to understand the best way to solve date-related problems in spark SQL. I'm trying to solve simple problem where I have a file that has date ranges like below:
startdate,enddate
01/01/2018,30/01/2018
01/02/2018,28/02/2018
01/03/2018,30/03/2018
and another table that has date and counts:
date,counts
03/01/2018,10
25/01/2018,15
05/02/2018,23
17/02/2018,43
Now all I want to find is sum of counts for each date range, so the output expected is:
startdate,enddate,sum(count)
01/01/2018,30/01/2018,25
01/02/2018,28/02/2018,66
01/03/2018,30/03/2018,0
Following is the code I have written but it's giving me a cartesian result set:
val spark = SparkSession.builder().appName("DateBasedCount").master("local").getOrCreate()
import spark.implicits._
val df1 = spark.read.option("header","true").csv("dateRange.txt").toDF("startdate","enddate")
val df2 = spark.read.option("header","true").csv("dateCount").toDF("date","count")
df1.createOrReplaceTempView("daterange")
df2.createOrReplaceTempView("datecount")
val res = spark.sql("select startdate,enddate,date,visitors from daterange left join datecount on date >= startdate and date <= enddate")
res.rdd.foreach(println)
The output is:
| startdate| enddate| date|visitors|
|01/01/2018|30/01/2018|03/01/2018| 10|
|01/01/2018|30/01/2018|25/01/2018| 15|
|01/01/2018|30/01/2018|05/02/2018| 23|
|01/01/2018|30/01/2018|17/02/2018| 43|
|01/02/2018|28/02/2018|03/01/2018| 10|
|01/02/2018|28/02/2018|25/01/2018| 15|
|01/02/2018|28/02/2018|05/02/2018| 23|
|01/02/2018|28/02/2018|17/02/2018| 43|
|01/03/2018|30/03/2018|03/01/2018| 10|
|01/03/2018|30/03/2018|25/01/2018| 15|
|01/03/2018|30/03/2018|05/02/2018| 23|
|01/03/2018|30/03/2018|17/02/2018| 43|
Now if I groupby startdate and enddate with sum on count I see following result which is incorrect:
| startdate| enddate| sum(count)|
|01/01/2018|30/01/2018| 91.0|
|01/02/2018|28/02/2018| 91.0|
|01/03/2018|30/03/2018| 91.0|
So how do we handle this and what is the best way to deal with dates in Spark SQL? Should we build columns as dateType in first place OR read as strings and then cast it to date while necessary?
The problem is that your dates are not interpreted as dates by Spark automatically, they are just strings. The solution is therefore to convert them into dates:
val df1 = spark.read.option("header","true").csv("dateRange.txt")
.toDF("startdate","enddate")
.withColumn("startdate", to_date(unix_timestamp($"startdate", "dd/MM/yyyy").cast("timestamp")))
.withColumn("enddate", to_date(unix_timestamp($"enddate", "dd/MM/yyyy").cast("timestamp")))
val df2 = spark.read.option("header","true").csv("dateCount")
.toDF("date","count")
.withColumn("date", to_date(unix_timestamp($"date", "dd/MM/yyyy").cast("timestamp")))
Then use the same code as before. The output of the SQL command is now:
+----------+----------+----------+------+
| startdate| enddate| date|counts|
+----------+----------+----------+------+
|2018-01-01|2018-01-30|2018-01-03| 10|
|2018-01-01|2018-01-30|2018-01-25| 15|
|2018-02-01|2018-02-28|2018-02-05| 23|
|2018-02-01|2018-02-28|2018-02-17| 43|
|2018-03-01|2018-03-30| null| null|
+----------+----------+----------+------+
If the last line should be ignored, simply change to an inner join instead.
Using df.groupBy("startdate", "enddate").sum() on this new dataframe will give the wanted output.
I am reading the data from HDFS into DataFrame using Spark 2.2.0 and Scala 2.11.8:
val df = spark.read.text(outputdir)
df.show()
I see this result:
+--------------------+
| value|
+--------------------+
|(4056,{community:...|
|(56,{community:56...|
|(2056,{community:...|
+--------------------+
If I run df.head(), I see more details about the structure of each row:
[(4056,{community:1,communitySigmaTot:1020457,internalWeight:0,nodeWeight:1020457})]
I want to get the following output:
+---------+----------+
| id | value|
+---------+----------+
|4056 |1 |
|56 |56 |
|2056 |20 |
+---------+----------+
How can I do it? I tried using .map(row => row.mkString(",")),
but I don't know how to extract the data as I showed.
The problem is that you are getting the data as a single column of strings. The data format is not really specified in the question (ideally it would be something like JSON), but given what we know, we can use a regular expression to extract the number on the left (id) and the community field:
val r = """\((\d+),\{.*community:(\d+).*\}\)"""
df.select(
F.regexp_extract($"value", r, 1).as("id"),
F.regexp_extract($"value", r, 2).as("community")
).show()
A bunch of regular expressions should give you required result.
df.select(
regexp_extract($"value", "^\\(([0-9]+),.*$", 1) as "id",
explode(split(regexp_extract($"value", "^\\(([0-9]+),\\{(.*)\\}\\)$", 2), ",")) as "value"
).withColumn("value", split($"value", ":")(1))
If your data is always of the following format
(4056,{community:1,communitySigmaTot:1020457,internalWeight:0,nodeWeight:1020457})
Then you can simply use split and regex_replace inbuilt functions to get your desired output dataframe as
import org.apache.spark.sql.functions._
df.select(regexp_replace((split(col("value"), ",")(0)), "\\(", "").as("id"), regexp_replace((split(col("value"), ",")(1)), "\\{community:", "").as("value") ).show()
I hope the answer is helpful
I have two Spark dataframe's, df1 and df2:
+-------+-----+---+
| name|empNo|age|
+-------+-----+---+
|shankar|12121| 28|
| ramesh| 1212| 29|
| suresh| 1111| 30|
| aarush| 0707| 15|
+-------+-----+---+
+------+-----+---+-----+
| eName| eNo|age| city|
+------+-----+---+-----+
|aarush|12121| 15|malmo|
|ramesh| 1212| 29|malmo|
+------+-----+---+-----+
I need to get the non matching records from df1, based on a number of columns which is specified in another file.
For example, the column look up file is something like below:
df1col,df2col
name,eName
empNo, eNo
Expected output is:
+-------+-----+---+
| name|empNo|age|
+-------+-----+---+
|shankar|12121| 28|
| suresh| 1111| 30|
| aarush| 0707| 15|
+-------+-----+---+
The idea is how to build a where condition dynamically for the above scenario, because the lookup file is configurable, so it might have 1 to n fields.
You can use the except dataframe method. I'm assuming that the columns to use are in two lists for simplicity. It's necessary that the order of both lists are correct, the columns on the same location in the list will be compared (regardless of column name). After except, use join to get the missing columns from the first dataframe.
val df1 = Seq(("shankar","12121",28),("ramesh","1212",29),("suresh","1111",30),("aarush","0707",15))
.toDF("name", "empNo", "age")
val df2 = Seq(("aarush", "12121", 15, "malmo"),("ramesh", "1212", 29, "malmo"))
.toDF("eName", "eNo", "age", "city")
val df1Cols = List("name", "empNo")
val df2Cols = List("eName", "eNo")
val tempDf = df1.select(df1Cols.head, df1Cols.tail: _*)
.except(df2.select(df2Cols.head, df2Cols.tail: _*))
val df = df1.join(broadcast(tempDf), df1Cols)
The resulting dataframe will look as wanted:
+-------+-----+---+
| name|empNo|age|
+-------+-----+---+
| aarush| 0707| 15|
| suresh| 1111| 30|
|shankar|12121| 28|
+-------+-----+---+
If you're doing this from a SQL query I would remap the column names in the SQL query itself with something like Changing a SQL column title via query. You could do a simple text replace in the query to normalize them to the df1 or df2 column names.
Once you have that you can diff using something like
How to obtain the difference between two DataFrames?
If you need more columns that wouldn't be used in the diff (e.g. age) you can reselect the data again based on your diff results. This may not be the optimal way of doing it but it would probably work.