I have a data file in which a column datetime has value 01-01-2011 00:00. I want to extract year, month and hour from this value.
Code:
val actualData=df1.select(year(col("datetime")).as("Year"),
month(col("datetime")).as("Month"),
hour(col("datetime")).as("Hour"),
dayofmonth(col("datetime")).as("DayOfMonth"))
Output i am getting:
+---------------------+----+-----+----+----------+
Year|Month|Hour|DayOfMonth|
+---------------------+----+-----+----+----------+
|null| null|null| null|
|null| null|null| null|
|null| null|null| null|
|null| null|null| null|
Seems column datetime is not in appropriate datatype. Casting the datetime column to date should help.
df1 = df1.withColumn("datetime",to_date(col("datetime"),"dd-MM-yyyy HH:mm"))
The format of "datetime" is not "yyyy-MM-dd HH:mm", so it need to specify the format of it.
val actualData=df1.select(year(to_date(col("datetime"), "dd-MM-yyyy HH:mm")).as("Year"),
month(to_date(col("datetime"), "dd-MM-yyyy HH:mm")).as("Month"),
hour(to_timestamp(col("datetime"), "dd-MM-yyyy HH:mm")).as("Hour"),
dayofmonth(to_date(col("datetime"), "dd-MM-yyyy HH:mm")).as("DayOfMonth"))
Related
I have a self written function and it gets a dataframe and returns the whole dataframe plus a new column. That new column must not have a fixed name but instead the current month as part of the new column name. E.g. "forecast_august2022".
I tried it like
.withColumnRenamed(
old_columnname,
new_columnname
)
But I do not know, how to create the new column name with a fixed value (forecast_) concatenating it with the current month. Ideas?
You can define a variable at start with current month and year and use it in f string while adding it in with column
from pyspark.sql import functions as F
import datetime
mydate = datetime.datetime.now()
month_nm=mydate.strftime("%B%Y") #gives you July2022 for today
dql1=spark.range(3).toDF("ID")
dql1.withColumn(f"forecast_{month_nm}",F.lit(0)).show()
#output
+---+-----------------+
| ID|forecast_July2022|
+---+-----------------+
| 0| 0|
| 1| 0|
| 2| 0|
+---+-----------------+
you can do like this :
>>> import datetime
>>> #provide month number
>>> month_num = "3"
>>> datetime_object = datetime.datetime.strptime(month_num, "%m")
>>> full_month_name = datetime_object.strftime("%B")
>>> df.withColumn(("newcol"+"_"+full_month_name+"2022"),col('period')).show()
+------+-------+------+----------------+
|period|product|amount|newcol_March2022|
+------+-------+------+----------------+
| 20191| prod1| 30| 20191|
| 20192| prod1| 30| 20192|
| 20191| prod2| 20| 20191|
| 20191| prod3| 60| 20191|
| 20193| prod1| 30| 20193|
| 20193| prod2| 30| 20193|
+------+-------+------+----------------+
I am new to spark and scala and I would like to convert a column of string dates, into Unix epochs. My dataframe looks like this:
+----------+-------+
| Dates |Reports|
+----------+-------+
|2020-07-20| 34|
|2020-07-21| 86|
|2020-07-22| 129|
|2020-07-23| 98|
+--------+---------+
The output should be
+----------+-------+
| Dates |Reports|
+----------+-------+
|1595203200| 34|
|1595289600| 86|
|1595376000| 129|
|1595462400| 98|
+--------+---------+
``
Use unix_timestamp.
val df = Seq(("2020-07-20")).toDF("date")
df.show
df.withColumn("unix_time", unix_timestamp('date, "yyyy-MM-dd")).show
+----------+
| date|
+----------+
|2020-07-20|
+----------+
+----------+----------+
| date| unix_time|
+----------+----------+
|2020-07-20|1595203200|
+----------+----------+
I try to extract in pyspark the date of Sunday from every given week in a year. Week and year are in the format yyyyww. This is possible for every week except the first week, in this case, a got a null value. This is the sample code and result.
columns = ['id', 'week_year']
vals = [
(1, 201952),
(2, 202001),
(3, 202002),
(4, 201901),
(5, 201902)
]
df = spark.createDataFrame(vals, columns)
+---+---------+
| id|week_year|
+---+---------+
| 1| 201952|
| 2| 202001|
| 3| 202002|
| 4| 201901|
| 5| 201902|
+---+---------+
df = df.withColumn("day", to_timestamp(concat(df.week_year, lit("-Sunday")), 'yyyyww-E'))
As a result I got
+---+---------+-------------------+
| id|week_year| day|
+---+---------+-------------------+
| 1| 201952|2019-12-22 00:00:00|
| 2| 202001| null|
| 3| 202002|2020-01-05 00:00:00|
| 4| 201901| null|
| 5| 201902|2019-01-06 00:00:00|
+---+---------+-------------------+
Do you have an idea, why it does not work for the first week? It is also strange for me that 5.01 and 6.01 are in second week, not in the first.
If you look at the calendar for 2020, the year starts on wednesday, which is in the middle of 1st week and that first week doesn't have a sunday. Same goes for the 2019. That is why 2020-01-05 is coming in the second week.
Hope this helps!
Problems to convert date to timestamp, Spark date to timestamp from unix_timestamp return null.
scala> import org.apache.spark.sql.functions.unix_timestamp
scala> spark.sql("select from_unixtime(unix_timestamp(('2017-08-13 00:06:05'),'yyyy-MM-dd HH:mm:ss')) AS date").show(false)
+----+
|date|
+----+
|null|
+----+
The problem was the change of time in Chile, thank you very much.
+-------------------+---------+
| DateIntermedia|TimeStamp|
+-------------------+---------+
|13-08-2017 00:01:07| null|
|13-08-2017 00:10:33| null|
|14-08-2016 00:28:42| null|
|13-08-2017 00:04:43| null|
|13-08-2017 00:33:51| null|
|14-08-2016 00:28:08| null|
|14-08-2016 00:15:34| null|
|14-08-2016 00:21:04| null|
|13-08-2017 00:34:13| null|
+-------------------+---------+
The solution, set timeZone:
spark.conf.set("spark.sql.session.timeZone", "UTC-6")
Spark Version: spark-2.0.1-bin-hadoop2.7
Scala: 2.11.8
I am loading a raw csv into a DataFrame. In csv, although the column is support to be in date format, they are written as 20161025 instead of 2016-10-25. The parameter date_format includes string of column names that need to be converted to yyyy-mm-dd format.
In the following code, I first loaded the csv of Date column as StringType via the schema, and then I check if the date_format is not empty, that is there are columns that need to be converted to Date from String, then cast each column using unix_timestamp and to_date. However, in the csv_df.show(), the returned rows are all null.
def read_csv(csv_source:String, delimiter:String, is_first_line_header:Boolean,
schema:StructType, date_format:List[String]): DataFrame = {
println("|||| Reading CSV Input ||||")
var csv_df = sqlContext.read
.format("com.databricks.spark.csv")
.schema(schema)
.option("header", is_first_line_header)
.option("delimiter", delimiter)
.load(csv_source)
println("|||| Successfully read CSV. Number of rows -> " + csv_df.count() + " ||||")
if(date_format.length > 0) {
for (i <- 0 until date_format.length) {
csv_df = csv_df.select(to_date(unix_timestamp(
csv_df(date_format(i)), "yyyy-MM-dd").cast("timestamp")))
csv_df.show()
}
}
csv_df
}
Returned Top 20 rows:
+-------------------------------------------------------------------------+
|to_date(CAST(unix_timestamp(prom_price_date, YYYY-MM-DD) AS TIMESTAMP))|
+-------------------------------------------------------------------------+
| null|
| null|
| null|
| null|
| null|
| null|
| null|
| null|
| null|
| null|
| null|
| null|
| null|
| null|
| null|
| null|
| null|
| null|
| null|
| null|
+-------------------------------------------------------------------------+
Why am I getting all null?
To convert yyyyMMdd to yyyy-MM-dd you can:
spark.sql("""SELECT DATE_FORMAT(
CAST(UNIX_TIMESTAMP('20161025', 'yyyyMMdd') AS TIMESTAMP), 'yyyy-MM-dd'
)""")
with functions:
date_format(unix_timestamp(col, "yyyyMMdd").cast("timestamp"), "yyyy-MM-dd")