I have a column datetime object declare as decimal (38,0) not timestamp or date and the data input is 'yyyMMdd'. How do I select data with that column convert as date format as 'yyyy-MM-dd' in spark sql (scala) within a day or two days old?
I have tried:
select count(*) from table_name where to_date('column_name','yyyy-MM-dd') = date_sub(current_date(),1));
this gives me 0 count when a data have quiet more than 500000 records
I tried:
select count(*) from table_name where from_unixtime(cast(load_dt_id as string), 'yyyy-MM-dd') = date_sub(current_date(), 1));
I got data in year 1970-01-31 which those year data are not in the table, even when I select that column where it's like '1970%', I got "OK" with bulk sign that accelerate query with Delta. The data select in order of that column started with 20140320
The format argument for to_date is the format of the input, not the desired output. Assuming you have yyyymmdd:
Seq(("20200208")).toDF("RawDate").select(col("RawDate"),to_date(col("RawDate"),"yyyyMMdd").as("formatted_date")).show()
+--------+--------------+
| RawDate|formatted_date|
+--------+--------------+
|20200208| 2020-02-08|
+--------+--------------+
Expanding this to filter by the derived date column:
val raw = Seq(("20200208"),("20200209"),("20200210")).toDF("RawDate")
raw: org.apache.spark.sql.DataFrame = [RawDate: string]
raw.select(col("RawDate"),to_date(col("RawDate"),"yyyyMMdd").as("formatted_date")).filter($"formatted_date".geq(date_add(current_date,-1))).show
+--------+--------------+
| RawDate|formatted_date|
+--------+--------------+
|20200209| 2020-02-09|
|20200210| 2020-02-10|
+--------+--------------+
Related
I have a field in a dataframe that has a column with date like 1632838270314 as an example
I want to convert it to date like 'yyyy-MM-dd' I have this so far but it doesn't work:
date = df['createdOn'].cast(StringType())
df = df.withColumn('date_key',unix_timestamp(date),'yyyy-MM-dd').cast("date"))
createdOn is the field that derives the date_key
The method unix_timestamp() is for converting a timestamp or date string into the number seconds since 01-01-1970 ("epoch"). I understand that you want to do the opposite.
Your example value "1632838270314" seems to be milliseconds since epoch.
Here you can simply cast it after converting from milliseconds to seconds:
from pyspark.sql import functions as F
df = sql_context.createDataFrame([
Row(unix_in_ms=1632838270314),
])
(
df
.withColumn('timestamp_type', (F.col('unix_in_ms')/1e3).cast('timestamp'))
.withColumn('date_type', F.to_date('timestamp_type'))
.withColumn('string_type', F.col('date_type').cast('string'))
.withColumn('date_to_unix_in_s', F.unix_timestamp('string_type', 'yyyy-MM-dd'))
.show(truncate=False)
)
# Output
+-------------+-----------------------+----------+-----------+-----------------+
|unix_in_ms |timestamp_type |date_type |string_type|date_to_unix_in_s|
+-------------+-----------------------+----------+-----------+-----------------+
|1632838270314|2021-09-28 16:11:10.314|2021-09-28|2021-09-28 |1632780000 |
+-------------+-----------------------+----------+-----------+-----------------+
You can combine the conversion into a single command:
df.withColumn('date_key', F.to_date((F.col('unix_in_ms')/1e3).cast('timestamp')).cast('string'))
I want to create a timestamp column to create a line chart from two columns containing month and year respectively.
The df looks like this:
I know I can create a string concat and then convert it to a datetime column:
df.select('*',
concat('01', df['month'],
df['year']).alias('date')).withColumn("date",
df['date'].cast(TimestampType()))
But I wanted a cleaner approach using an inbuilt PySpark functionality that can also help me create other date parts, like week number, quarters, etc. Any suggestions?
You will have to concatenate the string once, make the timestamp type column and then you can easily extract week, quarter etc.
You can use this function (and edit it to create whatever other columns you need as well):
def spark_date_parsing(df, date_column, date_format):
"""
Parses the date column given the date format in a spark dataframe
NOTE: This is a Pyspark implementation
Parameters
----------
:param df: Spark dataframe having a date column
:param date_column: Name of the date column
:param date_format: Simple Date Format (Java-style) of the dates in the date column
Returns
-------
:return: A spark dataframe with a parsed date column
"""
df = df.withColumn(date_column, F.to_timestamp(F.col(date_column), date_format))
# Spark returns 'null' if the parsing fails, so first check the count of null values
# If parse_fail_count = 0, return parsed column else raise error
parse_fail_count = df.select(
([F.count(F.when(F.col(date_column).isNull(), date_column))])
).collect()[0][0]
if parse_fail_count == 0:
return df
else:
raise ValueError(
f"Incorrect date format '{date_format}' for date column '{date_column}'"
)
Usage (with whatever is your resultant date format):
df = spark_date_parsing(df, "date", "dd/MM/yyyy")
I have a dataset containing a date field in the format of MM/dd/yyyy, my goal is to create a table with the same date format but with timestamp format (at the end, there should be an option to execute date functions on that, if it is int or string date function will not work.)
Things I tried:
my column name: as_of_date
1) cast(unix_timestamp(as_of_date, "MM/dd/yyyy") as timestamp)
i/p -> 01/03/2006, o/p ->2006-01-03 00:00:00
problem - I do not want extra zeros in the output. substr is not working on the date function
2) If i keep the value as string, date functions does't work.
day('01/03/2006')
input: '01/03/2006' , output:null (but expected 3)
Can you please help me a date format that already existing or help me to create a new date format for my logic.
Try with this once
use unix_timestamp function to match your input date format then use from_unixtime function to change the output format then cast to date type.
hive> select date(from_unixtime(unix_timestamp("03/06/2018", "MM/dd/yyyy"),"yyyy-MM-dd"));
+-------------+--+
| _c0 |
+-------------+--+
| 2018-03-06 |
+-------------+--+
In Impala:
hive> select from_unixtime(unix_timestamp("03/06/2018", "MM/dd/yyyy"),"yyyy-MM-dd");
+-------------+--+
| _c0 |
+-------------+--+
| 2018-03-06 |
+-------------+--+
I'm trying to select records from a DB2 Iseries system where the date field is greater than the first of this year.
However, the date fields I'm selecting from are actually PACKED fields, not true dates.
I'm trying to convert them to YYYY-MM-DD format and get everything greater than '2018-01-01' but no matter what I try it says it's invalid.
Currently trying this:
SELECT *
FROM table1
WHERE val = 145
AND to_date(char(dateShp), 'YYYY-MM-DD') >= '2018-01-01';
it says expression not valid using format string specified.
Any ideas?
char(dateshp) is going to return a string like '20180319'
So your format string should not include the dashes.. 'YYYYMMDD'
example:
select to_date(char(20180101), 'YYYYMMDD')
from sysibm.sysdummy1;
So your code should be
SELECT *
FROM table1
WHERE val = 145
AND to_date(char(dateShp), 'YYYYMMDD') >= '2018-01-01';
Charles gave you a solution that converts the Packed date to a date field, and if you are comparing to another date field, this is a good solution. But if you are comparing to a constant value or another numeric field, you could just use something like this:
select *
from table1
where val = 145
and dateShp >= 20180101;
I created a table with a datetime field "dt". Using COPY command to load data. The corresponding value for the field from the file is just the hour information, i.e., say, 14:50:00. So, the value being stored is 1900-01-01 14:50:00. I don't need the date part. How to do that.
Or may be an alternate datatype which can store only time.
Amazon Redshift supports only date(year month day) and timestamp(year month day hour minute second) format, and it doesn't support time(hour minute second) format of Postgresql.
In my idea, there are two ways to work around.
As #Damien_The_Unbeliever mentioned, ignore the date part of the timestamp format.
create table date_test(id int, timestamp timestamp);
insert into date_test2 values (1, '1900-01-01 14:50:00');
insert into date_test2 values (2, '1900-01-01 17:20:00');
select * from date_test2 where timestamp > '1900-01-01 14:50:00';
select * from date_test where date_test.timestamp > '1900-01-01 14:50:00';
id | timestamp
----+---------------------
2 | 1900-01-01 17:20:00
(1 row)
Use char or varchar type to store the time value.
create table date_test2(id int, timestamp char(8));
insert into date_test2 values (1, '14:50:00');
insert into date_test2 values (2, '17:20:00');
select * from date_test2 where timestamp > '14:50:00';
id | timestamp
----+-----------
2 | 17:20:00
(1 row)
The second solution looks easier, but it is worse performance as Redshift doc says. If you store a large amount of data, you should consider of the first one.
Here are the related links to the document about date/time column.
http://docs.aws.amazon.com/redshift/latest/dg/c_best-practices-timestamp-date-columns.html
http://docs.aws.amazon.com/redshift/latest/dg/r_Datetime_types.html