To create column with date datatype in hive table - date

I have created table in HIVE(0.10.0) using values :
2012-01-11 17:51 Stockton Children's Clothing 168.68 Cash
2012-01-11 17:51 Tampa Health and Beauty 441.08 Amex
............
Here date and time are tab separated values and I need to work on date column, Since Hive doesn't allow "date" datatype ,I have used "TIMESTAMP" for first date column(2012-01-11,...),
however after creating table it is showing NULL values for first column.
How to solve this? Please guide.

I loaded the data into a table with all columns defined as string and then casted the date value and loaded into another table where the column was defined as DATE. It seems to be working without any issues. The only difference is that I am using a Shark version of Hive, and to be honest with you, I am not sure whether there are any profound differences with actual Hive and Shark Hive.
Data:
hduser2#ws-25:~$ more test.txt
2010-01-05 17:51 Visakh
2013-02-16 09:31 Nair
Code:
[localhost:12345] shark> create table test_time(dt string, tm string, nm string) row format delimited fields terminated by '\t' stored as textfile;
Time taken (including network latency): 0.089 seconds
[localhost:12345] shark> describe test_time;
dt string
tm string
nm string
Time taken (including network latency): 0.06 seconds
[localhost:12345] shark> load data local inpath '/home/hduser2/test.txt' overwrite into table test_time;
Time taken (including network latency): 0.124 seconds
[localhost:12345] shark> select * from test_time;
2010-01-05 17:51 Visakh
2013-02-16 09:31 Nair
Time taken (including network latency): 0.397 seconds
[localhost:12345] shark> select cast(dt as date) from test_time;
2010-01-05
2013-02-16
Time taken (including network latency): 0.399 seconds
[localhost:12345] shark> create table test_date as select cast(dt as date) from test_time;
Time taken (including network latency): 0.71 seconds
[localhost:12345] shark> select * from test_date;
2010-01-05
2013-02-16
Time taken (including network latency): 0.366 seconds
[localhost:12345] shark>
If you are using TIMESTAMP, then you could try something in the lines of concatenating the date and time strings and then casting them.
create table test_1 as select cast(concat(dt,' ', tm,':00') as string) as ts from test_time;
select cast(ts as timestamp) from test_1;

It works fine for me by using load command from beeline side.
Data:
[root#hostname workspace]# more timedata
buy,1977-03-12 06:30:23
sell,1989-05-23 07:23:12
creating table statement:
create table mytime(id string ,t timestamp) row format delimited fields terminated by ',';
And loading data statement:
load data local inpath '/root/workspace/timedata' overwrite into table mytime;
Table structure:
describe mytime;
+-----------+------------+----------+--+
| col_name | data_type | comment |
+-----------+------------+----------+--+
| id | string | |
| t | timestamp | |
+-----------+------------+----------+--+
result of querying:
select * from mytime;
+------------+------------------------+--+
| mytime.id | mytime.t |
+------------+------------------------+--+
| buy | 1977-03-12 06:30:23.0 |
| sell | 1989-05-23 07:23:12.0 |
+------------+------------------------+--+

Apache Hive Data Types are very important for query language and data modeling (representation of the data structures in a table for a company’s database).
It is necessary to know about the data types and its usage to defining the table column types.
There are mainly two types of Apache Hive Data Types. They are,
Primitive Data types
Complex Data types
Will discuss about Complex data types,
Complex Data types further classified into four types. They are explained below,
2.1 ARRAY
It is an ordered collection of fields.
The fields must all be of the same type
Syntax: ARRAY
Example: array (1, 4)
2.2 MAP
It is an unordered collection of key-value pairs.
Keys must be primitives,values may be any type.
Syntax: MAP
Example: map(‘a’,1,’c’,3)
2.3 STRUCT
It is a collection of elements of different types.
Syntax: STRUCT
Example: struct(‘a’, 1 1.0)
2.4 UNION
It is a collection of Heterogeneous data types.
Syntax: UNIONTYPE
Example: create_union(1, ‘a’, 63)

Related

PySpark rolling operation with variable ranges

I have a dataframe looking like this
some_data | date | date_from | date_to
1234 |1-2-2020| 1-2-2020 | 2-2-2020
5678 |2-2-2020| 1-2-2020 | 2-3-2020
and I need to perform some operations on some_data based on time ranges that are different for every row, and stored in date_from and date_to. This is basically a rolling operation on some_data vs date, where the width of the window is not constant.
If the time ranges were the same, like always 7 days preceding/following, I would just do a window with rangeBetween. Any idea how I can still use rangeBetween with these variable ranges? I could really use the partitioning capability Window provides...
My current solution is:
a join of the table with itself to obtain a secondary/nested date column. at this point every date has the full list of possible dates
some wheres to select, for each primary date the proper secondary dates according to date_from and date_to
a groupby the primary date with agg performing the actual operation on the selected rows
But I am afraid this would not be very performant on large datasets. Can this be done with Window? Do you have a better/more performant suggestion?
Thanks a lot,
Andrea.

Have two different strings that represent dates in two different hive tables , and I want to use them to join

So I have two external tables in Hive, in my Hadoop cluster.
One table has a (date STRING) column, with this format '2019-05-24 11:16:31.0'
and the other one has (date STRING) column, with this format '23/May/2019:22:15:04', they are both strings. I need to transform them to the same type of date format and use them to join these two tables.
How would you aproach this problem solving it all within hive? Would it be possible? I'm quite the rookie in Hadoop, And I'm not fully aware of the possibilities of hive.
Ps: My hive version does not support !hive --version command to check what version I'm working with, so I'm not pretty sure how to understand what version I'm working on. Not my cluster and I'm not a root user.
You need to convert both strings to the same format before joining.
Converting non-standard format '23/May/2019:22:15:04'
Use unix_timestamp(string date, string pattern) to convert given date format to seconds passed from 1970-01-01. Then use from_unixtime() to convert to required format:
select from_unixtime(unix_timestamp('23/May/2019:22:15:04','dd/MMM/yyyy:HH:mm:ss'));
returns:
2019-05-23 22:15:04
If you want date only, specify date format 'yyyy-MM-dd' in the from_unixtime function:
select from_unixtime(unix_timestamp('23/May/2019:22:15:04','dd/MMM/yyyy:HH:mm:ss'),'yyyy-MM-dd');
Returns:
2019-05-23
Second table contains more standard format '2019-05-24 11:16:31.0' and you can do with simpler approach.
You can use simple substr, because the date is already in the Hive format 'yyyy-MM-dd':
select substr('2019-05-24 11:16:31.0',1,10);
Returns:
2019-05-24
Or if you want the same format as in the first example 'yyyy-MM-dd HH:mm:ss':
select substr('2019-05-24 11:16:31.0',1,19);
Returns:
2019-05-24 11:16:31
Also date_format (as of Hive 1.2.0) function can be used for the same:
select date_format('2019-05-24 11:16:31.0','yyyy-MM-dd HH:mm:ss');
Returns:
2019-05-24 11:16:31
And date portion only using date_format (as of Hive 1.2.0):
select date_format('2019-05-24 11:16:31.0','yyyy-MM-dd')
OK, you can use the String Functions and Operators in hive to make the two different date format to be same, like below:
select regexp_replace(substring('2019-05-24 11:16:31.0',0,10),'-','') as date;
+-----------+
| date |
+-----------+
| 20190524 |
+-----------+
select concat(split(substring_index('23/May/2019:22:15:04',':',1),'/')[2],case when split(substring_index('23/May/2019:22:15:04',':',1),'/')[1]='May' then '05' end,split(substring_index('23/May/2019:22:15:04',':',1),'/')[0]) as date;
+-----------+
| date |
+-----------+
| 20190523 |
+-----------+
And then join them, below is a simple example to clarify how to use, you can refine the details.
select
*
from
table1 t1
join
table2 t2 regexp_replace(substring(t1.date,0,10),'-','') = select concat(split(substring_index(t2.date,':',1),'/')[2],case when split(substring_index(t2.date,':',1),'/')[1]='May' then '05' end,split(substring_index(t2.date,':',1),'/')[0])
Am I make it clear?

Crate DB Timestamp query

I have created a table in crate DB with timestamp column. However while inserting records into it, there is not timezone information passed along as mentioned in the docs.
insert into t1 values(2,'2017-06-30T02:21:20');
this gets stored as:
2 | 1498789280000 (Fri, 30 Jun 2017 02:21:20 GMT)
Now my queries are all failing as the timestamp has got recorded as GMT and my queries are all in localtime timezone (Asia/Kolkata)
If anyone has run into this problem, could you please let me know whats the best way to modify the column to change values from GMT to IST without losing it, It has couple of millions of important records which cannot be lost or corrupted.
cheers!
CrateDB always assumes that timestamps are UTC when they are stored without timezone information. This is due to the internal representation as a simple long data type - which means that your timestamp is stored as a simple number: https://crate.io/docs/reference/en/latest/sql/data_types.html#timestamp
CrateDB also accepts the timezone information in your ISO string, so just inserting insert into t1 values(2,'2017-06-30T02:21:20+05:30'); will convert it to the appropriate UTC value.
For records that are already stored, you can make the DB aware of the timezone when querying for the field and convert the output back by passing the corresponding timezone value into a date_trunc or date_format function: https://crate.io/docs/reference/en/latest/sql/scalar.html#date-and-time-functions
this UPDATE test set ts = date_format('%Y-%m-%dT%H:%i:%s.%fZ','+05:30', ts); should do it.
cr> create table test(ts timestamp);
CREATE OK, 1 row affected (0.089 sec)
cr> insert into test values('2017-06-30T02:21:20');
INSERT OK, 1 row affected (0.005 sec)
cr> select date_format(ts) from test;
+-----------------------------+
| date_format(ts) |
+-----------------------------+
| 2017-06-30T02:21:20.000000Z |
+-----------------------------+
SELECT 1 row in set (0.004 sec)
cr> UPDATE test set ts = date_format('%Y-%m-%dT%H:%i:%s.%fZ','+05:30', ts);
UPDATE OK, 1 row affected (0.006 sec)
cr> select date_format(ts) from test;
+-----------------------------+
| date_format(ts) |
+-----------------------------+
| 2017-06-30T07:51:20.000000Z |
+-----------------------------+
SELECT 1 row in set (0.004 sec)
`

Pivot query by date in Amazon Redshift

I have a table in Redshift like:
category | date
----------------
1 | 9/29/2016
1 | 9/28/2016
2 | 9/28/2016
2 | 9/28/2016
which I'd like to turn into:
category | 9/29/2016 | 2/28/2016
--------------------------------
1 | 1 | 1
2 | 0 | 2
(count of each category for each date)
Pivot a table with Amazon RedShift / PostgreSQL seems to be helpful using CASE statements but that requires knowing all possible cases beforehand - how could I do this if the columns I want are every day starting from a given date?
There is no functionality provided with Amazon Redshift that can automatically pivot the data.
The Pivot a table with Amazon RedShift / PostgreSQL page you referenced shows how the output can be generated, but it is unable to automatically adjust the number of columns based upon the input data.
One option would be to write a program that queries available date ranges, then generates the SQL query. However, this isn't possible totally within Amazon Redshift.
You could do a self join on date, which i'm currently looking up how to do.

PostgreSQL: Index the day part of a timestamp

Consider the following table:
Column | Type |
--------------------+--------------------------+
id | bigint |
creation_time | timestamp with time zone |
...
Queries like the following (let alone more complicated JOINs) takes quite a while, because they needs to calculate creation_time::DATE for each item:
SELECT creation_time::DATE, COUNT(*) FROM items GROUP BY 1;
How do I create an index on the day part of the timestamp - creation_time::DATE?
I have tried:
CREATE INDEX items_day_of_creation_idx ON items (creation_time)::date;
CREATE INDEX items_day_of_creation_idx ON items (creation_time::date);
But both failed with:
ERROR: syntax error at or near "::"
When you create an index on an expression that expression must be put between parentheses (in addition to the parentheses that surround the column/expression list:
CREATE INDEX items_day_of_creation_idx ON items ( (creation_time::date) );
This worked, but I noticed that the index using an expression to cast a timestamp to a date used more disk space (~15%-20%) than an index on the timestamp.
I hoped for a disk space reduction in building an index on a 4 byte date over a 8 byte timestamp, but it seems that's not the case because 8 bytes seems to be the lowest common denominator for an element in the index. So, the disk use was worse, and the query performance was about the same, so I abandoned this approach.