Hive casting of MongoDb Date - mongodb

I'm linking Hive to a MongoDb collection that has a date. The MongoDB collection's structure looks like this:
{
"name" : "Using Hive",
"validFrom" : ISODate("2014-11-04T00:00:00.000Z"),
"validTo" : ISODate("2016-01-30T00:00:00.000Z"),
"_id" : ObjectId("54da1c02ead8571c292901d3")
}
I'm adding it to Hive as follows:
CREATE TABLE certificate
(
name STRING,
validFrom TIMESTAMP,
validTo TIMESTAMP,
id STRING
)
STORED BY 'com.mongodb.hadoop.hive.MongoStorageHandler'
WITH SERDEPROPERTIES('mongo.columns.mapping'='{"id":"_id"}')
TBLPROPERTIES('mongo.uri'='mongodb://localhost:27017/test.certificate');
When I get do a select the dates are null:
hive> select * from certificate;
OK
Using Hive NULL NULL 54da1c02ead8571c292901d3
MongoDb NULL NULL 54da1c02ead8571c292901d4
Hadoop NULL NULL 54da1c02ead8571c292901d5
I know Hive supports date casting, is that something I can do with the CREATE statement to ensure the dates are correctly cast? I'll be using queries with "where valid from date's less than today and valid to date's more than today" and such, so having those columns as dates and not strings is vital.
Thanks =D

Specify the mappings for columns validFrom and validTo. By default hive converts column names to lowercase. Please check if following works.
CREATE TABLE certificate
(
name STRING,
validfrom TIMESTAMP,
validto TIMESTAMP,
id STRING
)
STORED BY 'com.mongodb.hadoop.hive.MongoStorageHandler'
WITH SERDEPROPERTIES('mongo.columns.mapping'='{"id":"_id","validfrom":"validFrom","validto":"validTo"}')
TBLPROPERTIES('mongo.uri'='mongodb://localhost:27017/test.certificate');

Related

Create rows from part of column names

Source data
I am working on an ELT project to load data from CSV files into PostgreSQL where I will transform it. The CSV files have many columns that are consistent across files, but also contain activity columns that are inconsistent with names like Date (05/19/2020), Type (05/19/2020), etc.
In the loading script I am merging all of the columns with dates in the column name into one jsonb column so I don't have to constantly add new columns to the raw data table.
The resulting jsonb column in the raw data table looks like this:
id
activity
12345678
{"Date (05/19/2020)": null, "Type (05/19/2020)": null, "Date (06/03/2020)": "06/01/2020", "Type (06/03/2020)": "E"}
98765432
{"Date (05/19/2020)": "05/18/2020", "Type (05/19/2020)": "B", "Date (10/23/2020)": "10/26/2020", "Type (10/23/2020)": "T"}
JSON to columns
Using the amazing create_jsonb_flat_view function from this post I can convert the jsonb to columns like this:
id
Date (05/19/2020)
Type (05/19/2020)
Date (06/03/2020)
Type (06/03/2020)
Type (10/23/2020
Date (10/23/2020)
Type (10/23/2020)
10629465
null
null
06/01/2020
E
98765432
05/18/2020
B
10/26/2020
T
Need to move part of column name to row
Now, this is where I'm stuck. I need to remove the portion of the column name that is the Activity Date (e.g. (05/19/2020)) and create a row for each id and ActivityDate with additional columns for Date and Type like this:
id
ActivityDate
Date
Type
12345678
05/19/2020
null
null
12345678
06/03/2020
06/01/2020
E
98765432
05/19/2020
05/18/2020
B
98765432
10/23/2020
10/26/2020
T
I followed your link to the create_jsonb_flat_view article yesterday and then forgot this question. While I thank you for pointing me there, I think that mentioning it worked against you.
A more conventional approach using regexp_replace() works here. I left the date values as strings, but you can convert them with to_date() if needed:
with parse as (
select id, e.k, e.v,
regexp_replace(e.k, '\s+\([0-9/]{10}\)', '') as k_no_date,
regexp_replace(e.k, '^.+([0-9/]{10}).+', '\1') as k_date_only
from rawinput
cross join lateral jsonb_each_text(activity) as e(k, v)
)
select id,
k_date_only as activity_date,
min(v) filter (where k_no_date = 'Date') as date,
min(v) filter (where k_no_date = 'Type') as type
from parse
group by id, k_date_only;
db<>fiddle here
#Mike-Organek's Answer works beautifully!
However, I was curious if the regexp_replace() calls might be slowing the query down a bit and it seemed I could get the same results using a simpler function.
Since Mike gave me a great example to start with I modified it to split on the space between Date and (05/19/2020).
For 20,000 rows, it went from taking an avg of 7 sec on my local machine to an avg of .9 sec.
Here is the resulting query:
with parse as (
select id, e.k, e.v,
split_part(e.k, ' ', 1) as k_no_date,
trim(split_part(e.k, ' ', 2),'()') as k_date_only
from rawinput
cross join lateral jsonb_each_text(activity) as e(k, v)
)
select id,
k_date_only as activity_date,
min(v) filter (where k_no_date = 'Date') as date,
min(v) filter (where k_no_date = 'Type') as type
from parse
group by id, k_date_only;

Column value become NULL when creating Hive table from BSON file

I created a Hive(3.1.2) table from a BSON file dump from MongoDB (4.0).After creating the table, I select couples of entries from the table. However some of them value is null.
I tried to print the table row from BSON using python. It printed the values correct. Means the value not missing. Any clue about how to further trouble shoot?
SQL to create hive table.
CREATE EXTERNAL TABLE `tmp_test_status`(
`id` string COMMENT 'frame_id',
`createdAt` INT,
`updatedAt` string,
`task` string)
row format serde 'com.mongodb.hadoop.hive.BSONSerDe'
with serdeproperties('mongo.columns.mapping'='{"id":"_id"}')
stored as inputformat 'com.mongodb.hadoop.mapred.BSONFileInputFormat'
outputformat 'com.mongodb.hadoop.hive.output.HiveBSONFileOutputFormat'
LOCATION
'oss://data-warehouse/hive/warehouse/data.db/tmp_test_status';
===========================================
Data I printed by python bson lib.
{'_id': '00003a02-280d-4e59-8483-a0143e0a3359', 'createdAt': '1557999191951', 'updatedAt': '1557999191951', 'task': 'lane', '__v': 0}
===========================================
Data I selected from Hive table:
00003a02-280d-4e59-8483-a0143e0a3359 NULL NULL lane
093e72ae-206b-4112-ac28-5ba38f9485d0 NULL NULL lane
093ebe41-183c-47b4-ab25-93336875ae10 NULL NULL lane
093ec16b-ba1d-4ddc-90bc-9981342e8071 NULL NULL lane
I found the answer my self, the reason is that the BSON file attribute name distinguish lower and upper case, but Hive not. If the attribute name contain upper case in BSON file, then Hive will return NULL when query.Simply map the attribute name by table properties worked for me.
with serdeproperties('mongo.columns.mapping'='{"id":"_id", "createdAt": "createdAt", "updatedAt": "updatedAt", "reLabeled1" : "reLabeled1", "isValid": "isValid"}')

How to extract timestamp from mongodb objectid in postgres

In MongoDB you can retrieve the date from an ObjectId using the getTimestamp() function. How can I retrieve the date from a MongoDB ObjectId using Postgresql (e.g., in the case where such an ObjectId is stored in a Postgres database)?
Example input:
507c7f79bcf86cd7994f6c0e
Wanted output:
2012-10-15T21:26:17Z
In Mongodb documentation the Objectid is formed with a timestamp as the first 4 bytes, but this is represented in hexidecimal. Assuming that hexidecimal value is stored as a string in PostgreSQL, then the following query will extract just the first 8 characters of that objectid, convert that to an integer (which is seconds from 1970-01-01) then convert that integer to a timestamp. For example:
SELECT TO_TIMESTAMP(int_val) ts_val
FROM (
SELECT ('x' || lpad(left(objectid,8), 8, '0'))::bit(32)::int AS int_val
FROM (
VALUES ('507c7f79bcf86cd7994f6c0e')
) AS t1(objectid)
) AS t2
;
Converting a hexadecimal string to integer is discussed here:
Convert hex in text representation to decimal number
The first answer is quite excellent. This one expands the answer by making a reusable function out of it.
create function extractMongoTimestamp(text) RETURNS TIMESTAMP WITH TIME ZONE
as
'SELECT TO_TIMESTAMP(int_val) ts_val
FROM (
SELECT (''x'' || lpad(left(objectid,8), 8, ''0''))::bit(32)::int AS int_val
FROM (
VALUES ($1)
) AS t1(objectid)
) AS t2'
language sql
immutable
RETURNS null on null input;
Use it in your query:
select extractMongoTimestamp('507c7f79bcf86cd7994f6c0e');

Hive date is showing null in elasticsearch

I have a hive table details with below schema
name STRING,
address STRING,
dob DATE
My dob is stored in yyyy-mm-dd format.like 1988-01-27.
I am trying to load this elastic search table . So i followed below instruction in HUE.
CREATE EXTERNAL TABLE sampletable (name STRING, address STRING, dob DATE)
ROW FORMAT SERDE 'org.elasticsearch.hadoop.hive.EsSerDe'
STORED BY 'org.elasticsearch.hadoop.hive.EsStorageHandler' TBLPROPERTIES('es.resource' = 'test4/test4','es.nodes' = 'x.x.x.x:9200');
INSERT OVERWRITE TABLE sampletable SELECT * FROM details;
select * from sample table;
But DOB field shows NULL for all column. Whereas I can verify that my original hive table has data in date field.
After some research I was able to find that Elasticsearch expects data field to be in yyyy-mm-ddThh:mm:zz since my data doesn't match that it throws error. And also it mentioned, I can change the format to "strict_date" format, then it will work fine my hive date format. But I am not sure where in hive query i execute I need to metion this.
Can some one help me with this?
date type mapping to hive have some problem .
you can use hive string type mapping es date type , but you must set the config for hive table for parameter: es.mapping.date.rich , set it's value is false . like this 'es.mapping.date.rich' = 'false' , in create table statement ,it is:
CREATE EXTERNAL TABLE temp.data_index_es(
id bigint,
userId int,
createTime string
)
STORED BY 'org.elasticsearch.hadoop.hive.EsStorageHandler'
TBLPROPERTIES(
'es.nodes' = 'xxxx:9200',
'es.index.auto.create' = 'false',
'es.resource' = 'abc/{_type}',
'es.mapping.date.rich' = 'false',
'es.read.metadata' = 'true',
'es.mapping.id' = 'id',
'es.mapping.names' = 'id:id, userId:userId, createTime:createTime');
refer link: Mapping and Types

What is equivalent of isNullOrEmpty in Postgresql select command?

select complaintno from complaintprocess where endtime='';
It Is Not Working
In complaintprocess table endtime datatype is timestamp without time zone.
Here I want to get one of the column in complaintprocess where endtime is empty.
You could not store '' as timestamp. I suspect that by blank you mean NULL value.
SELECT CAST('' AS timestamp);
-- ERROR: invalid input syntax for type timestamp: ""
To filter them you could use:
SELECT complaintno
FROM complaintprocess
WHERE endtime IS NULL;