Column value become NULL when creating Hive table from BSON file - mongodb

I created a Hive(3.1.2) table from a BSON file dump from MongoDB (4.0).After creating the table, I select couples of entries from the table. However some of them value is null.
I tried to print the table row from BSON using python. It printed the values correct. Means the value not missing. Any clue about how to further trouble shoot?
SQL to create hive table.
CREATE EXTERNAL TABLE `tmp_test_status`(
`id` string COMMENT 'frame_id',
`createdAt` INT,
`updatedAt` string,
`task` string)
row format serde 'com.mongodb.hadoop.hive.BSONSerDe'
with serdeproperties('mongo.columns.mapping'='{"id":"_id"}')
stored as inputformat 'com.mongodb.hadoop.mapred.BSONFileInputFormat'
outputformat 'com.mongodb.hadoop.hive.output.HiveBSONFileOutputFormat'
LOCATION
'oss://data-warehouse/hive/warehouse/data.db/tmp_test_status';
===========================================
Data I printed by python bson lib.
{'_id': '00003a02-280d-4e59-8483-a0143e0a3359', 'createdAt': '1557999191951', 'updatedAt': '1557999191951', 'task': 'lane', '__v': 0}
===========================================
Data I selected from Hive table:
00003a02-280d-4e59-8483-a0143e0a3359 NULL NULL lane
093e72ae-206b-4112-ac28-5ba38f9485d0 NULL NULL lane
093ebe41-183c-47b4-ab25-93336875ae10 NULL NULL lane
093ec16b-ba1d-4ddc-90bc-9981342e8071 NULL NULL lane

I found the answer my self, the reason is that the BSON file attribute name distinguish lower and upper case, but Hive not. If the attribute name contain upper case in BSON file, then Hive will return NULL when query.Simply map the attribute name by table properties worked for me.
with serdeproperties('mongo.columns.mapping'='{"id":"_id", "createdAt": "createdAt", "updatedAt": "updatedAt", "reLabeled1" : "reLabeled1", "isValid": "isValid"}')

Related

Convert text column to jsonb in Postgres

I have a column products in a table test which is of type of text in the below format:
[{"is_bulk_product": false, "rate": 0, "subtotal": 7.17, "qty": 2, "tax": 0.90}]
It is an array with nested dictionary values. When I tried to alter the column using this:
alter table test alter COLUMN products type jsonb using products::jsonb;
I get below error:
ERROR: 22P02: invalid input syntax for type json
DETAIL: Character with value 0x09 must be escaped.
CONTEXT: JSON data, line 1: ...some_id": 2613, "qty": 2, "upc": "1234...
LOCATION: json_lex_string, json.c:789
Time: 57000.237 ms (00:57.000)
how can we make sure the json is valid before altering the column ?
Thank You
Your written JSON string is correct, so this SQL code execute without exception:
select '[{"is_bulk_product": false, "rate": 0, "subtotal": 7.17, "qty": 2, "tax": 0.90}]'::jsonb
Maybe the table has an incorrect JSON format in other records. You can firstly check this by selecting data, example:
select products::jsonb from test;
And you have incorrect syntax on your SQL code, you can cast products field to JSONB but not a test, test is your table name:
alter table test
alter COLUMN products type jsonb using products::jsonb;

postgres SQL: getting rid of NA while migrating data from csv file

I am migrating data from a "csv" file into a newly created table named fortune500. the code is shown below
CREATE TABLE "fortune500"(
"id" SERIAL,
"rank" INTEGER,
"title" VARCHAR PRIMARY KEY,
"name" VARCHAR,
"ticker" CHAR(5),
"url" VARCHAR,
"hq" VARCHAR,
"sector" VARCHAR,
"industry" VARCHAR,
"employees" INTEGER,
"revenues" INTEGER,
"revenues_change" REAL,
"profits" NUMERIC,
"profits_change" REAL,
"assets" NUMERIC,
"equity" NUMERIC
);
Then I wanted to migrate data from a csv file using the below code:
COPY "fortune500"("rank", "title", "name", "ticker", "url", "hq", "sector", "industry", "employees",
"revenues", "revenues_change", "profits", "profits_change", "assets", "equity")
FROM 'C:\Users\Yasser A.RahmAN\Desktop\SQL for Business Analytics\fortune.csv'
DELIMITER ','
CSV HEADER;
But I got the below error message due to NA values in one of the columns.
ERROR: invalid input syntax for type real: "NA"
CONTEXT: COPY fortune500, line 12, column profits_change: "NA"
SQL state: 22P02
So how can I get rid of 'NA' values while migrating the data?
Consider using a staging table that would not have restrictive data types and then do your transformations and insert into the final table after the data had been loaded into staging. This is known as ELT (Extract - Load - Transform) approach. You could also use some external tools to implement an ETL process, and do the transformation in that tool, before it reaches your database.
In your case, an ELT approach would be to first create a table with all text types, load that table and then insert into your final table, casting the text values into appropriate types, either filtering out the values that cannot be casted or inserting NULLs, or maybe 0, where that cast can't be made - depending on your requirements. For example you'd filter out rows where profits_change = 'NA' (or better, WHERE NOT (profits_change ~ '^\d+\.?\d+$') to check for a numeric value, or you'd insert NULL or 0:
CASE
WHEN profits_change ~ '^\d+\.?\d+$'
THEN profits_change::real
ELSE NULL -- or 0, depending what you need
END
You'd perform this kind of validation for all fields.
Alternatively, if it's a one off thing - just edit your CSV before importing.

Unable to make generated column in postgresql for Json data

I'm trying out generated column with postgres-12. I need to create a table with generated column with JSON data. I'm going to receive "name" field as key there . However, while doing so - I got below error:
postgres=# create table json_tab2 (data jsonb ,
postgres(# "json_tab2.pname" text generated always as (data ->> "name" ) stored
postgres(# );
ERROR: column "name" does not exist
LINE 2: ...on_tab2.pname" text generated always as (data ->> "name" ) ...
After this: I tried to alter existing table- because that has value into json data for generated column - so it should be able to identify "name" now. This time I ran below:
postgres=# alter table json_tab add column Pname text generated always as (data ->> "name") stored
;
ERROR: column "name" does not exist
However, "name" has value here:
data
-------------------------------------------------
{"age": 31, "city": "New York", "name": "John"}
I'm unable to understand - what I'm doing wrong here
The righthand side of the ->> operator should be a value. In this case, since it's a string, you need to surround it with single quotes ('):
create table json_tab2 (
data jsonb,
pname text generated always as (data ->> 'name') stored
-- Here ---------------------------------^----^
);

Hive date is showing null in elasticsearch

I have a hive table details with below schema
name STRING,
address STRING,
dob DATE
My dob is stored in yyyy-mm-dd format.like 1988-01-27.
I am trying to load this elastic search table . So i followed below instruction in HUE.
CREATE EXTERNAL TABLE sampletable (name STRING, address STRING, dob DATE)
ROW FORMAT SERDE 'org.elasticsearch.hadoop.hive.EsSerDe'
STORED BY 'org.elasticsearch.hadoop.hive.EsStorageHandler' TBLPROPERTIES('es.resource' = 'test4/test4','es.nodes' = 'x.x.x.x:9200');
INSERT OVERWRITE TABLE sampletable SELECT * FROM details;
select * from sample table;
But DOB field shows NULL for all column. Whereas I can verify that my original hive table has data in date field.
After some research I was able to find that Elasticsearch expects data field to be in yyyy-mm-ddThh:mm:zz since my data doesn't match that it throws error. And also it mentioned, I can change the format to "strict_date" format, then it will work fine my hive date format. But I am not sure where in hive query i execute I need to metion this.
Can some one help me with this?
date type mapping to hive have some problem .
you can use hive string type mapping es date type , but you must set the config for hive table for parameter: es.mapping.date.rich , set it's value is false . like this 'es.mapping.date.rich' = 'false' , in create table statement ,it is:
CREATE EXTERNAL TABLE temp.data_index_es(
id bigint,
userId int,
createTime string
)
STORED BY 'org.elasticsearch.hadoop.hive.EsStorageHandler'
TBLPROPERTIES(
'es.nodes' = 'xxxx:9200',
'es.index.auto.create' = 'false',
'es.resource' = 'abc/{_type}',
'es.mapping.date.rich' = 'false',
'es.read.metadata' = 'true',
'es.mapping.id' = 'id',
'es.mapping.names' = 'id:id, userId:userId, createTime:createTime');
refer link: Mapping and Types

Hive casting of MongoDb Date

I'm linking Hive to a MongoDb collection that has a date. The MongoDB collection's structure looks like this:
{
"name" : "Using Hive",
"validFrom" : ISODate("2014-11-04T00:00:00.000Z"),
"validTo" : ISODate("2016-01-30T00:00:00.000Z"),
"_id" : ObjectId("54da1c02ead8571c292901d3")
}
I'm adding it to Hive as follows:
CREATE TABLE certificate
(
name STRING,
validFrom TIMESTAMP,
validTo TIMESTAMP,
id STRING
)
STORED BY 'com.mongodb.hadoop.hive.MongoStorageHandler'
WITH SERDEPROPERTIES('mongo.columns.mapping'='{"id":"_id"}')
TBLPROPERTIES('mongo.uri'='mongodb://localhost:27017/test.certificate');
When I get do a select the dates are null:
hive> select * from certificate;
OK
Using Hive NULL NULL 54da1c02ead8571c292901d3
MongoDb NULL NULL 54da1c02ead8571c292901d4
Hadoop NULL NULL 54da1c02ead8571c292901d5
I know Hive supports date casting, is that something I can do with the CREATE statement to ensure the dates are correctly cast? I'll be using queries with "where valid from date's less than today and valid to date's more than today" and such, so having those columns as dates and not strings is vital.
Thanks =D
Specify the mappings for columns validFrom and validTo. By default hive converts column names to lowercase. Please check if following works.
CREATE TABLE certificate
(
name STRING,
validfrom TIMESTAMP,
validto TIMESTAMP,
id STRING
)
STORED BY 'com.mongodb.hadoop.hive.MongoStorageHandler'
WITH SERDEPROPERTIES('mongo.columns.mapping'='{"id":"_id","validfrom":"validFrom","validto":"validTo"}')
TBLPROPERTIES('mongo.uri'='mongodb://localhost:27017/test.certificate');