How to import data from folder to Hive with new columns as file's name and folder's name? - hiveql

I have data input like this:
Drivers
driver_1
1.csv
2.csv
...
driver_2
1.csv
2.csv
...
...
Structure of csv file is:
x,y
0.0,0.0
18.6,-11.1
36.1,-21.9
53.7,-32.6
70.1,-42.8
86.5,-52.6
I want to load all file in this folder to Hive table like:
id, x, y, file_name, folder_name
1, 0.0, 0.0, 1.csv, driver_1
...
How can I do it?
Can anyone help me please?

Hive has a virtual column named INPUT__FILE__NAME that contains the full path to the input file that contained the record. Then using REGEXP_EXTRACT we can extract out the parent directory and the filename:
SELECT
x
, y
, REGEXP_EXTRACT(INPUT__FILE__NAME, '.*/(.*)/(.*)', 2) AS file_name
, REGEXP_EXTRACT(INPUT__FILE__NAME, '.*/(.*)/(.*)', 1) AS folder_name
FROM
table
;

Related

PostgreSQL extract multi value from xml and into columns

I have the below XML in each row with different data.
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<ns2:Declaration xmlns="urn:wco:datamodel:WCO:Declaration_DS:DMS:2" xmlns:ns2="urn:wco:datamodel:WCO:DEC-DMS:2" xmlns:ns3="urn:wco:datamodel:WCO:WCO_DEC_EDS_AUTHORISATION:1">
<ns2:FunctionCode>9</ns2:FunctionCode>
<ns2:ProcedureCategory>B1</ns2:ProcedureCategory>
<ns2:FunctionalReferenceID>LRNU4YZHFFG</ns2:FunctionalReferenceID>
<ns2:IssueDateTime>
<DateTimeString formatCode="304">20210816084322+01</DateTimeString>
</ns2:IssueDateTime>
<ns2:TypeCode>EXA</ns2:TypeCode>
<ns2:GoodsItemQuantity>2</ns2:GoodsItemQuantity>
<ns2:DeclarationOfficeID>ABCd</ns2:DeclarationOfficeID>
<ns2:TotalGrossMassMeasure unitCode="KGM">33000.000</ns2:TotalGrossMassMeasure>
<ns2:TotalPackageQuantity>400</ns2:TotalPackageQuantity>
<ns2:Submitter>
<ns2:Name>ABC</ns2:Name>
<ns2:ID>ABC</ns2:ID>
</ns2:Submitter>
</ns2:Declaration>
ID
messagebody
1
<Xml...
2
<Xml...
what I would like to have is a query that extracts some elements from the XML and put them in a table as below
ID
messagebody
FunctionalReferenceID
ProcedureCategory
1
<Xml...
LRR....
B1
2
<Xml...
LR1....
B2
I'm using the below sql to extract only 1 path
select u.val::text
from sw_customs_message scm
cross join unnest(xpath('//ns2:ProcedureCategory/text()',
scm.messagebody::xml,
array[array['ns2','urn:wco:datamodel:WCO:DEC-DMS:2']])) as u(val)
where u.val::text = 'H7'
How i can use the xmltable() ?
Using xmltable() is typically easier for multiple columns (and rows), especially if namespaces come into play:
select scm.id, mb.*
from sw_customs_message scm
cross join xmltable(xmlnamespaces ('urn:wco:datamodel:WCO:DEC-DMS:2' as ns2,
'urn:wco:datamodel:WCO:WCO_DEC_EDS_AUTHORISATION:1' as ns3),
'/ns2:Declaration'
passing cast(messagebody as xml)
columns
functional_reference_id text path 'ns2:FunctionalReferenceID',
procedure_category text path 'ns2:ProcedureCategory',
function_code int path 'ns2:FunctionCode',
good_items_quantity int path 'ns2:GoodsItemQuantity'
) as mb
where mb.procedure_category = ...
messagebody should really be defined as xml so that you don't need to cast it each time you want to do something with the XML.

Are the bucket hash algorithms of tez and MR different?

I'm using Hive 3.1.2 and tried to create a bucket with bucket version=2.
When I created a bucket and checked the bucket file using hdfs dfs -cat, I could see that the hashing result was different.
Are the hash algorithms of Tez and MR different? Shouldn't it be the same if bucket version=2?
Here's the test method and its results.
1. Create Bucket table & Data table
CREATE EXTERNAL TABLE `bucket_test`(
`id` int COMMENT ' ',
`name` string COMMENT ' ',
`age` int COMMENT ' ',
`phone` string COMMENT ' ')
CLUSTERED BY (id, name, age) SORTED BY(phone) INTO 2 Buckets
ROW FORMAT SERDE
'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
TBLPROPERTIES (
'bucketing_version'='2',
'orc.compress'='ZLIB');
CREATE TABLE data_table (id int, name string, age int, phone string)
row format delimited fields terminated by ',';
2. Insert data into DATA_TABLE
INSERT INTO TABLE data_table
select stack
( 20
,1, 'a', 11, '111'
,1,'a',11,'222'
,3,'b',14,'333'
,3,'b',13,'444'
,5,'c',18,'555'
,5,'c',18,'666'
,5,'c',21,'777'
,8,'d',23,'888'
,9,'d',24,'999'
,10,'d',26,'1110'
,11,'d',27,'1112'
,12,'e',28,'1113'
,13,'f',28,'1114'
,14,'g',30,'1115'
,15,'q',31,'1116'
,16,'w',32,'1117'
,17,'e',33,'1118'
,18,'r',34,'1119'
,19,'t',36,'1120'
,20,'y',36,'1130')
3. Create Bucket with MR
set hive.enforce.bucketing = true;
set hive.execution.engine = mr;
set mapreduce.job.queuename=root.test;
Insert overwrite table bucket_test
select * from data_table ;
4. Check Bucket contents
# bucket0 : 6 rows
[root#test~]# hdfs dfs -cat /user/hive/warehouse/bucket_test/000000_0
10d261110
11d271112
18r341119
3b13444
5c18555
5c18666
# bucket1 : 14 rows
[root#test~]# hdfs dfs -cat /user/hive/warehouse/bucket_test/000001_0
1a11111
12e281113
13f281114
14g301115
15q311116
16w321117
17e331118
19t361120
20y361130
1a11222
3b14333
5c21777
8d23888
9d24999
5. Create Bucket with Tez
set hive.enforce.bucketing = true;
set hive.execution.engine = tez;
set tez.queue.name=root.test;
Insert overwrite table bucket_test
select * from data_table ;
6. Check Bucket contents
# bucket0 : 11 rows
[root#test~]# hdfs dfs -cat /user/hive/warehouse/bucket_test/000000_0
1a11111
10d261110
11d271112
13f281114
16w321117
17e331118
18r341119
20y361130
1a11222
5c18555
5c18666
# bucket1 : 9 rows
[root#test~]# hdfs dfs -cat /user/hive/warehouse/bucket_test/000001_0
12e281113
14g301115
15q311116
19t361120
3b14333
3b13444
5c21777
8d23888
9d24999

Databricks and Polybase cannot parse CSV including polygon

I have Azure Data Factory, which read CSV via HTTP Connection and store data to Azure Storage Gen2. File format is UTC-8. It seem like file get somehow corrupted because of polygon definitions.
File content is followings:
Shape123|"MULTIPOLYGON (((496000 6908000, 495000 6908000, 495000 6909000, 496000 6909000, 496000 6908000)))"|"Red"|"Long"|"208336"|"5"|"-1"
Problem 1:
Polybase complain about encoding and cannot read file.
Problem 2:
Databricks data frame cannot handle this and it can cuts row and reads only "Shape123|"MULTIPOLYGON (((496000 6908000,"
Quick solution:
Open CSV file with Notepad++ and reconfirm encoding as UTC-8. Then Polybase is able to handle.
Question:
What are automatic way to fix CSV file?
How to make dataframe to handle entire row if CSV file cannot not be fixed?
Polybase can cope perfectly well with UTF8 files and various delimiters. Did you create an external file format with pipe delimiter, double-quote as string delimiter, something like this?
CREATE EXTERNAL FILE FORMAT ff_pipeFileFormatSHAPE
WITH (
FORMAT_TYPE = DELIMITEDTEXT,
FORMAT_OPTIONS (
FIELD_TERMINATOR = '|',
STRING_DELIMITER = '"',
ENCODING = 'UTF8'
)
);
GO
CREATE EXTERNAL TABLE shape_data (
col1 VARCHAR(20),
col2 VARCHAR(8000),
col3 VARCHAR(20),
col4 VARCHAR(20),
col5 VARCHAR(20),
col6 VARCHAR(20),
col7 VARCHAR(20)
)
WITH (
LOCATION = 'yourPath/shape/shape working.txt',
DATA_SOURCE = ds_azureDataLakeStore,
FILE_FORMAT = ff_pipeFileFormatSHAPE,
REJECT_TYPE = VALUE,
REJECT_VALUE = 0
);
My results:

PostgreSQL absolute over relative xpath location

Consider the following xml document that is stored in a PostgreSQL field:
<E_sProcedure xmlns="http://www.minushabens.com/2008/FMSchema" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" modelCodeScheme="Emo_ex" modelCodeSchemeVersion="01" modelCodeValue="EMO_E_PROCEDURA" modelCodeMeaning="Section" sectionID="11">
<tCatSnVsn_Pmax modelCodeScheme="Emodinamica_referto" modelCodeSchemeVersion="01" modelCodeValue="tCat4" modelCodeMeaning="My text"><![CDATA[1]]></tCatSnVsn_Pmax>
</E_sProcedure>
If I run the following query I get the correct result for Line 1, while Line 2 returns nothing:
SELECT
--Line 1
TRIM(BOTH FROM array_to_string((xpath('//child::*[#modelCodeValue="tCat4"]/text()', t.xml_element)),'')) as tCatSnVsn_Pmax_MEANING
--Line2
,TRIM(BOTH FROM array_to_string((xpath('/tCatSnVsn_Pmax/text()', t.xml_element)),'')) as tCatSnVsn_Pmax
FROM (
SELECT unnest(xpath('//x:E_sProcedure', s.XMLDATA::xml, ARRAY[ARRAY['x', 'http://www.minushabens.com/2008/FMSchema']])) AS xml_element
FROM sr_data as s)t;
What's wrong in the xpath of Line 2?
Your second xpath() doesn't return anything because of two problems. First: you need to use //tCatSnVsn_Pmax as the xml_element still starts with <E_sProcedure>. The path /tCatSnVsn_Pmax tries to select a top-level element with that name.
But even then, the second one won't return anything because of the namespace. You need to pass the same namespace definition to the xpath(), so you need something like this:
SELECT (xpath('/x:tCatSnVsn_Pmax/text()', t.xml_element, ARRAY[ARRAY['x', 'http://www.minushabens.com/2008/FMSchema']]))[1] as tCatSnVsn_Pmax
FROM (
SELECT unnest(xpath('//x:E_sProcedure', s.XMLDATA::xml, ARRAY[ARRAY['x', 'http://www.minushabens.com/2008/FMSchema']])) AS xml_element
FROM sr_data as s
)t;
With modern Postgres versions (>= 10) I prefer using xmltable() for anything nontrivial. It makes passing namespaces easier and accessing multiple attributes or elements.
SELECT xt.*
FROM sr_data
cross join
xmltable(xmlnamespaces ('http://www.minushabens.com/2008/FMSchema' as x),
'/x:E_sProcedure'
passing (xmldata::xml)
columns
sectionid text path '#sectionID',
pmax text path 'x:tCatSnVsn_Pmax',
model_code_value text path 'x:tCatSnVsn_Pmax/#modelCodeValue') as xt
For your sample XML, the above returns:
sectionid | pmax | model_code_value
----------+------+-----------------
11 | 1 | tCat4

Invalid Input Syntax when Uploading CSV to Postgres Table

I'm fairly new to postgres. I am trying to copy over a file from my computer to a postgres server. I first initialize the table with
CREATE TABLE coredb (
id text, lng numeric(6,4), lat numeric(6,4),
score1 numeric(5,4), score2 numeric(5,4));
And my CSV looks like this:
ID lng lat score1 score2
1 -72.298 43.218 0.561 0.894
2 -72.298 43.218 0.472 0.970
3 -72.285 43.250 0.322 0.959
4 -72.285 43.250 0.370 0.934
5 -72.325 43.173 0.099 0.976
6 -72.325 43.173 0.099 0.985
However, when I try to copy the CSV over, I get the following error
COPY coredb FROM '/home/usr/Documents/filefordb.csv' DELIMITER ',' CSV;
ERROR: invalid input syntax for type numeric: "lng"
CONTEXT: COPY nhcore, line 1, column lng: "lng"
Oddly enough the csv imports just fine when I set the CREATE TABLE parameters to text for all the columns. Could someone explain why this is happening? I am using psql 9.4.1
You have to use HEADER true to tell COPY to skip the header line.