Not able to create Hive table with TIMESTAMP datatype in Azure Databricks - pyspark

org.apache.hadoop.hive.ql.metadata.HiveException:
java.lang.UnsupportedOperationException: Parquet does not support
timestamp. See HIVE-6384;
Getting above error while executing following code in Azure Databricks.
spark_session.sql("""
CREATE EXTERNAL TABLE IF NOT EXISTS dev_db.processing_table
(
campaign STRING,
status STRING,
file_name STRING,
arrival_time TIMESTAMP
)
PARTITIONED BY (
Date DATE)
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION "/mnt/data_analysis/pre-processed/"
""")

As per Hive-6384 Jira, Starting from Hive-1.2 you can use Timestamp,date types in parquet tables.
Workarounds for Hive < 1.2 version:
1. Using String type:
CREATE EXTERNAL TABLE IF NOT EXISTS dev_db.processing_table
(
campaign STRING,
status STRING,
file_name STRING,
arrival_time STRING
)
PARTITIONED BY (
Date STRING)
Stored as parquet
Location '/mnt/data_analysis/pre-processed/';
Then while processing you can cast arrival_time,Date cast to timestamp,date types.
Using a view and cast the columns but views are slow.
2. Using ORC format:
CREATE EXTERNAL TABLE IF NOT EXISTS dev_db.processing_table
(
campaign STRING,
status STRING,
file_name STRING,
arrival_time Timestamp
)
PARTITIONED BY (
Date date)
Stored as orc
Location '/mnt/data_analysis/pre-processed/';
ORC supports both timestamp,date type

Related

Unexpected type: BINARY

I am trying to read parquet files via the Flink table, and it throws the error when I select one of the timestamps.
My parquet table is something like this.
I create a table with this SQL :
CREATE TABLE MyDummyTable (
`id` INT,
ts BIGINT,
ts_ltz AS TO_TIMESTAMP_LTZ(ts, 3),
ts2 TIMESTAMP,
ts3 TIMESTAMP,
ts4 TIMESTAMP,
ts5 TIMESTAMP
)
It throws an error when I select one of the ts2, ts3, ts4, ts5
The error stack is this.
Caused by: java.lang.IllegalArgumentException: Unexpected type: BINARY
at org.apache.parquet.Preconditions.checkArgument(Preconditions.java:77)
at org.apache.flink.formats.parquet.vector.ParquetSplitReaderUtil.createWritableColumnVector(ParquetSplitReaderUtil.java:369)
at org.apache.flink.formats.parquet.ParquetVectorizedInputFormat.createWritableVectors(ParquetVectorizedInputFormat.java:264)
at org.apache.flink.formats.parquet.ParquetVectorizedInputFormat.createReaderBatch(ParquetVectorizedInputFormat.java:254)
at org.apache.flink.formats.parquet.ParquetVectorizedInputFormat.createPoolOfBatches(ParquetVectorizedInputFormat.java:244)
at org.apache.flink.formats.parquet.ParquetVectorizedInputFormat.createReader(ParquetVectorizedInputFormat.java:137)
at org.apache.flink.formats.parquet.ParquetVectorizedInputFormat.createReader(ParquetVectorizedInputFormat.java:73)
at org.apache.flink.connector.file.src.impl.FileSourceSplitReader.checkSplitOrStartNext(FileSourceSplitReader.java:112)
at org.apache.flink.connector.file.src.impl.FileSourceSplitReader.fetch(FileSourceSplitReader.java:65)
at org.apache.flink.connector.base.source.reader.fetcher.FetchTask.run(FetchTask.java:56)
at org.apache.flink.connector.base.source.reader.fetcher.SplitFetcher.runOnce(SplitFetcher.java:138)
... 7 more
My Current appproach
I am currently using this approach to jump over the problem but it does not seem a legit solution since I create two columns in the table.
CREATE TABLE MyDummyTable (
`id` INT,
ts2 STRING,
ts2_ts AS TO_TIMESTAMP(ts2)
)
Flink Version: 1.13.2
Scala Version: 2.11.12

Grafana query for created table in clickhouse

As I was trying to see the data from clickhouse as a graph in grafana...I tried a lot with query processing but I couldn't able to get points on grafana..my table looks like
CREATE TABLE m_psutilinfo (timestamp String, namespace String, data Float, unit String, plugin_running_on String, version UInt64, last_advertised_time String) ENGINE = Kafka('10.224.54.99:9092', 'psutilout', 'group3', 'JSONEachRow');
CREATE TABLE m_psutilinfo_t (timestamp DateTime,namespace String,data Float,unit String,plugin_running_on String,version UInt64,last_advertised_time String,DAY Date)ENGINE = MergeTree PARTITION BY DAY ORDER BY (DAY, timestamp) SETTINGS index_granularity = 8192;
CREATE MATERIALIZED VIEW m_psutilinfo_view TO m_psutilinfo_t AS SELECT toDateTime(substring(timestamp, 1, 19)) AS timestamp, namespace, data, unit, plugin_running_on, version, last_advertised_time, toDate(timestamp) AS DAY FROM m_psutilinfo;
these are the tables I created in clickhouse....what should be my query in grafana for getting data as a graph?
SELECT
$timeSeries as t,
count()
FROM $table
WHERE $timeFilter
GROUP BY t
ORDER BY t
I used tabix but wanted in grafana

Specified partition columns do not match the partition columns of the table.Please use () as the partition columns

here i'm trying to persist the data frame in to a partitioned hive table and getting this silly exception. I have looked in to it many times but not able to find the fault.
org.apache.spark.sql.AnalysisException: Specified partition columns
(timestamp value) do not match the partition columns of the table.
Please use () as the partition columns.;
Here is the script by which the external table is created with,
CREATE EXTERNAL TABLEIF NOT EXISTS events2 (
action string
,device_os_ver string
,device_type string
,event_name string
,item_name string
,lat DOUBLE
,lon DOUBLE
,memberid BIGINT
,productupccd BIGINT
,tenantid BIGINT
) partitioned BY (timestamp_val DATE)
row format serde 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
stored AS inputformat 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
outputformat 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
location 'maprfs:///location/of/events2'
tblproperties ('serialization.null.format' = '');
Here is the result of describe formatted of table "events2"
hive> describe formatted events2;
OK
# col_name data_type comment
action string
device_os_ver string
device_type string
event_name string
item_name string
lat double
lon double
memberid bigint
productupccd bigint
tenantid bigint
# Partition Information
# col_name data_type comment
timestamp_val date
# Detailed Table Information
Database: default
CreateTime: Wed Jan 11 16:58:55 IST 2017
LastAccessTime: UNKNOWN
Protect Mode: None
Retention: 0
Location: maprfs:/location/of/events2
Table Type: EXTERNAL_TABLE
Table Parameters:
EXTERNAL TRUE
serialization.null.format
transient_lastDdlTime 1484134135
# Storage Information
SerDe Library: org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe
InputFormat: org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat
OutputFormat: org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat
Compressed: No
Num Buckets: -1
Bucket Columns: []
Sort Columns: []
Storage Desc Params:
serialization.format 1
Time taken: 0.078 seconds, Fetched: 42 row(s)
Here is the line of code where the data is partitioned and stored in to a table,
val tablepath = Map("path" -> "maprfs:///location/of/events2")
AppendDF.write.format("parquet").partitionBy("Timestamp_val").options(tablepath).mode(org.apache.spark.sql.SaveMode.Append).saveAsTable("events2")
While running the application, i'm getting the below
Specified partition columns (timestamp_val) do not match the partition
columns of the table.Please use () as the partition columns.
I might be committing an obvious error, any help is much appreciated with an upvote :)
Please print schema of df:
AppendDF.printSchema()
Make sure it is not type mismatch??

Hive date is showing null in elasticsearch

I have a hive table details with below schema
name STRING,
address STRING,
dob DATE
My dob is stored in yyyy-mm-dd format.like 1988-01-27.
I am trying to load this elastic search table . So i followed below instruction in HUE.
CREATE EXTERNAL TABLE sampletable (name STRING, address STRING, dob DATE)
ROW FORMAT SERDE 'org.elasticsearch.hadoop.hive.EsSerDe'
STORED BY 'org.elasticsearch.hadoop.hive.EsStorageHandler' TBLPROPERTIES('es.resource' = 'test4/test4','es.nodes' = 'x.x.x.x:9200');
INSERT OVERWRITE TABLE sampletable SELECT * FROM details;
select * from sample table;
But DOB field shows NULL for all column. Whereas I can verify that my original hive table has data in date field.
After some research I was able to find that Elasticsearch expects data field to be in yyyy-mm-ddThh:mm:zz since my data doesn't match that it throws error. And also it mentioned, I can change the format to "strict_date" format, then it will work fine my hive date format. But I am not sure where in hive query i execute I need to metion this.
Can some one help me with this?
date type mapping to hive have some problem .
you can use hive string type mapping es date type , but you must set the config for hive table for parameter: es.mapping.date.rich , set it's value is false . like this 'es.mapping.date.rich' = 'false' , in create table statement ,it is:
CREATE EXTERNAL TABLE temp.data_index_es(
id bigint,
userId int,
createTime string
)
STORED BY 'org.elasticsearch.hadoop.hive.EsStorageHandler'
TBLPROPERTIES(
'es.nodes' = 'xxxx:9200',
'es.index.auto.create' = 'false',
'es.resource' = 'abc/{_type}',
'es.mapping.date.rich' = 'false',
'es.read.metadata' = 'true',
'es.mapping.id' = 'id',
'es.mapping.names' = 'id:id, userId:userId, createTime:createTime');
refer link: Mapping and Types

Hive returns no data for simple select on partitioned external table

my select query is fetching me no rows on a partitioned external table.
i created an external partitioned table audit_test on a location /user/abcdef/audit_table/, i am loading .csv file by creating partitioned directory based on date.
CREATE EXTERNAL TABLE audit_test
(perm_bitmap_txt STRING,
blank_txt STRING,
ownr_id STRING,
ad_grp_txt STRING,
size_bytes_tot INT,
last_mod_dt STRING,
last_mod_tm STRING,
hdfs_phy_loc_txt STRING,
reg_hdfs_loc_txt STRING,
reg_hdfs_grp_txt STRING,
reg_hdfs_comp_txt string)
PARTITIONED BY (data_ext_DT STRING)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS TEXTFILE
LOCATION 'user/abcdef/audit_table/';
Now my output location would be /user/abcdef/audit_table/data_ext_dt=20150203/20150203_audit.csv
when i run a simple select query i am getting zero rows
select * from audit_test where data_ext_dt = '20150203'
i have to create the partitions manually by using alter command:
alter table data_sec_audit_rpt_test add partition(data_ext_dt=20150203);
it worked.