Hive returns no data for simple select on partitioned external table - hiveql

my select query is fetching me no rows on a partitioned external table.
i created an external partitioned table audit_test on a location /user/abcdef/audit_table/, i am loading .csv file by creating partitioned directory based on date.
CREATE EXTERNAL TABLE audit_test
(perm_bitmap_txt STRING,
blank_txt STRING,
ownr_id STRING,
ad_grp_txt STRING,
size_bytes_tot INT,
last_mod_dt STRING,
last_mod_tm STRING,
hdfs_phy_loc_txt STRING,
reg_hdfs_loc_txt STRING,
reg_hdfs_grp_txt STRING,
reg_hdfs_comp_txt string)
PARTITIONED BY (data_ext_DT STRING)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS TEXTFILE
LOCATION 'user/abcdef/audit_table/';
Now my output location would be /user/abcdef/audit_table/data_ext_dt=20150203/20150203_audit.csv
when i run a simple select query i am getting zero rows
select * from audit_test where data_ext_dt = '20150203'

i have to create the partitions manually by using alter command:
alter table data_sec_audit_rpt_test add partition(data_ext_dt=20150203);
it worked.

Related

pyspark + hive: difference between first row in dataframe and table

I created a table in Hive using a csv file containing a header:
CREATE TABLE resultado(
data_jogo date,
mandante string,
visitante string,
gols_mandante int,
gols_visitante int,
torneio string,
cidade string,
pais string,
campo_neutro boolean
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
LINES TERMINATED BY '\n'
TBLPROPERTIES('skip.header.line.count'='1');
LOAD DATA INPATH '/user/hive/projeto/results.csv' OVERWRITE INTO TABLE resultado;
And works fine when a try a select:
SELECT * FROM resultado LIMIT 5;
Then I went to pyspark to see the same data:
from pyspark.sql import HiveContext
h = HiveContext(sc)
df = h.table('resultado')
df.show(5)
But it returns a dataframe with the header from file loaded in the table.
Please, can someone tell me what I'm doing wrong? As you can see I'm really new into this xD

Not able to create Hive table with TIMESTAMP datatype in Azure Databricks

org.apache.hadoop.hive.ql.metadata.HiveException:
java.lang.UnsupportedOperationException: Parquet does not support
timestamp. See HIVE-6384;
Getting above error while executing following code in Azure Databricks.
spark_session.sql("""
CREATE EXTERNAL TABLE IF NOT EXISTS dev_db.processing_table
(
campaign STRING,
status STRING,
file_name STRING,
arrival_time TIMESTAMP
)
PARTITIONED BY (
Date DATE)
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION "/mnt/data_analysis/pre-processed/"
""")
As per Hive-6384 Jira, Starting from Hive-1.2 you can use Timestamp,date types in parquet tables.
Workarounds for Hive < 1.2 version:
1. Using String type:
CREATE EXTERNAL TABLE IF NOT EXISTS dev_db.processing_table
(
campaign STRING,
status STRING,
file_name STRING,
arrival_time STRING
)
PARTITIONED BY (
Date STRING)
Stored as parquet
Location '/mnt/data_analysis/pre-processed/';
Then while processing you can cast arrival_time,Date cast to timestamp,date types.
Using a view and cast the columns but views are slow.
2. Using ORC format:
CREATE EXTERNAL TABLE IF NOT EXISTS dev_db.processing_table
(
campaign STRING,
status STRING,
file_name STRING,
arrival_time Timestamp
)
PARTITIONED BY (
Date date)
Stored as orc
Location '/mnt/data_analysis/pre-processed/';
ORC supports both timestamp,date type

Grafana query for created table in clickhouse

As I was trying to see the data from clickhouse as a graph in grafana...I tried a lot with query processing but I couldn't able to get points on grafana..my table looks like
CREATE TABLE m_psutilinfo (timestamp String, namespace String, data Float, unit String, plugin_running_on String, version UInt64, last_advertised_time String) ENGINE = Kafka('10.224.54.99:9092', 'psutilout', 'group3', 'JSONEachRow');
CREATE TABLE m_psutilinfo_t (timestamp DateTime,namespace String,data Float,unit String,plugin_running_on String,version UInt64,last_advertised_time String,DAY Date)ENGINE = MergeTree PARTITION BY DAY ORDER BY (DAY, timestamp) SETTINGS index_granularity = 8192;
CREATE MATERIALIZED VIEW m_psutilinfo_view TO m_psutilinfo_t AS SELECT toDateTime(substring(timestamp, 1, 19)) AS timestamp, namespace, data, unit, plugin_running_on, version, last_advertised_time, toDate(timestamp) AS DAY FROM m_psutilinfo;
these are the tables I created in clickhouse....what should be my query in grafana for getting data as a graph?
SELECT
$timeSeries as t,
count()
FROM $table
WHERE $timeFilter
GROUP BY t
ORDER BY t
I used tabix but wanted in grafana

Hive date is showing null in elasticsearch

I have a hive table details with below schema
name STRING,
address STRING,
dob DATE
My dob is stored in yyyy-mm-dd format.like 1988-01-27.
I am trying to load this elastic search table . So i followed below instruction in HUE.
CREATE EXTERNAL TABLE sampletable (name STRING, address STRING, dob DATE)
ROW FORMAT SERDE 'org.elasticsearch.hadoop.hive.EsSerDe'
STORED BY 'org.elasticsearch.hadoop.hive.EsStorageHandler' TBLPROPERTIES('es.resource' = 'test4/test4','es.nodes' = 'x.x.x.x:9200');
INSERT OVERWRITE TABLE sampletable SELECT * FROM details;
select * from sample table;
But DOB field shows NULL for all column. Whereas I can verify that my original hive table has data in date field.
After some research I was able to find that Elasticsearch expects data field to be in yyyy-mm-ddThh:mm:zz since my data doesn't match that it throws error. And also it mentioned, I can change the format to "strict_date" format, then it will work fine my hive date format. But I am not sure where in hive query i execute I need to metion this.
Can some one help me with this?
date type mapping to hive have some problem .
you can use hive string type mapping es date type , but you must set the config for hive table for parameter: es.mapping.date.rich , set it's value is false . like this 'es.mapping.date.rich' = 'false' , in create table statement ,it is:
CREATE EXTERNAL TABLE temp.data_index_es(
id bigint,
userId int,
createTime string
)
STORED BY 'org.elasticsearch.hadoop.hive.EsStorageHandler'
TBLPROPERTIES(
'es.nodes' = 'xxxx:9200',
'es.index.auto.create' = 'false',
'es.resource' = 'abc/{_type}',
'es.mapping.date.rich' = 'false',
'es.read.metadata' = 'true',
'es.mapping.id' = 'id',
'es.mapping.names' = 'id:id, userId:userId, createTime:createTime');
refer link: Mapping and Types

Most Efficient Way to Join Massive/Small Datasets

I currently have a large RDD called chartEvents containing data of the form:
case class ChartEvent(patientID: String, itemID: String, chartTime: String, storeTime: String, value: String,
valueNum: String, warning: String, error: String)
The data is coming from a 35 GB .csv file which I am parsing in using SQL:
CSVUtils.loadCSVAsTable(sqlContext, "data_unzipped/CHARTEVENTS.csv")
val chartEvents = sqlContext.sql(
"""
|SELECT SUBJECT_ID, ITEMID, CHARTTIME, STORETIME, VALUE, VALUENUM, WARNING, ERROR
|FROM CHARTEVENTS
""".stripMargin)
.map(r => ChartEvent(r(0).toString, r(1).toString, r(2).toString, r(3).toString, r(4).toString,
r(5).toString, r(6).toString, r(7).toString))
I have a separate, very small (less than 100 rows) RDD called featureMapping of the form RDD[(itemID, label)] where these are both strings. What I am trying to do is filter down the chartEvents RDD to rows which only contain itemIDs in featureMapping. My current method is to perform an inner join of the two RDDs as follows:
val result = chartEvents.map{case event => (event.itemID, event)}.join(featureMapping)
However, I am noticing that this is on track to take several hours to run, and is using a massive amount of space in my /user/<user>/appdata/local/temp folder. Is there a more efficient way to perform this filtering? Would coding it into the sqlContext be faster?
If you register your tables in hive metastore you can set spark.sql.autoBroadcastJoinThreshold
from the doc:
Configures the maximum size in bytes for a table that will be
broadcast to all worker nodes when performing a join. By setting this
value to -1 broadcasting can be disabled. Note that currently
statistics are only supported for Hive Metastore tables where the
command ANALYZE TABLE COMPUTE STATISTICS noscan has been
run.