I am trying to read parquet files via the Flink table, and it throws the error when I select one of the timestamps.
My parquet table is something like this.
I create a table with this SQL :
CREATE TABLE MyDummyTable (
`id` INT,
ts BIGINT,
ts_ltz AS TO_TIMESTAMP_LTZ(ts, 3),
ts2 TIMESTAMP,
ts3 TIMESTAMP,
ts4 TIMESTAMP,
ts5 TIMESTAMP
)
It throws an error when I select one of the ts2, ts3, ts4, ts5
The error stack is this.
Caused by: java.lang.IllegalArgumentException: Unexpected type: BINARY
at org.apache.parquet.Preconditions.checkArgument(Preconditions.java:77)
at org.apache.flink.formats.parquet.vector.ParquetSplitReaderUtil.createWritableColumnVector(ParquetSplitReaderUtil.java:369)
at org.apache.flink.formats.parquet.ParquetVectorizedInputFormat.createWritableVectors(ParquetVectorizedInputFormat.java:264)
at org.apache.flink.formats.parquet.ParquetVectorizedInputFormat.createReaderBatch(ParquetVectorizedInputFormat.java:254)
at org.apache.flink.formats.parquet.ParquetVectorizedInputFormat.createPoolOfBatches(ParquetVectorizedInputFormat.java:244)
at org.apache.flink.formats.parquet.ParquetVectorizedInputFormat.createReader(ParquetVectorizedInputFormat.java:137)
at org.apache.flink.formats.parquet.ParquetVectorizedInputFormat.createReader(ParquetVectorizedInputFormat.java:73)
at org.apache.flink.connector.file.src.impl.FileSourceSplitReader.checkSplitOrStartNext(FileSourceSplitReader.java:112)
at org.apache.flink.connector.file.src.impl.FileSourceSplitReader.fetch(FileSourceSplitReader.java:65)
at org.apache.flink.connector.base.source.reader.fetcher.FetchTask.run(FetchTask.java:56)
at org.apache.flink.connector.base.source.reader.fetcher.SplitFetcher.runOnce(SplitFetcher.java:138)
... 7 more
My Current appproach
I am currently using this approach to jump over the problem but it does not seem a legit solution since I create two columns in the table.
CREATE TABLE MyDummyTable (
`id` INT,
ts2 STRING,
ts2_ts AS TO_TIMESTAMP(ts2)
)
Flink Version: 1.13.2
Scala Version: 2.11.12
Related
org.apache.hadoop.hive.ql.metadata.HiveException:
java.lang.UnsupportedOperationException: Parquet does not support
timestamp. See HIVE-6384;
Getting above error while executing following code in Azure Databricks.
spark_session.sql("""
CREATE EXTERNAL TABLE IF NOT EXISTS dev_db.processing_table
(
campaign STRING,
status STRING,
file_name STRING,
arrival_time TIMESTAMP
)
PARTITIONED BY (
Date DATE)
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION "/mnt/data_analysis/pre-processed/"
""")
As per Hive-6384 Jira, Starting from Hive-1.2 you can use Timestamp,date types in parquet tables.
Workarounds for Hive < 1.2 version:
1. Using String type:
CREATE EXTERNAL TABLE IF NOT EXISTS dev_db.processing_table
(
campaign STRING,
status STRING,
file_name STRING,
arrival_time STRING
)
PARTITIONED BY (
Date STRING)
Stored as parquet
Location '/mnt/data_analysis/pre-processed/';
Then while processing you can cast arrival_time,Date cast to timestamp,date types.
Using a view and cast the columns but views are slow.
2. Using ORC format:
CREATE EXTERNAL TABLE IF NOT EXISTS dev_db.processing_table
(
campaign STRING,
status STRING,
file_name STRING,
arrival_time Timestamp
)
PARTITIONED BY (
Date date)
Stored as orc
Location '/mnt/data_analysis/pre-processed/';
ORC supports both timestamp,date type
While writing data into hive partitioned table, I am getting below error.
org.apache.spark.SparkException: Requested partitioning does not match the tablename table:
I have converted my RDD to a DF using case class and then I am trying to write the data into the existing hive partitioned table. But I am getting his error and as per the printed logs "Requested partitions:" is coming as blank. Partition columns are coming as expected in the hive table.
spark-shell error :-
scala> data1.write.format("hive").partitionBy("category", "state").mode("append").saveAsTable("sampleb.sparkhive6")
org.apache.spark.SparkException: Requested partitioning does not match the sparkhive6 table:
Requested partitions:
Table partitions: category,state
Hive table format :-
hive> describe formatted sparkhive6;
OK
col_name data_type comment
txnno int
txndate string
custno int
amount double
product string
city string
spendby string
Partition Information
col_name data_type comment
category string
state string
Try with insertInto() function instead of saveAsTable().
scala> data1.write.format("hive")
.partitionBy("category", "state")
.mode("append")
.insertInto("sampleb.sparkhive6")
(or)
Register a temp view on top of the dataframe then write with sql statement to insert data into hive table.
scala> data1.createOrReplaceTempView("temp_vw")
scala> spark.sql("insert into sampleb.sparkhive6 partition(category,state) select txnno,txndate,custno,amount,product,city,spendby,category,state from temp_vw")
I am creating a VoltDB table with the given insert statement
CREATE TABLE EMPLOYEE (
ID VARCHAR(4) NOT NULL,
CODE VARCHAR(4) NOT NULL,
FIRST_NAME VARCHAR(30) NOT NULL,
LAST_NAME VARCHAR(30) NOT NULL,
PRIMARY KEY (ID, CODE)
);
And partitioning the table with
PARTITION TABLE EMPLOYEE ON COLUMN ID;
I have written one spark job to insert data into VoltDB, I am using below scala code to insert records into VoltDB, Code works well if we do not partition the table
import org.voltdb._;
import org.voltdb.client._;
import scala.collection.JavaConverters._
val voltClient:Client = ClientFactory.createClient();
voltClient.createConnection("IP:PORT");
val empDf = spark.read.format("csv")
.option("inferSchema", "true")
.option("header", "true")
.option("sep", ",")
.load("/FileStore/tables/employee.csv")
// Code to convert scala seq to java varargs
def callProcedure(procName: String, parameters: Any*): ClientResponse =
voltClient.callProcedure(procName, paramsToJavaObjects(parameters: _*): _*)
def paramsToJavaObjects(params: Any*) = params.map { param ⇒
val value = param match {
case None ⇒ null
case Some(v) ⇒ v
case _ ⇒ param
}
value.asInstanceOf[AnyRef]
}
empDf.collect().foreach { row =>
callProcedure("EMPLOYEE.insert", row.toSeq:_*);
}
But I get below error if I partition the table
Mispartitioned tuple in single-partition insert statement.
Constraint Type PARTITIONING, Table CatalogId EMPLOYEE
Relevant Tuples:
ID CODE FIRST_NAME LAST_NAME
--- ----- ----------- ----------
1 CD01 Naresh "Joshi"
at org.voltdb.client.ClientImpl.internalSyncCallProcedure(ClientImpl.java:485)
at org.voltdb.client.ClientImpl.callProcedureWithClientTimeout(ClientImpl.java:324)
at org.voltdb.client.ClientImpl.callProcedure(ClientImpl.java:260)
at line4c569b049a9d4e51a3e8fda7cbb043de32.$read$$iw$$iw$$iw$$iw$$iw$$iw.callProcedure(command-3986740264398828:9)
at line4c569b049a9d4e51a3e8fda7cbb043de40.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$anonfun$1.apply(command-3986740264399793:8)
at line4c569b049a9d4e51a3e8fda7cbb043de40.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$anonfun$1.apply(command-3986740264399793:7)
at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
I found found a link (https://forum.voltdb.com/forum/voltdb-discussions/building-voltdb-applications/1182-mispartitioned-tuple-in-single-partition-insert-statement) regarding the problem and tried to partition the procedure using below query
PARTITION PROCEDURE EMPLOYEE.insert ON TABLE EMPLOYEE COLUMN ID;
AND
PARTITION PROCEDURE EMPLOYEE.insert ON TABLE EMPLOYEE COLUMN ID [PARAMETER 0];
But I am getting [Ad Hoc DDL Input]: VoltDB DDL Error: "Partition references an undefined procedure "EMPLOYEE.insert"" error while executing these statement.
However, I am able to insert the data by using #AdHoc stored procedure but I am not able figure out the problem or the solution for the above scenario where I am using EMPLOYEE.insert stored procedure to insert data into a partitioned table.
The procedure "EMPLOYEE.insert" is what is referred to as a "default" procedure, which is automatically generated by VoltDB when you create the table EMPLOYEE. It is already automatically partitioned based on the partitioning of the table, therefore you cannot call "PARTITION PROCEDURE EMPLOYEE.insert ..." to override this.
I think what is happening is that the procedure is partitioned by the ID column which in the EMPLOYEE table is a VARCHAR. The input parameter therefore should be a String. However, I think your code is somehow reading the CSV file and passing in the first column as an int value.
The java client callProcedure(String procedureName, Object... params) method accepts varargs for the parameters. This can be any Object[]. There is a check along the way somewhere on the server where the # of arguments must match the # expected by the procedure, or else the procedure call is returned as rejected, and it would never have been executed. However, I think in your case the # of arguments is ok, so it then tries to execute the procedure. It hashes the 1st parameter value corresponding to ID and then determines which partition this should go to. The invocation is routed to that partition for execution. When it executes, it tries to insert the values, but there is another check that the partition key value is correct for this partition, and this is failing.
I think if the value is passed in as an int, it is hashed to the wrong partition. Then in that partition it tries to insert the value into the column, which is a VARCHAR so it probably implicitly converts the int to a String, but it's not in the correct partition, so the insert fails with this error "Mispartitioned tuple in single-partition insert statement." which is the same error you would get if you wrote a java stored procedure and configured the wrong column as the partition key.
Disclosure: I work at VoltDB.
As I was trying to see the data from clickhouse as a graph in grafana...I tried a lot with query processing but I couldn't able to get points on grafana..my table looks like
CREATE TABLE m_psutilinfo (timestamp String, namespace String, data Float, unit String, plugin_running_on String, version UInt64, last_advertised_time String) ENGINE = Kafka('10.224.54.99:9092', 'psutilout', 'group3', 'JSONEachRow');
CREATE TABLE m_psutilinfo_t (timestamp DateTime,namespace String,data Float,unit String,plugin_running_on String,version UInt64,last_advertised_time String,DAY Date)ENGINE = MergeTree PARTITION BY DAY ORDER BY (DAY, timestamp) SETTINGS index_granularity = 8192;
CREATE MATERIALIZED VIEW m_psutilinfo_view TO m_psutilinfo_t AS SELECT toDateTime(substring(timestamp, 1, 19)) AS timestamp, namespace, data, unit, plugin_running_on, version, last_advertised_time, toDate(timestamp) AS DAY FROM m_psutilinfo;
these are the tables I created in clickhouse....what should be my query in grafana for getting data as a graph?
SELECT
$timeSeries as t,
count()
FROM $table
WHERE $timeFilter
GROUP BY t
ORDER BY t
I used tabix but wanted in grafana
here i'm trying to persist the data frame in to a partitioned hive table and getting this silly exception. I have looked in to it many times but not able to find the fault.
org.apache.spark.sql.AnalysisException: Specified partition columns
(timestamp value) do not match the partition columns of the table.
Please use () as the partition columns.;
Here is the script by which the external table is created with,
CREATE EXTERNAL TABLEIF NOT EXISTS events2 (
action string
,device_os_ver string
,device_type string
,event_name string
,item_name string
,lat DOUBLE
,lon DOUBLE
,memberid BIGINT
,productupccd BIGINT
,tenantid BIGINT
) partitioned BY (timestamp_val DATE)
row format serde 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
stored AS inputformat 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
outputformat 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
location 'maprfs:///location/of/events2'
tblproperties ('serialization.null.format' = '');
Here is the result of describe formatted of table "events2"
hive> describe formatted events2;
OK
# col_name data_type comment
action string
device_os_ver string
device_type string
event_name string
item_name string
lat double
lon double
memberid bigint
productupccd bigint
tenantid bigint
# Partition Information
# col_name data_type comment
timestamp_val date
# Detailed Table Information
Database: default
CreateTime: Wed Jan 11 16:58:55 IST 2017
LastAccessTime: UNKNOWN
Protect Mode: None
Retention: 0
Location: maprfs:/location/of/events2
Table Type: EXTERNAL_TABLE
Table Parameters:
EXTERNAL TRUE
serialization.null.format
transient_lastDdlTime 1484134135
# Storage Information
SerDe Library: org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe
InputFormat: org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat
OutputFormat: org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat
Compressed: No
Num Buckets: -1
Bucket Columns: []
Sort Columns: []
Storage Desc Params:
serialization.format 1
Time taken: 0.078 seconds, Fetched: 42 row(s)
Here is the line of code where the data is partitioned and stored in to a table,
val tablepath = Map("path" -> "maprfs:///location/of/events2")
AppendDF.write.format("parquet").partitionBy("Timestamp_val").options(tablepath).mode(org.apache.spark.sql.SaveMode.Append).saveAsTable("events2")
While running the application, i'm getting the below
Specified partition columns (timestamp_val) do not match the partition
columns of the table.Please use () as the partition columns.
I might be committing an obvious error, any help is much appreciated with an upvote :)
Please print schema of df:
AppendDF.printSchema()
Make sure it is not type mismatch??