Specified partition columns do not match the partition columns of the table.Please use () as the partition columns - scala

here i'm trying to persist the data frame in to a partitioned hive table and getting this silly exception. I have looked in to it many times but not able to find the fault.
org.apache.spark.sql.AnalysisException: Specified partition columns
(timestamp value) do not match the partition columns of the table.
Please use () as the partition columns.;
Here is the script by which the external table is created with,
CREATE EXTERNAL TABLEIF NOT EXISTS events2 (
action string
,device_os_ver string
,device_type string
,event_name string
,item_name string
,lat DOUBLE
,lon DOUBLE
,memberid BIGINT
,productupccd BIGINT
,tenantid BIGINT
) partitioned BY (timestamp_val DATE)
row format serde 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
stored AS inputformat 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
outputformat 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
location 'maprfs:///location/of/events2'
tblproperties ('serialization.null.format' = '');
Here is the result of describe formatted of table "events2"
hive> describe formatted events2;
OK
# col_name data_type comment
action string
device_os_ver string
device_type string
event_name string
item_name string
lat double
lon double
memberid bigint
productupccd bigint
tenantid bigint
# Partition Information
# col_name data_type comment
timestamp_val date
# Detailed Table Information
Database: default
CreateTime: Wed Jan 11 16:58:55 IST 2017
LastAccessTime: UNKNOWN
Protect Mode: None
Retention: 0
Location: maprfs:/location/of/events2
Table Type: EXTERNAL_TABLE
Table Parameters:
EXTERNAL TRUE
serialization.null.format
transient_lastDdlTime 1484134135
# Storage Information
SerDe Library: org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe
InputFormat: org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat
OutputFormat: org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat
Compressed: No
Num Buckets: -1
Bucket Columns: []
Sort Columns: []
Storage Desc Params:
serialization.format 1
Time taken: 0.078 seconds, Fetched: 42 row(s)
Here is the line of code where the data is partitioned and stored in to a table,
val tablepath = Map("path" -> "maprfs:///location/of/events2")
AppendDF.write.format("parquet").partitionBy("Timestamp_val").options(tablepath).mode(org.apache.spark.sql.SaveMode.Append).saveAsTable("events2")
While running the application, i'm getting the below
Specified partition columns (timestamp_val) do not match the partition
columns of the table.Please use () as the partition columns.
I might be committing an obvious error, any help is much appreciated with an upvote :)

Please print schema of df:
AppendDF.printSchema()
Make sure it is not type mismatch??

Related

ksql - CREATE TABLE results in table with null values even though kafka topic is populated

Using ksqlDB, I have created a JDBC connector with a custom query. Then, from the resulting kafka topic, I have created a table. However, selecting from the table returns data only for the PRIMARY KEY, while returning null for all other values. The postgres database that I am connecting to has its sales table constantly updated with new data, which I am trying to stream using ksql.
ksql> CREATE SOURCE CONNECTOR con WITH (
'connector.class' ='io.confluent.connect.jdbc.JdbcSourceConnector',
'connection.url' = '....',
'topic.prefix' = 'sales',
...
'key' = 'id',
'query' = 'SELECT id, time, price FROM sales');
Message
Created connector CON
ksql> print sales limit 1;
Key format: HOPPING(KAFKA_STRING) or TUMBLING(KAFKA_STRING) or KAFKA_STRING
Value format: JSON or KAFKA_STRING
rowtime: 2020/11/30 09:07:55.109 Z, key: [123], value: {"schema":{"type":"struct","fields":[{"type":"string","optional":alse,"field":"id"},{"type":"int64","optional":true,"field":"time"},{"type":"float","optional":true,"field":"price"}],"optional":false},"payload":{"id":"123","time":1,"price":10.0}}
Topic printing ceased
ksql> CREATE TABLE sales_table (id VARCHAR PRIMARY KEY, time INT, price DOUBLE) WITH (kafka_topic='sales', partitions=1, value_format='JSON');
Message
Table created
ksql> SELECT * FROM sales_table EMIT CHANGES LIMIT 1;
+-----+-----+-----+
|ID |TIME |PRICE|
+-----+-----+-----+
|123 |null |null |
Limit Reached
Query terminated
As you can see, the kafka topic has entries with proper values in the time and price fields. However, when a table is created over that topic, selecting from the table yields null time and price fields. Only the id (which is the PRIMARY KEY column) is printed correctly.
Any idea why this is happening?
You're using the org.apache.kafka.connect.json.JsonConverter converter in your connector with schemas.enable=true, so your schema is not (id VARCHAR PRIMARY KEY, time INT, price DOUBLE), and thus you get NULL values.
Better is to use io.confluent.connect.avro.AvroConverter (or Protobuf, or JSON Schema) in your source connector, because then you don't even have to type in the schema for your CREATE STREAM, you just have
CREATE TABLE sales_table WITH (kafka_topic='sales', value_format='AVRO');
You specify the alternative converter thus:
CREATE SOURCE CONNECTOR SOURCE_01 WITH (
…
'key.converter'= 'org.apache.kafka.connect.storage.StringConverter',
'value.converter'= 'io.confluent.connect.avro.AvroConverter',
'value.converter.schema.registry.url'= 'http://schema-registry:8081'
);
But if you must use JSON, in your source connector disable schemas:
CREATE SOURCE CONNECTOR SOURCE_01 WITH (
…
'value.converter.schemas.enable'= 'false'
);
Ref: https://www.confluent.io/blog/kafka-connect-deep-dive-converters-serialization-explained

Not able to create Hive table with TIMESTAMP datatype in Azure Databricks

org.apache.hadoop.hive.ql.metadata.HiveException:
java.lang.UnsupportedOperationException: Parquet does not support
timestamp. See HIVE-6384;
Getting above error while executing following code in Azure Databricks.
spark_session.sql("""
CREATE EXTERNAL TABLE IF NOT EXISTS dev_db.processing_table
(
campaign STRING,
status STRING,
file_name STRING,
arrival_time TIMESTAMP
)
PARTITIONED BY (
Date DATE)
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION "/mnt/data_analysis/pre-processed/"
""")
As per Hive-6384 Jira, Starting from Hive-1.2 you can use Timestamp,date types in parquet tables.
Workarounds for Hive < 1.2 version:
1. Using String type:
CREATE EXTERNAL TABLE IF NOT EXISTS dev_db.processing_table
(
campaign STRING,
status STRING,
file_name STRING,
arrival_time STRING
)
PARTITIONED BY (
Date STRING)
Stored as parquet
Location '/mnt/data_analysis/pre-processed/';
Then while processing you can cast arrival_time,Date cast to timestamp,date types.
Using a view and cast the columns but views are slow.
2. Using ORC format:
CREATE EXTERNAL TABLE IF NOT EXISTS dev_db.processing_table
(
campaign STRING,
status STRING,
file_name STRING,
arrival_time Timestamp
)
PARTITIONED BY (
Date date)
Stored as orc
Location '/mnt/data_analysis/pre-processed/';
ORC supports both timestamp,date type

How to resolve this erros "org.apache.spark.SparkException: Requested partitioning does not match the tablename table" in spark-shell

While writing data into hive partitioned table, I am getting below error.
org.apache.spark.SparkException: Requested partitioning does not match the tablename table:
I have converted my RDD to a DF using case class and then I am trying to write the data into the existing hive partitioned table. But I am getting his error and as per the printed logs "Requested partitions:" is coming as blank. Partition columns are coming as expected in the hive table.
spark-shell error :-
scala> data1.write.format("hive").partitionBy("category", "state").mode("append").saveAsTable("sampleb.sparkhive6")
org.apache.spark.SparkException: Requested partitioning does not match the sparkhive6 table:
Requested partitions:
Table partitions: category,state
Hive table format :-
hive> describe formatted sparkhive6;
OK
col_name data_type comment
txnno int
txndate string
custno int
amount double
product string
city string
spendby string
Partition Information
col_name data_type comment
category string
state string
Try with insertInto() function instead of saveAsTable().
scala> data1.write.format("hive")
.partitionBy("category", "state")
.mode("append")
.insertInto("sampleb.sparkhive6")
(or)
Register a temp view on top of the dataframe then write with sql statement to insert data into hive table.
scala> data1.createOrReplaceTempView("temp_vw")
scala> spark.sql("insert into sampleb.sparkhive6 partition(category,state) select txnno,txndate,custno,amount,product,city,spendby,category,state from temp_vw")

Getting `Mispartitioned tuple in single-partition insert statement` while trying to insert data into a partitioned table using `TABLE_NAME.insert`

I am creating a VoltDB table with the given insert statement
CREATE TABLE EMPLOYEE (
ID VARCHAR(4) NOT NULL,
CODE VARCHAR(4) NOT NULL,
FIRST_NAME VARCHAR(30) NOT NULL,
LAST_NAME VARCHAR(30) NOT NULL,
PRIMARY KEY (ID, CODE)
);
And partitioning the table with
PARTITION TABLE EMPLOYEE ON COLUMN ID;
I have written one spark job to insert data into VoltDB, I am using below scala code to insert records into VoltDB, Code works well if we do not partition the table
import org.voltdb._;
import org.voltdb.client._;
import scala.collection.JavaConverters._
val voltClient:Client = ClientFactory.createClient();
voltClient.createConnection("IP:PORT");
val empDf = spark.read.format("csv")
.option("inferSchema", "true")
.option("header", "true")
.option("sep", ",")
.load("/FileStore/tables/employee.csv")
// Code to convert scala seq to java varargs
def callProcedure(procName: String, parameters: Any*): ClientResponse =
voltClient.callProcedure(procName, paramsToJavaObjects(parameters: _*): _*)
def paramsToJavaObjects(params: Any*) = params.map { param ⇒
val value = param match {
case None ⇒ null
case Some(v) ⇒ v
case _ ⇒ param
}
value.asInstanceOf[AnyRef]
}
empDf.collect().foreach { row =>
callProcedure("EMPLOYEE.insert", row.toSeq:_*);
}
But I get below error if I partition the table
Mispartitioned tuple in single-partition insert statement.
Constraint Type PARTITIONING, Table CatalogId EMPLOYEE
Relevant Tuples:
ID CODE FIRST_NAME LAST_NAME
--- ----- ----------- ----------
1 CD01 Naresh "Joshi"
at org.voltdb.client.ClientImpl.internalSyncCallProcedure(ClientImpl.java:485)
at org.voltdb.client.ClientImpl.callProcedureWithClientTimeout(ClientImpl.java:324)
at org.voltdb.client.ClientImpl.callProcedure(ClientImpl.java:260)
at line4c569b049a9d4e51a3e8fda7cbb043de32.$read$$iw$$iw$$iw$$iw$$iw$$iw.callProcedure(command-3986740264398828:9)
at line4c569b049a9d4e51a3e8fda7cbb043de40.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$anonfun$1.apply(command-3986740264399793:8)
at line4c569b049a9d4e51a3e8fda7cbb043de40.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$anonfun$1.apply(command-3986740264399793:7)
at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
I found found a link (https://forum.voltdb.com/forum/voltdb-discussions/building-voltdb-applications/1182-mispartitioned-tuple-in-single-partition-insert-statement) regarding the problem and tried to partition the procedure using below query
PARTITION PROCEDURE EMPLOYEE.insert ON TABLE EMPLOYEE COLUMN ID;
AND
PARTITION PROCEDURE EMPLOYEE.insert ON TABLE EMPLOYEE COLUMN ID [PARAMETER 0];
But I am getting [Ad Hoc DDL Input]: VoltDB DDL Error: "Partition references an undefined procedure "EMPLOYEE.insert"" error while executing these statement.
However, I am able to insert the data by using #AdHoc stored procedure but I am not able figure out the problem or the solution for the above scenario where I am using EMPLOYEE.insert stored procedure to insert data into a partitioned table.
The procedure "EMPLOYEE.insert" is what is referred to as a "default" procedure, which is automatically generated by VoltDB when you create the table EMPLOYEE. It is already automatically partitioned based on the partitioning of the table, therefore you cannot call "PARTITION PROCEDURE EMPLOYEE.insert ..." to override this.
I think what is happening is that the procedure is partitioned by the ID column which in the EMPLOYEE table is a VARCHAR. The input parameter therefore should be a String. However, I think your code is somehow reading the CSV file and passing in the first column as an int value.
The java client callProcedure(String procedureName, Object... params) method accepts varargs for the parameters. This can be any Object[]. There is a check along the way somewhere on the server where the # of arguments must match the # expected by the procedure, or else the procedure call is returned as rejected, and it would never have been executed. However, I think in your case the # of arguments is ok, so it then tries to execute the procedure. It hashes the 1st parameter value corresponding to ID and then determines which partition this should go to. The invocation is routed to that partition for execution. When it executes, it tries to insert the values, but there is another check that the partition key value is correct for this partition, and this is failing.
I think if the value is passed in as an int, it is hashed to the wrong partition. Then in that partition it tries to insert the value into the column, which is a VARCHAR so it probably implicitly converts the int to a String, but it's not in the correct partition, so the insert fails with this error "Mispartitioned tuple in single-partition insert statement." which is the same error you would get if you wrote a java stored procedure and configured the wrong column as the partition key.
Disclosure: I work at VoltDB.

HiveContext.sql("insert into")

I'm trying to insert data with HiveContext like this:
/* table filedata
CREATE TABLE `filedata`(
`host_id` string,
`reportbatch` string,
`url` string,
`datatype` string,
`data` string,
`created_at` string,
`if_del` boolean)
*/
hiveContext.sql("insert into filedata (host_id, data) values (\"a1e1\", \"welcome\")")
Error and try to use "select":
hiveContext.sql("select \"a1e1\" as host_id, \"welcome\"as data").write.mode("append").saveAsTable("filedata")
/*
stack trace
java.lang.ArrayIndexOutOfBoundsException: 2
*/
It needs to all columns like this:
hc.sql("select \"a1e1\" as host_id,
\"xx\" as reportbatch,
\"xx\" as url,
\"xx\" as datatype,
\"welcome\" as data,
\"2017\" as created_at,
1 as if_del").write.mode("append").saveAsTable("filedata")
Is there a way to insert specified columns? For example, only insert columns "host_id" and "data".
As far as i Know , Hive does not support the insertion of values into only some columns
From the documentation
Each row listed in the VALUES clause is inserted into table tablename.
Values must be provided for every column in the table. The standard
SQL syntax that allows the user to insert values into only some
columns is not yet supported. To mimic the standard SQL, nulls can be
provided for columns the user does not wish to assign a value to.
So you should try this:
val data = sqlc.sql("select 'a1e1', null, null, null, 'welcome', null, null, null")
data.write.mode("append").insertInto("filedata")
Reference here
You can do it if you are using row columnar file format such as ORC. Please see the working example below. This example is in Hive but will work very well with HiveContext.
hive> use default;
OK
Time taken: 1.735 seconds
hive> create table test_insert (a string, b string, c string, d int) stored as orc;
OK
Time taken: 0.132 seconds
hive> insert into test_insert (a,c) values('x','y');
Query ID = user_20171219190337_b293c372-5225-4084-94a1-dec1df9e930d
Total jobs = 1
Launching Job 1 out of 1
Status: Running (Executing on YARN cluster with App id application_1507021764560_1375895)
--------------------------------------------------------------------------------
VERTICES STATUS TOTAL COMPLETED RUNNING PENDING FAILED KILLED
--------------------------------------------------------------------------------
Map 1 .......... SUCCEEDED 1 1 0 0 0 0
--------------------------------------------------------------------------------
VERTICES: 01/01 [==========================>>] 100% ELAPSED TIME: 4.06 s
--------------------------------------------------------------------------------
Loading data to table default.test_insert
Table default.test_insert stats: [numFiles=1, numRows=1, totalSize=417, rawDataSize=254]
OK
Time taken: 6.828 seconds
hive> select * from test_insert;
OK
x NULL y NULL
Time taken: 0.142 seconds, Fetched: 1 row(s)
hive>