Errors while saving Spark Dataframe to Hbase using Apache Phoenix - scala

I'm trying to save jsonRDD into hbase using apache phoenix spark plugin : df.saveToPhoenix(tableName, zkUrl = Some(quorumAddress)). The table looks like:
CREATE TABLE IF NOT EXISTS person (
ID BIGINT NOT NULL PRIMARY KEY,
NAME VARCHAR,
SURNAME VARCHAR) SALT_BUCKETS = 40, COMPRESSION='GZ';
I have about 100,000 - 2,000,000 records in this kind of tables. Some of them are saved normally. But some of them fail with error:
java.lang.RuntimeException: org.apache.phoenix.exception.PhoenixIOException:
callTimeout=1200000, callDuration=2902566: row 'PERSON' on table 'SYSTEM.CATALOG' at
region=SYSTEM.CATALOG,,1443172839381.a593d4dbac97863f897bca469e8bac66.,
hostname=hadoop-02,16020,1443292360474, seqNum=339
at org.apache.phoenix.mapreduce.PhoenixRecordWriter.close(PhoenixRecordWriter.java:62)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1$$anonfun$12$$anonfun$apply$5.apply$mcV$sp(PairRDDFunctions.scala:1043)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1294)
What could that mean? Are there any other ways to bulk insert data from DataFrame to hbase?

Related

Create a table from a topic using a where clause using ksql

I'm using the latest version of Kafka sql server 0.29.2, I guess. I'm trying to create a reading table that reads from a specific topic which receives lots of events, but I'm interested in specific events. The JSON event has a property named "evenType", so I want to continually filter the events and create a specific table to store the client data, like phone number, email etc., to update the client info.
I created a stream called orders_inputs only for testing purposes, and then I tried to create this table, but I got that error.
create table orders(orderid varchar PRIMARY KEY, itemid varchar) WITH (KAFKA_TOPIC='ORDERS', PARTITIONS=1, REPLICAS=1) as select orderid, itemid from orders_inputs where type='t1';
line 1:120: mismatched input 'as' expecting ';'
Statement: create table orders(orderid varchar PRIMARY KEY, itemid varchar) WITH (KAFKA_TOPIC='ORDERS', PARTITIONS=1, REPLICAS=1) as select orderid, itemid from orders_inputs where type='t1';
Caused by: line 1:120: mismatched input 'as' expecting ';'
Caused by: org.antlr.v4.runtime.InputMismatchException
If you are wanting to create a table that contains the results of a select query from a stream you can use CREATE TABLE AS SELECT
https://docs.confluent.io/5.2.1/ksql/docs/developer-guide/create-a-table.html#create-a-ksql-table-with-streaming-query-results
e.g.
CREATE TABLE orders AS
SELECT orderid, itemid FROM orders_inputs
WHERE type='t1';
You can specify the primary key when creating the stream order_inputs: https://docs.confluent.io/5.4.4/ksql/docs/developer-guide/syntax-reference.html#message-keys
Otherwise, you can specify the primary key when creating a table from a topic:
https://docs.confluent.io/5.2.1/ksql/docs/developer-guide/create-a-table.html#create-a-table-with-selected-columns
e.g.
CREATE TABLE orders
(orderid VARCHAR PRIMARY KEY,
itemid VARCHAR)
WITH (KAFKA_TOPIC = 'orders',
VALUE_FORMAT='JSON');
However, you would then have to query the table and filter where type=t1

Flyway - Postgresql partitioned table

I would like to generate partitioned table on PostgreSQL 11 database using Flyway. When I try to execute simple SQL file like
CREATE TABLE blabla (id varchar(100) NOT NULL, name varchar(100) NULL)
PARTITION BY LIST(name);
I have an error saying that "PARTITION" is not validate even if I'm using last release of flyway core library.
Does anyone know if partitioned table on PostgreSQL are managed with Flyway or what is the correct way for partition table creation ?

Data not syncing from mysql to elastic search after processing through Kafka

We are trying to send data from MySQL to elastic(ETL) though Kafka.
In MySQL we have multiple tables which we need to aggregate in specific format than we can send it to elastic search.
For that we used debezium to connect with Mysql and elastic and transformed data through ksql.
we have created streams for both the tables then partition them and create table of one entity but after joining we dint get the data from both the tables.
we are trying to join two tables of Mysql through Ksql and send it to elastic search using debezium.
Table 1: items
table 2 : item_images
CREATE STREAM items_from_debezium (id integer, tenant_id integer, name string, sku string, barcode string, qty integer, type integer, archived integer)
WITH (KAFKA_TOPIC='mySqlTesting.orderhive.items',VALUE_FORMAT='json');
CREATE STREAM images_from_debezium (id integer,item_id integer,image string, thumbnail string)
WITH (KAFKA_TOPIC='mySqlTesting.orderhive.item_images',VALUE_FORMAT='json');
CREATE STREAM items_flat
WITH (KAFKA_TOPIC='ITEMS_REPART',VALUE_FORMAT='json',PARTITIONS=1) as SELECT * FROM items_from_debezium PARTITION BY id;
CREATE STREAM images_flat
WITH (KAFKA_TOPIC='IMAGES_REPART',VALUE_FORMAT='json',PARTITIONS=1) as SELECT * FROM images_from_debezium PARTITION BY item_id;
CREATE TABLE item_images (id integer,item_id integer,image string, thumbnail string)
WITH (KAFKA_TOPIC='IMAGES_REPART',VALUE_FORMAT='json',KEY='item_id');
SELECT item_images.id,item_images.image,item_images.thumbnail,items_flat.id,items_flat.name,items_flat.sku,items_flat.barcode,items_flat.type,items_flat.archived,items_flat.qty
FROM items_flat left join item_images on items_flat.id=item_images.item_id
limit 10;
We are expecting data of both the tables but we are getting null from item_images table.

Not able to insert data into hive elasticsearch index using spark SQL

I have used the following steps in hive terminal to insert into elasticsearch index -
Create hive table pointing to elasticsearch index
CREATE EXTERNAL TABLE test_es(
id string,
name string
)
STORED BY 'org.elasticsearch.hadoop.hive.EsStorageHandler'
TBLPROPERTIES('es.resource = test/person', 'es.mapping = id');
Create a staging table and insert data into it
Create table emp(id string,name string) row format delimited fields terminated by ',';
load data local inpath '/home/monami/data.txt' into table emp;
Insert data from staging table into the hive elasticsearch index
insert overwrite table test_es select * from emp;
I could browse through the hive elasticsearch index successfully following the above steps in hive CLI. But whenever I am trying to insert in the same way using SPARK SQL hiveContext object, I am getting the folllowing error -
java.lang.RuntimeException: java.lang.RuntimeException: class org.elasticsearch.hadoop.mr.EsOutputFormat$EsOutputCommitter not org.apache.hadoop.mapred.OutputCommitter
Can you please let me know the reason for this error? If it is not possible to insert the same way using Spark, then what is the method to insert into hive elasticsearch index using Spark ?
Versions used - Spark 1.6, Scala 2.10, Elasticsearch 6.4, Hive 1.1

Apache Spark - Error persisting Dataframe to MemSQL database using JDBC driver

I'm currently facing an issue while trying to save an Apache Spark DataFrame loaded from an Apache Spark temp table to a distributed MemSQL database.
The trick is that I cannot use MemSQLContext connector for the moment. So I'm using JDBC driver.
Here is my code:
//store suppliers data from temp table into a dataframe
val suppliers = sqlContext.read.table("tmp_SUPPLIER")
//append data to the target table
suppliers.write.mode(SaveMode.Append).jdbc(url_memsql, "R_SUPPLIER", prop_memsql)
Here is the error message (occuring during the suppliers.write statement):
java.sql.SQLException: Distributed tables must either have a PRIMARY or SHARD key.
Note:
R_SUPPLIER table has exactly the same fields and datatypes than the temp table and has a primary key set.
FYI, here are some clues:
R_SUPPLIER script:
`CREATE TABLE R_SUPPLIER
(
SUP_ID INT NOT NULL PRIMARY KEY,
SUP_CAGE_CODE CHAR(5) NULL,
SUP_INTERNAL_SAP_CODE CHAR(5) NULL,
SUP_NAME VARCHAR(255) NULL,
SHARD KEY(SUP_ID)
);`
The suppliers.write statement has worked once, but data was then loaded in the DataFrame with a sqlContext.read.jdbc command and not sqlContext.sql (data was stored in a distant database and not in Apache Spark local temp table).
Did anyone face the same issue, please?
Are you getting that error when you run the create table, or when you run the suppliers.write code? That is an error that you should only get when creating a table. Therefore if you are hitting it when running suppliers.write, your code is probably trying to create and write to a new table, not the one you created before.