Read pipe separated values in ksql - apache-kafka

I am working on POC, I have to read pipe separated value file and insert these records into ms sql server.
I am using confluent 5.4.1 to use value_delimiter create stream property. But its giving exception: Delimeter only supported with DELIMITED format
1. Start Confluent (version: 5.4.1)::
[Dev root # myip ~]
# confluent local start
The local commands are intended for a single-node development environment
only, NOT for production usage. https://docs.confluent.io/current/cli/index.html
Using CONFLUENT_CURRENT: /tmp/confluent.vHhSRAnj
Starting zookeeper
zookeeper is [UP]
Starting kafka
kafka is [UP]
Starting schema-registry
schema-registry is [UP]
Starting kafka-rest
kafka-rest is [UP]
Starting connect
connect is [UP]
Starting ksql-server
ksql-server is [UP]
Starting control-center
control-center is [UP]
[Dev root # myip ~]
# jps
49923 KafkaRestMain
50099 ConnectDistributed
49301 QuorumPeerMain
50805 KsqlServerMain
49414 SupportedKafka
52103 Jps
51020 ControlCenter
1741
49646 SchemaRegistryMain
[Dev root # myip ~]
#
2. Create Topic:
[Dev root # myip ~]
# kafka-topics --create --zookeeper localhost:2181 --replication-factor 1 --partitions 1 --topic SampleData
Created topic SampleData.
3. Provide pipe separated data to SampeData Topic
[Dev root # myip ~]
# kafka-console-producer --broker-list localhost:9092 --topic SampleData <<EOF
> this is col1|and now col2|and col 3 :)
> EOF
>>[Dev root # myip ~]
#
4. Start KSQL::
[Dev root # myip ~]
# ksql
===========================================
= _ __ _____ ____ _ =
= | |/ // ____|/ __ \| | =
= | ' /| (___ | | | | | =
= | < \___ \| | | | | =
= | . \ ____) | |__| | |____ =
= |_|\_\_____/ \___\_\______| =
= =
= Streaming SQL Engine for Apache Kafka® =
===========================================
Copyright 2017-2019 Confluent Inc.
CLI v5.4.1, Server v5.4.1 located at http://localhost:8088
Having trouble? Type 'help' (case-insensitive) for a rundown of how things work!
5. Declare a schema for the existing topic: SampleData
ksql> CREATE STREAM sample_delimited (
> column1 varchar(1000),
> column2 varchar(1000),
> column3 varchar(1000))
> WITH (KAFKA_TOPIC='SampleData', VALUE_FORMAT='DELIMITED', VALUE_DELIMITER='|');
Message
----------------
Stream created
----------------
6. Verify data into KSQl Stream
ksql> SET 'auto.offset.reset' = 'earliest';
Successfully changed local property 'auto.offset.reset' to 'earliest'. Use the UNSET command to revert your change.
ksql> SELECT * FROM sample_delimited emit changes limit 1;
+---------------------------+---------------------------+---------------------------+---------------------------+---------------------------+
|ROWTIME |ROWKEY |COLUMN1 |COLUMN2 |COLUMN3 |
+---------------------------+---------------------------+---------------------------+---------------------------+---------------------------+
|1584339233947 |null |this is col1 |and now col2 |and col 3 :) |
Limit Reached
Query terminated
7. Write a new Kafka topic: SampleDataAvro that serializes all the data from sample_delimited stream to Avro format stream
ksql> CREATE STREAM sample_avro WITH (KAFKA_TOPIC='SampleDataAvro', VALUE_FORMAT='AVRO') AS SELECT * FROM sample_delimited;
Delimeter only supported with DELIMITED format
ksql>
8. Above line gives exception::
Delimeter only supported with DELIMITED format
9. Load ms sql kafka connect configuration
confluent local load test-sink -- -d ./etc/kafka-connect-jdbc/sink-quickstart-mssql.properties

The only time you need to specify the delimiter is when you define the stream that is reading from the source topic.
Here's my worked example:
Populate a topic with pipe-delimited data:
$ kafkacat -b localhost:9092 -t SampleData -P<<EOF
this is col1|and now col2|and col 3 :)
EOF
Declare a stream over it
CREATE STREAM sample_delimited (
column1 varchar(1000),
column2 varchar(1000),
column3 varchar(1000))
WITH (KAFKA_TOPIC='SampleData', VALUE_FORMAT='DELIMITED', VALUE_DELIMITER='|');
Query the stream to make sure it works
ksql> SET 'auto.offset.reset' = 'earliest';
Successfully changed local property 'auto.offset.reset' to 'earliest'. Use the UNSET command to revert your change.
ksql> SELECT * FROM sample_delimited emit changes limit 1;
+----------------+--------+---------------+--------------+--------------+
|ROWTIME |ROWKEY |COLUMN1 |COLUMN2 |COLUMN3 |
+----------------+--------+---------------+--------------+--------------+
|1583933584658 |null |this is col1 |and now col2 |and col 3 :) |
Limit Reached
Query terminated
Reserialise the data to Avro:
CREATE STREAM sample_avro WITH (KAFKA_TOPIC='SampleDataAvro', VALUE_FORMAT='AVRO') AS SELECT * FROM sample_delimited;
Dump the contents of the topic - note that it is now Avro:
ksql> print SampleDataAvro;
Key format: UNDEFINED
Value format: AVRO
rowtime: 3/11/20 1:33:04 PM UTC, key: <null>, value: {"COLUMN1": "this is col1", "COLUMN2": "and now col2", "COLUMN3": "and col 3 :)"}
The error that you're hitting is as a result of bug #4200. You can wait for the next release of Confluent Platform, or use standalone ksqlDB in which the issue is already fixed.
Here's using ksqlDB 0.7.1 streaming the data to MS SQL:
CREATE SINK CONNECTOR SINK_MSSQL WITH (
'connector.class' = 'io.confluent.connect.jdbc.JdbcSinkConnector',
'connection.url' = 'jdbc:sqlserver://mssql:1433',
'connection.user' = 'sa',
'connection.password' = 'Admin123',
'topics' = 'SampleDataAvro',
'key.converter' = 'org.apache.kafka.connect.storage.StringConverter',
'auto.create' = 'true',
'insert.mode' = 'insert'
);
Now query the data in MS SQL
1> Select ##version
2> go
---------------------------------------------------------------------
Microsoft SQL Server 2017 (RTM-CU17) (KB4515579) - 14.0.3238.1 (X64)
Sep 13 2019 15:49:57
Copyright (C) 2017 Microsoft Corporation
Developer Edition (64-bit) on Linux (Ubuntu 16.04.6 LTS)
(1 rows affected)
1> SELECT * FROM SampleDataAvro;
2> GO
COLUMN3 COLUMN2 COLUMN1
-------------- --------------- ------------------
and col 3 :) and now col2 this is col1
(1 rows affected)

Related

Configuration of JDBC Sink Connector in KsqlDB from MySQL database to PostgreSQL database

I wanted to copy a table from MySQL database to PostgreSQL. I have KsqlDB which acts as a stream processor. For the start, I just want to copy a simple table from the source's 'inventory' database to the sink database (PostgreSQL). The following is the structure of inventory database:
mysql> show tables;
+---------------------+
| Tables_in_inventory |
+---------------------+
| addresses |
| customers |
| geom |
| orders |
| products |
| products_on_hand |
+---------------------+
I have logged into the KsqlDB and register a source connector using the following configuration
CREATE SOURCE CONNECTOR inventory_connector WITH (
'connector.class' = 'io.debezium.connector.mysql.MySqlConnector',
'database.hostname' = 'mysql',
'database.port' = '3306',
'database.user' = 'debezium',
'database.password' = 'dbz',
'database.allowPublicKeyRetrieval' = 'true',
'database.server.id' = '223344',
'database.server.name' = 'dbserver',
'database.whitelist' = 'inventory',
'database.history.kafka.bootstrap.servers' = 'broker:9092',
'database.history.kafka.topic' = 'schema-changes.inventory',
'transforms' = 'unwrap',
'transforms.unwrap.type'= 'io.debezium.transforms.UnwrapFromEnvelope',
'key.converter'= 'org.apache.kafka.connect.json.JsonConverter',
'key.converter.schemas.enable'= 'false',
'value.converter'= 'org.apache.kafka.connect.json.JsonConverter',
'value.converter.schemas.enable'= 'false'
);
The following are the topics created
ksql> LIST TOPICS;
Kafka Topic | Partitions | Partition Replicas
-----------------------------------------------------------------------
_ksql-connect-configs | 1 | 1
_ksql-connect-offsets | 25 | 1
_ksql-connect-statuses | 5 | 1
dbserver | 1 | 1
dbserver.inventory.addresses | 1 | 1
**dbserver.inventory.customers** | 1 | 1
dbserver.inventory.geom | 1 | 1
dbserver.inventory.orders | 1 | 1
dbserver.inventory.products | 1 | 1
dbserver.inventory.products_on_hand | 1 | 1
default_ksql_processing_log | 1 | 1
schema-changes.inventory | 1 | 1
-----------------------------------------------------------------------
Now I need to just copy the contents of the 'dbserver.inventory.customers' to the PostgreSQL database. The following are the structure of the data
ksql> PRINT 'dbserver.inventory.customers' FROM BEGINNING;
Key format: JSON or HOPPING(KAFKA_STRING) or TUMBLING(KAFKA_STRING) or KAFKA_STRING
Value format: JSON or KAFKA_STRING
rowtime: 2022/08/29 02:39:20.772 Z, key: {"id":1001}, value: {"id":1001,"first_name":"Sally","last_name":"Thomas","email":"sally.thomas#acme.com"}, partition: 0
rowtime: 2022/08/29 02:39:20.773 Z, key: {"id":1002}, value: {"id":1002,"first_name":"George","last_name":"Bailey","email":"gbailey#foobar.com"}, partition: 0
rowtime: 2022/08/29 02:39:20.773 Z, key: {"id":1003}, value: {"id":1003,"first_name":"Edward","last_name":"Walker","email":"ed#walker.com"}, partition: 0
rowtime: 2022/08/29 02:39:20.773 Z, key: {"id":1004}, value: {"id":1004,"first_name":"Anne","last_name":"Kretchmar","email":"annek#noanswer.org"}, partition: 0
I have tried the following configuration of the sink connector:
CREATE SINK CONNECTOR postgres_sink WITH (
'connector.class'= 'io.confluent.connect.jdbc.JdbcSinkConnector',
'connection.url'= 'jdbc:postgresql://postgres:5432/inventory',
'connection.user' = 'postgresuser',
'connection.password' = 'postgrespw',
'topics'= 'dbserver.inventory.customers',
'transforms'= 'unwrap',
'transforms.unwrap.type'= 'io.debezium.transforms.ExtractNewRecordState',
'transforms.unwrap.drop.tombstones'= 'false',
'key.converter'= 'org.apache.kafka.connect.json.JsonConverter',
'key.converter.schemas.enable'= 'false',
'value.converter'= 'org.apache.kafka.connect.json.JsonConverter',
'value.converter.schemas.enable'= 'false',
'auto.create'= 'true',
'insert.mode'= 'upsert',
'auto.evolve' = 'true',
'table.name.format' = '${topic}',
'pk.mode' = 'record_key',
'pk.fields' = 'id',
'delete.enabled'= 'true'
);
It creates the connector but shows the following errors:
ksqldb-server | Caused by: org.apache.kafka.connect.errors.ConnectException: Sink connector 'POSTGRES_SINK' is configured with 'delete.enabled=true' and 'pk.mode=record_key' and therefore requires records with a non-null key and non-null Struct or primitive key schema, but found record at (topic='dbserver.inventory.customers',partition=0,offset=0,timestamp=1661740760772) with a HashMap key and null key schema.
What should be the configuration of the Sink Connector to copy these data to PostgreSQL?
I have also tried creating a stream first in AVRO and then using AVRO Key, Value convertor but it did not work. I think it is something to do with using right SMTs but I am not sure.
My ultimate aim is to join different streams and then store it in the PostgreSQL as a part of implementing the CQRS architecture. So if someone can share a framework I could use in such case it would be really useful.
As the error says, the key must be a primitive, not JSON object and not Avro either.
From your shown JSON, you'd need an extract field transform on your key
transforms=getKey,unwrap
transforms.getKey.type=org.apache.kafka.connect.transforms.ExtractField$Key
transforms.getKey.field=id
Or, you might be able to change your source connector to use IntegerConverter, not JSONConverter for the keys
Debezium also has an old blog post covering this exact use case - https://debezium.io/blog/2017/09/25/streaming-to-another-database/

KSQL left join giving 'null' result even when data is present

I'm learning K-SQL/KSQL-DB and currently exploring joins. Below is the issue where I'm stuck.
I have 1 stream 'DRIVERSTREAMREPARTITIONEDKEYED' and one table 'COUNTRIES', below is their description.
ksql> describe DRIVERSTREAMREPARTITIONEDKEYED;
Name: DRIVERSTREAMREPARTITIONEDKEYED
Field | Type
--------------------------------------
COUNTRYCODE | VARCHAR(STRING) (key)
NAME | VARCHAR(STRING)
RATING | DOUBLE
--------------------------------------
ksql> describe countries;
Name : COUNTRIES
Field | Type
----------------------------------------------
COUNTRYCODE | VARCHAR(STRING) (primary key)
COUNTRYNAME | VARCHAR(STRING)
----------------------------------------------
This is the sample data that they have,
ksql> select * from DRIVERSTREAMREPARTITIONEDKEYED emit changes;
+---------------------------------------------+---------------------------------------------+---------------------------------------------+
|COUNTRYCODE |NAME |RATING |
+---------------------------------------------+---------------------------------------------+---------------------------------------------+
|SGP |Suresh |3.5 |
|IND |Mahesh |2.4 |
ksql> select * from countries emit changes;
+---------------------------------------------------------------------+---------------------------------------------------------------------+
|COUNTRYCODE |COUNTRYNAME |
+---------------------------------------------------------------------+---------------------------------------------------------------------+
|IND |INDIA |
|SGP |SINGAPORE |
I'm trying to do a 'left outer' join on them with the stream being on the left side, but below is the output I get,
select d.name,d.rating,c.COUNTRYNAME from DRIVERSTREAMREPARTITIONEDKEYED d left join countries c on d.COUNTRYCODE=c.COUNTRYCODE emit changes;
+---------------------------------------------+---------------------------------------------+---------------------------------------------+
|NAME |RATING |COUNTRYNAME |
+---------------------------------------------+---------------------------------------------+---------------------------------------------+
|Suresh |3.5 |null |
|Mahesh |2.4 |null |
In ideal scenario I should get the data in 'COUNTRYNAME' column as the 'COUNTRYCODE' column in both stream and data have matching data.
I tried searching a lot but to no avail.
I'm using 'Confluent Platform: 6.1.1'
For join to work it is our responsible to verify if the keys of both entities which are being joined lie in the same partition, KsqlDB can't verify whether the partitioning strategies are the same for both join inputs.
In my case My 'Drivers' topic had 2 partitions on which I had created a stream 'DriversStream' which in turn also had 2 partitions, but the table 'Countries' which I wanted to Join it with had only 1 partition, due to this I 're-keyed' the 'DriversStream' and created another stream 'DRIVERSTREAMREPARTITIONEDKEYED' shown in the question.
But the data of the table and the stream were not in the same partition hence the join was failing.
I created another topic with 1 partition 'DRIVERINFO'.
kafka-topics --bootstrap-server localhost:9092 --create --partitions 1 --replication-factor 1 --topic DRIVERINFO
Then created a stream over it 'DRIVERINFOSTREAM'.
CREATE STREAM DRIVERINFOSTREAM (NAME STRING, RATING DOUBLE, COUNTRYCODE STRING) WITH (KAFKA_TOPIC='DRIVERINFO', VALUE_FORMAT='JSON');
Finally joined it with 'COUNTRIES' table which finally worked.
ksql> select d.name,d.rating,c.COUNTRYNAME from DRIVERINFOSTREAM d left join countries c on d.COUNTRYCODE=c.COUNTRYCODE EMIT CHANGES;
+-------------------------------------------+-------------------------------------------+-------------------------------------------+
|NAME |RATING |COUNTRYNAME |
+-------------------------------------------+-------------------------------------------+-------------------------------------------+
|Suresh |2.4 |SINGAPORE |
|Mahesh |3.6 |INDIA |
Refer to below links for details,
KSQL join
Partitioning data for Joins

Are nested objects defined in JSON supported in KSQL using the STRUCT type?

How to create a KSQL stream listening on the TOPIC T where the JSON structure of the message is:
{"k":"1","a":{"b":1,"c":{"d":10}}}
I tried the following and it does not work. Getting a syntax error.
create stream s (k VARCHAR,a STRUCT <b INT ,c <STRUCT d INT >> )
with (KAFKA_TOPIC='T',VALUE_FORMAT='JSON',KEY='k')
Here's an example of how to do it. I've got the test data in a topic:
ksql> PRINT test FROM BEGINNING;
Format:JSON
{"ROWTIME":1578571011016,"ROWKEY":"null","k":"1","a":{"b":1,"c":{"d":10}}}
^CTopic printing ceased
Declare the stream:
ksql> CREATE STREAM TEST (K VARCHAR,
A STRUCT<B INT,
C STRUCT<D INT>>
) WITH (KAFKA_TOPIC='test',
VALUE_FORMAT='JSON');
Message
----------------
Stream created
----------------
Query the stream:
ksql> SET 'auto.offset.reset' = 'earliest';
Successfully changed local property 'auto.offset.reset' to 'earliest'. Use the UNSET command to revert your change.
ksql> SELECT * FROM TEST EMIT CHANGES LIMIT 1;
+---------------------+----------+-----+---------------------+
|ROWTIME |ROWKEY |K |A |
+---------------------+----------+-----+---------------------+
|1578571011016 |null |1 |{B=1, C={D=10}} |
Limit Reached
Query terminated
ksql> SELECT K, A, A->B, A->C, A->C->D FROM TEST EMIT CHANGES LIMIT 1;
+----+-----------------+-------+---------+----------+
|K |A |A__B |A__C |A__C__D |
+----+-----------------+-------+---------+----------+
|1 |{B=1, C={D=10}} |1 |{D=10} |10 |
Limit Reached
Query terminated

Need to filter out Kafka Records based on a certain keyword

I have a Kafka topic which has around 3 million records. I want to pick-out a single record from this which has a certain parameter. I have been trying to query this using Lenses, but unable to form the correct query. below are the record contents of 1 message.
{
"header": {
"schemaVersionNo": "1",
},
"payload": {
"modifiedDate": 1552334325212,
"createdDate": 1552334325212,
"createdBy": "A",
"successful": true,
"source_order_id": "1111111111111",
}
}
Now I want to filter out a record with a particular source_order_id, but not able to figure out the right way to do so.
We have tried via lenses as well Kafka Tool.
A sample query that we tried in lenses is below:
SELECT * FROM `TEST`
WHERE _vtype='JSON' AND _ktype='BYTES'
AND _sample=2 AND _sampleWindow=200 AND payload.createdBy='A'
This query works, however if we try with source id as shown below we get an error:
SELECT * FROM `TEST`
WHERE _vtype='JSON' AND _ktype='BYTES'
AND _sample=2 AND _sampleWindow=200 AND payload.source_order_id='1111111111111'
Error : "Invalid syntax at line=3 and column=41.Invalid syntax for 'payload.source_order_id'. Field 'payload' resolves to primitive type STRING.
Consuming all 3 million records via a custom consumer and then iterating over it doesn't seem to be an optimised approach to me, so looking for any available solutions for such a use case.
Since you said you are open to other solutions, here is one built using KSQL.
First, let's get some sample records into a source topic:
$ kafkacat -P -b localhost:9092 -t TEST <<EOF
{ "header": { "schemaVersionNo": "1" }, "payload": { "modifiedDate": 1552334325212, "createdDate": 1552334325212, "createdBy": "A", "successful": true, "source_order_id": "3411976933214" } }
{ "header": { "schemaVersionNo": "1" }, "payload": { "modifiedDate": 1552334325412, "createdDate": 1552334325412, "createdBy": "B", "successful": true, "source_order_id": "3411976933215" } }
{ "header": { "schemaVersionNo": "1" }, "payload": { "modifiedDate": 1552334325612, "createdDate": 1552334325612, "createdBy": "C", "successful": true, "source_order_id": "3411976933216" } }
EOF
Using KSQL we can inspect the topic with PRINT:
ksql> PRINT 'TEST' FROM BEGINNING;
Format:JSON
{"ROWTIME":1552476232988,"ROWKEY":"null","header":{"schemaVersionNo":"1"},"payload":{"modifiedDate":1552334325212,"createdDate":1552334325212,"createdBy":"A","successful":true,"source_order_id":"3411976933214"}}
{"ROWTIME":1552476232988,"ROWKEY":"null","header":{"schemaVersionNo":"1"},"payload":{"modifiedDate":1552334325412,"createdDate":1552334325412,"createdBy":"B","successful":true,"source_order_id":"3411976933215"}}
{"ROWTIME":1552476232988,"ROWKEY":"null","header":{"schemaVersionNo":"1"},"payload":{"modifiedDate":1552334325612,"createdDate":1552334325612,"createdBy":"C","successful":true,"source_order_id":"3411976933216"}}
Then declare a schema on the topic, which enables us to run SQL against it:
ksql> CREATE STREAM TEST (header STRUCT<schemaVersionNo VARCHAR>,
payload STRUCT<modifiedDate BIGINT,
createdDate BIGINT,
createdBy VARCHAR,
successful BOOLEAN,
source_order_id VARCHAR>)
WITH (KAFKA_TOPIC='TEST',
VALUE_FORMAT='JSON');
Message
----------------
Stream created
----------------
Tell KSQL to work with all the data in the topic:
ksql> SET 'auto.offset.reset' = 'earliest';
Successfully changed local property 'auto.offset.reset' to 'earliest'. Use the UNSET command to revert your change.
And now we can select all the data:
ksql> SELECT * FROM TEST;
1552475910106 | null | {SCHEMAVERSIONNO=1} | {MODIFIEDDATE=1552334325212, CREATEDDATE=1552334325212, CREATEDBY=A, SUCCESSFUL=true, SOURCE_ORDER_ID=3411976933214}
1552475910106 | null | {SCHEMAVERSIONNO=1} | {MODIFIEDDATE=1552334325412, CREATEDDATE=1552334325412, CREATEDBY=B, SUCCESSFUL=true, SOURCE_ORDER_ID=3411976933215}
1552475910106 | null | {SCHEMAVERSIONNO=1} | {MODIFIEDDATE=1552334325612, CREATEDDATE=1552334325612, CREATEDBY=C, SUCCESSFUL=true, SOURCE_ORDER_ID=3411976933216}
^CQuery terminated
or we can selectively query it, using the -> notation to access nested fields in the schema:
ksql> SELECT * FROM TEST
WHERE PAYLOAD->CREATEDBY='A';
1552475910106 | null | {SCHEMAVERSIONNO=1} | {MODIFIEDDATE=1552334325212, CREATEDDATE=1552334325212, CREATEDBY=A, SUCCESSFUL=true, SOURCE_ORDER_ID=3411976933214}
As well as selecting all records, you can return just the fields of interest:
ksql> SELECT payload FROM TEST
WHERE PAYLOAD->source_order_id='3411976933216';
{MODIFIEDDATE=1552334325612, CREATEDDATE=1552334325612, CREATEDBY=C, SUCCESSFUL=true, SOURCE_ORDER_ID=3411976933216}
With KSQL you can write the results of any SELECT statement to a new topic, which populates it with all existing messages along with every new message on the source topic filtered and processed per the declared SELECT statement:
ksql> CREATE STREAM TEST_CREATED_BY_A AS
SELECT * FROM TEST WHERE PAYLOAD->CREATEDBY='A';
Message
----------------------------
Stream created and running
----------------------------
List topic on the Kafka cluster:
ksql> SHOW TOPICS;
Kafka Topic | Registered | Partitions | Partition Replicas | Consumers | ConsumerGroups
----------------------------------------------------------------------------------------------------
orders | true | 1 | 1 | 1 | 1
pageviews | false | 1 | 1 | 0 | 0
products | true | 1 | 1 | 1 | 1
TEST | true | 1 | 1 | 1 | 1
TEST_CREATED_BY_A | true | 4 | 1 | 0 | 0
Print the contents of the new topic:
ksql> PRINT 'TEST_CREATED_BY_A' FROM BEGINNING;
Format:JSON
{"ROWTIME":1552475910106,"ROWKEY":"null","HEADER":{"SCHEMAVERSIONNO":"1"},"PAYLOAD":{"MODIFIEDDATE":1552334325212,"CREATEDDATE":1552334325212,"CREATEDBY":"A","SUCCESSFUL":true,"SOURCE_ORDER_ID":"3411976933214"}}

Unable to view Hive records in Spark SQL, but can view them on Hive CLI

I am unable to view Hive records in Spark SQL, but can view them on Hive CLI.
Where the Hive CLI shows 2 records, the same query using the hive context shows 0
In Hive
hive> select ord_id from order;
OK
157434411
157435932
Time taken: 0.389 seconds, Fetched: 2 row(s)
In Spark SQL
hiveCtx.sql("select ord_id from order").show()
+------------+
|ord_id |
+------------+
+------------+
I have attempted to refresh the table, and restarted Hive, but the issue remains.
I have checked the solution from unable-to-view-data-of-hive-tables-after-update-in-spark but nothing seems to work.
Any advice would be gratefully received.
EDIT:
Corrected the name of the columns above.
I am also providing the output from desc:
hive> desc rpt_derived.rpttradeleg;
OK
ord_id string
ord_date timestamp
cust_buy_sell string
ord_cust_entity string
ord_deal_year int
ord_deal_month int
ord_deal_day int
# Partition Information
# col_name data_type comment
ord_year int
ord_month int
ord_day int
Time taken: 2.036 seconds, Fetched: 16 row(s)
hive>
From Spark SQL:
scala> hiveContext.sql("desc rpt_derived.rpttradeleg").show()
+--------------------+---------+-------+
| col_name|data_type|comment|
+--------------------+---------+-------+
| ord_id| string| |
| ord_date|timestamp| |
| cust_buy_sell| string| |
| ord_cust_entity| string| |
| ord_deal_year| int| |
| ord_deal_month| int| |
| ord_deal_day| int| |
+--------------------+---------+-------+
Spark version: spark-1.5.0+cdh5.5.2+114
Hive version: hive-1.1.0+cdh5.5.2+377
To connect to hive metastore you need to copy the hive-site.xml file into
spark/conf directory.After that spark will be able to connect to hive
metastore.
so run the following command after log in as root user
cp /usr/lib/hive/conf/hive-site.xml /usr/lib/spark/conf/
or use hiveCtx.sql("select *from databasename.tablename").show()
I hope, it will work for you.