Getting exception in Writing Spark Structured Streaming to HBase Table - scala

As described in Spark Structured Streaming with Hbase integration, I'm interesting in writing data to HBase in structured streaming framework. I've cloned SHC code from github, extend it by sync provider and trying to write records to HBase. Buy, I've received error: """Queries with streaming sources must be executed with writeStream.start(). My python code is below:
JARS
Spark version: 2.4.0, scala.version: 2.11.8, com.hortonworks.shc-core: 1.1.1-2.1-s_2.11 spark-sql-kafka-0-10_2.11
spark = SparkSession \
.builder \
.appName("SparkConsumer") \
.getOrCreate()
print 'read Avro schema from file: {}...'.format(schema_name)
schema = avro.schema.parse(open(schema_name, 'rb').read())
reader = avro.io.DatumReader(schema)
print 'the schema is read'
rows = spark \
.readStream \
.format('kafka') \
.option('kafka.bootstrap.servers', brokers) \
.option('subscribe', topic) \
.option('group.id', group_id) \
.option('maxOffsetsPerTrigger', 1000) \
.option("startingOffsets", "earliest") \
.load()
rows.printSchema()
schema = StructType([ \
StructField('consumer_id', StringType(), False), \
StructField('audit_system_id', StringType(), False), \
StructField('object_path', StringType(), True), \
StructField('object_type', StringType(), False), \
StructField('what_action', StringType(), False), \
StructField('when', LongType(), False), \
StructField('where', StringType(), False), \
StructField('who', StringType(), True), \
StructField('workstation', StringType(), True) \
])
def decode_avro(msg):
bytes_reader = io.BytesIO(bytes(msg))
decoder = avro.io.BinaryDecoder(bytes_reader)
data = reader.read(decoder)
return (\
data['consumer_id'],\
data['audit_system_id'],\
data['object_path'],\
data['object_type'],\
data['what_action'],\
data['when'],\
data['where'],\
data['who'],\
data['workstation']\
)
udf_decode_avro = udf(decode_avro, schema)
values = rows.select('value')
values.printSchema()
changes = values.withColumn('change', udf_decode_avro(col('value'))).select('change.*')
changes.printSchema()
change_catalog = '''
{
"table":
{
"namespace": "uba_input",
"name": "changes"
},
"rowkey": "consumer_id",
"columns":
{
"consumer_id": {"cf": "rowkey", "col": "consumer_id", "type": "string"},
"audit_system_id": {"cf": "data", "col": "audit_system_id", "type": "string"},
"object_path": {"cf": "data", "col": "object_path", "type": "string"},
"object_type": {"cf": "data", "col": "object_type", "type": "string"},
"what_action": {"cf": "data", "col": "what_action", "type": "string"},
"when": {"cf": "data", "col": "when", "type": "bigint"},
"where": {"cf": "data", "col": "where", "type": "string"},
"who": {"cf": "data", "col": "who", "type": "string"},
"workstation": {"cf": "data", "col": "workstation", "type": "string"}
}
}'''
query = changes \
.writeStream \
.outputMode("append") \
.format('HBase.HBaseSinkProvider')\
.option('hbasecat', change_catalog) \
.option("checkpointLocation", '/tmp/checkpoint') \
.start()
# .format('org.apache.spark.sql.execution.datasources.hbase')\
# query = changes \
# .writeStream \
# .format('console') \
# .start()
query.awaitTermination()
EXCEPTION
org.apache.spark.sql.AnalysisException: Queries with streaming sources must be executed with writeStream.start();;
LogicalRDD [key#146, value#147, topic#148, partition#149, offset#150L, timestamp#151, timestampType#152], true at org.apache.spark.sql.catalyst.analysis.UnsupportedOperationChecker$.org$apache$spark$sql$catalyst$analysis$UnsupportedOperationChecker$$throwError(UnsupportedOperationChecker.scala:374)
at org.apache.spark.sql.catalyst.analysis.UnsupportedOperationChecker$$anonfun$checkForBatch$1.apply(UnsupportedOperationChecker.scala:37)
at org.apache.spark.sql.catalyst.analysis.UnsupportedOperationChecker$$anonfun$checkForBatch$1.apply(UnsupportedOperationChecker.scala:35)
at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:127)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:126)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:126)
at scala.collection.immutable.List.foreach(List.scala:381)
at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:126)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:126)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:126)
at scala.collection.immutable.List.foreach(List.scala:381)
at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:126)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:126)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:126)
at scala.collection.immutable.List.foreach(List.scala:381)

Related

Queries with streaming sources must be executed with writeStream.start();

I'm new in Spark Structured Streaming, so I cannot understand the problem.I've implemented with HBaseSinkProvider... Can you please help me with it?
Spark verdion: 2.4.0,
scala.version: 2.11.8,
com.hortonworks.shc-core: 1.1.1-2.1-s_2.11
spark-sql-kafka-0-10_2.11
rows = spark \
.readStream \
.format('kafka') \
.option('kafka.bootstrap.servers', brokers) \
.option('subscribe', topic) \
.option('group.id', group_id) \
.option('maxOffsetsPerTrigger', 1000) \
.option("startingOffsets", "earliest") \
.load()
rows.printSchema()
schema = StructType([ \
StructField('consumer_id', StringType(), False), \
StructField('audit_system_id', StringType(), False), \
StructField('object_path', StringType(), True), \
StructField('object_type', StringType(), False), \
StructField('what_action', StringType(), False), \
StructField('when', LongType(), False), \
StructField('where', StringType(), False), \
StructField('who', StringType(), True), \
StructField('workstation', StringType(), True) \
])
val udf_decode_avro = udf(decode_avro, schema)
values = rows.select('value')
values.printSchema()
changes = values.withColumn('change',
udf_decode_avro(col('value'))).select('change.*')
changes.printSchema()
change_catalog ={"table":{
"namespace": "default",
"name": "changes",
"tableCoder": "PrimitiveType"
},
"rowkey": "consumer_id",
"columns":{
"consumer_id": {"cf": "rowkey", "col": "consumer_id", "type": "string"},
"audit_system_id": {"cf": "d", "col": "audit_system_id", "type": "string"},
"object_path": {"cf": "d", "col": "object_path", "type": "string"},
"object_type": {"cf": "d", "col": "object_type", "type": "string"},
"what_action": {"cf": "d", "col": "what_action", "type": "string"},
"when": {"cf": "t", "col": "when", "type": "bigint"},
"where": {"cf": "d", "col": "where", "type": "string"},
"who": {"cf": "d", "col": "who", "type": "string"},
"workstation": {"cf": "d", "col": "workstation", "type": "string"}}}
query = changes
.writeStream
.format('HBase.HBaseSinkProvider')
.option('hbasecat', change_catalog)
.option("checkpointLocation", '/tmp/checkpoint')
.outputMode("append")
.start().awaitTermination()
ERROR
Caused by: org.apache.spark.sql.AnalysisException: Queries with streaming sources must be executed with writeStream.start();;
at org.apache.spark.sql.catalyst.analysis.UnsupportedOperationChecker$.org$apache$spark$sql$catalyst$analysis$UnsupportedOperationChecker$$throwError(UnsupportedOperationChecker.scala:389)
at org.apache.spark.sql.catalyst.analysis.UnsupportedOperationChecker$$anonfun$checkForBatch$1.apply(UnsupportedOperationChecker.scala:38)
at org.apache.spark.sql.catalyst.analysis.UnsupportedOperationChecker$$anonfun$checkForBatch$1.apply(UnsupportedOperationChecker.scala:36)
at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:127)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:126)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:126)
at scala.collection.immutable.List.foreach(List.scala:381)
at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:126)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:126)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:126)
at scala.collection.immutable.List.foreach(List.scala:381)
at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:126)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:126)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:126)
at scala.collection.immutable.List.foreach(List.scala:381)
at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:126)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:126)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:126)
at scala.collection.immutable.List.foreach(List.scala:381)
at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:126)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:126)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:126)

Kafka connect JDBC error despite giving schema and payload

I am getting the following error when I run kafka JDBC connector to PSQL:
JsonConverter with schemas.enable requires "schema" and "payload"
fields and may not contain additional fields. If you are trying to
deserialize plain JSON data, set schemas.enable=false in your
converter configuration.
However my topic contains the following message structure with a schema added just like its presented online:
rowtime: 2022/02/04 12:45:48.520 Z, key: , value: "{"schema":
{"type": "struct", "fields": [{"type": "int", "field":
"ID", "optional": false}, {"type": "date", "field":
"Date", "optional": false}, {"type": "varchar",
"field": "ICD", "optional": false}, {"type": "int",
"field": "CPT", "optional": false}, {"type": "double",
"field": "Cost", "optional": false}], "optional": false,
"name": "test"}, "payload": {"ID": "24427934",
"Date": "2019-05-22", "ICD": "883.436", "CPT":
"60502", "cost": "1374.36"}}",
partition: 0
My configuration for the connector is:
curl -X PUT http://localhost:8083/connectors/claim_test/config \
-H "Content-Type: application/json" \
-d '{
"connector.class": "io.confluent.connect.jdbc.JdbcSinkConnector",
"connection.url":"jdbc:postgresql://localhost:5432/ae2772",
"key.converter":"org.apache.kafka.connect.json.JsonConverter",
"value.converter":"org.apache.kafka.connect.json.JsonConverter",
"value.converter.schemas.enable":"true",
"topics":"test_7",
"auto.create":"true",
"insert.mode":"insert"
}'
After some changes, I now get the following message:
WorkerSinkTask{id=claim_test} Error converting message value in topic 'test_9' partition 0 at offset 0 and timestamp 1644005137197: Unknown schema type: int
int is not a valid schema type. Should be int8, int16, int32, or int64.
Similarly, date, varchar and double are not valid either.
The types used in the JSON are not the same as Postgres or any SQL-specific types (a date should be converted to a Unix Epoch int64 time or be made a string).
You can find the supported schema types here: https://github.com/apache/kafka/blob/trunk/connect/api/src/main/java/org/apache/kafka/connect/data/Schema.java

Kafka connect path format not working properly

I create connector with the script below, but in S3, I see partition format of /year=2015/month=12/day=07/hour=15/ . Is there a way to implement partition of 'dt'=YYYY-MM-dd/'hour'=HH/ format ?
curl -X POST \
-H "Content-Type: application/json" \
--data '{
"name": "content.logging.test",
"config": {
"topics": "content.logging",
"connector.class": "io.confluent.connect.s3.S3SinkConnector",
"storage.class": "io.confluent.connect.s3.storage.S3Storage",
"format.class": "io.confluent.connect.s3.format.json.JsonFormat",
"s3.region": "ap-northeast-1",
"s3.bucket.name": "kafka-connect-test",
"locale": "en-US",
"timezone": "UTC",
"tasks.max": 1,
"flush.size": 10,
"partitioner.class": "io.confluent.connect.storage.partitioner.HourlyPartitioner",
"partition.duration.ms": 3600000,
"path.format": "'dt'=YYYY-MM-dd/'hour'=HH/"
}
}' http://$CONNECT_REST_ADVERTISED_HOST_NAME:8083/connectors
You should use the TimeBasedPartitioner if you want to use a format
https://docs.confluent.io/kafka-connect-s3-sink/current/index.html#partitioning-records-into-s3-objects

save rest api get method response as a json document

I am using the code below to read from a rest api and write the response to a json document in pyspark and save the file to Azure Data Lake Gen2. The code works fine when the response has no blank data but when I try to get all the data back then run into the following error.
Error Message: ValueError: Some of types cannot be determined after inferring.
Code:
import requests
response = requests.get('https://apiurl.com/demo/api/v3/data',
auth=('user', 'password'))
data = response.json()
from pyspark.sql import *
df=spark.createDataFrame([Row(**i) for i in data])
df.show()
df.write.mode("overwrite").json("wasbs://<file_system>#<storage-account-name>.blob.core.windows.net/demo/data")
Response:
[
{
"ProductID": "156528",
"ProductType": "Home Improvement",
"Description": "",
"SaleDate": "0001-01-01T00:00:00",
"UpdateDate": "2015-02-01T16:43:18.247"
},
{
"ProductID": "126789",
"ProductType": "Pharmacy",
"Description": "",
"SaleDate": "0001-01-01T00:00:00",
"UpdateDate": "2015-02-01T16:43:18.247"
}
]
Trying to fix the schema like below.
from pyspark.sql.types import StructType, StructField, StringType
schema = StructType([StructField("ProductID", StringType(), True), StructField("ProductType", StringType(), True), "Description", StringType(), True), StructField("SaleDate", StringType(), True), StructField("UpdateDate", StringType(), True)])
df = spark.createDataFrame([[None, None, None, None, None]], schema=schema)
df.show()
Not sure how to create the dataframe and write data to json document.
You can pass the data,schema variable to spark.createDataFrame() then spark will create a dataframe.
Example:
from pyspark.sql.functions import *
from pyspark.sql import *
from pyspark.sql.types import *
data=[
{
"ProductID": "156528",
"ProductType": "Home Improvement",
"Description": "",
"SaleDate": "0001-01-01T00:00:00",
"UpdateDate": "2015-02-01T16:43:18.247"
},
{
"ProductID": "126789",
"ProductType": "Pharmacy",
"Description": "",
"SaleDate": "0001-01-01T00:00:00",
"UpdateDate": "2015-02-01T16:43:18.247"
}
]
schema = StructType([StructField("ProductID", StringType(), True), StructField("ProductType", StringType(), True), StructField("Description", StringType(), True), StructField("SaleDate", StringType(), True), StructField("UpdateDate", StringType(), True)])
df = spark.createDataFrame(data, schema=schema)
df.show()
#+---------+----------------+-----------+-------------------+--------------------+
#|ProductID| ProductType|Description| SaleDate| UpdateDate|
#+---------+----------------+-----------+-------------------+--------------------+
#| 156528|Home Improvement| |0001-01-01T00:00:00|2015-02-01T16:43:...|
#| 126789| Pharmacy| |0001-01-01T00:00:00|2015-02-01T16:43:...|
#+---------+----------------+-----------+-------------------+--------------------+

Spark application with Hive Warehouse Connector saves array and map fields wrongly in Hive table

I am using Spark 2.4 with Hive Warehouse Connector and Scala 2.11. The current Hive Warehouse Connector provided by Hortonworks is not compatible with Spark 2.4. So I compile my jar file from https://github.com/abh1sh2k/spark-llap/pull/1/files which makes it compatible with Spark 2.4.
My Spark application reads from Kafka input stream and writes to Hive table (ORC format) using Hive output stream provided by Hive Warehouse Connector.
Here's my Spark code (Scala):
package example
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions.col
import za.co.absa.abris.avro.read.confluent.SchemaManager
import za.co.absa.abris.avro.functions.from_confluent_avro
object NormalizedEventsToHive extends Logging {
def main(args: Array[String]): Unit = {
val spark = SparkSession
.builder()
.appName("NormalizedEventsToHive")
.enableHiveSupport()
.getOrCreate()
import spark.implicits._
val schema_registry_config = Map(
"schema.registry.url" -> "http://schema-registry:8081",
"value.schema.naming.strategy" -> "topic.name",
"schema.registry.topic" -> "events-v1",
"value.schema.id" -> "latest"
)
val input_stream_df = spark.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "kafka:9092")
.option("startingOffsets", "earliest")
.option("subscribe", "events-v1")
.load()
val data = input_stream_df
.select(from_confluent_avro(col("value"), schema_registry_config) as 'data)
.select("data.*")
val output_stream_df = data.writeStream.format("com.hortonworks.spark.sql.hive.llap.streaming.HiveStreamingDataSource")
.option("database", "default")
.option("table", "events")
.option("checkpointLocation", "file:///checkpoint2")
.option("metastoreUri", "thrift://hive-metastore:9083")
.start()
output_stream_df.awaitTermination()
}
}
The input Kafka messages are AVRO encoded and Confluent Schema Registry is used for schema version control. za.co.absa.abris.avro.functions.from_confluent_avro is used to decode the AVRO encoded Kafka messages.
Here's the AVRO schema:
{
"type": "record",
"name": "events",
"fields": [
{ "name": "id", "type": ["null", "string"], "default": null },
.....
{ "name": "field_map", "type": ["null", { "type": "map", "values": ["null", "string"] }], "default": null },
{ "name": "field_array", "type": ["null", { "type": "array", "items": "string" }], "default": null },
{ "name": "field_array_of_map", "type": ["null", { "type": "array", "items": { "type": "map", "values": ["null", "string"] }}], "default": null }
]
}
The events Hive table (ORC format) is created as:
CREATE TABLE `events`(
`id` string,
......
`field_map` map<string,string>,
`field_array` array<string>,
`field_array_of_map` array<map<string,string>>
)
CLUSTERED BY(id) INTO 9 BUCKETS
STORED AS ORC
TBLPROPERTIES ('transactional'='true');
The fields with array<string>, map<string, string>, array<map<string, string>> types, they are saved wrongly in Hive table.
When SELECT query is issued in Beeline, they show:
field_map {"org.apache.spark.sql.catalyst.expressions.UnsafeMapData#101c5674":null}
field_array ["org.apache.spark.sql.catalyst.expressions.UnsafeArrayData#6b5730c2"]
field_array_of_map [{"org.apache.spark.sql.catalyst.expressions.UnsafeArrayData#ca82f1a4":null}]
From https://github.com/hortonworks-spark/spark-llap, it mentions Array type is supported, although Map isn't. Any idea how to save Array correctly? Any workaround for Map type?
Changes found on the HWC github repository pull request worked for me in Structured Streaming.
What i did:
Cloned #massoudm branch
In the project root directory I ran sbt assembly
I used the new created hwc jar
My code:
data
.writeStream
.queryName(config("stream.name") + "_query")
.options(hiveConfig)
.option("writer", "json")
.format(HiveWarehouseSession.STREAM_TO_STREAM)
.outputMode("append")
.start()
Most important are:
.option("writer", "json")
.format(HiveWarehouseSession.STREAM_TO_STREAM)
This is the link to the pull request.