save rest api get method response as a json document - pyspark

I am using the code below to read from a rest api and write the response to a json document in pyspark and save the file to Azure Data Lake Gen2. The code works fine when the response has no blank data but when I try to get all the data back then run into the following error.
Error Message: ValueError: Some of types cannot be determined after inferring.
Code:
import requests
response = requests.get('https://apiurl.com/demo/api/v3/data',
auth=('user', 'password'))
data = response.json()
from pyspark.sql import *
df=spark.createDataFrame([Row(**i) for i in data])
df.show()
df.write.mode("overwrite").json("wasbs://<file_system>#<storage-account-name>.blob.core.windows.net/demo/data")
Response:
[
{
"ProductID": "156528",
"ProductType": "Home Improvement",
"Description": "",
"SaleDate": "0001-01-01T00:00:00",
"UpdateDate": "2015-02-01T16:43:18.247"
},
{
"ProductID": "126789",
"ProductType": "Pharmacy",
"Description": "",
"SaleDate": "0001-01-01T00:00:00",
"UpdateDate": "2015-02-01T16:43:18.247"
}
]
Trying to fix the schema like below.
from pyspark.sql.types import StructType, StructField, StringType
schema = StructType([StructField("ProductID", StringType(), True), StructField("ProductType", StringType(), True), "Description", StringType(), True), StructField("SaleDate", StringType(), True), StructField("UpdateDate", StringType(), True)])
df = spark.createDataFrame([[None, None, None, None, None]], schema=schema)
df.show()
Not sure how to create the dataframe and write data to json document.

You can pass the data,schema variable to spark.createDataFrame() then spark will create a dataframe.
Example:
from pyspark.sql.functions import *
from pyspark.sql import *
from pyspark.sql.types import *
data=[
{
"ProductID": "156528",
"ProductType": "Home Improvement",
"Description": "",
"SaleDate": "0001-01-01T00:00:00",
"UpdateDate": "2015-02-01T16:43:18.247"
},
{
"ProductID": "126789",
"ProductType": "Pharmacy",
"Description": "",
"SaleDate": "0001-01-01T00:00:00",
"UpdateDate": "2015-02-01T16:43:18.247"
}
]
schema = StructType([StructField("ProductID", StringType(), True), StructField("ProductType", StringType(), True), StructField("Description", StringType(), True), StructField("SaleDate", StringType(), True), StructField("UpdateDate", StringType(), True)])
df = spark.createDataFrame(data, schema=schema)
df.show()
#+---------+----------------+-----------+-------------------+--------------------+
#|ProductID| ProductType|Description| SaleDate| UpdateDate|
#+---------+----------------+-----------+-------------------+--------------------+
#| 156528|Home Improvement| |0001-01-01T00:00:00|2015-02-01T16:43:...|
#| 126789| Pharmacy| |0001-01-01T00:00:00|2015-02-01T16:43:...|
#+---------+----------------+-----------+-------------------+--------------------+

Related

Queries with streaming sources must be executed with writeStream.start();

I'm new in Spark Structured Streaming, so I cannot understand the problem.I've implemented with HBaseSinkProvider... Can you please help me with it?
Spark verdion: 2.4.0,
scala.version: 2.11.8,
com.hortonworks.shc-core: 1.1.1-2.1-s_2.11
spark-sql-kafka-0-10_2.11
rows = spark \
.readStream \
.format('kafka') \
.option('kafka.bootstrap.servers', brokers) \
.option('subscribe', topic) \
.option('group.id', group_id) \
.option('maxOffsetsPerTrigger', 1000) \
.option("startingOffsets", "earliest") \
.load()
rows.printSchema()
schema = StructType([ \
StructField('consumer_id', StringType(), False), \
StructField('audit_system_id', StringType(), False), \
StructField('object_path', StringType(), True), \
StructField('object_type', StringType(), False), \
StructField('what_action', StringType(), False), \
StructField('when', LongType(), False), \
StructField('where', StringType(), False), \
StructField('who', StringType(), True), \
StructField('workstation', StringType(), True) \
])
val udf_decode_avro = udf(decode_avro, schema)
values = rows.select('value')
values.printSchema()
changes = values.withColumn('change',
udf_decode_avro(col('value'))).select('change.*')
changes.printSchema()
change_catalog ={"table":{
"namespace": "default",
"name": "changes",
"tableCoder": "PrimitiveType"
},
"rowkey": "consumer_id",
"columns":{
"consumer_id": {"cf": "rowkey", "col": "consumer_id", "type": "string"},
"audit_system_id": {"cf": "d", "col": "audit_system_id", "type": "string"},
"object_path": {"cf": "d", "col": "object_path", "type": "string"},
"object_type": {"cf": "d", "col": "object_type", "type": "string"},
"what_action": {"cf": "d", "col": "what_action", "type": "string"},
"when": {"cf": "t", "col": "when", "type": "bigint"},
"where": {"cf": "d", "col": "where", "type": "string"},
"who": {"cf": "d", "col": "who", "type": "string"},
"workstation": {"cf": "d", "col": "workstation", "type": "string"}}}
query = changes
.writeStream
.format('HBase.HBaseSinkProvider')
.option('hbasecat', change_catalog)
.option("checkpointLocation", '/tmp/checkpoint')
.outputMode("append")
.start().awaitTermination()
ERROR
Caused by: org.apache.spark.sql.AnalysisException: Queries with streaming sources must be executed with writeStream.start();;
at org.apache.spark.sql.catalyst.analysis.UnsupportedOperationChecker$.org$apache$spark$sql$catalyst$analysis$UnsupportedOperationChecker$$throwError(UnsupportedOperationChecker.scala:389)
at org.apache.spark.sql.catalyst.analysis.UnsupportedOperationChecker$$anonfun$checkForBatch$1.apply(UnsupportedOperationChecker.scala:38)
at org.apache.spark.sql.catalyst.analysis.UnsupportedOperationChecker$$anonfun$checkForBatch$1.apply(UnsupportedOperationChecker.scala:36)
at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:127)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:126)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:126)
at scala.collection.immutable.List.foreach(List.scala:381)
at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:126)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:126)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:126)
at scala.collection.immutable.List.foreach(List.scala:381)
at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:126)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:126)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:126)
at scala.collection.immutable.List.foreach(List.scala:381)
at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:126)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:126)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:126)
at scala.collection.immutable.List.foreach(List.scala:381)
at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:126)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:126)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:126)

Getting exception in Writing Spark Structured Streaming to HBase Table

As described in Spark Structured Streaming with Hbase integration, I'm interesting in writing data to HBase in structured streaming framework. I've cloned SHC code from github, extend it by sync provider and trying to write records to HBase. Buy, I've received error: """Queries with streaming sources must be executed with writeStream.start(). My python code is below:
JARS
Spark version: 2.4.0, scala.version: 2.11.8, com.hortonworks.shc-core: 1.1.1-2.1-s_2.11 spark-sql-kafka-0-10_2.11
spark = SparkSession \
.builder \
.appName("SparkConsumer") \
.getOrCreate()
print 'read Avro schema from file: {}...'.format(schema_name)
schema = avro.schema.parse(open(schema_name, 'rb').read())
reader = avro.io.DatumReader(schema)
print 'the schema is read'
rows = spark \
.readStream \
.format('kafka') \
.option('kafka.bootstrap.servers', brokers) \
.option('subscribe', topic) \
.option('group.id', group_id) \
.option('maxOffsetsPerTrigger', 1000) \
.option("startingOffsets", "earliest") \
.load()
rows.printSchema()
schema = StructType([ \
StructField('consumer_id', StringType(), False), \
StructField('audit_system_id', StringType(), False), \
StructField('object_path', StringType(), True), \
StructField('object_type', StringType(), False), \
StructField('what_action', StringType(), False), \
StructField('when', LongType(), False), \
StructField('where', StringType(), False), \
StructField('who', StringType(), True), \
StructField('workstation', StringType(), True) \
])
def decode_avro(msg):
bytes_reader = io.BytesIO(bytes(msg))
decoder = avro.io.BinaryDecoder(bytes_reader)
data = reader.read(decoder)
return (\
data['consumer_id'],\
data['audit_system_id'],\
data['object_path'],\
data['object_type'],\
data['what_action'],\
data['when'],\
data['where'],\
data['who'],\
data['workstation']\
)
udf_decode_avro = udf(decode_avro, schema)
values = rows.select('value')
values.printSchema()
changes = values.withColumn('change', udf_decode_avro(col('value'))).select('change.*')
changes.printSchema()
change_catalog = '''
{
"table":
{
"namespace": "uba_input",
"name": "changes"
},
"rowkey": "consumer_id",
"columns":
{
"consumer_id": {"cf": "rowkey", "col": "consumer_id", "type": "string"},
"audit_system_id": {"cf": "data", "col": "audit_system_id", "type": "string"},
"object_path": {"cf": "data", "col": "object_path", "type": "string"},
"object_type": {"cf": "data", "col": "object_type", "type": "string"},
"what_action": {"cf": "data", "col": "what_action", "type": "string"},
"when": {"cf": "data", "col": "when", "type": "bigint"},
"where": {"cf": "data", "col": "where", "type": "string"},
"who": {"cf": "data", "col": "who", "type": "string"},
"workstation": {"cf": "data", "col": "workstation", "type": "string"}
}
}'''
query = changes \
.writeStream \
.outputMode("append") \
.format('HBase.HBaseSinkProvider')\
.option('hbasecat', change_catalog) \
.option("checkpointLocation", '/tmp/checkpoint') \
.start()
# .format('org.apache.spark.sql.execution.datasources.hbase')\
# query = changes \
# .writeStream \
# .format('console') \
# .start()
query.awaitTermination()
EXCEPTION
org.apache.spark.sql.AnalysisException: Queries with streaming sources must be executed with writeStream.start();;
LogicalRDD [key#146, value#147, topic#148, partition#149, offset#150L, timestamp#151, timestampType#152], true at org.apache.spark.sql.catalyst.analysis.UnsupportedOperationChecker$.org$apache$spark$sql$catalyst$analysis$UnsupportedOperationChecker$$throwError(UnsupportedOperationChecker.scala:374)
at org.apache.spark.sql.catalyst.analysis.UnsupportedOperationChecker$$anonfun$checkForBatch$1.apply(UnsupportedOperationChecker.scala:37)
at org.apache.spark.sql.catalyst.analysis.UnsupportedOperationChecker$$anonfun$checkForBatch$1.apply(UnsupportedOperationChecker.scala:35)
at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:127)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:126)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:126)
at scala.collection.immutable.List.foreach(List.scala:381)
at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:126)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:126)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:126)
at scala.collection.immutable.List.foreach(List.scala:381)
at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:126)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:126)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:126)
at scala.collection.immutable.List.foreach(List.scala:381)

Spark application with Hive Warehouse Connector saves array and map fields wrongly in Hive table

I am using Spark 2.4 with Hive Warehouse Connector and Scala 2.11. The current Hive Warehouse Connector provided by Hortonworks is not compatible with Spark 2.4. So I compile my jar file from https://github.com/abh1sh2k/spark-llap/pull/1/files which makes it compatible with Spark 2.4.
My Spark application reads from Kafka input stream and writes to Hive table (ORC format) using Hive output stream provided by Hive Warehouse Connector.
Here's my Spark code (Scala):
package example
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions.col
import za.co.absa.abris.avro.read.confluent.SchemaManager
import za.co.absa.abris.avro.functions.from_confluent_avro
object NormalizedEventsToHive extends Logging {
def main(args: Array[String]): Unit = {
val spark = SparkSession
.builder()
.appName("NormalizedEventsToHive")
.enableHiveSupport()
.getOrCreate()
import spark.implicits._
val schema_registry_config = Map(
"schema.registry.url" -> "http://schema-registry:8081",
"value.schema.naming.strategy" -> "topic.name",
"schema.registry.topic" -> "events-v1",
"value.schema.id" -> "latest"
)
val input_stream_df = spark.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "kafka:9092")
.option("startingOffsets", "earliest")
.option("subscribe", "events-v1")
.load()
val data = input_stream_df
.select(from_confluent_avro(col("value"), schema_registry_config) as 'data)
.select("data.*")
val output_stream_df = data.writeStream.format("com.hortonworks.spark.sql.hive.llap.streaming.HiveStreamingDataSource")
.option("database", "default")
.option("table", "events")
.option("checkpointLocation", "file:///checkpoint2")
.option("metastoreUri", "thrift://hive-metastore:9083")
.start()
output_stream_df.awaitTermination()
}
}
The input Kafka messages are AVRO encoded and Confluent Schema Registry is used for schema version control. za.co.absa.abris.avro.functions.from_confluent_avro is used to decode the AVRO encoded Kafka messages.
Here's the AVRO schema:
{
"type": "record",
"name": "events",
"fields": [
{ "name": "id", "type": ["null", "string"], "default": null },
.....
{ "name": "field_map", "type": ["null", { "type": "map", "values": ["null", "string"] }], "default": null },
{ "name": "field_array", "type": ["null", { "type": "array", "items": "string" }], "default": null },
{ "name": "field_array_of_map", "type": ["null", { "type": "array", "items": { "type": "map", "values": ["null", "string"] }}], "default": null }
]
}
The events Hive table (ORC format) is created as:
CREATE TABLE `events`(
`id` string,
......
`field_map` map<string,string>,
`field_array` array<string>,
`field_array_of_map` array<map<string,string>>
)
CLUSTERED BY(id) INTO 9 BUCKETS
STORED AS ORC
TBLPROPERTIES ('transactional'='true');
The fields with array<string>, map<string, string>, array<map<string, string>> types, they are saved wrongly in Hive table.
When SELECT query is issued in Beeline, they show:
field_map {"org.apache.spark.sql.catalyst.expressions.UnsafeMapData#101c5674":null}
field_array ["org.apache.spark.sql.catalyst.expressions.UnsafeArrayData#6b5730c2"]
field_array_of_map [{"org.apache.spark.sql.catalyst.expressions.UnsafeArrayData#ca82f1a4":null}]
From https://github.com/hortonworks-spark/spark-llap, it mentions Array type is supported, although Map isn't. Any idea how to save Array correctly? Any workaround for Map type?
Changes found on the HWC github repository pull request worked for me in Structured Streaming.
What i did:
Cloned #massoudm branch
In the project root directory I ran sbt assembly
I used the new created hwc jar
My code:
data
.writeStream
.queryName(config("stream.name") + "_query")
.options(hiveConfig)
.option("writer", "json")
.format(HiveWarehouseSession.STREAM_TO_STREAM)
.outputMode("append")
.start()
Most important are:
.option("writer", "json")
.format(HiveWarehouseSession.STREAM_TO_STREAM)
This is the link to the pull request.

Pyspark with AWS Glue join 1-N relation into a JSON array

Don't know how can I join 1-N relations on AWS Glue and export a JSON file like:
{"id": 123, "name": "John Doe", "profiles": [ {"id": 1111, "channel": "twitter"}, {"id": 2222, "channel": "twitter"}, {"id": 3333, "channel": "instagram"} ]}
{"id": 345, "name": "Test", "profiles": []}
The profiles JSON array should be created using the other tables. Also I would like to add the channel column too.
The 3 tables that I have on AWS Glue data catalog are:
person_json
{"id": 123,"nanme": "John Doe"}
{"id": 345,"nanme": "Test"}
instagram_json
{"id": 3333, "person_id": 123}
{"id": 3333, "person_id": null}
twitter_json
{"id": 1111, "person_id": 123}
{"id": 2222, "person_id": 123}
This is the script I have so far:
import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from pyspark.sql.functions import lit
from awsglue.context import GlueContext
from awsglue.job import Job
glueContext = GlueContext(SparkContext.getOrCreate())
# catalog: database and table names
db_name = "test_database"
tbl_person = "person_json"
tbl_instagram = "instagram_json"
tbl_twitter = "twitter_json"
# Create dynamic frames from the source tables
person = glueContext.create_dynamic_frame.from_catalog(database=db_name, table_name=tbl_person)
instagram = glueContext.create_dynamic_frame.from_catalog(database=db_name, table_name=tbl_instagram)
twitter = glueContext.create_dynamic_frame.from_catalog(database=db_name, table_name=tbl_twitter)
# Join the frames
joined_instagram = Join.apply(person, instagram, 'id', 'person_id').drop_fields(['person_id'])
joined_all = Join.apply(joined_instagram, twitter, 'id', 'person_id').drop_fields(['person_id'])
# Writing output to S3
output_s3_path = "s3://xxx/xxx/person.json"
output = joined_all.toDF().repartition(1)
output.write.mode("overwrite").json(output_s3_path)
How should the script be changed in order to achieve the desired output?
Thanks
from pyspark.sql.functions import collect_set, lit, struct
...
instagram = instagram.toDF().withColumn( 'channel', lit('instagram') )
instagram = instagram.withColumn( 'profile', struct('id', 'channel') )
twitter = twitter.toDF().withColumn( 'channel', lit('twitter') )
twitter = twitter.withColumn( 'profile', struct('id', 'channel') )
profiles = instagram.union(twitter)
profiles = profiles.groupBy('person_id').agg( collect_set('profile').alias('profiles') )
joined_all = person.join(profiles, person.id == profiles.person_id, 'left_outer').drop('channel', 'person_id')
joined_all.show(n=2, truncate=False)
+---+--------+-----------------------------------------------------+
|id |name |profiles |
+---+--------+-----------------------------------------------------+
|123|John Doe|[[1111, twitter], [2222, twitter], [3333, instagram]]|
|345|Test |null |
+---+--------+-----------------------------------------------------+
.show() doesn't show the full structure of the structs in the profiles field.
print(joined_all.collect())
[Row(id=123, name='John Doe', profiles=[Row(id=1111, channel='twitter'), Row(id=2222, channel='twitter'), Row(id=3333, channel='instagram')]), Row(id=345, name='Test', profiles=None)]

Dynamically generating schema from JSon File

Need your help in defining a dynamic schema with fields and datatypes from input metadata JSon file .Below is the JSon file
[
{
"trim": true,
"name": "id",
"nullable": true,
"id": null,
"position": 0,
"table": "employee",
"type": "integer",
"primaryKey": true
},
{
"trim": true,
"name": "salary",
"nullable": true,
"id": null,
"position": 1,
"table": "employee",
"type": "double",
"primaryKey": false
},
{
"trim": true,
"name": "dob",
"nullable": true,
"id": null,
"position": 2,
"table": "employee",
"type": "date",
"primaryKey": false
}
]
Found a useful link but all the fields are mapped to string data type .
Programmatically generate the schema AND the data for a dataframe in Apache Spark
My requirement is bit different .I have an input CSV file without any header contents and therefore all column values are of string datatype.Similarly I have a JSON metadata file that contains the field name and corresponding data types.I want to define a schema that maps the field name with corresponding datatype from JSON to input CSV data.
For example :Below is the sample code I have written for mapping the column names from JSON file to input CSV data . But it doesn't convert or map the columns to the corresponding datatype .
val in_emp = spark.read.csv(in_source_data)
val in_meta_emp = spark.read.option("multiline","true").json(in_meta_data)
val in_cols = in_meta_emp.select("name","type").map(_.getString(0)).collect
val in_cols_map = in_emp.toDF(in_cols:_*)
in_emp.show
in_cols_map.show
in_cols_map.dtypes
Result :
Please click to see the result
3 :mapped Input datatypes
Array[(String, String)] = Array((id,StringType),
(salary,StringType), (dob,StringType))
Below code depicts the static way to define the schema but I am looking for dynamic way that picks the column and corresponding data type from JSON metadata file.
val schema = StructType (Array(
StructField("id",IntegerType,true ),
StructField("dob",DateType,true ),
StructField("salary",DoubleType,true )
))
val in_emp =
spark.read
.schema(schema)
.option("inferSchema","true")
.option("dateFormat", "yyyy.MM.dd")
.csv(in_source_data)