Spark 2.4.0 on Java 1.8.0_161 (Scala 2.11.12)
Run command: spark-shell --jars=spark-avro_2.11-2.4.0.jar
Currently working on some POC using small avro files, I want to be able to read in a (single) AVRO file, make a change, then write it back out.
Reading is fine:
val myAv = spark.read.format("avro").load("myAvFile.avro")
However, I am getting this error when trying to write back out (even before making any changes):
scala> myAv.write.format("avro").save("./output-av-file.avro")
org.apache.spark.sql.AnalysisException:
Datasource does not support writing empty or nested empty schemas.
Please make sure the data schema has at least one or more column(s).
;
at org.apache.spark.sql.execution.datasources.DataSource$.org$apache$spark$sql$execution$datasources$DataSource$$validateSchema(DataSource.scala:733)
at org.apache.spark.sql.execution.datasources.DataSource.planForWriting(DataSource.scala:523)
at org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:281)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:270)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:228)
... 49 elided
I've tried specifying the schema of the dataframe manually, but to no avail:
.write.option("avroSchema", c_schema.toString).format("avro") ...
Reason is quite obvious schema is coming as empty. see here from code
if (hasEmptySchema(schema)) {
throw new AnalysisException(
s"""
|Datasource does not support writing empty or nested empty schemas.
|Please make sure the data schema has at least one or more column(s).
""".stripMargin)
}
Related
I am trying to registry and serialize an abject with flink, kafka, glue and avro. I've seen this method which I'm trying.
Schema schema = parser.parse(new File("path/to/avro/file"));
GlueSchemaRegistryAvroSerializationSchema<GenericRecord> test= GlueSchemaRegistryAvroSerializationSchema.forGeneric(schema, topic, configs);
FlinkKafkaProducer<GenericRecord> producer = new FlinkKafkaProducer<GenericRecord>(
kafkaTopic,
test,
properties);
My problem is that this system doesn't allow to include an object different than GenericRecord, the object that I want to send is another and is very big. So big that is too complex to transform to GenericRecord.
I don't find too much documentation. How can I send an object different than GenericRecord, or any way to include my object inside GenericRecord?
I'm not sure if I understand correctly, but basically the GlueSchemaRegistryAvroSerializationSchema has another method called forSpecific that accepts SpecificRecord. So, You can use avro generation plugin for Your build tool depending on what You use (e.g. for sbt here) that will generate classes from Your avro schema that can then be passed to forSpecific method.
Use Case and Description
My use case is described more here, but the gist of the issue is:
I am making a custom SMT and want to make sure the Elasticsearch sink connector deserializes incoming records properly, but then after that I don't need any sort of schema at all. Each record has a dynamic amount of fields set, so I don't want to have any makeUpdatedSchema step (e.g., this code) at all. This both keeps code more simple and I would assume improves performance since I don't have to recreate schemas for each record.
What I tried
I tried doing something like the applySchemaless code as shown here even when the record has a schema by returning something like this, with null for schema:
return newRecord(record, null, updatedValue);
However, in runtime it errors out, saying I have an incompatible schema.
Key Question
I might be misunderstanding the role of the schema at this point in the process (is it needed at all once we're in the Elasticsearch sink connector?) or how it works, and if so that would be helpful to know as well. But is there some way to write a custom SMT like this?
I am implementing a spark structured streaming application that processes webserver log files from a folder on disk or perhaps S3.
Spark Structured Streaming fits the use case almost perfectly, with one wrinkle.
The filenames in the folder also contain the machine name eg. like:
/node1_20181101.json.gz
/node1_20181102.json.gz
/node2_20181101.json.gz
/node3_20181102.json.gz
/node4_20181102.json.gz
...and so on.
A (simplified) version of the source looks something like this ( I would turn the below to a continuous stream with windowing etc):
val inputDF = spark.read
.option("codec", classOf[GzipCodec].getName)
.option("maxFilesPerTrigger", 1.toString)
.json(config.directory)
.transform { ds =>
logger.info(ds.inputFiles)
ds
}.foreach(println(_))
I would like to transform the batch and add the node ID from the filename to each record line, - I can't seem to see any kind of an onBatch trigger that I could use to enrich the record schema with the node ID from the file name.
I have looked at the following and nothing seems to fit:
[FileStreamSource][https://jaceklaskowski.gitbooks.io/spark-structured-streaming/spark-sql-streaming-FileStreamSource.html#metadataLog]
Unfortunately getting a handle on the machine name from the file name is key to the analytics I do later, and I have no control over how the logs are populated
Any clues?
SparkSession
.builder
.master("local[*]")
.config("spark.sql.warehouse.dir", "C:/tmp/spark")
.config("spark.sql.streaming.checkpointLocation", "C:/tmp/spark/spark-checkpoint")
.appName("my-test")
.getOrCreate
.readStream
.schema(schema)
.json("src/test/data")
.cache
.writeStream
.start
.awaitTermination
While executing this sample in Spark 2.1.0 I got error.
Without the .cache option it worked as intended but with .cache option i got:
Exception in thread "main" org.apache.spark.sql.AnalysisException: Queries with streaming sources must be executed with writeStream.start();;
FileSource[src/test/data]
at org.apache.spark.sql.catalyst.analysis.UnsupportedOperationChecker$.org$apache$spark$sql$catalyst$analysis$UnsupportedOperationChecker$$throwError(UnsupportedOperationChecker.scala:196)
at org.apache.spark.sql.catalyst.analysis.UnsupportedOperationChecker$$anonfun$checkForBatch$1.apply(UnsupportedOperationChecker.scala:35)
at org.apache.spark.sql.catalyst.analysis.UnsupportedOperationChecker$$anonfun$checkForBatch$1.apply(UnsupportedOperationChecker.scala:33)
at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:128)
at org.apache.spark.sql.catalyst.analysis.UnsupportedOperationChecker$.checkForBatch(UnsupportedOperationChecker.scala:33)
at org.apache.spark.sql.execution.QueryExecution.assertSupported(QueryExecution.scala:58)
at org.apache.spark.sql.execution.QueryExecution.withCachedData$lzycompute(QueryExecution.scala:69)
at org.apache.spark.sql.execution.QueryExecution.withCachedData(QueryExecution.scala:67)
at org.apache.spark.sql.execution.QueryExecution.optimizedPlan$lzycompute(QueryExecution.scala:73)
at org.apache.spark.sql.execution.QueryExecution.optimizedPlan(QueryExecution.scala:73)
at org.apache.spark.sql.execution.QueryExecution.sparkPlan$lzycompute(QueryExecution.scala:79)
at org.apache.spark.sql.execution.QueryExecution.sparkPlan(QueryExecution.scala:75)
at org.apache.spark.sql.execution.QueryExecution.executedPlan$lzycompute(QueryExecution.scala:84)
at org.apache.spark.sql.execution.QueryExecution.executedPlan(QueryExecution.scala:84)
at org.apache.spark.sql.execution.CacheManager$$anonfun$cacheQuery$1.apply(CacheManager.scala:102)
at org.apache.spark.sql.execution.CacheManager.writeLock(CacheManager.scala:65)
at org.apache.spark.sql.execution.CacheManager.cacheQuery(CacheManager.scala:89)
at org.apache.spark.sql.Dataset.persist(Dataset.scala:2479)
at org.apache.spark.sql.Dataset.cache(Dataset.scala:2489)
at org.me.App$.main(App.scala:23)
at org.me.App.main(App.scala)
Any idea?
Your (very interesting) case boils down to the following line (that you can execute in spark-shell):
scala> :type spark
org.apache.spark.sql.SparkSession
scala> spark.readStream.text("files").cache
org.apache.spark.sql.AnalysisException: Queries with streaming sources must be executed with writeStream.start();;
FileSource[files]
at org.apache.spark.sql.catalyst.analysis.UnsupportedOperationChecker$.org$apache$spark$sql$catalyst$analysis$UnsupportedOperationChecker$$throwError(UnsupportedOperationChecker.scala:297)
at org.apache.spark.sql.catalyst.analysis.UnsupportedOperationChecker$$anonfun$checkForBatch$1.apply(UnsupportedOperationChecker.scala:36)
at org.apache.spark.sql.catalyst.analysis.UnsupportedOperationChecker$$anonfun$checkForBatch$1.apply(UnsupportedOperationChecker.scala:34)
at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:127)
at org.apache.spark.sql.catalyst.analysis.UnsupportedOperationChecker$.checkForBatch(UnsupportedOperationChecker.scala:34)
at org.apache.spark.sql.execution.QueryExecution.assertSupported(QueryExecution.scala:63)
at org.apache.spark.sql.execution.QueryExecution.withCachedData$lzycompute(QueryExecution.scala:74)
at org.apache.spark.sql.execution.QueryExecution.withCachedData(QueryExecution.scala:72)
at org.apache.spark.sql.execution.QueryExecution.optimizedPlan$lzycompute(QueryExecution.scala:78)
at org.apache.spark.sql.execution.QueryExecution.optimizedPlan(QueryExecution.scala:78)
at org.apache.spark.sql.execution.QueryExecution.sparkPlan$lzycompute(QueryExecution.scala:84)
at org.apache.spark.sql.execution.QueryExecution.sparkPlan(QueryExecution.scala:80)
at org.apache.spark.sql.execution.QueryExecution.executedPlan$lzycompute(QueryExecution.scala:89)
at org.apache.spark.sql.execution.QueryExecution.executedPlan(QueryExecution.scala:89)
at org.apache.spark.sql.execution.CacheManager$$anonfun$cacheQuery$1.apply(CacheManager.scala:104)
at org.apache.spark.sql.execution.CacheManager.writeLock(CacheManager.scala:68)
at org.apache.spark.sql.execution.CacheManager.cacheQuery(CacheManager.scala:92)
at org.apache.spark.sql.Dataset.persist(Dataset.scala:2603)
at org.apache.spark.sql.Dataset.cache(Dataset.scala:2613)
... 48 elided
The reason for this turned out quite simple to explain (no pun to Spark SQL's explain intended).
spark.readStream.text("files") creates a so-called streaming Dataset.
scala> val files = spark.readStream.text("files")
files: org.apache.spark.sql.DataFrame = [value: string]
scala> files.isStreaming
res2: Boolean = true
Streaming Datasets are the foundation of Spark SQL's Structured Streaming.
As you may have read in Structured Streaming's Quick Example:
And then start the streaming computation using start().
Quoting the scaladoc of DataStreamWriter's start:
start(): StreamingQuery Starts the execution of the streaming query, which will continually output results to the given path as new data arrives.
So, you have to use start (or foreach) to start the execution of the streaming query. You knew it already.
But...there are Unsupported Operations in Structured Streaming:
In addition, there are some Dataset methods that will not work on streaming Datasets. They are actions that will immediately run queries and return results, which does not make sense on a streaming Dataset.
If you try any of these operations, you will see an AnalysisException like "operation XYZ is not supported with streaming DataFrames/Datasets".
That looks familiar, doesn't it?
cache is not in the list of the unsupported operations, but that's because it has simply been overlooked (I reported SPARK-20927 to fix it).
cache should have been in the list as it does execute a query before the query gets registered in Spark SQL's CacheManager.
Let's go deeper into the depths of Spark SQL...hold your breath...
cache is persist while persist requests the current CacheManager to cache the query:
sparkSession.sharedState.cacheManager.cacheQuery(this)
While caching a query CacheManager does execute it:
sparkSession.sessionState.executePlan(planToCache).executedPlan
which we know is not allowed since it is start (or foreach) to do so.
Problem solved!
I have streaming of data coming from SparkStreaming. Which i need to process and finally want to store the data in Cassandra. So, earlier i was trying to use SparkCassandra connector. But it doesn't give the access of SparkStreaming Context object on workers. So, I have to use separate cassandra-scala driver. Hence, i ended up with phantom. Now, my question is i have already defined the column family in the cassnandra. So, how do i do the select and update query from scala.
I have followed these documentation link1 but i don't understand why do we need to give the table definition at client (scala code) side. Why can't we just give Keyspace, ClusterPoints and ColumnFamily and be done with it.
object CustomConnector {
val hosts = Seq("IP1", "IP2")
val Connector = ContactPoints(hosts).keySpace("KEYSPACE_NAME")
}
realTimeAgg.foreachRDD{ x => if (x.toLocalIterator.nonEmpty) {
x.foreachPartition {
How to achieve select/insert in Cassandra table here using phantom
}
This is not yet possible using phantom, we are actively working on phantom-spark to allow you to do this, but at this stage in time this is still a few months away.
In the interim, you will have to rely on the spark cassandra connector and use the non type-safe API to achieve this. It's a more unfortunate setup, but in the very near future this will be resolved.