Reading runtime values using AvroIO class - apache-beam

I need to read an AVRO file in Apache Beam using AvroIO by passing the schema and filepath dynamically. Is there any way we can pass a ValueProvider or a side input or anything else to AvroIO.read.
Below is the code that I'm using:
PCollection<GenericRecord> records =p.apply(AvroIO.readGenericRecords(dynamicallyProvidedSchema)
.from(dynamicallyProvidedFilePath));

AvroIO.read().from() can take a ValueProvider. For dynamically provided schema, Beam 2.2 (release is currently in progress) includes AvroIO.parseGenericRecords() that lets you avoid specifying a schema altogether, you just have to specify a function from GenericRecord to your custom type.

Related

Glue avro schema registry with flink and kafka for any object

I am trying to registry and serialize an abject with flink, kafka, glue and avro. I've seen this method which I'm trying.
Schema schema = parser.parse(new File("path/to/avro/file"));
GlueSchemaRegistryAvroSerializationSchema<GenericRecord> test= GlueSchemaRegistryAvroSerializationSchema.forGeneric(schema, topic, configs);
FlinkKafkaProducer<GenericRecord> producer = new FlinkKafkaProducer<GenericRecord>(
kafkaTopic,
test,
properties);
My problem is that this system doesn't allow to include an object different than GenericRecord, the object that I want to send is another and is very big. So big that is too complex to transform to GenericRecord.
I don't find too much documentation. How can I send an object different than GenericRecord, or any way to include my object inside GenericRecord?
I'm not sure if I understand correctly, but basically the GlueSchemaRegistryAvroSerializationSchema has another method called forSpecific that accepts SpecificRecord. So, You can use avro generation plugin for Your build tool depending on what You use (e.g. for sbt here) that will generate classes from Your avro schema that can then be passed to forSpecific method.

azure data flow converting boolean to string

I am trying to split large json files into smaller chunks using azure data flow. It splits the file but it changes column type boolean to string in output files. This same data flow will be used for different json files with different schemas therefore can't have any fixed schema mapping defined. I have to use auto mapping option. Please suggest how could I solve this issue of automatic datatype conversion? or Any other approach to split the file in the azure data factory?
Here, with my dataset I have tried to have a source as a json file and Sink as a json. If you have a fixed Schema and import it then the data flow works fine and could return a boolean value after running the pipeline.
But as you stated to have "same data flow will be used for different json files with different schemas therefore can't have any fixed schema mapping defined". Hence, you must have a Derived Column to explicitly covert all to boolean values.
Import Schema :
**In the sink you could inspect :
Data Preview :
In your ADF Data Flow Source transformation, click on the Projection Tab and click "Define default format". Set explicit values for Boolean True/False so that ADF can use that hint for proper data type inference for your data.

Kafka Stream Ksql Json

Does kafka stream / Ksql actually support json natively somehow? What are the other formats supported ? I have seen that is it possible to have flat json interpreted as table. I want to understand that part a bit better; what are the other formats that kafka-streams via Ksql that can be queried via SQL? How is that possible or supported? What's the native support?
KSQL
For Value formats, KSQL Supports AVRO, JSON and DELIMITED (like CSV).
You can find the documentation here:
https://docs.confluent.io/current/ksql/docs/developer-guide/syntax-reference.html
Kafka Streams
Kafka Streams comes with some primitive/basic SerDes (Serializers / Deserializers) under the org.apache.kafka.common.serialization package.
You can find the documentation here:
https://kafka.apache.org/22/documentation/streams/developer-guide/datatypes.html
Confluent also provides schema-registry compatible Avro SerDes for data in generic Avro and in specific Avro format. You can find the documentation here:
https://docs.confluent.io/current/streams/developer-guide/datatypes.html#avro
You can also use basic SerDe implementation for JSON that comes with the examples:
https://github.com/apache/kafka/blob/2.2/streams/examples/src/main/java/org/apache/kafka/streams/examples/pageview/PageViewTypedDemo.java#L83
As a last resort, you can always create your own custom SerDes. For that, you must:
Write a serializer for your data type T by implementing
org.apache.kafka.common.serialization.Serializer.
Write a deserializer for T by implementing
org.apache.kafka.common.serialization.Deserializer.
Write a serde for T by implementing
org.apache.kafka.common.serialization.Serde, which you either do
manually (see existing SerDes in the previous section) or by
leveraging helper functions in Serdes such as
Serdes.serdeFrom(Serializer<T>, Deserializer<T>). Note that you will
need to implement your own class (that has no generic types) if you
want to use your custom serde in the configuration provided to
KafkaStreams. If your serde class has generic types or you use
Serdes.serdeFrom(Serializer<T>, Deserializer<T>), you can pass your
serde only via methods calls (for example
builder.stream("topicName", Consumed.with(...))).

How to find out Avro schema from binary data that comes in via Spark Streaming?

I set up a Spark-Streaming pipeline that gets measuring data via Kafka. This data was serialized using Avro. The data can be of two types - EquidistantData and DiscreteData. I created these using an avdl file and the sbt-avrohugger plugin. I use the variant that generates Scala case classes that inherit from SpecificRecord.
In my receiving application, I can get the two schemas by querying EquidistantData.SCHEMA$ and DiscreteData.SCHEMA$.
Now, my Kafka stream gives me RDDs whose value class is Array[Byte]. So far so good.
How can I find out from the byte array which schema was used when serializing it, i.e., whether to use EquidistantData.SCHEMA$ or DiscreteData.SCHEMA$?
I thought of sending an appropriate info in the message key. Currently, I don't use the message key. Would this be a feasible way or can I get the schema somehow from the serialized byte array I received?
Followup:
Another possibility would be to use separate topics for discrete and equidistant data. Would this be feasible?

Spark: what options can be passed with DataFrame.saveAsTable or DataFrameWriter.options?

Neither the developer nor the API documentation includes any reference about what options can be passed in DataFrame.saveAsTable or DataFrameWriter.options and they would affect the saving of a Hive table.
My hope is that in the answers to this question we can aggregate information that would be helpful to Spark developers who want more control over how Spark saves tables and, perhaps, provide a foundation for improving Spark's documentation.
The reason you don't see options documented anywhere is that they are format-specific and developers can keep creating custom write formats with a new set of options.
However, for few supported formats I have listed the options as mentioned in the spark code itself:
CSVOptions
JDBCOptions
JSONOptions
ParquetOptions
TextOptions
OrcOptions
AvroOptions
Take a look at https://github.com/delta-io/delta/blob/master/src/main/scala/org/apache/spark/sql/delta/DeltaOptions.scala the class "DeltaOptions'
Currently, supported options are:
replaceWhere
mergeSchema
overwriteSchema
maxFilesPerTrigger
excludeRegex
ignoreFileDeletion
ignoreChanges
ignoreDeletes
optimizeWrite
dataChange
queryName
checkpointLocation
path
timestampAsOf
versionAsOf
According to the source code you can specify the path option (indicates where to store the hive external data in hdfs, translated to 'location' in Hive DDL).
Not sure you have other options associated with saveAsTable but I'll be searching for more.
As per the latest spark documentation following are the options that can be passed while writing DataFrame to external storage using .saveAsTable(name, format=None, mode=None, partitionBy=None, **options) API
if you click on the source hyperlink on the right hand side in the documentation you can traverse and find details of the other not so clear arguments
eg. format and options which are described under the class DataFrameWriter
so when the document reads options – all other string options it is referring to options which gives you following option as for spark 2.4.4
timeZone: sets the string that indicates a timezone to be used to format
timestamps in the JSON/CSV datasources or partition values. If it isn’t set, it uses the default value, session local timezone.
and when it reads format – the format used to save it is referring to format(source)
Specifies the underlying output data source.
Parameters
source – string,
name of the data source, e.g. ‘json’, ‘parquet’.
hope this was helpful.
The difference is between the versions.
We have the following in spark2:
createOrReplaceTempView()
createTempView()
createOrReplaceGlobalTempView()
createGlobalView()
saveAsTable is deprecated in spark 2.
Basically these are divided depending on the availability of the table.
Please refer to the link
saveAsTable(String tableName)
Saves the content of the DataFrame as the specified table.
FYI -> https://spark.apache.org/docs/2.3.0/api/java/org/apache/spark/sql/DataFrameWriter.html