Avro support in Flink - scala - scala

How to read avro from Flink in scala?
Is it the same for batch/stream/table: StreamExecutionEnvironment/ ExecutionEnvironment / TableEnvironment?
would it be sth like: val custTS: TableSource = new AvroInputFormat("/path/to/file", ...)
Below is java avro implementation ref (connectors), but can't find scala ref anywhere:
AvroInputFormat<User> users = new AvroInputFormat<User>(in, User.class);
DataSet<User> usersDS = env.createInput(users);

You can use Flink's InputFormats, including the AvroInputFormat, from the Java as well as the Scala API:
Streaming & batch: val avroInputStream = env.createInput(new AvroInputFormat[User](in, classOf[User]))
Table API: tableEnv.registerTable("table", avroInputStream.toTable)

Related

Spark 3.2.0 Structured Streaming save data to Kafka with Confluent Schema Registry

Is there some easy way how to save a spark structured streaming dataframe into kafka with Confluent Schema registry? Spark version is 3.2.0, Scala 2.12
I managed to read data from Kafka with Confluent schema registry with a bit of an ugly code:
val schemaRegistryClient = new CachedSchemaRegistryClient(schemaRegistry, 128)
val kafkaAvroDeserializer = new AvroDeserializer(schemaRegistryClient)
val deserializer = kafkaAvroDeserializer
}
class AvroDeserializer extends AbstractKafkaAvroDeserializer {
def this(client: SchemaRegistryClient) {
this()
this.schemaRegistry = client
}
override def deserialize(bytes: Array[Byte]): String = {
val genericRecord = super.deserialize(bytes).asInstanceOf[GenericRecord]
genericRecord.toString
}
}
spark.udf.register("deserialize", (bytes: Array[Byte]) =>
DeserializerWrapper.deserializer.deserialize(bytes))```
Now I would like to write the data to another Kafka topic - is there a simple way?
You'd need to use similarly ugly code that uses a serializer UDF over a Struct column (or primitive type).
There's libraries that can help with making it less ugly - https://github.com/AbsaOSS/ABRiS

Push Data to Nifi Flow using apache spark and scala

I want to get data from nifi flow to spark and do some stuff. After that I want to send result again to nifi flow.
This is my nifi flow to send data to spark using output ports.
To get data from Nifi flow I wrote the below function.
def process() ={
val schema =
StructType(
Seq(
StructField(name = "ID", dataType = StringType, nullable = false),
StructField(name = "Client_Name", dataType = StringType, nullable = false),
StructField(name = "Due_Date", dataType = StringType, nullable = false),
StructField(name = "Score", dataType = StringType, nullable = false)
)
)
val config =
new SiteToSiteClient
.Builder()
.url("http://localhost:8090/nifi")
.portName("Data For Spark")
.buildConfig()
val sparkConf = new SparkConf()
.setMaster("local[*]")
.setAppName("NiFi-Spark Streaming example")
val ssc = new StreamingContext(sparkConf, Seconds(10))
val spark = SparkSession.builder().config(sparkConf).getOrCreate()
val packetStream = ssc.receiverStream(new NiFiReceiver(config, StorageLevel.MEMORY_ONLY))
val file = packetStream.map(dataPacket => new String(dataPacket.getContent, StandardCharsets.UTF_8))
file.foreachRDD(rdd => {
val data = spark.createDataFrame(rdd
.filter(!_.contains("ID,Client_Name,Due_Date,Score"))
.map(line => Row.fromSeq(line.split(",").toSeq)), schema)
data.show(100)
val id = data.select("ID")
})
ssc.start()
ssc.awaitTermination()
}
The final result of above function is id dataframe. I want to send that result to nifi flow. I don't want to write that result as a file to some destination and get to nifi flow using getFile processor.
How can I send the final result to nifi flow?
This is an interesting approach.
Have you considered introducing a brokering service such as Apache Kafka? This can be used both as a source and as a sink in your Apache Spark application and the integration is out of the box. You can also follow the official guide here: https://spark.apache.org/docs/latest/structured-streaming-kafka-integration.html. The guide describes a flow using the relatively new Apache Spark Structured Streaming.
Then on Apache NiFi you can use the ConsumeKafkaRecord processor to consume from the same topic being used as a sink in your Apache Spark application. You can also make use of the PublishKafkaRecord processor if you wish to refactor your application to make use of Apache Kafka as a source as well rather than relying on Apache NiFi sockets directly.
Update: If you absolutely must write directly to Apache NiFi, using Apache Spark Structured Streaming you can extend the ForeachWriter class (https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.ForeachWriter) to implement your own custom sink.
Look at ListenHTTP. That way you will treat NiFi as a simple REST service. Personally, I would prefer some message bus involved between Spark and NiFi but if it is not possible for your use case then you can try if that works for you.

AvroInputFormat returns set of object addresses instead of values

I'm writing some data using flink AvroOutputFormat,
val source: DataSet[Row] = environment.createInput(inputBuilder.finish)
val tableEnv: BatchTableEnvironment = new BatchTableEnvironment(environment, TableConfig.DEFAULT)
val table: Table = source.toTable(tableEnv)
val avroOutputFormat = new AvroOutputFormat[Row](classOf[Row])
avroOutputFormat.setCodec(AvroOutputFormat.Codec.NULL)
source.write(avroOutputFormat, "/Users/x/Documents/test_1.avro").setParallelism(1)
environment.execute()
This writes data into a file called test_1.avro. When I tried to read the file as,
val users = new AvroInputFormat[Row](new Path("/Users/x/Documents/test_1.avro"), classOf[Row])
val usersDS = environment.createInput(users)
usersDS.print()
This prints the row as,
java.lang.Object#4462efe1,java.lang.Object#7c3e4b1a,java.lang.Object#2db4ad1,java.lang.Object#765d55d5,java.lang.Object#2513a118,java.lang.Object#2bfb583b,java.lang.Object#73ae0257,java.lang.Object#6fc1020a,java.lang.Object#5762658b
Is there a possible way to print this data values instead of object addresses.
You are mixing Table API and Datastream API in a weird fashion. It would be best to stick to one API or use the proper conversion methods.
As is you are basically not letting Flink know the expected input/output schema. classOf[Row] is everything and nothing.
To write a table to Avro file, please use the table connector. Basic sketch
tableEnv.connect(new FileSystem("/path/to/file"))
.withFormat(new Avro().avroSchema("...")) // <- Adjust
.withSchema(schema)
.createTemporaryTable("AvroSinkTable")
table.insertInto("AvroSinkTable")
edit: as of now, Filesystem connector unfortunately does not support Avro.
So there is no option but to use dataset API. I recommend to use avrohugger to generate an appropriate scala class for your avro schema.
// convert to your scala class
val dsTuple: DataSet[User] = tableEnv.toDataSet[User](table)
// write out
val avroOutputFormat = new AvroOutputFormat<>(User.class)
avroOutputFormat.setCodec(Codec.SNAPPY)
avroOutputFormat.setSchema(User.SCHEMA$)
specificUser.write(avroOutputFormat, outputPath1)

What are the available output formats for writeStream in Spark structured streaming

Consider a generic writeStream invocation - with the typical "console" output format:
out.writeStream
.outputMode("complete")
.format("console")
.start()
What are the alternatives? I noticed actually that the default is parquet:
In DataStreamWriter:
/**
* Specifies the underlying output data source.
*
* #since 2.0.0
*/
def format(source: String): DataStreamWriter[T] = {
this.source = source
this
}
private var source: String = df.sparkSession.sessionState.conf.defaultDataSourceName
In SQLConf:
def defaultDataSourceName: String = getConf(DEFAULT_DATA_SOURCE_NAME)
val DEFAULT_DATA_SOURCE_NAME = buildConf("spark.sql.sources.default")
.doc("The default data source to use in input/output.")
.stringConf
.createWithDefault("parquet")
But then how is the path for the parquet file specified? What are the other formats supported and what options do they have/require?
Here is the official spark documentation for the same: https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#output-sinks
As of spark 2.4.1, five formats are supported out of the box:
File sink
Kafka sink
Foreach sink
Console sink
Memory sink
On top of that one can also implement her custom sink by extending Sink API of Spark: https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/Sink.scala
I found one reference: https://community.hortonworks.com/questions/89282/structured-streaming-writestream-append-to-file.html
It seems that option("path",path) can be used:

flink what is the equivalent to parseQuotedStrings in scala api

I am trying to convert this jave code to scala:
DataSet<Tuple3<Long, String, String>> lines = env.readCsvFile("movies.csv")
.ignoreFirstLine()
.parseQuotedStrings('"')
.ignoreInvalidLines()
.types(Long.class, String.class, String.class);
to scala. I couldn't find any alternative in scala to parseQuotedStrings I will appreciate any assistance here
This is following code uses flink's java api, literal translation of the code provided by you.
import org.apache.flink.api.java._
val env = ExecutionEnvironment.getExecutionEnvironment
val movies = env.readCsvFile("movies.csv")
.ignoreFirstLine()
.parseQuotedStrings('"')
.ignoreInvalidLines()
.types(classOf[Long], classOf[String], classOf[String])
Also you can use flink's scala api, something like this
import org.apache.flink.api.scala._
val env = ExecutionEnvironment.getExecutionEnvironment
val movies = env.readCsvFile[(Int,String,String)]
("movies.csv", ignoreFirstLine = true, quoteCharacter = '"', lenient = true)
AFAIK Scala api does not have the fluent api of the java version. "lenient" options is same as "ignoreInvalidLines" and the other options should be self explanatory.