I'm trying to simulate unit test case for streaming service written in spark-scala using EmbeddedKafka.
Everything working fine for JSON file, but for avro file we need to have schema registry.
Hence thought to use MockSchemaRegistryClient, but its failing with Error : Connection Refused.
Code Snippet :
class MyTest with EmbeddedKafka {
"Simple test for embeddedkafka should {
"Successfully Run test case in {
val customBrokerProperties = Map("includeHeaders"-> "true",
AbstractKafkaAvroSerDeConfig.SCHEMA_REGISTRY_URL_CONFIG -> "http://localhost:1234"
)
val customProducerProperties = Map(includeHeaders-> "true",
AbstractKafkaAvroSerDeConfig.SCHEMA_REGISTRY_URL_CONFIG -> "http://localhost:1234",
)
val mockSchemaRegistryClient new MockSchemaRegistryClient()
mockSchemaRegistryClient.register("testtopic-value",schema,1,1)
val serializer = new KafkaAvroSerializer(mockSchemaRegistryClient)
serializer.configure(customProducerProperties, isKey)
val deserializer = new KafkaAvroDeserializer(mockSchemaRegistryClient)
deserializer.configure(customBrokerProperties, isKey)
}}}
Related
We have an akka-scala application and we are avroizing the incoming messages. After that we are trying to write that message to a kafka topic which is an avro kafka topic. While writing we are getting the below exception:
org.apache.kafka.common.errors.SerializationException: Error registering Avro schema
Caused by: io.confluent.kafka.schemaregistry.client.rest.exceptions.RestClientException: Internal Server Error; error code: 500
We have checked that the schema registry is working fine and subject and version exist. We are not registering any new schema, that already exists. We are using scala 2.13.8 and tried with different confluent kafka avro serializer version like 5.1.0, 5.2.0, 5.3.0, 6.1.3. Can you please tell what might be the cause of this.
After disabling auto register schemas, this error was fixed. We have set the flag auto.register.schemas = false in the producer settings and that fixed the problem.
Below is the complete producer settings for the reference:
object KafkaSinkSettings {
val BootstrapServers = sys.env.get(Constants.BOOTSTRAP_SERVERS)
val ProducerConfig = Constants.PRODUCER_CONFIG
val SchemaRegistryUrl = sys.env.get(Constants.SCHEMA_REGISTRY_URL)
val Topic = sys.env.get(Constants.TOPIC)
val ClientId = Constants.METRICS_PREFIX
val MaxInFlightReqPerConn = sys.env.get(Constants.MAX_IN_FLIGHT_REQUESTS_PER_CONNECTION)
val RetriesConfig = sys.env.get(Constants.RETRIES_CONFIG)
val RequestTimeoutMSConfig = sys.env.get(Constants.REQUEST_TIMEOUT_MS_CONFIG)
val RetryBackoffMsConfig = sys.env.get(Constants.RETRY_BACKOFF_MS_CONFIG)
def apply(implicit config: Config): KafkaSinkSettings = new KafkaSinkSettings()
}
class KafkaSinkSettings(implicit config: Config) {
val producerConfig = config.getConfig(KafkaSinkSettings.ProducerConfig)
val kafkaAvroSerDeConfig = Map[String, Any](
AbstractKafkaAvroSerDeConfig.SCHEMA_REGISTRY_URL_CONFIG -> KafkaSinkSettings.SchemaRegistryUrl.getOrElse(
config.getString("kafka.schema-registry-url")),
AbstractKafkaAvroSerDeConfig.AUTO_REGISTER_SCHEMAS -> config.getString("kafka.auto-register-schemas"),
ProducerConfig.BOOTSTRAP_SERVERS_CONFIG -> KafkaSinkSettings.BootstrapServers.getOrElse(
config.getString("kafka.bootstrap-servers")),
KafkaAvroDeserializerConfig.SPECIFIC_AVRO_READER_CONFIG -> true.toString
)
def createAvroProducerSettings(): ProducerSettings[String, AnyRef] = {
val kafkaAvroSerializer = new KafkaAvroSerializer()
kafkaAvroSerializer.configure(kafkaAvroSerDeConfig.asJava, false)
val producerSettings = ProducerSettings(producerConfig, new StringSerializer, kafkaAvroSerializer)
.withBootstrapServers(KafkaSinkSettings.BootstrapServers.getOrElse(
config.getString("kafka.bootstrap-servers")))
.withProperty(ProducerConfig.CLIENT_ID_CONFIG, KafkaSinkSettings.ClientId + randomClientIdPostfix)
.withProperty(ProducerConfig.MAX_IN_FLIGHT_REQUESTS_PER_CONNECTION, KafkaSinkSettings.MaxInFlightReqPerConn.getOrElse(
config.getString("kafka.max-in-flight-requests-per-connection")
))
.withProperty(ProducerConfig.RETRIES_CONFIG, KafkaSinkSettings.RetriesConfig.getOrElse(
config.getString("kafka.retries-config")
))
.withProperty(ProducerConfig.REQUEST_TIMEOUT_MS_CONFIG, KafkaSinkSettings.RequestTimeoutMSConfig.getOrElse(
config.getString("kafka.request-timeout-ms-config")
))
.withProperty(ProducerConfig.RETRY_BACKOFF_MS_CONFIG, KafkaSinkSettings.RetryBackoffMsConfig.getOrElse(
config.getString("kafka.retry-backoff-ms-config")
))
producerSettings
}
}
I have a function getS3Object to get a json object stored in S3
def getS3Object (s3ObjectName) : Unit = {
val bucketName = "xyz"
val object_to_write = s3client.getObject(bucketName, s3ObjectName)
val file = new File(filename)
fileWriter = new FileWriter(file)
bw = new BufferedWriter(fileWriter)
bw.write(object_to_write)
bw.close()
fileWriter.close()
}
My dataframe (df) contains one column where each row is the S3ObjectName
S3ObjectName
a1.json
b2.json
c3.json
d4.json
e5.json
When I execute the below logic I get an error saying "task is not serializable"
Method 1:- df.foreach(x => getS3Object(x.getString(0))
I tried converting the df to rdd but still get the same error
Method 2:- df.rdd.foreach(x => getS3Object(x.getString(0))
However it works with collect()
Method 3:- df.collect.foreach(x => getS3Object(x.getString(0))
I do not wish to use the collect() method as all the elements of the dataframe are collected to the driver and potentially result in OutOfMemory error.
Is there a way to make the foreach() function work using Method 1?
The problem for your s3Client can be solved as following. But you have to remember that these functions run on executor nodes (other machines), so your whole val file = new File(filename) thing is probably not going to work here.
You can put your files on some distibuted file system like HDFS or S3.
object S3ClientWrapper extends Serializable {
// s3Client must be created here.
val s3Client = {
val awsCreds = new BasicAWSCredentials("access_key_id", "secret_key_id")
AmazonS3ClientBuilder.standard()
.withCredentials(new AWSStaticCredentialsProvider(awsCreds))
.build()
}
}
def getS3Object (s3ObjectName) : Unit = {
val bucketName = "xyz"
val object_to_write = S3ClientWrapper.s3Client.getObject(bucketName, s3ObjectName)
// now you have to solve your file problem
}
I would like to create my own CacheStore using Slick to store data value in BinaryMode in a Postgres DB.
I have read the doc related to Binary Marshaller on Ignite Website.
I have been inspired by the code here https://github.com/gastonlucero/ignite-persistence/blob/master/src/main/scala/test/db/CachePostgresSlickStore.scala
So I have created that code :
val myCacheCfg = new CacheConfiguration[String, MySpecialCustomObject]("MYCACHE")
myCacheCfg.setStoreKeepBinary(true)
myCacheCfg.setCacheStoreFactory(FactoryBuilder.factoryOf(classOf[myCacheSlickStore]))
myCacheCfg.setBackups(1)
myCacheCfg.setCacheMode(CacheMode.LOCAL)
myCacheCfg.setReadThrough(true)
myCacheCfg.setWriteThrough(true)
.......
class myCacheSlickStore extends CacheStoreAdapter[String, MySpecialCustomObject] with PostgresSlickConnection with Serializable {.....}
......
trait PostgresSlickConnection extends PostgresSlickParameters {
val tableName: String
}
But I have this kind of error : "type mismatch;" for the line related to setCacheStoreFactory
Do you have any idea or example in order to create your own CacheStore with setStoreKeepBinary(true)?
Here a complete example to illustrate :
final case class myObject(
parameters_1: Map[String, Set[String]],
parameters_2: Map[String, Set[String]]
)
class CacheSlickStore extends CacheStoreAdapter[String, BinaryObject] {}
val JdbcPersistence =
"myJdbcPersistence"
val cacheCfg =
new CacheConfiguration[String, myObject](JdbcPersistence)
cacheCfg.setStoreKeepBinary(true)
cacheCfg.setCacheStoreFactory(
FactoryBuilder.factoryOf(classOf[CacheSlickStore])
)
cacheCfg.setBackups(1)
cacheCfg.setCacheMode(CacheMode.LOCAL)
cacheCfg.setReadThrough(true)
cacheCfg.setWriteThrough(true)
var cache: IgniteCache[String, myObject] = _
val config = new IgniteConfiguration()
ignition = Ignition.getOrStart(config)
cache = ignition.getOrCreateCache[String, myObject](JdbcPersistence)
ignition.addCacheConfiguration(cacheCfg)
If I cast CacheConfiguration it compiles but fails to run.
Finally the solution is to cast in scala to Any and not BinaryObject. You can find a solution here Github project
My goal is to use kafka to read in a string in json format, do a filter to the string, select part of the message and sink the message out (still in json string format).
For testing purpose, my input string message looks like:
{"a":1,"b":2,"c":"3"}
And my code of implementation is:
def main(args: Array[String]): Unit = {
val inputProperties = new Properties()
inputProperties.setProperty("bootstrap.servers", "localhost:9092")
inputProperties.setProperty("group.id", "myTest2")
val inputTopic = "test"
val outputProperties = new Properties()
outputProperties.setProperty("bootstrap.servers", "localhost:9092")
val outputTopic = "test2"
val env = StreamExecutionEnvironment.getExecutionEnvironment
env.getConfig.disableSysoutLogging
env.getConfig.setRestartStrategy(RestartStrategies.fixedDelayRestart(4, 10000))
// create a checkpoint every 5 seconds
env.enableCheckpointing(5000)
// create a Kafka streaming source consumer for Kafka 0.10.x
val kafkaConsumer = new FlinkKafkaConsumer010(
inputTopic,
new JSONDeserializationSchema(),
inputProperties)
val messageStream : DataStream[ObjectNode]= env
.addSource(kafkaConsumer).rebalance
val filteredStream: DataStream[ObjectNode] = messageStream.filter(node => node.get("a")
.asText.equals("1") && node.get("b").asText.equals("2"))
// Need help in this part, how to extract for instance a,c and
// get something like {"a":"1", "c":"3"}?
val testStream:DataStream[JsonNode] = filteredStream.map(
node => {
node.get("a")
}
)
testStream.addSink(new FlinkKafkaProducer010[JsonNode](
outputTopic,
new SerializationSchema[JsonNode] {
override def serialize(element: JsonNode): Array[Byte] = element.toString.getBytes()
}, outputProperties
))
env.execute("Kafka 0.10 Example")
}
As shown in the comment of this code, I am not sure how to properly select part of the message. I use map, but I don't know how to concatenate the whole message. For instance, what I did in the code can only give me a result as "1", but what I want is {"a":1, "c":"3"}
Or maybe there is a completely different way to solve this problem. The thing is in spark streaming there is a "select" API, however I cannot find it in Flink.
And thanks a lot for flink community's help! This is the last feature I would like to achieve in this small project.
Flink Streaming job processes each input one time and output it to the next task or save them onto external storage.
One way is save all the outputs into external storage, like HDFS. After streaming job is done, using a batch job to combine them into a JSON.
Another way is to use state and RichMapFunction to get the JSON containing all the key-values.
stream.map(new MapFunction<String, Tuple2<String, String>>() {
public Tuple2<String, String> map(String value) throws Exception {
return new Tuple2<String, String>("mock", value);
}
}).keyBy(0).map(new RichMapFunction<Tuple2<String,String>, String>() {
#Override
public String map(Tuple2<String, String> value) throws Exception {
ValueState<String> old = getRuntimeContext().getState(new ValueStateDescriptor<String>("test", String.class));
String newVal = old.value();
if (newVal != null) makeJSON(newVal, value.f1);
else newVal = value.f1;
old.update(newVal);
return newVal;
}
}).print();
And use this map function: filteredStream.map(function);
Note that when using state, you will see output like this:
{"a": 1}, {"a": 1, "c": 3}.
The last output should be what you want.
I am writing Flink CEP program inside the Lagom's Microservice Implementation. My FLINK CEP program run perfectly fine in simple scala application. But when i use this code inside the Lagom service implementation i am receiving the following exception
Lagom Service Implementation
override def start = ServiceCall[NotUsed, String] {
val env = StreamExecutionEnvironment.getExecutionEnvironment
var executionConfig = env.getConfig
env.setParallelism(1)
executionConfig.disableSysoutLogging()
var topic_name="topic_test"
var props= new Properties
props.put("bootstrap.servers", "localhost:9092")
props.put("acks","all");
props.put("key.serializer","org.apache.kafka.common.serialization.StringSerializer");
props.put("value.serializer","org.apache.kafka.common.serialization.ByteArraySerializer");
props.put("block.on.buffer.full","false");
val kafkaSource = new FlinkKafkaConsumer010 (topic_name, new KafkaDeserializeSchema , props)
val stream = env.addSource(kafkaSource)
val deliveryPattern = Pattern.begin[XYZ]("begin").where(_.ABC == 5)
.next("next").where(_.ABC == 10).next("end").where(_.ABC==5)
val deliveryPatternStream = CEP.pattern(stream, deliveryPattern)
def selectFn(pattern : collection.mutable.Map[String, XYZ]): String = {
val startEvent = pattern.get("begin").get
val nextEvent = pattern.get("next").get
"Alert Detected"
}
val deliveryResult =deliveryPatternStream.select(selectFn(_)).print()
env.execute("CEP")
req=> Future.successful("Done")
}
}
I don't understand how to resolve this issue.