No File writen down to HDFS in flink

No File writen down to HDFS in flink - apache-kafka

I'm trying to consume kafka by flink and save the result to hdfs but no file was produces all the time.. and no error message raise up..
btw, it's ok to save to local file but when I change the path to hdfs, I got nothing.
object kafka2Hdfs {
private val ZOOKEEPER_HOST = "ip1:2181,ip2:2181,ip3:2181"
private val KAFKA_BROKER = "ip1:9092,ip2:9092,ip3:9092"
private val TRANSACTION_GROUP = "transaction"
val topic = "tgt3"
def main(args : Array[String]){
val env = StreamExecutionEnvironment.getExecutionEnvironment
env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime)
env.enableCheckpointing(1000L)
env.getCheckpointConfig.setCheckpointingMode(CheckpointingMode.EXACTLY_ONCE)
// configure Kafka consumer
val kafkaProps = new Properties()
.... //topic infos
kafkaProps.setProperty("fs.default-scheme", "hdfs://ip:8020")
val consumer = new FlinkKafkaConsumer010[String](topic, new SimpleStringSchema(), kafkaProps)
val source = env.addSource(consumer)
val path = new Path("/user/jay/data")
// sink
val rollingPolicy : RollingPolicy[String,String] = DefaultRollingPolicy.create()
.withRolloverInterval(15000)
.build()
val sink: StreamingFileSink[String] = StreamingFileSink
.forRowFormat(path, new SimpleStringEncoder[String]("UTF-8"))
.withRollingPolicy(rollingPolicy)
.build()
source.addSink(sink)
env.execute("test")
}
}
I'm very confused..

Off the top of my head, there could be two things to look into:
Is the HDFS namenode properly configured so that Flink knows it tries to write to HDFS instead of local disk?
What do the nodemanger and taskmanager logs say? it could fail due to permission issue on HDFS.

Related

Presto is giving this error : Cannot invoke "com.fasterxml.jackson.databind.JsonNode.has(String)" because "currentNode" is null

I'm pushing a JSON file into a Kafka topic, connecting the topic in presto and structuring the JSON data into a queryable table.
The problem I am facing is that , presto is not to fetch data its shows error Cannot invoke "com.fasterxml.jackson.databind.JsonNode.has(String)" because "currentNode" is null.
Code for pushing data into kafka topic:
object Producer extends App{
val props = new Properties()
props.put("bootstrap.servers", "localhost:9092")
props.put("key.serializer", "org.apache.kafka.common.serialization.StringSerializer")
props.put("value.serializer", "org.apache.kafka.connect.json.JsonSerializer")
val producer = new KafkaProducer[String,JsonNode](props)
println("inside prducer")
val mapper = (new ObjectMapper() with ScalaObjectMapper).
registerModule(DefaultScalaModule).
configure(DeserializationFeature.FAIL_ON_UNKNOWN_PROPERTIES, false).
findAndRegisterModules(). // register joda and java-time modules automatically
asInstanceOf[ObjectMapper with ScalaObjectMapper]
val filename = "/Users/rishunigam/Documents/devicd.json"
val jsonNode: JsonNode= mapper.readTree(new File(filename))
val s = jsonNode.size()
for(i <- 0 to jsonNode.size()) {
val js = jsonNode.get(i)
println(js)
val record = new ProducerRecord[String, JsonNode]( "tpch.devicelog", js)
println(record)
producer.send( record )
}
println("producer complete")
producer.close()
}

Unable to turn kafka stream into flink table

I want to make a table out of a kafka stream and print it but getting this trivial errror.
val env = StreamExecutionEnvironment.getExecutionEnvironment
val tEnv = StreamTableEnvironment.create(env)
val properties = new Properties()
properties.setProperty("bootstrap.servers", "localhost:9092")
val src = new FlinkKafkaConsumer010[ObjectNode]("broadcast", new JSONKeyValueDeserializationSchema(false), properties)
val stream = env.
addSource(src)
val tbl = tEnv.registerDataStream("ASK", stream, 'locationID, 'temp)
env.execute()
Kindly Help, ThankYou

Flink crash with “java.lang.IllegalArgumentException” after I fetch some data to kafka topic

I have a Flink program consuming a Kafka topic. I use Spark to send some message(JSON String I copied from the topic) to the topic (what I want to do is to manually trigger the flink computation). Then the Flink crashed instantly with following error:
java.lang.IllegalArgumentException
at java.nio.Buffer.limit(Buffer.java:275)
at kafka.message.Message.sliceDelimited(Message.scala:236)
at kafka.message.Message.payload(Message.scala:218)
at org.apache.flink.streaming.connectors.kafka.internals.SimpleConsumerThread.run(SimpleConsumerThread.java:338)
Anyone can tell me why this happened and how to resolve it ?
There is the spark code I use to write json string to Kafka:
// Connect Kafka
println("Connecting kafka")
val KAFKA_QUEUE_TIME = 5000
val KAFKA_BATCH_SIZE = 16384
val brokerList = "kafka05broker01.cnsuning.com:9092,kafka05broker02.cnsuning.com:9092,kafka05broker03.cnsuning.com:9092"
val props = new Properties
props.put("serializer.class", "kafka.serializer.StringEncoder")
props.put("partitioner.class", "utils.SimplePartitioner")
props.put("metadata.broker.list", brokerList)
props.put("producer.type", "async")
props.put("queue.time", "5000")
props.put("batch.size", "16384")
val config = new ProducerConfig(props)
val producer = new Producer[AnyRef, AnyRef](config)
// Send Kafka
println("Sending msg to kafka")
val topic = "xxxxxx"
val msg = "xxx"
for (i <- 0 to 1000) {
println(i)
val randomPartition = "" + new Random().nextInt(255)
val message = new KeyedMessage[AnyRef, AnyRef](topic, randomPartition, msg)
producer.send(message)
}
There is the how I consume in flink:
val allActProperties = kafkaPropertiesGen( GroupId, BrokerServer, ZKConnect)
val streamComsumer = new FlinkKafkaConsumer08[TraitRecord](topic, new TraitRecordSchema(), allActProperties)
val stream: DataStream[TraitRecord] = env.addSource(streamComsumer ).setParallelism(12)

Which version of kafka you are using?It's seems the kafka jar version is not corresponding to your kafka version.Or FlinkKafkaConsumer08 is not corresponding to your kafka version?Visit
java.lang.IllegalArgumentException kafka console consumer

scala- how can I confirm specific topic exists in Kafka server(broker)?

I am using scala, spark, and Kafka. I have 2 questions.
1.how can I confirm the topic exists in Kafka broker(server)?
2.how can I confirm the Kafka server (bootstrap server) is running or not?
object kafkaProducer extends App {
def sendMessages(): Unit = {
//define topic
val topic = "spark-topic" // how can i confirm this topic is exist in kafka server ?
//define producer properties
val props = new java.util.Properties()
props.put("bootstrap.servers", "localhost:9092")
props.put("client.id", "KafkaProducer")
props.put("key.serializer", "org.apache.kafka.common.serialization.StringSerializer")
props.put("value.serializer", "org.apache.kafka.connect.json.JsonSerializer")
//create producer instance
val kafkaProducer = new KafkaProducer[String, JsonNode](props)
//create object mapper
val mapper = new ObjectMapper with ScalaObjectMapper
mapper.registerModule(DefaultScalaModule)
mapper.configure(DeserializationFeature.FAIL_ON_UNKNOWN_PROPERTIES, false)
//mapper Json object to string
def toJson(value: Any): String = {
mapper.writeValueAsString(value)
}
//send producer message
val jsonstring =
s"""{
| "id": "0001",
| "name": "Peter"
|}
""".stripMargin
val jsonNode: JsonNode = mapper.readTree(jsonstring)
val rec = new ProducerRecord[String, JsonNode](topic, jsonNode)
kafkaProducer.send(rec)
//println(rec)
}
}

1) The recommended way to check if a topic exists is to use the AdminClient API.
You can use listTopics() or describeTopics().
2) Assuming you don't have any privilege access to the cluster (to check metrics or liveness probes), the only way to check the cluster is running is to try connecting to/use it.
With the AdminClient, you can use describeCluster(), for example, to attempt to retrieve the state of the cluster.

How to write spark streaming DF to Kafka topic

I am using Spark Streaming to process data between two Kafka queues but I can not seem to find a good way to write on Kafka from Spark. I have tried this:
input.foreachRDD(rdd =>
rdd.foreachPartition(partition =>
partition.foreach {
case x: String => {
val props = new HashMap[String, Object]()
props.put(ProducerConfig.BOOTSTRAP_SERVERS_CONFIG, brokers)
props.put(ProducerConfig.VALUE_SERIALIZER_CLASS_CONFIG,
"org.apache.kafka.common.serialization.StringSerializer")
props.put(ProducerConfig.KEY_SERIALIZER_CLASS_CONFIG,
"org.apache.kafka.common.serialization.StringSerializer")
println(x)
val producer = new KafkaProducer[String, String](props)
val message = new ProducerRecord[String, String]("output", null, x)
producer.send(message)
}
}
)
)
and it works as intended but instancing a new KafkaProducer for every message is clearly unfeasible in a real context and I'm trying to work around it.
I would like to keep a reference to a single instance for every process and access it when I need to send a message. How can I write to Kafka from Spark Streaming?

Yes, unfortunately Spark (1.x, 2.x) doesn't make it straight-forward how to write to Kafka in an efficient manner.
I'd suggest the following approach:
Use (and re-use) one KafkaProducer instance per executor process/JVM.
Here's the high-level setup for this approach:
First, you must "wrap" Kafka's KafkaProducer because, as you mentioned, it is not serializable. Wrapping it allows you to "ship" it to the executors. The key idea here is to use a lazy val so that you delay instantiating the producer until its first use, which is effectively a workaround so that you don't need to worry about KafkaProducer not being serializable.
You "ship" the wrapped producer to each executor by using a broadcast variable.
Within your actual processing logic, you access the wrapped producer through the broadcast variable, and use it to write processing results back to Kafka.
The code snippets below work with Spark Streaming as of Spark 2.0.
Step 1: Wrapping KafkaProducer
import java.util.concurrent.Future
import org.apache.kafka.clients.producer.{KafkaProducer, ProducerRecord, RecordMetadata}
class MySparkKafkaProducer[K, V](createProducer: () => KafkaProducer[K, V]) extends Serializable {
/* This is the key idea that allows us to work around running into
NotSerializableExceptions. */
lazy val producer = createProducer()
def send(topic: String, key: K, value: V): Future[RecordMetadata] =
producer.send(new ProducerRecord[K, V](topic, key, value))
def send(topic: String, value: V): Future[RecordMetadata] =
producer.send(new ProducerRecord[K, V](topic, value))
}
object MySparkKafkaProducer {
import scala.collection.JavaConversions._
def apply[K, V](config: Map[String, Object]): MySparkKafkaProducer[K, V] = {
val createProducerFunc = () => {
val producer = new KafkaProducer[K, V](config)
sys.addShutdownHook {
// Ensure that, on executor JVM shutdown, the Kafka producer sends
// any buffered messages to Kafka before shutting down.
producer.close()
}
producer
}
new MySparkKafkaProducer(createProducerFunc)
}
def apply[K, V](config: java.util.Properties): MySparkKafkaProducer[K, V] = apply(config.toMap)
}
Step 2: Use a broadcast variable to give each executor its own wrapped KafkaProducer instance
import org.apache.kafka.clients.producer.ProducerConfig
val ssc: StreamingContext = {
val sparkConf = new SparkConf().setAppName("spark-streaming-kafka-example").setMaster("local[2]")
new StreamingContext(sparkConf, Seconds(1))
}
ssc.checkpoint("checkpoint-directory")
val kafkaProducer: Broadcast[MySparkKafkaProducer[Array[Byte], String]] = {
val kafkaProducerConfig = {
val p = new Properties()
p.setProperty("bootstrap.servers", "broker1:9092")
p.setProperty("key.serializer", classOf[ByteArraySerializer].getName)
p.setProperty("value.serializer", classOf[StringSerializer].getName)
p
}
ssc.sparkContext.broadcast(MySparkKafkaProducer[Array[Byte], String](kafkaProducerConfig))
}
Step 3: Write from Spark Streaming to Kafka, re-using the same wrapped KafkaProducer instance (for each executor)
import java.util.concurrent.Future
import org.apache.kafka.clients.producer.RecordMetadata
val stream: DStream[String] = ???
stream.foreachRDD { rdd =>
rdd.foreachPartition { partitionOfRecords =>
val metadata: Stream[Future[RecordMetadata]] = partitionOfRecords.map { record =>
kafkaProducer.value.send("my-output-topic", record)
}.toStream
metadata.foreach { metadata => metadata.get() }
}
}
Hope this helps.

My first advice would be to try to create a new instance in foreachPartition and measure if that is fast enough for your needs (instantiating heavy objects in foreachPartition is what the official documentation suggests).
Another option is to use an object pool as illustrated in this example:
https://github.com/miguno/kafka-storm-starter/blob/develop/src/main/scala/com/miguno/kafkastorm/kafka/PooledKafkaProducerAppFactory.scala
I however found it hard to implement when using checkpointing.
Another version that is working well for me is a factory as described in the following blog post, you just have to check if it provides enough parallelism for your needs (check the comments section):
http://allegro.tech/2015/08/spark-kafka-integration.html

With Spark >= 2.2
Both read and write operations are possible on Kafka using Structured Streaming API
Build stream from Kafka topic
// Subscribe to a topic and read messages from the earliest to latest offsets
val ds= spark
.readStream // use `read` for batch, like DataFrame
.format("kafka")
.option("kafka.bootstrap.servers", "brokerhost1:port1,brokerhost2:port2")
.option("subscribe", "source-topic1")
.option("startingOffsets", "earliest")
.option("endingOffsets", "latest")
.load()
Read the key and value and apply the schema for both, for simplicity we are making converting both of them to String type.
val dsStruc = ds.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)")
.as[(String, String)]
Since dsStruc have the schema, it accepts all SQL kind operations like filter, agg, select ..etc on it.
Write stream to Kafka topic
dsStruc
.writeStream // use `write` for batch, like DataFrame
.format("kafka")
.option("kafka.bootstrap.servers", "brokerhost1:port1,brokerhost2:port2")
.option("topic", "target-topic1")
.start()
More configuration for Kafka integration to read or write
Key artifacts to add in the application
"org.apache.spark" % "spark-core_2.11" % 2.2.0,
"org.apache.spark" % "spark-streaming_2.11" % 2.2.0,
"org.apache.spark" % "spark-sql-kafka-0-10_2.11" % 2.2.0,

There is a Streaming Kafka Writer maintained by Cloudera (actually spun off from a Spark JIRA [1]). It basically creates a producer per partition, which amortizes the time spent to create 'heavy' objects over a (hopefully large) collection of elements.
The Writer can be found here: https://github.com/cloudera/spark-kafka-writer

I was having the same issue and found this post.
The author solves the problem by creating 1 producer per executor. Instead of sending the producer itself, he sends only a “recipe” how to create a producer in an executor by broadcasting it.
val kafkaSink = sparkContext.broadcast(KafkaSink(conf))
He uses a wrapper that lazily creates the producer:
class KafkaSink(createProducer: () => KafkaProducer[String, String]) extends Serializable {
lazy val producer = createProducer()
def send(topic: String, value: String): Unit = producer.send(new ProducerRecord(topic, value))
}
object KafkaSink {
def apply(config: Map[String, Object]): KafkaSink = {
val f = () => {
val producer = new KafkaProducer[String, String](config)
sys.addShutdownHook {
producer.close()
}
producer
}
new KafkaSink(f)
}
}
The wrapper is serializable because the Kafka producer is initialized just before first use on an executor. The driver keeps the reference to the wrapper and the wrapper sends the messages using each executor's producer:
dstream.foreachRDD { rdd =>
rdd.foreach { message =>
kafkaSink.value.send("topicName", message)
}
}

Why is it infeasible? Fundamentally each partition of each RDD is going to run independently (and may well run on a different cluster node), so you have to redo the connection (and any synchronization) at the start of each partition's task. If the overhead of that is too high then you should increase the batch size in your StreamingContext until it becomes acceptable (obv. there's a latency cost to doing this).
(If you're not handling thousands of messages in each partition, are you sure you need spark-streaming at all? Would you do better with a standalone application?)

This might be what you want to do. You basically create one producer for each partition of records.
input.foreachRDD(rdd =>
rdd.foreachPartition(
partitionOfRecords =>
{
val props = new HashMap[String, Object]()
props.put(ProducerConfig.BOOTSTRAP_SERVERS_CONFIG, brokers)
props.put(ProducerConfig.VALUE_SERIALIZER_CLASS_CONFIG,
"org.apache.kafka.common.serialization.StringSerializer")
props.put(ProducerConfig.KEY_SERIALIZER_CLASS_CONFIG,
"org.apache.kafka.common.serialization.StringSerializer")
val producer = new KafkaProducer[String,String](props)
partitionOfRecords.foreach
{
case x:String=>{
println(x)
val message=new ProducerRecord[String, String]("output",null,x)
producer.send(message)
}
}
})
)
Hope that helps

With Spark < 2.2
Since there is no direct way of writing the messages to Kafka from Spark Streaming
Create a KafkaSinkWritter
import java.util.Properties
import org.apache.kafka.clients.producer._
import org.apache.spark.sql.ForeachWriter
class KafkaSink(topic:String, servers:String) extends ForeachWriter[(String, String)] {
val kafkaProperties = new Properties()
kafkaProperties.put("bootstrap.servers", servers)
kafkaProperties.put("key.serializer", "org.apache.kafka.common.serialization.StringSerializer")
kafkaProperties.put("value.serializer", "org.apache.kafka.common.serialization.StringSerializer")
val results = new scala.collection.mutable.HashMap[String, String]
var producer: KafkaProducer[String, String] = _
def open(partitionId: Long,version: Long): Boolean = {
producer = new KafkaProducer(kafkaProperties)
true
}
def process(value: (String, String)): Unit = {
producer.send(new ProducerRecord(topic, value._1 + ":" + value._2))
}
def close(errorOrNull: Throwable): Unit = {
producer.close()
}
}
Write messages using SinkWriter
val topic = "<topic2>"
val brokers = "<server:ip>"
val writer = new KafkaSink(topic, brokers)
val query =
streamingSelectDF
.writeStream
.foreach(writer)
.outputMode("update")
.trigger(ProcessingTime("25 seconds"))
.start()
Reference link

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

No File writen down to HDFS in flink - apache-kafka

Off the top of my head, there could be two things to look into: Is the HDFS namenode properly configured so that Flink knows it tries to write to HDFS instead of local disk? What do the nodemanger and taskmanager logs say? it could fail due to permission issue on HDFS.

Related

Presto is giving this error : Cannot invoke "com.fasterxml.jackson.databind.JsonNode.has(String)" because "currentNode" is null

Unable to turn kafka stream into flink table

Flink crash with “java.lang.IllegalArgumentException” after I fetch some data to kafka topic

scala- how can I confirm specific topic exists in Kafka server(broker)?

How to write spark streaming DF to Kafka topic

Categories

Resources