I'm following Flink's documentation on how to use WatermarkStrategy with KafkaConsumer. The code is shown below
val kafkaSource = new FlinkKafkaConsumer[MyType]("myTopic", schema, props)
kafkaSource.assignTimestampsAndWatermarks(
WatermarkStrategy
.forBoundedOutOfOrderness(Duration.ofSeconds(20)))
val stream: DataStream[MyType] = env.addSource(kafkaSource)
Anytime I try to compile the code above I get an error saying
error: overloaded method value assignTimestampsAndWatermarks with alternatives:
error: overloaded method value assignTimestampsAndWatermarks with alternatives:
[ERROR] (x$1: org.apache.flink.streaming.api.functions.AssignerWithPeriodicWatermarks[String])org.apache.flink.streaming.connectors.kafka.FlinkKafkaConsumerBase[String] <and>
[ERROR] (x$1: org.apache.flink.streaming.api.functions.AssignerWithPunctuatedWatermarks[String])org.apache.flink.streaming.connectors.kafka.FlinkKafkaConsumerBase[String] <and>
[ERROR] (x$1: org.apache.flink.api.common.eventtime.WatermarkStrategy[String])org.apache.flink.streaming.connectors.kafka.FlinkKafkaConsumerBase[String]
[ERROR] cannot be applied to (org.apache.flink.api.common.eventtime.WatermarkStrategy[Nothing])
[ERROR] consumer.assignTimestampsAndWatermarks(
The code below returns WatermarkStrategyy[Nothing] instead of WatermarkStrategy[String]
WatermarkStrategy
.forBoundedOutOfOrderness(Duration.ofSeconds(20)))
I solved this by using this code
val kafkaSource = new FlinkKafkaConsumer[MyType]("myTopic", schema, props)
watermark: Watermark[String] = WatermarkStrategy.forBoundedOutOfOrderness(Duration.ofSeconds(20))
kafkaSource.assignTimestampsAndWatermarks(watermark)
#Mayokun is right. But to make the code simpler, you could put the type information right after the static method:
val kafkaSource = new FlinkKafkaConsumer[MyType]("myTopic", schema, props)
kafkaSource.assignTimestampsAndWatermarks(
WatermarkStrategy.forBoundedOutOfOrderness[MyType](Duration.ofSeconds(20))
)
Related
I am new to scala and I am trying to write a function that takes in a JSON, converts it to Scala dictionary (Map) and checks for certain keys
Below is part of a function that checks for a bunch keys
import play.api.libs.json.Json
def setParams(jsonString: Map[String, Any]) = {
val paramsMap = Json.parse(jsonString)
if (parmsMap.contains("key_1")) {
println('key_1 present')
}
On compiling it with sbt, I get the following errors
/Users/usr/scala_codes/src/main/scala/wrapped_code.scala:29:26: overloaded method value parse with alternatives:
[error] (input: Array[Byte])play.api.libs.json.JsValue <and>
[error] (input: java.io.InputStream)play.api.libs.json.JsValue <and>
[error] (input: String)play.api.libs.json.JsValue
[error] cannot be applied to (Map[String,Any])
[error] val paramsMap = Json.parse(jsonString)
[error] ^
[error] /Users/usr/scala_codes/src/main/scala/wrapped_code.scala:31:9: not found: value parmsMap
[error] if (parmsMap.contains("key_1")) {
Also, in key-value pairs of the JSON, the keys are all strings but the values could be integers, floats or strings. Do I need to make changes for that?
Seems your input type in setParams function should be String not Map[String, Any]
and you have one typo: if (parmsMap.contains("key_1")) should be if (paramsMap.contains("key_1"))
correct function:
def setParams(jsonString: String): Unit = {
val paramsMap = Json.parse(jsonString).as[Map[String, JsValue]]
if (paramsMap.contains("key_1")) println('key_1 present')
}
New to Scala, trying to compute and log latency for my kafka records using log4j, but running into errors. Tried to look at some SO articles but I think I missing some Scala concept here. Any help is appreciated.
Solution1: Does not give an error.
val currentTimeInMillis = Instant.now.toEpochMilli
val latency = Math.max(0, currentTimeInMillis - record.timestamp())
logger.info("record latency: {}", latency)
logger.info("record KafkaPartition: {}, record Offset: {}", record.kafkaPartition(), record.kafkaOffset())
Solution2: This gives error:
val currentTimeInMillis = Instant.now.toEpochMilli
val latency = Math.max(0, currentTimeInMillis - record.timestamp())
logger.info("record latency: {}, record KafkaPartition: {}, record Offset: {}", latency, record.kafkaPartition(), record.kafkaOffset())
Getting below error for Solution2:
error: overloaded method value info with alternatives
[ERROR] (x$1: org.slf4j.Marker,x$2: String,x$3: Object*)Unit <and>
[ERROR] (x$1: org.slf4j.Marker,x$2: String,x$3: Any,x$4: Any)Unit <and>
[ERROR] (x$1: String,x$2: Object*)Unit
[ERROR] cannot be applied to (String, Long, Integer, Long)
[ERROR] logger.info("record latency: {}, record KafkaPartition: {}, record Offset: {}", latency, record.kafkaPartition(), record.kafkaOffset())
[ERROR] ^
[ERROR] one error found
I've run into this before with Scala. Adding .toString to log arguments that aren't Strings will fix the problem.
I need to group my rdd by two columns and aggregate the count. I have a function:
def constructDiagnosticFeatureTuple(diagnostic: RDD[Diagnostic])
: RDD[FeatureTuple] = {
val grouped_patients = diagnostic
.groupBy(x => (x.patientID, x.code))
.map(_._2)
.map{ events =>
val p_id = events.map(_.patientID).take(1).mkString
val f_code = events.map(_.code).take(1).mkString
val count = events.size.toDouble
((p_id, f_code), count)
}
//should be in form:
//diagnostic.sparkContext.parallelize(List((("patient", "diagnostics"), 1.0)))
}
At compile time, I am getting an error:
/FeatureConstruction.scala:38:3: type mismatch;
[error] found : Unit
[error] required: org.apache.spark.rdd.RDD[edu.gatech.cse6250.features.FeatureConstruction.FeatureTuple]
[error] (which expands to) org.apache.spark.rdd.RDD[((String, String), Double)]
[error] }
[error] ^
How can I fix it?
I red this post: Scala Spark type missmatch found Unit, required rdd.RDD , but I do not use collect(), so, it does not help me.
I've a text file with following format (id,f1,f2,f3,...,fn):
12345,0,0,1,2,...,3
23456,0,0,1,2,...,0
33333,0,1,1,0,...,0
56789,1,0,0,0,...,4
a_123,0,0,0,6,...,3
And I want to read the file (ignore the line like a_123,0,0,0,6,...,3) to create a RDD[(Long, Vector). Here's my solution:
def readDataset(path: String, sparkSession: SparkSession): RDD[(ItemId, Vector)] = {
val sc = sparkSession.sparkContext
sc.textFile(path)
.map({ line => val values=line.split(",")
(
values(0).toLong,
//util.Try(values(0).toLong).getOrElse(0L),
Vectors.dense(values.slice(1, values.length).map {x => x.toDouble }).toSparse
)})
.filter(x => x._1 > 0)
}
However this code can not be compiled:
[ERROR] found : org.apache.spark.rdd.RDD[(Long, org.apache.spark.ml.linalg.SparseVector)]
[ERROR] required: org.apache.spark.rdd.RDD[(Long, org.apache.spark.ml.linalg.Vector)]
[ERROR] (which expands to) org.apache.spark.rdd.RDD[(Long, org.apache.spark.ml.linalg.Vector)]
[ERROR] Note: (Long, org.apache.spark.ml.linalg.SparseVector) <: (Long, org.apache.spark.ml.linalg.Vector), but class RDD is invariant in type T.
[ERROR] You may wish to define T as +T instead. (SLS 4.5)
[ERROR] .filter(x => x._1 > 0)
[ERROR] ^
[ERROR] one error found
But if I remove the . toSparse or .filter(x => x._1 > 0) this code can be compiled successfully.
Does someone know why and what should I do to fix it?
Also is there any better way to read the file to RDD with ignoring the non-numeric id lines?
The code compiles successfully if you remove toSparse because the type of your PairRDD is (ItemId, Vector).
The org.apache.spark.ml.linalg.Vector class/type represent the Dense Vector which you are generating using Vector.dense and when you call toSparse it gets converted to org.apache.spark.ml.linalg.SparseVector which is not the type that your PairRDD expects.
As for filtering non-integer IDs I would say your method is a good way to do that.
There are some examples for use SQL over Spark Streaming in foreachRDD(). But if I want to use SQL in tranform():
case class AlertMsg(host:String, count:Int, sum:Double)
val lines = ssc.socketTextStream("localhost", 8888)
lines.transform( rdd => {
if (rdd.count > 0) {
val t = sqc.jsonRDD(rdd)
t.registerTempTable("logstash")
val sqlreport = sqc.sql("SELECT host, COUNT(host) AS host_c, AVG(lineno) AS line_a FROM logstash WHERE path = '/var/log/system.log' AND lineno > 70 GROUP BY host ORDER BY host_c DESC LIMIT 100")
sqlreport.map(r => AlertMsg(r(0).toString,r(1).toString.toInt,r(2).toString.toDouble))
} else {
rdd
}
}).print()
I got such error:
[error] /Users/raochenlin/Downloads/spark-1.2.0-bin-hadoop2.4/logstash/src/main/scala/LogStash.scala:52: no type parameters for method transform: (transformFunc: org.apache.spark.rdd.RDD[String] => org.apache.spark.rdd.RDD[U])(implicit evidence$5: scala.reflect.ClassTag[U])org.apache.spark.streaming.dstream.DStream[U] exist so that it can be applied to arguments (org.apache.spark.rdd.RDD[String] => org.apache.spark.rdd.RDD[_ >: LogStash.AlertMsg with String <: java.io.Serializable])
[error] --- because ---
[error] argument expression's type is not compatible with formal parameter type;
[error] found : org.apache.spark.rdd.RDD[String] => org.apache.spark.rdd.RDD[_ >: LogStash.AlertMsg with String <: java.io.Serializable]
[error] required: org.apache.spark.rdd.RDD[String] => org.apache.spark.rdd.RDD[?U]
[error] lines.transform( rdd => {
[error] ^
[error] one error found
[error] (compile:compile) Compilation failed
Seems only if I use sqlreport.map(r => r.toString) can be a correct usage?
dstream.transform take a function transformFunc: (RDD[T]) ⇒ RDD[U]
In this case, the if must result in the same type on both evaluations of the condition, which is not the case:
if (count == 0) => RDD[String]
if (count > 0) => RDD[AlertMsg]
In this case, remove the optimization of if rdd.count ... sothat you have an unique transformation path.