New to Scala, trying to compute and log latency for my kafka records using log4j, but running into errors. Tried to look at some SO articles but I think I missing some Scala concept here. Any help is appreciated.
Solution1: Does not give an error.
val currentTimeInMillis = Instant.now.toEpochMilli
val latency = Math.max(0, currentTimeInMillis - record.timestamp())
logger.info("record latency: {}", latency)
logger.info("record KafkaPartition: {}, record Offset: {}", record.kafkaPartition(), record.kafkaOffset())
Solution2: This gives error:
val currentTimeInMillis = Instant.now.toEpochMilli
val latency = Math.max(0, currentTimeInMillis - record.timestamp())
logger.info("record latency: {}, record KafkaPartition: {}, record Offset: {}", latency, record.kafkaPartition(), record.kafkaOffset())
Getting below error for Solution2:
error: overloaded method value info with alternatives
[ERROR] (x$1: org.slf4j.Marker,x$2: String,x$3: Object*)Unit <and>
[ERROR] (x$1: org.slf4j.Marker,x$2: String,x$3: Any,x$4: Any)Unit <and>
[ERROR] (x$1: String,x$2: Object*)Unit
[ERROR] cannot be applied to (String, Long, Integer, Long)
[ERROR] logger.info("record latency: {}, record KafkaPartition: {}, record Offset: {}", latency, record.kafkaPartition(), record.kafkaOffset())
[ERROR] ^
[ERROR] one error found
I've run into this before with Scala. Adding .toString to log arguments that aren't Strings will fix the problem.
Related
I'm following Flink's documentation on how to use WatermarkStrategy with KafkaConsumer. The code is shown below
val kafkaSource = new FlinkKafkaConsumer[MyType]("myTopic", schema, props)
kafkaSource.assignTimestampsAndWatermarks(
WatermarkStrategy
.forBoundedOutOfOrderness(Duration.ofSeconds(20)))
val stream: DataStream[MyType] = env.addSource(kafkaSource)
Anytime I try to compile the code above I get an error saying
error: overloaded method value assignTimestampsAndWatermarks with alternatives:
error: overloaded method value assignTimestampsAndWatermarks with alternatives:
[ERROR] (x$1: org.apache.flink.streaming.api.functions.AssignerWithPeriodicWatermarks[String])org.apache.flink.streaming.connectors.kafka.FlinkKafkaConsumerBase[String] <and>
[ERROR] (x$1: org.apache.flink.streaming.api.functions.AssignerWithPunctuatedWatermarks[String])org.apache.flink.streaming.connectors.kafka.FlinkKafkaConsumerBase[String] <and>
[ERROR] (x$1: org.apache.flink.api.common.eventtime.WatermarkStrategy[String])org.apache.flink.streaming.connectors.kafka.FlinkKafkaConsumerBase[String]
[ERROR] cannot be applied to (org.apache.flink.api.common.eventtime.WatermarkStrategy[Nothing])
[ERROR] consumer.assignTimestampsAndWatermarks(
The code below returns WatermarkStrategyy[Nothing] instead of WatermarkStrategy[String]
WatermarkStrategy
.forBoundedOutOfOrderness(Duration.ofSeconds(20)))
I solved this by using this code
val kafkaSource = new FlinkKafkaConsumer[MyType]("myTopic", schema, props)
watermark: Watermark[String] = WatermarkStrategy.forBoundedOutOfOrderness(Duration.ofSeconds(20))
kafkaSource.assignTimestampsAndWatermarks(watermark)
#Mayokun is right. But to make the code simpler, you could put the type information right after the static method:
val kafkaSource = new FlinkKafkaConsumer[MyType]("myTopic", schema, props)
kafkaSource.assignTimestampsAndWatermarks(
WatermarkStrategy.forBoundedOutOfOrderness[MyType](Duration.ofSeconds(20))
)
I am new to scala and I am trying to write a function that takes in a JSON, converts it to Scala dictionary (Map) and checks for certain keys
Below is part of a function that checks for a bunch keys
import play.api.libs.json.Json
def setParams(jsonString: Map[String, Any]) = {
val paramsMap = Json.parse(jsonString)
if (parmsMap.contains("key_1")) {
println('key_1 present')
}
On compiling it with sbt, I get the following errors
/Users/usr/scala_codes/src/main/scala/wrapped_code.scala:29:26: overloaded method value parse with alternatives:
[error] (input: Array[Byte])play.api.libs.json.JsValue <and>
[error] (input: java.io.InputStream)play.api.libs.json.JsValue <and>
[error] (input: String)play.api.libs.json.JsValue
[error] cannot be applied to (Map[String,Any])
[error] val paramsMap = Json.parse(jsonString)
[error] ^
[error] /Users/usr/scala_codes/src/main/scala/wrapped_code.scala:31:9: not found: value parmsMap
[error] if (parmsMap.contains("key_1")) {
Also, in key-value pairs of the JSON, the keys are all strings but the values could be integers, floats or strings. Do I need to make changes for that?
Seems your input type in setParams function should be String not Map[String, Any]
and you have one typo: if (parmsMap.contains("key_1")) should be if (paramsMap.contains("key_1"))
correct function:
def setParams(jsonString: String): Unit = {
val paramsMap = Json.parse(jsonString).as[Map[String, JsValue]]
if (paramsMap.contains("key_1")) println('key_1 present')
}
I need to group my rdd by two columns and aggregate the count. I have a function:
def constructDiagnosticFeatureTuple(diagnostic: RDD[Diagnostic])
: RDD[FeatureTuple] = {
val grouped_patients = diagnostic
.groupBy(x => (x.patientID, x.code))
.map(_._2)
.map{ events =>
val p_id = events.map(_.patientID).take(1).mkString
val f_code = events.map(_.code).take(1).mkString
val count = events.size.toDouble
((p_id, f_code), count)
}
//should be in form:
//diagnostic.sparkContext.parallelize(List((("patient", "diagnostics"), 1.0)))
}
At compile time, I am getting an error:
/FeatureConstruction.scala:38:3: type mismatch;
[error] found : Unit
[error] required: org.apache.spark.rdd.RDD[edu.gatech.cse6250.features.FeatureConstruction.FeatureTuple]
[error] (which expands to) org.apache.spark.rdd.RDD[((String, String), Double)]
[error] }
[error] ^
How can I fix it?
I red this post: Scala Spark type missmatch found Unit, required rdd.RDD , but I do not use collect(), so, it does not help me.
I am reading data from kafka topic in spark streaming job. I need to create key value RDD out of data.
val messages = KafkaUtils.createStream(streamingContext, "localhost:2181","abc",topics, StorageLevel.MEMORY_ONLY)
messages.print()
create key value RDD out of CustomerId and Tokens
val xactionByCustomer = messages.map(_._2).map {
transaction =>
val key = transaction.customerId
var tokens = transaction.tokens
(key, tokens)
}
Error ::
[error] /home/ec2-user/alok/marseille/src/main/scala/com/jcalc/feed/MarkovPredictor.scala:115: value customerId is not a member of String
[error] val key = transaction.customerId
[error] ^
[error] /home/ec2-user/alok/marseille/src/main/scala/com/jcalc/feed/MarkovPredictor.scala:116: value tokens is not a member of String
[error] var tokens = transaction.tokens
[error] ^
[error] two errors found
[error] (compile:compileIncremental) Compilation failed
sample data ::
(null,W3Q6TF3CCI,X84N230CIH,NNN)
(null,O8IV7KEXT0,G1D590G05V,NNS)
(null,LBQKYNE081,MYU0O7JC5H,NHN)
(null,SRB4P501SW,E0FTI4RN7X,LHL)
(null,HELRFMAXVS,W6F704TN21,LHN)
(null,FS4PLQLI63,TK1O9YHS15,NNN)
(null,KI70UDVJLC,4ANBDAW7SU,LNN)
(null,IP6IVPGCWQ,MD93GGGBKA,NNN)
(null,976N9RPXSP,JKU0SV7UMH,LNL)
(null,J4V3AB1YVT,J9WXC1BRAY,LHN)
I am interested in 2nd & 4th value only for pair RDD.
Any Help ?
Your data looks like tuple: (String, String, String, String) and since you're interested in 2dn & 4th value mapping:
val xactionByCustomer = messages.map(row => (row._2, row._4))
should be enough.
There are some examples for use SQL over Spark Streaming in foreachRDD(). But if I want to use SQL in tranform():
case class AlertMsg(host:String, count:Int, sum:Double)
val lines = ssc.socketTextStream("localhost", 8888)
lines.transform( rdd => {
if (rdd.count > 0) {
val t = sqc.jsonRDD(rdd)
t.registerTempTable("logstash")
val sqlreport = sqc.sql("SELECT host, COUNT(host) AS host_c, AVG(lineno) AS line_a FROM logstash WHERE path = '/var/log/system.log' AND lineno > 70 GROUP BY host ORDER BY host_c DESC LIMIT 100")
sqlreport.map(r => AlertMsg(r(0).toString,r(1).toString.toInt,r(2).toString.toDouble))
} else {
rdd
}
}).print()
I got such error:
[error] /Users/raochenlin/Downloads/spark-1.2.0-bin-hadoop2.4/logstash/src/main/scala/LogStash.scala:52: no type parameters for method transform: (transformFunc: org.apache.spark.rdd.RDD[String] => org.apache.spark.rdd.RDD[U])(implicit evidence$5: scala.reflect.ClassTag[U])org.apache.spark.streaming.dstream.DStream[U] exist so that it can be applied to arguments (org.apache.spark.rdd.RDD[String] => org.apache.spark.rdd.RDD[_ >: LogStash.AlertMsg with String <: java.io.Serializable])
[error] --- because ---
[error] argument expression's type is not compatible with formal parameter type;
[error] found : org.apache.spark.rdd.RDD[String] => org.apache.spark.rdd.RDD[_ >: LogStash.AlertMsg with String <: java.io.Serializable]
[error] required: org.apache.spark.rdd.RDD[String] => org.apache.spark.rdd.RDD[?U]
[error] lines.transform( rdd => {
[error] ^
[error] one error found
[error] (compile:compile) Compilation failed
Seems only if I use sqlreport.map(r => r.toString) can be a correct usage?
dstream.transform take a function transformFunc: (RDD[T]) ⇒ RDD[U]
In this case, the if must result in the same type on both evaluations of the condition, which is not the case:
if (count == 0) => RDD[String]
if (count > 0) => RDD[AlertMsg]
In this case, remove the optimization of if rdd.count ... sothat you have an unique transformation path.