Kafka stream count distinct in scala? - apache-kafka

I need to do a distinct count per user in a kafka stream. This is my initial implementation but has the error on aggregate
Required: [Seq[String], mutableHashSet[String]]
Found: mutable.HashSet[String]
and I'm not sure how to provide a custom serde for mutable.HashSet either ...
val totalUniqueCategoriesCounts: KTable[String, Int] = inputStream
.filter((_ , ev) => ev.eventData.evData.pageType.isDefined)
.groupBy((_, ev) => ev.eventData.custData.customerUid.get)
.aggregate(initializer = mutable.HashSet[String])(
(aggKey: String, newValue: Event, aggValue: mutable.HashSet[String]) => {
val cat: String = newValue.eventData.cntData.contentCategory.get
aggValue += cat
aggValue
}, **Serde Here?**)
.mapValues((set: mutable.HashSet[String]) => set.size)
//.count()
totalUniqueCategoriesCounts.toStream.to("total_unique_categories")
Any help would be appreciated.
I'm also concerned about performance. Is this the best way to do a count distinct in a kafka stream?
Update Fixed the code issue but still concerned about the performance implications of this (if any) or any better ways to do the same thing.

It seems all I was forgetting was the () after mutable.HashSet[String], so instead it should be
val totalUniqueCategoriesCounts: KTable[String, Int] = inputStream
.filter((_ , ev) => ev.eventData.evData.pageType.isDefined)
.groupBy((_, ev) => ev.eventData.custData.customerUid.get)
.aggregate(initializer = mutable.HashSet[String]())(
(aggKey: String, newValue: Event, aggValue: mutable.HashSet[String]) => {
val cat: String = newValue.eventData.cntData.contentCategory.get
aggValue += cat
aggValue
})
.mapValues((set: mutable.HashSet[String]) => set.size)
//.count()
totalUniqueCategoriesCounts.toStream.to("total_unique_categories")
Intellij was totally throwing me off :/
The performance question still remains and would really appreciate any input.

Related

Wrong object type after stream and filter in Scala: Stream[$_2] instead of Stream[String]

Learning Scala and having some troubles with streams.
I'm trying to filter a collection of "Element" (from scala-parser library, kind of all the Soup objects) based on the fact that it contains a "%" and extract the value.
override def extractRoi(line: Element): Double = {
line.asInstanceOf[JsoupElement].underlying
.select("td")
.stream()
.map(e => e.text().toString)
.filter(e => e.contains("%"))
.findFirst()
.orElse("")
.replace("%", "").toDouble
}
when I map by doing "e.text()" I should have a Stream[String] but it is a Stream[_$2] and I don't understand why. So the code doesn't when at the orElse.
What transformation do I need to do to end up with a Stream[String] ?
It seems that using Java streams in Scala does something weird. In my code above it couldn't infer the right type. So this is how I fixed it:
override def extractRoi(line: Element): Double = {
line.select("td")
.iterator()
.asScala
.map(e => e.text().toString)
.filter(e => e.contains("%"))
.findFirst()
.orElse("")
.replace("%", "").toDouble
}

request timeout from flatMapping over cats.effect.IO

I am attempting to transform some data that is encapsulated in cats.effect.IO with a Map that also is in an IO monad. I'm using http4s with blaze server and when I use the following code the request times out:
def getScoresByUserId(userId: Int): IO[Response[IO]] = {
implicit val formats = DefaultFormats + ShiftJsonSerializer() + RawShiftSerializer()
implicit val shiftJsonReader = new Reader[ShiftJson] {
def read(value: JValue): ShiftJson = value.extract[ShiftJson]
}
implicit val shiftJsonDec = jsonOf[IO, ShiftJson]
// get the shifts
var getDbShifts: IO[List[Shift]] = shiftModel.findByUserId(userId)
// use the userRoleId to get the RoleId then get the tasks for this role
val taskMap : IO[Map[String, Double]] = taskModel.findByUserId(userId).flatMap {
case tskLst: List[Task] => IO(tskLst.map((task: Task) => (task.name -> task.standard)).toMap)
}
val traversed: IO[List[Shift]] = for {
shifts <- getDbShifts
traversed <- shifts.traverse((shift: Shift) => {
val lstShiftJson: IO[List[ShiftJson]] = read[List[ShiftJson]](shift.roleTasks)
.map((sj: ShiftJson) =>
taskMap.flatMap((tm: Map[String, Double]) =>
IO(ShiftJson(sj.name, sj.taskType, sj.label, sj.value.toString.toDouble / tm.get(sj.name).get)))
).sequence
//TODO: this flatMap is bricking my request
lstShiftJson.flatMap((sjLst: List[ShiftJson]) => {
IO(Shift(shift.id, shift.shiftDate, shift.shiftStart, shift.shiftEnd,
shift.lunchDuration, shift.shiftDuration, shift.breakOffProd, shift.systemDownOffProd,
shift.meetingOffProd, shift.trainingOffProd, shift.projectOffProd, shift.miscOffProd,
write[List[ShiftJson]](sjLst), shift.userRoleId, shift.isApproved, shift.score, shift.comments
))
})
})
} yield traversed
traversed.flatMap((sLst: List[Shift]) => Ok(write[List[Shift]](sLst)))
}
as you can see the TODO comment. I've narrowed down this method to the flatmap below the TODO comment. If I remove that flatMap and merely return "IO(shift)" to the traversed variable the request does not timeout; However, that doesn't help me much because I need to make use of the lstShiftJson variable which has my transformed json.
My intuition tells me I'm abusing the IO monad somehow, but I'm not quite sure how.
Thank you for your time in reading this!
So with the guidance of Luis's comment I refactored my code to the following. I don't think it is optimal (i.e. the flatMap at the end seems unecessary, but I couldnt' figure out how to remove it. BUT it's the best I've got.
def getScoresByUserId(userId: Int): IO[Response[IO]] = {
implicit val formats = DefaultFormats + ShiftJsonSerializer() + RawShiftSerializer()
implicit val shiftJsonReader = new Reader[ShiftJson] {
def read(value: JValue): ShiftJson = value.extract[ShiftJson]
}
implicit val shiftJsonDec = jsonOf[IO, ShiftJson]
// FOR EACH SHIFT
// - read the shift.roleTasks into a ShiftJson object
// - divide each task value by the task.standard where task.name = shiftJson.name
// - write the list of shiftJson back to a string
val traversed = for {
taskMap <- taskModel.findByUserId(userId).map((tList: List[Task]) => tList.map((task: Task) => (task.name -> task.standard)).toMap)
shifts <- shiftModel.findByUserId(userId)
traversed <- shifts.traverse((shift: Shift) => {
val lstShiftJson: List[ShiftJson] = read[List[ShiftJson]](shift.roleTasks)
.map((sj: ShiftJson) => ShiftJson(sj.name, sj.taskType, sj.label, sj.value.toString.toDouble / taskMap.get(sj.name).get ))
shift.roleTasks = write[List[ShiftJson]](lstShiftJson)
IO(shift)
})
} yield traversed
traversed.flatMap((t: List[Shift]) => Ok(write[List[Shift]](t)))
}
Luis mentioned that mapping my List[Shift] to a Map[String, Double] is a pure operation so we want to use a map instead of flatMap.
He mentioned that I'm wrapping every operation that comes from the database in IO which is causing a great deal of recomputation. (including DB transactions)
To solve this issue I moved all of the database operations inside of my for loop, using the "<-" operator to flatMap each of the return values allows the variables being used to preside within the IO monads, hence preventing the recomputation experienced before.
I do think there must be a better way of returning my return value. flatMapping the "traversed" variable to get back inside of the IO monad seems to be unnecessary recomputation, so please anyone correct me.

Handling and throwing Exceptions in Scala

I have the following implementation:
val dateFormats = Seq("dd/MM/yyyy", "dd.MM.yyyy")
implicit def dateTimeCSVConverter: CsvFieldReader[DateTime] = (s: String) => Try {
val elem = dateFormats.map {
format =>
try {
Some(DateTimeFormat.forPattern(format).parseDateTime(s))
} catch {
case _: IllegalArgumentException =>
None
}
}.collectFirst {
case e if e.isDefined => e.get
}
if (elem.isDefined)
elem.get
else
throw new IllegalArgumentException(s"Unable to parse DateTime $s")
}
So basically what I'm doing is that, I'm running over my Seq and trying to parse the DateTime with different formats. I then collect the first one that succeeds and if not I throw the Exception back.
I'm not completely satisfied with the code. Is there a better way to make it simpler? I need the exception message passed on to the caller.
The one problem with your code is it tries all patterns no matter if date was already parsed. You could use lazy collection, like Stream to solve this problem:
def dateTimeCSVConverter(s: String) = Stream("dd/MM/yyyy", "dd.MM.yyyy")
.map(f => Try(DateTimeFormat.forPattern(format).parseDateTime(s))
.dropWhile(_.isFailure)
.headOption
Even better is the solution proposed by jwvh with find (you don't have to call headOption):
def dateTimeCSVConverter(s: String) = Stream("dd/MM/yyyy", "dd.MM.yyyy")
.map(f => Try(DateTimeFormat.forPattern(format).parseDateTime(s))
.find(_.isSuccess)
It returns None if none of patterns matched. If you want to throw exception on that case, you can uwrap option with getOrElse:
...
.dropWhile(_.isFailure)
.headOption
.getOrElse(throw new IllegalArgumentException(s"Unable to parse DateTime $s"))
The important thing is, that when any validation succeedes, it won't go further but will return parsed date right away.
This is a possible solution that iterates through all the options
val dateFormats = Seq("dd/MM/yyyy", "dd.MM.yyyy")
val dates = Vector("01/01/2019", "01.01.2019", "01-01-2019")
dates.foreach(s => {
val d: Option[Try[DateTime]] = dateFormats
.map(format => Try(DateTimeFormat.forPattern(format).parseDateTime(s)))
.filter(_.isSuccess)
.headOption
d match {
case Some(d) => println(d.toString)
case _ => throw new IllegalArgumentException("foo")
}
})
This is an alternative solution that returns the first successful conversion, if any
val dateFormats = Seq("dd/MM/yyyy", "dd.MM.yyyy")
val dates = Vector("01/01/2019", "01.01.2019", "01-01-2019")
dates.foreach(s => {
dateFormats.find(format => Try(DateTimeFormat.forPattern(format).parseDateTime(s)).isSuccess) match {
case Some(format) => println(DateTimeFormat.forPattern(format).parseDateTime(s))
case _ => throw new IllegalArgumentException("foo")
}
})
I made it sweet like this now! I like this a lot better! Use this if you want to collect all the successes and all the failures. Note that, this might be a bit in-efficient when you need to break out of the loop as soon as you find one success!
implicit def dateTimeCSVConverter: CsvFieldReader[DateTime] = (s: String) => Try {
val (successes, failures) = dateFormats.map {
case format => Try(DateTimeFormat.forPattern(format).parseDateTime(s))
}.partition(_.isSuccess)
if (successes.nonEmpty)
successes.head.get
else
failures.head.get
}

Flink 1.5 DataStream: Duplicate Filtering

I have an unbound DataStream that represents friendship in a social network. These friendships can be bidirectional and therefore appear twice in the stream.
The structure of the data is: timestamp|user1|user2 .
For example:
2010-03-09T02:51:11.571+0000|143|1219
2010-03-09T06:08:51.942+0000|1242|4624
2010-03-09T08:24:03.773+0000|2191|4986
2010-03-09T09:37:09.788+0000|459|4644
I want to remove bidirectional friendships to count them only once. In practice I would like filter duplicates.
I have found a solution here
My FilterFunction looks like:
def filter(ds: DataStream[String]): DataStream[(String, String, String)] = {
val res = data.
mapWith(line => {
val str = line.split("\\|")
if (str(1).toLong > str(2).toLong)
(str(0), str(1), str(2))
else
(str(0), str(2), str(1))
})
.keyBy(tuple => (tuple._2, tuple._3))
.flatMap(new FilterFunction())
res
}
And I have implemented my RichFlatMapFunction as:
class FilterFunction extends RichFlatMapFunction[(String, String, String), (String, String, String)] {
private var seen: ValueState[Boolean] = _
override def flatMap(value: (String, String, String), out:
Collector[(String, String, String)]): Unit = {
if (!seen.value() || seen.value() == null) {
seen.update(true)
out.collect(value)
}
}
override def open(parameters: Configuration): Unit = {
seen = getRuntimeContext.getState(
new ValueStateDescriptor("seen", classOf[Boolean])
)
}
}
However when i print I'm getting random results. I have tried to perform a count in a time window of 1 year:
val da1 = filter(data)
.mapWith(tuple => Parser.parseUserConnection(tuple).get)
.assignAscendingTimestamps(connection => connection.timestamp.getMillis)
.mapWith(connection => (connection, 1))
.timeWindowAll(Time.days(365))
.sum(1)
.mapWith(tuple => tuple._2)
.print()
My console print the first time:
1> 33735
Then:
1> 10658
2> 33735
and for subsequent execution, different results (Just 33735 seems to be stable). I cannot understand this strange behavior.
It's hard to follow what you are finding surprising. But a general technique for debugging such an app is to print the results of different stages of the pipeline to see at what point the results become strange. Or debug the job in an IDE and step through it.

Pattern matching and RDDs

I have a very simple (n00b) question but I'm somehow stuck. I'm trying to read a set of files in Spark with wholeTextFiles and want to return an RDD[LogEntry], where LogEntry is just a case class. I want to end up with an RDD of valid entries and I need to use a regular expression to extract the parameters for my case class. When an entry is not valid I do not want the extractor logic to fail but simply write an entry in a log. For that I use LazyLogging.
object LogProcessors extends LazyLogging {
def extractLogs(sc: SparkContext, path: String, numPartitions: Int = 5): RDD[Option[CleaningLogEntry]] = {
val pattern = "<some pattern>".r
val logs = sc.wholeTextFiles(path, numPartitions)
val entries = logs.map(fileContent => {
val file = fileContent._1
val content = fileContent._2
content.split("\\r?\\n").map(line => line match {
case pattern(dt, ev, seq) => Some(LogEntry(<...>))
case _ => logger.error(s"Cannot parse $file: $line"); None
})
})
That gives me an RDD[Array[Option[LogEntry]]]. Is there a neat way to end up with an RDD of the LogEntrys? I'm somehow missing it.
I was thinking about using Try instead, but I'm not sure if that's any better.
Thoughts greatly appreciated.
To get rid of the Array - simply replace the map command with flatMap - flatMap will treat a result of type Traversable[T] for each record as separate records of type T.
To get rid of the Option - collect only the successful ones: entries.collect { case Some(entry) => entry }.
Note that this collect(p: PartialFunction) overload (which performs something equivelant to a map and a filter combined) is very different from collect() (which sends all data to the driver).
Altogether, this would be something like:
def extractLogs(sc: SparkContext, path: String, numPartitions: Int = 5): RDD[CleaningLogEntry] = {
val pattern = "<some pattern>".r
val logs = sc.wholeTextFiles(path, numPartitions)
val entries = logs.flatMap(fileContent => {
val file = fileContent._1
val content = fileContent._2
content.split("\\r?\\n").map(line => line match {
case pattern(dt, ev, seq) => Some(LogEntry(<...>))
case _ => logger.error(s"Cannot parse $file: $line"); None
})
})
entries.collect { case Some(entry) => entry }
}