Flink 1.5 DataStream: Duplicate Filtering

Flink 1.5 DataStream: Duplicate Filtering - scala

I have an unbound DataStream that represents friendship in a social network. These friendships can be bidirectional and therefore appear twice in the stream.
The structure of the data is: timestamp|user1|user2 .
For example:
2010-03-09T02:51:11.571+0000|143|1219
2010-03-09T06:08:51.942+0000|1242|4624
2010-03-09T08:24:03.773+0000|2191|4986
2010-03-09T09:37:09.788+0000|459|4644
I want to remove bidirectional friendships to count them only once. In practice I would like filter duplicates.
I have found a solution here
My FilterFunction looks like:
def filter(ds: DataStream[String]): DataStream[(String, String, String)] = {
val res = data.
mapWith(line => {
val str = line.split("\\|")
if (str(1).toLong > str(2).toLong)
(str(0), str(1), str(2))
else
(str(0), str(2), str(1))
})
.keyBy(tuple => (tuple._2, tuple._3))
.flatMap(new FilterFunction())
res
}
And I have implemented my RichFlatMapFunction as:
class FilterFunction extends RichFlatMapFunction[(String, String, String), (String, String, String)] {
private var seen: ValueState[Boolean] = _
override def flatMap(value: (String, String, String), out:
Collector[(String, String, String)]): Unit = {
if (!seen.value() || seen.value() == null) {
seen.update(true)
out.collect(value)
}
}
override def open(parameters: Configuration): Unit = {
seen = getRuntimeContext.getState(
new ValueStateDescriptor("seen", classOf[Boolean])
)
}
}
However when i print I'm getting random results. I have tried to perform a count in a time window of 1 year:
val da1 = filter(data)
.mapWith(tuple => Parser.parseUserConnection(tuple).get)
.assignAscendingTimestamps(connection => connection.timestamp.getMillis)
.mapWith(connection => (connection, 1))
.timeWindowAll(Time.days(365))
.sum(1)
.mapWith(tuple => tuple._2)
.print()
My console print the first time:
1> 33735
Then:
1> 10658
2> 33735
and for subsequent execution, different results (Just 33735 seems to be stable). I cannot understand this strange behavior.

It's hard to follow what you are finding surprising. But a general technique for debugging such an app is to print the results of different stages of the pipeline to see at what point the results become strange. Or debug the job in an IDE and step through it.

Related

How to covert multiple strings in a list to be keys in a map

I am trying to write a function that would return a map in which every word is a key and the values are pages at which the word shows up. Currently, I am stuck at the point where I have data of the following type: List(List(words),page).
Is there any sensible way to reformat this data if so, please explain as I have no idea how to even begin?
object G {
def main(args: Array[String]): Unit = {
stwórzIndeks()
}
def stwórzIndeks(): Unit= {
val linie = io.Source
.fromResource("tekst.txt")
.getLines
.toList
val zippedLinie: List[(String,Int)]=linie.zipWithIndex
val splitt=zippedLinie.foldLeft(List.empty[(List[String],Int)])((acc,curr)=>{
curr match {
case (arr,int) => {
val toAdd=(arr.split("\\s+").toList,zippedLinie.length-int)
toAdd+:acc
}
}
})
}
}

You can replace that foldLet with a flatMap with an inner map to get a big List of (word, page).
val wordsAndPage = zippedLinie.flatMap {
case (line, idx) =>
lome.split("\\s+").toList.map(word => word -> idx + 1)
}
After that you can check for one of the grouping methods in the scaladoc.

Pass map to slick filter and filter by the values in the map

I'm new to scala, and, I'm trying to pass a map i.e. Map[String, Any]("from_type" -> "Admin", "from_id" -> 1) to my service for dynamic filtering. I'm trying to avoid writing my code like this filter(_.fromType === val && _.fromId === val2)
When trying this example Slick dynamically filter by a list of columns and values
I get a Type mismatch. Required Function1[K, NotInfered T] Found: Rep[Boolean]
Service code:
val query = TableQuery[UserTable]
def all(perPage: Int page: Int, listFilters: Map[String, Any]): Future[ResultPagination[User]] = {
val baseQuery = for {
items <- query.filter( listFilters ).take(perPage).drop(page).result // <----I want to filter here
total <- query.length.result
} yield ResultPagination[User](items, total)
db.run(baseQuery)
}
Table code:
def fromId: Rep[Int] = column[Int]("from_id")
def fromType: Rep[String] = column[String]("from_type")
def columnToRep(column: String): Rep[_] = {
column match {
case "from_type" = this.fromType
case "from_id" = this.fromId
}
}

Well, I would not recommend to use Map[String, Any] construction, because of using Any you are loosing type safety: for instance you can pass to the function by mistake Map("fromId" -> "1") and compile won't help identify issue.
I guess, what you want is to pass some kind of structure representing variative filter. And Query.filterOpt can help you in this case. You can take a look usage examples at: https://scala-slick.org/doc/3.3.2/queries.html#sorting-and-filtering
Please, see code example below:
// Your domain filter structure. None values will be ignored
// So `UserFilter()` - will match all.
case class UserFilter(fromId: Option[Int] = None, fromString: Option[String] = None)
def all(perPage: Int, page: Int, filter: UserFilter): Future[ResultPagination[User]] = {
val baseQuery = for {
items <- {
query
.filterOpt(filter.fromId)(_.fromId === _)
.filterOpt(filter.fromString)(_.fromType === _)
.take(perPage)
.drop(page)
.result
}
total <- query.length.result
} yield ResultPagination[User](items, total)
db.run(baseQuery)
}
And this will type safe.
Hope this helps!

Unpacking tuple directly into class in scala

Scala gives the ability to unpack a tuple into multiple local variables when performing various operations, for example if I have some data
val infos = Array(("Matt", "Awesome"), ("Matt's Brother", "Just OK"))
then instead of doing something ugly like
infos.map{ person_info => person_info._1 + " is " + person_info._2 }
I can choose the much more elegant
infos.map{ case (person, status) => person + " is " + status }
One thing I've often wondered about is how to directly unpack the tuple into, say, the arguments to be used in a class constructor. I'm imagining something like this:
case class PersonInfo(person: String, status: String)
infos.map{ case (p: PersonInfo) => p.person + " is " + p.status }
or even better if PersonInfo has methods:
infos.map{ case (p: PersonInfo) => p.verboseStatus() }
But of course this doesn't work. Apologies if this has already been asked -- I haven't been able to find a direct answer -- is there a way to do this?

I believe you can get to the methods at least in Scala 2.11.x, also, if you haven't heard of it, you should checkout The Neophyte's Guide to Scala Part 1: Extractors.
The whole 16 part series is fantastic, but part 1 deals with case classes, pattern matching and extractors, which is what I think you are after.
Also, I get that java.lang.String complaint in IntelliJ as well, it defaults to that for reasons that are not entirely clear to me, I was able to work around it by explicitly setting the type in the typical "postfix style" i.e. _: String. There must be some way to work around that though.
object Demo {
case class Person(name: String, status: String) {
def verboseStatus() = s"The status of $name is $status"
}
val peeps = Array(("Matt", "Alive"), ("Soraya", "Dead"))
peeps.map {
case p # (_ :String, _ :String) => Person.tupled(p).verboseStatus()
}
}
UPDATE:
So after seeing a few of the other answers, I was curious if there was any performance differences between them. So I set up, what I think might be a reasonable test using an Array of 1,000,000 random string tuples and each implementation is run 100 times:
import scala.util.Random
object Demo extends App {
//Utility Code
def randomTuple(): (String, String) = {
val random = new Random
(random.nextString(5), random.nextString(5))
}
def timer[R](code: => R)(implicit runs: Int): Unit = {
var total = 0L
(1 to runs).foreach { i =>
val t0 = System.currentTimeMillis()
code
val t1 = System.currentTimeMillis()
total += (t1 - t0)
}
println(s"Time to perform code block ${total / runs}ms\n")
}
//Setup
case class Person(name: String, status: String) {
def verboseStatus() = s"$name is $status"
}
object PersonInfoU {
def unapply(x: (String, String)) = Some(Person(x._1, x._2))
}
val infos = Array.fill[(String, String)](1000000)(randomTuple)
//Timer
implicit val runs: Int = 100
println("Using two map operations")
timer {
infos.map(Person.tupled).map(_.verboseStatus)
}
println("Pattern matching and calling tupled")
timer {
infos.map {
case p # (_: String, _: String) => Person.tupled(p).verboseStatus()
}
}
println("Another pattern matching without tupled")
timer {
infos.map {
case (name, status) => Person(name, status).verboseStatus()
}
}
println("Using unapply in a companion object that takes a tuple parameter")
timer {
infos.map { case PersonInfoU(p) => p.name + " is " + p.status }
}
}
/*Results
Using two map operations
Time to perform code block 208ms
Pattern matching and calling tupled
Time to perform code block 130ms
Another pattern matching without tupled
Time to perform code block 130ms
WINNER
Using unapply in a companion object that takes a tuple parameter
Time to perform code block 69ms
*/
Assuming my test is sound, it seems the unapply in a companion-ish object was ~2x faster than the pattern matching, and pattern matching another ~1.5x faster than two maps. Each implementation probably has its use cases/limitations.
I'd appreciate if anyone sees anything glaringly dumb in my testing strategy to let me know about it (and sorry about that var). Thanks!

The extractor for a case class takes an instance of the case class and returns a tuple of its fields. You can write an extractor which does the opposite:
object PersonInfoU {
def unapply(x: (String, String)) = Some(PersonInfo(x._1, x._2))
}
infos.map { case PersonInfoU(p) => p.person + " is " + p.status }

You can use tuppled for case class
val infos = Array(("Matt", "Awesome"), ("Matt's Brother", "Just OK"))
infos.map(PersonInfo.tupled)
scala> infos: Array[(String, String)] = Array((Matt,Awesome), (Matt's Brother,Just OK))
scala> res1: Array[PersonInfo] = Array(PersonInfo(Matt,Awesome), PersonInfo(Matt's Brother,Just OK))
and then you can use PersonInfo how you need

You mean like this (scala 2.11.8):
scala> :paste
// Entering paste mode (ctrl-D to finish)
case class PersonInfo(p: String)
Seq(PersonInfo("foo")) map {
case p# PersonInfo(info) => s"info=$info / ${p.p}"
}
// Exiting paste mode, now interpreting.
defined class PersonInfo
res4: Seq[String] = List(info=foo / foo)
Methods won't be possible by the way.

Several answers can be combined to produce a final, unified approach:
val infos = Array(("Matt", "Awesome"), ("Matt's Brother", "Just OK"))
object Person{
case class Info(name: String, status: String){
def verboseStatus() = name + " is " + status
}
def unapply(x: (String, String)) = Some(Info(x._1, x._2))
}
infos.map{ case Person(p) => p.verboseStatus }
Of course in this small case it's overkill, but for more complex use cases this is the basic skeleton.

Allocation of Function Literals in Scala

I have a class that represents sales orders:
class SalesOrder(val f01:String, val f02:Int, ..., f50:Date)
The fXX fields are of various types. I am faced with the problem of creating an audit trail of my orders. Given two instances of the class, I have to determine which fields have changed. I have come up with the following:
class SalesOrder(val f01:String, val f02:Int, ..., val f50:Date){
def auditDifferences(that:SalesOrder): List[String] = {
def diff[A](fieldName:String, getField: SalesOrder => A) =
if(getField(this) != getField(that)) Some(fieldName) else None
val diffList = diff("f01", _.f01) :: diff("f02", _.f02) :: ...
:: diff("f50", _.f50) :: Nil
diffList.flatten
}
}
I was wondering what the compiler does with all the _.fXX functions: are they instanced just once (statically), and can be shared by all instances of my class, or will they be instanced every time I create an instance of my class?
My worry is that, since I will use a lot of SalesOrder instances, it may create a lot of garbage. Should I use a different approach?

One clean way of solving this problem would be to use the standard library's Ordering type class. For example:
class SalesOrder(val f01: String, val f02: Int, val f03: Char) {
def diff(that: SalesOrder) = SalesOrder.fieldOrderings.collect {
case (name, ord) if !ord.equiv(this, that) => name
}
}
object SalesOrder {
val fieldOrderings: List[(String, Ordering[SalesOrder])] = List(
"f01" -> Ordering.by(_.f01),
"f02" -> Ordering.by(_.f02),
"f03" -> Ordering.by(_.f03)
)
}
And then:
scala> val orderA = new SalesOrder("a", 1, 'a')
orderA: SalesOrder = SalesOrder#5827384f
scala> val orderB = new SalesOrder("b", 1, 'b')
orderB: SalesOrder = SalesOrder#3bf2e1c7
scala> orderA diff orderB
res0: List[String] = List(f01, f03)
You almost certainly don't need to worry about the perfomance of your original formulation, but this version is (arguably) nicer for unrelated reasons.

Yes, that creates 50 short lived functions. I don't think you should be worried unless you have manifest evidence that that causes a performance problem in your case.
But I would define a method that transforms SalesOrder into a Map[String, Any], then you would just have
trait SalesOrder {
def fields: Map[String, Any]
}
def diff(a: SalesOrder, b: SalesOrder): Iterable[String] = {
val af = a.fields
val bf = b.fields
af.collect { case (key, value) if bf(key) != value => key }
}
If the field names are indeed just incremental numbers, you could simplify
trait SalesOrder {
def fields: Iterable[Any]
}
def diff(a: SalesOrder, b: SalesOrder): Iterable[String] =
(a.fields zip b.fields).zipWithIndex.collect {
case ((av, bv), idx) if av != bv => f"f${idx + 1}%02d"
}

Scala: how to traverse stream/iterator collecting results into several different collections

I'm going through log file that is too big to fit into memory and collecting 2 type of expressions, what is better functional alternative to my iterative snippet below?
def streamData(file: File, errorPat: Regex, loginPat: Regex): List[(String, String)]={
val lines : Iterator[String] = io.Source.fromFile(file).getLines()
val logins: mutable.Map[String, String] = new mutable.HashMap[String, String]()
val errors: mutable.ListBuffer[(String, String)] = mutable.ListBuffer.empty
for (line <- lines){
line match {
case errorPat(date,ip)=> errors.append((ip,date))
case loginPat(date,user,ip,id) =>logins.put(ip, id)
case _ => ""
}
}
errors.toList.map(line => (logins.getOrElse(line._1,"none") + " " + line._1,line._2))
}

Here is a possible solution:
def streamData(file: File, errorPat: Regex, loginPat: Regex): List[(String,String)] = {
val lines = Source.fromFile(file).getLines
val (err, log) = lines.collect {
case errorPat(inf, ip) => (Some((ip, inf)), None)
case loginPat(_, _, ip, id) => (None, Some((ip, id)))
}.toList.unzip
val ip2id = log.flatten.toMap
err.collect{ case Some((ip,inf)) => (ip2id.getOrElse(ip,"none") + "" + ip, inf) }
}

Corrections:
1) removed unnecessary types declarations
2) tuple deconstruction instead of ulgy ._1
3) left fold instead of mutable accumulators
4) used more convenient operator-like methods :+ and +
def streamData(file: File, errorPat: Regex, loginPat: Regex): List[(String, String)] = {
val lines = io.Source.fromFile(file).getLines()
val (logins, errors) =
((Map.empty[String, String], Seq.empty[(String, String)]) /: lines) {
case ((loginsAcc, errorsAcc), next) =>
next match {
case errorPat(date, ip) => (loginsAcc, errorsAcc :+ (ip -> date))
case loginPat(date, user, ip, id) => (loginsAcc + (ip -> id) , errorsAcc)
case _ => (loginsAcc, errorsAcc)
}
}
// more concise equivalent for
// errors.toList.map { case (ip, date) => (logins.getOrElse(ip, "none") + " " + ip) -> date }
for ((ip, date) <- errors.toList)
yield (logins.getOrElse(ip, "none") + " " + ip) -> date
}

I have a few suggestions:
Instead of a pair/tuple, it's often better to use your own class. It gives meaningful names to both the type and its fields, which makes the code much more readable.
Split the code into small parts. In particular, try to decouple pieces of code that don't need to be tied together. This makes your code easier to understand, more robust, less prone to errors and easier to test. In your case it'd be good to separate producing your input (lines of a log file) and consuming it to produce a result. For example, you'd be able to make automatic tests for your function without having to store sample data in a file.
As an example and exercise, I tried to make a solution based on Scalaz iteratees. It's a bit longer (includes some auxiliary code for IteratorEnumerator) and perhaps it's a bit overkill for the task, but perhaps someone will find it helpful.
import java.io._;
import scala.util.matching.Regex
import scalaz._
import scalaz.IterV._
object MyApp extends App {
// A type for the result. Having names keeps things
// clearer and shorter.
type LogResult = List[(String,String)]
// Represents a state of our computation. Not only it
// gives a name to the data, we can also put here
// functions that modify the state. This nicely
// separates what we're computing and how.
sealed case class State(
logins: Map[String,String],
errors: Seq[(String,String)]
) {
def this() = {
this(Map.empty[String,String], Seq.empty[(String,String)])
}
def addError(date: String, ip: String): State =
State(logins, errors :+ (ip -> date));
def addLogin(ip: String, id: String): State =
State(logins + (ip -> id), errors);
// Produce the final result from accumulated data.
def result: LogResult =
for ((ip, date) <- errors.toList)
yield (logins.getOrElse(ip, "none") + " " + ip) -> date
}
// An iteratee that consumes lines of our input. Based
// on the given regular expressions, it produces an
// iteratee that parses the input and uses State to
// compute the result.
def logIteratee(errorPat: Regex, loginPat: Regex):
IterV[String,List[(String,String)]] = {
// Consumes a signle line.
def consume(line: String, state: State): State =
line match {
case errorPat(date, ip) => state.addError(date, ip);
case loginPat(date, user, ip, id) => state.addLogin(ip, id);
case _ => state
}
// The core of the iteratee. Every time we consume a
// line, we update our state. When done, compute the
// final result.
def step(state: State)(s: Input[String]): IterV[String, LogResult] =
s(el = line => Cont(step(consume(line, state))),
empty = Cont(step(state)),
eof = Done(state.result, EOF[String]))
// Return the iterate waiting for its first input.
Cont(step(new State()));
}
// Converts an iterator into an enumerator. This
// should be more likely moved to Scalaz.
// Adapted from scalaz.ExampleIteratee
implicit val IteratorEnumerator = new Enumerator[Iterator] {
#annotation.tailrec def apply[E, A](e: Iterator[E], i: IterV[E, A]): IterV[E, A] = {
val next: Option[(Iterator[E], IterV[E, A])] =
if (e.hasNext) {
val x = e.next();
i.fold(done = (_, _) => None, cont = k => Some((e, k(El(x)))))
} else
None;
next match {
case None => i
case Some((es, is)) => apply(es, is)
}
}
}
// main ---------------------------------------------------
{
// Read a file as an iterator of lines:
// val lines: Iterator[String] =
// io.Source.fromFile("test.log").getLines();
// Create our testing iterator:
val lines: Iterator[String] = Seq(
"Error: 2012/03 1.2.3.4",
"Login: 2012/03 user 1.2.3.4 Joe",
"Error: 2012/03 1.2.3.5",
"Error: 2012/04 1.2.3.4"
).iterator;
// Create an iteratee.
val iter = logIteratee("Error: (\\S+) (\\S+)".r,
"Login: (\\S+) (\\S+) (\\S+) (\\S+)".r);
// Run the the iteratee against the input
// (the enumerator is implicit)
println(iter(lines).run);
}
}

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Flink 1.5 DataStream: Duplicate Filtering - scala

It's hard to follow what you are finding surprising. But a general technique for debugging such an app is to print the results of different stages of the pipeline to see at what point the results become strange. Or debug the job in an IDE and step through it.

Related

How to covert multiple strings in a list to be keys in a map

Pass map to slick filter and filter by the values in the map

Unpacking tuple directly into class in scala

Allocation of Function Literals in Scala

Scala: how to traverse stream/iterator collecting results into several different collections

Categories

Resources