Summarizing Scala collection stats without boilerplate - agnostic to sequential vs parallel - scala

In Scala 3 I keep finding myself writing boilerplate shaped like this:
def summarizeEmployees(employees: ParArray[Employee]) {
// print various Employee collection statistics
}
def summarizeEmployees(employees: Array[Employee]) {
summarizeEmployees(employees.par)
}
def summarizePets(pets: ParArray[Pet]) {
// print various Pet collection statistics
}
def summarizePets(pets: Array[Pet]) {
summarizePets(pets.par)
}
def summarizeCars(cars: ParArray[Car]) {
// print various Car collection statistics
}
def summarizeCars(cars: Array[Pet]) {
summarizeCars(cars.par)
}
How can one cut down on such repetition and boilerplate and write a single method which is agnostic to whether a ParArray or an Array is passed as the argument? Something like:
def summarizeEmployees(employees: ParOrSeqCollectionSuperclass[Employee]) {
// print various Employee collection statistics - I don't care about side effects of par vs seq
}
def summarizePets(pets: ParOrSeqCollectionSuperclass[Pet]) {
// print various Pet collection statistics - I don't care about side effects of par vs seq
}
def summarizeCars(cars: ParOrSeqCollectionSuperclass[Car]) {
// print various Car collection statistics - I don't care about side effects of par vs seq
}
Any help would be gratefully received, especially if you can show sample code. Thanks!

Related

Collector in Flink. What does it do?

I'm learning Flink and one of the things is confusing for me is the fact of using an object called Collector. For example in the flatmap function. What's this Collector and its method collect? and why for example a map function doesn't need to pass results by explicitly using it ?
here there can be seen some examples of using Collector in the flatmap function:
https://www.programcreek.com/scala/org.apache.flink.util.Collector
also, if I search for where the Collector would be placed in Flink Architecture I don't find any diagram with that mapping
Flink passes a Collector to any user function that has the possibility of emitting an arbitrary number of stream elements. A map function doesn’t use a Collector because it performs a one-to-one transformation, with the return value of the map function being the output. Whereas a flatmap can emit zero, one, or many stream elements for each event, which makes the Collector a convenient way to accommodate this.
as you know, if you want one piece to produce N outputs in the data stream, you can use Collector to encapsulate the output data in flatmap,on the contrary,Map usually produces one-to-one data, so doesn't need to use it.In a sense,Collector has a wide range of internal applications. You can take a look at org.apache.flink.streaming.api.operators.Output(extend from Collector) \org.apache.flink.runtime.operators.shipping.OutputCollector ,they are usually used to collect records and emits them to writer.and so on,collect be called when needs to write data.
Examples (not necessarily accurate):
There are three definitions of Scala source code for flatMap. Let's take a look at the definition of the first one.
/**
* Creates a new DataStream by applying the given function to every element and flattening
* the results.
*/
def flatMap[R: TypeInformation](fun: (T, Collector[R]) => Unit): DataStream[R] = {
if (fun == null) {
throw new NullPointerException("FlatMap function must not be null.")
}
val cleanFun = clean(fun)
val flatMapper = new FlatMapFunction[T, R] {
def flatMap(in: T, out: Collector[R]) { cleanFun(in, out) }
}
flatMap(flatMapper)
}
Examples of using this method are as follows:
text.flatMap((input: String, out: Collector[String]) => {
input.split(" ").foreach(out.collect)
})
In this method, we need to send the data manually through Collector
Then let's take a look at the second definition in the source code:
/**
* Creates a new DataStream by applying the given function to every element and flattening
* the results.
*/
def flatMap[R: TypeInformation](fun: T => TraversableOnce[R]): DataStream[R] = {
if (fun == null) {
throw new NullPointerException("FlatMap function must not be null.")
}
val cleanFun = clean(fun)
val flatMapper = new FlatMapFunction[T, R] {
def flatMap(in: T, out: Collector[R]) { cleanFun(in) foreach out.collect }
}
flatMap(flatMapper)
}
Instead of using Collector to collect the output, here we output a list directly, and Flink helps us flatten the list. Using TraversableOnce also causes us to return a list anyway, even if it is an empty list, otherwise we cannot match the definition of the function.
text.flatMap(input => {
if (input.size > 15) {
input.split(" ")
} else {
Seq.empty
}
})
You can find a lot of similar places, as long as it is about sending data records, you can almost see Collector.

Reduce testing overhead when DAO contains action

For accessing objects, a Slick DAO which contains functions returning actions and objects of a stored type were created. Example:
def findByKeysAction(a: String, b: String, c: String = {
Users.filter(x => x.a === a && x.b === b && x.c === c).result
}
def findByKeys(a: String, b: String, c: String): Future[Option[foo]] = {
db.run(findByKeysAction(consumerId, contextId, userId)).map(_.headOption)
}
Notice how the non-action-based function wraps the other in db.run().
What is a solid approach to test both functions and minimizing redundancy of code?
I naive method could of course be to test them both with their individual test setups (above is a simple example; there could be a lot of test setup needed to satisfy DB restrictions).
Notice how the non-action-based function wraps the other in db.run().
Not really. Your findByKeys method does not call findByUserIdAction, so I'm adjusting for that minor detail in this answer.
def findByUserIdAction(userId: String) = {
Users.filter(_.userId === userId).result
}
The above code returns a DBIOAction. As the documentation states:
Just like a query, an I/O action is only a description of an operation. Creating or composing actions does not execute anything on a database.
As far as a user of Slick is concerned, there is no meaningful test for a DBIOAction, because by itself it does nothing; it's only a recipe for what one wants to do. To execute the above DBIOAction, you have to materialize it, which is what the following does:
def findByUserId(userId: String): Future[Option[User]] = {
db.run(findByUserIdAction(userId)).map(_.headOption)
}
The materialized result is what you want to test. One way to do so is to use ScalaTest's ScalaFutures trait. For example, in a spec that mixes in that trait, you could have something like:
"Users" should "return a single user by id" in {
findByUserId("id3").futureValue shouldBe Option(User("id3", ...))
}
Take a look at this Slick 3.2.0 test project for more examples: specifically, TestSpec and QueryCoffeesTest.
In summary, don't bother trying to test a DBIOAction in isolation; just test its materialized result.

Elegant traversal of a source in Scala

As a data scientist I frequently use the following pattern for data extraction (i.e. DB, file reading and others):
val source = open(sourceName)
var item = source.getNextItem()
while(item != null){
processItem(item)
item = source.getNextItem()
}
source.close
My (current) dream is to wrap this verbosity into a Scala object "SourceTrav" that would allow this elegance:
SourceTrav(sourceName).foreach(item => processItem(item))
with the same functionality as above, but without running into StackOverflowError, as might happen with the examples in Semantics of Scala Traversable, Iterable, Sequence, Stream and View?
Any idea?
If Scala's standard library (for example scala.io.Source) doesn't suit your needs, you can use different Iterator or Stream companion object methods to wrap manual iterator traversal.
In this case, for example, you can do the following, when you already have an open source:
Iterator.continually(source.getNextItem()).takeWhile(_ != null).foreach(processItem)
If you also want to add automatic opening and closing of the source, don't forget to add try-finally or some other flavor of loan pattern:
case class SourceTrav(sourceName: String) {
def foreach(processItem: Item => Unit): Unit = {
val source = open(sourceName)
try {
Iterator.continually(source.getNextItem()).takeWhile(_ != null).foreach(processItem)
} finally {
source.close()
}
}
}

Rich function on Flink join, Scala API

I'm struggling with Flink and Scala.
I have a join transformation over a DataSet that pretty much works, but I want to turn it into a RichFuntion, so that I can access a broadcasted set:
val newBoard: DataSet[Cell] = board.rightOuterJoin(neighbours)
.where("coords").equalTo("cellCoords"){
(cell, neighbours) => {
// Do some rich function things, like
// override the open method so I can get
// the broadcasted set
}
}
}.withBroadcastSet(board, "aliveCells")
I have been looking all over the documentation, but I can't find any example of a RichJoinFuntion being used in Scala. I only find examples for rich functions used in map or filter, but the syntax is different for the join transformation (function between parenthesis vs. between brackets).
You can use a RichJoinFunction with the Scala DataSet API as follows
val newBoard: DataSet[Cell] = board.rightOuterJoin(neighbours)
.where("coords").equalTo("cellCoords")
.apply(new YourJoinFunction())
.withBroadcastSet(board, "aliveCells")
class YourJoinFunction extends RichJoinFunction[IN1, IN2, Cell] {
override def join(first: IN1, second: IN2): Cell = {
// Do some rich function things, like
// override the open method so I can get
// the broadcasted set
}
}

Scala DSL without extra syntax

I asked myself this question a couple of times and came up with a solution for that feels very dirty. Maybe you can give me any advice since I think this is a basic problem for every DSL written in Scala.
I want to have a hierarchical structure of nested objects without adding any extra syntax. Specs is a good example for this:
MySpec extends Specification {
"system" should {
"example0" in { ... }
"example1" in { ... }
"example2" in { ... }
}
"system" can {
"example0" in { ... }
}
}
For instance I do not have to write "example0" in { ... } :: "example1" in { ... } :: "example2" in { ... } :: Nil.
This is exactly the same behaviour I would like to have. I think this is achieved by an implicit definition in the Specification class in Specs like (please do not be offended if you are the Specs author and I missunderstood something :))
implicit def sus2spec(sus: Sus): Specification = {
suslist += sus
this
}
My main problem arises now when I want to nest such objects. Imagine I have this grammar:
root: statement*;
statement:
IDENT '{' statement* '}'
| declaration*
;
declaration: IDENT ':=' INT+;
I would like to translate this into a DSL that looks like this:
MyRoot extends Root {
"statement0" is {
"nested_statement0" is {
"nested_nested_statement0" is {
"declaration0" := 0
}
"declaration1" := 1
"declaration2" := 2
}
"declaration3" := 3
}
"statement1" is {
"declaration4" := 4
}
}
The problem that arises here is for me that the implicit solution does not work. The implicit definition would be executed in the scope of the root object which means I would add all objects to the root and the hierarchy is lost.
Then I thought I can use something like a Stack[Statement]. I could push an object to it for every call to is but that feels very dirty.
To put the question in one sentence: How do I create a recursive DSL wich respect to its hierarchy without adding any extra syntax and is there a solution to do this with immutable objects only?
I've seen a nice trick in XScalaWT to achieve the nesting in DSL. I didn't check if specs uses the same, or something different.
I think the following example shows the main idea. The heart of it is the setups function: it accepts some functions (more precisely closures, if I'm not mistaken) that needs only a Nestable and will call them on the current one.
printName happens to be such a method, just like addChild, with parameters filled for the first list of params.
For me understanding this was the revealing part. After that you can relatively simply add many other fancy features (like implicit magic, dsl methods based on structural typing, etc.).
Of course you can have any "context like" class instead of Nestable, especially if you go for pure immutable stuff. If parents need references to children you can collect the children during the setups() and create the parent only at the end.
In this case you would probably have something like
private def setupChildren[A, B](a : A, setups:(A => B)*) : Seq[B] = {
for (setup <- setups) yield setup(a)
}
You would pass in the "context", and create the parent using the returned children.
BTW I think this setup thing was needed in XScalaWT because it's for SWT where child objects need a reference to their parent control. If you don't need it (or anything from the current "context") then everything becomes a bit easier.
Using companion objects with proper apply methods should mostly solve the problem. Most likely they should also accept other functions, (having the same number of params, or a tuple if you need more).
One disadvantage of this trick is that you have to have a separate dsl method (even if a simple one) for each method that you want to call on your classes. Alternatively you can use lines like
x => x.printName
which will do the job, but not so nice (especially if you have to do it often).
object NestedDsl {
object Nestable {
def apply(name: String, setups:(Nestable => Unit)*): Nestable = {
val n = new Nestable(None, name)
setup(n, setups: _*)
n
}
}
class Nestable(parent: Option[Nestable], name: String) {
def printName() { println(name) }
}
// DSL part
def addChild(name: String, setups:(Nestable => Unit)*)(parent: Nestable) = {
val n = new Nestable(Some(parent), name)
setup(n, setups: _*)
n
}
def printName(n: Nestable) = n.printName
private def setup[T](t : T, setups:(T => Unit)*) : T = {
setups.foreach(setup => setup(t))
t
}
def main(args: Array[String]) {
Nestable("root",
addChild(
"first",
addChild("second",
printName
)
)
)
}
}
I have had a look at specs and they do not do it any differnet. Basically all you need is a mutable stack. You can have a look at the result here: cssx-dsl
The code is quite simple. Basically I have a mutable builder and convert it to an immutable representation afterwards.