UDF in Spark works very slow

UDF in Spark works very slow - scala

I have a UDF in spark (running on EMR), written in scala that parses device from user agent using uaparser library for scala (uap-scala). When working on small sets it works fine (5000 rows) but when running on larger sets (2M) it works very slow.
I tried collecting the Dataframe to list and looping over it on the driver, and that was also very slow, what makes me believe that the UDF runs on the driver and not the workers
How can I establish this? does anyone have another theory?
if that is the case, why can this happen?
This is the udf code:
def calcDevice(userAgent: String): String = {
val userAgentVal = Option(userAgent).getOrElse("")
Parser.get.parse(userAgentVal).device.family
}
val calcDeviceValUDF: UserDefinedFunction = udf(calcDevice _)
usage:
.withColumn("agentDevice", udfDefinitions.calcDeviceValUDF($"userAgent"))
Thanks
Nir

Problem was with instantiating the builder within the UDF itelf. The solution is to create the object outside the udf and use it at row level:
val userAgentAnalyzerUAParser = Parser.get
def calcDevice(userAgent: String): String = {
val userAgentVal = Option(userAgent).getOrElse("")
userAgentAnalyzerUAParser.parse(userAgentVal).device.family
}
val calcDeviceValUDF: UserDefinedFunction = udf(calcDevice _)

We ran into the same issue where Spark jobs were hanging. One additional thing we did was to use a broadcast variable. This UDF is actually very slow after all the changes so your mileage may vary. One other caveat is that of acquiring the SparkSession; we run in Databricks and if the SparkSession isn't available then it will crash; if you need the job to continue then you have to deal with that failure case.
object UDFs extends Serializable {
val uaParser = SparkSession.getActiveSession.map(_.sparkContext.broadcast(CachingParser.default(100000)))
val parseUserAgent = udf { (userAgent: String) =>
// We will simply return an empty map if uaParser is None because that would mean
// there is no active spark session to broadcast the parser.
//
// Also if you wrap the potentially null value in an Option and use flatMap and map to
// add type safety it becomes slower.
if (userAgent == null || uaParser.isEmpty) {
Map[String, Map[String, String]]()
} else {
val parsed = uaParser.get.value.parse(userAgent)
Map(
"browser" -> Map(
"family" -> parsed.userAgent.family,
"major" -> parsed.userAgent.major.getOrElse(""),
"minor" -> parsed.userAgent.minor.getOrElse(""),
"patch" -> parsed.userAgent.patch.getOrElse("")
),
"os" -> Map(
"family" -> parsed.os.family,
"major" -> parsed.os.major.getOrElse(""),
"minor" -> parsed.os.minor.getOrElse(""),
"patch" -> parsed.os.patch.getOrElse(""),
"patch-minor" -> parsed.os.patchMinor.getOrElse("")
),
"device" -> Map(
"family" -> parsed.device.family,
"brand" -> parsed.device.brand.getOrElse(""),
"model" -> parsed.device.model.getOrElse("")
)
)
}
}
}
You might also want to play with the size of the CachingParser.

Given Parser.get.parse is missing from the question, it is possible to judge only udf part.
For performance you can remove Option:
def calcDevice(userAgent: String): String = {
val userAgentVal = if(userAgent == null) "" else userAgent
Parser.get.parse(userAgentVal).device.family
}

Related

Can I return Map collection in Scala using for-yield syntax?

I'm fairly new to Scala, so hopefully you tolerate this question in the case you find it noobish :)
I wrote a function that returns a Seq of elements using yield syntax:
def calculateSomeMetrics(names: Seq[String]): Seq[Long] = {
for (name <- names) yield {
// some auxiliary actions
val metrics = somehowCalculateMetrics()
metrics
}
}
Now I need to modify it to return a Map to preserve the original names against each of the calculated values:
def calculateSomeMetrics(names: Seq[String]): Map[String, Long] = { ... }
I've attempted to use the same yield-syntax but to yield a tuple instead of a single element:
def calculateSomeMetrics(names: Seq[String]): Map[String, Long] = {
for (name <- names) yield {
// Everything is the same as before
(name, metrics)
}
}
However, the compiler interprets it Seq[(String, Long)], as per the compiler error message
type mismatch;
found : Seq[(String, Long)]
required: Map[String, Long]
So I'm wondering, what is the "canonical Scala way" to implement such a thing?

The efficient way of creating different collection types is using scala.collection.breakOut. It works with Maps and for comprehensions too:
import scala.collection.breakOut
val x: Map[String, Int] = (for (i <- 1 to 10) yield i.toString -> i)(breakOut)
x: Map[String,Int] = Map(8 -> 8, 4 -> 4, 9 -> 9, 5 -> 5, 10 -> 10, 6 -> 6, 1 -> 1, 2 -> 2, 7 -> 7, 3 -> 3)
In your case it should work too:
import scala.collection.breakOut
def calculateSomeMetrics(names: Seq[String]): Map[String, Long] = {
(for (name <- names) yield {
// Everything is the same as before
(name, metrics)
})(breakOut)
}
Comparison with toMap solutions: before toMap creates an intermediate Seq of Tuple2s (which incidentally might be a Map too in certain cases) and from that it creates the Map, while breakOut omits this intermediate Seq creation and creates the Map directly instead of the intermediate Seq.
Usually this is not a huge difference in memory or CPU usage (+ GC pressure), but sometimes these things matter.

Either:
def calculateSomeMetrics(names: Seq[String]): Map[String, Long] = {
(for (name <- names) yield {
// Everything is the same as before
(name, metrics)
}).toMap
}
Or:
names.map { name =>
// doStuff
(name, metrics)
}.toMap

Several links here that either other people pointed me at or I managed to find out later on, just assembling them in a single answer for my future reference.
breakOut - suggested by Michał in his comment
toMap - in this thread
Great profound explanation on how breakOut works - in this answer
Note, though, that breakOut is going away, as noted by Karl

Does this specific exercise lend itself well to a 'functional style' design pattern?

Say we have an array of one dimensional javascript objects contained in a file Array.json for which the key schema isn't known, that is the keys aren't known until the file is read.
Then we wish to output a CSV file with a header or first entry which is a comma delimited set of keys from all of the objects.
Each next line of the file should contain the comma separated values which correspond to each key from the file.
Array.json
[
abc:123,
xy:"yz",
s12:13,
],
...
[
abc:1
s:133,
]
A valid output:
abc,xy,s12,s
123,yz,13,
1,,,133
I'm teaching myself 'functional style' programming but I'm thinking that this problem doesn't lend itself well to a functional solution.
I believe that this problem requires some state to be kept for the output header and that subsequently each line depends on that header.
I'm looking to solve the problem in a single pass. My goals are efficiency for a large data set, minimal traversals, and if possible, parallelizability. If this isn't possible then can you give a proof or reasoning to explain why?
EDIT: Is there a way to solve the problem like this functionally?:
Say you pass through the array once, in some particular order. Then
from the start the header set looks like abc,xy,s12 for the first
object. With CSV entry 123,yz,13 . Then on the next object we add an
additional key to the header set so abc,xy,s12,s would be the header
and the CSV entry would be 1,,,133 . In the end we wouldn't need to
pass through the data set a second time. We could just append extra
commas to the result set. This is one way we could approach a single
pass....
Are there functional tools ( functions ) designed to solve problems like this, and what should I be considering? [ By functional tools I mean Monads,FlatMap, Filters, etc. ] . Alternatively, should I be considering things like Futures ?
Currently I've been trying to approach this using Java8, but am open to solutions from Scala, etc. Ideally I would be able to determine if Java8s' functional approach can solve the problem since that's the language I'm currently working in.

Since the csv output will change with every new line of input, you must hold that in memory before writing it out. If you consider creating an output text format from an internal representation of a csv file another "pass" over the data (the internal representation of the csv is practically a Map[String,List[String]] which you must traverse to convert it to text) then it's not possible to do this in a single pass.
If, however, this is acceptable, then you can use a Stream to read a single item from your json file, merge that into the csv file, and do this until the stream is empty.
Assuming, that the internal representation of the csv file is
trait CsvFile {
def merge(line: Map[String, String]): CsvFile
}
And you can represent a single item as
trait Item {
def asMap: Map[String, String]
}
You can implement it using foldLeft:
def toCsv(items: Stream[Item]): CsvFile =
items.foldLeft(CsvFile(Map()))((csv, item) => csv.merge(item.asMap))
or use recursion to get the same result
#tailrec def toCsv(items: Stream[Item], prevCsv: CsvFile): CsvFile =
items match {
case Stream.Empty => prevCsv
case item #:: rest =>
val newCsv = prevCsv.merge(item.asMap)
toCsv(rest, newCsv)
}
Note: Of course you don't have to create types for CsvFile or Item, you can use Map[String,List[String]] and Map[String,String] respectively
UPDATE:
As more detail was requested for the CsvFile trait/class, here's an example implementation:
case class CsvFile(lines: Map[String, List[String]], rowCount: Int = 0) {
def merge(line: Map[String, String]): CsvFile = {
val orig = lines.withDefaultValue(List.fill(rowCount)(""))
val current = line.withDefaultValue("")
val newLines = (lines.keySet ++ line.keySet) map {
k => (k, orig(k) :+ current(k))
}
CsvFile(newLines.toMap, rowCount + 1)
}
}

This could be one approach:
val arr = Array(Map("abc" -> 123, "xy" -> "yz", "s12" -> 13), Map("abc" -> 1, "s" -> 133))
val keys = arr.flatMap(_.keys).distinct // get the distinct keys for header
arr.map(x => keys.map(y => x.getOrElse(y,""))) // get an array of rows

Its completely OK to have state in functional programming. But having mutable state or mutating state is not allowed in functional programming.
Functional programming advocates creating new changed state instead of mutating the state in place.
So, its Ok to read and access state created in the program until and unless you are mutating or side effecting.
Coming to the point.
val list = List(List("abc" -> "123", "xy" -> "yz"), List("abc" -> "1"))
list.map { inner => inner.map { case (k, v) => k}}.flatten
list.map { inner => inner.map { case (k, v) => v}}.flatten
REPL
scala> val list = List(List("abc" -> "123", "xy" -> "yz"), List("abc" -> "1"))
list: List[List[(String, String)]] = List(List((abc,123), (xy,yz)), List((abc,1)))
scala> list.map { inner => inner.map { case (k, v) => k}}.flatten
res1: List[String] = List(abc, xy, abc)
scala> list.map { inner => inner.map { case (k, v) => v}}.flatten
res2: List[String] = List(123, yz, 1)
or use flatMap instead of map and flatten
val list = List(List("abc" -> "123", "xy" -> "yz"), List("abc" -> "1"))
list.flatMap { inner => inner.map { case (k, v) => k}}
list.flatMap { inner => inner.map { case (k, v) => v}}

In functional programming, mutable state is not allowed. But immutable states/values are fine.
Assuming that you have read your json file in to a value input:List[Map[String,String]], the codes below will solve your problem:
val input = List(Map("abc"->"123", "xy"->"yz" , "s12"->"13"), Map("abc"->"1", "s"->"33"))
val keys = input.map(_.keys).flatten.toSet
val keyvalues = input.map(kvs => keys.map(k => (k->kvs.getOrElse(k,""))).toMap)
val values = keyvalues.map(_.values)
val result = keys.mkString(",") + "\n" + values.map(_.mkString(",")).mkString("\n")

Building a document functionally and based on input value in a Play 2 controller in Scala and ReactiveMongo

I've a Play controller Action that edits a document in MongoDB using ReactiveMongo. The code is shown below. Both name and keywords are optional. I'm creating a temp BSONDocument() and adding tuples to it based on if name and keywords exist are not empty. However, tmp is currently mutable(is a var). I'm wondering how I can get rid of the var.
def editEntity(id: String, name: Option[String], keywords: Option[String]) = Action {
val objectId = new BSONObjectID(id)
//TODO get rid of var here
var tmp = BSONDocument()
if (name.exists(_.trim.nonEmpty)) {
tmp = tmp.add(("name" -> BSONString(name.get)))
}
val typedKeywords : Option[List[String]] = Utils.getKeywords(keywords)
if (typedKeywords.exists(_.size > 0)) {
tmp = tmp.add(("keywords" -> typedKeywords.get.map(x => BSONString(x))))
}
val modifier = BSONDocument("$set" -> tmp)
val updateFuture = collection.update(BSONDocument("_id" -> objectId), modifier)
}
UPDATE After looking at the solution from #Vikas it came to me what if there are more (say 10 or 15) number of input Options that I need to deal with. Maybe a fold or reduce based solution will scale better?

In your current code you're adding an empty BSONDocument() if none of those if conditions matched? val modifier = BSONDocument("$set" -> tmp) will have an empty tmp if name was None and typedKeyWords was None. Assuming that's what you want here is one approach to get rid of transient var. also note having a var locally (in a method) isn't a bad thing (sure I'll still make that code look bit prettier)
val typedKeywords : Option[List[String]] = Utils.getKeywords(keywords)
val bsonDoc = (name,typedKeywords) match{
case (Some(n),Some(kw) ) => BSONDocument().add( "name" -> BSONString(n)) .add(("keywords" -> kw.map(x => BSONString(x))))
case (Some(n), None) => BSONDocument().add( "name" -> BSONString(n))
case (None,Some(kw)) => BSONDocument().add(("keywords" -> kw.map(x => BSONString(x))))
case (None,None) => BSONDocument()
}
val modifier = BSONDocument("$set" -> bsonDoc)

How to convert Future[BSONDocument] to list?

The code sends a request to MongoDB using ReactiveMongo and returns Future[BSONDocument] but my code handles lists of data, so I need to get the value of Future[BSONDocument] and then turn it into a list.
How do I do that preferably without blocking?
Upadte:
I am using ReactiveMongo RawCommand
def findLByDistance()(implicit ec: ExecutionContext) = db.command(RawCommand(
BSONDocument(
"aggregate" -> collName,
"pipeline" -> BSONArray(BSONDocument(
"$geoNear" -> BSONDocument(
"near" -> BSONArray(44.72,25.365),
"distanceField" -> "location.distance",
"maxDistance" -> 0.08,
"uniqueDocs" -> true)))
)))
And the result comes out in Future[BSONDocument]. For some simple queries I used default query builder which allowed for simple conversion
def findLimitedEvents()(implicit ec: ExecutionContext) =
collection.find(BSONDocument.empty)
.query(BSONDocument("tags" -> "lazy"))
.options(QueryOpts().batchSize(10))
.cursor.collect[List](10, true)
I basically I need the RawCommand output type to match previously used.

Not sure about your exact use-case (showing some more code would help), but it might be useful to convert from List[Future[BSONDocument]] to one Future[List[BsonDocument]], which you can then more easily onSuccess or map on, you can do that via:
val futures: List[Future[A]] = List(f1, f2, f3)
val futureOfThose: Future[List[A]] = Future.sequence(futures)

You cannot "get" a future without blocking. if you want to wait for a Future to complete then you must block.
What you can do is map a Future into another Future:
val futureDoc: Future[BSONDocument] = ...
val futureList = futureDoc map { doc => docToList(doc) }
Eventually, you'll hit a point where you've combined, mapped, recovered, etc. all your futures and want something to happen with the result. This is where you either block, or establish a handler to do something with the eventual result:
val futureThing: Future[Thing] = ...
//this next bit will be executed at some later time,
//probably on a different thread
futureThing onSuccess {
case thing => doWhateverWith(thing)
}

How to query with '$in' over '_id' in reactive mongo and play

I have a project set up with playframework 2.2.0 and play2-reactivemongo 0.10.0-SNAPSHOT. I'd like to query for few documents by their ids, in a fashion similar to this:
def usersCollection = db.collection[JSONCollection]("users")
val ids: List[String] = /* fetched from somewhere else */
val query = ??
val users = usersCollection.find(query).cursor[User].collect[List]()
As a query I tried:
Json.obj("_id" -> Json.obj("$in" -> ids)) // 1
Json.obj("_id.$oid" -> Json.obj("$in" -> ids)) // 2
Json.obj("_id" -> Json.obj("$oid" -> Json.obj("$in" -> ids))) // 3
for which first and second return empty lists and the third fails with error assertion 10068 invalid operator: $oid.

NOTE: copy of my response on the ReactiveMongo mailing list.
First, sorry for the delay of my answer, I may have missed your question.
Play-ReactiveMongo cannot guess on its own that the values of a Json array are ObjectIds. That's why you have to make a Json object for each id that looks like this: {"$oid": "526fda0f9205b10c00c82e34"}. When the ReactiveMongo Play plugin sees an object which first field is $oid, it treats it as an ObjectId so that the driver can send the right type for this value (BSONObjectID in this case.)
This is a more general problem actually: the JSON format does not match exactly the BSON one. That's the case for numeric types (BSONInteger, BSONLong, BSONDouble), BSONRegex, BSONDateTime, and BSONObjectID. You may find more detailed information in the MongoDB documentation: http://docs.mongodb.org/manual/reference/mongodb-extended-json/ .

I managed to solve it with:
val objectIds = ids.map(id => Json.obj("$oid" -> id))
val query = Json.obj("_id" -> Json.obj("$in" -> objectIds))
usersCollection.find(query).cursor[User].collect[List]()
since play-reactivemongo format considers BSONObjectID only when "$oid" is followed by string
implicit object BSONObjectIDFormat extends PartialFormat[BSONObjectID] {
def partialReads: PartialFunction[JsValue, JsResult[BSONObjectID]] = {
case JsObject(("$oid", JsString(v)) +: Nil) => JsSuccess(BSONObjectID(v))
}
val partialWrites: PartialFunction[BSONValue, JsValue] = {
case oid: BSONObjectID => Json.obj("$oid" -> oid.stringify)
}
}
Still, I hope there is a cleaner solution. If not, I guess it makes it a nice pull request.

I'm wondering if transforming id to BSONObjectID isn't more secure this way :
val ids: List[String] = ???
val bsonObjectIds = ids.map(BSONObjectID.parse(_)).collect{case Success(t) => t}
this will only generate valid BSONObjectIDs (and discard invalid ones)
If you do it this way :
val objectIds = ids.map(id => Json.obj("$oid" -> id))
your objectIds may not be valid ones depending on string id really being the stringify version of a BSONObjectID or not

If you import play.modules.reactivemongo.json._ it work without any $oid formatters.
import play.modules.reactivemongo.json._
...
val ids: Seq[BSONObjectID] = ???
val selector = Json.obj("_id" -> Json.obj("$in" -> ids))
usersCollection.find(selector).cursor[User].collect[Seq]()

I tried with the following and it worked for me:
val listOfItems = BSONArray(51, 61)
val query = BSONDocument("_id" -> BSONDocument("$in" -> listOfItems))
val ruleListFuture = bsonFutureColl.flatMap(_.find(query, Option.empty[BSONDocument]).cursor[ResponseAccDataBean]().
collect[List](-1, Cursor.FailOnError[List[ResponseAccDataBean]]()))

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse