How to perform initialization in spark? - scala

I want to perform geoip lookups of my data in spark. To do that I'm using MaxMind's geoIP database.
What I want to do is to initialize a geoip database object once on each partition, and later use that to lookup the city related to an IP address.
Does spark have an initialization phase for each node, or should I instead check whether an instance variable is undefined, and if so, initialize it before continuing? E.g. something like (this is python but I want a scala solution):
class IPLookup(object):
database = None
def getCity(self, ip):
if not database:
self.database = self.initialise(geoipPath)
...
Of course, doing this requires spark will serialise the whole object, something which the docs caution against.

In Spark, per partition operations can be do using :
def mapPartitions[U](f: (Iterator[T]) ⇒ Iterator[U], preservesPartitioning: Boolean = false)
This mapper will execute the function f once per partition over an iterator of elements. The idea is that the cost of setting up resources (like DB connections) will be offset with the usage of such resources over a number of elements in the iterator.
Example:
val logsRDD = ???
logsRDD.mapPartitions{iter =>
val geoIp = new GeoIPLookupDB(...)
// this is local map over the iterator - do not confuse with rdd.map
iter.map(elem => (geoIp.resolve(elem.ip),elem))
}

This seems like a good usage of a broadcast variable. Have you looked at the documentation for that functionality and if you have does it fail to meet your requirements in someway?

As #bearrito mentioned - you can use load your GeoDB and then broadcast it from your Driver.
Another option to consider is to provide an external service that you can use to do a lookup. It could be an in memory cache such as Redis/Memcached/Tacheyon or a regular datastore.

Related

How to persist externally obtained stateful data in apache-beam python?

In my apache-beam job I call an external source, GCP Storage, this can be considered like a http call for universal purposes, the important part is that it is external call to enrich the job.
Every piece of data I am processing, I call this API to obtain some information to enrich the data. There is heavy amounts of repeat calls to the same data on the API.
Is there a good way to cache or store the results for reuse for each piece of data processed to limit the amount of network traffic required. It is a massive bottleneck for processing.
You can consider persisting this value as instance state on your DoFn. For example
class MyDoFn(beam.DoFn):
def __init__(self):
# This will be called during construction and pickled to the workers.
self.value1 = some_api_call()
def setup(self):
# This will be called once for each DoFn instance (generally
# once per worker), good for non-pickleable stuff that won't change.
self.value2 = some_api_call()
def start_bundle(self):
# This will be called per-bundle, possibly many times on a worker.
self.value3 = some_api_call()
def process(self, element):
# This is called on each element.
key = ...
if key not in self.some_lru_cache:
self.some_lru_cache[key] = some_api_call()
value4 = self.some_lru_cache[key]
# Use self.value1, self.value2, self.value3 and/or value4 here.
There is no internal persistence layer in Beam. You have to download the data you want to process. And this can potentially happen on a fleet of workers that all have to have access to the data.
However you might want to consider accessing your data as a side-input. You will have to preload all the data and won't need to query the external source for each element: https://beam.apache.org/documentation/programming-guide/#side-inputs
For GCS specifically you might want to try to use the existing IO, e.g. TextIO: https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/io/TextIO.java

Syncing speed of reading from DB and Writing to elasticsearch using Akka grpc stream

Here, we developed multi services each uses akka actors and communication between services are via Akka GRPC. There is one service which fills an in memory database and other service called Reader applies some query and shape data then transfer them to elasticsearch service for insertion/update. The volume of data in each reading phase is about 1M rows.
The problem arises when Reader transfers large amount of data so elasticsearch can not process them and insert/update them all.
I used akka stream method for these two services communication. I also use scalike jdbc lib and code like below to read and insert batch data instead of whole ones.
def applyQuery(query: String,mergeResult:Map[String, Any] => Unit) = {
val publisher = DB readOnlyStream {
SQL(s"${query}").map(_.toMap()).list().fetchSize(100000)
.iterator()
}
Source.fromPublisher(publisher).runForeach(mergeResult)
}
////////////////////////////////////////////////////////
var batchRows: ListBuffer[Map[String, Any]] = new ListBuffer[Map[String, Any]]
val batchSize: Int = 100000
def mergeResult(row:Map[String, Any]):Unit = {
batchRows :+= row
if (batchRows.size == batchSize) {
send2StorageServer(readyOutput(batchRows))
batchRows.clear()
}
}
def readyOutput(res: ListBuffer[Map[String, Any]]):ListBuffer[StorageServerRequest] = {
// code to format res
}
Now, when using 'foreach' command, it makes operations much slower. I tried different batch size but it made no sense. Am I wrong in using foreach command or is there any better way to resolve speed problem using akka stream, flow, etc.
I found that operation to be used to append to ListBuffer is
batchRows += row
but using :+ does not produce bug but is very inefficient so by using correct operator, foreach is no longer slow, although the speed problem again exists. This time, reading data is fast but writing to elasticsearch is slow.
After some searches, I came up with these solutions:
1. The use of queue as buffer between database and elasticsearch may help.
2. Also if blocking read operation until write is done is not costly,
it can be another solution.

Spark - ElasticSearch Index creation performance too slow

I am trying to use Apache spark to create an index in Elastic search(Writing huge data to ES).I have done a Scala program to create index using Apache spark.I have to index huge data, which is getting as my product bean in a LinkedList. Then.Then i tried to traverse over the product bean list and create the index. My code given below.
val conf = new SparkConf().setAppName("ESIndex").setMaster("local[*]")
conf.set("es.index.auto.create", "true").set("es.nodes", "127.0.0.1")
.set("es.port", "9200")
.set("es.http.timeout", "5m")
.set("es.scroll.size", "100")
val sc = new SparkContext(conf)
//Return my product bean as a in a linkedList.
val list: util.LinkedList[product] = getData()
for (item <- list) {
sc.makeRDD(Seq(item)).saveToEs("my_core/json")
}
The issue with this approach is taking too much time to create the index.
Is there any way to create the index in a better way?
Don't pass data through driver unless it is necessary. Depending on what is the source of data returned from getData you should use relevant input method or create your own. If data comes from MongoDB use for example mongo-hadoop, Spark-MongoDB or Drill with JDBC connection. Then use map or similar method to build the required objects and use saveToEs on transformed RDD.
Creating a RDD with as single element doesn't make sense. It doesn't benefit from Spark architecture at all. You just start a potentially huge number of tasks which have nothing with only a single active executor.

Difference between map and mapAsync

Can anyone please explain me difference between map and mapAsync w.r.t AKKA stream? In the documentation it is said that
Stream transformations and side effects involving external non-stream
based services can be performed with mapAsync or mapAsyncUnordered
Why cant we simply us map here? I assume that Flow, Source, Sink all would be Monadic in nature and thus map should work fine w.r.t the Delay in the nature of these ?
Signature
The difference is best highlighted in the signatures: Flow.map takes in a function that returns a type T while Flow.mapAsync takes in a function that returns a type Future[T].
Practical Example
As an example, suppose that we have a function which queries a database for a user's full name based on a user id:
type UserID = String
type FullName = String
val databaseLookup : UserID => FullName = ??? //implementation unimportant
Given an akka stream Source of UserID values we could use Flow.map within a stream to query the database and print the full names to the console:
val userIDSource : Source[UserID, _] = ???
val stream =
userIDSource.via(Flow[UserID].map(databaseLookup))
.to(Sink.foreach[FullName](println))
.run()
One limitation of this approach is that this stream will only make 1 db query at a time. This serial querying will be a "bottleneck" and likely prevent maximum throughput in our stream.
We could try to improve performance through concurrent queries using a Future:
def concurrentDBLookup(userID : UserID) : Future[FullName] =
Future { databaseLookup(userID) }
val concurrentStream =
userIDSource.via(Flow[UserID].map(concurrentDBLookup))
.to(Sink.foreach[Future[FullName]](_ foreach println))
.run()
The problem with this simplistic addendum is that we have effectively eliminated backpressure.
The Sink is just pulling in the Future and adding a foreach println, which is relatively fast compared to database queries. The stream will continuously propagate demand to the Source and spawn off more Futures inside of the Flow.map. Therefore, there is no limit to the number of databaseLookup running concurrently. Unfettered parallel querying could eventually overload the database.
Flow.mapAsync to the rescue; we can have concurrent db access while at the same time capping the number of simultaneous lookups:
val maxLookupCount = 10
val maxLookupConcurrentStream =
userIDSource.via(Flow[UserID].mapAsync(maxLookupCount)(concurrentDBLookup))
.to(Sink.foreach[FullName](println))
.run()
Also notice that the Sink.foreach got simpler, it no longer takes in a Future[FullName] but just a FullName instead.
Unordered Async Map
If maintaining a sequential ordering of the UserIDs to FullNames is unnecessary then you can use Flow.mapAsyncUnordered. For example: you just need to print all of the names to the console but didn't care about order they were printed.

Spark Streaming: how to propagate updates to a Broadcast variable to the whole cluster?

I have a module in the Spark driver listening to a Kafka Queue and depending on the content of the Queue I need to modify the content of a Broadcast variable (or a closure). In this example this could be a String.
For example if the string "change" arrives on the queue, I need to update the Broadcast variable in every node.
I would like to see a pattern to do this that is clean and performant or at least receive an input on where I can find some material to better understand how to propagate modifications in the Spark Cluster.
Broadcast variable are indeed propagating variables or whole closures to the spark cluster, using a peer to peer protocol.
From the Learning Spark book:
A broadcast variable is
simply an object of type spark.broadcast.Broadcast[T], which wraps a value of
type T. We can access this value by calling value on the Broadcast object in our
tasks. The value is sent to each node only once, using an efficient, BitTorrent-like
communication mechanism.
What will impact on performance is the serialization method you're using (e.g: Kryo, a custom one, ...):
There is an example in the book:
Example 6-8. Country lookup with Broadcast values in Scala
// Look up the countries for each call sign for the
// contactCounts RDD. We load an array of call sign
// prefixes to country code to support this lookup.
val signPrefixes = sc.broadcast(loadCallSignTable())
val countryContactCounts = contactCounts.map {
case (sign, count) =>
val country = lookupInArray(sign, signPrefixes.value) (country, count)
}.reduceByKey((x, y) => x + y)
countryContactCounts.saveAsTextFile(outputDir + "/countries.txt")
As shown in these examples, the process of using broadcast variables is simple:
1. Create a Broadcast[T] by calling SparkContext.broadcast on an object of type
T. Any type works as long as it is also Serializable.
2. Access its value with the value property (or value() method in Java).
3. The variable will be sent to each node only once, and should be treated as read-
only (updates will not be propagated to other nodes).