How to persist externally obtained stateful data in apache-beam python? - apache-beam

In my apache-beam job I call an external source, GCP Storage, this can be considered like a http call for universal purposes, the important part is that it is external call to enrich the job.
Every piece of data I am processing, I call this API to obtain some information to enrich the data. There is heavy amounts of repeat calls to the same data on the API.
Is there a good way to cache or store the results for reuse for each piece of data processed to limit the amount of network traffic required. It is a massive bottleneck for processing.

You can consider persisting this value as instance state on your DoFn. For example
class MyDoFn(beam.DoFn):
def __init__(self):
# This will be called during construction and pickled to the workers.
self.value1 = some_api_call()
def setup(self):
# This will be called once for each DoFn instance (generally
# once per worker), good for non-pickleable stuff that won't change.
self.value2 = some_api_call()
def start_bundle(self):
# This will be called per-bundle, possibly many times on a worker.
self.value3 = some_api_call()
def process(self, element):
# This is called on each element.
key = ...
if key not in self.some_lru_cache:
self.some_lru_cache[key] = some_api_call()
value4 = self.some_lru_cache[key]
# Use self.value1, self.value2, self.value3 and/or value4 here.

There is no internal persistence layer in Beam. You have to download the data you want to process. And this can potentially happen on a fleet of workers that all have to have access to the data.
However you might want to consider accessing your data as a side-input. You will have to preload all the data and won't need to query the external source for each element: https://beam.apache.org/documentation/programming-guide/#side-inputs
For GCS specifically you might want to try to use the existing IO, e.g. TextIO: https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/io/TextIO.java

Related

Syncing speed of reading from DB and Writing to elasticsearch using Akka grpc stream

Here, we developed multi services each uses akka actors and communication between services are via Akka GRPC. There is one service which fills an in memory database and other service called Reader applies some query and shape data then transfer them to elasticsearch service for insertion/update. The volume of data in each reading phase is about 1M rows.
The problem arises when Reader transfers large amount of data so elasticsearch can not process them and insert/update them all.
I used akka stream method for these two services communication. I also use scalike jdbc lib and code like below to read and insert batch data instead of whole ones.
def applyQuery(query: String,mergeResult:Map[String, Any] => Unit) = {
val publisher = DB readOnlyStream {
SQL(s"${query}").map(_.toMap()).list().fetchSize(100000)
.iterator()
}
Source.fromPublisher(publisher).runForeach(mergeResult)
}
////////////////////////////////////////////////////////
var batchRows: ListBuffer[Map[String, Any]] = new ListBuffer[Map[String, Any]]
val batchSize: Int = 100000
def mergeResult(row:Map[String, Any]):Unit = {
batchRows :+= row
if (batchRows.size == batchSize) {
send2StorageServer(readyOutput(batchRows))
batchRows.clear()
}
}
def readyOutput(res: ListBuffer[Map[String, Any]]):ListBuffer[StorageServerRequest] = {
// code to format res
}
Now, when using 'foreach' command, it makes operations much slower. I tried different batch size but it made no sense. Am I wrong in using foreach command or is there any better way to resolve speed problem using akka stream, flow, etc.
I found that operation to be used to append to ListBuffer is
batchRows += row
but using :+ does not produce bug but is very inefficient so by using correct operator, foreach is no longer slow, although the speed problem again exists. This time, reading data is fast but writing to elasticsearch is slow.
After some searches, I came up with these solutions:
1. The use of queue as buffer between database and elasticsearch may help.
2. Also if blocking read operation until write is done is not costly,
it can be another solution.

Flink - how to aggregate and query a rich sink functions state across multiple task slots

I implemented a rich sink function which performs some network calls per the invoked upon object. I would like to be able to count some metadata on these events keyed by some contextual information contained on the event (a batchID of the event), and expose this meta data to external system.
For example an event looks like this:
case class MyEvent(batchId: String, eventId: String, moreInformation: ...)
class MySink(...) extends RichSinkFunction[MyEvent]
{
override def open(parameters: Configuration): Unit = {
...
}
override def close(): Unit = {
...
}
override def invoke(event: MyEvent) = {
// some processing is done here
....
//
...
if (success) {
I want to save the meta data here per event.batchId
state.count.number.of.events.processed.for.event.batchId
}
}
}
And in another place I want to somehow be able to query the value of how many events were processed for batchId
A few options:
Plan A: Use Metric objects and a MetricReporter to expose the data to the external system(s). This has the drawback that metrics aren't checkpointed, and if there are a lot of batchIds, you'll probably end up polluting the metrics system with lots of metrics that can't get GC'ed.
Plan B: Rewrite your RichSinkFunction as a RichFlatMap (or ProcessFunction) that emits a stream of Tuples holding (batchId, number.of.events.in.batchId). You can key this stream by the batchId, and then use keyed state in a KeyedProcessFunction (for example) to store and expose this state via queryable state. This has the drawback that queryable state only allows for point queries (one key at a time).
Plan C: In this variant, the external systems could query the state created in Plan B by injecting queries into a stream that is broadcast into a KeyedBroadcastProcessFunction that holds keyed state.count.number.of.events.processed.for.event.batchId data. You can then use ctx.applyToKeyedState in the processBroadcastElement method of the KeyedBroadcastProcessFunction to respond to these queries. See one of the Flink training exercises for an example.
Plan D: write the results from B (or C) into redis, or elasticsearch, or some other queryable data store, and have the external systems get this info from there.

How to setup domain model as actor?

I'm fairly new to both Scala and Akka and I'm trying to figure out how you would create a proper domain model, which also is an Actor.
Let's imagine we have a simple business case where you can open a new Bank Account. Let's say that one of the rules is that you can only create one bank account per last name (not realistic, but just for the sake of simplicity). My first approach, without applying any business rules, would look something like this:
object Main {
def main(args: Array[String]): Unit = {
implicit val system = ActorSystem("accout")
implicit val materializer = ActorMaterializer()
implicit val executionContext = system.dispatcher
val account = system.actorOf(Props[Account])
account ! CreateAccount("Doe")
}
}
case class CreateAccount(lastName: String)
class Account extends Actor {
var lastName: String = null
override def receive: Receive = {
case createAccount: CreateAccount =>
this.lastName = lastName
}
}
Eventually you would persist this data somewhere. However, when adding the rule that there can only be one Bank Account per last name, a query to some data storage needs to be done. Let's say we put that logic inside a repository and the repository eventually returns an Account, we get to the problem where Account isn't an Actor anymore, since the repository won't be able to create Actors.
This is definitely a wrong implementation and not how Actors should be used. My question is, what are ways to solve these kind of problems? I am aware that my knowledge of Akka is not on a decent level yet, so it might be a weird/stupid formulated question.
This might be a long answer and I am sorry there isn't a TLDR version. :)
Ok, so you want to "Actorize" your domain model? Bad idea. Domain models are not necessarily actors. Sometimes they are but often they are not. It would be an anti-pattern to deploy one actor per domain model because if you do that you are simply offloading the method calling to message calling but losing all of the single threaded paradigm of the method calling. You cannot guarantee the timing of the messages hitting your actor and programming based upon ASK patterns is a good way to introduce a system that is not scalable, eventually you have too many threads and too many futures and cant proceed further, the system bogs and chokes. So what does that mean for your particular problem?
First you have to stop thinking of the domain model as a single thing and definitely stop using POJO entities. I entirely agree with Martin Fowler when he discusses the anemic domain model. In a well built actor system there will often be three domain models. One is the persisted model which has entities that model your database. The second is the immutable model. This is the model that the actors use to communicate with each other. All the entities are immutable from the bottom up, all collections unmodifiable, all objects only have getters, all constructors copy the collections to new immutable collections. The immutable model means your actors never have to copy anything, they just pass around references to data. Lastly you will have the API model, this is usually the set of entities that model the JSON for the clients to consume. The API model is there to insulate the back end from client code changes and vice versa, its the contract between the systems.
To create your actors stop thinking about your persistent model and what you will do with it but instead start thinking of the use cases. What does your system have to do? Model your actors based on the use cases and that will change the implementation of the actors and their deployment strategies.
For example, consider a server that delivers inventory information to users including current stock levels, reviews by users and so on for products by a single vendor. The users hammer this information and it changes quickly as stock levels change. This information is likely stored in half a dozen different tables. We don't model an actor for each table but rather a single actor to serve this use case. In this case this information is accessed by a large group of people in heavy load environment. So we are best creating an actor to aggregate all of this data and replicating the actor to each node and whenever the data changes we inform all replicants on all nodes of the changes. This means the user getting the overview doesn't even touch the database. They hit the actors, get the immutable model, convert that to the API model and then return the data.
On the other hand if a user wants to change the stock levels, we need to make sure that two users don't do it concurrently yet large DB transactions slows down the system massively. So instead we pick one node that will hold the stock management actor for that vendor and we cluster shard the actor. Any requests are routed to that actor and handled serially. The company user logs in and notes the receipt of a delivery of 20 new items. The message goes from whatever node they hit to the node holding the actor for that vendor, the vendor then makes the appropriate database changes and the broadcasts the change which is picked up by all the replicated inventory view actors to change their data.
Now this is simplistic because you have to deal with lost messages (read the articles on why reliable messaging is not necessary). However once you start to go down that road you soon realize that simply making your domain model an actor system is an anti-pattern and there are better ways to do things.
Anyway that is my 2 cents :)
General Design
Actors should generally be simple dispatchers to business logic and contain as little functionality as possible. Think of Actors as similar to a Future; when you want concurrency in scala you don't extend the Future class, you just use Future functionality around your existing logic.
Limiting your Actors to bare-bones responsibility has several advantages:
Testing the code can be done without having to construct ActorSystems, probes, ActorRefs, etc...
The business logic can easily be transplanted to other asynchronous libraries, e.g. Futures and akka streams.
It's easier to create a "proper domain model" with plain old classes and functions than it is with Actors.
Placing business logic in Actors naturally emphasizes a more object oriented code/system design rather than a functional approach (we picked scala for a reason).
Business Logic (No Akka)
Here we will setup all of the domain specific logic without using any akka related "stuff".
object BusinessLogicDomain {
type FirstName = String
type LastName = String
type Balance = Double
val defaultBalance : Balance = 0.0
case class Account(firstName : FirstName,
lastName : LastName,
balance : Balance = defaultBalance)
Lets model your account directory as a HashMap:
type AccountDirectory = HashMap[LastName, Account]
val emptyDirectory : AccountDirectory = HashMap.empty[LastName, Account]
We can now create a function that matches your requirements for distinct account per last name:
val addAccount : (AccountDirectory, Account) => AccountDirectory =
(accountDirectory, account) =>
if(accountDirectory contains account.lastName)
accountDirectory
else
accountDirectory + (account.lastName -> account)
}//end object BusinessLogicDomain
Repository (Akka)
Now that the unpolluted business code is complete, and isolated, we can add the concurrency layer on top of the foundational logic.
We can use the become functionality of Actors to store the state and respond to requests:
import BusinessLogicDomain.{Account, AccountDirectory, emptyDirectory, addAccount}
case object QueryAccountDirectory
class RepoActor(accountDirectory : AccountDirectory = emptyDirectory) extends Actor {
val statefulReceive : AccountDirectory => Receive =
currentDirectory => {
case account : Account =>
context become statefulReceive(addAccount(currentDirectory, account))
case QueryAccountDirectory =>
sender ! currentDirectory
}
override def receive : Receive = statefulReceive(accountDirectory)
}

Service Fabric Actors - save state to database

I'm working on a sample Service Fabric project, where I have to maintain a shopping list. For this I have a ShoppingList actor, which is identifiable by a specific id. It stores the current list content in its state using StateManager. All works fine.
However, in parallel I'd like to maintain the shopping list content in a sql database. In particular:
store all add/remove item request for future analysis (ML)
on first actor initialization load list content from db (e.g. after cluster has been re-created)
What is the best approach to achieve that? Create a custom StateProvider (how? can't find examples)?
Or maybe have another service/actor for handling all db operations (possibly using queues and reminders)?
All examples seem to completely rely on default StateManager, with no data persistence to external storage, so I'm not sure what's the best practice.
The best way will be to have a separate entity responsible for storing data to DB. And actor will just send an event (not implying SF events) with some data about performed operation, and another entity will catch it and perform the rest of the work.
But of course you can implement this thing in actor itself, but it will bring two possible issues:
Actor will be not able to process other requests if there will be some issues with DB or connectivity between actor and DB or if there will be high loading of DB itself and it will process requests slowly. The actor would have to wait till transferring to DB successfully completes.
Possible overloading of DB with many single connections from many actors instead of one or several connection from another entity and batch insertion.
So, your final solution will depend on workload of your system. But definitely you will need a reliable queue to safely store data in DB if value of such data is too high to afford a loss.
Also, I think you could use default state manager to store logs and information about transactions before it will be transferred to DB and remove from service's state after transaction completes. There is no need to have permanent storage of such data in services.
And another things to take into consideration — reading from DB. Probably, if you have relationship database and will update with new records only one table + if there will be huge amount of actors that will query such data on activation, you will have performance degradation as this table will be locked for reading or writing if you will not configure it to behave differently. So, probably, you will need caching system to read data for actors activation — depends on your workload.
And about implementing your custom State Manager: take a look at this example. Basically, all you need to do is to implement IReliableStateManagerReplica interface and pass it to StatefullService constructor.

How to perform initialization in spark?

I want to perform geoip lookups of my data in spark. To do that I'm using MaxMind's geoIP database.
What I want to do is to initialize a geoip database object once on each partition, and later use that to lookup the city related to an IP address.
Does spark have an initialization phase for each node, or should I instead check whether an instance variable is undefined, and if so, initialize it before continuing? E.g. something like (this is python but I want a scala solution):
class IPLookup(object):
database = None
def getCity(self, ip):
if not database:
self.database = self.initialise(geoipPath)
...
Of course, doing this requires spark will serialise the whole object, something which the docs caution against.
In Spark, per partition operations can be do using :
def mapPartitions[U](f: (Iterator[T]) ⇒ Iterator[U], preservesPartitioning: Boolean = false)
This mapper will execute the function f once per partition over an iterator of elements. The idea is that the cost of setting up resources (like DB connections) will be offset with the usage of such resources over a number of elements in the iterator.
Example:
val logsRDD = ???
logsRDD.mapPartitions{iter =>
val geoIp = new GeoIPLookupDB(...)
// this is local map over the iterator - do not confuse with rdd.map
iter.map(elem => (geoIp.resolve(elem.ip),elem))
}
This seems like a good usage of a broadcast variable. Have you looked at the documentation for that functionality and if you have does it fail to meet your requirements in someway?
As #bearrito mentioned - you can use load your GeoDB and then broadcast it from your Driver.
Another option to consider is to provide an external service that you can use to do a lookup. It could be an in memory cache such as Redis/Memcached/Tacheyon or a regular datastore.