what are other approaches instead of using collect() in spark scala

what are other approaches instead of using collect() in spark scala - scala

My piece of scala code looks like,
val orgIncInactive = orgIncLatest.filter("(LD_TMST != '' and LD_TMST is not null)").select("ORG_ID").rdd
orgIncInactive.collect.foreach(p => DenormalizedTablesMethodsUtil.hbaseTablePurge(p(0).toString, tableName, connection))
Is there any way that I can avoid using collect() here?
I tried various possibilities but I am ending up with Serializable errors.
Thanks.

Depends what you are trying to do, and what is ultimately causing the serialization error. It looks like you are trying to pass some kind of database connection into the anonymous function. That's generally going to fail for a couple of reasons. Even if you made the connection object itself serializable -- say by sub-classing the object and implementing Serializable -- database connections are not something you can share between the driver and the executors.
Instead, what you need to do is to create the connection object on each of the executors, and then use the local connection object instead of one defined in the driver. There are a couple of ways to accomplish this.
One is to use mapPartitions, which allows you to instantiate objects locally before the logic is run. See here for more on this.
Another possibility is to create a singleton object that on initialization sets a connection object to null or None. Then, you would define a method in the object like "getConnection" that checks whether the connection has been initialized. If not, it initializes the connection. Then either way it returns the valid connection.
I use the second approach more than the first, because it limits initialization to only once per executor instead of forcing it to happen once per partition.

Related

Object cache on Spark executors

A good question for Spark experts.
I am processing data in a map operation (RDD). Within the mapper function, I need to lookup objects of class A to be used in processing of elements in an RDD.
Since this will be performed on executors AND creation of elements of type A (that will be looked up) happens to be an expensive operation, I want to pre-load and cache these objects on each executor. What is the best way of doing it?
One idea is to broadcast a lookup table, but class A is not serializable (no control over its implementation).
Another idea is to load them up in a singleton object. However, I want to control what gets loaded into that lookup table (e.g. possibly different data on different Spark jobs).
Ideally, I want to specify what will be loaded on executors once (including the case of Streaming, so that the lookup table stays in memory between batches), through a parameter that will be available on the driver during its start-up, before any data gets processed.
Is there a clean and elegant way of doing it or is it impossible to achieve?

This is exactly the targeted use case for broadcast. Broadcasted variables are transmitted once and use torrents to move efficiently to all executors, and stay in memory / local disk until you no longer need them.
Serialization often pops up as an issue when using others' interfaces. If you can enforce that the objects you consume are serializable, that's going to be the best solution. If this is impossible, your life gets a little more complicated. If you can't serialize the A objects, then you have to create them on the executors for each task. If they're stored in a file somewhere, this would look something like:
rdd.mapPartitions { it =>
val lookupTable = loadLookupTable(path)
it.map(elem => fn(lookupTable, elem))
}
Note that if you're using this model, then you have to load the lookup table once per task -- you can't benefit from the cross-task persistence of broadcast variables.
EDIT: Here's another model, which I believe lets you share the lookup table across tasks per JVM.
class BroadcastableLookupTable {
#transient val lookupTable: LookupTable[A] = null
def get: LookupTable[A] = {
if (lookupTable == null)
lookupTable = < load lookup table from disk>
lookupTable
}
}
This class can be broadcast (nothing substantial is transmitted) and the first time it's called per JVM, you'll load the lookup table and return it.

In case serialisation turns out to be impossible, how about storing the lookup objects in a database? It's not the easiest solution, granted, but should work just fine. I could recommend checking e.g. spark-redis, but I am sure there are better solution out there.

Since A is not serializable the easiest solution is to create yout own serializable type A1 with all data from A required for computation. Then use the new lookup table in broadcast.

Drools 6 Fusion Notification

We are working in a very complex solution using drools 6 (Fusion) and I would like your opinion about best way to read Objects created during the correlation results over time.
My first basic approach was to read Working Memory every certain time, looking for new objects and reporting them to external Service (REST).
AgendaEventListener does not seems to be the "best" approach beacuse I dont care about most of the objects being inserted in working memory, so maybe, best approach would be to inject particular "object" in some sort of service inside DRL. Is this a good approach?

You have quite a lot of options. In decreasing order of my preference:
AgendaEventListener is probably the solution requiring the smallest amount of LOC. It might be useful for other tasks as well; all you have on the negative side is one additional method call and a class test per inserted fact. Peanuts.
You can wrap the insert macro in a DRL function and collect inserted fact of class X in a global List. The problem you have here is that you'll have to pass the KieContext as a second parameter to the function call.
If the creation of a class X object is inevitably linked with its insertion into WM, you could add the registry of new objects into a static List inside class X, to be done in a factory method (or the constructor).
I'm putting your "basic approach" last because it requires much more cycles than the listener (#1) and tons of overhead for maintaining the set of X objects that have already been put to REST.

Mixing Parallel Collections with Akka

How well to scala parallel collection operations get along with the concurrency/parallelism used by Akka Actors (and Futures) with respect to efficient scheduling on the system?
Actors' and Futures' execution is handled by an ExecutionContext generally provided by the Dispatcher. What I find on parallel collections indicates they use a TaskSupport object. I found a ExecutionContextTaskSupport object that may connect the two but am not sure.
What is the proper way to mix the two concurrency solutions, or is it advised not to?

At present this is not supported / handled well.
Prior to Scala 2.11-M7, attempting to use the dispatcher as the ContextExecutor throws an exception.
That is, the following code in an actor's receive will throw a NotImplementedError:
val par = List(1,2,3).par
par.tasksupport = new ExecutionContextTaskSupport(context.dispatcher)
par foreach println
Incidentally, this has been fixed in 2.11-M7, though it was not done to correct the above issue.
In reading through the notes on the fix it sounds like the implementation provided by ExecutionContextTaskSupport in the above case could have some overhead over directly using one of the other TaskSupport implementations; however, I have done nothing to test that interpretation or evaluate the magnitude of any impact.
A Note on Parallel Collections:
By default Parallel Collections will use the global ExecutorContext (ExecutionContext.Implicits.global) just as you might use for Futures. While this is well behaved, if you want to be constrained by the dispatcher (using context.dispatcher)—as you are likely to do with Futures in Akka—you need to set a different TaskSupport as shown in the code sample above.

Implementing a Mondrian shared SegmentCache

I am trying to implement a Mondrian SegmentCache. The cache is to be shared by multiple JVMs running the Mondrian library. We are using Redis as the backing store, however for the purpose of this question, any persistent key-value store should be fine.
Will the stackoverflow community help complete this implementation? The documentation and Google searches are not yielding enough level of detail. Here we go:
new SegmentCache {
private val logger = Logger("my-segment-cache")
import logger._
import com.redis.serialization.Parse
import Parse.Implicits.parseByteArray
private def redis = new RedisClient("localhost", 6379)
def get(header: SegmentHeader): SegmentBody = {
val result = redis.get[Array[Byte]](header.getUniqueID) map { bytes ⇒
val st = new ByteArrayInputStream(bytes)
val o = new ObjectInputStream(st)
o.readObject.asInstanceOf[SegmentBody]
}
info(s"cache get\nHEADER $header\nRESULT $result")
result.orNull
}
def getSegmentHeaders: util.List[SegmentHeader] = ???
def put(header: SegmentHeader, body: SegmentBody): Boolean = {
info(s"cache put\nHEADER $header\nBODY $body")
val s = new ByteArrayOutputStream
val o = new ObjectOutputStream(s)
o.writeObject(body)
redis.set(header.getUniqueID, s.toByteArray)
true
}
def remove(header: SegmentHeader): Boolean = ???
def tearDown() {}
def addListener(listener: SegmentCacheListener) {}
def removeListener(listener: SegmentCacheListener) {}
def supportsRichIndex(): Boolean = true
}
Some immediate questions:
is SegmentHeader.getUniqueID the appropriate key to use in the cache?
how should getSegmentHeaders be implemented? The current implementation above just throws an exception, and doesn't seem ever be called by Mondrian. How do we make the SegmentCache re-use existing cache records on startup?
how are addListener and removeListener meant to be used? I assume they have something to do with coordinating cache changes across nodes sharing the cache. But how?
what should supportsRichIndex return? In general, how does someone implementing a SegmentCache know what value to return?
I feel like these are basic issues that should be covered in the documentation, but they are not (as far as I can find). Perhaps we can correct the lack of available information here. Thanks!

is SegmentHeader.getUniqueID the appropriate key to use in the cache?
Yes and no. The UUID is convenient on systems like memcached, where everything boils down to a key/value match. If you use the UUID, you'll need to implement supportsRichIndex() as false. The reason for this is that excluded regions are not part of the UUID. That's on design for good reasons.
What we recommend is an implementation that serializes the SegmentHeader (it implements Serializable and hashCode() & equals()) and use that directly as a binary key that you propagate, so that it will retain the invalidated regions and keep everything nicely in sync.
You should look at how we've implemented it in the default memory cache.
There is also an implementation using Hazelcast.
We at Pentaho have also used Infinispan with great success.
how should getSegmentHeaders be implemented?
Again, take a look at the default in-memory implementation. You simply need to return the list of all the currently known SegmentHeader. If you can't provide that list for whatever reason, either because you've used the UUID only, or because your storage backend doesn't support obtaining a list, like memcached, you return an empty list. Mondrian won't be able to use in-memory rollup and won't be able to share the segments, unless it hits the right UUIDs in cache.
how are addListener and removeListener meant to be used?
Mondrian needs to be notified when new elements appear in the cache. These could be created by other nodes. Mondrian maintains an index of all the segments it should know about (thus enabling in-memory operations), so that's a way to propagate the updates. You need to bridge the backend with the Mondrian instances here. Take a look at how the Hazelcast implementation does it.
The idea behind this is that Mondrian maintains a spatial index of the currently known cells and will only query the necessary/missing cells from SQL if it absolutely needs to. This is necessary to achieve greater scalability. Fetching cells from SQL is extremely slow compared to objects which we maintain in an in-memory data grid.
How do we make the SegmentCache re-use existing cache records on startup
This is a caveat. Currently this is possible by applying this patch. It wasn't ported to the master codeline because it is a mess and is tangled with the fixes for another case. It has been reported to work, but wasn't tested internally by us. The relevant code is about here. If you get around to testing this, we always welcome contributions. Let us know if you're interested on the mailing list. There are a ton of people who will gladly help.
One workaround is to update the local index through the listener when your cache implementation starts.

Mapping to legacy MongoDB store

I'm attempting to write up a Yesod app as a replacement for a Ruby JSON service that uses MongoDB on the backend and I'm running into some snags.
the sql=foobar syntax in the models file does not seem too affect which collection Persistent.MongoDB uses. How can I change that?
is there a way to easily configure mongodb (preferably through the yaml file) to be explicitly read only? I'd take more comfort deploying this knowing that there was no possible way the app could overwrite or damage production data.
Is there any way I can get Persistent.MongoDB to ignore fields it doesn't know about? This service only needs a fraction of the fields in the collection in question. In order to keep the code as simple as possible, I'd really like to just map to the fields I care about and have Yesod ignore everything else. Instead it complains that the fields don't match.
How does one go about defining instances for models, such as ToJSON. I'd like to customize how that JSON gets rendered but I get the following error:
Handler/ProductStat.hs:8:10:
Illegal instance declaration for ToJSON Product'
(All instance types must be of the form (T t1 ... tn)
where T is not a synonym.
Use -XTypeSynonymInstances if you want to disable this.)
In the instance declaration forToJSON Product'

1) seems that sql= is not hooked up to mongo. Since sql is already doing this it shouldn't be difficult for Mongo.
2) you can change the function that runs the queries
in persistent/persistent-mongoDB/Database/Persist there is a runPool function of PersistConfig. That gets used in yesod-defaults. We should probably change the loadConfig function to check a readOnly setting
3) I am ok with changing the reorder function to allow for ignoring, although in the future (if MongoDB returns everything in ordeR) that may have performance implications, so ideally you would list the ignored columns.
4) This shouldn't require changes to Persistent. Did you try turning on TypeSynonymInstances ?
I have several other Yesod/Persistent priorities to attend to before these changes- please roll up your sleeves and let me know what help you need making them. I can change 2 & 3 myself fairly soon if you are committed to testing them.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse