A good question for Spark experts.
I am processing data in a map operation (RDD). Within the mapper function, I need to lookup objects of class A to be used in processing of elements in an RDD.
Since this will be performed on executors AND creation of elements of type A (that will be looked up) happens to be an expensive operation, I want to pre-load and cache these objects on each executor. What is the best way of doing it?
One idea is to broadcast a lookup table, but class A is not serializable (no control over its implementation).
Another idea is to load them up in a singleton object. However, I want to control what gets loaded into that lookup table (e.g. possibly different data on different Spark jobs).
Ideally, I want to specify what will be loaded on executors once (including the case of Streaming, so that the lookup table stays in memory between batches), through a parameter that will be available on the driver during its start-up, before any data gets processed.
Is there a clean and elegant way of doing it or is it impossible to achieve?
This is exactly the targeted use case for broadcast. Broadcasted variables are transmitted once and use torrents to move efficiently to all executors, and stay in memory / local disk until you no longer need them.
Serialization often pops up as an issue when using others' interfaces. If you can enforce that the objects you consume are serializable, that's going to be the best solution. If this is impossible, your life gets a little more complicated. If you can't serialize the A objects, then you have to create them on the executors for each task. If they're stored in a file somewhere, this would look something like:
rdd.mapPartitions { it =>
val lookupTable = loadLookupTable(path)
it.map(elem => fn(lookupTable, elem))
}
Note that if you're using this model, then you have to load the lookup table once per task -- you can't benefit from the cross-task persistence of broadcast variables.
EDIT: Here's another model, which I believe lets you share the lookup table across tasks per JVM.
class BroadcastableLookupTable {
#transient val lookupTable: LookupTable[A] = null
def get: LookupTable[A] = {
if (lookupTable == null)
lookupTable = < load lookup table from disk>
lookupTable
}
}
This class can be broadcast (nothing substantial is transmitted) and the first time it's called per JVM, you'll load the lookup table and return it.
In case serialisation turns out to be impossible, how about storing the lookup objects in a database? It's not the easiest solution, granted, but should work just fine. I could recommend checking e.g. spark-redis, but I am sure there are better solution out there.
Since A is not serializable the easiest solution is to create yout own serializable type A1 with all data from A required for computation. Then use the new lookup table in broadcast.
Related
I'm trying to perform a isin filter as optimized as possible. Is there a way to broadcast collList using Scala API?
Edit: I'm not looking for an alternative, I know them, but I need isin so my RelationProviders will pushdown the values.
val collList = collectedDf.map(_.getAs[String]("col1")).sortWith(_ < _)
//collList.size == 200.000
val retTable = df.filter(col("col1").isin(collList: _*))
The list i'm passing to the "isin" method has upto ~200.000 unique elements.
I know this doesn't look like the best option and a join sounds better, but I need those elements pushed down into the filters, makes a huge difference when reading (my storage is Kudu, but it also applies to HDFS+Parquet, base data is too big and queries work on around 1% of that data), I already measured everything, and it saved me around 30minutes execution time :). Plus my method already takes care if the isin is larger than 200.000.
My problem is, I'm getting some Spark "task are too big" (~8mb per task) warnings, everything works fine so not a big deal, but I'm looking to remove them and also optimize.
I've tried with, which does nothing as I still get the warning (since the broadcasted var gets resolved in Scala and passed to vargargs I guess):
val collList = collectedDf.map(_.getAs[String]("col1")).sortWith(_ < _)
val retTable = df.filter(col("col1").isin(sc.broadcast(collList).value: _*))
And this one which doesn't compile:
val collList = collectedDf.map(_.getAs[String]("col1")).sortWith(_ < _)
val retTable = df.filter(col("col1").isin(sc.broadcast(collList: _*).value))
And this one which doesn't work (task too big still appears)
val broadcastedList=df.sparkSession.sparkContext.broadcast(collList.map(lit(_).expr))
val filterBroadcasted=In(col("col1").expr, collList.value)
val retTable = df.filter(new Column(filterBroadcasted))
Any ideas on how to broadcast this variable? (hacks allowed). Any alternative to the isin which allows filter pushdown is also valid I've seen some people doing it on PySpark, but the API is not the same.
PS: Changes on the storage are not possible, I know partitioning (already partitioned, but not by that field) and such could help, but user inputs are totally random and the data is accessed and changed my many clients.
I'd opt for dataframe broad cast hash join in this case instead of broadcast variable.
Prepare a dataframe with your collectedDf("col1") collection list you want to filter with isin and then
use join between 2 dataframes to filter the rows matching.
I think it would be more efficient than isin since you have 200k entries to be filtered. spark.sql.autobroadcastjointhreshhold is the property you need to set with appropriate size(by default 10mb). AFAIK you can use till 200mb or 3oomb based on your requirements.
see this BHJ Explanation of how it works
Further reading Spark efficiently filtering entries from big dataframe that exist in a small dataframe
I'll just leave with big tasks since I only use it twice (but saves a lot of time) in my program and I can afford it, but if someone else needs it badly... well this seems to be the path.
Best alternatives I found to have big-arrays pushdown:
Change your relation provider so it broadcasts big-lists when pushing down In filters, this will probably leave some broadcasted trash, but well..., as long as your app is not streaming, it shouldn't be a problem, or you can save in a global list and clean those after a while
Add a filter in Spark (I wrote something at https://issues.apache.org/jira/browse/SPARK-31417 ) which allows broadcasted pushdown all the way to your relation provider. You would have to add your custom predicate, then implement your custom "Pushdown" (you can do this by adding a new rule) and then rewrite your RDD/Relation provider so it can exploit the fact the variable is broadcasted.
Use coalesce(X) after reading to decrease number of tasks, can work sometimes, depends on how the RelationProvider/RDD is implemented.
I am persisting some dataframes which are stored in var. Now when values to that var changes, how does persistence works? For example:
var checkedBefore_c = AddressValidation.validateAddressInAI(inputAddressesDF, addressDimTablePath, target_o, target_c, autoSeqColName).distinct.filter(col(CommonConstants.API_QUALITY_RATING) >= minQualityThreshold)
checkedBefore_c.persist(StorageLevel.MEMORY_AND_DISK_SER)
var pre_checkedBefore_c = checkedBefore_c.except(checkedBefore_o)
pre_checkedBefore_c.persist(StorageLevel.MEMORY_AND_DISK_SER)
checkedBefore_c = pre_checkedBefore_c.drop(target_o).drop(autoSeqColName)
.withColumn(target_o, pre_checkedBefore_c(target_c))
.withColumn(CommonConstants.API_STATUS, lit("AI-INSERT"))
.withColumn(CommonConstants.API_ERROR_MESSAGE, lit(""))
checkedBefore_c = CommonUtils.addAutoIncremetColumn(checkedBefore_c, autoSeqColName)
checkedBefore_c = checkedBefore_c.select(addDimWithLoggingSchema.head, addDimWithLoggingSchema.tail: _*)
checkedBefore_c.persist(StorageLevel.MEMORY_AND_DISK_SER)
You are trying to persist checkedBefore_c DataFrame, but in your code you have not called any action.
Brief explanation
Spark has two type of operation, transformation and action.
Transformation: Transformations are lazy eventuated like, map, reduceByKey, etc.
Action: Actions are eager eventuated, like foreach, count, save etc.
persist and cache are also lazy operation, so till the time you invoke action persist and cache will not be performed.
For more details please refer Action in Spark. You could also refer this.
Now how persist work.
In persist, spark store partition in in memory or disk or both.
Their are various options, for all option refer, org.apache.spark.storage.StorageLevel source code.
Each executors will be responsible to store their partition, if in memory option is given, first it will try to fit all partition if it does not fit then it will clean old cache data based(It is LRY cache). If still all partition does not fit in memory it will cache partitions which fits in memory and it will leave the rest.
If memory with disk option is selected then, first it will perform all step mentioned above and then left partition will be stored in local disk.
If replication factor is two then, each partition will be cached in two different executors.
In you case you have passed MEMORY_AND_DISK_SER, which means all object will be serialized before caching. By default Java serialization is used, but you could override it and use Kyro serialization, which is recommended.
My piece of scala code looks like,
val orgIncInactive = orgIncLatest.filter("(LD_TMST != '' and LD_TMST is not null)").select("ORG_ID").rdd
orgIncInactive.collect.foreach(p => DenormalizedTablesMethodsUtil.hbaseTablePurge(p(0).toString, tableName, connection))
Is there any way that I can avoid using collect() here?
I tried various possibilities but I am ending up with Serializable errors.
Thanks.
Depends what you are trying to do, and what is ultimately causing the serialization error. It looks like you are trying to pass some kind of database connection into the anonymous function. That's generally going to fail for a couple of reasons. Even if you made the connection object itself serializable -- say by sub-classing the object and implementing Serializable -- database connections are not something you can share between the driver and the executors.
Instead, what you need to do is to create the connection object on each of the executors, and then use the local connection object instead of one defined in the driver. There are a couple of ways to accomplish this.
One is to use mapPartitions, which allows you to instantiate objects locally before the logic is run. See here for more on this.
Another possibility is to create a singleton object that on initialization sets a connection object to null or None. Then, you would define a method in the object like "getConnection" that checks whether the connection has been initialized. If not, it initializes the connection. Then either way it returns the valid connection.
I use the second approach more than the first, because it limits initialization to only once per executor instead of forcing it to happen once per partition.
We are working in a very complex solution using drools 6 (Fusion) and I would like your opinion about best way to read Objects created during the correlation results over time.
My first basic approach was to read Working Memory every certain time, looking for new objects and reporting them to external Service (REST).
AgendaEventListener does not seems to be the "best" approach beacuse I dont care about most of the objects being inserted in working memory, so maybe, best approach would be to inject particular "object" in some sort of service inside DRL. Is this a good approach?
You have quite a lot of options. In decreasing order of my preference:
AgendaEventListener is probably the solution requiring the smallest amount of LOC. It might be useful for other tasks as well; all you have on the negative side is one additional method call and a class test per inserted fact. Peanuts.
You can wrap the insert macro in a DRL function and collect inserted fact of class X in a global List. The problem you have here is that you'll have to pass the KieContext as a second parameter to the function call.
If the creation of a class X object is inevitably linked with its insertion into WM, you could add the registry of new objects into a static List inside class X, to be done in a factory method (or the constructor).
I'm putting your "basic approach" last because it requires much more cycles than the listener (#1) and tons of overhead for maintaining the set of X objects that have already been put to REST.
I am trying to implement a Mondrian SegmentCache. The cache is to be shared by multiple JVMs running the Mondrian library. We are using Redis as the backing store, however for the purpose of this question, any persistent key-value store should be fine.
Will the stackoverflow community help complete this implementation? The documentation and Google searches are not yielding enough level of detail. Here we go:
new SegmentCache {
private val logger = Logger("my-segment-cache")
import logger._
import com.redis.serialization.Parse
import Parse.Implicits.parseByteArray
private def redis = new RedisClient("localhost", 6379)
def get(header: SegmentHeader): SegmentBody = {
val result = redis.get[Array[Byte]](header.getUniqueID) map { bytes ⇒
val st = new ByteArrayInputStream(bytes)
val o = new ObjectInputStream(st)
o.readObject.asInstanceOf[SegmentBody]
}
info(s"cache get\nHEADER $header\nRESULT $result")
result.orNull
}
def getSegmentHeaders: util.List[SegmentHeader] = ???
def put(header: SegmentHeader, body: SegmentBody): Boolean = {
info(s"cache put\nHEADER $header\nBODY $body")
val s = new ByteArrayOutputStream
val o = new ObjectOutputStream(s)
o.writeObject(body)
redis.set(header.getUniqueID, s.toByteArray)
true
}
def remove(header: SegmentHeader): Boolean = ???
def tearDown() {}
def addListener(listener: SegmentCacheListener) {}
def removeListener(listener: SegmentCacheListener) {}
def supportsRichIndex(): Boolean = true
}
Some immediate questions:
is SegmentHeader.getUniqueID the appropriate key to use in the cache?
how should getSegmentHeaders be implemented? The current implementation above just throws an exception, and doesn't seem ever be called by Mondrian. How do we make the SegmentCache re-use existing cache records on startup?
how are addListener and removeListener meant to be used? I assume they have something to do with coordinating cache changes across nodes sharing the cache. But how?
what should supportsRichIndex return? In general, how does someone implementing a SegmentCache know what value to return?
I feel like these are basic issues that should be covered in the documentation, but they are not (as far as I can find). Perhaps we can correct the lack of available information here. Thanks!
is SegmentHeader.getUniqueID the appropriate key to use in the cache?
Yes and no. The UUID is convenient on systems like memcached, where everything boils down to a key/value match. If you use the UUID, you'll need to implement supportsRichIndex() as false. The reason for this is that excluded regions are not part of the UUID. That's on design for good reasons.
What we recommend is an implementation that serializes the SegmentHeader (it implements Serializable and hashCode() & equals()) and use that directly as a binary key that you propagate, so that it will retain the invalidated regions and keep everything nicely in sync.
You should look at how we've implemented it in the default memory cache.
There is also an implementation using Hazelcast.
We at Pentaho have also used Infinispan with great success.
how should getSegmentHeaders be implemented?
Again, take a look at the default in-memory implementation. You simply need to return the list of all the currently known SegmentHeader. If you can't provide that list for whatever reason, either because you've used the UUID only, or because your storage backend doesn't support obtaining a list, like memcached, you return an empty list. Mondrian won't be able to use in-memory rollup and won't be able to share the segments, unless it hits the right UUIDs in cache.
how are addListener and removeListener meant to be used?
Mondrian needs to be notified when new elements appear in the cache. These could be created by other nodes. Mondrian maintains an index of all the segments it should know about (thus enabling in-memory operations), so that's a way to propagate the updates. You need to bridge the backend with the Mondrian instances here. Take a look at how the Hazelcast implementation does it.
The idea behind this is that Mondrian maintains a spatial index of the currently known cells and will only query the necessary/missing cells from SQL if it absolutely needs to. This is necessary to achieve greater scalability. Fetching cells from SQL is extremely slow compared to objects which we maintain in an in-memory data grid.
How do we make the SegmentCache re-use existing cache records on startup
This is a caveat. Currently this is possible by applying this patch. It wasn't ported to the master codeline because it is a mess and is tangled with the fixes for another case. It has been reported to work, but wasn't tested internally by us. The relevant code is about here. If you get around to testing this, we always welcome contributions. Let us know if you're interested on the mailing list. There are a ton of people who will gladly help.
One workaround is to update the local index through the listener when your cache implementation starts.