Implementing a Mondrian shared SegmentCache - scala

I am trying to implement a Mondrian SegmentCache. The cache is to be shared by multiple JVMs running the Mondrian library. We are using Redis as the backing store, however for the purpose of this question, any persistent key-value store should be fine.
Will the stackoverflow community help complete this implementation? The documentation and Google searches are not yielding enough level of detail. Here we go:
new SegmentCache {
private val logger = Logger("my-segment-cache")
import logger._
import com.redis.serialization.Parse
import Parse.Implicits.parseByteArray
private def redis = new RedisClient("localhost", 6379)
def get(header: SegmentHeader): SegmentBody = {
val result = redis.get[Array[Byte]](header.getUniqueID) map { bytes ⇒
val st = new ByteArrayInputStream(bytes)
val o = new ObjectInputStream(st)
o.readObject.asInstanceOf[SegmentBody]
}
info(s"cache get\nHEADER $header\nRESULT $result")
result.orNull
}
def getSegmentHeaders: util.List[SegmentHeader] = ???
def put(header: SegmentHeader, body: SegmentBody): Boolean = {
info(s"cache put\nHEADER $header\nBODY $body")
val s = new ByteArrayOutputStream
val o = new ObjectOutputStream(s)
o.writeObject(body)
redis.set(header.getUniqueID, s.toByteArray)
true
}
def remove(header: SegmentHeader): Boolean = ???
def tearDown() {}
def addListener(listener: SegmentCacheListener) {}
def removeListener(listener: SegmentCacheListener) {}
def supportsRichIndex(): Boolean = true
}
Some immediate questions:
is SegmentHeader.getUniqueID the appropriate key to use in the cache?
how should getSegmentHeaders be implemented? The current implementation above just throws an exception, and doesn't seem ever be called by Mondrian. How do we make the SegmentCache re-use existing cache records on startup?
how are addListener and removeListener meant to be used? I assume they have something to do with coordinating cache changes across nodes sharing the cache. But how?
what should supportsRichIndex return? In general, how does someone implementing a SegmentCache know what value to return?
I feel like these are basic issues that should be covered in the documentation, but they are not (as far as I can find). Perhaps we can correct the lack of available information here. Thanks!

is SegmentHeader.getUniqueID the appropriate key to use in the cache?
Yes and no. The UUID is convenient on systems like memcached, where everything boils down to a key/value match. If you use the UUID, you'll need to implement supportsRichIndex() as false. The reason for this is that excluded regions are not part of the UUID. That's on design for good reasons.
What we recommend is an implementation that serializes the SegmentHeader (it implements Serializable and hashCode() & equals()) and use that directly as a binary key that you propagate, so that it will retain the invalidated regions and keep everything nicely in sync.
You should look at how we've implemented it in the default memory cache.
There is also an implementation using Hazelcast.
We at Pentaho have also used Infinispan with great success.
how should getSegmentHeaders be implemented?
Again, take a look at the default in-memory implementation. You simply need to return the list of all the currently known SegmentHeader. If you can't provide that list for whatever reason, either because you've used the UUID only, or because your storage backend doesn't support obtaining a list, like memcached, you return an empty list. Mondrian won't be able to use in-memory rollup and won't be able to share the segments, unless it hits the right UUIDs in cache.
how are addListener and removeListener meant to be used?
Mondrian needs to be notified when new elements appear in the cache. These could be created by other nodes. Mondrian maintains an index of all the segments it should know about (thus enabling in-memory operations), so that's a way to propagate the updates. You need to bridge the backend with the Mondrian instances here. Take a look at how the Hazelcast implementation does it.
The idea behind this is that Mondrian maintains a spatial index of the currently known cells and will only query the necessary/missing cells from SQL if it absolutely needs to. This is necessary to achieve greater scalability. Fetching cells from SQL is extremely slow compared to objects which we maintain in an in-memory data grid.
How do we make the SegmentCache re-use existing cache records on startup
This is a caveat. Currently this is possible by applying this patch. It wasn't ported to the master codeline because it is a mess and is tangled with the fixes for another case. It has been reported to work, but wasn't tested internally by us. The relevant code is about here. If you get around to testing this, we always welcome contributions. Let us know if you're interested on the mailing list. There are a ton of people who will gladly help.
One workaround is to update the local index through the listener when your cache implementation starts.

Related

Is it possible when using MongoTemplate to dynamically set read preference for a particular query?

In our application, we manage a number of MongoTemplate instances, each representing a client database. For the majority of database operations, we want to use the secondaryPreferred read preference in order to leverage our cluster's read replicas and distribute load. However, in at least one case we need to read from the primary to get the most recent data. I don't see any way to override the read preference for this single query. I see this issue on the JIRA board, but it's been open for 6 years and associated the StackOverflow link is dead. Assuming that won't be implemented, I'm trying to figure out some alternate solutions. Does this seem like a correct assessment of the possible options?
Create two MongoClients with the different read preferences, and use them to create a separate set of MongoTemplates for primary and secondary reads. I'm concerned that this probably creates double the number of connections to the cluster (although perhaps it's not a concern, if the additional connections all go to the secondaries).
Use the MongoTemplate.setReadPreference() method to temporarily change the read preference before performing the operation, then reset it once finished. It seems like this would be vulnerable to race conditions, however.
Sidestep the Spring Data framework and use executeCommand() directly, which supports a readPreference argument. This means we'd lose all of the benefits and abstraction of Spring Data and have to manipulate the BSON objects directly.
The Query class has a slaveOk() method, but this is the inverse of what I'm looking for and it seems like it's deprecated.
Any further information is appreciated as well. Thanks!
As a workaround solution we can override the method prepareCollection(MongoCollection<Document> collection) in MongoTemplate (refer: here) and change the read preference for the needed query alone and let the rest of the cases follow the default read preference
As a side note, seems like slaveOk() is doing literally nothing
https://github.com/spring-projects/spring-data-mongodb/blob/f00991dc293dceee172b1ece6613dde599a0665d/spring-data-mongodb/src/main/java/org/springframework/data/mongodb/core/MongoTemplate.java#L3328
switch (option) {
case NO_TIMEOUT:
cursorToUse = cursorToUse.noCursorTimeout(true);
break;
case PARTIAL:
cursorToUse = cursorToUse.partial(true);
break;
case SECONDARY_READS:
case SLAVE_OK:
break;
default:
throw new IllegalArgumentException(String.format("%s is no supported flag.", option));
}

Monitoring runtime use of concrete collections

Background:
Our Scala software consists of various components, developed by different teams, that pass Scala collections back and forth. The APIs usually use abstract collections such as Seq[T] and Set[T], and developers are currently essentially free to choose any implementation they like: e.g. when creating new instances, some go with List() or Vector(), others with Seq.empty.
Problem:
Different implementations have different performance characteristics, e.g. List might have been a good choice locally (for one component) because the collection is only sequentially iterated over or modified at the head, but it could have been a poor choice globally, because another component performs loads of random accesses.
Question:
Are their any tools — ideally Scala-specific, but JVM-general might also be OK — that can monitor runtime use of collections and record the information necessary to detect and report undesirable access/usage patterns of collections?
My feeling is that runtime monitoring would be more fruitful than static analyses (including simple linting) because (i) statically detecting usage patterns in hot code is virtually impossible, and (ii) would most likely miss collections that are internally created, e.g. when performing complex filter/map/fold/etc. operations on immutable collections.
Edits/Clarifications:
Changing the interfaces to enforce specific types such as List isn't an option; it would also not prevent purely internal use of "wrong" collections/usage patterns.
The goal is identifying a globally optimal (over many runs of the software) collection type rather than locally optimising for each applied algorithm
You don't need linting for this, let alone runtime monitoring. This is exactly what having a strictly-typed language does for you out of the box. If you want to ensure a particular collection type is passed to the API, just declare that that API accepts that collection type (e.g., def foo(x: Stream[Bar]), not def foo(x: Seq[Bar]), etc.).
Alternatively, when practical, just convert to the desired type as part of implementation: def foo(x: List[Bar]) = { val y = x.toArray ; lotsOfRandomAccess(y); }
Collections that are "internally created" are typically the same type as the parent object: List(1,2,3).map(_ + 1) returns a List etc.
Again, if you want to ensure you are using a particular type, just say so:
val mapped: List[Int] = List(1,2,3).map(_ + 1)
You can actually, change the type this way if there is a need for that:
val mappedStream: Stream[Int] = List(1,2,3).map(_ + 1)(breakOut)
As discussed in the comments, this is a problem that needs to be solved at a local level rather than via global optimisation.
Each algorithm in the system will work best with a particular data type, so using a single global structure will never be optimal. Instead, each algorithm should ensure that the incoming data is in a format that can be processed efficiently. If it is not in the right format, the data should be converted to a better format as the first part of the process. Since the algorithm works better on the right format, this conversion is always a performance improvement.
The output data format is more of a problem if the system does not know which algorithm will be used next. The solution is to use the most efficient output format for the algorithm in question, and rely on other algorithms to re-format the data if required.
If you do want to monitor the whole system, it would be better to track the algorithms rather than the collections. If you monitor which algorithms are called and in which order you can create multiple traces through the code. You can then play back those traces with different algorithms and data structures to see which is the most efficient configuration.

Scala Event Sourcing with Kafka

For a microservice I need the functionality to persist state (changes). Essentially, the following happens:
case class Item(i: Int)
val item1 = Item(0)
val item2 = exec(item1)
Where exec is user defined and hence not known in advance. As an example, let's assume this implementation:
def exec(item: Item) = item.copy(i = item.i + 1)
After each call to exec, I want log the state changes (here: item.i: 0->1) so that...
there is a history (e.g. list of tuples like (timestamp, what has changed, old value, new value))
state changes and snapshots could be persisted efficiently to a local file systems and sent to a journal
Arbitrary consumers (not only the specific producer where the changes originated) could be restored from the journal/snapshots
As less dependencies to libraries and infrastructure as possible (it is a small project, complex infrastructure/server installations & maintenance is not possible)
I know that the EventStore DB is probably the best solution, however, in the given environment (a huge enterprise with a lot of policies), it is not possible for me to install & run it. The only infrastructural options are a RDBMS or Kafka. I'd like to go with Kafka as it seems to be the natural fit in this event sourcing use case.
I also noticed that Akka Persistence seems to handle all of the requirements well. But I have a couple of questions:
Are there any alternatives I missed?
Akka Persistence's Kafka integration is only available through a community plugin that is not maintained regularly. Seems to me, as this is not a common use case. Is there any reason the outlined architecture is not wide spread?
Is cloning possible? In the Akka documentation it says:
"So, if two different entities share the same persistenceId,
message-replaying behavior is corrupted."
So, let's assume two application instances, one and two, both have unique persistenceIds. Could two be restored (cloned) from one's journal? Even if they don't share the same Id (which is not allowed)?
Are there any complete examples of this architecture available?

Object cache on Spark executors

A good question for Spark experts.
I am processing data in a map operation (RDD). Within the mapper function, I need to lookup objects of class A to be used in processing of elements in an RDD.
Since this will be performed on executors AND creation of elements of type A (that will be looked up) happens to be an expensive operation, I want to pre-load and cache these objects on each executor. What is the best way of doing it?
One idea is to broadcast a lookup table, but class A is not serializable (no control over its implementation).
Another idea is to load them up in a singleton object. However, I want to control what gets loaded into that lookup table (e.g. possibly different data on different Spark jobs).
Ideally, I want to specify what will be loaded on executors once (including the case of Streaming, so that the lookup table stays in memory between batches), through a parameter that will be available on the driver during its start-up, before any data gets processed.
Is there a clean and elegant way of doing it or is it impossible to achieve?
This is exactly the targeted use case for broadcast. Broadcasted variables are transmitted once and use torrents to move efficiently to all executors, and stay in memory / local disk until you no longer need them.
Serialization often pops up as an issue when using others' interfaces. If you can enforce that the objects you consume are serializable, that's going to be the best solution. If this is impossible, your life gets a little more complicated. If you can't serialize the A objects, then you have to create them on the executors for each task. If they're stored in a file somewhere, this would look something like:
rdd.mapPartitions { it =>
val lookupTable = loadLookupTable(path)
it.map(elem => fn(lookupTable, elem))
}
Note that if you're using this model, then you have to load the lookup table once per task -- you can't benefit from the cross-task persistence of broadcast variables.
EDIT: Here's another model, which I believe lets you share the lookup table across tasks per JVM.
class BroadcastableLookupTable {
#transient val lookupTable: LookupTable[A] = null
def get: LookupTable[A] = {
if (lookupTable == null)
lookupTable = < load lookup table from disk>
lookupTable
}
}
This class can be broadcast (nothing substantial is transmitted) and the first time it's called per JVM, you'll load the lookup table and return it.
In case serialisation turns out to be impossible, how about storing the lookup objects in a database? It's not the easiest solution, granted, but should work just fine. I could recommend checking e.g. spark-redis, but I am sure there are better solution out there.
Since A is not serializable the easiest solution is to create yout own serializable type A1 with all data from A required for computation. Then use the new lookup table in broadcast.

Cassandra auto create table code in production

Is it a good practise to keep the table generation in your production code?
I'm referring too:
Await.ready(database.autocreate().future(), 2 minutes)
Are there any potential issues with leaving this in? Just looking for some explanation on if it is a good idea or not.
Is it better to keep this type of work outside in some sort of a script to run during initial rollout and migrations?
I fully and very strongly disagree with all the advice given above. The whole point of phantom is to never have to write CQL, and we have complete mechanisms that allow you to control how your schema gets initialised, inclusive of all possible properties.
Have a look at the tests here or at the default Cassandra initialisation, there's pretty much nothing you can't do.
Custom settings in autocreation
If you want to provide all those params instead of defaults during database.autocreate, that's really simple too:
class MyTable extends CassandraTable[MyTable, MyRecord] {
override def autocreate(
keySpace: KeySpace
): CreateQuery.Default[T, R] = create.ifNotExists()(keySpace)
.`with`(compaction eqs LeveledCompactionStrategy.sstable_size_in_mb(50))
.and(compression eqs LZ4Compressor.crc_check_chance(0.5))
}
Later when you do this:
class MyDB(override val connector: KeySpaceDef) extends Database {
object myTable extends MyTable with connector.Connector
}
And you do:
val database = new MyDB(ContactPoint.local.keySpace("whatever")
When you run database.createAsync or database.create, all the settings you defined above will be respected.
Custom keyspace autocreation
Phantom also supports specifying custom keyspace initialisation queries during keyspace autogeneration.
val init = KeySpaceSerializer("my_app").ifNotExists()
.`with`(replication eqs SimpleStrategy.replication_factor(2))
.and(durable_writes eqs true)
val connector = ContactPoint.local.keySpace(
"my_app",
(session, space): (Session, KeySpace) => init.queryString
)
This way you can benefit from any known form of customisation you can think of while still not having to deal with CQL. If you use phantom-pro which will shortly be available for subscription, there will also be automated schema migration capability, so holding your schema in any kind of CQL is a very big no no.
Phantom also transparently handles CQL variations between versions, I've never seen a bash script who does that, so you can get into unpleasantries quite quickly with a simple Cassandra upgrade/downgrade, and why would you if you can just automate things?
The table creation/modification logic may not be useful once after installation or upgrade. Further, it may not be safe/not required to keep it in the production code. So, keep your code/scripts at bootstrap or installer level.
As already pointed out, from my experience this is not something you would like to have.
I have been using phantom in production for more than a year, and the only place I left the table creation to be something automatically, was inside my tests, running with an embedded version.
You can find out more here: https://github.com/iamthiago/cassandra-phantom/blob/master/src/test/scala/com/cassandra/phantom/modeling/test/service/SongsTest.scala
To push it further, there is a similar discussion with hibernate. You can take a look here: Hibernate/JPA DB Schema Generation Best Practices