Cassandra auto create table code in production - scala

Is it a good practise to keep the table generation in your production code?
I'm referring too:
Await.ready(database.autocreate().future(), 2 minutes)
Are there any potential issues with leaving this in? Just looking for some explanation on if it is a good idea or not.
Is it better to keep this type of work outside in some sort of a script to run during initial rollout and migrations?

I fully and very strongly disagree with all the advice given above. The whole point of phantom is to never have to write CQL, and we have complete mechanisms that allow you to control how your schema gets initialised, inclusive of all possible properties.
Have a look at the tests here or at the default Cassandra initialisation, there's pretty much nothing you can't do.
Custom settings in autocreation
If you want to provide all those params instead of defaults during database.autocreate, that's really simple too:
class MyTable extends CassandraTable[MyTable, MyRecord] {
override def autocreate(
keySpace: KeySpace
): CreateQuery.Default[T, R] = create.ifNotExists()(keySpace)
.`with`(compaction eqs LeveledCompactionStrategy.sstable_size_in_mb(50))
.and(compression eqs LZ4Compressor.crc_check_chance(0.5))
}
Later when you do this:
class MyDB(override val connector: KeySpaceDef) extends Database {
object myTable extends MyTable with connector.Connector
}
And you do:
val database = new MyDB(ContactPoint.local.keySpace("whatever")
When you run database.createAsync or database.create, all the settings you defined above will be respected.
Custom keyspace autocreation
Phantom also supports specifying custom keyspace initialisation queries during keyspace autogeneration.
val init = KeySpaceSerializer("my_app").ifNotExists()
.`with`(replication eqs SimpleStrategy.replication_factor(2))
.and(durable_writes eqs true)
val connector = ContactPoint.local.keySpace(
"my_app",
(session, space): (Session, KeySpace) => init.queryString
)
This way you can benefit from any known form of customisation you can think of while still not having to deal with CQL. If you use phantom-pro which will shortly be available for subscription, there will also be automated schema migration capability, so holding your schema in any kind of CQL is a very big no no.
Phantom also transparently handles CQL variations between versions, I've never seen a bash script who does that, so you can get into unpleasantries quite quickly with a simple Cassandra upgrade/downgrade, and why would you if you can just automate things?

The table creation/modification logic may not be useful once after installation or upgrade. Further, it may not be safe/not required to keep it in the production code. So, keep your code/scripts at bootstrap or installer level.

As already pointed out, from my experience this is not something you would like to have.
I have been using phantom in production for more than a year, and the only place I left the table creation to be something automatically, was inside my tests, running with an embedded version.
You can find out more here: https://github.com/iamthiago/cassandra-phantom/blob/master/src/test/scala/com/cassandra/phantom/modeling/test/service/SongsTest.scala
To push it further, there is a similar discussion with hibernate. You can take a look here: Hibernate/JPA DB Schema Generation Best Practices

Related

Customise generated Slick SQL for debugging

I want to customise the SQL that Slick generates for a standard insert before it is sent to the DBMS, so that I can add extra DBMS-specific debugging options that Slick doesn't natively support. How can I do that?
At the action level (i.e., with a DBIO), you can replace the SQL Slick will use via overrideStatements. Combined with statements to access the SQL Slick generates, that would give you a place to jump in and customize the SQL.
Bare in mind, you'll be working with Strings with these two API calls.
A simple example would be:
val regularInsert = table += row
// Switching the generated SQL to all-caps is a terrible idea,
// and may not run in your database, but it will do as an example:
val modifiedSQL = regularInsert.statements.map(_.toUpperCase())
val modifiedInsert = regularInsert.overrideStatements(modifiedSQL)
// run modifiedInsert action as normal
The next step up from this would be to implement a custom database profile to override the way inserts are created to include debugging.
This is more involved: you'd want to extend the profile you're currently using, and dive into the Slick APIs to override various methods to change the insert behaviour. For example, you might start by exploring the existing Postgres profile if that's the database you're using.
However, the above example can be applied per-insert as needed which may be enough for what you need.
If you are using a connection pool such as HikariCP, you can put a Java breakpoint on the ProxyConnection.prepareStatement(String sql) method, or the equivalent method in whatever connection pool library you are using. Then when the SQL of interest is about to be prepared by that method, use your debugger's "evaluate expression" functionality to modify/replace the value of sql.
This won't work if the library you are setting the breakpoint on is not open source, or for some other reason is compiled without debugging information.

Is it possible when using MongoTemplate to dynamically set read preference for a particular query?

In our application, we manage a number of MongoTemplate instances, each representing a client database. For the majority of database operations, we want to use the secondaryPreferred read preference in order to leverage our cluster's read replicas and distribute load. However, in at least one case we need to read from the primary to get the most recent data. I don't see any way to override the read preference for this single query. I see this issue on the JIRA board, but it's been open for 6 years and associated the StackOverflow link is dead. Assuming that won't be implemented, I'm trying to figure out some alternate solutions. Does this seem like a correct assessment of the possible options?
Create two MongoClients with the different read preferences, and use them to create a separate set of MongoTemplates for primary and secondary reads. I'm concerned that this probably creates double the number of connections to the cluster (although perhaps it's not a concern, if the additional connections all go to the secondaries).
Use the MongoTemplate.setReadPreference() method to temporarily change the read preference before performing the operation, then reset it once finished. It seems like this would be vulnerable to race conditions, however.
Sidestep the Spring Data framework and use executeCommand() directly, which supports a readPreference argument. This means we'd lose all of the benefits and abstraction of Spring Data and have to manipulate the BSON objects directly.
The Query class has a slaveOk() method, but this is the inverse of what I'm looking for and it seems like it's deprecated.
Any further information is appreciated as well. Thanks!
As a workaround solution we can override the method prepareCollection(MongoCollection<Document> collection) in MongoTemplate (refer: here) and change the read preference for the needed query alone and let the rest of the cases follow the default read preference
As a side note, seems like slaveOk() is doing literally nothing
https://github.com/spring-projects/spring-data-mongodb/blob/f00991dc293dceee172b1ece6613dde599a0665d/spring-data-mongodb/src/main/java/org/springframework/data/mongodb/core/MongoTemplate.java#L3328
switch (option) {
case NO_TIMEOUT:
cursorToUse = cursorToUse.noCursorTimeout(true);
break;
case PARTIAL:
cursorToUse = cursorToUse.partial(true);
break;
case SECONDARY_READS:
case SLAVE_OK:
break;
default:
throw new IllegalArgumentException(String.format("%s is no supported flag.", option));
}

Scala Event Sourcing with Kafka

For a microservice I need the functionality to persist state (changes). Essentially, the following happens:
case class Item(i: Int)
val item1 = Item(0)
val item2 = exec(item1)
Where exec is user defined and hence not known in advance. As an example, let's assume this implementation:
def exec(item: Item) = item.copy(i = item.i + 1)
After each call to exec, I want log the state changes (here: item.i: 0->1) so that...
there is a history (e.g. list of tuples like (timestamp, what has changed, old value, new value))
state changes and snapshots could be persisted efficiently to a local file systems and sent to a journal
Arbitrary consumers (not only the specific producer where the changes originated) could be restored from the journal/snapshots
As less dependencies to libraries and infrastructure as possible (it is a small project, complex infrastructure/server installations & maintenance is not possible)
I know that the EventStore DB is probably the best solution, however, in the given environment (a huge enterprise with a lot of policies), it is not possible for me to install & run it. The only infrastructural options are a RDBMS or Kafka. I'd like to go with Kafka as it seems to be the natural fit in this event sourcing use case.
I also noticed that Akka Persistence seems to handle all of the requirements well. But I have a couple of questions:
Are there any alternatives I missed?
Akka Persistence's Kafka integration is only available through a community plugin that is not maintained regularly. Seems to me, as this is not a common use case. Is there any reason the outlined architecture is not wide spread?
Is cloning possible? In the Akka documentation it says:
"So, if two different entities share the same persistenceId,
message-replaying behavior is corrupted."
So, let's assume two application instances, one and two, both have unique persistenceIds. Could two be restored (cloned) from one's journal? Even if they don't share the same Id (which is not allowed)?
Are there any complete examples of this architecture available?

Getting Spring Data MongoDB to recreate indexes

I'd like to be able to purge the database of all data between Integration test executions. My first thought was to use an org.springframework.test.context.support.AbstractTestExecutionListener
registered using the #TestExecutionListeners annotation to perform the necessary cleanup between tests.
In the afterTestMethod(TestContext testContext) method I tried getting the database from the test context and using the com.mongodb.DB.drop() method. This worked ok, apart from the fact that it also destroys the indexes that were automatically created by Spring Data when it first bound my managed #Document objects.
For now I have fixed this by resorting to iterating through the collection names and calling remove as follows:
for (String collectionName : database.getCollectionNames()) {
if (collectionIsNotASystemCollection(collectionName)
database.getCollection(collectionName).remove(new BasicDBObject());
}
This works and achieves the desired result - but it'd be nice if there was a way I could simply drop the database and just ask Spring Data to "rebind" and perform the same initialisation that it did when it started up to create all of the necessary indexes. That feels a bit cleaner and safer...
I tried playing around with the org.springframework.data.mongodb.core.mapping.MongoMappingContext but haven't yet managed to work out if there is a way to do what I want.
Can anyone offer any guidance?
See this ticket for an explanation why it currently works as it works and why working around this issue creates more problems than it solves.
Supposed you're working with Hibernate and then trigger a call to delete the database, would you even dream to assume that the tables and all indexes reappear magically? If you drop a MongoDB database/collection you remove all metadata associated with it. Thus, you need to set it up the way you'd like it to work.
P.S.: I am not sure we did ourselves a favor to add automatic indexing support as this of course triggers the expectations that you now have :). Feel free to comment on the ticket if you have suggestions how this could be achieved without the downsides I outlined in my initial comment.

Implementing a Mondrian shared SegmentCache

I am trying to implement a Mondrian SegmentCache. The cache is to be shared by multiple JVMs running the Mondrian library. We are using Redis as the backing store, however for the purpose of this question, any persistent key-value store should be fine.
Will the stackoverflow community help complete this implementation? The documentation and Google searches are not yielding enough level of detail. Here we go:
new SegmentCache {
private val logger = Logger("my-segment-cache")
import logger._
import com.redis.serialization.Parse
import Parse.Implicits.parseByteArray
private def redis = new RedisClient("localhost", 6379)
def get(header: SegmentHeader): SegmentBody = {
val result = redis.get[Array[Byte]](header.getUniqueID) map { bytes ⇒
val st = new ByteArrayInputStream(bytes)
val o = new ObjectInputStream(st)
o.readObject.asInstanceOf[SegmentBody]
}
info(s"cache get\nHEADER $header\nRESULT $result")
result.orNull
}
def getSegmentHeaders: util.List[SegmentHeader] = ???
def put(header: SegmentHeader, body: SegmentBody): Boolean = {
info(s"cache put\nHEADER $header\nBODY $body")
val s = new ByteArrayOutputStream
val o = new ObjectOutputStream(s)
o.writeObject(body)
redis.set(header.getUniqueID, s.toByteArray)
true
}
def remove(header: SegmentHeader): Boolean = ???
def tearDown() {}
def addListener(listener: SegmentCacheListener) {}
def removeListener(listener: SegmentCacheListener) {}
def supportsRichIndex(): Boolean = true
}
Some immediate questions:
is SegmentHeader.getUniqueID the appropriate key to use in the cache?
how should getSegmentHeaders be implemented? The current implementation above just throws an exception, and doesn't seem ever be called by Mondrian. How do we make the SegmentCache re-use existing cache records on startup?
how are addListener and removeListener meant to be used? I assume they have something to do with coordinating cache changes across nodes sharing the cache. But how?
what should supportsRichIndex return? In general, how does someone implementing a SegmentCache know what value to return?
I feel like these are basic issues that should be covered in the documentation, but they are not (as far as I can find). Perhaps we can correct the lack of available information here. Thanks!
is SegmentHeader.getUniqueID the appropriate key to use in the cache?
Yes and no. The UUID is convenient on systems like memcached, where everything boils down to a key/value match. If you use the UUID, you'll need to implement supportsRichIndex() as false. The reason for this is that excluded regions are not part of the UUID. That's on design for good reasons.
What we recommend is an implementation that serializes the SegmentHeader (it implements Serializable and hashCode() & equals()) and use that directly as a binary key that you propagate, so that it will retain the invalidated regions and keep everything nicely in sync.
You should look at how we've implemented it in the default memory cache.
There is also an implementation using Hazelcast.
We at Pentaho have also used Infinispan with great success.
how should getSegmentHeaders be implemented?
Again, take a look at the default in-memory implementation. You simply need to return the list of all the currently known SegmentHeader. If you can't provide that list for whatever reason, either because you've used the UUID only, or because your storage backend doesn't support obtaining a list, like memcached, you return an empty list. Mondrian won't be able to use in-memory rollup and won't be able to share the segments, unless it hits the right UUIDs in cache.
how are addListener and removeListener meant to be used?
Mondrian needs to be notified when new elements appear in the cache. These could be created by other nodes. Mondrian maintains an index of all the segments it should know about (thus enabling in-memory operations), so that's a way to propagate the updates. You need to bridge the backend with the Mondrian instances here. Take a look at how the Hazelcast implementation does it.
The idea behind this is that Mondrian maintains a spatial index of the currently known cells and will only query the necessary/missing cells from SQL if it absolutely needs to. This is necessary to achieve greater scalability. Fetching cells from SQL is extremely slow compared to objects which we maintain in an in-memory data grid.
How do we make the SegmentCache re-use existing cache records on startup
This is a caveat. Currently this is possible by applying this patch. It wasn't ported to the master codeline because it is a mess and is tangled with the fixes for another case. It has been reported to work, but wasn't tested internally by us. The relevant code is about here. If you get around to testing this, we always welcome contributions. Let us know if you're interested on the mailing list. There are a ton of people who will gladly help.
One workaround is to update the local index through the listener when your cache implementation starts.