Is there any way we can monitor all data modifications in intersystems cache? - intersystems-cache

I'm a newbie of intersystems cache. We have an old system using cache database, now we want to extract and transform all data of it to store into another different database such as PostgreSQL first, and then monitor all modifications of the original cache data to modify(new or update) our transformed data in PostgreSQL in time.
Is there any way we can monitor all data modifications in cache?
Does cache have got any modification/replication log just like mongodb's oplog?
Any idea would be appreciated, thanks!

In short, yes, there is a way to monitor data modifications. InterSystems uses journaling for some reasons, mostly related to keeping data consistent. In some situations, journals may have more records than in others. It used to rollback transactions, restore data in unexpected shutdowns, for backups, for mirroring, and so on.
But I think in your situation it may help. Most of the quite old applications on Caché does not use Objects and SQL tables and works just with globals as is. While in journals you will find only globals, so, you if already have objects and tables in your Caché application, you should know where and how it stores data in globals. And with Journals API you will be able to read every change in the data. There will be any changes, if the application uses transactions, you will have flags about it as well. Every record changes only one value, it could be set or kill.
And be aware, that Caché cleans outdated journal files by settings, so, you have to increase the number of days after which it will be purged.

Related

How to do caching if you can't afford misses at all?

I'm developing an app that is processing incoming data and currently needs to hit the database for each incoming datapoint. The problem is twofold:
the database can't keep up with the load
the database returns results for less than 5% of the queries
The first idea is to cache the data from the relational database into something like Redis to improve lookup speed. But all the regular caching strategies rely on the fact that you can fall back to the database if needed and fetch data from there. This is problematic in my case because for 95% of the queries there is nothing in the database and I don't have anything to store in the cache. I can of course store the empty results in the cache but that would mean that 95% (or even more, depending on the composition of data) of my cache storage would be rubbish.
The preferred way to do it would be to implement a caching system that doesn't have any misses: everything from the database is always present in the cache and therefore if it's not in the cache, then it's not in the database. After looking around though I found that the consistency of Redis does not seem reliable enough to always make that assumption - if the key doesn't exist in Redis, how can I be 100% sure that it doesn't exist in the database (assuming that we're not in the midst of an update)? It is a strong requirement that if there is a row in the database about an incoming datapoint, then it needs to be found and can't just be missed out on.
How do I go about designing a caching system that will always have the same data as the relational database - without having a fallback to look the data up in the database? Redis might not be the tool but what would you recommend? Is there a pattern or a keyword that I should look up that I haven't thought of?
There already is such a cache in the database: shared buffers. So all you have to do is to set shared_buffers big enough to contain the whole database and restart. Soon the whole database will be cached, and reading will cause no more I/O and will be fast.
That also works if you cannot cache the whole database, as long as you only need to access part of it: PostgreSQL will then just cache those 8kB-pages that are in use.
In my opinion, adding another external caching system can never do better than that. That is particularly true if data are ever modified: any external caching system would have to make sure that its data are not stale, which would introduce an additional overhead.

MongoDB disk space reclamation

I am familiar both with the MongoDB repairDatabase and compact commands, but these both seem to lock the database and/or collection. Is there another way to reclaim deleted disk space without essentially shutting down the database? What are best practices in this area? Thanks!
Best practice would probably depend on your schema and what your application does. Here's my use case, perhaps you can learn something... My application is storing very large amounts of time stamped data samples. Deleting data from a very large store is a very expensive operation, this gets more complicated when you try doing this on live systems. MongoDB had several issues in the past with reclaiming the disk space back to OS and we had to dance around this, not sure how good it works now. But what we did solved everything for good - we partitioned the data in such way so that we could dispose of old stuff by simply dumping entire database. Dropping mongodb database is a very cheap and efficient operation, almost instantaneous even when you drop a TB. Note that dropping collection is not as effective as dropping database, this was actually a key to the solution. For doing this we had to redesign the schema.. Your case of course could be different, but the lesson learned is that deleting data from large storage is very expensive.
The best method currently is to run a Master Slave Setup.
Shutdown 1 mongod instance and let it resync.
More details here: Reducing MongoDB database file size

solution to synch back cache value to database?

below is the scenario:
here is a access statistic system, just like Blogger's overviewstats function.
Statistic data is stored persistent in database(like MySQL), while using a key-value cache(now is memcache) to cache the access counts, each access only update the value in cache.
Now the question is how to synch back the latest count value to database?
A normal solution is to write back after some interval, but memcache will discard items when there is no enough spaces, some updates may lost.
so I think a better solution is if memcache can send a message(like JMS) when discarding an item, and then i can synch that item to database.
It seems that memcache does not provide this function, is there any other key-value cache can do this?
Or is there any better solutions?
Memcached is a cache, so you need to use it as one. When you update the access counts in memcached, you should also enqueue the updates so they can be written asynchronously to the database. That way, counts that fall out of the cache can be reloaded from the database.
I like the idea of memcached enqueuing items that are about to be discarded, but it's probably not going to happen in the main project due to performance considerations.

How does memcache store data?

I am a newbie to caching and have no idea how data is stored in caching. I have tried to read a few examples online, but everybody is providing code snippets of storing and getting data, rather than explaining how data is cached using memcache. I have read that it stores data in key, value pairs , but I am unable to understand where are those key-value pairs stored?
Also could someone explain why is data going into cache is hashed or encrypted? I am a little confused between serialising data and hashing data.
A couple of quotes from the Memcache page on Wikipedia:
Memcached's APIs provide a giant hash
table distributed across multiple
machines. When the table is full,
subsequent inserts cause older data to
be purged in least recently used (LRU)
order.
And
The servers keep the values in RAM; if
a server runs out of RAM, it discards
the oldest values. Therefore, clients
must treat Memcached as a transitory
cache; they cannot assume that data
stored in Memcached is still there
when they need it.
The rest of the page on Wikipedia is pretty informative, and it might help you get started.
They are stored in memory on the server, that way if you use the same key/value often and you know they won't change for a while you can store them in memory for faster access.
I'm not deeply familiar with memcached, so take what I have to say with a grain of salt :-)
Memcached is a separate process or set of processes that store a key-value store in-memory so they can be easily accessed later. In a sense, they provide another global scope that can be shared by different aspects of your program, enabling a value to be calculated once, and used in many distinct and separate areas of your program. In another sense, they provide a fast, forgetful database that can be used to store transient data. The data is not stored permanently, but in general it will be stored beyond the life of a particular request (it is possible for Memcached to never store your data, so every read will be a miss, but that's generally an indication that you do not have it set up correctly for your use case).
The data going into cache does not have to be hashed or encrypted (but both things can happen to the data, depending on the caching mechanism.)
Serializing data actually has nothing to do with either concept -- instead, it is the process of changing data from one format (generally one suited for in-memory storage) to another one (generally suitable for storage in a persistent medium.) Another term for this process is marshalling and unmarshalling.

Main Memory DB vs Object DB

I'm currently trying to pick a database vendor.
I'm just seeking some personal opinions from fellow database developers out there.
My question is especially targeted towards people who:
1) have used Main Memory DB (MMDB) that supports replicating to disk (hybrid) before (i.e. ExtremeDB)
or
2) have used Versant Object Database and/or Objectivity Database and/or Progress ObjectStore
and the question is really: if you could recommend a database vendor, based on your experience, that would suit my application.
My application is a commercial real-time (read: high-performance) object-oriented C++ GIS kind of app, where we need to do a lot of lat/lon search (i.e. given an area, find all matching targets within the area...R-Tree index).
The types of data that I would like to store into the database are all modeled as objects and they make use of std::list and std::vector, so naturally, Object Database seems to make sense. I have read through enough articles to convince myself that a traditional RDBMS probably isnt what I'm really looking for in terms of
performance (joins or multiple
tables for dynamic-length data like
list/vector)
ease of programming
(impedance mismatch)
However, in terms of performance,
Input data is being fed into the system at about 40 MB/s.
Hence, the system will also be doing insert into the database at the rate of roughly 350 inserts per second (where each object varies from 64KB to 128KB),
Database will consistently be searched and updated via multiple threads.
From my understanding, all of the Object DBs I have listed here use cache for storing database objects. ExtremeDB claims that since it's designed especially for memory, it can avoid overhead of caching logic, etc. See more by googling: Main Memory vs. RAM-Disk Databases: A Linux-based Benchmark
So..I'm just a bit confused. Can Object DBs be used in real-time system? Is it as "fast" as MMDB?
Fundamentally, I difference between a MMDB and a OODB is that the MMDB has the expectation that all of its data is based in RAM, but persisted to disk at some point. Whereas an OODB is more conventional in that there's no expectation of the entire DB fitting in to RAM.
The MMDB can leverage this by giving up on the concept that the persisted data doesn't necessarily have to "match" the in RAM data.
The way anything with persistence is going to work, is that it has to write the data to disk on update in some fashion.
Almost all DBs use some kind of log for this. These logs are basically "raw" pages of data, or perhaps individual transactions, appended to a file. When the file gets "too big", a new file is started.
Once the logs are properly consolidated in to the main store, the logs are discarded (or reused).
Now, a crude, in RAM DB can exist simply by appending transactions to a log file, and when it's restarted, it just loads the log in to RAM. So, in essence, the log file IS the database.
The downside of this technique is the longer and more transactions you have, the bigger your log/DB is, and thus the longer the DB startup time. But, ideally, you can also "snapshot" the current state, which eliminates all of the logs up to date, and effectively compresses them.
In this manner, all the routine operations of the DB have to manage is appending pages to logs, rather than updating other disk pages, index pages, etc. Since, ideally, most systems don't need to "Start up" that often, perhaps start up time is less of an issue.
So, in this way, a MMDB can be faster than an OODB who has a different contract with the disk, maintaining logs and disk pages. In this way, an OODB can be slower even if the entire DB fits in to RAM and is properly cached, simply because you incur disk operations outside of the log operations during normal operations, vs a MMDB where these operations happen as a "maintenance" task, which can be scheduled during down time and/or quiet time.
As to whether either of these systems can meet you actual performance needs, I can't say.
The back ends of databases (reader and writer processes, caching, lock managing, txn log files, ACID semantics) are the same, so RDBs and OODB are actually very similar here. The difference is the interface to the application programmer. Is your data model complicated, consists of lots of classes with real inheritance relationships? Then OO is good. Is it relatively flat and simple? Then go RDB. What is the nature of the relationships? Is it pointer-like and set like? Then go RDB. Is is more complicated, like (ordered) list, array, map? Then you should go OO. Also, do you have a stand-alone application with no need to integrate with other apps? Then OO is ok. Do you have to share data with other apps (i.e. several apps access the same database)? Then that's a deal-breaker for OO, and you should stick with RDB. Is the schema of your database stable or do you expect it to evolve frequently? OODBs are bad ad schema evolution, so if you expect frequent changes, stick with RDBs.