I have been investigating the experimental Akka Persistence Query module and am very interested in implementing a custom read journal for my application. The documentation describes two main flavors of queries, ones that return current state of the journal (e.g CurrentPersistenceIdsQuery) and ones that return a subscribe-able stream that emit events as the events are committed to the journal via the write side of the application (e.g. AllPersistenceIdsQuery)
For my contrived application, I am using Postgres and Slick 3.1.1 to drive the guts of these queries. I can successfully stream database query results by doing something like:
override def allPersistenceIds = {
val db = Database.forConfig("postgres")
val metadata = TableQuery[Metadata]
val query = for (m <- metadata) yield m.persistenceId
Source.fromPublisher(db.stream(query.result))
}
However, the stream is signaled as complete as soon as the underlying Slick DB action is completed. This doesn't seem to fulfill the requirement of a perpetually open stream that is capable of emitting new events.
My questions are:
Is there a way to do it purely using the Akka Streams DSL? That is, can I sent up a flow that cannot be closed?
I have done some exploring on how the LevelDB read journal works and it seems to handle new events by having the read journal subscribe to the write journal. This seems reasonable but I must ask - in general, is there a recommended approach for dealing with this requirement?
The other approach I have thought about is polling (e.g. periodically have my read journal query the DB and check for new events / ids). Would someone with more experience than I be able to offer some advice?
Thanks!
It's not as trivial as this one line of code however you're one the right track already.
In order to implement an "infinite" stream you'll need to query multiple times - i.e. implement polling, unless the underlying db allows for an infinite query (which here it does not AFAICS).
The polling needs to keep track of the "offset", so if you're querying by some tag, and you issue another poll, you need to start that (2nd now) query from the "last emitted element", and not the beginning of the table again. So you need somewhere, most likely an Actor, that keeps this offset.
The Query Side LevelDB plugin is not the best role model for other implementations as it assumes much about the underlying journal and how those work. Also, LevelDB is not meant for production with Akka Persistence – it's a Journal we ship in order to have a persistent journal you can play around with out of the box (without starting Cassandra etc).
If you're looking for inspiration the MongoDB plugins actually should be a pretty good source for that, as they have very similar limitations as the SQL stores. I'm not sure if any of the SQL journals did currently implement the Query side.
One can use Postgres replication API to get 'infinite' stream of database events. It's supported by Postgres JDBC driver starting from version 42.0.0, see related pull request.
However, it's not real stream but rather buffered synchronous reader from database WAL.
PGReplicationStream stream =
pgConnection
.replicationStream()
.logical()
.withSlotName("test_decoding")
.withSlotOption("include-xids", false)
.withSlotOption("skip-empty-xacts", true)
.start();
while (true) {
ByteBuffer buffer = stream.read();
//process logical changes
}
It would be nice to have an Akka Streams adapter (Source) in alpakka project for this reader.
Related
Background
I have implemented a snowflake data pipeline (s3 log file > SNS > pipe > stage table > stream > task > stored proc/UDF > final table) in our production snowflake database.
While things were working on a smaller scale in our dev database, it seems the production pipeline has stopped working given the amount of data (6416006096 records and growing) attempting to flow throw it.
Problem
After some investigation so far it looks like s3 log > SNS > pipe > stage table is OK, but I things are stuck where the task retrieves records from the stream... The stream is NOT stale. I have spent a lot of time reading the docs on streams and have not found any help in there for my current issue.
It looks like the stream has too much data to return -- when I try to get a count(*) or * with limit 10 from the stream it is not returning after 8 minutes (and counting)...
Even if I could limit the data returned, I have experimented where once you select from the stream within a transaction, you can lose all changes even if you don't want them all (i.e., use a where clause to filter)...
Question
Is there any way to get anything to return from the stream without resetting it?
Is there anyway to chunk the results from the stream without losing all changes within a transaction?
Is there some undocumented limit with streams--have I hit that?
Concern
I don't want to shut down the data pipeline bc that means I may have to start all over but I guess I will have to if I get no answers (I have contacted support too but have yet to hear back). Given that streams and tasks are still only preview I guess this shouldn't be a surprise, but I was told they would be GA by now from Snowflake.
Is there any way to get anything to return from the stream without resetting it?
You should be able to select from the stream without resetting it. Only using it in a DML (ex: insert into mytable as select * from stream) will reset it.
Is there anyway to chunk the results from the stream without losing all changes within a transaction?
No, streams don't support chunking.
Is there some undocumented limit with streams--have I hit that?
I don't think there are undocumented limits, streams are essentially ranges on a table so if there's a lot of data in the underlying table, it could take awhile to scan it.
Some other considerations:
Are you using the right sized warehouse? If you have a lot of data in the stream, and a lot of DMLs consisting of updates, deletes, and inserts you might want to reconsider your warehouse size. I believe Snowflake does some partition level comparisons to reconcile added and deleted data.
Can you "tighten" up how often you read from the stream so that there's less data to process each time?
Depending on the type of data you're interested in, Snowflake offers an append only stream type, which only shows added data. This makes scanning much faster.
This question already has an answer here:
Is it a good practice to do sync database query or restful call in Kafka streams jobs?
(1 answer)
Closed 4 years ago.
There is a Kafka stream component that fetches JSON data from a topic. Now I have to do the following:
Parse that input JSON data and fetch the value of a certain ID
(identifier) attribute
Do a lookup against a particular table in Oracle database
Enrich that input JSON with data from the lookup table
Publish the enriched JSON data to another topic
What is the best design approach to achieve Step#2? I have a fair idea on how I can do the other steps. Any help is very much appreciated.
Depending on the size of the dataset you're talking about, and of the volume of the stream, I'd try to cache the database as much as possible (assuming it doesn't change that often). Augmenting data by querying a database on every record is very expensive in terms of latency and performance.
The way I've done this before is instantiating a thread whose only task is to maintain a fresh local cache (usually a ConcurrentHashMap), and make that available to the process that requires it. In this case, you'll probably want to create a processor, give it the reference to the ConcurrentHashMap described above, and when the Kafka record comes in, lookup the data with the key, augment the record, and send it to either a Sink processor, or to another Streams processor, depending on what you want do with it.
In case the lookup fails, you can fallback to actually do a query on demand to the database, but you probably want to test and profile this, because in the worst case scenario of 100% cache misses, you're going to be querying the database a lot.
We are planning to run kafka streams application distributed in two machines. Each instance stores its Ktable data on its own machine.
The challenge we face here is,
We have a million records pushed to Ktable. We need to iterate the
whole Ktable (RocksDB) data and generate the report.
Let's say 500K records stored in each instance. It's not possible to get all records from other instance in a single GET over http
(unless there is any streaming TCP technique available) . Basically
we need two instance data in a single call and generate the report.
Proposed Solution:
We are thinking to have a shared location (state.dir) for these two instances.So that these two instances will store the Ktable data at same directory and the idea is to get all data from a single instance without interactive query by just calling,
final ReadOnlyKeyValueStore<Key, Result> allDataFromTwoInstance =
streams.store("result",
QueryableStoreTypes.<Key, Result>keyValueStore())
KeyValueIterator<Key, ReconResult> iterator = allDataFromTwoInstance.all();
while (iterator.hasNext()) {
//append to excel report
}
Question:
Will the above solution work without any issues? If not, is there any alternative solution for this?
Please suggest. Thanks in Advance
GlobalKTable is the most natural first choice, but it means each node where the global table is defined contains the entire dataset.
The other alternative that comes to mind is indeed to stream the data between the nodes on demand. This makes sense especially if creating the report is an infrequent operation or when the dataset cannot fit a single node. Basically, you can follow the documentation guidelines for querying remote Kafka Streams nodes here:
http://kafka.apache.org/0110/documentation/streams/developer-guide#streams_developer-guide_interactive-queries_discovery
and for RPC use a framework that supports streaming, e.g. akka-http.
Server-side streaming:
http://doc.akka.io/docs/akka-http/current/java/http/routing-dsl/source-streaming-support.html
Consuming a streaming response:
http://doc.akka.io/docs/akka-http/current/java/http/implications-of-streaming-http-entity.html#client-side-handling-of-streaming-http-entities
This will not work. Even if you have a shared state.dir, each instance only loads its own share/shard of the data and is not aware about the other data.
I think you should use GlobalKTable to get a full local copy of the data.
I have used MongoDB but new to Cassandra. I have worked on applications which are using MongoDB and are not very large applications. Read and Write operations are not very much intensive. MongoDB worked well for me in that scenario. Now I am building a new application(w/ some feature like Stack Overflow[voting, totals views, suggestions, comments etc.]) with lots of Concurrent write operations on the same item into the database(in future!). So according to the information, I gathered via online, MongoDB is not the best choice (but Cassandra is). But the problem I am finding in Cassandra is Picking the right data model.
Construct Models around your queries. Not around relations and
objects.
I also looked at the solution of using Mongo + Redis. Is it efficient to update Mongo database first and then updating Redis DB for all multiple write requests for the same data item?
I want to verify which one will be the best to solve this issue Mongo + redis or Cassandra?
Any help would be highly appreciated.
Picking a database is very subjective. I'd say that modern MongoDB 3.2+ using the new WiredTiger Storage Engine handles concurrency pretty well.
When selecting a distributed NoSQL (or SQL) datastore, you can generally only pick two of these three:
Consistency (all nodes see the same data at the same time)
Availability (every request receives a response about whether it succeeded or failed)
Partition tolerance (the system continues to operate despite arbitrary partitioning due to network failures)
This is called the CAP Theorem.
MongoDB has C and P, Cassandra has A and P. Cassandra is also a Column-Oriented Database, and will take a bit of a different approach to storing and retrieving data than, say, MongoDB does (which is a Document-Oriented Database). The reality is that either database should be able to scale to your needs easily. I would worry about how well the data storage and retrieval semantics fit your application's data model, and how useful the features provided are.
Deciding which database is best for your app is highly subjective, and borders on an "opinion-based question" on Stack Overflow.
Using Redis as an LRU cache is definitely a component of an effective scaling strategy. The typical model is, when reading cacheable data, to first check if the data exists in the cache (Redis), and if it does not, to query it from the database, store the result in the cache, and return it. While maybe appropriate in some cases, it's not common to just write everything to both Redis and the database. You need to figure out what's cacheable and how long each cached item should live, and either cache it at read time as I explained above, or at write time.
It only depends on what your application is for. For extensive write apps it is way better to go with Cassandra
I'm starting a new application and I want to use cqrs and eventsourcing. I got the idea of replaying events to recreate aggregates and snapshotting to speedup if needed, using in memory models, caching, etc.
My question is regarding large read models I don't want to hold in memory. Suppose I have an application where I sell products, and I want to listen to a stream of events like "ProductRegistered" "ProductSold" and build a table in a relational database that will be used for reporting or integration with another system. Suppose there are lots of records and this table may take from a few seconds to minutes to truncate/rebuild, and the application exports dozens of these projections for multiple purposes.
How does one handle the consistency of the projections in this scenario?
With in-memory data, it's quite simple and fast to replay the events. But I feel that external projections that are kept in disk will be much slower to rebuild.
Should I always start my application with a TRUNCATE TABLE + rebuild for every external projection? This seems impractical to me over time, but I may be worried about a problem I didn't have yet.
Since the table is itself like a snapshot, I could keep a "control table" to tell which event was the last one I handled for that projection, so I can replay only what's needed. But I'm worried about inconsistencies if the application or database crashes. It seems that checking the consistency of the table and rebuilding would be the same, which points to the solution 1 again.
How would you handle that in a way that is maintainable over time? Are there better solutions?
Thank you very much.
One way to handle this is the concept of checkpointing. Essentially either your event stream or your whole system has a version number (checkpoint) that increments with each event.
For each projection, you store the last committed checkpoint that was applied. At startup, you pull events greater than the last checkpoint number that was applied to the projection, and continue building your projection from there. If you need to rebuild your projection, you delete the data AND the checkpoint and rerun the whole stream (or set of streams).
Caution: the last applied checkpoint and the projection's read models need to be persisted in a single transaction to ensure they do not get out of sync.