I'm trying to do a join of records emitted from a KafkaSpout with records in an Oracle table (not a streaming join)
What is the best way to implement this?
I can use a cache to retrieve the records from db table and then I can do a join of each tuple emitted from the spout with the cached data.
Would like to get suggestions on this.
The simples way is to open an JDBC connection to the database in open() or prepare() (depending if you want to do this in spout or bolt) and query the database for each tuple to be processed to receive the corresponding join tuples.
Of course, you can additionally use a cache (maybe a simple HashMap) within your spout/bolt code to avoid querying the same data over and over again. For this, I would populate the cache lazily and also limit the number of entries to avoid out-of-memory errors. You might want to implement LRU strategy to dismiss tuples from your cache it case it reaches its limit.
Related
We have an IOT app that receives data on kafka and processes it saves in rdbms. The db that we are using(MemSql) supports more than 20000 inserts per second. But with my regular repository.save method I have been able to achieve only 50 inserts per sec. I have made a simple code that I am testing on a high network speed aws ec2 instance
void saveA(){
for(int i=0;i<1000;i++){
A obj = new A();
aRepository.save(obj);
}
}
This takes 20 seconds to complete. I wish to achieve around 1000k inserts per sec. How do increase this ingestion speed? Should I create a create a thread pool of size 1000 and call save from separate thread? In that case case do I need to care about properties like spring.datasource.tomcat.max-active to increase number of connections in the pool? Would spring data automatically pick a separate connection from pool for each thread?
I can't do batch inserts as I am reading data from kafka one at a time and also because there could be some duplicate data that I need to catch as DataIntegrityViolationException and update.
You don't describe how complex the objects are that you are saving, but it sounds like you have fairly simple objects, i.e. the ration of inserts per save operation is close to 1 and you also don't seem to do many updates, if at all.
If that is the case I'd recommend ditching JPA and going straight for JDBC (using the JdbcTemplate)
The reason is that JPA does a lot of thing for making the typical JPA process work: Load an entity graph, manipulate it, and flush it back to the database.
But you don't do that and so JPA might not help much and makes your life hard because you need to tune JPA and JDBC.
Start with performing the inserts directly using JdbcTemplate.
The next step would be to perform batch inserts.
You write you can't do that but I don't see why you can't collect a couple of rows before writing them to the database.
We are on Postgresql 12 and looking to partition a group of tables that are all related by Data Source Name. A source can have tens of millions of records and the whole dataset makes up about 900GB of space across the 2000 data sources. We don't have a good way to update these records so we are looking at a full dump and reload any time we need to update data for a source. This is why we are looking at using partitioning so we can load the new data into a new partition, detach (and later drop) the partition that currently houses the data, and then attach the new partition with the latest data. Queries will be performed via a single ID field. My concern is that since we are partitioning by source name and querying by an ID that isn't used in the partition definition that we won't be able to utilize any partition pruning and our queries will suffer for it.
How concerned should we be with query performance for this use case? There will be an index defined on the ID that is being queried, but based on the Postgres documentation it can add a lot of planning time and use a lot of memory to service queries that look at many partitions.
Performance will suffer, but it will depend on the number of partitions how much. The more partitions you have, the slower both planning and execution time will get, so keep the number low.
You can save on query planning time by defining a prepared statement and reusing it.
I am new to spring batch and trying to design a new application which has to read 20 million records from database and process it.
I don’t think we can do this with one single JOB and Step(in sequential with one thread).
I was thinking we can do this in Partitioning where step is divided into master and multiple workers (each worker is a thread which does its own process can run parallel).
We have to read a table(existing table) which has 20 million records and process them but in this table we do not have any auto generated sequence number and it have primary key like employer number with 10 digits.
I checked few sample codes for Partitioning where we can pass the range to each worker and worker process given range like worker1 from 1 to 100 and worker2 101 to 200…but in my case which is not going work because we don’t have sequence number to pass as range to each worker.
In Partitioning can master read the data from database (like 1000 records) and pass it to each worker in place for sending range ? .
Or for the above scenario do you suggest any other better approach.
In principle any query that returns result rows in a deterministic order is amenable to partitioning as in the examples you mentioned by means of OFFSET and LIMIT options. The ORDER BY may considerably increase the query execution time, although if you order by the table's primary key then this effect should be less noticeable as the table's index will already be ordered. So I would give this approach a try first, as it is the most elegant IMHO.
Note however that you might run into other problems processing a huge result set straight from a JdbcCursorItemReader, because some RDBMSs (like MySQL) won't be happy with the rate at which you'd be fetching rows interlocked with processing. So depending on the complexity of your processing I would recommend validating the design in that regard early on.
Unfortunately it is not possible to retrieve a partition's entire set of table rows and pass it as a parameter to the worker step as you suggested, because the parameter must not serialize to more than a kilobyte (or something in that order of magnitude).
An alternative would be to retrieve each partition's data and store it somewhere (in a map entry in memory if size allows, or in a file) and pass the reference to that resource in a parameter to the worker step which then reads and processes it.
I have a topology (see below) that reads off a very large topic (over a billion messages per day). The memory usage of this Kafka Streams app is pretty high, and I was looking for some suggestions on how I might reduce the footprint of the state stores (more details below). Note: I am not trying to scape goat the state stores, I just think there may be a way for me to improve my topology - see below.
// stream receives 1 billion+ messages per day
stream
.flatMap((key, msg) -> rekeyMessages(msg))
.groupBy((key, value) -> key)
.reduce(new MyReducer(), MY_REDUCED_STORE)
.toStream()
.to(OUTPUT_TOPIC);
// stream the compacted topic as a KTable
KTable<String, String> rekeyedTable = builder.table(OUTPUT_TOPIC, REKEYED_STORE);
// aggregation 1
rekeyedTable.groupBy(...).aggregate(...)
// aggreation 2
rekeyedTable.groupBy(...).aggregate(...)
// etc
More specifically, I'm wondering if streaming the OUTPUT_TOPIC as a KTable is causing the state store (REKEYED_STORE) to be larger than it needs to be locally. For changelog topics with a large number of unique keys, would it be better to stream these as a KStream and do windowed aggregations? Or would that not reduce the footprint like I think it would (e.g. that only a subset of the records - those in the window, would exist in the local state store).
Anyways, I can always spin up more instances of this app, but I'd like to make each instance as efficient as possible. Here's my question:
Are there any config options, general strategies, etc that should be considered for Kafka Streams app with this level of throughput?
Are there any guidelines for how memory intensive a single instance should have? Even if you have a somewhat arbitrary guideline, it may be helpful to share with others. One of my instances is currently utilizing 15GB of memory - I have no idea if that's good/bad/doesn't matter.
Any help would be greatly appreciated!
With your current pattern
stream.....reduce().toStream().to(OUTPUT_TOPIC);
builder.table(OUTPUT_TOPIC, REKEYED_STORE)
you get two stores with the same content. One for the reduce() operator and one for reading the table() -- this can be reduced to one store though:
KTable rekeyedTable = stream.....reduce(.);
rekeyedTable.toStream().to(OUTPUT_TOPIC); // in case you need this output topic; otherwise you can also omit it completely
This should reduce your memory usage notably.
About windowing vs non-windowing:
it's a matter of your required semantics; so simple switching from a non-windowed to a windowed reduce seems to be questionable.
Even if you can also go with windowed semantics, you would not necessarily reduce memory. Note, in aggregation case, Streams does not store the raw records but only the current aggregate result (ie, key + currentAgg). Thus, for a single key, the storage requirement is the same for both cases (a single window has the same storage requirement). At the same time, if you go with windows, you might actually need more memory as you get an aggregate pro key pro window (while you get just a single aggregate pro key in the non-window case). The only scenario you might save memory, is the case for which you 'key space' is spread out over a long period of time. For example, you might not get any input records for some keys for a long time. In the non-windowed case, the aggregate(s) of those records will be stores all the time, while for the windowed case the key/agg record will be dropped and new entried will be re-created if records with this key occure later on again (but keep in mind, that you lost the previous aggergate in this case -- cf. (1))
Last but not least, you might want to have a look into the guidelines for sizing an application: http://docs.confluent.io/current/streams/sizing.html
I have a spring batch with multiple steps, some sequential and some parallel. Some of these steps involve fetching millions of rows and the query has multiple joins and left joins. I tried using JdbcPagingItemReader but the order by clause simply hangs the query. I don't get results even after 30 minutes of waiting. So I switched to JdbcCursorItemReader.
Is that approach fine ? I understand that the JdbcCursorItemReader fetches all the data at once and writes it out based on the commit interval. Is there any option to specify to the reader to fetch, say, 50000 records at a time, so that my application and the system is not overloaded ?
Thank you for your response, Michael. I have 22 customized Item readers which are extended from jdbcCursorItemReader. If there are multiple threads, how would the Spring batch handle the resultset? Is there a possibility of multiple threads reading from the same resultset in this case, also?
The JdbcCursorItemReader has the ability to configure the fetchSize (how many records are returned from the db with each request), however that depends on your database and it's configuration. For example, most databases you can configure the fetch size and it's honored. However, MySql requires you set the fetch side to Integer.MIN_VALUE in order to stream results. Sqlite is another that has special requirements.
That being said, it is important to know that JdbcCursorItemReader is not thread safe (multiple threads would be reading from the same ResultSet).
I personally would advocate for tuning your query but assuming the above conditions, you should be able to use the JdbcCursorItemReader fine.