Spring batch -preload chunk related data - spring-batch

I am reading records from file and I need them to associate with records that are already in database.
The related database record is specified within line in the file (there is id of that record). One item read should have one related record in database. I do not want to read single record from database per item due to performance issues it might have.
Therefore I would like to read all related records from database that are related to currently processed lines within chunk. Is there a way? Or is there a way to access all items that are being processed as a part of single chunk (they should be all in memory anyway)?
I know that I could load all records that are likely to be needed, but assume there is millions of such records in database and i am only processing file that has like thousands lines.

This is clearly a case of custom reader - remember that Spring Batch is simply a framework that tries to give structure to your code & infra but doesn't impose much restrictions as what logic or code you write on your own as long as it conforms to interfaces.
Having said that, if you are not transforming any of read items in ItemProcessor, a List of read items should be available at ItemWriter & those are the read items from the file as part of chunk.
If your file is really small, you can read all items in one go using your custom file reader / parser instead of reading one by one by API provided reader & then can load only those items from DB in one go
Instead of having a Single Step Job, you can have a Two Step Job where your first step dumps file read records to a DB table & in second step you do SQL join among these two tables to find out common records.
These are simply broad ideas & implementation is up-to you. It would become hard if you start looking for ready made APIs for all the custom cases encountered in practical scenarios.

I do not want to read single record from database per item due to performance issues it might have.
What if you read all related items at once for the current item? You can achieve that using the driving query pattern. The idea is to use an item processor that does the query to the database to fetch all records related to the current item.

Related

Create tables using a predefined schema in REST API call in Springboot

There is a scenario where I need to add entry for every user in a table. There will be around 5-10 records per user and the approximate number of users are approximately 1000. So, if I add the data of every user each day in a single table, the table becomes very heavy and the Read/Write operations in the table would take some time to return the data(which would be mostly for a particular user)
The tech stack for back-end is Spring-boot and PostgreSQL.
Is there any way to create a new table for every user dynamically from the java code and is it really a good way to manage the data, or should all the data should be in a single table.
I'm concerned about the performance of the queries once there are many records in case of a single table holding data for every user.
The model will contain the similar things like userName, userData, time, etc.
Thank you for you time!
Creating one table per user is not a good practice. Based on the information you provided, 10000 rows are created per day. Any RDBMS will be able to perfectly handle this amount of data without any performance issues.
By making use of indexing and partitioning, you will be able to address any potential performance issues.
PS: It is always recommended to define a retention period for data you want to keep in operational database. I am not sure about your use-case, but if possible define a retention period and move older data out of operational table into a backup storage.

Kafka stream enrichment - Sourcing a lookup table [duplicate]

This question already has an answer here:
Is it a good practice to do sync database query or restful call in Kafka streams jobs?
(1 answer)
Closed 4 years ago.
There is a Kafka stream component that fetches JSON data from a topic. Now I have to do the following:
Parse that input JSON data and fetch the value of a certain ID
(identifier) attribute
Do a lookup against a particular table in Oracle database
Enrich that input JSON with data from the lookup table
Publish the enriched JSON data to another topic
What is the best design approach to achieve Step#2? I have a fair idea on how I can do the other steps. Any help is very much appreciated.
Depending on the size of the dataset you're talking about, and of the volume of the stream, I'd try to cache the database as much as possible (assuming it doesn't change that often). Augmenting data by querying a database on every record is very expensive in terms of latency and performance.
The way I've done this before is instantiating a thread whose only task is to maintain a fresh local cache (usually a ConcurrentHashMap), and make that available to the process that requires it. In this case, you'll probably want to create a processor, give it the reference to the ConcurrentHashMap described above, and when the Kafka record comes in, lookup the data with the key, augment the record, and send it to either a Sink processor, or to another Streams processor, depending on what you want do with it.
In case the lookup fails, you can fallback to actually do a query on demand to the database, but you probably want to test and profile this, because in the worst case scenario of 100% cache misses, you're going to be querying the database a lot.

Idempotent streams or preventing duplicate rows using PipelineDB

My application produces rotating log files containing multiple application metrics. The log file is rotated once a minute, but each file is still relatively large (over 30MB, with 100ks of rows)
I'd like to feed the logs into PipelineDB (running on the same single machine) which Countiuous View can create for me exactly the aggregations I need over the metrics.
I can easily ship the logs to PipelineDB using copy from stdin, which works great.
However, a machine might occasionally power off unexpectedly (e.g. due to power shortage) during the copy of a log file. Which means that once back online there is uncertainty how much of the file has been inserted into PipelineDB.
How could I ensure that each row in my logs is inserted exactly once in such cases? (It's very important that I get complete and accurate aggregations)
Notice each row in the log file has a unique identifier (serial number created by my application), but I can't find in the docs the option to define a unique field in the stream. I assume that PipelineDB's design is not meant to handle unique fields in stream rows
Nonetheless, are there any alternative solutions to this issue?
Exactly once semantics in a streaming (infinite rows) context is a very complex problem. Most large PipelineDB deployments use some kind of message bus infrastructure (e.g. Kafka) in front of PipelineDB for delivery semantics and reliability, as that's not PipelineDB's core focus.
That being said, there are a couple of approaches you could use here that may be worth thinking about.
First, you could maintain a regular table in PipelineDB that keeps track of each logfile and the line number that it has successfully written to PipelineDB. When beginning to ship a new logfile, check it against this table to determine which line number to start at.
Secondly, you could separate your aggregations by logfile (by including a path or something in the grouping) and simply DELETE any existing rows for that logfile before sending it. Then use combine to aggregate over all logfiles at read time, possibly with a VIEW.

Neo4j's MERGE command on big datasets

Currently, I am working on a project of implementing a Neo4j (V2.2.0) database in the field of web-analytics. After loading some samples, I'm trying to load a big data set (>1GB, >4M lines). The problem I am facing, is that the usage of the MERGE command takes exponentially more time as the data size grows. Online sources are ambiguous on what the best way is to load big sets of data when not every line has to be loaded as a node, and I would like some clarity on the subject. To emphasize, in this situation I am just loading the nodes; relations are the next step.
Basically there are three methods
i) Set a uniqueness constraint for a property, and create all nodes. This method was used mainly before the MERGE command was introduced.
CREATE CONSTRAINT ON (book:Book) ASSERT book.isbn IS UNIQUE
followed by
USING PERIODIC COMMIT 250
LOAD CSV WITH HEADERS FROM "file:C:\\path\\file.tsv" AS row FIELDTERMINATOR'\t'
CREATE (:Book{isbn=row.isbn, title=row.title, etc})
In my experience, this will return a error if a duplicate is found, which stops the query.
ii) Merging the nodes with all their properties.
USING PERIODIC COMMIT 250
LOAD CSV WITH HEADERS FROM "file:C:\\path\\file.tsv" AS row FIELDTERMINATOR'\t'
MERGE (:Book{isbn=row.isbn, title=row.title, etc})
I have tried loading my set in this manner, but after letting the process run for over 36 hours and coming to a grinding halt, I figured there should be a better alternative, as ~200K of my eventual ~750K nodes were loaded.
iii) Merging nodes based on one property, and setting the rest after that.
USING PERIODIC COMMIT 250
LOAD CSV WITH HEADERS FROM "file:C:\\path\\file.tsv" AS row FIELDTERMINATOR'\t'
MERGE (b:Book{isbn=row.isbn})
ON CREATE SET b.title = row.title
ON CREATE SET b.author = row.author
etc
I am running a test now (~20K nodes) to see if switching from method ii to iii will improve execution time, as a smaller sample gave conflicting results. Are there methods which I am overseeing and could improve execution time? If I am not mistaken, the batch inserter only works for the CREATE command, and not the MERGE command.
I have permitted Neo4j to use 4GB of RAM, and judging from my task manager this is enough (uses just over 3GB).
Method iii) should be the fastest solution since you MERGE against a single property. Do you create the uniqueness constraint before you do the MERGE? Without an index (constraint or normal index), the process will take a long time with a growing number of nodes.
CREATE CONSTRAINT ON (book:Book) ASSERT book.isbn IS UNIQUE
Followed by:
USING PERIODIC COMMIT 20000
LOAD CSV WITH HEADERS FROM "file:C:\\path\\file.tsv" AS row FIELDTERMINATOR'\t'
MERGE (b:Book{isbn=row.isbn})
ON CREATE SET b.title = row.title
ON CREATE SET b.author = row.author
This should work, you can increase the PERIODIC COMMIT.
I can add a few hundred thousand nodes within minutes this way.
In general, make sure you have indexes in place. Merge a node first on the basis of the properties that are indexed (to exploit fast lookup) and then modify that node's properties as needed with SET.
Beyond that, both of your approaches are going through the transaction layer. If you need to jam a lot of data into the DB really quickly, you probably don't want to use transactions to do that, because they're giving you functionality you might not need, and they require overhead that's slowing you down. So a larger solution would be to not insert data with LOAD CSV but go another route entirely.
If you're using the 2.2 series of neo4j, you can go for the batch inserter via java, or the neo4j-import tool sadly not available prior to 2.2. What they both have in common is that they don't use transactions.
Finally, either way you go you should read Michael Hunger's article on importing data into neo4j as it provides a good conceptual discussion of what's happening, and why you need to skip transactions if you're going to load big huge piles of data into neo4j.

Many reads side effect on other queries

Imagine one database collection with many records (traffic data) and 2 separate services that are using it.
First service just reads data using aggregation framework and it takes pretty long because not every fields are indexed, so lets say it generates lots of reads on that collection. This reads are done every 15 minutes.
Second service is usually doing nothing but in some moment it can also start doing lots of aggregation queries and in other moment lots of update queries.
My question is: Does reads from first service has any kind of side-effect on second service i.e does it slows down second service. Same question about second service massive reads affect on first service.
I understand that updates from second service will slow down every process because of mongoDB write lock but reads are little mysterious for me.
Ivan
There would be no direct impacts of the reads upon each other. As you mentioned, MongoDB uses a one writer many reader model, so as long as your workload is read only they should not directly impact each other.
That said, if the queries require more data than can be contained in RAM they will compete for access to the available RAM. This competition will require accessing data that is not in RAM from disk, which can also lead to one set of queries impacting each other as they must both read from disk.