Using luigi to update Postgres table - postgresql

I've just started using the luigi library. I am regularly scraping a website and inserting any new records into a Postgres database. As I'm trying to rewrite parts of my scripts to use luigi, it's not clear to me how the "marker table" is supposed to be used.
Workflow:
Scrape data
Query DB to check if new data differs from old data.
If so, store the new data in the same table.
However, using luigi's postgres.CopyToTable, if the table already exists, no new data will be inserted. I guess I should be using the inserted column in the table_updates table to figure out what new data should be inserted, but it's unclear to me what that process looks like and I can't find any clear examples online.

You don't have to worry about marker table much: it's an internal table luigi uses to track which task has already been successfully executed. In order to do so, luigi uses the update_id property of your task. If you didn't declared one, then luigi will use the task_id as shown here. That task_id is a concatenation of the task family name and the first three parameters of your task.
The key here is to overwrite the update_id property of your task and return a custom string that you'll know will be unique for each run of your task. Usually you should use the significant parameters of your task, something like:
#property
def update_id(self):
return ":".join(self.param1, self.param2, self.param3)
By significant I mean parameters that change the output of your task. I imagine parameters like website url o id, and scraping date. Parameters like the hostname, port, username or password of your database will be the same for any of these tasks so they shouldn't be considered significant.
Notice that without having details about your tables and the data you're trying to save its pretty hard to say how you must build that update_id string, so please be careful.

Related

Why Spring Data doesn't support returning entity for modifying queries?

When implementing a system which creates tasks that need to be resolved by some workers, my idea would be to create a table which would have some task definition along with a status, e.g. for document review we'd have something like reviewId, documentId, reviewerId, reviewTime.
When documents are uploaded to the system we'd just store the documentId along with a generated reviewId and leave the reviewerId and reviewTime empty. When next reviewer comes along and starts the review we'd just set his id and current time to mark the job as "in progress" (I deliberately skip the case where the reviewer takes a long time, or dies during the review).
When implementing such a use case in e.g. PostgreSQL we could use the UPDATE review SET reviewerId = :reviewerId, reviewTime: reviewTime WHERE reviewId = (SELECT reviewId from review WHERE reviewId is null AND reviewTime is null FOR UPDATE SKIP LOCKED LIMIT 1) RETURNING reviewId, documentId, reviewerId, reviewTime (so basically update the first non-taken row, using SKIP LOCKED to skip any already in-processing rows).
But when moving from native solution to JDBC and beyond, I'm having troubles implementing this:
Spring Data JPA and Spring Data JDBC don't allow the #Modifying query to return anything else than void/boolean/int and force us to perform 2 queries in a single transaction - one for the first pending row, and second one with the update
one alternative would be to use a stored procedure but I really hate the idea of storing such logic so away from the code
other alternative would be to use a persistent queue and skip the database all along but this introduced additional infrastructure components that need to be maintained and learned. Any suggestions are welcome though.
Am I missing something? Is it possible to have it all or do we have to settle for multiple queries or stored procedures?
Why Spring Data doesn't support returning entity for modifying queries?
Because it seems like a rather special thing to do and Spring Data JDBC tries to focus on the essential stuff.
Is it possible to have it all or do we have to settle for multiple queries or stored procedures?
It is certainly possible to do this.
You can implement a custom method using an injected JdbcTemplate.

CQRS - How to handle if a command requires data from db (query)

I am trying to wrap my head around the best way to approach this problem.
I am importing a file that contains bunch of users so I created a handler called
ImportUsersCommandHandler and my command is ImportUsersCommand that has List<User> as one of the parameters.
In the handler, for each user that I need to import I have to make sure that the UserType is valid, this is where the confusion comes in. I need to do a query against the database, to get list of all possible user types and than for each user I am importing, I want to verify that the user type id in the import matches one that is in the db.
I have 3 options.
Create a query GetUserTypesQuery and get the rest of this and then pass it on to the ImportUsersCommand as a list and verify inside the command handler
Call the GetUserTypesQuery from the command itself and not pass it (command calling another query)
Do not create a GetUsersTypeQuery and just do the query results within the command (still a query but no query/handler involved)
I feel like all these are dirty solutions and not the correct way to apply CQRS.
I agree option 1 sounds the best but would maybe suggest adding a pre handler to validate your input?
So ImportUsersCommandHandler deals with importing you data (and only that) and add a handler that runs before that validates (in your example, checks the user types and maybe other stuff) and bails out of it does not pass. So it queries the db, checks the usertypes and does whatever it needs to if it fails. Otherwise it just passes down to your business handler (ImportUsersCommandHandler).
I am used to using Mediatr in NET Core and this pattern works well (this is what we do) so sorry if this does not fit with your environment/setup!

How to add a list of Steps to Job in spring batch

I'm extending existing Job. What I need to do is update a list of records from database with data gotten from external service. I don't know how to do it in a loop so I thought about creating a list of Steps each consisting of reader, processor and writer and simply adding them to next() method in a jobBuilder. Looking at documentation it's only possible to add one Step at a time, and I have several thousands rows in the database, thus several thousands Steps. How should I do this?
edit:
in short I need to:
read a list of ids from db,
for every id I need to call external service to get information relevant to this id,
process data from it
save updated row to db

How to match a list of values from Database1 with a column in Database2 using JDBC Request in JMeter?

I am quite new to JMeter, so I am looking for the best approach to do this: I want to get a list of messageID's from Database1 and then check whether these messageID values will be found in Database2 and then check the ErrorMessage column for these ID's against what I expect.
I have the JDBC Request working for extracting the list of messageID's from Database1. JMeter returns the list to me, but now I'm stuck. I am not sure how to handle the variable names and result variable names field in the JDBC Request and use this in the next throughput controller loop for the JDBC Request for Database2.
My JDBC request looks like this (PostgreSQL):
SELECT messageID FROM database1
ORDER BY created DESC
FETCH FIRST 20 ROWS ONLY
Variable names: messageid
Result variable names: resultDB1
Then I use the BeanShell Assertion to see whether the connection to the database is present, or whether the response is empty.
But now, I have to connect to a different database, so I need to make a new throughput controller with a new JDBC configuration, Request, etc in there, but I don't know how to pass on the messageid list to this new request.
What I thought about was writing the list of results from Database1 into a file and then read the values from that file for Database2, but that seems like unnecessarily complicated to me, like there should be a solution in JMeter already for that. Also, I am running my JMeter tests on a remote linux server, so I don't want to make it more complicated by making new files and saving them somewhere.
You can convert your resultDB1 into a JMeter Property like:
props.put("resultDB1", vars.getObject("resultDB1"));
As per JMeter Documentation:
Properties are not the same as variables. Variables are local to a thread; properties are common to all threads
So basically JMeter Properties is a subset of Java Properties which are global for the whole JVM
Once done you will be able to access the value in other Thread Groups like:
ArrayList resultDB1 = (ArrayList)props.get("resultDB1");
ArrayList resultDB2 = (ArrayList)vars.getObject("resultDB2");
//your code to compare 2 result sets here
Also be aware that since JMeter 3.1 you should be using JSR223 Test Elements and Groovy language for scripting so consider migrating to JSR223 Assertion on next available opportunity.

How to get list of aggregates using JOliviers's CommonDomain and EventStore?

The repository in the CommonDomain only exposes the "GetById()". So what to do if my Handler needs a list of Customers for example?
On face value of your question, if you needed to perform operations on multiple aggregates, you would just provide the ID's of each aggregate in your command (which the client would obtain from the query side), then you get each aggregate from the repository.
However, looking at one of your comments in response to another answer I see what you are actually referring to is set based validation.
This very question has raised quite a lot debate about how to do this, and Greg Young has written an blog post on it.
The classic question is 'how do I check that the username hasn't already been used when processing my 'CreateUserCommand'. I believe the suggested approach is to assume that the client has already done this check by asking the query side before issuing the command. When the user aggregate is created the UserCreatedEvent will be raised and handled by the query side. Here, the insert query will fail (either because of a check or unique constraint in the DB), and a compensating command would be issued, which would delete the newly created aggregate and perhaps email the user telling them the username is already taken.
The main point is, you assume that the client has done the check. I know this is approach is difficult to grasp at first - but it's the nature of eventual consistency.
Also you might want to read this other question which is similar, and contains some wise words from Udi Dahan.
In the classic event sourcing model, queries like get all customers would be carried out by a separate query handler which listens to all events in the domain and builds a query model to satisfy the relevant questions.
If you need to query customers by last name, for instance, you could listen to all customer created and customer name change events and just update one table of last-name to customer-id pairs. You could hold other information relevant to the UI that is showing the data, or you could simply hold IDs and go to the repository for the relevant customers in order to work further with them.
You don't need list of customers in your handler. Each aggregate MUST be processed in its own transaction. If you want to show this list to user - just build appropriate view.
Your command needs to contain the id of the aggregate root it should operate on.
This id will be looked up by the client sending the command using a view in your readmodel. This view will be populated with data from the events that your AR emits.