I am working on a Java/Java EE application that has 12 db tables and the same number of methods exposed via webservices. I have quartz configured to do some jobs to insert data in these tables via another operation. I am trying to estimate the number of records that would reside(maximum) in the quartz tables, i.e.
qrtz_blob_triggers
qrtz_calendars
qrtz_cron_triggers
qrtz_fired_triggers
qrtz_job_details
qrtz_locks
qrtz_paused_trigger_grps
qrtz_scheduler_state
qrtz_simple_triggers
qrtz_simprop_triggers
qrtz_triggers
I might not have given the complete information to arrive at a logical estimate. Please let me know if there is any other information that I can help with.
If somebody can at least provide me with an approach to arrive at an estimate that would be great.
This depends on many factors. Generally, you can expect one record in the trigger tables for each job, assuming your implementation is correct and deleting a job also deletes the trigger. In job details you should have one set of records for each job. Scheduler state is implemenation dependant, too. These tables don't tend to be very large (unless you have a very large implementation with thousands of jobs). The log tables (assuming you write logs to the database) are the ones that get big.
Related
I am unaware of the best practices involved in the automation of the addition of future partitions to a table. So the situation is like this: In the month of say, Dec 2021, we want to create partitions for the next year (2022) for some tables in Postgres. We can obviously do it manually, but we want to automate this. So far, I could think of (and found by researching, talking to some people, etc.) the following ways: -
Using PL/pgSQL (In my opinion, there could be issues related to version control and deployment here)
Writing a script (in Python, say) and executing it as a cron job annually
Adding the partitioning logic to the code (that inserts records in the database) i.e., whenever a record is inserted in a table, you check whether the partition corresponding to the record exists, if not, you create the partition (fetching the metadata of the table with each incoming record can be expensive, but we can try and optimize it in my opinion)
Is there any other way that I am missing? If not, what would be the best way of the above to automate the addition of future partitions to a table in Postgres.
Also, please point out if this is not the right platform for such questions (it would be great if you could direct me to the right one).
Thank you for reading this.
Option 1 is not sufficient, because you need a way to run the code automatically (that's the hard part). It doesn't matter much if you use PL/pgSQL or a client side language for the procedural parts of the operation.
Option 3 is not easy to achieve, and certainly not in an efficient fashion.
I would say that the best way is to schedule a job for partition creation, either with the operating system scheduler (cron) or with a PostgreSQL extension like pg_timetable or pg_cron.
I have gone nuts searching for information about this. MS SQL offers the option as can be found here https://www.mssqltips.com/sqlservertip/4116/sql-server-transactional-replication-static-row-and-column-filters/ but I could not find a way to do this in Postgres.
This is the case:
There is a central server storing the information of resources to be consumed locally by other local servers.
Each of the local servers is interested only in the resources that belongs to it. (i.e. there is a central repository of books, and I only want the books written in my language)
Additionally as each server is a separate client, there should be a "great wall" to avoid them to access information of other clients.
We have thought of several lines for developing this:
Use a socket (already implemented) to push changes from central to local via API.
Use triggers to push changes from central to local on database level.
Use logic replication as explained in the question.
I also have no information of which method would be more effective computationally or regarding I/O. The table is small, less than 15 columns and less than 10,000 rows. So I guess there should not be a problem. Although updates to this table may happen as several per second (2 or 3 per second estimated average).
Logic replication (Publisher + Subscribers at DB level) seems like the proper solution, but I am stuck.
Ideas?
One way to do it with logical replication would be to partition the table by client and replicate only the respective partition.
But with 10000 rows that seems like overkill.
I would enable row level security on the table so that every client can only see the data that belong to them. Then define a foreign table in each client's database and pull the whole thing over. You can either
truncate and replace the table every time
or
if you have foreign key dependencies, first remove the rows that do no longer exist in the central database, then insert or update those rows that are new or were modified
Package pglogical works for this purpose, as stated by #a_horse_with_no_name
Latency seems good for now, for us, it did the job.
There is a scenario where I need to add entry for every user in a table. There will be around 5-10 records per user and the approximate number of users are approximately 1000. So, if I add the data of every user each day in a single table, the table becomes very heavy and the Read/Write operations in the table would take some time to return the data(which would be mostly for a particular user)
The tech stack for back-end is Spring-boot and PostgreSQL.
Is there any way to create a new table for every user dynamically from the java code and is it really a good way to manage the data, or should all the data should be in a single table.
I'm concerned about the performance of the queries once there are many records in case of a single table holding data for every user.
The model will contain the similar things like userName, userData, time, etc.
Thank you for you time!
Creating one table per user is not a good practice. Based on the information you provided, 10000 rows are created per day. Any RDBMS will be able to perfectly handle this amount of data without any performance issues.
By making use of indexing and partitioning, you will be able to address any potential performance issues.
PS: It is always recommended to define a retention period for data you want to keep in operational database. I am not sure about your use-case, but if possible define a retention period and move older data out of operational table into a backup storage.
My software runs a cronjob every 30 minutes, which pulls data from Google Analytics / Social networks and inserts the results into a Postgres DB.
The data looks like this:
url text NOT NULL,
rangeStart timestamp NOT NULL,
rangeEnd timestamp NOT NULL,
createdAt timestamp DEFAULT now() NOT NULL,
...
(various integer columns)
Since one query returns 10 000+ items, it's obviously not a good idea to store this data in a single table. At this rate, the cronjob will generate about 480 000 records a day and about 14.5 million a month.
I think the solution would be using several tables, for example I could use a specific table to store data generated in a given month: stats_2015_09, stats_2015_10, stats_2015_11 etc.
I know Postgres supports table partitioning. However, I'm new to this concept, so I'm not sure what's the best way to do this. Do I need partitioning in this case, or should I just create these tables manually? Or maybe there is a better solution?
The data will be queried later in various ways, and those queries are expected to run fast.
EDIT:
If I end up with 12-14 tables, each storing 10-20 millions rows, Postgres should be still able to run select statements quickly, right? Inserts don't have to be super fast.
Partitioning is a good idea under various circumstances. Two that come to mind are:
Your queries have a WHERE clause that can be readily mapped onto one or a handful of partitions.
You want a speedy way to delete historical data (dropping a partition is faster than deleting records).
Without knowledge of the types of queries that you want to run, it is difficult to say if partitioning is a good idea.
I think I can say that splitting the data into different tables is a bad idea because it is a maintenance nightmare:
You can't have foreign key references into the table.
Queries spanning multiple tables are cumbersome, so simple questions are hard to answer.
Maintaining tables becomes a nightmare (adding/removing a column).
Permissions have to be carefully maintained, if you have users with different roles.
In any case, the place to start is with Postgres's documentation on partitioning, which is here. I should note that Postgres's implementation is a bit more awkward than in other databases, so you might want to review the documentation for MySQL or SQL Server to get an idea of what it is doing.
Firstly, I would like to challenge the premise of your question:
Since one query returns 10 000+ items, it's obviously not a good idea to store this data in a single table.
As far as I know, there is no fundamental reason why the database would not cope fine with a single table of many millions of rows. At the extreme, if you created a table with no indexes, and simply appended rows to it, Postgres could simply carry on writing these rows to disk until you ran out of storage space. (There may be other limits internally, I'm not sure; but if so, they're big.)
The problems only come when you try to do something with that data, and the exact problems - and therefore exact solutions - depend on what you do.
If you want to regularly delete all rows which were inserted more than a fixed timescale ago, you could partition the data on the createdAt column. The DELETE would then become a very efficient DROP TABLE, and all INSERTs would be routed through a trigger to the "current" partition (or could even by-pass it if your import script was aware of the partition naming scheme). SELECTs, however, would probably not be able to specify a range of createAt values in their WHERE clause, and would thus need to query all partitions and combine the results. The more partitions you keep around at a time, the less efficient this would be.
Alternatively, you might examine the workload on the table and see that all queries either already do, or easily can, explicitly state a rangeStart value. In that case, you could partition on rangeStart, and the query planner would be able to eliminate all but one or a few partitions when planning each SELECT query. INSERTs would need to be routed through a trigger to the appropriate table, and maintenance operations (such as deleting old data that is no longer needed) would be much less efficient.
Or perhaps you know that once rangeEnd becomes "too old" you will no longer need the data, and can get both benefits: partition by rangeEnd, ensure all your SELECT queries explicitly mention rangeEnd, and drop partitions containing data you are no longer interested in.
To borrow Linus Torvald's terminology from git, the "plumbing" for partitioning is built into Postgres in the form of table inheritance, as documented here, but there is little in the way of "porcelain" other than examples in the manual. However, there is a very good extension called pg_partman which provides functions for managing partition sets based on either IDs or date ranges; it's well worth reading through the documentation to understand the different modes of operation. In my case, none quite matched, but forking that extension was significantly easier than writing everything from scratch.
Remember that partitioning does not come free, and if there is no obvious candidate for a column to partition by based on the kind of considerations above, you may actually be better off leaving the data in one table, and considering other optimisation strategies. For instance, partial indexes (CREATE INDEX ... WHERE) might be able to handle the most commonly queried subset of rows; perhaps combined with "covering indexes", where Postgres can return the query results directly from the index without reference to the main table structure ("index-only scans").
Currently, I am working on a project of implementing a Neo4j (V2.2.0) database in the field of web-analytics. After loading some samples, I'm trying to load a big data set (>1GB, >4M lines). The problem I am facing, is that the usage of the MERGE command takes exponentially more time as the data size grows. Online sources are ambiguous on what the best way is to load big sets of data when not every line has to be loaded as a node, and I would like some clarity on the subject. To emphasize, in this situation I am just loading the nodes; relations are the next step.
Basically there are three methods
i) Set a uniqueness constraint for a property, and create all nodes. This method was used mainly before the MERGE command was introduced.
CREATE CONSTRAINT ON (book:Book) ASSERT book.isbn IS UNIQUE
followed by
USING PERIODIC COMMIT 250
LOAD CSV WITH HEADERS FROM "file:C:\\path\\file.tsv" AS row FIELDTERMINATOR'\t'
CREATE (:Book{isbn=row.isbn, title=row.title, etc})
In my experience, this will return a error if a duplicate is found, which stops the query.
ii) Merging the nodes with all their properties.
USING PERIODIC COMMIT 250
LOAD CSV WITH HEADERS FROM "file:C:\\path\\file.tsv" AS row FIELDTERMINATOR'\t'
MERGE (:Book{isbn=row.isbn, title=row.title, etc})
I have tried loading my set in this manner, but after letting the process run for over 36 hours and coming to a grinding halt, I figured there should be a better alternative, as ~200K of my eventual ~750K nodes were loaded.
iii) Merging nodes based on one property, and setting the rest after that.
USING PERIODIC COMMIT 250
LOAD CSV WITH HEADERS FROM "file:C:\\path\\file.tsv" AS row FIELDTERMINATOR'\t'
MERGE (b:Book{isbn=row.isbn})
ON CREATE SET b.title = row.title
ON CREATE SET b.author = row.author
etc
I am running a test now (~20K nodes) to see if switching from method ii to iii will improve execution time, as a smaller sample gave conflicting results. Are there methods which I am overseeing and could improve execution time? If I am not mistaken, the batch inserter only works for the CREATE command, and not the MERGE command.
I have permitted Neo4j to use 4GB of RAM, and judging from my task manager this is enough (uses just over 3GB).
Method iii) should be the fastest solution since you MERGE against a single property. Do you create the uniqueness constraint before you do the MERGE? Without an index (constraint or normal index), the process will take a long time with a growing number of nodes.
CREATE CONSTRAINT ON (book:Book) ASSERT book.isbn IS UNIQUE
Followed by:
USING PERIODIC COMMIT 20000
LOAD CSV WITH HEADERS FROM "file:C:\\path\\file.tsv" AS row FIELDTERMINATOR'\t'
MERGE (b:Book{isbn=row.isbn})
ON CREATE SET b.title = row.title
ON CREATE SET b.author = row.author
This should work, you can increase the PERIODIC COMMIT.
I can add a few hundred thousand nodes within minutes this way.
In general, make sure you have indexes in place. Merge a node first on the basis of the properties that are indexed (to exploit fast lookup) and then modify that node's properties as needed with SET.
Beyond that, both of your approaches are going through the transaction layer. If you need to jam a lot of data into the DB really quickly, you probably don't want to use transactions to do that, because they're giving you functionality you might not need, and they require overhead that's slowing you down. So a larger solution would be to not insert data with LOAD CSV but go another route entirely.
If you're using the 2.2 series of neo4j, you can go for the batch inserter via java, or the neo4j-import tool sadly not available prior to 2.2. What they both have in common is that they don't use transactions.
Finally, either way you go you should read Michael Hunger's article on importing data into neo4j as it provides a good conceptual discussion of what's happening, and why you need to skip transactions if you're going to load big huge piles of data into neo4j.