I need to know if the results of SQL query has been changed between two queries.
The solution a came up with is to calculate and compare some hash value based on ResultSet content.
What is the preferred way?
There are no such special hashCode method, for ResultSet that is calculated based on all retrieved data. Definetly you can not use default hashCode method.
To be 100% sure that you will take into account all the changes in the data,
you have to retrieve all columns from all the rows from ResultSet one by one and calculate hash code for them with any possible way. (Put everything into single String and get it's hashCode).
But it's very time consumption operation. I would propose you to execute extra query that calculate hash sum by itself. For example it can return count of rows and sum of all columns/rows... or smth like that..
Related
I have a system with a large number of tables that contain historical data. Each table has a ts_from and ts_to column which are of type timestamptz. These represent the time period in which the data for a particular row was valid.
These columns are indexed.
If I want to query all rows that were valid at a particular timestamp, it is trivial to write the ts_from <= #at_timestamp AND ts_to >= #at_timestamp WHERE clause to utilitise the index.
However, I wanted to create a function called Temporal.at which would take the #at_timestamp column and the ts_from / ts_to columns and do this by hiding the complexity of the comparison from the query that uses it. You might think this is trivial, but I would also like to extend the concept to create a function called Temporal.between which would take a #from_timestamp and #to_timestamp and select all rows that were valid between those two periods. That function would not be trivial, as one would have to check where rows partially overlap the period rather than always being fully enclosed by it.
The issue is this: I have written these functions but they do not cause the index to be used. The query performance is woefully slow on the history tables, some of which have hundreds of millions of rows.
The questions therefore are:
a) Is there a way to write these functions so that we can be sure the indexes will be used?
b) Am I going about this completely the wrong way and is there a better way to proceed?
This is complicated if you model ts_from and ts_to as two different timestamp columns. Instead, you should use a range type: tstzrange. Then everything will become simple:
for containment in an interval, use #at_timestamp <# from_to
for interval overlap, use tstzinterval(#from_timestamp, #to_timestamp) && from_to
Both queries can be supported by a GiST index on the range column.
I would like to transpose table a to table b without knowing exactly how many procedures there are. Is there a way to include a loop inside a query?
Thank you in advance!
So far I am just checking what the maximum amount of 'procedures' are, I put all the procedures in an array, and then query all elements from this array. However I would like a query that always works without first defining the maximum amount of procedures.
I have a table with 100+ values corresponding to each row, so I'm exploring different ways to store them.
Without any indexes, would I lose anything if I store these 100 values in an integer[] column in postgresql? As compared to storing them in separate columns.
Plus, since we can add indexes to array elemnets,
CREATE INDEX test_index on test ((foo[1]));
Would there be a performance difference queries using such an index as compared to regular index on a column?
As far as I've read, this performance difference would come into picture in arrays with variable length elements; but I'm not sure about fixed length ones.
Don't go for the lazy way.
If you need to store 100 and more values as array, it is ok, if it has sense has array for your application, your data.
If you need to query for a specific element of the array, then this design is not good, regardless of performances, and you must use columns. This will help you in the moment you must delete a "column" in the middle or redesign it.
Anyway, as wrote by Frank in comments, if values are all same type, consider to model them to another table (if also the meaning is the same).
I am getting my data from denormalized table, where I keep names and actions (apart from other things). I want to create a calculated field that will return sum of workgroup names but only when there are more than five actions present in DB for given workgroup.
Here's how I have done it when I wanted to check if certain action has been registered for workgroup:
WINDOW_SUM(COUNTD(IF [action] = "ADD" THEN [workgroup_name] END))
When I try to do similar thing with count, I am getting "Cannot mix aggregate and non-aggregate arguments":
WINDOW_SUM(COUNTD(IF COUNT([Number of Records]) > 5 THEN [workgroup_name] END))
I know that there's problem with the IF clause, but don't know how to fix it.
How to change the IF to be valid? Maybe there's an easier way to do it, that I am missing?
EDIT:
(after Inox's response)
I know that my problem is mixing aggregate with non-aggregate fields. I can't use filter to do it, because I want to use it later as a part of more complicated view - filtering would destroy the whole idea.
No, the problem is to mix aggregated arguments (e.g., sum, count) with non aggregate ones (e.g., any field directly). And that's what you're doing mixing COUNT([Number of Records]) with [workgroup_name]
If your goal is to know how many workgroup_name (unique) has more than 5 records (seems like that by the idea of your code), I think it's easier to filter then count.
So first you drag workgroup_name to Filter, go to tab conditions, select By field, Number of Records, Count, >, 5
This way you'll filter only the workgroup_name that has more than 5 records.
Now you can go with a simple COUNTD(workgroup_name)
EDIT: After clarification
Okay, than you need to add a marker that is fixed in your database. So table calculations won't help you.
By definition table calculation depends on the fields that are on the worksheet (and how you decide to use those fields to partition or address), and it's only calculated AFTER being called in a sheet. That way, each time you call the function it will recalculate, and for some analysis you may want to do, the fields you need to make the table calculation correct won't be there.
Same thing applies to aggregations (counts, sums,...), the aggregation depends, well, on the level of aggregation you have.
In this case it's better that you manipulate your data prior to connecting it to Tableau. I don't see a direct way (a single calculated field that would solve your problem). What can be done is to generate a db from Tableau (with the aggregation of number of records for each workgroup_name) then export it to csv or mdb and then reconnect it to Tableau. But if you can manipulate your database outside Tableau, it's usually a better solution
I'm trying to write a store procedure that can take some input parameters (obviously), run multiple queries against those, taking the output from those and doing calculations, and from those calculations and the original queries, outputting a formatted text string like:
Number of Rows for max(Z) matching condition x and y of total rows matching x (x&y/x*100).
To explain the max(Z) bit, this will be the username field, it won't matter which actual entry is picked, because the where clause will filter the results by user id, is there a saner way to do this?
For starters break the code up into multiple procedures. Don't create one procedure that does all of these things.