Postgres Query Shows one Item twice with offset and limit set - postgresql

I have this query written in Postgresql.
SELECT "api_issue".id
FROM "api_issue"
LEFT JOIN api_issue_categories ON api_issue_categories.issue_id = api_issue.id
AND api_issue_categories.categories_id = '1126'
WHERE api_issue_categories.categories_id = '1126'
ORDER BY api_issue.published_date LIMIT '20' OFFSET '40'
This query returns the following.
ID
313279
312740
.....
313953
The key here is ID 313953
Now I adjust the queries offset to 60.
SELECT "api_issue".id
FROM "api_issue"
LEFT JOIN api_issue_categories ON api_issue_categories.issue_id = api_issue.id
AND api_issue_categories.categories_id = '1126'
WHERE api_issue_categories.categories_id = '1126'
ORDER BY api_issue.published_date LIMIT '20' OFFSET '60'
And the following results are returned.
ID
313953
.....
312740
313454
Notice that 313953 is returned as the first result.
So the problem is that ID 313953 is returned as the last result in the initial query and the first result in the second query. I've verified that there is only 1 entry for this record in the JOIN table.
The extremely strange thing is that you would think this would happen consistently. That the last returned ID would be the first returned ID in the next query, but this only happens when the initial offset is 40 and the second query uses an offset of 60.
This query is used on the front end as a paging result and this is the only entry out of 175 that shows up twice for some reason.
Does anyone have any idea?? I'm baffled.

There are two possibilities how this can happen.
Either api_issue.id is not unique in the result. This can happen if the api_issue is in the same category multiple times.
More likely though is that there are multiple issues with the same publish date. This way there is no guarantee on how they are ordered. Adding a secondary sort key will help to get a stable order. E.g ORDER BY api_issue.published_date, api_issue.id

Related

Push query to ksqlDB not returning final result in first result row

I'm trying to get the count of events in a ksqlDB table within an arbitrary time window.
The table my_table was created with a WINDOW SESSION.
It is important to note the query is being run after all data was processed, and the ksqlDB server is basically doing nothing.
My query looks something like this
count(*) as count
FROM my_table
WHERE WINDOWSTART < (1602010972370 + 5000) AND WINDOWEND > 1602010972370
group by 1 emit changes;
Running this kind of query will very often return one result row, and immediately after a second result row with the actual "final" result.
It doesn't look like its a result of values in the table not being "settled" yet, because if I repeat the same query (as many times as I want) I get the same exact behavior.
I'm assuming there is some configuration value which will let ksqlDB to wait just a little longer (in the order of one second) before it returns the result, so I could get the final result in the first row?
BTW using emit final will not work on the query itself since it only apply to "windowed querys"

Slow query with order and limit clause but only if there are no records

I am running the following query:
SELECT * FROM foo WHERE name = 'Bob' ORDER BY address DESC LIMIT 25 OFFSET 1
Because I have records in the table with name = 'Bob' the query time is fast on a table of 10M records (<.5 seconds)
However, if I search for name = 'Susan' the query takes over 45 seconds. I have no records in the table where name = 'Susan'.
I have an index on each of name and address. I've vacuumed the table, analyzed it and have even tried to re-write the query:
SELECT * FROM (SELECT * FROM foo WHERE name = 'Bob' ORDER BY address DESC) f LIMIT 25 OFFSET 1
and can't find any solution. I'm not really sure how to proceed. Please note this is different than this post as my slowness only happens when there are no records.
EDIT:
If I take out the ORDER BY address then it runs quickly. Obviously, I need that there. I've tried re-writing it (with no success):
SELECT * FROM (SELECT * FROM foo WHERE name = 'Bob') f ORDER BY address DESC LIMIT 25 OFFSET 1
Examine the execution plan to see which index is being used. In this case, the separate indexes for name and address are not enough. You should create a combined index of name, then address for this query.
Think of an index as a system maintained copy of certain columns, in a different order from the original. In this case, you want to first find matches by name, then tie-break on address, then take until you have enough or run out of name matches.
By making name first in the multi-column index, the index will be sorted by name first. Then address will serve as our tie-breaker.
Under the original indexes, if the address index is the one chosen then the query's speed will vary based on how quickly it can find matches.
The plan (in english) would be: Proceed through all of the rows which happen to already be sorted by address, discard any that do not match the name, keep going until we have enough.
So if you do not get 25 matches, you read the whole table!
With my proposed multi-column index, the plan (in English) would be: Proceed through all of the name matching rows which happen to already be sorted by address. Start with the first one and take them until you have enough. If you run out, stop.
Since the situation is that a query without the Order By is much faster than the one with the Order By clause; I'd make 2 queries:
-One without the order by, limit 1, to know if you have at least one record.
In the case you have at least one, it's safe to run the query with Order by.
-If there's no record, no need to run the second query.
Yes, it's not a solution, but it will let you deliver your project. Just ensure you create a ticket to handle the technical debt after delivery ;) otherwise your lead developer will set you on fire.
Then, to solve the real technical problem, it will be useful to know which indices you have created. Without these it will be very hard to give you a proper solution!

SQL Server Doesn't Find Record Containing Float Value That It Returned

I have a query similar to this:
SELECT SpecimenID, TestPeriodID, Grams, ConsumptionRate FROM LabData
WHERE TestPeriodID = 255
AND TestID = 1
AND Grams = 728560
The record that is returned has a value of 16.5667068820687 for the FLOAT column ConsumptionRate.
I now add the following to the end of my query:
AND ConsumptionRate = 16.5667068820687
Executing the new query returns zero records, even though the additional criteria are exactly what SQL Server itself reported. I assume that this is a rounding error. However, I have a CLR function that is executing the 2nd query based on the results returned by the first.
What can I do to in my generated search criteria to maintain an accurate representation of the first result, but not miss existing records in the second result?
How about if you use - does it bring any result?
ConsumptionRate > 16.56 AND ConsumptionRate < 16.57

COUNT(field) returns correct amount of rows but full SELECT query returns zero rows

I have a UDF in my database which basically tries to get a station (e.g. bus/train) based on some input data (geographic/name/type). Inside this function i try to check if there are any rows matching the given values:
SELECT
COUNT(s.id)
INTO
firsttry
FROM
geographic.stations AS s
WHERE
ST_DWithin(s.the_geom,plocation,0.0017)
AND
s.name <-> pname < 0.8
AND
s.type ~ stype;
The firsttry variable now contains the value 1. If i use the following (slightly extended) SELECT statement i get no results:
RETURN query SELECT
s.id, s.name, s.type, s.the_geom,
similarity(
regexp_replace(s.name::text,'(Hauptbahnhof|Hbf)','Hbf'),
regexp_replace(pname::text,'(Hauptbahnhof|Hbf)','Hbf')
)::double precision AS sml,
st_distance(s.the_geom,plocation) As dist from geographic.stations AS s
WHERE ST_DWithin(s.the_geom,plocation,0.0017) and s.name <-> pname < 0.8
AND s.type ~ stype
ORDER BY dist asc,sml desc LIMIT 1;
the parameters are as follows:
stype = '^railway'
pname = 'Amsterdam Science Park'
plocation = ST_GeomFromEWKT('SRID=4326;POINT(4.9492530 52.3531670)')
the tuple i need to be returned is:
id name type geom (displayed as ST_AsText)
909658;"Amsterdam Sciencepark";"railway_station";"POINT(4.9482893 52.352904)"
The same UDF returns quite well for a lot of other stations, but this is one (of more) which just won't work. Any suggestions?
P.S. The use of the <-> operator is coming from the pg_trgm module.
Some ideas on how to troubleshoot this:
Break your troubleshooting into steps. Start with the simplest query possible. No aggregates, just joins and no filters. Then add filters. Then add order by, then add aggregates. Look at exactly where the change occurs.
Try reindexing the database.
One possibility that occurs to me based on this is that it could be a corrupted index used in the second query but not the first. I have seen corrupted indexes in the past and usually they throw errors but at least in theory they should be able to create a problem like this.
If this is correct, your query will suddenly return rows if you remove the ORDER BY clause.
If you have a corrupted index, then you need to pay close attention to hardware. Is the RAM ECC? Is the processor overheating? How are you disks doing?
A second possibility is that there is a typo on a join condition of filter statement. Normally this is something I would suspect first but it is easy enough to weed out index problems to start there. If removing the ORDER BY doesn't change things, then chances are it is a typo. If you can't find a typo, then try reindexing.

Does DataReader.NextResult retrieves the result is always the same order

I have a SELECT query that yields multiple results and do not have any ORDER BY clause.
If I execute this query multiple times and then iterate through results using DataReader.NextResult(), would I be guaranteed to get the results in the same order?
For e.g. if I execute the following query that return 199 rows:
SELECT * FROM products WHERE productid < 200
would I always get the first result with productid = 1 and so on?
As far as I have observed it always return the results in same order, but I cannot find any documentation for this behavior.
======================================
As per my research:
Check out this blog Conor vs. SQL. I actually wanted to ask if the query-result changes even if the data in table remains the same (i.e no update or delete). But it seems like in case of large table, when SQL server employees parallelism, the order can be different
First of all, to iterate the rows in a DataReader, you should call Read, not NextResult.
Calling NextResult will move to the next result set if your query has multiple SELECT statements.
To answer your question, you must not rely on this.
A query without an ORDER BY clause will return rows in SQL Server's default iteration order.
For small tables, this will usually be the order in which the rows were added, but this is not guaranteed and is liable to change at any time. For example, if the table is indexed or partitioned, the order will be different.
No, DataReader will return the results in the order they come back from SQL. If you don't specify an ORDER BY clause, that will be the order that they exist in the table.
It is possible, perhaps even likely that they will always return in the same order, but this isn't guaranteed. The order is determined by the queryplan (at least in SQL Server) on the database server. If something changes that queryplan, the order could change. You should always use ORDER BY if the order of results is in anyway important to your processing of the data.