COUNT(field) returns correct amount of rows but full SELECT query returns zero rows - postgresql

I have a UDF in my database which basically tries to get a station (e.g. bus/train) based on some input data (geographic/name/type). Inside this function i try to check if there are any rows matching the given values:
SELECT
COUNT(s.id)
INTO
firsttry
FROM
geographic.stations AS s
WHERE
ST_DWithin(s.the_geom,plocation,0.0017)
AND
s.name <-> pname < 0.8
AND
s.type ~ stype;
The firsttry variable now contains the value 1. If i use the following (slightly extended) SELECT statement i get no results:
RETURN query SELECT
s.id, s.name, s.type, s.the_geom,
similarity(
regexp_replace(s.name::text,'(Hauptbahnhof|Hbf)','Hbf'),
regexp_replace(pname::text,'(Hauptbahnhof|Hbf)','Hbf')
)::double precision AS sml,
st_distance(s.the_geom,plocation) As dist from geographic.stations AS s
WHERE ST_DWithin(s.the_geom,plocation,0.0017) and s.name <-> pname < 0.8
AND s.type ~ stype
ORDER BY dist asc,sml desc LIMIT 1;
the parameters are as follows:
stype = '^railway'
pname = 'Amsterdam Science Park'
plocation = ST_GeomFromEWKT('SRID=4326;POINT(4.9492530 52.3531670)')
the tuple i need to be returned is:
id name type geom (displayed as ST_AsText)
909658;"Amsterdam Sciencepark";"railway_station";"POINT(4.9482893 52.352904)"
The same UDF returns quite well for a lot of other stations, but this is one (of more) which just won't work. Any suggestions?
P.S. The use of the <-> operator is coming from the pg_trgm module.

Some ideas on how to troubleshoot this:
Break your troubleshooting into steps. Start with the simplest query possible. No aggregates, just joins and no filters. Then add filters. Then add order by, then add aggregates. Look at exactly where the change occurs.
Try reindexing the database.
One possibility that occurs to me based on this is that it could be a corrupted index used in the second query but not the first. I have seen corrupted indexes in the past and usually they throw errors but at least in theory they should be able to create a problem like this.
If this is correct, your query will suddenly return rows if you remove the ORDER BY clause.
If you have a corrupted index, then you need to pay close attention to hardware. Is the RAM ECC? Is the processor overheating? How are you disks doing?
A second possibility is that there is a typo on a join condition of filter statement. Normally this is something I would suspect first but it is easy enough to weed out index problems to start there. If removing the ORDER BY doesn't change things, then chances are it is a typo. If you can't find a typo, then try reindexing.

Related

where column in (single value) performance

I am writing dynamic sql code and it would be easier to use a generic where column in (<comma-seperated values>) clause, even when the clause might have 1 term (it will never have 0).
So, does this query:
select * from table where column in (value1)
have any different performance than
select * from table where column=value1
?
All my test result in the same execution plans, but if there is some knowledge/documentation that sets it to stone, it would be helpful.
This might not hold true for each and any RDBMS as well as for each an any query with its specific circumstances.
The engine will translate WHERE id IN(1,2,3) to WHERE id=1 OR id=2 OR id=3.
So your two ways to articulate the predicate will (probably) lead to exactly the same interpretation.
As always: We should not really bother about the way the engine "thinks". This was done pretty well by the developers :-) We tell - through a statement - what we want to get and not how we want to get this.
Some more details here, especially the first part.
I Think this will depend on platform you are using (optimizer of the given SQL engine).
I did a little test using MySQL Server and:
When I query select * from table where id = 1; i get 1 total, Query took 0.0043 seconds
When I query select * from table where id IN (1); i get 1 total, Query took 0.0039 seconds
I know this depends on Server and PC and what.. But The results are very close.
But you have to remember that IN is non-sargable (non search argument able), it will not use the index to resolve the query, = is sargable and support the index..
If you want the best one to use, You should test them in your environment because they both work so good!!

SphinxQL Variables Deprecated, Alternate Query?

I had what I thought was a fairly straightforward SphinxQL query, but it turns out # variables are deprecated (see example below)
SELECT *,#weight AS m FROM test1 WHERE MATCH('tennis') ORDER BY m DESC LIMIT 0,1000 OPTION ranker=bm25, max_matches=3000, field_weights=(title=10, content=5);
I feel like there must be a way to sort the results by strength of match. What is the replacement?
On another note, what if I want to include in it a devaluation if certain other words appear. For example, let's say I wanted to devalue results that had the word "apparel" in them. Could that be executed in the same query?
Thanks!
Well results are 'by default' in weight decending, so just do...
SELECT * FROM test1 WHERE MATCH('tennis') LIMIT 0,1000 OPTION ...
But otherwise its, just the # variables, are replaced by 'functions' mainly because its more 'SQL like'. So #weight, is WEIGHT()
SELECT * FROM test1 WHERE MATCH('tennis') ORDER BY WEIGHT() DESC ...
or
SELECT *,WEIGHT() AS m FROM test1 WHERE MATCH('tennis') ORDER BY m DESC ...
For reference #group is instead GROUPBY(), #count is COUNT(*), #distinct is COUNT(DISTINCT ...), #geodist is GEODIST(...) , and #expr doesnt really have an equivlent, either just use the expression directly, or use your own custom named alias.
As for second question. Kinda tricky, they isnt really a 'negative' weighter. Ther is a keyword boost operator, but as far can't use it to specifically devalue.
The only way I can think maybe have it work, is if negative match was against a specific field, could build a complex ranking exspression. Basically as a negative weight instead, would need a specific field for the ranking expression, so could use to select that column
... MATCH('#!(negative) tennis #negative apparel')
... OPTION ranker=expr('SUM(word_count*IF(user_weight=99,-1,1))'), field_weights(negative=99)
That's a very basic demo expression for illustrative purposes, a real one would probably be a lot more complex. Its just showing using 99 as a placeholder for 'negative' multiplication.
Would need the new negative field creating, which could just be a duplicate of other field(s)

Slow query with order and limit clause but only if there are no records

I am running the following query:
SELECT * FROM foo WHERE name = 'Bob' ORDER BY address DESC LIMIT 25 OFFSET 1
Because I have records in the table with name = 'Bob' the query time is fast on a table of 10M records (<.5 seconds)
However, if I search for name = 'Susan' the query takes over 45 seconds. I have no records in the table where name = 'Susan'.
I have an index on each of name and address. I've vacuumed the table, analyzed it and have even tried to re-write the query:
SELECT * FROM (SELECT * FROM foo WHERE name = 'Bob' ORDER BY address DESC) f LIMIT 25 OFFSET 1
and can't find any solution. I'm not really sure how to proceed. Please note this is different than this post as my slowness only happens when there are no records.
EDIT:
If I take out the ORDER BY address then it runs quickly. Obviously, I need that there. I've tried re-writing it (with no success):
SELECT * FROM (SELECT * FROM foo WHERE name = 'Bob') f ORDER BY address DESC LIMIT 25 OFFSET 1
Examine the execution plan to see which index is being used. In this case, the separate indexes for name and address are not enough. You should create a combined index of name, then address for this query.
Think of an index as a system maintained copy of certain columns, in a different order from the original. In this case, you want to first find matches by name, then tie-break on address, then take until you have enough or run out of name matches.
By making name first in the multi-column index, the index will be sorted by name first. Then address will serve as our tie-breaker.
Under the original indexes, if the address index is the one chosen then the query's speed will vary based on how quickly it can find matches.
The plan (in english) would be: Proceed through all of the rows which happen to already be sorted by address, discard any that do not match the name, keep going until we have enough.
So if you do not get 25 matches, you read the whole table!
With my proposed multi-column index, the plan (in English) would be: Proceed through all of the name matching rows which happen to already be sorted by address. Start with the first one and take them until you have enough. If you run out, stop.
Since the situation is that a query without the Order By is much faster than the one with the Order By clause; I'd make 2 queries:
-One without the order by, limit 1, to know if you have at least one record.
In the case you have at least one, it's safe to run the query with Order by.
-If there's no record, no need to run the second query.
Yes, it's not a solution, but it will let you deliver your project. Just ensure you create a ticket to handle the technical debt after delivery ;) otherwise your lead developer will set you on fire.
Then, to solve the real technical problem, it will be useful to know which indices you have created. Without these it will be very hard to give you a proper solution!

Converting complex query with inner join to tableau

I have a query like this, which we use to generate data for our custom dashboard (A Rails app) -
SELECT AVG(wait_time) FROM (
SELECT TIMESTAMPDIFF(MINUTE,a.finished_time,b.start_time) wait_time
FROM (
SELECT max(start_time + INTERVAL avg_time_spent SECOND) finished_time, branch
FROM mytable
WHERE name IN ('test_name')
AND status = 'SUCCESS'
GROUP by branch) a
INNER JOIN
(
SELECT MIN(start_time) start_time, branch
FROM mytable
WHERE name IN ('test_name_specific')
GROUP by branch) b
ON a.branch = b.branch
HAVING avg_time_spent between 0 and 1000)t
GROUP BY week
Now I am trying to port this to tableau, and I am not being able to find a way to represent this data in tableau. I am stuck at how to represent the inner group by in a calculated field. I can also try to just use a custom sql data source, but I am already using another data source.
columns in mytable -
start_time
avg_time_spent
name
branch
status
I think this could be achieved new Level Of Details formulas, but unfortunately I am stuck at version 8.3
Save custom SQL for rare cases. This doesn't look like a rare case. Let Tableau generate the SQL for you.
If you simply connect to your table, then you can usually write calculated fields to get the information you want. I'm not exactly sure why you have test_name in one part of your query but test_name_specific in another, so ignoring that, here is a simplified example to a similar query.
If you define a calculated field called worst_case_test_time
datediff(min(start_time), dateadd('second', max(start_time), avg_time_spent)), which seems close to what your original query says.
It would help if you explained what exactly you are trying to compute. It appears to be some sort of worst case bound for avg test time. There may be an even simpler formula, but its hard to know without a little context.
You could filter on status = "Success" and avg_time_spent < 1000, and place branch and WEEK(start_time) on say the row and column shelves.
P.S. Your query seems a little off. Don't you need an aggregation function like MAX or AVG after the HAVING keyword?

How to avoid T-SQL function being called more times when needing combined results?

I have two T-SQL scalar functions that both perform calculations over large sums of data (taking 'a lot' of time) and return a value, e.g. CalculateAllIncomes(EmployeeID) and CalculateAllExpenditures(EmployeeID).
I run a select statement that calls these and returns results for each Employee. I also need the balance of each employee calculated as AllIncomes-AllExpenditures.
I have a function GetBalance(EmployeeID) that calls the two above mentioned functions and returns the result {CalculateAllIncomes(EmployeeID) - CalculateAllExpenditures(EmployeeID)}. But if I do:
Select CalculateAllIncomes(EmployeeID), CalculateAllExpenditures(EmployeeID), GetBalance(EmployeeID) .... the functions CalcualteAllIncomes() and CalculateAllExpenditures get called twice (once explicitly and once inside the GetBalance funcion) and so the resulting query takes twice as long as it should.
I'd like to find some better solution. I tried:
select alculateAllIncomes(EmployeeID), AS Incomes, CalculateAllExpenditures
(EmployeeID) AS Expenditures, (Incomes - Expenditures) AS Balance....
but it throws errors:
Invalid column name Incomes and
Invalid column name Expenditures.
I'm sure there has to be a simple solution, but I cannot figure it out. For some reason it seems that I am not able to use column Aliases in the SELECT clause. Is it so? And if so, what could be the workaround in this case?
Thanks for any suggestions.
Forget function calls: you can probably do it everything in one normal query.
Function calls misused (trying for OO encapsulation) force you into this situation. In addition, if you have GetBalance(EmployeeID) per row in the Employee table then you are CURSORing over the table. And you've now compounded this by multiple calls too.
What you need is something like this:
;WITH cSUMs AS
(
SELECT
SUM(CASE WHEN type = 'Incomes' THEN SomeValue ELSE 0 END) AS Income),
SUM(CASE WHEN type = 'Expenditures' THEN SomeValue ELSE 0 END) AS Expenditure)
FROM
MyTable
WHERE
EmployeeID = #empID --optional for all employees
GROUP BY
EmployeeID
)
SELECT
Income, Expenditure, Income - Expenditure
FROM
cSUMs
I once got a query down from a weekend to under a second by eliminating this kind of OO thinking from a bog standard set based aggregate query.