Oracle 10g, how to query numerical value, (years) with specific limits on results - oracle10g

So the question I am posed with is to take the years produced of all of the movies in two genre's, (SH and CH) an then print out a list of all the movies, (title and year), that were produced before any of the movies in my specific genre were produced. I have this:
SELECT x.title "Title", x.yr "Year"
FROM movies x
WHERE EXISTS (SELECT x FROM movies y
WHERE y.genre IN ('SH', 'CH') AND y.yr < x.yr)
ORDER BY yr;
but it's producing all sorts of titles that were produced during and after the two genres had any of their movies produced. I would think that the less than would limit the results to anything under 1965, (the oldest move in either genre), but it doesn't, but if I use the greater than operator it does, (although it still pumps out newer results as well, so that doesn't work either)
Does anybody see what it is I am missing here? Thanks for any help.

Got it!
I just needed to use the operator differently
WHERE yr <
(SELECT insert stuff here); // this is homework so I can't post the full code
Seems I was just over complicating things.

Related

Postgres...how to improve ilike results (quality not speed)

I have a list of chemicals in my database and I provide our users with the ability to do a live search via our website. I use SQLAlchemy and the query I use looks something like this:
Compound.query.filter(Compound.name.ilike(f'%{name}%')).limit(50).all()
When someone searches for toluene, for example, they don't get the result they're looking for because there are many chemicals that have the word toluene in them, such as:
2, 4 Dinitrotoluene
2-Chloroethyl-p-toluenesulfonate
4-Bromotoluene
6-Amino-m-toluenesulfonic acid
a,2,4-trichlorotoluene
a,o-Dichlorotoluene
a-Bromtoluene
etc...
I realize I could increase my limit but I feel like 50 is more than enough. Or, I could change the ilike(f'%{name}%')) to something like ilike(f'{name}%')) but our business requirements don't want this. What I'd rather do is improve the ability for Postgres to return results so that toluene is at the top of the search results.
Any ideas on how Postgres' ilike capability?
Thanks in advance.
One option is to better rank the results. Postgres text search allows you to rank the results.
A cheap and dirty version of preferential ranking is to do multiple queries for name = ?, ilike(f'{name}%')), and ilike(f'%{name}%')) using a union. That way the ilike(f'{name}%')) results come first.
And rather than a hard limit, offer pagination. SQLAlchemy has paginate to help.
ILIKE yields a boolean. It doesn't specify what order to return the results, just whether to return them at all (you can order by a boolean, but if you only return trues there is nothing left to order by). So by the time you are done improving it, it would no longer be ILIKE at all but something else completely.
You might be looking for something like <-> from pg_trgm, which provides a distance score which can be sorted on. Although really, you could just order the result based on the length of the compound name, and return the shortest 50 that contain the target.
something like ilike(f'{name}%')) but our business requirements don't want this
Isn't your business requirement to get better results?
But at least in my database, this could just return a bunch of names in inverted format, like toluene, 2,4-dinitro, so the results might not be much better, unless you avoid storing such inverted names. Sorting by either <-> or by length would overcome that problem. But they would also penalize toluene, ACS reagent grade 99.99% by HPLC, should you have names like that.

Sphinx Mulit-Level Sort with Randomize

Here is my challenge with Sphinx Sort where I have Vendors who pay for premium placement and those who don't:
I already do a multi-level order including the PaidVendorStatus which is either 0 or 1 as:
order by PaidVendorStatus,Weight()
So in essence I end up with multiple sort groups:
PaidVendorStatus=1, Weight1
....
PaidVendorStatus=1, WeightN
PaidVendorStatus=0, Weight1
...
PaidVendorStatus=0, WeightN
The problem is I have three goals:
Randomly prioritize each vendor in any given sort group
Have each vendor's 'odds' of being randomly assigned top position be equal regardless of how many records they have returned in the group (so if Vendor A has 50 results and VendorB has 2 results they still both have 50% odds of being randomly assigned any given spot)
Ideally, maintain the same results order in any given search (so that if the user searches again the same order will be displayed
I've tried various solutions:
Select CRC32(Vendor) as RANDOM...Order by PaidVendorStatus,Weight(),RANDOM
which solves 2 and 3 except due to the nature of CRC32 ALWAYS puts the same vendor first (and second, third, etc.) so in essence does not solve the issue at all.
I tried making a sphinx sql_attr_string in my Sphinx Configuration which was a concatenation of Vendor and the record Title (Select... concat(Vendor,Title) as RANDOMIZER..)` and then used that to randomize
Select CRC32(RANDOMIZER) as RANDOM...
which solves 1 and 3 as now the Title field gets thrown in the randomization mis so that the same Vendor does not always get first billing. However, it fails at 2 since in essence I am only sorting by Title and thus Vendor B with two results now has a very low change of being sorted first.
In an ideal world naturally I could just order this way;
Order by PaidVendorStatus,Weight(),RAND(Vendor)
but that is not possible.
Any thoughts on this appreciated. I did btw check out as per Barry Hunter's suggestion this thread on UDF but unless I am not understanding it at all (possible) it does not seem to be the solution for this problem.
Well one idea is:
SELECT * FROM (
SELECT *,uniqueserial(vendor_id) AS sorter FROM index WHERE MATCH(...)
ORDER BY PaidVendorStatus DESC ,Weight() DESC LIMIT 1000
) ORDER BY sorter DESC, WEIGHT() DESC:
This exploits SPhixnes 'multiple sort' function with pysudeo subquery.
This works wors becasuse the inner query is sorted by PaidVendor first, so their items are fist. Which works to affect the ordr that unqique serial is called in.
Its NOT really 'randomising' the results as such, seems you jsut randomising them to mix up the vendors (so a single vendor doesnt domninate results. Uniqueserial works by 'spreading' the particular vendors results out - the results will tend to cycle through the vendors.
This is tricky as it exploits a relative undocumented sphinx feature - subqueries.
For the UDF see http://svn.geograph.org.uk/svn/modules/trunk/sphinx/
Still dont have an answer for your biased random (as in 2.)
but just remembered another feature taht can help with 3. - can supply s specific seed to the random number. Typically random generators are seeded from the current time, which gives ever changing values, But using a specific seed.
Seed is however a number, so need a predictable, but changing number. Could CRC the query?
... sphinx doesnt support expressions in the OPTION so would have to caculate the hash in the app.
<?php
$query = $db->Quote($_GET['q']);
$crc = crc32($query);
$sql = "SELECT id,IDIV(WEIGHT(),100) as i,RAND() as r FROM index WHERE MATCH($query)
ORDER BY PaidVendorStatus DESC,i DESC,r ASC OPTION random_seed=$crc";
If wanted the results to only slowly evolve, add the current date, so each day is a new selection...
$crc = crc32($query.date('Ymd'));

COUNT(field) returns correct amount of rows but full SELECT query returns zero rows

I have a UDF in my database which basically tries to get a station (e.g. bus/train) based on some input data (geographic/name/type). Inside this function i try to check if there are any rows matching the given values:
SELECT
COUNT(s.id)
INTO
firsttry
FROM
geographic.stations AS s
WHERE
ST_DWithin(s.the_geom,plocation,0.0017)
AND
s.name <-> pname < 0.8
AND
s.type ~ stype;
The firsttry variable now contains the value 1. If i use the following (slightly extended) SELECT statement i get no results:
RETURN query SELECT
s.id, s.name, s.type, s.the_geom,
similarity(
regexp_replace(s.name::text,'(Hauptbahnhof|Hbf)','Hbf'),
regexp_replace(pname::text,'(Hauptbahnhof|Hbf)','Hbf')
)::double precision AS sml,
st_distance(s.the_geom,plocation) As dist from geographic.stations AS s
WHERE ST_DWithin(s.the_geom,plocation,0.0017) and s.name <-> pname < 0.8
AND s.type ~ stype
ORDER BY dist asc,sml desc LIMIT 1;
the parameters are as follows:
stype = '^railway'
pname = 'Amsterdam Science Park'
plocation = ST_GeomFromEWKT('SRID=4326;POINT(4.9492530 52.3531670)')
the tuple i need to be returned is:
id name type geom (displayed as ST_AsText)
909658;"Amsterdam Sciencepark";"railway_station";"POINT(4.9482893 52.352904)"
The same UDF returns quite well for a lot of other stations, but this is one (of more) which just won't work. Any suggestions?
P.S. The use of the <-> operator is coming from the pg_trgm module.
Some ideas on how to troubleshoot this:
Break your troubleshooting into steps. Start with the simplest query possible. No aggregates, just joins and no filters. Then add filters. Then add order by, then add aggregates. Look at exactly where the change occurs.
Try reindexing the database.
One possibility that occurs to me based on this is that it could be a corrupted index used in the second query but not the first. I have seen corrupted indexes in the past and usually they throw errors but at least in theory they should be able to create a problem like this.
If this is correct, your query will suddenly return rows if you remove the ORDER BY clause.
If you have a corrupted index, then you need to pay close attention to hardware. Is the RAM ECC? Is the processor overheating? How are you disks doing?
A second possibility is that there is a typo on a join condition of filter statement. Normally this is something I would suspect first but it is easy enough to weed out index problems to start there. If removing the ORDER BY doesn't change things, then chances are it is a typo. If you can't find a typo, then try reindexing.

How to get a number of rows for given query?

I would like to get a number (like 5, 1000, etc.) of rows for given query. However method "count" for query gives me ColumnOps.CountAll -- and I don't know how to get the number.
See SQ wiki for example:
https://github.com/szeiger/scala-query/wiki/Counts
(for(...) yield ...).count
Obviously one step is missing, and the question is -- what is that step?
I use explicit query route because the count is for join.
You can get a number to do like this:
val q = (for(...) yield ...).count
q.first
I was confused by ScalaQuery document too.
I recommend that you check SQ's tests to know how it use.

Possible to rank partial matches in Postgres full text search?

I'm trying to calculate a ts_rank for a full-text match where some of the terms in the query may not be in the ts_vector against which it is being matched. I would like the rank to be higher in a match where more words match. Seems pretty simple?
Because not all of the terms have to match, I have to | the operands, to give a query such as to_tsquery('one|two|three') (if it was &, all would have to match).
The problem is, the rank value seems to be the same no matter how many words match. In other words, it's maxing rather than multiplying the clauses.
select ts_rank('one two three'::tsvector, to_tsquery('one')); gives 0.0607927.
select ts_rank('one two three'::tsvector, to_tsquery('one|two|three|four'));
gives the expected lower value of 0.0455945 because 'four' is not the vector.
But select ts_rank('one two three'::tsvector, to_tsquery('one|two'));
gives 0.0607927 and likewise
select ts_rank('one two three'::tsvector, to_tsquery('one|two|three'));
gives 0.0607927
I would like the result of ts_rank to be higher if more terms match.
Possible?
To counter one possible response: I cannot calculate all possible subsequences of the search query as intersections and then union them all in a query because I am going to be working with large queries. I'm sure there are plenty of arguments against this anyway!
Edit: I'm aware of ts_rank_cd but it does not solve the above problem.
Use the smlar extension (linux only AFAIK, written by the same guys that brought us text search).
It has functions for calculating TFIDF, cosine, or overlap similarity between arrays. It supports indexing so is fast.
Another way would be to "spell-check" the query prior to using it, basically removing any query terms that are not in your corpus.
The conclusion that I have come to is to & the items together for the ranking. In my select query (with which I'm doing the search) the items are |ed. This seems to work.