How to reproducibly query random rows with SQLAlchemy from PostgreSQL? - postgresql

I am trying to pseudo-randomly select rows from a PostgreSQL table using SQLAlchemy, but I need to use a seed to guarantee reproducibility of the query. The specific case is concerning a publication being submitted, tied to a codebase.
This answer does a great job introducing how one can leverage the following one-liner to select a single random row:
select.order_by(func.random()) # for PostgreSQL, SQLite
Further, one can select many of psuedo-random rows via:
select.order_by(func.random()).limit(n)
But how do I ensure that I select the same psuedo-random rows every time I run the above query?

You can leverage the setseed(n) postgreSQL method. Using sqlalchemy and ChEMBL as our sample DB, the full solution looks like this:
from sqlalchemy import create_engine, func, select
query = select([MoleculeRecord.molregno]).order_by(func.random()).limit(500)
SEED = .5 # Change this to any float from -1.0 -> 1.0 to get different query results
e = create_engine("postgres:///chembl_25")
conn = e.connect()
conn.execute(f"SELECT setseed({SEED})")
firstQueryResults = [x.molregno for x in conn.execute(query)]
secondQueryResults = [x.molregno for x in conn.execute(query)]
assert(firstQueryResults == secondQueryResults)
This returns 500 random rows, with the invariant that these 500 rows will always be the same each time this query is executed. Too select different random rows, change the SEED variable too something different, between -1.0 and 1.0.

Related

Optimising computing time on sql query

I usually use the update query to change or update a column in my PostgreSQL database.
I create subqueries from the data in a second schema to integrate it with the column to be updated.
In this code I update the 'unit_source_concept_id' column of the measurement table in the OMOP CDM schema from a table in another schema ('transform_semantic' schema).
with subquery1 as (
select unit_source_concept_id
from transform_semantique.measurement )
update omop.measurement m
set unit_source_concept_id = subquery1.unit_source_concept_id
from subquery1
where measurement_source_concept_id <> 5;
I don't know if the update query is the most suitable and optimal (in terms of computing time). This request took +6000s to execute.
Do you know a method to optimise this query?

SELECT query returns more records than exists

Background
I have a table with raster data (grib_data) created by using raster2pgsql.
I have created a second table (turb_mod) with a subset of the points in grib_data that has a value above a certain threshold.
This subset table (turb_mod) has been created with the following query
WITH turb AS (SELECT rid, rast, (ST_PixelAsPoints(rast)).val AS val
FROM grib_data
)
SELECT rid, rast INTO turb_mod
FROM turb WHERE val > 0.5;
The response when creating the table is "SELECT 53" indicating that the table turb_mod would now hold 53 rows
Problem
If I now try to return the raster data from turb_mod using the below query it returns all records from the original table, not the 53 that I am expecting
SELECT (ST_PixelAsPoints(rast)).x AS x FROM turb_mod;
Questions
Why does my query not return only the 53 records?
Is there a better way to create a table with a selection of raster points from the original table? I want to use the subset to apply further geospatial functions like spatial clustering.
In your final SELECT, you're calling the function ST_PixelAsPoints, which is a set-returning function. This results in an output row [being] generated for each element of the function's result set (reference), and can thus result in a different row count to that of your source table, turb_mod.
Your query is functionally equivalent to this (preferred) syntax:
SELECT points.x
FROM
turb_mod
JOIN LATERAL ST_PixelAsPoints(rast) points ON TRUE;
This syntax better shows what's happening, and also shows how you might choose to include more columns from the function's output, which may help to answer your second point.

PostgreSQL 11.5 doing sequential scan for SELECT EXISTS query

I have a multi tenant environment where each tenant (customer) has its own schema to isolate their data. Not ideal I know, but it was a quick port of a legacy system.
Each tenant has a "reading" table, with a composite index of 4 columns:
site_code char(8), location_no int, sensor_no int, reading_dtm timestamptz.
When a new reading is added, a function is called which first checks if there has already been a reading in the last minute (for the same site_code.location_no.sensor_no):
IF EXISTS (
SELECT
FROM reading r
WHERE r.site_code = p_site_code
AND r.location_no = p_location_no
AND r.sensor_no = p_sensor_no
AND r.reading_dtm > p_reading_dtm - INTERVAL '1 minute'
)
THEN
RETURN;
END IF;
Now, bare in mind there are many tenants, all behaving fine except 1. In 1 of the tenants, the call is taking nearly half a second rather than the usual few milliseconds because it is doing a sequential scan on a table with nearly 2 million rows instead of an index scan.
My random_page_cost is set to 1.5.
I could understand a sequential scan if the query was returning possibly many rows, checking for the existance of any.
I've tried ANALYZE on the table, VACUUM FULL, etc but it makes no difference.
If I put "SET LOCAL enable_seqscan = off" before the query, it works perfectly... but it feels wrong, but it will have to be a temporary solution as this is a live system and it needs to work.
What else can I do to help Postgres make what is clearly the better decision of using the index?
EDIT: If I do a similar query manually (outside of a function) it chooses an index.
My guess is that the engine is evaluating the predicate and considers is not selective enough (thinks too many rows will be returned), so decides to use a table scan instead.
I would do two things:
Make sure you have the correct index in place:
create index ix1 on reading (site_code, location_no,
sensor_no, reading_dtm);
Trick the optimizer by making the selectivity look better. You can do that by adding the extra [redundant] predicate and r.reading_dtm < :p_reading_dtm:
select 1
from reading r
where r.site_code = :p_site_code
and r.location_no = :p_location_no
and r.sensor_no = :p_sensor_no
and r.reading_dtm > :p_reading_dtm - interval '1 minute'
and r.reading_dtm < :p_reading_dtm

Update statement where one column depends on another updated column

I want to update two columns in my table, one of them depends on the calculation of another updated column. The calculation is rather complex, so I don't want to repeat that every time, I just want to use the newly updated value.
CREATE TABLE test (
A int,
B int,
C int,
D int
)
INSERT INTO test VALUES (0, 0, 5, 10)
UPDATE test
SET
B = C*D * 100,
A = B / 100
So my question, is this even possible to get 50 as the value for column A in just one query?
Another option would be to use persistent computed columns, but will that work when I have dependencies on another computed column?
you cant achieve what you are trying to in a single query.This is due to a Concept called All At Once Operations which translates to "In SQL Server, Operations which appears in Same logical Phase are evaluated at the same time.."..
Below operations wont yield result you are expecting
insert into table1
(t1,t1+100,t1+200)-- sql wont use new t1 incremented value
sames goes with update as well
update t1
set t1=t1*100
t2=t1 --sql wont use t1 updated value(*100)
References:
TSQL Querying by Itzik Ben-Gan

PostgreSQL - PostGIS query optimization

I have a query which creates an input to pgRouting pgr_drivingDistance function:
CREATE TEMP TABLE tmp_edge AS
SELECT
e."Id" as id,
e."Source" as source,
e."Target" as target,
e."Length" / (1000*LEAST("Speed", "SpeedMin")/60) as cost
FROM "Edge" e,
"SpeedLimit" sl
WHERE sl."VehicleKindId" = 1
AND e.the_geom &&
ST_MakeEnvelope(
x1-(1000*GREATEST("Speed", "SpeedMax")/60)*13,
y1-(1000*GREATEST("Speed", "SpeedMax")/60)*13,
x1+(1000*GREATEST("Speed", "SpeedMax")/60)*13,
y1+(1000*GREATEST("Speed", "SpeedMax")/60)*13, 3857)
AND sl."RoadCategoryId" = e."CategoryId";
In the WHERE clause I calculate the same thing several times to get bounding box coordinates.
I tried to put calculations into FROM part and use alias for calculated column, but then whole execution time increases twice.
Edge table is quite large (1 milion) and SpeedLimit is several dozen record.
Is there any way to enhance this query?
It is recommended way to join tables using JOIN syntax. And then later restrict given set wit WHERE. What is ST_MakeEnvelope? You can use Index on expression in PostgreSQL ;)
Expression indexes in PostgreSQL
Since you are using expressions you might benefit from them.
And you might use Explain analyize to notice your bottlenecks in the query