How do I add another column to a PostgreSQL subquery? - postgresql

I wasn't too sure how to phrase this question, so here are the details. I'm using a trick to compute the hamming distance between two bitstrings. Here's the query:
select length(replace(x::text,'0',''))
from (
select code # '000111101101001010' as x
from codeTable
) as foo
Essentially it computes the xor between the two strings, removes all 0's, then returns the length. This is functionally equivalent to the hamming distance between two bitstrings. Unfortunately, this only returns the hamming distance, and nothing else. In the codeTable table, there is also a column called person_id. I want to be able to return the minimum hamming distance and the id associated with that. Returning the minimum hamming distance is simple enough, just add a min() around the 'length' part.
select min(length(replace(x::text,'0','')))
from (
select code # '000111101101001010' as x
from codeTable
) as foo
This is fine, however, it only returns the hamming distance, not the person_id. I have no idea what I would need to do to return the person_id associated with that hamming distance.
Does anybody have any idea on how to do this?

Am I missing something? Why the subquery? Looks to me like the following should work to:
select length(replace((code # '000111101101001010')::text,'0',''))
from codeTable
Going from there I get:
select person_id,length(replace((code # '000111101101001010')::text,'0','')) as x
from codeTable
order by x
limit 1
I replaced the min with an order by and limit 1 because there is no direct way of getting the corresponding person_id for the value returned by the min function. In general postgres will be smart enough not to sort the whole intermediate result but just scan it for the row with the lowest value it has to return.

Related

Esper: Take the last value from each id and take the mean of all but the most extreme

I have 5 temperature sensors. I want to calculate the mean temperature of 4 - excluding the most extreme value (high or low).
Firstly: will std:unique(id) create a window of the last temperature readings for each id 1-5?
select
avg(tempEvent.temp) as meantemp
from
Event(id in (1, 2, 3, 4, 5)).std:unique(id) as tempEvent
Secondly: how could I change the select statement (possibly using an expression if necessary) to only calculate the mean of four values excluding the most extreme?
The background is, I want to know the deviations of each temperature from the average, but I don't want the average to include an anomalous id. Otherwise all temperatures will look like they are deviating from the average but really only one is.
Finding the average of the middle four values is simple enough, though not as elegant as your solution. The code below will work for any number of temps.
SELECT
AVG(temp) AS meantemp
FROM (
SELECT
temp,
COUNT(temp) AS c,
RANK () OVER (PARTITION BY temp ORDER BY temp) AS r
FROM
[table]
)
WHERE
r > 1
AND r < (c-1)
;
As for your second question, I'm not sure I understand. Do you want the value from among the four middle values that has the greatest absolute deviation from the mean of those four values?

Dividing AVG of column1 by AVG of column2

I am trying to divide the average value of column1 by the average value of column 2, which will give me an average price from my data. I believe there is a problem with my syntax / structure of my code, or I am making a rookie mistake.
I have searched stack and cannot find many examples of dividing two averaged columns, and checked the postgres documentation.
The individual average query is working fine (as shown here)
SELECT (AVG(CAST("Column1" AS numeric(4,2))),2) FROM table1
But when I combine two of them in an attempt to divide, It simply does not work.
SELECT (AVG(CAST("Column1" AS numeric(4,2))),2) / (AVG(CAST("Column2" AS numeric(4,2))),2) FROM table1
I am receiving the following error; "ERROR: row comparison operator must yield type boolean, not type numeric". I have tried a few other variations which have mostly given me syntax errors.
I don't know what you are trying to do with your current approach. However, if you want to take the ratio of two averages, you could also just take the ratio of the sums:
SELECT SUM(CAST(Column1 AS numeric(4,2))) / SUM(CAST(Column2 AS numeric(4,2)))
FROM table1;
Note that SUM() just takes a single input, not two inputs. The reason why we can use the sums is that average would normalize both the numerator and denominator by the same amount, which is the number of rows in table1. Hence, this factor just cancels out.

MATLAB Extracting Column Number

My goal is to create a random, 20 by 5 array of integers, sort them by increasing order from top to bottom and from left to right, and then calculate the mean in each of the resulting 20 rows. This gives me a 1 by 20 array of the means. I then have to find the column whose mean is closest to 0. Here is my code so far:
RandomArray= randi([-100 100],20,5);
NewArray=reshape(sort(RandomArray(:)),20,5);
MeanArray= mean(transpose(NewArray(:,:)))
X=min(abs(x-0))
How can I store the column number whose mean is closest to 0 into a variable? I'm only about a month into coding so this probably seems like a very simple problem. Thanks
You're almost there. All you need is a find:
RandomArray= randi([-100 100],20,5);
NewArray=reshape(sort(RandomArray(:)),20,5);
% MeanArray= mean(transpose(NewArray(:,:))) %// gives means per row, not column
ColNum = find(abs(mean(NewArray,1))==min(abs(mean(NewArray,1)))); %// gives you the column number of the minimum
MeanColumn = RandomArray(:,ColNum);
find will give you the index of the entry where abs(mean(NewArray)), i.e. the absolute values of the mean per column equals the minimum of that same array, thus the index where the mean of the column is closest to 0.
Note that you don't need your MeanArray, as it transposes (which can be done by NewArray.', and then gives the mean per column, i.e. your old rows. I chucked everything in the find statement.
As suggested in the comment by Matthias W. it's faster to use the second output of min directly instead of a find:
RandomArray= randi([-100 100],20,5);
NewArray=reshape(sort(RandomArray(:)),20,5);
% MeanArray= mean(transpose(NewArray(:,:))) %// gives means per row, not column
[~,ColNum] = min(abs(mean(NewArray,1)));
MeanColumn = RandomArray(:,ColNum);

How to count up a value until all geometry features from one table are selected

For example, I have this query to find the minimum distance between two geometries (stored in 2 tables) with a PostGIS function called ST_Distance.
Having thousands of geometries (in both tables) it takes to much time without using ST_DWithin. ST_DWithin returns true if the geometries are within the specified distance of one another (here 2000m).
SELECT DISTINCT ON
(id)
table1.id,
table2.id
min(ST_Distance(a.geom, b.geom)) AS distance
FROM table1 a, table2 b
WHERE ST_DWithin(a.geom, b.geom, 2000.0)
GROUP BY table1.id, table2.id
ORDER BY table1.id, distance
But you have to estimate the distance value to fetch all geometries (e.g. stored in table1). Therefore you have to look at your data in some way in a GIS, or you have to calculate the maximum distance for all (and that takes a lot of time).
In the moment I do it in that way that I approximate the distance value until all features are queried from table1, for example.
Would it be efficient that my query automatically increases (with a reasonable value) the distance value until the count of all geometries (e.g. for table1) is reached? How can I put this in execution?
Would it be slow down everything because the query needs maybe a lot of approaches to find the distance value?
Do I have to use a recursive query for this purpose?
See this post here: K-Nearest Neighbor Query in PostGIS
Basically, the <-> operator is a bit unusual in that it works in the order by clause, but it avoids having to make a guess as to how far you want to search in ST_DWithin. There is a major gotcha with this operator though, which is that the geometry in the order by clause must be a constant that is you CAN NOT write:
select a.id, b.id from table a, table b order by geom.a <-> geom.b limit 1;
Instead you would have to create a loop, substituting in a value above for geom.b
More information can be found here: http://boundlessgeo.com/2011/09/indexed-nearest-neighbour-search-in-postgis/

Select rows randomly distributed around a given mean

I have a table that has a value field. The records have values somewhat evenly distributed between 0 and 100.
I want to query this table for n records, given a target mean, x, so that I'll receive a weighted random result set where avg(value) will be approximately x.
I could easily do something like
SELECT TOP n * FROM table ORDER BY abs(x - value)
... but that would give me the same result every time I run the query.
What I want to do is to add weighting of some sort, so that any record may be selected, but with diminishing probability as the distance from x increases, so that I'll end up with something like a normal distribution around my given mean.
I would appreciate any suggestions as to how I can achieve this.
why not use the RAND() function?
SELECT TOP n * FROM table ORDER BY abs(x - value) + RAND()
EDIT
Using Rand won't work because calls to RAND in a select have a tendency to produce the same number for most of the rows. Heximal was right to use NewID but it needs to be used directly in the order by
SELECT Top N value
FROM table
ORDER BY
abs(X - value) + (cast(cast(Newid() as varbinary) as integer))/10000000000
The large divisor 10000000000 is used to keep the avg(value) closer to X while keeping the AVG(x-value) low.
With that all said maybe asking the question (without the SQL bits) on https://stats.stackexchange.com/ will get you better results.
try
SELECT TOP n * FROM table ORDER BY abs(x - value), newid()
or
select * from (
SELECT TOP n * FROM table ORDER BY abs(x - value)
) a order by newid()