postgres ANY() with BETWEEN Condition - postgresql

in case someone is wondering, i am recycling a different question i answered myself, because is realized that my problem has a different root-cause than i thought:
My question actually seems pretty simple, but i cannot find a way.
How do is query postgres if any element of an array is between two values?
The Documentation states that a BETWEEN b and c is equivalent to a > b and a < c
This however does not work on arrays, as
ANY({1, 101}) BETWEEN 10 and 20 has to be false
while
ANY({1,101}) > 10 AND ANY({1,101}) < 20 has to be true.
{1,101} meaning an array containing the two elements 1 and 101.
how can i solve this problem, without resorting to workarounds?
regards,
BillDoor
EDIT: for clarity:
The scenario i have is, i am querying an xml document via xpath(), but for this problem a column containing an array of type int[] does the job.
id::int | numbers::int[] | name::text
1 | {1,3,200} | Alice
2 | {21,100} | Bob
I want all Names, where there is a number that is between 20 and 30 - so i want Bob
The query
SELECT name from table where ANY(numbers) > 20 AND ANY(numbers) < 30
will return Alice and Bob, showing that alice has numbers > 20 as well as other numbers < 30.
A BETWEEN syntax is not allowed in this case, however between only gets mapped to > 20 AND < 30 internally anyways
Quoting the docs on the Between Operators' mapping to > and < documentation:
There is no difference between the two respective forms apart from the
CPU cycles required to rewrite the first one into the second one
internally.
PS.:
Just to avoid adding a new question for this: how can i solve
id::int | numbers::int[] | name::text
1 | {1,3,200} | Alice
2 | {21,100} | Alicia
SELECT id FROM table WHERE ANY(name) LIKE 'Alic%'
result: 1, 2
i can only find examples of matching one value to multiple regex, but not matching one regex against a set of values :/. Besides the shown syntax is invalid, ANY has to be the second operand, but the second operand of LIKE has to be the regex.

exists (select * from (select unnest(array[1,101]) x ) q1 where x between 10 and 20 )
you can create a function based on on this query
second approach:
select int4range(10,20,'[]') #> any(array[1, 101])
for timestamps and dates its like:
select tsrange( '2015-01-01'::timestamp,'2015-05-01'::timestamp,'[]') #> any(array['2015-05-01', '2015-05-02']::timestamp[])
for more info read: range operators

Related

How to split one column into two separated by according to third column in knex.js?

I have a task that I have been cracking my head off.
So I have this table transactions and it has 2 columns bonus and type like :
bonus | type
20 1
15 -1
What I want is to have a query with bonus column divided into two columns bonus_spent and bonus_left by type.
It should probably look like this one:
bonus_left | bonus_spent
20 15
I know I can duplicate tables and join them with where clause but is there any way I can do this operation on single query?
In vanilla SQL you would use conditional aggregation. We use the user_id column which indicates who the bonus belongs to and I've used SUM for aggregation to allow for there being more than one of each type of bonus:
SELECT user_id,
SUM(CASE WHEN type = 1 THEN bonus ELSE 0 END) AS bonus_left,
SUM(CASE WHEN type = -1 THEN bonus ELSE 0 END) AS bonus_spent
FROM transactions
GROUP BY user_id
Output:
user_id bonus_left bonus_spent
1 20 15
Demo on dbfiddle
I agree with Nick and you should mark that answer correct IMHO. For completeness and some Knex:
knex('users AS u')
.join('transactions AS t', 'u.id', 't.user_id')
.select('u.id', 'u.name')
.sum(knex.raw('CASE WHEN t.type = 1 THEN t.bonus ELSE 0 END AS bonus_left'))
.sum(knex.raw('CASE WHEN t.type = -1 THEN t.bonus ELSE 0 END AS bonus_spent'))
Note that, lacking your table schema, this is untested. It'll look roughly like this though. You could also just embed the two SUMs as knex.raw in the select list, but this is perhaps a little more organised.
Consider creating the type as a Postgres enum. This would allow you to avoid having to remember what a 'magic number' is in your table, instead writing comparisons like:
CASE WHEN type = 'bonus_left'
It also stops you from accidentally entering some other integer, like 99, because Postgres will type-check the insertion.
I have a nagging concern that having bonus 'left' vs 'spent' in the same table reflects a wider problem with the schema (for example, why isn't the total amount of bonus remaining the only value we need to track?) but perhaps that's just my paranoia!

Designing a database to save and query a dynamic range?

I need to design a (postgres) database table which can save a dynamic range of something.
Example:
We have a course table. Each course can have (a minimum AND a maximum) OR (a specific amount) of participants.
A math course can be started with 4 to 10 students while a physics course needs to have exactly 8 students to start.
After that, I want to be able to query that.
Let's say, I want all courses which can take 6 students. The math course should be returned, the physics course shouldn't as it requires exactly 8 students.
When I query for 8 students, both courses should be returned.
For the implementation I thought about two simple fields: min_students and max_students. Then I could simply check if the number is equal to or between these numbers.
The issue is: I have to fill both columns everytime. Also for the physics course which requires exactly 8 students.
example:
name | min_students | max_students
--------|--------------|-------------
math | 4 | 10
physics | 8 | 8
Is there a more elegant/efficient way? I also thought about making the max_students column nullable so I could check for
min_students = X OR (min_students >= X AND max_students <= Y)
Would that be more efficient? What about the performance?
Each course can have (a minimum AND a maximum) OR (a specific amount) of participants.
All courses has a minimum and a maximum, for some courses it happens to be the same value. It might seem trivial but thinking about it that way lets you define the problem in a simpler way.
Instead of:
min_students == X OR (min_students >= X AND max_students <= Y)
you can express it as:
num_students BETWEEN min_students AND max_students
BETWEEN is inclusive, so 8 BETWEEN 8 and 8 is true
Regarding optimizations
Additional conditionals makes queries exponentially harder to understand for humans and that leads to missed edge cases and usually results in inefficient queries anyway. Focus on making the code easy to understand, or "elegant", and never sacrifice readability for performance unless you are really sure that you have a performance issue in the first place and that your optimization actually helps.
If you have a table with 10M rows it might be worth looking at super optimizing disk usage if you run on extremely limited hw but reducing the disk usage of a table even with 20 MB is almost certainly wasting time in any normal circumstance even when it doesn't make the code more complicated.
Besides, each row takes up 23-24 bytes in addition to any actual data it contains so shaving of a byte or two wouldn't make a big difference. Setting values to NULL can actually increase disk usage in some situations.
Alternative solution
When using a range data type the comparison would look like this:
num_students #> x
where num_students represents a range (for example 4 to 10) and #> means "contains the value"
create table num_sequence (num int);
create table courses_range (name text, num_students int4range);
insert into num_sequence select generate_series(3,10);
insert into courses_range values
('math', '[4,4]'), ('physics', '[6,7]'), ('dance', '[7,9]');
select * from num_sequence
left join courses_range on num_students #> num;
num | name | num_students
-----+---------+--------------
3 | |
4 | math | [4,5)
5 | |
6 | physics | [6,8)
7 | physics | [6,8)
7 | dance | [7,10)
8 | dance | [7,10)
9 | dance | [7,10)
10 | |
Note that the ranges are output formatted like [x,y), hard brackets means inclusive while parenthesis means exclusive and that for integers: [4,4] = [4,5) = (3,5)

Best way to filter int columns in Postgres

Imagine I have a table like this:
id | value
----|------
1 | 1200
2 | 3450
3 | 1230
4 | 1245
5 | 4512
Both id and value are Integer, now I want to filter all rows that their value starts with 12, in this case I want these ids: 1,3,4.
I figure some different ways to do this:
Cast value field to String and then use LIKE or regex to filter
Use div to divide those values by 100 and then compare
Use Shif-Right to shift those values by 2 then compare
In the first case I'm not really sure about the performance because working with strings get the most time.
In the third case I don't know how to do it or is it even possible to do something like that?
Generally I want to know whats the best practice to do it and how? is there a better way that I couldn't find?
In addition, I'm using Postgres and working with SQLalchemy ORM,
so either a SQL query or SQLalchemy query will be accepted.

How to count frequency of columns in row in a typedpipe in scalding?

I'm currently working on a mapreduce job using scalding. I'm trying to threshold based on how many times I see a particular value among the rows in my typedpipe. For example, if I had these rows in my typedpipe:
Column 1 | Column 2
'hi' | 'hey'
'hi' | 'ho'
'hi' | 'ho'
'bye' | 'bye'
I would want to append to each row the frequency I saw the value in column 1 and column 2 in every row. Meaning the output would look like:
Column 1 | Column 2 | Column 1 Freq | Column 2 Freq
'hi' | 'hey'| 3 | 1
'hi' | 'ho' | 3 | 2
'hi' | 'ho' | 3 | 2
'bye' | 'bye' | 1 | 1
Currently, I'm doing that by grouping the typed pipe by each column, like so:
val key2Freqs = input.groupBy('key2) {
_.size('key2Freq)
}.rename('key2 -> 'key2Right).project('key2Right, 'key2Freq);
Then joining the original input with key2Freqs like so:
.joinWithSmaller('key2 -> 'key2Right, key2Freqs, joiner = new LeftJoin)
However, this is really slow and seems to me to be pretty inefficient for what is essentially a pretty simple task. It gets especially long b/c I have 6 different keys that I want to get these values for, and I am currently mapping and joining 6 different times in my job. There must be a better way to do this, right?
If the number of the different values in each column is small enough to fit them all into memory, you could .map your columns into Map[String,Int], and then .groupAll.sum to count them all in one go (I am using the "typed api" notation, don't quite remember how exactly this is done in the fields api, but you get the idea). You'll need to use the MapMonoid from algebird, or just write your own if you don't want to add a dependency for this one thing, it is not hard.
You'd then end up with a pipe, containing a single entry for the resulting Map. Now, you can get your original pipe, and do .crossWithTiny to bring the map with counts into it, and then .map to extract individual counts.
Otherwise, if you can't keep all that in memory, then what are you doing now seems like it is the only way ... unless you are actually looking for an approximation of "top hitters", rather than exact counts of the entire universe ... in which case, check out algebird's SketchMap.

PostgreSQL Fuzzy Searching multiple words with Levenshtein

I am working out a postgreSQL query to allow for fuzzy searching capabilities when searching for a company's name in an app that I am working on. I have found and have been working with Postgres' Levenshtein method (part of the fuzzystrmatch module) and for the most part it is working. However, it only seems to work when the company's name is one word, for example:
With Apple (which is stored in the database as simply apple) I can run the following query and have it work near perfectly (it returns a levenshtein distance of 0):
SELECT * FROM contents
WHERE levenshtein(company_name, 'apple') < 4;
However when I take the same approach with Sony (which is stored in the database as Sony Electronics INC) I am unable to get any useful results (entering Sony gives a levenshtein distance of 16).
I have tried to remedy this problem by breaking the company's name down into individual words and inputting each one individually, resulting in something like this:
user input => 'sony'
SELECT * FROM contents
WHERE levenshtein('Sony', 'sony') < 4
OR levenshtein('Electronics', 'sony') < 4
OR levenshtein('INC', 'sony') < 4;
So my question is this: is there some way that I can accurately implement a multi-word fuzzy search with the current general approach that I have now, or am I looking in the complete wrong place?
Thanks!
Given your data and the following query with wild values for the Levenshtein Insertion (10000), Deletion (100) and Substitution (1) cost:
with sample_data as (select 101 "id", 'Sony Entertainment Inc' as "name"
union
select 102 "id",'Apple Corp' as "name")
select sample_data.id,sample_data.name, components.part,
levenshtein(components.part,'sony',10000,100,1) ld_sony
from sample_data
inner join (select sd.id,
lower(unnest(regexp_split_to_array(sd.name,E'\\s+'))) part
from sample_data sd) components on components.id = sample_data.id
The output is so:
id | name | part | ld_sony
-----+------------------------+---------------+---------
101 | Sony Entertainment Inc | sony | 0
101 | Sony Entertainment Inc | entertainment | 903
101 | Sony Entertainment Inc | inc | 10002
102 | Apple Corp | apple | 104
102 | Apple Corp | corp | 3
(5 rows)
Row 1 - no changes..
Row 2 - 9 deletions and 3 changes
Row 3 - 1 insertion and 2 changes
Row 4 - 1 deletion and 4 changes
Row 5 - 3 changes
I've found that splitting the words out causes a lot of false positives whe you give a threshold. You can order by the Levenshtein distance to position the better matches close to the top. Maybe tweaking the Levenshtein variables will help you to order the matches better. Sadly, Levenshtein doesn't weight earlier changes differently than later changes.