Complex SphinxQL Query - sphinx

I'm trying to write a SphinxQL query that would replicate the following MySQL in a Sphinx RT index:
SELECT id FROM table WHERE colA LIKE 'valA' AND (colB = valB OR colC = valC OR ... colX = valX ... OR colY LIKE 'valY' .. OR colZ LIKE 'valZ')
As you can see I'm trying to get all the rows where one string column matches a certain value, AND matches any one of a list of values, which mixes and matches string and integer columns / values)
This is what I've gotten so far in SphinxQL:
SELECT id, (intColA = intValA OR intColB = intValB ...) as intCheck FROM rt_index WHERE MATCH('#requiredMatch = requiredValue');
The problem I'm running into is in matching all of the potential optional string values. The best possible query (if multiple MATCH statements were allowed and they were allowed as expressions) would be something like
SELECT id, (intColA = intValA OR MATCH('#checkColA valA|valB') OR ...) as optionalMatches FROM rt_index WHERE optionalMatches = 1 AND MATCH('#requireCol requiredVal')
I can see a potential way to do this with CRC32 string conversions and MVA attributes but these aren't supported with RT Indexes and I REALLY would prefer not switch from them.

One way would be to simply convert all your columns to normal fields. Then you can put all this logic inside the MATCH(..). Ie not using attributes.
Yes you can only have one MATCH per query.
Otherwise, yes you could use the CRC trick to make string attributes into integer ones, so can use for filtering.
Not sure why you would need MVA, but they are now supported in RT indexes in 2.0.2

Related

In Postgres, how can I efficiently filter using the inner numbers of this jsonb structure?

So I work with Postgres SQL, and I have a jsonb column with the following structure:
{
"Store1":[
{
"price":5.99,
"seller":"seller"
},
{
"price":56.43,
"seller":"seller"
}
],
"Store2":[
{
"price":45.65,
"seller":"seller"
},
{
"price":44.66,
"seller":"seller"
}
]
}
I have a jsonb like this for every product in the database. I want to run an SQL query that will answer the following question:
For each product, is one of the prices in this JSON is bigger/equal/smaller than X?
Basically filter the product to include only the ones who have at least one price that satisfies a mathematical condition.
How can I do it efficiently? What's the best way in Postgres to iterate a JSON like this, with a relatively complex inner structure?
Also, if I could control the way the data is structured (to an extent, I can), what changes can I do to make this query more efficient?
Thanks!
Use a json path expression:
WHERE col ## '$.*[*].price < 20'
or
WHERE col #? '$.*[*] ? (#.price < 20)'
If you need to compare to another column or make the query parameterised, you can either build the jsonpath dynamically
WHERE col ## format('$.*[*].price < %s', $1)::jsonpath
WHERE col #? format('$.*[*] ? (#.price < %s)', $1)::jsonpath
or you can use the respective function and pass variables as an object:
WHERE jsonb_path_match(col, '$.*[*].price < $limit', jsonb_build_object('limit', $1))
WHERE jsonb_path_exists(col, format('$.*[*] ? (#.price < $limit)', jsonb_build_object('limit', $1))
I admit I had to check my cheat sheet to figure out the right combination of operator and expression. Takeaways:
if a comparison operator needs to work with multiple values, it generally functions as an ANY
## does not work with ? (# …) filter expressions since they don't return a boolean,
#? does not work with predicates since they always return a value (even if it's false)
What changes can I do to make this query more efficient?
As #jjanes commented on my other answer, the jsonpath match col ## '$.*[*].price < $limit' isn't going to be fast and needs to do full table scan, at least for < and >. To make a useful index, a different approach is required. An index can only have a single value to compare with, not any number. For that, we need to change the condition from EXISTS(SELECT prices_of(col) WHERE price < $limit) to (SELECT MIN(prices_of(col))) < $limit.
With this idea it is possible to build an expression index on the result of a custom immutable function:
CREATE FUNCTION min_price(data jsonb) RETURNS float
LANGUAGE SQL
IMMUTABLE
RETURNS NULL ON NULL INPUT
RETURN (
SELECT min((offer ->> 'price')::float)
FROM jsonb_each(data) AS entries(name, store),
LATERAL jsonb_array_elements(store) AS elements(offer)
);
CREATE INDEX example_min_data_price_idx ON example (min_price(data));
which you can use as
SELECT * FROM example WHERE min_price(data) < 20;
Looking for rows with a price larger than a certain number requires a separate index on max_price(data). If you want to use the index in a JOIN with more conditions, consider making it a multi-column index.
Looking for row with a price equalling a certain number can be optimised by indexing the jsonb column and using a jsonpath:
CREATE INDEX example_data_idx ON example USING GIN (data jsonb_ops);
SELECT * FROM example WHERE data ## '$.*[*].price == 20';
SELECT * FROM example WHERE data #? '$.*[*] ? (#.price == 20)';
Unfortunately you can't use jsonb_path_ops here since that doesn't support the wildcard.

How to reference a column in the select clause in the order clause in SQLAlchemy like you do in Postgres instead of repeating the expression twice

In Postgres if one of your columns is a big complicated expression you can just say ORDER BY 3 DESC where 3 is the order of the column where the complicated expression is. Is there anywhere to do this in SQLAlchemy?
As Gord Thompson observes in this comment, you can pass the column index as a text object to group_by or order_by:
q = sa.select(sa.func.count(), tbl.c.user_id).group_by(sa.text('2')).order_by(sa.text('2'))
serialises to
SELECT count(*) AS count_1, posts.user_id
FROM posts GROUP BY 2 ORDER BY 2
There are other techniques that don't require re-typing the expression.
You could use the selected_columns property:
q = sa.select(tbl.c.col1, tbl.c.col2, tbl.c.col3)
q = q.order_by(q.selected_columns[2]) # order by col3
You could also order by a label (but this will affect the names of result columns):
q = sa.select(tbl.c.col1, tbl.c.col2, tbl.c.col3.label('c').order_by('c')

How to format a number in Entity Framework LINQ (without trailing zeroes)?

In a SQL Server database I have a column of decimal datatype defined something like this:
CREATE TABLE MyTable
(
Id INT,
Number DECIMAL(9, 4)
)
I use Entity Framework and I would like to return column Number converted to a string with only the digits right of the decimal separator that are actually needed. A strict constraint is that a result must be an IQueryable.
So my query is:
IQueryable queryable = (
from myTable in MyDatabase.NyTable
select new
{
Id = myTable.Id,
Number = SqlFunctions.StringConvert(myTable.Number,9,4)
}
);
The problem with is that it always convert number to string with 4 decimals, even if they are 0.
Examples:
3 is converted to "3.0000"
1.2 is converted to "1.2000"
If I use other parameters for StringConvert i.e.
SqlFunctions.StringConvert(myTable.Number, 9, 2)
the results are also not OK:
0.375 gets rounded to 0.38.
StringConvert() function is translated into SQL Server function STR.
https://learn.microsoft.com/en-us/sql/t-sql/functions/str-transact-sql?view=sql-server-2017
This explains the weird results.
In the realm of Entity Framework and LINQ I was not able to find a working solution.
What I look for is something like C# function
String.Format("0.####", number)
but this cannot be used in a LINQ query.
In plain simple SQL I could write my query like this
SELECT
Id,
Number = CAST(CAST(Number AS REAL) AS VARCHAR(15))
FROM
MyTable
I have not managed to massage LINQ to produce query like that.
A workaround would be to forget doing this in LINQ, which is quite inflexible and messy thing, borderline on useless and just return type DECIMAL from database and do my formatting on a client side before displaying. But this is additional, unnecessary code and I would hate to di it that way if there perhaps is a simpler way via LINQ.
Is it possible to format numbers in LINQ queries?
I would absolutely return a decimal from he database and format it when needed. Possible directly after the query. But usually this is done at display time to take into account culture specific formatting from the the client.
var q =
(from myTable in MyDatabase.NyTable
select new
{
Id = myTable.Id,
Number = myTable.Number
})
.AsEnumerable()
.Select(x => new { Id = x.Id, Number = x.Number.ToString("G29") });

filtering on a range of values in a db column with sqlalchemy orm

I have a postgresql database and in one particular table, with many rows. One column in this table, called data, is a float array, REAL[], and gets filled with an array of ~4500 elements. I want to access this table through some query via SQLAlchemy and the ORM.
How do I select all rows in the table where a subset of this column satisfies some condition, e.g.contains a range of values? Like I want to select all rows where the data contains values >= 10, or values between >=10 and <=20.
Can I do this with a straight session query like
rows = session.query(Table).filter(Table.data.(some conditional)).all()
where my conditional is something like "VALUES >= 10 and VALUES <= 20"?
Or do I need to define some special methods, or setup, when I'm defining my SQLAlchemy table class. For example, I have my table set up as
class Table(Base):
__tablename__ = 'table'
__table_args__ = {'autoload' : True, 'schema' : 'testdb', 'extend_existing':True}
data = deferred(Column(ARRAY(Float)))
def __repr__(self):
return '<Table (pk={0})>'.format(self.pk)
Ideally I'd like to set it up so I can just do simple filtering in my session.query calls. Is this possible? I'm not super familiar with the ORM, so maybe it is?
I've had a look at the ARRAY Comparator sqlalchemy docs but those only seem to work on exact values. My data is precise to 6 sigfigs, and I don't know the exact values ahead of time.
What's the best way to do this? Thanks.
EDIT:
Based on the below comment, here is the code I'm using in attempting to select all rows (out of 1000) that have data (from 1 column) >= 1.0. There should be 537 rows.
rows = session.query(datadb.Table).filter(datadb.Table.data.any(1.0,operator=operators.le)).all()
This gives the correct subset number. len(rows) = 537. However, I don't understand the logic of with this operator, where to select data >=1.0 , I use the le operator? Also, along those same lines, there should be 234 rows that have data between the values >=1.0 and <1.0, but this statement fails to give the correct subset..
rows = session.query(datadb.Table).filter(datadb.Table.data.any(1.0,operator=operators.le)).filter(datadb.Table.data.any(1.2,operator=operators.ge)).all()
* EDIT 2 *
Here's an example of my database Table with a few rows. pk is an integer, and data is a real[].
db datadb
schema Table
pk data
0 [0.0,0.0,0.5,0.3,1.3,1.9,0.3,0.0,0.0]
1 [0.1,0.0,1.0,0.7,1.1,1.5,1.2,0.3,1.4]
2 [0.0,0.6,0.4,0.3,1.6,1.7,0.4,1.3,0.0]
3 [0.0,0.1,0.2,0.4,1.0,1.1,1.2,0.9,0.0]
4 [0.0,0.0,0.5,0.3,0.2,0.1,0.7,0.3,0.1]
I have 5 rows, 4 of them have data with values >= 1.0, while just 2 have values in the range >= 1.0 and <= 1.2. The query I would do to grab the rows is in the first case
rows = session.query(datadb.Table).filter(datadb.Table.data.any(1.0,operator=operators.le)).all()
This should return the 4 rows, at pk=0,1,2,3. This query does what I expect. The second case
rows = session.query(datadb.Table).filter(datadb.Table.data.any(1.0,operator=operators.le)).filter(datadb.Table.data.any(1.2,operator=operators.ge)).all()
and should return the 2 rows at pk=1,3. However this query just returns the 4 rows from the first query. For the second query, I also tried
rows = session.query(datadb.Table).filter(datadb.Table.data.any(1.0,operator=operators.le),datadb.Table.data.any(1.2,operator=operators.ge)).all()
which also didn't work.
Please read documentation on ARRAY.Comparator, according to which you should be able to do the following:
rows = (session.query(Table)
.filter(Table.data.any(10, operator=operators.le))
.filter(Table.data.any(20, operator=operators.ge)
).all()
EDIT:
# combined filter does not work,
# but applying one or the other is still useful as it reduces the result set
q = (session.query(MyTable)
.filter(MyTable.data.any(1.0, operator=operators.le))
# .filter(MyTable.data.any(1.2, operator=operators.ge))
)
# filter in memory
items = [_row for _row in q.all()
if any(1.0 <= item <= 1.2 for item in _row.data)]
for item in items:
print(item)

Select from any of multiple values from a Postgres field

I've got a table that resembles the following:
WORD WEIGHT WORDTYPE
a 0.3 common
the 0.3 common
gray 1.2 colors
steeple 2 object
I need to pull the weights for several different words out of the database at once. I could do:
SELECT * FROM word_weight WHERE WORD = 'a' OR WORD = 'steeple' OR WORD='the';
but it feels ugly and the code to generate the query is obnoxious. I'm hoping that there's a way I can do something like (pseudocode):
SELECT * FROM word_weight WHERE WORD = 'a','the';
You are describing the functionality of the in clause.
select * from word_weight where word in ('a', 'steeple', 'the');
If you want to pass the whole list in a single parameter, use array datatype:
SELECT *
FROM word_weight
WHERE word = ANY('{a,steeple,the}'); -- or ANY('{a,steeple,the}'::TEXT[]) to make explicit array conversion
If you are not sure about the value and even not sure whether the field will be an empty string or even null then,
.where("column_1 ILIKE ANY(ARRAY['','%abc%','%xyz%']) OR column_1 IS NULL")
Above query will cover all possibility.