Thinking Sphinx(v4.4) - Error while indexing after including 'where' condition in index definition - sphinx

Thinking sphinx gem version - 4.4.1
Sphinx version - 3.3.1
I get
ERROR: index 'article_core': no full text fields in schema, nothing to index!
while indexing after including where condition in the index definition.
Index definition below
ThinkingSphinx::Index.define :article, :with => :active_record do
indexes title
# where "text = 'Past Simple'" # type of text column is text: rebuild successful
# where "id > 1" # type of id column is int: rebuild successful
where "photo = 'photo'" # type of photo column is String : rebuild fails
end
The issue occurs when we use where condition inside the index
definition on a string column(char varying)
This issue does not occur when we apply where condition with other data types.
the error statement is 'ERROR: index 'article_core': no full text fields in schema, nothing
to index!.'

I'm wondering if you're using PostgreSQL as your application database? And maybe you don't have any articles which match the where condition? There's an issue with Sphinx 3.x releases where empty indices raise an error when indexing (and it seems it's not fixed in v3.3.1, which only came out last week).
I've lodged this issue with the Sphinx team, but there's not been any response thus far.
If you really want to use SQL-backed indices, I'm afraid you're going to have to downgrade to Sphinx 2.2.11, or consider switching to Manticore (a drop-in fork of Sphinx that doesn't have this issue). Or, you could instead use real-time indices, which work fine with Sphinx v3. If so, you can use the scope method within an index to limit the results:
ThinkingSphinx::Index.define :article, :with => :real_time do
indexes title
scope { Article.where(:photo => "photo") }
end

Related

Search jsonb fields in postgresql with Hasura

Is it possible to do a greater than search across a jsonb field using hasura?
it looks to be possible in PostgreSQL itself, How can I do less than, greater than in JSON Postgres fields?
in postgres I'm storing a table
asset
name: string
version: int
metadata: jsonb
the metadata looks like this.
{'length': 5}
I am able to find asset that matches exactly using the _contains.
{
asset(where:{metadata : {_contains : {length: 5}}}){
name
metadata
}
}
I would like to be able to find asset with a length over 10.
I tried:
{
asset(where:{metadata : {_gt : {length: 10}}}){
name
metadata
}
}
A. Possibility to do on graphql level directly
Hasura documentation: JSONB operators (_contains, _has_key, etc.) mentions only 4 operators:
The _contains, _contained_in, _has_key, _has_keys_any and _has_keys_all operators are used to filter based on JSONB columns.
So direct answer for your question: No. It's not possible to do on graphql level in hasura.
(At least it's not possible yet. Who knows: maybe in future releases more operators will be implemented.
)
B. Using derived views
But there is another way, the one explained in https://hasura.io/blog/postgres-json-and-jsonb-type-support-on-graphql-41f586e47536/#derived-data
This recomendation is repeated in: https://github.com/hasura/graphql-engine/issues/6331
We don't have operators like that for JSONB (might be solved by something like #5211) but you can use a view or computed field to flatten the text field from the JSONB column into a new column and then do a like on that.
Recipe is:
1. Create a view
CREATE VIEW assets -- note plural here. Name view accordingly to your style guide
AS
SELECT
name,
version,
metadata,
(metadata->>'length')::int as meta_len -- cast to other number type if needed
FROM asset
2. Register this view
3. Use it in graphql queries as usual table
E.g.
query{
assets(where: {meta_len: {_gt:10}}){
name
metadata
}
C. Using SETOF-functions
1. Create SETOF-function
CREATE FUNCTION get_assets(min_length int DEFAULT 0)
RETURNS SETOF asset
LANGUAGE SQL
STABLE
AS $$
SELECT * FROM asset
WHERE
(metadata->>'length')::int > min_length;
$$;
2. Register in hasura
3. Use in queries
query{
get_assets(args: {min_length: 10}){
name
metadata
}
I think that was the last possible option.
It will not gives you full "schemaless freedom" that maybe you're looking but IDK know about other ways.

Firestore permutation explosion of composite index

I'm stuck in the composite index of firestore. I have a couple of fields under a user, which are like A(string), B(string), C(string), D(array), E(array), F(array), G(array). Users can be searched and queried by different combinations of these fields. For example, "A == 'Male', B == '2020' ", which tells me to create composite index and after I created the query like "A == 'Male', B == '2020', C == "Ontario' " still needs a new composite index.
What I'm wondering is that do I have to create all the permutation of composite index?
The array fields are more than two, but the SDK only allows one "array-contains" clause. What can I do for this? For this, I have tried to split an array [element1,element2] to the structure like "element1 : true, element2 : true", which can be queried by "==" clause. But the problem is that the array is dynamic, every time I append a "==" clause, the SDK tells me I need to create a new composite index.
Anyone has any ideas about this?
There is no tooling that will create all the desired combinations automatically. However, the documentation suggests that you can use the Firebase CLI to deploy indexes that are defined using its JSON configuration. This configuration file is not documented, so you will have to reverse engineer it based on indexes that you create manually. An example of one such index configuration is here. What you can do is manually create an index, then run firebase init, choose Firestore, and it will dump the indexes to its JSON config, which you can edit and redeploy. As of today, you will have to run firebase init in a fresh folder to get new indexes from the server.
Once you know how to deploy indexes like this, you can write code to create all the combinations of indexes in that JSON config. It's not pretty, but it's doable.

ERROR: data type tstzrange[] has no default operator class for access method "gist" in Postgres 10

I am trying to set an index to a tstzrange[] column in PostgreSQL 10. I created the column via the pgAdmin 4 GUI, set its name and data type as tstzrange[] and set it as not null, nothing more.
I then did a CREATE EXTENSION btree_gist; for the database and it worked.
Then I saw in the documentation that I should index the range and I do:
CREATE INDEX era_ac_range_idx ON era_ac USING GIST (era_ac_range);
...but then I get:
ERROR: data type tstzrange[] has no default operator class for
access method "gist"
which, frankly, I don't know what it actually means, or how to solve it. What should I do ?
PS, that column is currently empty, has no data yet.
Ps2, This table describes chronological eras, there is an id, the era name (eg the sixties) and the timezone range (eg 1960-1969).
A date is inserted by the user and I want to check in which era it belongs.
Well, you have an array of timestamp-ranges as a single column. You can index an array with a GIN index and a range with (iirc) GIN or GiST. However, I'm not sure how an index on a column that is both would operate. I guess you could model it as an N-dimensional r-tree or some such.
I'm assuming you want to check for overlapping ranges.Could you normalise the data and have a linked table with one range in each row?

Does ActiveRecord#first method always return record with minimal ID?

Env: Rails 4.2.4, Postgres 9.4.1.0
Is there a guarantee that ActiveRecord#first method will always return a record with minimal ID and ActiveRecord#last - with maximum ID?
I can see from Rails console that for these 2 methods appropriate ORDER ASC/DESC is added to generated SQL. But an author of another SO thread Rails with Postgres data is returned out of order tells that first method returned NOT first record...
ActiveRecord first:
2.2.3 :001 > Account.first
Account Load (1.3ms) SELECT "accounts".* FROM "accounts" ORDER BY "accounts"."id" ASC LIMIT 1
ActiveRecord last:
2.2.3 :002 > Account.last
Account Load (0.8ms) SELECT "accounts".* FROM "accounts" ORDER BY "accounts"."id" DESC LIMIT 1
==========
ADDED LATER:
So, I did my own investigation (based on D-side answer) and the Answer is NO. Generally speaking the only guarantee is that first method will return first record from a collection. It may as a side effect add ORDER BY PRIMARY_KEY condition to SQL, but it depends on either records were already loaded into cache/memory or not.
Here's methods extraction from Rails 4.2.4:
/activerecord/lib/active_record/relation/finder_methods.rb
# Find the first record (or first N records if a parameter is supplied).
# If no order is defined it will order by primary key.
# ---> NO, IT IS NOT. <--- This comment is WRONG.
def first(limit = nil)
if limit
find_nth_with_limit(offset_index, limit)
else
find_nth(0, offset_index) # <---- When we get there - `find_nth_with_limit` method will be triggered (and will add `ORDER BY`) only when its `loaded?` is false
end
end
def find_nth(index, offset)
if loaded?
#records[index] # <--- Here's the `problem` where record is just returned by index, no `ORDER BY` is applied to SQL
else
offset += index
#offsets[offset] ||= find_nth_with_limit(offset, 1).first
end
end
Here's a few examples to be clear:
Account.first # True, records are ordered by ID
a = Account.where('free_days > 1') # False, No ordering
a.first # False, no ordering, record simply returned by #records[index]
Account.where('free_days > 1').first # True, Ordered by ID
a = Account.all # False, No ordering
a.first # False, no ordering, record simply returned by #records[index]
Account.all.first # True, Ordered by ID
Now examples with has-many relationship:
Account has_many AccountStatuses, AccountStatus belongs_to Account
a = Account.first
a.account_statuses # No ordering
a.account_statuses.first
# Here is a tricky part: sometimes it returns #record[index] entry, sometimes it may add ORDER BY ID (if records were not loaded before)
Here is my conclusion:
Treat method first as returning a first record from already loaded collection (which may be loaded in any order, i.e. unordered). And if I want to be sure that first method will return record with minimal ID - then a collection upon which I apply first method should be appropriately ordered before.
And Rails documentation about first method is just wrong and need to be rewritten.
http://guides.rubyonrails.org/active_record_querying.html
1.1.3 first
The first method finds the first record ordered by the primary key. <--- No, it is not!
If sorting is not chosen, the rows will be returned in an unspecified
order. The actual order in that case will depend on the scan and join
plan types and the order on disk, but it must not be relied on. A
particular output ordering can only be guaranteed if the sort step is
explicitly chosen.
http://www.postgresql.org/docs/9.4/static/queries-order.html (emphasis mine)
So ActiveRecord actually adds ordering by primary key, whichever that is, to keep the result deterministic. Relevant source code is easy to find using pry, but here are extracts from Rails 4.2.4:
# show-source Thing.all.first
def first(limit = nil)
if limit
find_nth_with_limit(offset_index, limit)
else
find_nth(0, offset_index)
end
end
# show-source Thing.all.find_nth
def find_nth(index, offset)
if loaded?
#records[index]
else
offset += index
#offsets[offset] ||= find_nth_with_limit(offset, 1).first
end
end
# show-source Thing.all.find_nth_with_limit
def find_nth_with_limit(offset, limit)
relation = if order_values.empty? && primary_key
order(arel_table[primary_key].asc) # <-- ATTENTION
else
self
end
relation = relation.offset(offset) unless offset.zero?
relation.limit(limit).to_a
end
it may change depending of your Database engine, it returns always the minimal ID in mysql with first method but it does not works the same for postgresql, I had several issues with this when I was a nobai, my app was working as expected in local with mysql, but everything was messed up when deployed to heroku with postgresql, so for avoid issues with postgresql always order your records by id before the query:
Account.order(:id).first
The above ensures minimal ID for mysql, postgresql and any other database engine as you can see in the query:
SELECT `accounts`.* FROM `accounts` ORDER BY `accounts`.`id` ASC LIMIT 1
I don't think that answer you reference is relevant (even to the question it is on), as it refers to non-ordered querying, whereas first and last do apply an order based on id.
In some cases, where you are applying your own group on the query, you cannot use first or last because an order by cannot be applied if the grouping does not include id, but you can use take instead to just get the first row returned.
There have been versions where first and/or last did not apply the order (one of the late Rails 3 on PostgreSQL as I recall), but they were errors.

PostgreSQL: Full text search multitenant site, plus only parts of site

I'm developing a multitenant web application, and I want to add full text search, so that people will be able to:
1) search only the site they are currently visiting (but not all sites), and
2) search only a section of that site (e.g. restrict search to a blog or a forum on the site), and
3) search a single forum thread only.
I wonder what indexes should I add?
Please assume the database is huge (so that e.g. index-scanning-by-site-ID and then filtering-by-full-text-search is too slow).
I can think of three approaches:
Create three indexes. 1) One that indexes everything on a per site basis.
And 2) one that indexes everything on a per-site plus site-section basis.
And 3) one that indexes everything on a per-site and page-id basis.
Create one single index, and insert into [the text to index] magic words like:
"site_<site-id>"
and "section_<section-id>" and "page_<page-id>", and then when I search
for section XX in site YYY I could prefix the search query like so:
"site_XX AND section_YYY AND ...".
Dynamically add database indexes when a new site or site section is created:
create index dw1_posts__search_site_YYY
on dw1_posts using gin(to_tsvector('english', approved_text))
where site_id = 'YYY';
Does any of these three approaches above make sense? Are there better alternatives?
(Details: However, perhaps approach 1 is impossible? Attempting to index-a-column and also index-for-full-text-searching at the same time, results in syntax errors:
> create index dw1_posts__search_site
on dw1_posts (site_id)
using gin(to_tsvector('english', approved_text));
ERROR: syntax error at or near "using"
LINE 1: ...dex dw1_posts__search_site on dw1_posts(site_id) using gin(...
^
> create index dw1_posts__search_site
on dw1_posts
using gin(to_tsvector('english', approved_text))
(site_id);
ERROR: syntax error at or near "("
LINE 1: ... using gin(to_tsvector('english', approved_text)) (site_id);
(If approach 1 was possible, then I could do queries like:
select ... from ... where site_id = ... and <full-text-search-column> ## <query>;
and have PostgreSQL first check site_id and then the full-text-search column, using one single index.)
)
/ End details.)
Update, one week later: I'm using ElasticSearch instead. I got the impression that no scalable solution exists, for faceted search, with relational databases / PostgreSQL. And integrating with ElasticSearch seems to be roughly as simple as implementing and testing and tweaking the approaches suggested here. (For example, PostgreSQL's stemmer/whatever-it's-called might split "section_NNN" into two words: "section" and "NNN" and thus index words that doesn't exist on the page! Tricky to fix such small annoying issues.)
The normal approach would be to create:
one full text index:
CREATE INDEX idx1
ON dw1_posts USING gin(to_tsvector('english', approved_text));
a simple index on the site_id:
CREATE INDEX idx2
on dw1_posts(page_id);
another simple index on the page_id:
CREATE INDEX idx3
on dw1_posts(site_id);
Then it's the SQL planner's business to decide which ones to use if any, and in what order depending on the queries and the distribution of values in the columns. There is no point in trying to outsmart the planner before you've actually witnessed slow queries.
Another alternative, which is similar to the "site_<site-id>" and "section_<section-id>" and "page_<page-id>" alternative, should be to prefix the text to index with:
SiteSectionPage_<site-id>_<section-id>_<subsection-id>_<page-id>
And then use prefix matching (i.e. :*) when searching:
select ... from .. where .. ## 'SiteSectionPage_NN_MMM:* AND (the search phrase)'
where NN is the site ID and MMM is the section ID.
But this won't work with Chinese? I think trigrams are appropriate when indexing Chinese, but then SiteSectionPage... will be split into: Sit, ite, teS, eSe, which makes no sense.