Sphinx / Manticore - base one plain index off another? - sphinx

I have a plain text index that sucks data from MySQL and inserts it into Manticore in a format I need (e.g. converting datetime strings to timestamp, CONCATing some fields etc.
I then want to create a second plain text index based off this data to group it further. This will save me having to either re-run the normalisation that's done to the first index on INSERT or make it easier for me to query in the future.
For example, my first index is a list of all phone calls that have been made / received (telephone number, duration, agent). The second index should group by Year-Month-Date in such a way that I can see how many calls each agent made on that day. This means I end up with idx_phone_calls and idx_phone_calls_by_date.
Currently, I generate the first index from MySQL, then get Manticore to query itself (by setting the MySQL host to localhost. It works, but it feels as though I should be able to query Manticore directly from within the index. However, I'm struggling to find if that's possible.
Is there a better way to do it?

Well Sphinx/Manticore, has its own GROUP BY function. So maybe can just run the final query against the original index anyway, avoid the need for the second index.
Sphinx's Aggregation (in some way) is more powerful than MySQL, and can do some 'super aggregation' functions (like with WITHIN GROUP ORDER BY)
But otherwise there is no direct way to create an off another (eg there is no CREATE TABLE idx_phone_calls_by_date SELECT ... FROM idx_phone_calls ... )
Your 'solution' of directing indexer to query the data from searchd is good. In general this should be pretty efficent, particully on localhost, there is little overhead. Maintains the logical seperation of searchd being for queries, indexer being for well building indexes.

Related

Postgresql best index for datetime ranges

I have a Postgre table “tasks” with the fields “start”:timestamptz, “finish”:timestamptz, “type”:int (and a lot of others). It contains about 200m records. Start, finish and type fields have a separate b-tree indexes.
I’d like to build a report “Tasks for a period” and need to get all tasks which lay (fully or partially) inside the reporting period. Report could be built for all task types or for the specific one.
So I wrote the SQL:
SELECT * FROM tasks
WHERE start<={report_to}
AND finish>={report_from}
AND ({report_tasktype} IS NULL OR type={report_tasktype})
and it runs for ages even on short reporting periods.
Please advice if there a way to improve performance by altering the query or by creating new indexes on the table? For some reasons I can’t change the structure of the “tasks” table
You would want a GiST index on the range. Since you already have it stored as two end points rather than as a range, you could do a functional index to convert them on the fly.
ON task USING GIST (tstzrange(start,finish))
And then compare the ranges for overlap with &&
It may also improve things to add "type" as a second column to the index, which would require the btree_gist extension.

Tableau: Create a table calculation that sums distinct string values (names) when condition is met

I am getting my data from denormalized table, where I keep names and actions (apart from other things). I want to create a calculated field that will return sum of workgroup names but only when there are more than five actions present in DB for given workgroup.
Here's how I have done it when I wanted to check if certain action has been registered for workgroup:
WINDOW_SUM(COUNTD(IF [action] = "ADD" THEN [workgroup_name] END))
When I try to do similar thing with count, I am getting "Cannot mix aggregate and non-aggregate arguments":
WINDOW_SUM(COUNTD(IF COUNT([Number of Records]) > 5 THEN [workgroup_name] END))
I know that there's problem with the IF clause, but don't know how to fix it.
How to change the IF to be valid? Maybe there's an easier way to do it, that I am missing?
EDIT:
(after Inox's response)
I know that my problem is mixing aggregate with non-aggregate fields. I can't use filter to do it, because I want to use it later as a part of more complicated view - filtering would destroy the whole idea.
No, the problem is to mix aggregated arguments (e.g., sum, count) with non aggregate ones (e.g., any field directly). And that's what you're doing mixing COUNT([Number of Records]) with [workgroup_name]
If your goal is to know how many workgroup_name (unique) has more than 5 records (seems like that by the idea of your code), I think it's easier to filter then count.
So first you drag workgroup_name to Filter, go to tab conditions, select By field, Number of Records, Count, >, 5
This way you'll filter only the workgroup_name that has more than 5 records.
Now you can go with a simple COUNTD(workgroup_name)
EDIT: After clarification
Okay, than you need to add a marker that is fixed in your database. So table calculations won't help you.
By definition table calculation depends on the fields that are on the worksheet (and how you decide to use those fields to partition or address), and it's only calculated AFTER being called in a sheet. That way, each time you call the function it will recalculate, and for some analysis you may want to do, the fields you need to make the table calculation correct won't be there.
Same thing applies to aggregations (counts, sums,...), the aggregation depends, well, on the level of aggregation you have.
In this case it's better that you manipulate your data prior to connecting it to Tableau. I don't see a direct way (a single calculated field that would solve your problem). What can be done is to generate a db from Tableau (with the aggregation of number of records for each workgroup_name) then export it to csv or mdb and then reconnect it to Tableau. But if you can manipulate your database outside Tableau, it's usually a better solution

PostgreSQL: Full text search multitenant site, plus only parts of site

I'm developing a multitenant web application, and I want to add full text search, so that people will be able to:
1) search only the site they are currently visiting (but not all sites), and
2) search only a section of that site (e.g. restrict search to a blog or a forum on the site), and
3) search a single forum thread only.
I wonder what indexes should I add?
Please assume the database is huge (so that e.g. index-scanning-by-site-ID and then filtering-by-full-text-search is too slow).
I can think of three approaches:
Create three indexes. 1) One that indexes everything on a per site basis.
And 2) one that indexes everything on a per-site plus site-section basis.
And 3) one that indexes everything on a per-site and page-id basis.
Create one single index, and insert into [the text to index] magic words like:
"site_<site-id>"
and "section_<section-id>" and "page_<page-id>", and then when I search
for section XX in site YYY I could prefix the search query like so:
"site_XX AND section_YYY AND ...".
Dynamically add database indexes when a new site or site section is created:
create index dw1_posts__search_site_YYY
on dw1_posts using gin(to_tsvector('english', approved_text))
where site_id = 'YYY';
Does any of these three approaches above make sense? Are there better alternatives?
(Details: However, perhaps approach 1 is impossible? Attempting to index-a-column and also index-for-full-text-searching at the same time, results in syntax errors:
> create index dw1_posts__search_site
on dw1_posts (site_id)
using gin(to_tsvector('english', approved_text));
ERROR: syntax error at or near "using"
LINE 1: ...dex dw1_posts__search_site on dw1_posts(site_id) using gin(...
^
> create index dw1_posts__search_site
on dw1_posts
using gin(to_tsvector('english', approved_text))
(site_id);
ERROR: syntax error at or near "("
LINE 1: ... using gin(to_tsvector('english', approved_text)) (site_id);
(If approach 1 was possible, then I could do queries like:
select ... from ... where site_id = ... and <full-text-search-column> ## <query>;
and have PostgreSQL first check site_id and then the full-text-search column, using one single index.)
)
/ End details.)
Update, one week later: I'm using ElasticSearch instead. I got the impression that no scalable solution exists, for faceted search, with relational databases / PostgreSQL. And integrating with ElasticSearch seems to be roughly as simple as implementing and testing and tweaking the approaches suggested here. (For example, PostgreSQL's stemmer/whatever-it's-called might split "section_NNN" into two words: "section" and "NNN" and thus index words that doesn't exist on the page! Tricky to fix such small annoying issues.)
The normal approach would be to create:
one full text index:
CREATE INDEX idx1
ON dw1_posts USING gin(to_tsvector('english', approved_text));
a simple index on the site_id:
CREATE INDEX idx2
on dw1_posts(page_id);
another simple index on the page_id:
CREATE INDEX idx3
on dw1_posts(site_id);
Then it's the SQL planner's business to decide which ones to use if any, and in what order depending on the queries and the distribution of values in the columns. There is no point in trying to outsmart the planner before you've actually witnessed slow queries.
Another alternative, which is similar to the "site_<site-id>" and "section_<section-id>" and "page_<page-id>" alternative, should be to prefix the text to index with:
SiteSectionPage_<site-id>_<section-id>_<subsection-id>_<page-id>
And then use prefix matching (i.e. :*) when searching:
select ... from .. where .. ## 'SiteSectionPage_NN_MMM:* AND (the search phrase)'
where NN is the site ID and MMM is the section ID.
But this won't work with Chinese? I think trigrams are appropriate when indexing Chinese, but then SiteSectionPage... will be split into: Sit, ite, teS, eSe, which makes no sense.

I have a massive table that I need to optimize. I think I need to use indexes, but I was hoping for some more information about them

So I have a large table that I query (select only) quite frequently. The table is around 12,000 rows long. Since the advent of iOS, the time that it is taking to run these select queries has gone up 4-5x.
I was told that I need to add an index to my table. The query that I am using looks like this:
SELECT * FROM book_content WHERE book_id = ? AND chapter = ? ORDER BY verse ASC
How can I create an index for this table? Is it a command I just run once? What exactly is the index going to do? I didn't learn about these in school so they still seem like some sort of magic to me at this point, so I was hoping to get a little instruction.
Thanks!
You want an index on book_id and chapter. Without an index, a server would do a table scan and essentially load the entire table into memory to do its search. Do a quick search on the CREATE INDEX command for the RDBMS that you are using. You create the index once and every time you do an INSERT or DELETE or UPDATE, the server will update the index automatically. An index can be UNIQUE and it can be on multiple fields (in your case, book_id and chapter). If you make it UNIQUE, the database will not allow you to insert a second row with the same key (in this case, book_id and chapter). On most servers, having one index on two fields is different from having two individual indexes on single fields each.
A Mysql example would be:
CREATE INDEX id_chapter_idx ON book_content (book_id,chapter);
If you want only one record for each book_id, chapter combination, use this command:
CREATE UNIQUE INDEX id_chapter_idx ON book_content (book_id,chapter);
A PRIMARY INDEX is a special index that is UNIQUE and NOT NULL. Each table can only have one primary index. In fact, each table should have one primary index to ensure table integrity, especially during joins.
You don't have to think of indexes as "magic".
An index on an SQL table is much like the index in a printed book - it lets you find what you're looking for without reading the entire book cover-to-cover.
For example, say you have a cookbook, and you're looking for recipes that involve chicken. The index in the back of the book might say something like:
chicken: 30,34,72,84
letting you know that you will find chicken recipes on those 4 pages. It's much faster to find this information in the index than by reading through the whole book, because the index is shorter, and (more importantly) it's in alphabetical order, so you can quickly find the right place in the index.
So, in general you want to create indexes on columns that you will regularly need to query (book_id and chapter, in your example).
When you declare a column as primary key automatically generates an index on that column. In your case for using more often select an index is ideal, because they improve time of selection queries and degrade the time of insertion. So you can create the indexes you think you need without worrying about the performance
Indexes are a very sensitive subject. If you consider using them, you need to be very careful how many you make. The primary key, or id, of each table should have a clustered index. All the rest, it depends on how you plan to use them. I'm very fuzzy in the subject of indexes, and have actually never worked with them, but from a seminar I just watched actually yesterday, you don't want too many indexes - because they can actually slow things down when you don't need to use them.
Let's say you put an index on 5 out of 8 fields on a table. Each index is designated for a particular query somewhere in your software. Well, when 1 query is run, it uses that 1 index, and doesn't need the other 4. So that's unneeded weight on this 1 query. If you need an index, be sure that this is an index which could be useful in many places, not just 1 place.

Is it correct to scan a table in MySQL using "SELECT * .. LiMIT start, count" without an ORDER BY clause?

Suppose Table X has a 100 tuples.
Will the following approach to scanning X generate all the tuples in TABLE X, in MySQL?
for start in [0, 10, 20, ..., 90]:
print results of "select * from X LIMIT start, 10;"
I ask, because I've been using PostgreSQL, which clearly says that this approach need not work, but there seems to be no such info for MySQL. If it won't, is there a way to return results in a fixed ordering without knowing any other info about the table (like what the primary key fields are)?
I need to scan each tuple in a table in an application, and I want a way to do it without using too much memory in the application (so simply doing a "select * from X" is out).
No, that isn't a safe assumption. Without an ORDER BY clause, there is no guaranteeing that your query will return unique results each time. If this table is properly indexed, adding an ORDER BY (for the index) shouldn't be too expensive.
Edit: Non-ORDER BYed results will sometimes be in the order of the clustered index, but I wouldn't put any money on that!
If you are using Innodb or MyISAM table types, a better approach is to use the HANDLER interface. Only MySQL supports this, but it does what you want:
http://dev.mysql.com/doc/refman/5.0/en/handler.html
Also, the MySQL API supports two modes of retrieving data from the server:
store result: in this mode, as soon as a query is executed, the API retrieves the entire result set before returning to the user code. This can use up a lot of client memory buffering results, but minimises the use of resources on the server.
use result: in this mode, the API pulls results row-by-row and returns control to the user code more frequently. This minimises the use of memory on the client, but can hold locks on the server for longer.
Most of the MySQL APIs for various languages support this in oneform or another. It is usually an argument that can be supplied as when creating the connection, and / or a separate call that can be used against an existing connection to switch it to that mode.
So, in answer to your question - I would do the following:
set the connection to "use result" mode;
select * from X