I am a newbie to sphinxsearch. But I learned to configure the conf and delta conf.
My doubt: I recently I came through a conf which has the same query in mainconf as well as deltaconf.
sample
main_conf:
{
#...
sql_query = select a,b,c from documents group by c
#...
}
delta_conf : main_conf
{
#...
sql_query_killlist = select a from documents where date_updated > '2015-08-01 23:30:00'
sql_query = select a,b,c from documents group by c
#...
}
I need the understand how this works exactly. Will there be a duplication of data in the delta conf which was there in the main conf. How this works on indexing.
Thanks in advance.
Please be kind enough to answer this even if the question is basic.
The answer I got from sphinxsearch forum. And I think it's clear
Will there be a duplication of data
Yes. As it stands.
You need to modify the delta sql_query to only include the new/updated records - the ones
NOT in the main index.
(the kill list is there to 'remove' the ones that need to be duplicated (because they
updated) from the main. If a record needs to be in both indexes, use killlist to return
it back to one)
Related
I have a large index definition that takes too long to index. I suspect the main problem is caused by the many LEFT OUTER JOINs generated.
I saw this question, but can't find documentation about using source: :query, which seems to be part of the solution.
My index definition and the resulting query can be found here: https://gist.github.com/jonsgold/fdd7660bf8bc98897612
How can I optimize the generated query to run faster during indexing?
The 'standard' sphinx solution to this would be to use ranged queries.
http://sphinxsearch.com/docs/current.html#ex-ranged-queries
... splitting up the query into lots of small parts, so the database server has a better chance of being able to run the query (rather than one huge query)
But I have no idea how to actully enable that in Thinking Sphinx. Can't see anything in the documentation. Could help you edit the sphinx.conf, but also not sure how TS will cope with you manually editing the config file.
This is the solution that worked best (from the linked question). Basically, you can remove a piece of the main query sql_query and define it separately as a sql_joined_field in the sphinx.conf file.
It's important to add all relevant sql conditions to each sql_joined_field (such as sharding indexes by modulo on the ID). Here's the new definition:
ThinkingSphinx::Index.define(
:incident,
with: :active_record,
delta?: false,
delta_processor: ThinkingSphinx::Deltas.processor_for(ThinkingSphinx::Deltas::ResqueDelta)
) do
indexes "SELECT incidents.id * 51 + 7 AS id, sites.name AS site FROM incidents LEFT OUTER JOIN sites ON sites.id = site_id WHERE incidents.deleted = 0 AND EXISTS (SELECT id FROM accounts WHERE accounts.status = 'enabled' AND incidents.account_id = id) ORDER BY id", as: :site, source: :query
...
has
...
end
ThinkingSphinx::Index.define(
:incident,
with: :active_record,
delta?: true,
delta_processor: ThinkingSphinx::Deltas.processor_for(ThinkingSphinx::Deltas::ResqueDelta)
) do
indexes "SELECT incidents.id * 51 + 7 AS id, sites.name AS site FROM incidents LEFT OUTER JOIN sites ON sites.id = site_id WHERE incidents.deleted = 0 AND incidents.delta = 1 AND EXISTS (SELECT id FROM accounts WHERE accounts.status = 'enabled' AND incidents.account_id = id) ORDER BY id", as: :site, source: :query
...
has
...
end
The magic that defines the field site as a separate query is the option source: :query at the end of the line.
Notice the core index definition has the parameter delta?: false, while the delta index definition has the parameter delta?: true. That's so I could use the condition WHERE incidents.delta = 1 in the delta index and filter out irrelevant records.
I found sharding didn't perform any better, so I reverted to one unified index.
See the whole index definition here: https://gist.github.com/jonsgold/05e2aea640320ee9d8b2.
Important to remember!
The Sphinx document ID offset must be handled manually. That is, whenever an index for another model is added or removed, my calculated document ID will change. This must be updated.
So, in my example, if I added an index for a different model (not :incident), I would have to run rake ts:configure to find out my new offset and change incidents.id * 51 + 7 accordingly.
I have a query like this, which we use to generate data for our custom dashboard (A Rails app) -
SELECT AVG(wait_time) FROM (
SELECT TIMESTAMPDIFF(MINUTE,a.finished_time,b.start_time) wait_time
FROM (
SELECT max(start_time + INTERVAL avg_time_spent SECOND) finished_time, branch
FROM mytable
WHERE name IN ('test_name')
AND status = 'SUCCESS'
GROUP by branch) a
INNER JOIN
(
SELECT MIN(start_time) start_time, branch
FROM mytable
WHERE name IN ('test_name_specific')
GROUP by branch) b
ON a.branch = b.branch
HAVING avg_time_spent between 0 and 1000)t
GROUP BY week
Now I am trying to port this to tableau, and I am not being able to find a way to represent this data in tableau. I am stuck at how to represent the inner group by in a calculated field. I can also try to just use a custom sql data source, but I am already using another data source.
columns in mytable -
start_time
avg_time_spent
name
branch
status
I think this could be achieved new Level Of Details formulas, but unfortunately I am stuck at version 8.3
Save custom SQL for rare cases. This doesn't look like a rare case. Let Tableau generate the SQL for you.
If you simply connect to your table, then you can usually write calculated fields to get the information you want. I'm not exactly sure why you have test_name in one part of your query but test_name_specific in another, so ignoring that, here is a simplified example to a similar query.
If you define a calculated field called worst_case_test_time
datediff(min(start_time), dateadd('second', max(start_time), avg_time_spent)), which seems close to what your original query says.
It would help if you explained what exactly you are trying to compute. It appears to be some sort of worst case bound for avg test time. There may be an even simpler formula, but its hard to know without a little context.
You could filter on status = "Success" and avg_time_spent < 1000, and place branch and WEEK(start_time) on say the row and column shelves.
P.S. Your query seems a little off. Don't you need an aggregation function like MAX or AVG after the HAVING keyword?
I am trying to figure out what would be the best way to go ahead and locate duplicates in a 5 column csv data. The real data has more than million rows in it.
Following is the content of mentioned 6 columns.
Name, address, city, post-code, phone number, machine number
Data does not have fixed length, data might in certain columns might be missing in certain instances.
I am thinking of using perl to first normalize all the short forms used in names, city and address. Fellow perl enthusiasts from stackoverflow have helped me a lot.
But there would still be a lot of data which would be difficult to match.
So I am wondering is it possible to match content based on "LIKELINESS / SIMILARITY" (eg. google similar to gugl) the likeliness would be required to overcome errors that creeped in while collecting data.
I have 2 tasks in hand w.r.t. the data.
Flag duplicate rows with certain identifier
Mention the percentage match between similar rows.
I would really appreciate if I could get suggestions as to what all possible methods could be employed and which would propbably be best because of their certain merits.
You could write a Perl program to do this, but it will be easier and faster to put it into a SQL database and use that.
Most SQL databases have a way to import CSV. For this answer, I suggest PostgreSQL because it has very powerful string functions which you will need to find your fuzzy duplicates. Create your table with an auto incremented ID column if your CSV data doesn't already have unique IDs.
Once the import is done, add indexes on the columns you want to check for duplicates.
CREATE INDEX name ON whatever (name);
You can do a self-join to look for duplicates in whatever way you like. Here's an example that finds duplicate names.
SELECT id
FROM whatever t1
JOIN whatever t2 ON t1.id < t2.id
WHERE t1.name = t2.name
PostgreSQL has powerful string functions including regexes to do the comparisons.
Indexes will have a hard time working on things like lower(t1.name). Depending on the sorts of duplicates you want to work with, you can add indexes for these transforms (this is a feature of PostgreSQL). For example, if you wanted to search case insensitively you can add an index on the lower-case name. (Thanks #asjo for pointing that out)
CREATE INDEX ON whatever ((lower(name)));
// This will be muuuuuch faster
SELECT id
FROM whatever t1
JOIN whatever t2 ON t1.id < t2.id
WHERE lower(t1.name) = lower(t2.name)
A "likeness" match can be achieved in several ways, a simple one would be to use the fuzzystrmatch functions like metaphone(). Same trick as before, add a column with the transformed row and index it.
Other simple things like data normalization are better done on the data itself before adding indexes and looking for duplicates. For example, trim out and squish extra whitespace.
UPDATE whatever SET name = trim(both from name);
UPDATE whatever SET name = regexp_replace(name, '[[:space:]]+', ' ');
Finally, you can use the Postgres Trigram module to add fuzzy indexing to your table (thanks again to #asjo).
there are a few topics about this already with accepted answers but I couldn't figure out a solution based on those:
Eg:
Ruby on Rails: must appear in the GROUP BY clause or be used in an aggregate function
GroupingError: ERROR: column must appear in the GROUP BY clause or be used in an aggregate function
PGError: ERROR: column "p.name" must appear in the GROUP BY clause or be used in an aggregate function
My query is:
Idea.unscoped.joins('inner join likes on ideas.id = likes.likeable_id').
select('likes.id, COUNT(*) AS like_count, ideas.id, ideas.title, ideas.intro, likeable_id').
group('likeable_id').
order('like_count DESC')
This is fine in development with sqlite but breaks on heroku with PostgreSQL.
The error is:
PG::GroupingError: ERROR: column "likes.id" must appear in the GROUP BY clause or be used in an aggregate function
If I put likes.id in my group by then the results make no sense. Tried to put group before select but doesn't help. I even tried to take the query into two parts. No joy. :(
Any suggestions appreciated. TIA!
I don't know why you want to select likes.id in the first place. I see that you basically want the like_count for each Idea; I don't see the point in selecting likes.id. Also, when you already have the ideas.id, I don't see why you would want to get the value of likes.likeable_id since they'll both be equal. :/
Anyway, the problem is since you're grouping by likeable_id (basically ideas.id), you can't "select" likes.id since they would be "lost" by the grouping.
I suppose SQLite is lax about this. I imagine it wouldn't group things properly.
ANYWAY(2) =>
Let me propose a cleaner solution.
# model
class Idea < ActiveRecord::Base
# to save you the effort of specifying the join-conditions
has_many :likes, foreign_key: :likeable_id
end
# in your code elsewhere
ideas = \
Idea.
joins(:likes).
group("ideas.id").
select("COUNT(likes.id) AS like_count, ideas.id, ideas.title, ideas.intro").
order("like_count DESC")
If you still want to get the IDs of likes for each item, then after the above, here's what you could do:
grouped_like_ids = \
Like.
select(:id, :likeable_id).
each_with_object({}) do |like, hash|
(hash[like.likeable_id] ||= []) << like.id
end
ideas.each do |idea|
# selected previously:
idea.like_count
idea.id
idea.title
idea.intro
# from the hash
like_ids = grouped_like_ids[idea.id] || []
end
Other readers: I'd be very interested in a "clean" one-query non-sub-query solution. Let me know in the comments if you leave a response. Thanks.
I have a UDF in my database which basically tries to get a station (e.g. bus/train) based on some input data (geographic/name/type). Inside this function i try to check if there are any rows matching the given values:
SELECT
COUNT(s.id)
INTO
firsttry
FROM
geographic.stations AS s
WHERE
ST_DWithin(s.the_geom,plocation,0.0017)
AND
s.name <-> pname < 0.8
AND
s.type ~ stype;
The firsttry variable now contains the value 1. If i use the following (slightly extended) SELECT statement i get no results:
RETURN query SELECT
s.id, s.name, s.type, s.the_geom,
similarity(
regexp_replace(s.name::text,'(Hauptbahnhof|Hbf)','Hbf'),
regexp_replace(pname::text,'(Hauptbahnhof|Hbf)','Hbf')
)::double precision AS sml,
st_distance(s.the_geom,plocation) As dist from geographic.stations AS s
WHERE ST_DWithin(s.the_geom,plocation,0.0017) and s.name <-> pname < 0.8
AND s.type ~ stype
ORDER BY dist asc,sml desc LIMIT 1;
the parameters are as follows:
stype = '^railway'
pname = 'Amsterdam Science Park'
plocation = ST_GeomFromEWKT('SRID=4326;POINT(4.9492530 52.3531670)')
the tuple i need to be returned is:
id name type geom (displayed as ST_AsText)
909658;"Amsterdam Sciencepark";"railway_station";"POINT(4.9482893 52.352904)"
The same UDF returns quite well for a lot of other stations, but this is one (of more) which just won't work. Any suggestions?
P.S. The use of the <-> operator is coming from the pg_trgm module.
Some ideas on how to troubleshoot this:
Break your troubleshooting into steps. Start with the simplest query possible. No aggregates, just joins and no filters. Then add filters. Then add order by, then add aggregates. Look at exactly where the change occurs.
Try reindexing the database.
One possibility that occurs to me based on this is that it could be a corrupted index used in the second query but not the first. I have seen corrupted indexes in the past and usually they throw errors but at least in theory they should be able to create a problem like this.
If this is correct, your query will suddenly return rows if you remove the ORDER BY clause.
If you have a corrupted index, then you need to pay close attention to hardware. Is the RAM ECC? Is the processor overheating? How are you disks doing?
A second possibility is that there is a typo on a join condition of filter statement. Normally this is something I would suspect first but it is easy enough to weed out index problems to start there. If removing the ORDER BY doesn't change things, then chances are it is a typo. If you can't find a typo, then try reindexing.