Sphinx is there a way to insert into distributed index - sphinx

i.e
index main {
type = distributed
local = rt
agent = 10.0.0.2:3312:rt
agent = 10.0.0.3:3312:rt
agent = 10.0.0.4:3312:rt
agent_connect_timeout = 200
agent_query_timeout = 1000
}
Is there a way to insert into distributed index

Not right now.
The main thing Sphinx doesn't really have a concept to know which agent(s) the insert would be propagated to. You probably have a custom sharding idea, which sphinx doesn't know about.
For now the application would have to know how to connect to the agent(s) directly, and run the insert into the true index.

Related

Sqlalcemy - Bulk-insert with multiple cascade & back_populates relationships

I have tried to optimize our insertions to the database, which is currently the bottleneck and slowing down our pipeline. I decided to first start speed up our data_generator used for testing, all the tables are empty at first. Thought it would be a easy place to start ..
they are then populated and used in various tests.
Currently, we do pretty much all insertions with Session.add(entry) or in some cases bulked entries with add_all(entries), which does not improve the speed that much.
The goal was to do more insertions at once and have less time communicating back and forth with the database and I tried various bulk_insert methods (bulk_save_objects, bulk_insert_mappings and ORM,CORE methods with INSERT INTO, COPY, IMPORT .. but I got nothing to work properly. Foreign key constraints, duplicated keys ... or tables not getting populated.
I will show an example of a Table that would previous be added with add_all() in a run_transaction.
class News(NewsBase):
__tablename__ = 'news'
news_id = Column(UUID(as_uuid=True), primary_key=True, nullable=False)
url_visit_count = Column('url_visit_count', Integer, default=0)
# One to many
sab_news = relationship("sab_news", back_populates="news")
sent_news = relationship("SenNews", back_populates="news")
scope_news = relationship("ScopeNews", back_populates="news")
news_content = relationship("NewsContent", back_populates="news")
# One to one
other_news = relationship("other_news", uselist=False, back_populates="news")
# Many to many
companies = relationship('CompanyNews', back_populates='news', cascade="all, delete")
aggregating_news_sources = relationship("AggregatingNewsSource", secondary=NewsAggregatingNewsSource,
back_populates="news")
def __init__(self, title, language, news_url, publish_time):
self.news_id = uuid4()
super().__init__(title, language, news_url, publish_time)
We have many tables built like this, some with more relations, and my conclusion now is that having many different relationships that back_populates and update each other does not allow for fast bulk_insertions, Am I wrong?
One of my current solution that was able to decrease our execution_time from 120s to 15s for a regular data_generator for testing is like this:
def write_news_to_db(news, news_types, news_sources, company_news,
):
write_bulk_in_chunks(news_types)
write_bulk_in_chunks(news_sources)
def write_news(session):
enable_batch_inserting(session)
session.add_all(news)
def write_company_news(session):
session.add_all(company_news)
engine = create_engine(
get_connection_string("name"),
echo = False,
executemany_mode = "values")
run_transaction(create_session(engine=engine), lambda s: write_news(s))
run_transaction(create_session(), lambda s: write_company_news(s))
I used this library sqlalchemy_batch_inserts
github together with Psycopg2 Fast Execution Helpers, set executemany_mode="values".
I did this by creating a new engine just for these insertions - It did work however this itself seems like a bad practice. It works with the same database.
Anyway, this does seem to work, but it is still not the execution speed I want - especially when we are initially working with empty tables.
Ideally, I wouldn't want to do this hacky solution and avoid bulk_insertions since SQLAlchemy does not recommend using them - to avoid problems that I have faced.
But how does one construct queries to properly do bulk_insertions in cases of complex Tables like these - should we re-design our tables or is it possible?
Using Multi-row insertions within the run_transaction with ORM or CORE would be ideal, but I haven't been able to do it.
Any recommendations or help would be much appreciated!
TLDR; Bulk-insertion with multiple relationships, back_populates, cascade. How is it supposed to done?
CockroachDB supports bulk insertions using multi-row insert for existing tables as well as IMPORT statements for new tables - https://www.cockroachlabs.com/docs/stable/insert.html. Have you considered using these options directly?

Thinking Sphinx indexing performance

I have a large index definition that takes too long to index. I suspect the main problem is caused by the many LEFT OUTER JOINs generated.
I saw this question, but can't find documentation about using source: :query, which seems to be part of the solution.
My index definition and the resulting query can be found here: https://gist.github.com/jonsgold/fdd7660bf8bc98897612
How can I optimize the generated query to run faster during indexing?
The 'standard' sphinx solution to this would be to use ranged queries.
http://sphinxsearch.com/docs/current.html#ex-ranged-queries
... splitting up the query into lots of small parts, so the database server has a better chance of being able to run the query (rather than one huge query)
But I have no idea how to actully enable that in Thinking Sphinx. Can't see anything in the documentation. Could help you edit the sphinx.conf, but also not sure how TS will cope with you manually editing the config file.
This is the solution that worked best (from the linked question). Basically, you can remove a piece of the main query sql_query and define it separately as a sql_joined_field in the sphinx.conf file.
It's important to add all relevant sql conditions to each sql_joined_field (such as sharding indexes by modulo on the ID). Here's the new definition:
ThinkingSphinx::Index.define(
:incident,
with: :active_record,
delta?: false,
delta_processor: ThinkingSphinx::Deltas.processor_for(ThinkingSphinx::Deltas::ResqueDelta)
) do
indexes "SELECT incidents.id * 51 + 7 AS id, sites.name AS site FROM incidents LEFT OUTER JOIN sites ON sites.id = site_id WHERE incidents.deleted = 0 AND EXISTS (SELECT id FROM accounts WHERE accounts.status = 'enabled' AND incidents.account_id = id) ORDER BY id", as: :site, source: :query
...
has
...
end
ThinkingSphinx::Index.define(
:incident,
with: :active_record,
delta?: true,
delta_processor: ThinkingSphinx::Deltas.processor_for(ThinkingSphinx::Deltas::ResqueDelta)
) do
indexes "SELECT incidents.id * 51 + 7 AS id, sites.name AS site FROM incidents LEFT OUTER JOIN sites ON sites.id = site_id WHERE incidents.deleted = 0 AND incidents.delta = 1 AND EXISTS (SELECT id FROM accounts WHERE accounts.status = 'enabled' AND incidents.account_id = id) ORDER BY id", as: :site, source: :query
...
has
...
end
The magic that defines the field site as a separate query is the option source: :query at the end of the line.
Notice the core index definition has the parameter delta?: false, while the delta index definition has the parameter delta?: true. That's so I could use the condition WHERE incidents.delta = 1 in the delta index and filter out irrelevant records.
I found sharding didn't perform any better, so I reverted to one unified index.
See the whole index definition here: https://gist.github.com/jonsgold/05e2aea640320ee9d8b2.
Important to remember!
The Sphinx document ID offset must be handled manually. That is, whenever an index for another model is added or removed, my calculated document ID will change. This must be updated.
So, in my example, if I added an index for a different model (not :incident), I would have to run rake ts:configure to find out my new offset and change incidents.id * 51 + 7 accordingly.

Can't update specific columns? Only large string columns?

I'm trying to run some simple update statements on Cache 2008. Loging into the web portal. I'm able to run queries like:
update testspace.clients
set requires_attention = 'Yes'
, notes = 'testsdfsd'
where id = '1||CPL62549.001'
The web portal runs and looks like it updated things but when I do a select statement requires_attention is updated but notes isn't.
Both fields are of type string. The only difference is notes is MAXLEN = 32700.
I've tested this on other columns in other tables. Any string column with MAXLEN = 32700 wont let me update it. Seems odd. Perhaps this is a coincidence and something else is going on. Seems strange that I can update some fields of a record but not others.
any ideas?
I'm new to cache but have experience with SQL Server, Oracle, MySQL, etc.
Strings in Cache are limited to 32000 characters. Setting the MAXLEN to a number greater than that is going to cause problems.
Set the MAXLEN to 32000 and it should be fine.

Postgres: updating not-changed rows

Say, I have a following query:
UPDATE table_name
SET column_name1 = column_value1, ..., column_nameN = column_valueN
WHERE id = M
The thing is, that column_value1, ..., column_valueN have not changed. Will this query be really executed and what about performance in this case comparing to update with really changed data? What if I have about 50 of such queries per page with not-changed data?
You need to help postgresql here by specifying only the changed columns and rows. It will go ahead and perform update on whatever you specify without checking if the data has been changed.
p.s. This is where ORM comes in handy.
EDIT: You may also be interested in How can I speed up update/replace operations in PostgreSQL?, where the OP went through all the troubles to speed up UPDATE performance, when the best performance can be achieved by updating changed data only.

Why is one query consistently ~25ms faster than another in postgres?

A friend wrote a query with the following condition:
AND ( SELECT count(1) FROM users_alerts_status uas
WHERE uas.alert_id = context_alert.alert_id
AND uas.user_id = 18309
AND uas.status = 'read' ) = 0
Seeing this, I suggested we change it to:
AND NOT EXISTS ( SELECT 1 FROM users_alerts_status uas
WHERE uas.alert_id = context_alert.alert_id
AND uas.user_id = 18309
AND uas.status = 'read' )
But in testing, the first version of the query is consistently between 20 and 30ms faster (we tested after restarting the server). Conceptually, what am I missing?
My guess would be that the first one can short circuit; as soon as it sees any rows that match the criteria, it can return the count of 1. The second one needs to check every row (and it returns a row of "1" for every result), so doesn't get the speed benefit of short circuiting.
That being said, doing an EXPLAIN (or whatever your database supports) might give a better insight than my guess.
Conceptually, I'd say that your option is at least as good as the other, at least a little more elegant. I'm not sure if it should be slower or faster - and if those 25ms are relevant.
The definite answer, usually, comes by looking at the EXPLAIN output.
What Postgresql version is that? PG 8.4 is said to have some optimizations regarding NOT EXISTS