Experience problem with indexing lot's of content data. Searching for the suitable solution.
The logic if following:
Robot is uploading content every day to the database.
Sphinx index must reindex only new (daily) data. I.e. the previous content is never being changed.
Sphinx delta indexing is an exact solution for this, but with too much content the error is rising: too many string attributes (current index format allows up to 4 GB).
Distributed indexing seems to be usable, but how to dynamically (without dirty hacks) add & split indexing data?
I.e.: day 1 there are total 10000 rows, day 2 - 20000 rows and etc. The index throws >4GB error on about 60000 rows.
The expected index flow: 1-5 day there is 1 index (no matter distributed or not), 6-10 day - 1 distributed (composite) index (50000 + 50000 rows) and so on.
The question is how to fill distributed index dynamically?
Daily iteration sample:
main index
chunk1 - 50000 rows
chunk2 - 50000 rows
chunk3 - 35000 rows
delta index
10000 new rows
rotate "delta"
merge "delta" into "main"
Please, advice.
Thanks to #barryhunter
RT indexes is a solution here.
Good manual is here: https://www.sphinxconnector.net/Tutorial/IntroductionToRealTimeIndexes
I've tested match queries on 3 000 000 000 letters. The speed is close to be the same as for "plain" index type. Total index size on HDD is about 2 GB.
Populating sphinx rt index:
CPU usage: ~ 50% of 1 core / 8 cores,
RAM usage: ~ 0.5% / 32 GB, Speed: quick as usual select - insert (mostly depends on using batch insert or row-by-row)
NOTE:
"SELECT MAX(id) FROM sphinx_index_name" will produce error "fullscan requires extern docinfo". Setting docinfo = extern will not solve this. So keep counter simply in mysql table (like for sphinx delta index: http://sphinxsearch.com/docs/current.html#delta-updates).
Related
I have a list of search terms I'm reading into Pandas:
terms_list = np.genfromtxt('terms_list.csv', delimiter=',',dtype=str)
I need to pull what will be about 2.5 million rows from three joined tables in my Postgresql database, and loop through in Pandas to check the text in several columns from each row for a match with my terms_list. In the past I have used a chunking function to query and text search 500 rows at a time; however for 12,000 rows this took six minutes, so this strategy isn't effective to apply to 2.5 million rows.
I'm not very familiar with big data, so wondering if there are any good strategies for iterating through large volumes on a low-performant postgres database?
I have a simple question.
If we do a db.collection.find({_id:ObjectId("an id")}) on a 1 million rows, does it take the same time as 1 billion rows?
if possible to explain why it does or does not take the same time, knowing that _id is an indexed field.
MongoDB uses B-trees for indexes which have a time complexity of O(log n) for search.
log 1M = 6
log 1B = 9
So a search over 1 billion docs will take roughly 50% longer than a search over 1 million docs.
I have a table say 'T' in kdb which has rows over 6 billion. When I tried to execute query like this
select from T where i < 10
it throws wsfull expection. Is there any way I can execute queries like this in table having large amount of data.
10#T
The expression as you wrote it first makes a bitmap containing all of the elements where i (rownumber) < 10, which is as tall as one of your columns. It then does where (which just contains til 10) and then gets them from each row. You can save the last step with:
T[til 10]
but 10#T is shorter.
Assuming you have a partitioned table here, it is normally beneficial to have the partitioning column (date, int etc.) as the first item in the where clause of your query - otherwise as mentioned previously you are reading a six billion item list into memory, which will result in a 'wsfull signal for any machine with less than the requisite amount of RAM.
Bear in mind that row index starts at 0 for each partition, and is not reflective of position in the overall table. The query that you gave as an example in your question would return the first ten rows of each partition of table T in your database.
In order to do this without reaching your memory limit, you can try running the following (if your database is date-partitioned):
raze{10#select from T where date=x}each date
I tested two scenarios Single Huge collection vs Multiple Small Collections and found huge difference in performance while querying. Here is what I did.
Case 1: I created a product collection containing 10 million records for 10 different types of product, and in this exactly 1 million records for each product type, and I created index on ProductType. When I ran a sample query with condition ProductType=1 and ProductPrice>100 and limit(10) to return 10 records of ProductType=1 and whose price is greater than 100, it took about 35 milliseconds when the collection has lot of products whose price is more than 100, and the same query took about 8000 millisecond (8 second) when we have very less number of products in ProductType=1 whose price is greater than 100.
Case 2: I created 10 different Product table for each ProductType each containing 1 million records. In collection 1 which contains records for productType 1, when I ran the same sample query with condition ProductPrice>100 and limit(10) to return 10 records of products whose price is greater than 100, it took about 2.5 milliseconds when the collection has lot of products whose price is more than 100, and the same query took about 1500 millisecond (1.5 second) when we have very less number of products whose price is greater than 100.
So why there is so much difference? The only difference between the case one and case two is one huge collection vs multiple smaller collection, but I have created index of ProductType in the first case one single huge collection. I guess the performance difference is caused by the Index in the first case, and I need that index in the first case otherwise it will be more worst in performance. I expected some performance slow in the first case due to the Index but I didn't expect the huge difference about 10 times slow in the first case.
So 8000 milliseconds vs 1500 milliseconds on one huge collection vs multiple small collection. Why?
Separating the collections gives you a free index without any real overhead. There is overhead for an index scan, especially if the index is not really helping you cut down on the number of results it has to scan (if you have a million results in the index, but you have to scan them all and inspect them, it's not going to help you much).
In short, separating them out is a valid optimization, but you should make your indexes better for your queries before you actually decide to take that route, which I consider a drastic measure (an index on product price might help you more in this case).
Using explain() can help you understand how queries work. Some basics are: You want a low nscanned to n ratio, ideally. You don't want scanAndOrder = true, and you don't want BasicCursor, usually (this means you're not using an index at all).
Is it possible to predict the amount of disk space/memory that will be used by a basic index in PostgreSQL 9.0?
E.g. If I have a default B-tree index on an integer column in a table of 1 million rows, how much space would be taken up by the index? Is the entire index held in memory at all times?
Not really a definitive answer, but I looked at a table in a 9.0 test system I have with a couple of int indexes on a table of 280k rows. The indexs all report a size of 6232kb. So roughly 22 bytes per row.
There is no way to say that. It depends on the type of operations you will make, as PostgreSQL stores many different versions of the same row, including row versions stored in index files.
Just make the table you are interested in and check it.