Using xmlpipe2 with Sphinx - sphinx

I'm attempting to load large amounts of data directly into Sphinx from Mongo; and currently the best method I've found has been using xmlpipe2.
I'm wondering however if there are ways to just do updates to the dataset, as a full reindex of hundreds of thousands of records can take a while and be a bit intensive on the system.
Is there a better way to do this?
Thank you!

Main plus delta scheme. When all the updates goes to separate smaller index as described here:
http://sphinxsearch.com/docs/current.html#delta-updates

Related

Efficient storage of UNICODED text for processing with Blaze/Pandas

I have about 5 million (& growing) rows of twitter feed and I want to store them efficiently for faster read / write access using Pandas (Preferably Blaze). From that huge metadata of a single tweet, I am just storing [username, tweet time, tweet & tweet ID]. So it's not much. Also, all the tweets are unicode encoded. Now what's the best way to store this data? I am currently storing them in a bunch of CSVs but I don't find it as a viable solution as the data grows and hence plan to move to a DB. I first thought of HDF5 but it still has issues storing unicoded columns (even in Python 3).
Since Blaze has excellent support for databases (& I think is great for analytics too), may I know what can be a good architectural solution (at production level, if possible) to my problem? As my data is also structured, I don't feel the need for a NoSQL solution but am open to suggestions.
Currently, those 5 MM rows occupy only about 1 GB of space and I don't think it will ever cross a few tens of GB. So, is using Postgres, the best idea?
Thanks
Yes, PostgresSQL is a perfectly fine choice for your 10s of GB application. I've had an easy time using sqlalchemy with the psycopg2 driver, and the psql command line tool is fine.
There is an incredible command-line interface to PostgresSQL called pgcli that offers tab-completion for table and column names. I highly recommend it, and just this tool might be enough to push you to use PostgresSQL.

Log viewing utility database choice

I will be implementing log viewing utility soon. But I stuck with DB choice. My requirements are like below:
Store 5 GB data daily
Total size of 5 TB data
Search in this log data in less than 10 sec
I know that PostgreSQL will work if I fragment tables. But will I able to get this performance written above. As I understood NoSQL is better choice for log storing, since logs are not very structured. I saw an example like below and it seems promising using hadoop-hbase-lucene:
http://blog.mgm-tp.com/2010/03/hadoop-log-management-part1/
But before deciding I wanted to ask if anybody did a choice like this before and could give me an idea. Which DBMS will fit this task best?
My logs are very structured :)
I would say you don't need database you need search engine:
Solr based on Lucene and it packages everything what you need together
ElasticSearch another Lucene based search engine
Sphinx nice thing is that you can use multiple sources per search index -- enrich your raw logs with other events
Scribe Facebook way to search and collect logs
Update for #JustBob:
Most of the mentioned solutions can work with flat file w/o affecting performance. All of then need inverted index which is the hardest part to build or maintain. You can update index in batch mode or on-line. Index can be stored in RDBMS, NoSQL, or custom "flat file" storage format (custom - maintained by search engine application)
You can find a lot of information here:
http://kkovacs.eu/cassandra-vs-mongodb-vs-couchdb-vs-redis
See which fits your needs.
Anyway for such a task NoSQL is the right choice.
You should also consider the learning curve, MongoDB / CouchDB, even though they don't perform such as Cassandra or Hadoop, they are easier to learn.
MongoDB being used by Craigslist to store old archives: http://www.10gen.com/presentations/mongodb-craigslist-one-year-later

Database for Reporting / analytics on 1TB data with a simple model

Big data = 1TB increasing by 10% every year.
Model is simple.. one table with 25 columns.
No joins with other tables..
I'm looking to do simple query filtering on a subset of the 25 columns..
I'd guess a traditional SQL store with indexes on the filtered columns is what's necessary. Hadoop is overkill and won't make sense as this is for a realtime service. mongo? a bi engine like pentaho?
Any recommendations?
It seems that traditional solution indeed sounds fine, unless there will not be any significant changes to the really simple model as you've described it.
NoSQL sounds like not the best choice for BI / Reporting.
Get a good hardware. Spend time on performance tests and build all the required indexes. Implement a proper new data upload strategy. Implement table-level partitioning in PostgreSQL according to your needs and performance tests.
P.S. If I could have a chance now to switch from ORACLE/DB2, I would definitely go for PostgreSQL.
I'd suggest investigating Infobright here - it's column-based & compressing, so you won't store the full TB, has a open-source version so you can try it out without being called by a bunch of salespeople (but last time I looked the OSS version was missing some really useful stuff, so you may end up wanting a license). Last time I tried it, it looked to the outside world like MySQL, so not hard to integrate. When I last checked it out, it was single-server-oriented, and claims to work with up to 50TB on a single server. I think that Infobright can sit behind Pentaho if you decide to move in that direction.
Something infobright had going for it was it was pretty close to no-admin - there's no manual indexing, or index maintenance.
Sounds like a column store would help. depends how you're handling inserts, and if you ever have to do updates. But as well as infobright if you're going commercial, then checkout vectorwise, it's quicker and similar priced.
If you want free/open source, then check out Luciddb - There's not many docs, but it is very good at what it does!
If you want unbelievable speed, then check out vectorwise. I believe it's about the same price as infobright, but much faster.

PostgreSQL and S3QL for storing/accessing lots of data

We're currently using Postgres 9 on Amazon's EC2 and are very satisfied with the performance. Now we're looking at adding ~2TB of data to Postgres, which is larger than our EC2 small instance can hold.
I found S3QL and am considering using it in conjunction with moving the Postgres data directory to S3 storage. Has anyone had experience with doing this? I'm mainly concerned with performance (frequent reads, less frequent writes). Any advice is welcome, thanks.
My advice is "don't do that". I don't know anything about the context of your problem, but I guess that a solution doesn't have to involve doing bulk data processing through PostgreSQL. The whole reason grid processing systems were invented was to solve the problem of analyzing large data sets. I think you should consider building a system that follows standard BI practices around extracting dimensional data. Then take that normalized data and, assuming it's still pretty large, load it into Hadoop/Pig. Do your analysis and aggregation there. Dump the resulting aggregate data into a file and load that into your PG database along-side the dimensions.

Analyse Database Table and Usage

I just got into a new company and my task is to optimize the Database performance. One possible (and suggested) way would be to use multiple servers instead of one. As there are many possible ways to do that, i need to analyse the DB first. Is there a tool with which i can measure how many Inserts/Updates and Deletes are performed for each table?
I agree with Surfer513 that the DMV is going to be much better than CDC. Adding CDC is fairly complex and will add a load to the system. (See my article here for statistics.)
I suggest first setting up a SQL Server Trace to see which commands are long-running.
If your system makes heavy use of stored procedures (which hopefully it does), also check out sys.dm_exec_procedure_stats. That will help you to concentrate on the procedures/tables/views that are being used most-often. Look at execution_count and total_worker_time.
The point is that you want to determine which parts of your system are slow (using Trace) so that you know where to spend your time.
One way would be to utilize Change Data Capture (CDC) or Change Tracking. Not sure how in depth you are looking for with this, but there are other simpler ways to get a rough estimate (doesn't look like you want exacts, just ballpark figures..?).
Assuming that there are indexes on your tables, you can query sys.dm_db_index_operational_stats to get data on inserts/updates/deletes that affect the indexes. Again, this is a rough estimate but it'll give you a decent idea.