Using MySQLProxy with Sphinx - sphinx

Is there relatively easy way to insert into Sphinx distributed index using MySQL proxy?
E.g. you connect to MySQL proxy and send something like:
insert into my_ft_index values(1000, 'harry potter');
then MySQLProxy somehow calculate the hash of 1000 and decide where to forward this insert?

See:
http://sphinxsearch.com/forum/view.html?id=13632
From what I understand there, mysqlproxy wouldnt work (but havent tried)
Frankly its pretty trival to implemnent in application code. (ie 'picking' which server to send the request) - if was a HA setup, with multiple servers per shard, less so.

Related

Choosing DB for my big table

I have a table with something like 60gb in my DB (using ram memory) that I want to get out to other DB... It's only one table with two columns (ID and text). And the only thing that iam doing on this DB is simple write and read request (1 object each time by id). This DB should get a lot of read and write request (web application). Iam wondering should I use mongodb or elastic or maybe there are something better for me?
Sounds like a key-value store problem. Though both MongoDB and Elasticsearch should work well for it too. What are you using now — Redis or Memcached maybe?
Do you know the size of the value (text)?
I assume you have (at least) one other datastore for other workloads in your system. Maybe you can use the same for multiple tasks? Reusing operational knowledge is definitely a win.
Other than that there is not enough information to make a hard argument for any system IMO.

How to transport and index Cassandra data on Elastic Search?

I'm starting a nodejs application where I want to index Cassandra data on Elastic Search, but what would be the best way to do that?, I gave a look to Storm to accomplish just that but doesn't seem to be the solution. Primarily, I was thinking to use one client for Cassandra and one client for Elastic Search and apply inserts/updates/deletes twice on my application, being one per client, but doesn't appear to be the way to go, and I'm worried about the consistency of this. There's a better way to transport Cassandra data to be indexed on Elastic Search? Storm would help me to accomplish that? Could someone suggest any techniques to transport one database data to another? I'm in a really doubt here with nowhere to look.
Do you want to move the data from Cassandra to ElasticSearch once and only once? Or you want to keep them in sync?
In both cases, I think Storm is a good fit. I used in the past to move data from our RDBMS into Apache Solr. One thing to keep in mind is the limit of writes that Solr/Elastic search can do. If you increased the parallelism, then you are bringing them to the knees.
Another option could be Apache Hadoop but it is only suitable for one time copying or if you want to copy the data (same data of yesterday + what could be new) every day.

Architecture to create an uptime monitor in Node.js

What's the best solution for using Node.js and Redis to create an uptime monitoring system? Can I use Redis as a queue but is not the best way to save information, maybe MongoDB is?
It seems pretty simple but needing to have more than 1 server to guarantee the server is down and make everything work together is not so easy.
To monitor uptime, you would use a Cron job on the system. With each call, you would check to see if the host is up, and how long it would take. And in that script, you would save your data in Redis.
To do this in Node.JS, you would create a script that checks the status of the server. Just making a HTTP request to the server (Or Ping, w.e.) and recording if it fails or not. Then I would just record it to Redis. How you do it does not matter, because the script (if you run the cron every 30 seconds) has [30] seconds before the next run, so you dont have to worry about getting your query to the server. How you save your data is up to you, but in this case even MySQL would work (if you are only doing a small number of sites)
More on Cron # Wikipedia
Can I use Redis as a queue but is not
the best way to save information,
maybe MongoDB is?
You can(should) use Redis as your queue. It is going to be extremely fast.
I also think it is going to be very good option to save the information inside Redis. Unfortunately Redis does not do any timing(yet). I think you could/should use Beanstalkd to put messages on the queue that get delivered when needed(every x seconds). I also think cron is not that a very good idea because you would be needing a lot of them and when using a queue you could do your work faster(share load among multiple processes) also.
Also I don't think you need that much memory to save everything in memory(makes site fast) because dataset is going to be relative simple. Even if you aren't able(smart to get more memory if you ask me) to fit entire dataset in memory you can rely on Redis's virtual memory.
It seems pretty simple but needing to
have more than 1 server to guarantee
the server is down and make everything
work together is not so easy.
Sharding/replication is what I think you should read into to solve this problem(hard). Luckily Redis supports replication(sharding can also be achieved). MongoDB supports sharding/replication out of the box. To be honest I don't think you need sharding yet and your dataset is rather simple so Redis is going to be faster:
http://redis.io/topics/replication
http://www.mongodb.org/display/DOCS/Sharding+Introduction
http://www.mongodb.org/display/DOCS/Replication
http://ngchi.wordpress.com/2010/08/23/towards-auto-sharding-in-your-node-js-app/

MongoDB on EC2 server or AWS SimpleDB?

What scenario makes more sense - host several EC2 instances with MongoDB installed, or much rather use the Amazon SimpleDB webservice?
When having several EC2 instances with MongoDB I have the problem of setting the instance up by myself.
When using SimpleDB I have the problem of locking me into Amazons data structure right?
What differences are there development-wise? Shouldn't I be able to just switch the DAO of my service layers, to either write to MongoDB or AWS SimpleDB?
SimpleDB has some scalability limitations. You can only scale by sharding and it has higher latency than mongodb or cassandra, it has a throughput limit and it is priced higher than other options. Scalability is manual (you have to shard).
If you need wider query options and you have a high read rate and you don't have so much data mongodb is better. But for durability, you need to use at least 2 mongodb server instances as master/slave. Otherwise you can lose the last minute of your data. Scalability is manual. It's much faster than simpledb. Autosharding is implemented in 1.6 version.
Cassandra has weak query options but is as durable as postgresql. It is as fast as mongo and faster on higher data size. Write operations are faster than read operations on cassandra. It can scale automatically by firing ec2 instances, but you have to modify config files a bit (if I remember correctly). If you have terabytes of data cassandra is your best bet. No need to shard your data, it was designed distributed from the 1st day. You can have any number of copies for all your data and if some servers are dead it will automatically return the results from live ones and distribute the dead server's data to others. It's highly fault tolerant. You can include any number of instances, it's much easier to scale than other options. It has strong .net and java client options. They have connection pooling, load balancing, marking of dead servers,...
Another option is hadoop for big data but it's not as realtime as others, you can use hadoop for datawarehousing. Neither cassandra or mongo have transactions, so if you need transactions postgresql is a better fit. Another option is Amazon RDS, but it's performance is bad and price is high. If you want to use databases or simpledb you may also need data caching (eg: memcached).
For web apps, if your data is small I recommend mongo, if it is large cassandra is better. You don't need a caching layer with mongo or cassandra, they are already fast. I don't recommend simpledb, it also locks you to Amazon as you said.
If you are using c#, java or scala you can write an interface and implement it for mongo, mysql, cassandra or anything else for data access layer. It's simpler in dynamic languages (eg rub,python,php). You can write a provider for two of them if you want and can change the storage maybe in runtime by a only a configuration change, they're all possible. Development with mongo,cassandra and simpledb is easier than a database, and they are free of schema, it also depends on the client library/connector you're using. The simplest one is mongo. There's only one index per table in cassandra, so you've to manage other indexes yourself, but with the 0.7 release of cassandra secondary indexes will bu possible as I know. You can also start with any of them and replace it in the future if you have to.
I think you have both a question of time and speed.
MongoDB / Cassandra are going to be much faster, but you will have to invest $$$ to get them going. This means you'll need to run / setup server instances for all them and figure out how they work.
On the other hand, you don't have to per a "per transaction" cost directly, you just pay for the hardware which is probably more efficient for larger services.
In the Cassandra / MongoDB fight here's what you'll find (based on testing I'm personally involved with over the last few days).
Cassandra:
Scaling / Redundancy is very core
Configuration can be very intense
To do reporting you need map-reduce, for that you need to run a hadoop layer. This was a pain to get configured and a bigger pain to get performant.
MongoDB:
Configuration is relatively easy (even for the new sharding, this week)
Redundancy is still "getting there"
Map-reduce is built-in and it's easy to get data out.
Honestly, given the configuration time required for our 10s of GBs of data, we went with MongoDB on our end. I can imagine using SimpleDB for "must get these running" cases. But configuring a node to run MongoDB is so ridiculously simple that it may be worth skipping the "SimpleDB" route.
In terms of DAO, there are tons of libraries already for Mongo. The Thrift framework for Cassandra is well supported. You can probably write some simple logic to abstract away connections. But it will be harder to abstract away things more complex than simple CRUD.
Now 5 years later it is not hard to set up Mongo on any OS. Documentation is easy to follow, so I do not see setting up Mongo as a problem. Other answers addressed the questions of scalability, so I will try to address the question from the point of view of a developer (what limitations each system has):
I will use S for SimpleDB and M for Mongo.
M is written in C++, S is written in Erlang (not the fastest language)
M is open source, installed everywhere, S is proprietary, can run only on amazon AWS. You should also pay for a whole bunch of staff for S
S has whole bunch of strange limitations. M limitations are way more reasonable. The most strange limitations are:
maximum size of domain (table) is 10 GB
attribute value length (size of field) is 1024 bytes
maximum items in Select response - 2500
maximum response size for Select (the maximum amount of data S can return you) - 1Mb
S supports only a few languages (java, php, python, ruby, .net), M supports way more
both support REST
S has a query syntax very similar to SQL (but way less powerful). With M you need to learn a new syntax which looks like json (also it is straight-forward to learn the basics)
with M you have to learn how you architect your database. Because many people think that schemaless means that you can throw any junk in the database and extract this with ease, they might be surprised that Junk in, Junk out maxim works. I assume that the same is in S, but can not claim it with certainty.
both do not allow case insensitive search. In M you can use regex to somehow (ugly/no index) overcome this limitation without introducing the additional lowercase field/application logic.
in S sorting can be done only on one field
because of 5s timelimit count in S can behave strange. If 5 seconds passed and the query has not finished, you end up with a partial number and a token which allows you to continue query. Application logic is responsible for collecting all this data an summing up.
everything is a UTF-8 string, which makes it a pain in the ass to work with non string values (like numbers, dates) in S. M type support is way richer.
both do not have transactions and joins
M supports compression which is really helpful for nosql stores, where the same field name is stored all-over again.
S support just a single index, M has single, compound, multi-key, geospatial etc.
both support replication and sharding
One of the most important things you should consider is that SimpleDB has a very rudimentary query language. Even basic things like group by, sum average, distinct as well as data manipulation is not supported, so the functionality is not really way richer than Redis/Memcached. On the other hand Mongo support a rich query language.

Are MongoDB and CouchDB perfect substitutes?

I haven't got my hands dirty yet with neither CouchDB nor MongoDB but I would like to do so soon... I also have read a bit about both systems and it looks to me like they cover the same cases... Or am I missing a key distinguishing feature?
I would like to use a document based storage instead of a traditional RDBMS in my next project. I also need the datastore to
handle large binary objects (images and videos)
automatically replicate itself to physically separate nodes
rendering the need of an additional RDBMS superfluous
Are both equally well suited for these requirements?
Thanks!
I've actually used both pretty extensively, both for very different projects.
I'd say they are equally well suited for the requirements you list, however there are quite a lot of differences between the two. IMO the biggest is their query-ability. CouchDB doesn't have 'queries' in the RDBMS sense (select * from ...) but instead uses 'views' which are more like stored procedures (essentially, static queries defined in the database (1)). MongoDB has much more 'usual' querying.
Essentially it comes down to your application requirements. If you give more information I might be able to shed some more light on what might matter in that situation.
(1): you can have temporarily, non-static queries in CouchDB but they aren't recommended for production use
Mongo uses more "traditional" queries. You turn on indexing on a per-key basis and use a SQLish query syntax.
CouchDB's views can do much deeper indexing and relationships but require you to do a little more work and understand the way the key sorting works for doing the queries.
There is a big difference in the replication systems as well. Mongo's replication looks a lot like most RDBMS solutions with masters and slaves and all that. CouchDB's replication is more peer to peer, no master/slave, every CouchDB is a node.
CouchDB's replication is made for keeping geographically apart sites in sync. It handles network- and other errors gracefully by restarting replication where it left off. Participating nodes can even be put offline deliberately.
Before using MongoDB, I'd recommend that you take a look at the following: http://groups.google.com/group/mongodb-user/browse_thread/thread/460dbd49a5b6b267. MongoDB has a small chance of corrupting data due to its lack of fsync's with each write.
http://nosql.mypopescu.com/post/298557551/couchdb-vs-mongodb
From a developer point of view the biggest difference is the mongo live queries vs couch view (which must be "compiled").
From an operational point of view, couch is working completely on http-rest. If you're able to configure http servers you know how to setup coach. With Mongo instead you have to learn how to set up config servers, replica set and mongos (kind of balancer).