Multiple Siddhi apps or one big one - complex-event-processing

We are building an application on top of Siddhi (using the Java library) that allows users to dynamically add rules and have all incoming information going forward be run against those rules. My question is if it's better to have one large app with many queries, streams, windows, and partitions, or to break up each query into it's own application.
We have been including everything in one single Siddhi app (SiddhiAppRuntime), but this is starting to become large and I fear things may start interacting with each other in unintended ways. We are also snapshotting the SiddhiAppRuntime and restoring state whenever our application gets restarted. This could likely lead to massive restores if we have hundreds of pattern queries to re-run.
I am considering making a separate SiddhiAppRuntime from a single SiddhiManager for each query. The benefits (as I see them) would be reduce the risk of unintentional interactions, make each query able to function on its own, and restoring the query after a shutdown should be much simpler since it will only need to restore a single query. Potential downsides could be increased overhead for having potentially hundreds of SiddhiAppRuntimes.
What is considered best practice for our scenario? What will offer better performance, both for running data through the rules and for restoring the rules in the case we have to restart.
(If this is too broad or any clarification is needed I will do my best to update this question accordingly)

From the lengthy description that you have given I assume these rules that users add does not interact with each other meaning rules add by user1 will not be interacting with rules added by user2.
In such a case it is recommended to use different Siddhi Apps(SiddhiAppRuntimes) for each user. This wont add much additional performance overhead as apps wont be interacting with each other. This will improve snapshoting process as we will be taking separate snapshopts per each app.
Also this will make sure you will have clear separation between each collection of rules and will be easily manageable.

Related

Best way to keep in sync data in two different applications

I have 2 closed-source application that must share the same data at some point. Both uses REST APIs.
An actual example are helpdesk tickets, they can be created on both applications and i need to update the data on one application when the user adds a new ticket/closes a ticket on the other application and vice versa.
Since is closed-source I can't really modify che code.
I was thinking I can create a third application that every 5 minutes or so, list both applications' tickets for differences on the precedent call, and if the data is different from the precedent call it updates the other application too.
Is there a better way of doing this?
With closed-source applications it's nearly impossible to get something out of them, unless they have some plugin-based setup that you can hook into.
The most efficient way in terms of costs would be to have the first application publish a message on a queue, or call a web-hook that you set, whenever the event is triggered. But as I mentioned, the application needs to support that.
So yeah, your solution is pretty much everything you can do for now, but keep in mind the challenges that you may encounter over time:
What if the results of both APIs are too large to be compared directly? Maybe you need to think about paging the results.
What if your app crashes and you loose the previous state? You need to somehow back it up in an external source
How often you should poll the API to make sure you're getting the updates you need, while keeping a good performance for the existing traffic?

How to connect Postgres ReadRepica for Reads without affecting the application source code

I have an application which has read and write inline queries in the code, I am facing a challenge while pointing the read and write queries to respective Databases. Is there any best of doing it for Go application?
My thought is to have two ORMs up with Read and Write databases and select appropriate based on the operation. e.g: ReadDbMap.Select("query"); WriteDbMap.Update("query");
But this change effects entire application, that is the concern I have
I am afraid that there is no simpler way.
Streaming replication is not primarily a load balancing feature. For one, you'll have to be aware that a change you made on the primary server is not immediately visible on the standby, so your application will have to deal with these temporary inconsistencies.

Are "best practices" regarding connection handle re-use and database user design mutually exclusive?

SO says this may be subjective. I'm hoping not--I just can't seem to understand how this works in practice, and it seems like a specific enough technical question with I hope a definitive answer.
Context: LAPP stack.
I've read that using a single database user as the login for all connections to the database, and handling security yourself from there, is a bad idea. Databases have sufficient security models and it makes sense to use them.
Database handles have some resource cost associated with them, hence the existence of Apache::DBI, DBIx::Connector, and DBI::connect_cached(), to re-use a recent connection to a database. Making use of them should make a web app faster by avoiding the cost of connecting to a database.
The reason these seem to be mutually exclusive best practices is that, in my understanding, #1 implies that any database connection will be made with separate per-user credentials, which implies (as Apache::DBI documents) that re-using such connections will likely quickly cause your database backend to run out of connections.
The default maximum number of connections for PostgreSQL is 100.
The default numbers of servers and multiplied by subprocesses allowed for each, for Apache 2 running with the prefork MPM, far exceeds that, so it seems Apache::DBI's docs are right.
Thus the question: What do people do then, in practice?
Does this mean people using a LAPP stack generally connect using a single database user, and implement their own security/permissions model? Or does it mean they don't pool connections? Or do they choose between these two strategies based on speed vs security needs if they go with a LAPP stack, and if they need both, go with a desktop app or some other connection model?
Or if these are not, in fact, mutually exclusive strategies, what am I missing in my understanding here?
I've read that using a single database user as the login for all connections to the database, and handling security yourself from there, is a bad idea. Databases have sufficient security models and it makes sense to use them.
You probably misread this, or read it in a highly biased location. A more balanced view is (hopefully) this:
Managing perms (ACL or RBAC or other) within the database is a bloody mess and hard to get right. It can cripple performance, too, if done improperly (think: "select * from table join perms where convoluted_permission_scenario".) Depending on who you ask, you'll get more or less extreme viewpoints, e.g. here's (the very controversial) Zed Shaw: http://vimeo.com/2723800.
Managing perms at the DB level is just as much of a bloody mess. Not all engines implement row-level permissions, and even then there occasionally are leaks. For instance, calling a function in a where clause could (can?) leak rows in Postgres (until a recent version?) if raise gets called. And frankly, if you go past a superficial analysis of what is going on, it basically amounts to the former — just standardized and (usually) in C.
Managing perms at the app level without a database is also a bloody mess. It'll cripple performance no matter what you do from the moment where you need to join outside of SQL, unless you're dealing with trivial amounts of data. If you try it, you'll do fine… until your database grows too large and you basically don't.
So, in short: it's a bloody mess no matter where you manage it. Because permissions are a mess. In addition to the casual and idealistic "Joe needs write access to this set of nodes", you also need to cope with more down to earth scenarios such as "John is going off on vacation for Christmas and needs to temporarily delegate his write permissions on this set of nodes to his assistant Jane". Moreover, whichever scenario you do pick, you need to manage read access (which is usually the most frequent) in such a way that it's fast so you can scale. There's no silver bullet.
Moreover, even in the first and last of the above scenarios, it's ideal to have three DB users. One for reads, one for read/writes, and one for schema changes. Most apps don't, because it's yet another bloody mess to configure your ORM that way, hence the typical one DB user per app.
Anyway, getting back to your question: what people do in practice is one or two database users (read vs read/write/modify), implement RBAC or ACL within the database itself, and avoid access restriction logic like the plague on public-facing pages for performance reasons.

High Volume MongoDB with Twitter Streaming API, Ruby on Rails, Heroku setup

I'm looking to re-code an application to better handle spikes in tweets. I'm moving to Heroku and MongoDB (either MongoLab or MongoHQ) for the database solution.
During certain news events, tweet volume might spike to 15,000 / second. Typically with each tweet, I parse the tweet and store various pieces of data such as user data, etc. My idea is to store the raw tweets in a separate collection, and have a separate process grab raw tweets and parse them. The goal here is when there is a massive spike in tweets, my application isn't trying to parse all of these, but is essentially backlogging the raw tweets in another collection. As the volume slows, the process can take care of the backlog over time.
My question is three fold:
Can MongoDB handle this type of volume with regards to inserts into a collection at a rate of 15,000 tweets per second?
Any idea on the better setup: MongoHQ or MongoLab?
Any feedback on the overall setup?
Thanks!
The write volume that it will handle depends on lots of factors - hardware, indexes, size of each document, etc. Your best bet is to test it in the environment you're planning to use. If the demands of the write load exceed the capacity of a single mongo server, you can always use just multiple shards.
They are very similar, but there are some differences in pricing and the actual site design has a bunch of differences. There's a thread of discussion about it here: https://webmasters.stackexchange.com/questions/20782/mongodb-hosting-mongolab-vs-mongohq-vs-mongomachine
Overall it seems to make sense. Sounds like you will probably want to flesh out some details about how you will be processing the backlog. Will you be polling it by querying periodically, deleting tweets from the backlog as it processes them, etc.
Completely agree on the need to test this. In general, mongo can handle that many writes, but in practice it depends on the size of your set up, other operations, indexes, etc.
I had to do a similar approach for collecting tons of metrics data. I used a lightweight event-machine process to accept incoming requests in parallel, and store them in a simple format, then another process would take those requests and send them up to a central server. The main goal was to make sure no data was lost if the central server was down, but it also allowed me to put in some throttling logic so that the spikes in data wouldn't overwhelm the system.
I'd be interested to see how this works out for you price-wise, vs. a vps like linode. (I'm a huge Heroku fan, but with certain architectures it can get pricey quickly)

What technical considerations must a system/network administrator worry about when a site gets onto social bookmarking/sharing sites?

The reason I ask is that Stack Overflow has been Slashdotted, and Redditted.
First, what kinds of effect does this have on the servers that power a website? Second, what can be done by system administrators to ensure that their sites remain up and running as best as possible?
Unfortunately, if you haven't planned for this before it happens, it's probably too late and your users will have a poor experience.
Scalability is your first immediate concern. You may start getting more hits per second than you were getting per month. Your first line of defense is good programming and design. Make sure you're not doing anything stupid like reloading data from a database multiple times per request instead of caching it. Before the spike happens, you need to do some fairly realistic load tests to see where the bottlenecks are.
For absurdly high traffic, consider the ability to switch some dynamic pages over to static pages.
Having a server architecture that can scale also helps. Shared hosts generally don't scale. A single dedicated machine generally doesn't scale. Using something like Amazon's EC2 to host can help, especially if you plan for a cluster of servers from the beginning (even if your cluster is a single computer).
You're next major concern is security. You're suddenly a much bigger target for the bad guys. Make sure you have a good security plan in place. This is something you should always have, but it become more important with high usage.
Firstly, ask if you really want to spend weeks and thousands of $ on planning for something that might not even happen, and if it does happen, lasts about 5 hours.
Easiest solution is to have a good way to switch to a page simply allowing a signup. People will sign up and you can email them when the storm has passed.
More elaborate solutions rely on being able to scale quickly. That's firstly a software issue (can you connect to a db on another server, can you do load balancing). Secondly, your hosting solution needs to support fast expansion. Amazon EC2 comes to mind, or maybe slicehost. With both services you can easily start new instances ("Let's move the database to a different server") and expand your instances ("Let's upgrade the db server to 4GB RAM").
If you keep all data in the db (including sessions), you can easily have multiple front-end servers. For the database I'd usually try a single server with the highest resources available, but only because I haven't worked with db replication and it used to be quite hard to do, at least with mysql. Things might have improved.
The app designer needs to think about scaling up (larger machines with more cores and higher performance) and/or scaling out (distributing workload across multiple systems). The IT guy needs to work out how to best support that. The network is what you look at first, because obviously everything rides on top of it. Starting at the border, that usually means network load balancers and redundant routers being served by multiple providers. You can also look at geographic caching services and apps such as cachefly.
You want to reduce your bottlenecks as much as possible. You also want to design the environment such that it can be scaled out as needed without much work. Do the design work up front and it'll mean less headaches when you do get dugg.
Some ideas (of what I used in the past and current projects):
For boosting performance (if needed) you can put a reverse-proxying, caching squid in front of your server. Of course that only works if you don't have session keys and if the pages are somewhat static (means: they change only once an hour or so) and not personalised.
With the squid you can boost a bloated and slow CMS like typo3, thus having the performance of static websites with the comfort of a CMS.
You can outsource large files to external services like Amazon S3, saving your server's bandwidth.
And if you are able to spend some (three-figures per month) bucks, you can as well use a Content Delivery Network. Whith that in place you automatically have scaling, high-availability and low latencys for your users. Of course, your pages must be cachable, so session keys and personalised pages are a no-no. If designed carefully and with CDNs in mind, you can at least cache SOME content, like pics and videos and static stuff.
The load goes up, as other answers have mentioned.
You'll also get an influx of new users/blog comments/votes from bored folks who are only really interested in vandalism. This is mostly a problem for blogs which allow completely anonymous commenting, where some dreadful stuff will be entered. The blog platform might have spam filters sufficient to block it, but manual intervention is frequently required to clean up remaining drivel.
Even a little barrier to entry, like requiring a user name or email address even if no verification is done, will dramatically reduce the volume of the vandalism.