How to scale custom made analytics engine? - mongodb

We have a mid size analytics engine built on top of Elastic Search cluster.
We store send data to our servers in form of json, very similar to what Google Analytics might be doing. We push this entire data in ES cluster. As of now which amounts to ~60GB per day(Approx 2TB per month).
We have a data retention policy of few months lets say 6 months(As per pricing plan).
We provide dynamic reports like ....
all the users who are coming from United States and are using the chrome browser and are using the browser on an iPhone.
the sum of clicks on a particular button of all the users who are coming from referrer matching regex “http://www.google.com” and are based out of India and are using Desktop.
PROBLEM
It has worked for us pretty good till now, but we are facing a problem to scale. As we have already deployed 100s of servers to handle this amount of data & show near real time analytics.
What I am looking for here is that how can I optimise data storage and still show near real time slicing and dicing of data. Imagine how google analytics or mix panel might be storing and showing data in real time.
I am open any technology shift. Suggestions please. (Something similar to GA or Mix Panel is what we have in term of feature)
Do you guys thing storing this huge amount of data in some NO-SQL like mongodb will work and running MAP-Reduce on that data? But that might not be real time(We can expect a delay of 5-10 mins in showing data)
Tech Stack Used(As of now)
Apache/Nginx as webserver + application code
Programming Language(Ruby/PHP etc)
Log collection/parsing via logstash
Elasticsearch cluster to store and query data
SDK written in Javascript which pushes events to our server(Like GA)
We store event payload which looks something like this.
{
"query_params":[
],
"device_type":"Desktop",
"browser_string":"Chrome 47.0.2526",
"ip":"62.82.34.0",
"screen_colors":"24",
"os":"Mac OS X",
"browser_version":"47.0.2526",
"session":1,
"country_code":"ES",
"document_encoding":"UTF-8",
"city":"Palma De Mallorca",
"tz":"Europe/Madrid",
"uuid":"A37F2D3A4B99FF003132D662EFEEAFCA",
"combination_goals_facet_term":"c2_g1",
"ts":1452015428,
"hour_of_day":17,
"os_version":"10.11.2",
"experiment":465,
"user_time":"2016-01-05T17:37:10.675000",
"direct_traffic":false,
"combination":"2",
"search_traffic":false,
"returning_visitor":false,
"hit_time":"2016-01-05T17:37:08",
"user_language":"es",
"device":"Other",
"active_goals":[
1
],
"account":196,
"url":"http://someurl.com",
"action":"click",
"country":"Spain",
"region":"Islas Baleares",
"day_of_week":"Tuesday",
"converted_goals":[
],
"social_traffic":false,
"converted_goals_info":[
],
"referrer":"http://www.google.com",
"browser":"Chrome",
"ua":"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.106 Safari/537.36",
"email_traffic":false
}
EDIT
"optimize data storage" means for every event we receive 70% data same in the json payload. However we keep on creating the new document in ES for event. I was hoping if somehow we stop storing the repeated keys of json and store only what changed in subsequent event payload. Thus optimizing storage space.
We are using SSDs on all our servers. What I am worried about is that what happens we talk about the scale of GA and similar amount of data. I doubt above mentioned Architecture or Tech will survive. Looking for suggestions for that sorta scale.

I think you are already using the best-suited stack for such kind of use case. What I would suggest to work on fine tuning the elasticsearch optimizations if already not done.
Some suggestions could be
Think of using SSD's instead of HDD for elastic search cluster.
Think of using fine tuning parameters like "refresh_intervals"
Using auto scaling via cloud some load balancers in order to handle proper requests.
Hope this helps.

Related

Need advice: How to share a potentially large report to remote users?

I am asking for advice on possibly better solutions for the part of the project I'm working on. I'll first give some background and then my current thoughts.
Background
Our clients can use my company's products to generate potentially large data sets for use in their industry. When the data sets are generated, the clients will file a processing request to us.
We want to send the clients a summary email which contains some statistical charts as well as sampling points from the data sets so they can do some initial quality control work. If the data sets are of bad quality, they don't need to file any request.
One problem is that the charts and sampling points can be potentially too large to be sent in an email. The charts and the sampling points we want to include in the emails are pictures. Although we can use low-quality format such as JPEG to save space, we cannot control how many data sets would be included in the summary email, so the total size could still exceed the normal email size limit.
In terms of technologies, we are mainly developing in Python on Ubuntu 14.04.
Goals of the Solution
In general, we want to present a report-like thing to the clients to do some initial QA. The report may contains external links but does not need to be very interactive. In other words, a static report should be fine.
We want to reduce the steps or things that our clients must do to read the report. For example, if the report can be just an email, the user only needs to 1). log in and 2). open the email. If they use a client software, they may skip 1). and just open and begin to read.
We also want to minimize the burden of maintaining extra user accounts for both us and our clients. For example, if the solution requires us to register a new user account, this solution is, although still acceptable, not ranked very high.
Security is important because our clients don't want their reports to be read by unauthorized third parties.
We want the process automated. We want the solution to provide programming interface so that we can automate the report sending/sharing process.
Performance is NOT a critical issue. Our user base is not large. I think at most in hundreds. They also don't generate data that frequently, at most once a week. We don't need real-time response. Even a delay of a few hours is still acceptable.
My Current Thoughts of Solution
Possible solution #1: In-house web service. I can set up a server machine and develop our own web service. We put the report into our database and the clients can then query via the Internet.
Possible solution #2: Amazon Web Service. AWS is quite mature but I'm not sure if they could be expensive because so far we just wanna share a report with our remote clients which doesn't look like a big deal to use AWS.
Possible solution #3: Google Drive. I know Google Drive provides API to do uploading and sharing programmatically, but I think we need to register a dedicated Google account to use that.
Any better solutions??
You could possibly use AWS S3 and Cloudfront. Files can easily be loaded into S3 using the AWS SDK's and API. You can then use the API to generate secure links to the files that can only be opened for a specific time and optionally from a specific IP.
Files on S3 can also be automatically cleaned up after a specific time if needed using lifecycle rules.
Storage and transfer prices are fairly cheap with AWS and remember that the S3 storage cost indicated is by the month so if you only have an object loaded for a few days then you only pay for a few days.
S3: http://aws.amazon.com/s3/pricing
Cloudfront: https://aws.amazon.com/cloudfront/pricing/
Here's a list of the SDK's for AWS:
https://aws.amazon.com/tools/#sdk
Or you can use their command line tools for Windows batch or powershell scripting:
https://aws.amazon.com/tools/#cli
Here's some info on how the private content urls are created:
http://docs.aws.amazon.com/AmazonCloudFront/latest/DeveloperGuide/PrivateContent.html
I will suggest to built this service using mix of your #1 and #2 options. You can do the processing and for transferring the data leverage AWS S3 which is quiet cheap.
Example: 100GB costs like approx $3.
Also AWS S3 will be beneficial as you are covered for any disaster on your local environment your data will be safe in S3.
For security you can leverage data encryption and signed URLS in AWS S3.

Using CouchDB as interface. Is it appropriate way?

our devices (microscopes with cameras) produce images and additional information to each image.
Now a middleware supplies wants to connect these devices to lab automation system. They have to acquire the data and we have to provide it. An astonishing thing for me was their interface suggestion - a very cryptical token separated format (ASTM E1394-97). Unfortunatelly, they even can't accomodate images in their protocol, and are aiming to get file-paths.
I thought it is not the up-to date approach. While lookink for alternatives, I saw CoachDB.
So, my idea was, our devices would import data including images in CoachDB and they could get the data. It seems even, that using mustache, we could produce the format they want (ascii-text) and placing URLs as image references instead of path's.
My question is, did someone applied CoachDB for such a use case already? It seems to be a little-bit misuse of CoachDB, as the main intention is interface not data storage. Another point disturbing me is, that the inventor of CoachDB went to other project Coachbase. Could it mean lack of support for CoachDB in the future?
Thank you very much for any insights and suggestions!
It's ok use-case and actually we're using CouchDB in such way - as proxing middleware between medical laboratory analyzers and LIS. Some of them publish images or pdf data on shared folders and we'd just loading them into related document as attachments.
More over you'd like to know, CouchDB is able to serve external processes (aka os_daemons) and take care about their lifespan: restarting if someone had terminated and starting right after you update config options through HTTP interface. This helps to setup ASTM client and server processes since this protocol is different from HTTP (which is native for CouchDB) which communicates with devices and creates documents as regular CouchDB clients. In same way you may setup daemons to monitor shared folders for specific files. And all this is just CouchDB with few "low bounded" plugins.

Realtime backend platform for reporting / dashboards?

I will build a dashboard system for my apps, where a page will have several widgets that draw charts, tables and glyphs representing potentially unrelated data.
The client will be HTML5 and I can push for only modern web browser.
My big problem is what backend use for this. I want to store "tables" for use in the charts and in real-time update the widgets.
For example, a invoicing widget will show how much $$ have been collected today. In the "table" will have a row for each total of the invoice:
inv = 1; total = 50
Total: 50
and the widget will draw that. When new data is pushed:
inv = 2; total = 100
Total: 150
The widget will show in realtime the total to the end-user.
The data is private for the user company. Eventually I will need to purge too old data (ie: I only need to keep as much data is necessary to proper evaluation of the info need for the end-user. For example, only keep 1 month of invoicing totals).
I'm thinking in use something like http://www.firebase.com/ or http://pusher.com/ but I suspect only solve the "notify in realtime" part of the equation. As far as I understand, they not let me get past data (ie: If the data is update in the weekend and the user open his dashboard to see what happened)
Then I see http://derbyjs.com/ and the possibility to use mongodb.
I wonder which backend/platform will bring me closer to the build of this system. I have experience with python/django/.net/postgress but could accept the use of something else if solve best this kind of app behavior.
Firebase offers both the "notify in relatime" part that you mention, as well as persistent data storage. Take a look at the tutorial, which walks you through building a real-time persisted chat app (the past chat messages are stored in Firebase and are sent back to the client every time you reload). And you can do much more complicated stuff like the real-time charts / widgets that you mention as well.
The big limitation with Firebase right now is that we're in closed beta and the data is currently unprotected (anybody can read and write your data). The security features are coming soon though.
Some other backend platforms you may want to evaluate are: Meteor and Simperium. Firebase and Simperium are cloud services where your data is stored in the cloud and you don't have to manage any servers of your own, while Meteor and DerbyJS are platforms that you have to install and run on your own server.
I would recommend signalR. It's amazing and you can literally do anything with it. Check it out: www.signalr.net and if you have any problems simply go to www.jabbr.net You will find a very helpful community there. I implemented a notification mechanism similar to facebook together with real time monitoring and a small chat in the same web site.

High Volume MongoDB with Twitter Streaming API, Ruby on Rails, Heroku setup

I'm looking to re-code an application to better handle spikes in tweets. I'm moving to Heroku and MongoDB (either MongoLab or MongoHQ) for the database solution.
During certain news events, tweet volume might spike to 15,000 / second. Typically with each tweet, I parse the tweet and store various pieces of data such as user data, etc. My idea is to store the raw tweets in a separate collection, and have a separate process grab raw tweets and parse them. The goal here is when there is a massive spike in tweets, my application isn't trying to parse all of these, but is essentially backlogging the raw tweets in another collection. As the volume slows, the process can take care of the backlog over time.
My question is three fold:
Can MongoDB handle this type of volume with regards to inserts into a collection at a rate of 15,000 tweets per second?
Any idea on the better setup: MongoHQ or MongoLab?
Any feedback on the overall setup?
Thanks!
The write volume that it will handle depends on lots of factors - hardware, indexes, size of each document, etc. Your best bet is to test it in the environment you're planning to use. If the demands of the write load exceed the capacity of a single mongo server, you can always use just multiple shards.
They are very similar, but there are some differences in pricing and the actual site design has a bunch of differences. There's a thread of discussion about it here: https://webmasters.stackexchange.com/questions/20782/mongodb-hosting-mongolab-vs-mongohq-vs-mongomachine
Overall it seems to make sense. Sounds like you will probably want to flesh out some details about how you will be processing the backlog. Will you be polling it by querying periodically, deleting tweets from the backlog as it processes them, etc.
Completely agree on the need to test this. In general, mongo can handle that many writes, but in practice it depends on the size of your set up, other operations, indexes, etc.
I had to do a similar approach for collecting tons of metrics data. I used a lightweight event-machine process to accept incoming requests in parallel, and store them in a simple format, then another process would take those requests and send them up to a central server. The main goal was to make sure no data was lost if the central server was down, but it also allowed me to put in some throttling logic so that the spikes in data wouldn't overwhelm the system.
I'd be interested to see how this works out for you price-wise, vs. a vps like linode. (I'm a huge Heroku fan, but with certain architectures it can get pricey quickly)

Hosting needs for a turn based iPhone game

So i've been spending some time developing an iPhone app - it's a simple little game and is similar to "Words with friends" in that it:
1) is turn based
2) contacts a web service API to store the "game data" (turns, user info, etc).
In my case, i'm using .NET MVC and a SQL Server backend to develop the API. We're not talking an immense amount of data here - small images will be transferred back and forth and stored in the database though. A typical request would see a few records added or changed in the database.
I mostly don't have much concept of when things would start to get overloaded - my concern, of course, is that this thing takes off (obviously wishful thinking) and then my server gets so overwhelmed that it dies. That being said, I don't want to spend time and money on Windows Azure or something when my hosting needs may be totally trivial.
So, my somewhat general question is this - does anyone have any firsthand knowledge of when things start to get overloaded? Like...just a general estimate of number of requests or something for a time period, assuming each request hits the .NET app which then hits the database a reasonable number of times.
Even some anecdotal "My similar API gets hit 10,000 times a minute and is hosted on crappy shared hosting" would be awesome just so I get some concept.
Thanks in advance!
It is very hard to give a good answer to your question as it greatly depends on what precisely the backend does for each request. Even "trivial" services as you describe can easily differ greatly in performance depending on the actual implementation.
As a rough guideline based on our projects, if your API is a single HTTP request (no HTTPS), hitting a bare-bones controller, being translated into a single, simple SQL statement ("SELECT * FROM foo WHERE bar") returning less than 100 Bytes of data, you can serve about 750 requests per minute on a 32 Bit, 1 Gigahertz box with 512MB ram.
But this number will be reduced to 75 or less if any of those factors go up.
That said:
This is the poster-child case for cloud computing.
If Azure is too much hassle / cost for you (which is not an uncommon complaint from independent developers) you have three main alternatives:
1) Ditch .NET in favor of Python and host within Google App Engine
Python is quick to learn and GAE scales beautifully without you ever needing to care. Best of all, there is a huge free-tier so unless your app really takes off, you won't pay a cent. As you are developing for iOS, I assume you aren't hell bent on .NET to begin with.
2) If you need .NET, go with AWS
They also have a rather large free-tier. Either throw everything on top of a Mono stack (completely free for the 1st year) or shell out the money for a Windows EC2 instance. This takes more planning than GAE but with a little work you can make it scale to wherever your app goes.
If cost is a concern, use the same AWS cluster to host several of your Apps' APIs.
3) Go with OpenFeint's Multiplayer API
OpenFeint supports basic multiplayer games. If you can implement the needed functionality using it, then this might be the best solution. If not, look into (1) and (2).
How long is a piece of string? It all depends with the hosting and connection speeds. .Net is more than capable of handling LARGE amounts of requests. The simplest solution is to monitor the server (or if you cannot, monitor your web services performance) and get better hosting if your app starts to suffer.