Event-based analytics package that won't break the bank with high volume - postgresql

I'm using an application that is very interactive and is now at the point of requiring a real analytics solution. We generate roughly 2.5-3 million events per month (and growing), and would like to build reports to analyze cohorts of users, funneling, etc. The reports are standard enough that it would seem feasible to use an existing service.
However, given the volume of data I am worried that the costs of using a hosted analytics solution like MixPanel will become very expensive very quickly. I've also looked into building a traditional star-schema data warehouse with offline background processes (I know very little about data warehousing).
This is a Ruby application with a PostgreSQL backend.
What are my options, both build and buy, to answer such questions?

Why not building your own?
Check this open source project as an exemple:
http://www.warefeed.com
It is very basic and you will have to built datamart feature you will need in your case

Related

Anylogic: Behavioural Experiment using Anylogic

I would like to do an online experiment with human participants backed up by simulation. In the experiment, participants need to change the parameters and run simulations based on their changed parameters. Their changes need to be saved for data analysis.
I wonder if it is feasible to do this using anyLogic or anyLogic cloud. The essential features are:
The model needs to be shared with participants; ideally, they do not have to download anyLogic to complete the experiment.
The changes that participants made need to be saved and downloaded.
Have anyone had the experience of doing similar things?
Many thanks,
I reached out to the salespeople of AnyLogic. They believed that this is feasible but hadn't seen anyone had done this before...
I did this... but in order to do this, you need either the anylogic professional version or the anylogic cloud subscription, or the AnyLogic private cloud (best option)
This is not possible using PLE/public cloud, because to do this you need to either be able to export the executable file (which requires professional version) and/or you need to export the data to a centralized database (which requires the anylogic cloud subscription as a minimum to have access to this feature)
Option with Professional version using exported executable
You export your model and send to all of them a link to download it
You use a centralized database to save every action the user makes to it
Con of this option: i would say 30% of the users will have issues with their JAVA version, computer issues, and other issues and you will have to do troubleshooting with them on an individual basis. If you have 1000 users, then this will be virtually impossible to manage
Option with AnyLogic cloud subscription
You export your model to AnyLogic cloud
you use a centralized database (using Zapier or your own SQL database) that will save the data every time a user changes anything
Con of this option: you will need all users to have a cloud subscription, OR you will need all users to use the same user/password to have access to the model. I am not sure with 1000 users what would happen if you have so many people using the same user/password.. According to AnyLogic, what they told me is that there's no limitation there.
Option with AnyLogic Private Cloud
With this option you have absolute freedom to do what you want. I haven't done it here, but it's also the most expensive option.

Is Apache Zeppelin suitable for presenting dashboard for several users?

In other words, can Zeppelin be used as a Tableau replacement at small scale?
I have a new UI/UX design of reporting dashboard. Data for dashboard comes from relational database (SQL Server). This dashboard is to be viewed by ~300 colleagues in my company. Perhaps up to ten of them will be viewing it at the same time.
Currently the dashboard is implemented in Kibana with data being imported into Elasticsearch from SQL Server on a regular basis. However, the new design requires certain widgets and data aggregations that go beyond dashboarding capabilities of Kibana. Additionally, my organization desires to migrate this dashboard to a technology which is considered more familiar for data scientists that work with us (Kibana isn't considered such).
This report and dashboard could be migrated to Tableau. Tableau is powerful enough to perform desired data aggregations and present all desired widgets. However we can't afford licenses cost, but we can invest as much developer's time as needed.
I have evaluated couple of open-source dashboarding tools (Metabase and Superset) and they are lacking aggregations and widgets that we need. I would not go into details because the question is not about specifics. It is clear that Metabase and Superset are not powerful enough for our needs.
I have an impression that Apache Zeppelin is powerful enough with its support for arbitrary Python code (I would use Pandas for data aggregations), graphs and widgets. However, I am not sure if single Zeppelin instance can support well number of concurrent viewers.
We'd like to build a set of notebooks and make them available to all colleagues in the organization (access control is not an issue, we trust each other). Notebooks will be interactive with data filters and date range pickers.
Looks like Zeppelin has switchable interpreter isolation modes which we can use to make different user's sessions isolated from each other. My question is whether a single t2.large AWS instance hosting Zeppelin can sustain up to ten users viewing report aggregated on 300k rows dataset. Also, are there any usability concerns which make an idea of multi-user viewing of reporting dashboard impractical for Zeppelin?
I see a couple questions you're asking:
Can Zeppelin replace Tableau on a small scale? This depends on what features you are using in Tableau. Every platform has its own set of features that the others do or don't have, and Tableau has a lot of customization options that you won't find elsewhere. Aim to get as much of your dashboard converted 1:1 then warm everyone up to the idea that it will look/operate a little bit different since it's on a different platform.
Can a t2.large hosting Zeppelin sustain up to 10 concurrent users viewing a report aggregated on 300k rows? A t2.large should be more than big enough to run Zeppelin, Tableau, Superset, etc. with 10 concurrent users pulling a report with 300k rows. 300k isn't really that much.
A good way to speed things up and squeeze more concurrent users on with your existing infrastructure is to speed up your data sources. That is where a lot of the aggregation calculations happen. Taking a look at your ETL's and trying to aggregate ahead of time can help, as well as make sure your data scientists aren't running massive queries slowing down your database server.

How to provide my application as a SAAS

I currently have a few apps that I provide to clients by setting them up on their own website. The app uses their own SQL database to record any transactions.
Recently, the number of customers I supply the app to has increased, leading to a higher maintenance work load as each installation must be managed separately.
I'm ready to move to the next level and want to host the app in a single cloud based environment so that I only have to maintain one instance. I would then provide access to that app to each client site, for example embed it in an iframe or perhaps deliver it via a sub-domain. I am not sure about where the DB would sit?
However, this is new territory for me and I'm not sure where to begin. The app is very small and quite simple. I've read a lot of stuff about SAAS but most of it seems quite enterprise level, I'm really looking for a simple and easy to use starting point.
What's the current best practice for this kind of setup and what might be a good guide to read or platform to use?

Developing Geo-location apps for the iPhone

How does one build a directory of 'Spots' for users to check-in to in a native iPhone app? Or, does the developer borrow data from, let's say, Google Maps?
When you Use data obtained from another network or source, you take a risk that the data may change and or may not be accurate, The data may cease to exist, (more so with google, LOL, one minute they are there like gangbusters, the next they are like "Gone" no explanation no apologies, just missing in action, if your developing an application for a business its always best to use your own data sources.
That may be more expensive but its the only way you will have any kind of control over your application resources,.
You can go both ways, it depends on what you want to do and how you designed it to do it. You can have a prerecorded and static database of spots, or you can update it sometimes connecting to some server or you can do it all dynamically by loading each time data from the internet.
Which one to choose? first you shall design your app having in mind something like:
How many times will these datas change
How frequently will these changes happen
How much will it cost to do an update
and so on
Developing your own database of places is likely to be quite an undertaking (and your competitors have a big head start). Google is beginning to provide their Places API for "check-in" style applications, so you may be able to get in on their beta.

Designing a Stress Testing Framework

I do a sort of integration/stress test on a very large product (think operating-system size), and recently my team and I have been discussing ways to better organize our test workloads. Up until now, we've been content to have all of our (custom) workload applications in a series of batch-type jobs each of which represent a single stress-test run. Now that we're at a point where the average test run involves upwards of 100 workloads running across 13 systems, we think it's time to build something a little more advanced.
I've seen a lot out there about unit testing frameworks, but very little for higher-level stress type tests. Does anyone know of a common (or uncommon) way that the problem of managing large numbers of workloads is solved?
Right now we would like to keep a database of each individual workload and provide a front-end to mix and match them into test packages depending on what kind of stress we need on a given day, but we don't have any examples of the best way to do more advanced things like ranking the stress that each individual workload places on a system.
What are my fellow stress testers on large products doing? For us, a few handrolled scripts just won't cut it anymore.
My own experience has been initially on IIS platform with WCAT and latterly with JMeter and Selenium.
WCAT and JMeter both allow you to walk through a web site as a route, completing forms etc. and record the process as a script. The script can then be played back singly or simulating multiple clients and multiple threads. Randomising the play back to simulate lumpy and unpredictable use etc.
The scripts can be edited and or can be written by hand once you know whee you are going. WCAT will let you play back from the log files as well allowing you simulate real world usage.
Both of the above are installed on a PC or server.
Selenium is a FireFox add in but works in a similar way recording and playing back scripts and allowing for scaling up.
Somewhat harder is working out the scenarios you are testing for and then designing the tests to fit them. Also Interaction with the database and other external resouces need to be factored in. Expcet to spend a lot of time looking at log files, so good graphical output is essential.
The most difficult thing for me was to collect and organized performance metrics from servers under load. Performance Monitor was my main tool. When Visual Studio Tester edition came up I was astonished how easy it was to work with performance counters. They pre-packaged a list of counters for a web server, SQL server, ASP.NET application, etc. I learned about a bunch of performance counters I did not know even exist. In addition you can collect your own counters too. You can store metrics after each run. You can also connect to production servers and see how they are feeling today. Then I see all those graphics in real time I feel empowered! :) If you need more load you can get VS Load Agents and create load generating rig (or should I call it botnet). Comparing to over products on the market it is relatively inexpensive. MS license goes per processor but not per concurrent request. That means you can produce as much load as your hardware can handle. On average, I was able to get about 3,000 concurrent web requests on dual-core, 2GB memory computer. In addition you can incorporate performance tests into your builds.
Of course it goes for Windows. Besides the price tag of about $6K for the tool could be a bit high, plus the same amount of money per additional load agent.