StatsD type server for advanced analytics - real-time

My company is growing and so is our need for analytics.
But concurrently we are needing greater speed and complexity.
I'm looking for recommendations to FOSS stats servers similar to StatsD, but capable of more complexity. On the scale of things like cohort charts, grouping actions by category and extracting unique vs total.

Related

What is better for Reports, DataAnalytics and BI: AWS Redshift or AWS ElasticSearch?

We have 4 applications: A, B, C, and D. Applications are scaping different social-network data from different sources. Each application has its own database.
Application A scrapes eg. Instagram accounts, Instagram posts, Instagram stories - from external X source. 
Applications B scrapes eg. Instagram account follower and following COUNT history - from external source Y. 
Application C scrapes eg. Instagram account audience data (eg. gender statistic: male vs female, age statistic, country statistic, etc) - from external source Z.
Application D scrapes TikTok data from external source W.
Our data analytics team has to create different kinds of analysis: 
eg. data (table) that have Instagram post engagement (likes + post / total number of followers for that month) for specific Instagram accounts. 
eg. Instagram account development - total number of followers per month, the total number of posts per month, average post engagement per month, etc...
eg. account follower insights - we are analyzing just pieces of Instagram account followers eg. 5000 of them 1000000. We analyze who our followers follow beside us. Top 10 followings. 
lot of other similar kind of reports
Right now we have 3TB of data in our OLTP Postgres DB, and it is not a solution for us anymore. - We are running really heavy queries for reporting, BI... and we want to move social-network data to Data Warehouse or Open Search.
We are on AWS and we want to use Redshift or Open Search for our data analysis. 
We don't need Real Time processing. What is the better solution for us, Redshift or OpenSearch?
Any ideas are welcome.  
I expect to have infrastructure that will be able to run heavy queries for data analytics team for reporting and BI.
Based on what you've described, it sounds like AWS Redshift would be a better fit for your needs. Redshift is designed for data warehousing and can handle large-scale data processing, analysis, and reporting, which aligns with your goal of analyzing large amounts of data from multiple sources. Redshift also offers advanced query optimization capabilities, which can help your team run complex queries more efficiently.
OpenSearch, on the other hand, is a search and analytics engine that's designed for full-text search and real-time analytics. While OpenSearch is optimized for OLTP workloads, it may not be the best fit for your use case, which involves analyzing structured data from different sources.
When it comes to infrastructure, it's important to consider the size of your data, the complexity of your queries, and the number of users accessing the system. Redshift can scale to handle large amounts of data, and you can choose the appropriate node type and cluster size based on your needs. You can also use features such as Amazon Redshift Spectrum to analyze data in external data sources like Amazon S3.
It's worth noting that moving data to a data warehouse like Redshift may involve some initial setup and data migration costs. However, in the long run, having a dedicated data warehouse can improve the efficiency and scalability of your data analytics processes.

Monitor MongoDB Atlas data transfer costs

I have a MongoDB Atlas cluster that serves many customers. Each customer has its own database on the cluster.
I would like to reduce my application's impact on MongoDB data transfer costs, which have been increasing for the last few days, but the billing info provided by Atlas does not break down prices per database. Therefore, I have no way of knowing which customers are costly and what are the most costly queries in terms of data transfer.
Moreover, using the prices on a daily basis and a few queries, I cannot correlate insertion of resources in my application with prices. For example, let's say my resources are Cats, one day it will cost 5$ of data transfer with 5000 Cats inserted in total in the databases, but the next day, it's going to cost 13$ with 1500 Cats inserted.
Do you know of tools or something in the Atlas dashboard I might've missed that could help me better track costs per customer, or say, a cost per Cat (in my example) so that I build a pricing model for my customers?
Thank you
You are most likely going to need separate projects and deployments.
A MongoDB client instance is generally capable of using any database on the server (subject to authorization rules and APIs provided in the language in question), therefore to get a breakdown of data transfer by database would require the server to track bytes transferred per operation and then aggregate those counts. As far as I know this isn't a feature that currently exists.
The most practical way of tracking this today is probably writing a layer on top of the driver on the client side that would look at data actually received.

Can a Mobile Operator use Netflow same as used by a WireLine Operator

Can a Mobile Operator use Netflow same as used by a WireLine Operator to gain insights in to subscriber behavior. The reason I am asking this question is because the network typologies are very different.
I've implemented solutions for both types and you are correct in your point about different requirements. The number of netflow collection points is dramatically increased when deploying for a mobile provider as the have a lot of Point of Presence (PoP) for the last mile. The network complexity of many routes requires that the netflow product be smart about de-duplication of netflow for accurate visibility. Different network topologies also require different mitigation architectures. I would recommend a netflow solution that:
1. Handles de-duplication for complex networks easily and accurately - identify Core /Edge and so on.
2. Does not charge by users or objects protected so greater value in a large PoP deployment. Billing by overall flowrate is the easiest to budget, maximize value.
3. Play nice with others- To get the most from netflow, it needs to be shared and propagated to existing ops systems
4. Scale and sustainability- The growth of mobile has increased flowrate, so having a system that can scale to unlimited flowrate, and handle all the storage and load balancing automatically if key.
I highly recommend the power of netflow to improve QoS and reduce costs for a mobile operator. Here is some additional detail from and blog post https://www.flowtraq.com/network-flow-analysis-for-maximum-security/
Bottom line- A netflow product with Topology awareness.
-Gurdev

Amazon Redshift for SaaS application

I am currently testing Redshift for a SaaS near-realtime analytics application.
The queries performance are fine on a 100M rows dataset.
However, the concurrency limit of 15 queries per cluster will become a problem when more users will be using the application at the same time.
I cannot cache all aggregated results since we authorize to customize filters on each query (ad-hoc querying)
The requirements for the application are:
queries must return results within 10s
ad-hoc queries with filters on more than 100 columns
From 1 to 50 clients connected at the same time on the application
dataset growing at 10M rows / day rate
typical queries are SELECT with aggregated function COUNT, AVG with 1 or 2 joins
Is Redshift not correct for this use case? What other technologies would you consider for those requirements?
This question was also posted on the Redshift Forum. https://forums.aws.amazon.com/thread.jspa?messageID=498430&#498430
I'm cross-posting my answer for others who find this question via Google. :)
In the old days we would have used an OLAP product for this, something like Essbase or Analysis Services. If you want to look into OLAP there is an very nice open source implementation called Mondrian that can run over a variety of databases (including Redshift AFAIK). Also check out Saiku for an OSS browser based OLAP query tool.
I think you should test the behaviour of Redshift with more than 15 concurrent queries. I suspect that it will not be user noticeable as the queries will simply queue for a second or 2.
If you prove that Redshift won't work you could test Vertica's free 3-node edition. It's a bit more mature than Redshift (i.e. it will handle more concurrent users) and much more flexible about data loading.
Hadoop/Impala is overly complex for a dataset of your size, in my opinion. It is also not designed for a large number of concurrent queries or short duration queries.
Shark/Spark is designed for the case where you data is arriving quickly and you have a limited set of metrics that you can pre-calculate. Again this does not seem to match your requirements.
Good luck.
Redshift is very sensitive to the keys used in joins and group by/order by. There are no dynamic indexes, so usually you define your structure to suit the tasks.
What you need to ensure is that your joins match the structure 100%. Look at the explain plans - you should not have any redistribution or broadcasting, and no leader node activities (such as Sorting). It sounds like the most critical requirement considering the amount of queries you are going to have.
The requirement to be able to filter/aggregate on arbitrary 100 columns can be a problem as well. If the structure (dist keys, sort keys) don't match the columns most of the time, you won't be able to take advantage of Redshift optimisations. However, these are scalability problems - you can increase the number of nodes to match your performance, you just might be surprised of the costs of the optimal solution.
This may not be a serious problem if the number of projected columns is small, otherwise Redshift will have to hold large amounts of data in memory (and eventually spill) while sorting or aggregating (even in distributed manner), and that can again impact performance.
Beyond scaling, you can always implement sharding or mirroring, to overcome some queue/connection limits, or contact AWS support to have some limits lifted
You should consider pre-aggregation. Redshift can scan billions of rows in seconds as long as it does not need to do transformations like reordering. And it can store petabytes of data - so it's OK if you store data in excess
So in summary, I don't think your use case is not suitable based on just the definition you provided. It might require work, and the details depend on the exact usage patterns.

Good data store for millions of events?

We have a number of systems that daily generate a total of around 5M events. Currently we are saving these for around 10 days totaling around 40-50M events. Currently we're using an RDBMS as the persistance layer with a web-GUI slapped onto it, but we are experiencing certain performance problems.
An event consists of 20-30 fields composed of the following:
fields representing the event itself (e.g. OrderReceived)
fields representing the system that generated the event (e.g. ERP system)
fields representing the business context in which the event was generated (e.g. OrderManagement)
fields representing other details that we consider relevant/important
Roughly 5-6 of the fields are identifiers, most of them unique, representing the event itself, the business entity/object, the context and similar. Using these identifiers we can also relate events to each other chaining them together. The time difference in an event chain may be hours or in rare cases even days.
Currently we use the solution for analysis of individual event chains, mostly for error and outlier analysis (where did my order go?). In the future we may also like to gather statistics about events and event chains (how many orders per day? how many orders are handled by system X?). If possible the solution should also be able to grow to at least double the current size (we foresee an increase in the number of events as new systems are enabled). Analysis is today currently performed by human beings so search needs to be tolerable (searching for an event chain should take seconds, not minutes). The datastore should also allow for cleaning of stale events.
As mentioned in the beginning we're using a standard RDBMS for this. We were using a fairly normalized structure which we've now started denormalizing to try to increase performance. I can't help wondering whether some other solution might be better though. I've started looking around at different NoSQL databases (and in my own opinion MongoDB seems promising) but also trying to gather information concerning search engines and similar (Solr and ElasticSearch e.g.).
The question is what type of data store/solution would be a good fit for these events? Should we head into the NoSQL space, is perhaps a search engine what we want, or are we barking up the wrong tree when what we really need is to find someone who's really good at optimizing RDBMS:s?
I would suggest a hibrid solution with a conventional SQL server for the actual storage and a Lucene based frontend search engine, that is populated from the SQL based on some automatic or timed event. The web layer queries the Lucene layer and writes the SQL.
An SQL backend keeps your options open for the future (OLAP??, etc) and also provides a standard,scalable and multiuser way to accept data from the world through the dbconnection libraries and ui tools. In short if your data is stored in SQL you can not be lost...
The Lucene layer provides extreme query performance if the query capabilities it provides suffices. (In a nutshell: field value search for numbers, dates, strings, etc,range search,multiple field value search (field is an array actually), all with logical operators and logicalbinary expressions, sorting and paging. HOWEVER! it can not do groupings and sum, avg etc aggregating functions).
UPDATE: several years passed. Solr now has statistical capabilities like sum, avg, etc...
Query Performance: in a 100M record item database selecting a couple of hundred items with multifield query predicate is under 100ms.
Populating the index takes a constant time (no increase on size) because of the internal splitfile implementation. It is possible to build up a 5 million line index in minutes, 20 tops depending on mainly your storage controller. Lucence however supports realtime update to the index, a feature that we have used extensively with success on high load websites.
Lucene supports splitting and index into subindexes and index hierarchies so you can create an index per day but search in all of them (or in a specific subset of them) with a single query (using the multi-index adapter). I tried it with 2000 unique index files and the performance was amazing.
These architecture can be done without much effort in Java and .NET, both has great SQL and Lucene support