What is better for Reports, DataAnalytics and BI: AWS Redshift or AWS ElasticSearch? - amazon-redshift

We have 4 applications: A, B, C, and D. Applications are scaping different social-network data from different sources. Each application has its own database.
Application A scrapes eg. Instagram accounts, Instagram posts, Instagram stories - from external X source. 
Applications B scrapes eg. Instagram account follower and following COUNT history - from external source Y. 
Application C scrapes eg. Instagram account audience data (eg. gender statistic: male vs female, age statistic, country statistic, etc) - from external source Z.
Application D scrapes TikTok data from external source W.
Our data analytics team has to create different kinds of analysis: 
eg. data (table) that have Instagram post engagement (likes + post / total number of followers for that month) for specific Instagram accounts. 
eg. Instagram account development - total number of followers per month, the total number of posts per month, average post engagement per month, etc...
eg. account follower insights - we are analyzing just pieces of Instagram account followers eg. 5000 of them 1000000. We analyze who our followers follow beside us. Top 10 followings. 
lot of other similar kind of reports
Right now we have 3TB of data in our OLTP Postgres DB, and it is not a solution for us anymore. - We are running really heavy queries for reporting, BI... and we want to move social-network data to Data Warehouse or Open Search.
We are on AWS and we want to use Redshift or Open Search for our data analysis. 
We don't need Real Time processing. What is the better solution for us, Redshift or OpenSearch?
Any ideas are welcome.  
I expect to have infrastructure that will be able to run heavy queries for data analytics team for reporting and BI.

Based on what you've described, it sounds like AWS Redshift would be a better fit for your needs. Redshift is designed for data warehousing and can handle large-scale data processing, analysis, and reporting, which aligns with your goal of analyzing large amounts of data from multiple sources. Redshift also offers advanced query optimization capabilities, which can help your team run complex queries more efficiently.
OpenSearch, on the other hand, is a search and analytics engine that's designed for full-text search and real-time analytics. While OpenSearch is optimized for OLTP workloads, it may not be the best fit for your use case, which involves analyzing structured data from different sources.
When it comes to infrastructure, it's important to consider the size of your data, the complexity of your queries, and the number of users accessing the system. Redshift can scale to handle large amounts of data, and you can choose the appropriate node type and cluster size based on your needs. You can also use features such as Amazon Redshift Spectrum to analyze data in external data sources like Amazon S3.
It's worth noting that moving data to a data warehouse like Redshift may involve some initial setup and data migration costs. However, in the long run, having a dedicated data warehouse can improve the efficiency and scalability of your data analytics processes.

Related

How to overcome API/Websocket limitations with OHCL data for trading platform for lots of users?

I'm using CCXT for some API REST calls for information and websockets. It's OK for 1 user, if I wanted to have many users using the platform, How would I go about an inhouse solution?
Currently each chart is either using websockets or rest calls, if I have 20 charts then thats 20 calls, if I increase users, then thats 20x whatever users. If I get a complete coin list with realtime prices from 1 exchange, then that just slows everything down.
Some ideas I have thought about so far are:
Use proxies with REST/Websockets
Use timescale DB to store the data and serve that OR
Use caching on the server, and serve that to the users
Would this be a solution? There's got to be a way to over come rate limiting & reducing the amount of calls to the exchanges.
Probably, it's good to think about having separated layers to:
receive market data (a single connection that broadcast data to OHLC processors)
process OHLC histograms (subscribe to internal market data)
serve histogram data (subscribe to processed data)
The market data stream is huge, and if you think about these layers independently, it will make it easy to scale and even decouple the components later if necessary.
With timescale, you can build materialized views that will easily access and retrieve the information. Every materialized view can set a continuous aggregate policy based on the interval of the histograms.
Fetching all data all the time for all the users is not a good idea.
Pagination can help bring the visible histograms first and limit the query results to avoid heavy IO in the server with big chunks of memory.

Monitor MongoDB Atlas data transfer costs

I have a MongoDB Atlas cluster that serves many customers. Each customer has its own database on the cluster.
I would like to reduce my application's impact on MongoDB data transfer costs, which have been increasing for the last few days, but the billing info provided by Atlas does not break down prices per database. Therefore, I have no way of knowing which customers are costly and what are the most costly queries in terms of data transfer.
Moreover, using the prices on a daily basis and a few queries, I cannot correlate insertion of resources in my application with prices. For example, let's say my resources are Cats, one day it will cost 5$ of data transfer with 5000 Cats inserted in total in the databases, but the next day, it's going to cost 13$ with 1500 Cats inserted.
Do you know of tools or something in the Atlas dashboard I might've missed that could help me better track costs per customer, or say, a cost per Cat (in my example) so that I build a pricing model for my customers?
Thank you
You are most likely going to need separate projects and deployments.
A MongoDB client instance is generally capable of using any database on the server (subject to authorization rules and APIs provided in the language in question), therefore to get a breakdown of data transfer by database would require the server to track bytes transferred per operation and then aggregate those counts. As far as I know this isn't a feature that currently exists.
The most practical way of tracking this today is probably writing a layer on top of the driver on the client side that would look at data actually received.

AWS database solution for storing non-relational data

Whats the best AWS database for the below requirement
I need to store around 50,000 - 1,00,000 entries in the database.
Each of the entry would have a String as a key and a Json array as the value.
I should be able to retrieve the JSON array using the key.
The size of JSON data is around 20-30KB
I expect around 10,000 - 40,000 reads per hour.
Around 50,000 - 1,00,000 writes/week
I have to consider the cost as well.
Ease of integration/development
I am bit confused between MongoDB, DynamoDB and PostgreSQL. Please share your thoughts on this.
DynamoDB:-
DynamoDB is a fully managed proprietary NoSQL database service that supports key-value and document data structures. For the typical use case that you have described in OP, it would serve the purpose.
DynamoDB can handle more than 10 trillion requests per day and support
peaks of more than 20 million requests per second.
DynamoDB has good AWS SDK for all operations. The read and write capacity units can be configured for the table.
DynamoDB tables using on-demand capacity mode automatically adapt to
your application’s traffic volume. On-demand capacity mode instantly
accommodates up to double the previous peak traffic on a table. For
example, if your application’s traffic pattern varies between 25,000
and 50,000 strongly consistent reads per second where 50,000 reads per
second is the previous traffic peak, on-demand capacity mode instantly
accommodates sustained traffic of up to 100,000 reads per second. If
your application sustains traffic of 100,000 reads per second, that
peak becomes your new previous peak, enabling subsequent traffic to
reach up to 200,000 reads per second.
One point to note is that it doesn't allow to query the table based on non-key attributes. This means if you don't know the hash key of the table, you may need to do full table scan to get the data. However, there is a Secondary Index option which you can explore to get around the problem. You may need to have all the Query Access Patterns of your use case before you design and make informed decision.
MongoDB:-
MongoDB is not a fully managed service on AWS. However, you can setup the database using AWS service such as EC2, VPC, IAM, EBS etc. This requires some AWS cloud experience to setup the database. The other option is to use MongoDB Atlas service.
MongoDB is more flexible in terms of querying. Also, it has a powerful aggregate functions. There are lots of tools available to query the database directly to explore the data like SQL.
In terms of Java API, the Spring MongoDB can be used to perform typical database operation. There are lots of open source frameworks available on various languages for MongoDB (example Mongoose Nodejs) as well.
The MongoDB has support for many programming languages and the APIs are mature as well.
PostgreSQL:-
PostgreSQL is a fully managed database on AWS.
PostgreSQL has become the preferred open source relational database
for many enterprise developers and start-ups, powering leading
geospatial and mobile applications. Amazon RDS makes it easy to set
up, operate, and scale PostgreSQL deployments in the cloud.
I think I don't need to write much about this database and its API. It is very mature database and has good APIs.
Points to consider:-
Query Access Pattern
Easy setup
Database maintenance
API and frameworks
Community support

Real-time statistics (example). NoSQL

Task
Hi I have 2-3 thousands of users online. I also have groups, teams and other(2-3) entities which have users. So for about every 10 seconds I
want to show online statistics (query various params of users and other entities). And every, I believe, 5 - 30 seconds user can change his status. Every 1 hour move to another group or team or whatever. What no-sql database should I use ? I dont have experience, just know no-sql is quite fast and just read a little about Redis, MongoDB, Cassandra.
Of course, I store this data model in RDBMS (except online status and statistics).
I think about next solution:
Store all data in json. use Redis. prepend id prefix (EX 'user_'+userId)
user_id:{"status":"123", "group":"group_id", "team":"team_id", "firstname":"firstname", "lastname":"lastname", ... other attributes]}
group_id:{users:[user_id,user_id,...], ... other group attributes}
team_id:{users:[user_id,user_id,...], ... other team attributes}
...
What would you recommend or propose? Will it be convenient to query such data?
Maybe I can use some popular standard algotithms to query statistics (ex monte-carlo algotithm for percentage statistics, I dunno). Thanks
You could use Redis Hyperloglog, a feature added in Redis 2.8.9.
This blog post describes how to calculate very efficiently some statistics that look quite similar to the ones you need.

Amazon Redshift for SaaS application

I am currently testing Redshift for a SaaS near-realtime analytics application.
The queries performance are fine on a 100M rows dataset.
However, the concurrency limit of 15 queries per cluster will become a problem when more users will be using the application at the same time.
I cannot cache all aggregated results since we authorize to customize filters on each query (ad-hoc querying)
The requirements for the application are:
queries must return results within 10s
ad-hoc queries with filters on more than 100 columns
From 1 to 50 clients connected at the same time on the application
dataset growing at 10M rows / day rate
typical queries are SELECT with aggregated function COUNT, AVG with 1 or 2 joins
Is Redshift not correct for this use case? What other technologies would you consider for those requirements?
This question was also posted on the Redshift Forum. https://forums.aws.amazon.com/thread.jspa?messageID=498430&#498430
I'm cross-posting my answer for others who find this question via Google. :)
In the old days we would have used an OLAP product for this, something like Essbase or Analysis Services. If you want to look into OLAP there is an very nice open source implementation called Mondrian that can run over a variety of databases (including Redshift AFAIK). Also check out Saiku for an OSS browser based OLAP query tool.
I think you should test the behaviour of Redshift with more than 15 concurrent queries. I suspect that it will not be user noticeable as the queries will simply queue for a second or 2.
If you prove that Redshift won't work you could test Vertica's free 3-node edition. It's a bit more mature than Redshift (i.e. it will handle more concurrent users) and much more flexible about data loading.
Hadoop/Impala is overly complex for a dataset of your size, in my opinion. It is also not designed for a large number of concurrent queries or short duration queries.
Shark/Spark is designed for the case where you data is arriving quickly and you have a limited set of metrics that you can pre-calculate. Again this does not seem to match your requirements.
Good luck.
Redshift is very sensitive to the keys used in joins and group by/order by. There are no dynamic indexes, so usually you define your structure to suit the tasks.
What you need to ensure is that your joins match the structure 100%. Look at the explain plans - you should not have any redistribution or broadcasting, and no leader node activities (such as Sorting). It sounds like the most critical requirement considering the amount of queries you are going to have.
The requirement to be able to filter/aggregate on arbitrary 100 columns can be a problem as well. If the structure (dist keys, sort keys) don't match the columns most of the time, you won't be able to take advantage of Redshift optimisations. However, these are scalability problems - you can increase the number of nodes to match your performance, you just might be surprised of the costs of the optimal solution.
This may not be a serious problem if the number of projected columns is small, otherwise Redshift will have to hold large amounts of data in memory (and eventually spill) while sorting or aggregating (even in distributed manner), and that can again impact performance.
Beyond scaling, you can always implement sharding or mirroring, to overcome some queue/connection limits, or contact AWS support to have some limits lifted
You should consider pre-aggregation. Redshift can scan billions of rows in seconds as long as it does not need to do transformations like reordering. And it can store petabytes of data - so it's OK if you store data in excess
So in summary, I don't think your use case is not suitable based on just the definition you provided. It might require work, and the details depend on the exact usage patterns.