Is there any way to download the entire google analytics 4 data for a certain period? - google-analytics-api

I was reading some API documents including https://developers.google.com/analytics/devguides/reporting/data/v1/basics. However, this API allows downloading certain dimensions and metrics.
request = RunReportRequest(
property=f"properties/{property_id}",
dimensions=[Dimension(name="country")],
metrics=[Metric(name="activeUsers")],
date_ranges=[DateRange(start_date="2020-09-01", end_date="2020-09-15")],
)
Is there any way to download the entire data as json or something?

BigQuery Export allows you to download all your raw events. Using the Data API, you could create individual reports with say 5 dimensions & metrics each; then, you could download your data in slices through say 10 of those reports.
BigQuery and the Data API will have different schemas. For example, BigQuery gives the event timestamp, and the most precise time granularity that the Data API gives is hour. So, your decision between the Data API & BigQuery may depend on which dimensions & metrics you need.
What dimensions & metrics are most important to you?

"Donwnload the entire data" is a vague statement. First you need to decide what kind of data you need. Aggregated data or Raw data?
Aggregated data e.g. daily / hourly activeUsers, events, sessions etc. can be extracted using Data API as you have already tried. Each API requests accepts fixed number of dimensions and metrics. So as per your business requirements you should decide the dimension-metric combinations and extract the data using the API.
Raw events can be extracted from BigQuery if you have linked GA4 property to it. Here is the BigQuery table schema for that where each row has clientId, timestamp and other event level details which are not available with Data API. Again based on business requirements you can write queries and extract the data.

Related

What is better for Reports, DataAnalytics and BI: AWS Redshift or AWS ElasticSearch?

We have 4 applications: A, B, C, and D. Applications are scaping different social-network data from different sources. Each application has its own database.
Application A scrapes eg. Instagram accounts, Instagram posts, Instagram stories - from external X source. 
Applications B scrapes eg. Instagram account follower and following COUNT history - from external source Y. 
Application C scrapes eg. Instagram account audience data (eg. gender statistic: male vs female, age statistic, country statistic, etc) - from external source Z.
Application D scrapes TikTok data from external source W.
Our data analytics team has to create different kinds of analysis: 
eg. data (table) that have Instagram post engagement (likes + post / total number of followers for that month) for specific Instagram accounts. 
eg. Instagram account development - total number of followers per month, the total number of posts per month, average post engagement per month, etc...
eg. account follower insights - we are analyzing just pieces of Instagram account followers eg. 5000 of them 1000000. We analyze who our followers follow beside us. Top 10 followings. 
lot of other similar kind of reports
Right now we have 3TB of data in our OLTP Postgres DB, and it is not a solution for us anymore. - We are running really heavy queries for reporting, BI... and we want to move social-network data to Data Warehouse or Open Search.
We are on AWS and we want to use Redshift or Open Search for our data analysis. 
We don't need Real Time processing. What is the better solution for us, Redshift or OpenSearch?
Any ideas are welcome.  
I expect to have infrastructure that will be able to run heavy queries for data analytics team for reporting and BI.
Based on what you've described, it sounds like AWS Redshift would be a better fit for your needs. Redshift is designed for data warehousing and can handle large-scale data processing, analysis, and reporting, which aligns with your goal of analyzing large amounts of data from multiple sources. Redshift also offers advanced query optimization capabilities, which can help your team run complex queries more efficiently.
OpenSearch, on the other hand, is a search and analytics engine that's designed for full-text search and real-time analytics. While OpenSearch is optimized for OLTP workloads, it may not be the best fit for your use case, which involves analyzing structured data from different sources.
When it comes to infrastructure, it's important to consider the size of your data, the complexity of your queries, and the number of users accessing the system. Redshift can scale to handle large amounts of data, and you can choose the appropriate node type and cluster size based on your needs. You can also use features such as Amazon Redshift Spectrum to analyze data in external data sources like Amazon S3.
It's worth noting that moving data to a data warehouse like Redshift may involve some initial setup and data migration costs. However, in the long run, having a dedicated data warehouse can improve the efficiency and scalability of your data analytics processes.

How can I get the range of values, min & max for each of the columns in the micro-partition in Snowflake?

It says Snowflake stores metadata about all rows stored in a micro-partition, including the range of values for each of the columns in the micro-partition in the following thread, https://community.snowflake.com/s/question/0D53r00009kz6HpCAI/are-min-max-values-stored-in-a-micro-partitions-metadata-, What function I can use to retrieve this information? I tried to run SYSTEM$CLUSTERING_INFORMATION and it returns total_partition_count, depth, overlaps related information but no information about the column values in the micro-partitions. Thanks!
Snowflake stores this meta-data about each partition internally to optimize queries, but it does not publish it.
Part of the reason is security, as knowing metadata about each partition can reveal data that should be masked to some users, or hidden through row level security.
But if there's an interesting business use case that you are looking for to have this data, Snowflake is listening.

How to overcome API/Websocket limitations with OHCL data for trading platform for lots of users?

I'm using CCXT for some API REST calls for information and websockets. It's OK for 1 user, if I wanted to have many users using the platform, How would I go about an inhouse solution?
Currently each chart is either using websockets or rest calls, if I have 20 charts then thats 20 calls, if I increase users, then thats 20x whatever users. If I get a complete coin list with realtime prices from 1 exchange, then that just slows everything down.
Some ideas I have thought about so far are:
Use proxies with REST/Websockets
Use timescale DB to store the data and serve that OR
Use caching on the server, and serve that to the users
Would this be a solution? There's got to be a way to over come rate limiting & reducing the amount of calls to the exchanges.
Probably, it's good to think about having separated layers to:
receive market data (a single connection that broadcast data to OHLC processors)
process OHLC histograms (subscribe to internal market data)
serve histogram data (subscribe to processed data)
The market data stream is huge, and if you think about these layers independently, it will make it easy to scale and even decouple the components later if necessary.
With timescale, you can build materialized views that will easily access and retrieve the information. Every materialized view can set a continuous aggregate policy based on the interval of the histograms.
Fetching all data all the time for all the users is not a good idea.
Pagination can help bring the visible histograms first and limit the query results to avoid heavy IO in the server with big chunks of memory.

tableau extract vs live

I just need a bit more clarity around tableau extract VS live. I have 40 people who will use tableau and a bunch of custom SQL scripts. If we go down the extract path will the custom SQL queries only run once and all instances of tableau will use a single result set or will each instance of tableau run the custom SQL separately and only cache those results locally?
There are some aspects of your configuration that aren't completely clear from your question. Tableau extracts are a useful tool - they essentially are temporary, but persistent, cache of query results. They act similar to a materialized view in many respects.
You will usually want to employ your extract in a central location, often on Tableau Server, so that it is shared by many users. That's typical. With some work, you can make each individual Tableau Desktop user have a copy of the extract (say by distributing packaged workbooks). That makes sense in some environments, say with remote disconnected users, but is not the norm. That use case is similar to sending out data marts to analysts each month with information drawn from a central warehouse.
So the answer to your question is that Tableau provides features that you can can employ as you choose to best serve your particular use case -- either replicated or shared extracts. The trick is then just to learn how extracts work and employ them as desired.
The easiest way to have a shared extract, is to publish it to Tableau Server, either embedded in a workbook or separately as a data source (which is then referenced by workbooks). The easiest way to replicate extracts is to export your workbook as a packaged workbook, after first making an extract.
A Tableau data source is the meta data that references an original source, e.g. CSV, database, etc. A Tableau data source can optionally include an extract that shadows the original source. You can refresh or append to the extract to see new data. If published to Tableau Server, you can have the refreshes happen on schedule.
Storing the extract centrally on Tableau Server is beneficial, especially for data that changes relatively infrequently. You can capture the query results, offload work from the database, reduce network traffic and speed your visualizations.
You can further improve performance by filtering (and even aggregating) extracts to have only the data needed to display your viz. Very useful for large data sources like web server logs to do the aggregation once at extract creation time. Extracts can also just capture the results of long running SQL queries instead of repeating them at visualization time.
If you do make aggregated extracts, just be careful that any further aggregation you do in the visualization makes sense. SUMS of SUMS and MINS of MINs are well defined. Averages of Averages etc are not always meaningful.
If you use the extract, than if will behave like a materialized SQL table, thus anything before the Tableau extract will not influence the result, until being refreshed.
The extract is used when the data need to be processed very fast. In this case, the copy of the source of data is stored in the Tableau memory engine, so the query execution is very fast compared to the live. The only problem with this method is that the data won't automatically update when the source data is updated.
The live is used when handling real-time data. Here each query is accessed from the source data, so the performance won't be as good as the extract.
If you need to work on a static database use extract else the live.
I am feeling from your question that you are worrying about performance issues, which is why you are wondering if your users should use tableau extract or use live connection.
From my opinion for both cases (live vs extract) it all depends on your infrastructure and the size of the table. It makes no sense to make an extract of a huge table that would take hours to download (for example 1 billion rows and 400 columns).
In the case all your users are directly connected on a database (not a tableau server), you may run on different issues. If the tables they are connecting to, are relatively small and your database processes well multiple users that may be OK. But if your database has to run many resource-intensive queries in parallel, on big tables, on a database that is not optimized for many users to access at the same time and located in a different time zone with high latency, that will be a nightmare for you to find a solution. On the worse case scenario you may have to change your data structure and update your infrastructure to allow 40 users to access the data simultaneously.

Maximum number of rows with web data connector as data source

How many rows can the web data connector handle to import data into Tableau? Or what is the maximum number of rows which I can generally import?
There are no limitations to how many rows of data you bring back with your web data connector; performance scales pretty well as you bring back more and more rows, so it's really just a matter of how much time you are OK dealing with.
The total performance will be a combination of:
The time it takes for you to retrieve data from the API.
The time it takes our database to create an extract with that data once your web data connector passes it back to Tableau.
#2 will be comparable to the time it would take to create an extract from an Excel file with the same schema and size as the data in your web data connector.
On a related note, the underlying database used (Tableau Data Engine) handles a large number of rows well, but is not as suited for handling a large number of columns, thus our guidance is to bring back less than 60 columns if possible.