I want to return a large result-set of Google Analytics data across a two month period.
However, the total results found is not accurate or what I expect.
If I narrow the start-date and end-date to a particular day it returns roughly 40k of results. Which over a two month period there should be 2.4 million records. However the total results found from the api suggests 350k.
There is some discrepancy and the numbers do not add up when selecting a larger date range. I can confirm there is no gap in ga data over the two month period.
Would be great if someone has come across this issue and has found a reason for it.
In your query you need to supply a sampiling level
samplingLevel=DEFAULT Optional.
Use this parameter to set the sampling level (i.e. the number of visits used to
calculate the result) for a reporting query. The allowed values are consistent with
the web interface and include:
•DEFAULT — Returns response with a sample size that balances speed and accuracy.
•FASTER — Returns a fast response with a smaller sample size.
•HIGHER_PRECISION — Returns a more accurate response using a large sample size,
but this may result in the response being slower.
If not supplied, the DEFAULT sampling level will be used.
There is no way to completely remove sampling large request will still return sampled data even if you have set it to Higher_precission make your request smaller go day by day if you have to.
If you want to pay for a premium Google Analytics account you can extract your data into BigQuery and you will have access to unsampled reports.
Related
I'm developing a sensor monitoring application using Thingsboard CE and PostgreSQL.
Contex:
We collect data every second, such that we can have a real time view of the sensors measurements.
This however is very exhaustive on storage and does not constitute a requirement other than enabling real time monitoring. For example there is no need to check measurements made last week with such granularity (1 sec intervals), hence no need to keep such large volumes of data occupying resources. The average value for every 5 minutes would be perfectly fine when consulting the history for values from previous days.
Question:
This poses the question on how to delete existing rows from the database while aggregating the data being deleted and inserting a new row that would average the deleted data for a given interval. For example I would like to keep raw data (measurements every second) for the present day and aggregated data (average every 5 minutes) for the present month, etc.
What would be the best course of action to tackle this problem?
I checked to see if PostgreSQL had anything resembling this functionality but didn't find anything. My main ideia is to use a cron job to periodically perform the aggregations/deletions from raw data to aggregated data. Can anyone think of a better option? I very much welcome any suggestions and input.
I am recording and monitoring SLOs (server-side request duration) of Kubernetes Pods via Prometheus using a HistogramVec within a Golang HTTP server. Every request’s duration is timed and persisted as described in the Prometheus practices and partitioned by status code, method and HTTP path.
I am running autoscaling experiments therefore Pods are created & terminated. After each experiment I fetch the metrics for all pods (including the ones already deleted) and plot a cumulative distribution, e.g.:
In order to make these plots more “accurate”, I opted for many, smaller histogram buckets and aggregate & analyze the data locally and do not use the built-in Histogram Quantiles. The ideal query would therefore return only the most recent value for all time series that have existed over a specified time range (green + red circles).
Currently, I am using a range query within the script generating all the plots, e.g.:
http://localhost:9090/api/v1/query_range?query=http_request_duration_milliseconds_bucket{path="/service/login"}&start=1591803898&end=1591804801&step=5s
However, I am aware that this is highly inefficient and costly as it retrieves a huge amount of surplus data even though I am only interested in the very last value for each individual time series. On the other hand, if I use an instant query, I only get the values for a specified moment, thus I’d need to shoot multiple queries & first find out when some time series (red circles) were marked stale - which doesn’t seem great either.
So, basically I'm looking for a way to work around the Prometheus basics on staleness, and stop stale time series from "disappearing":
If no sample is found (by default) 5 minutes before a sampling timestamp, no value is returned for that time series at this point in time. This effectively means that time series "disappear" from graphs at times where their latest collected sample is older than 5 minutes or after they are marked stale.
I am almost certain that there is a way to do this (e.g. an option to simply include stale time series), but I haven’t been able to put it together so far.
The solution is to use last_over_time() function. For example, the following query returns the last values seen during the last hour per each histogram bucket:
last_over_time(http_request_duration_milliseconds_bucket{path="/service/login"}[1h])
This query must be sent to /api/v1/query instead of /api/v1/query_range, since /api/v1/query calculates the query only once at the given time timestamp, while /api/v1/query_range calculates the query 1+(end-start)/step times at every point on the timer range [start ... end] with interval step.
Note also that big number of histogram buckets multiplied by big number of unique path label values may result into too many time series, which is known as high cardinality. See this article for more details.
See also VictoriaMetrics historgrams, which solve common issues in Prometheus histograms.
Found another way to do this following the input in this thread as well as increasing the lookbackDelta QuerySpec option.
Now shooting queries such as
http://localhost:9090/api/v1/query?query=max_over_time(http_request_duration_milliseconds_bucket{path="/service/login",le="101"}[2h])
return the desired result:
I'm working on a tool to fetch about 3 years of historic data from a site, in order to perform some data analysis & machine learning.
The dimensions of the report I am requesting are:
[ ga:cityId, ga:dateHour, ga:userType, ga:deviceCategory ]
And my starting point is to import to a postgres db (the data may live elsewhere eventually but we have Good Reasons for starting with a relational database).
I've defined a unique index on the [ ga:cityId, ga:dateHour, ga:userType, ga:deviceCategory ] tuple for the postgres table, and my import job currently routinely fails every 30000-50000 rows due to a duplicate of that tuple.
What would cause google to return duplicate rows?
I'm batching the inserts by 1000 rows / statement because row-at-a-time would be very time consuming, so I think my best workaround is to disable the unique index for the duration of the initial import, de-dupe, and then re-enable it and do daily imports of fresh data row-at-a-time. Other strategies?
There shouldn't be duplicate reports coming back from google if the time ranges are unique.
Are you using absolute or relative (to present) dates? If the latter, you should ensure that changes in the time period cause by the progression of the relative time (ie the present) don't cause an overlap.
Using relative time period could also cause gaps in your data.
I have a requirement to develop a reporting solution for a system which has a large number of data items, with a significant number of these being free text fields. Almost any value in the tables are needed for access to a team of analysts who carry out reporting, analysis and data provision.
It has been suggested that an OLAP solution would be appropriate for the delivery of this, however the general need is to get records not aggregates and each cube would have a large number of dimensions (~150) and very few measures (number of records, length of time). I have been told that this approach will let us answer any questions we ask of it, however we do not have repeated business questions that much but need to list the raw records out.
Is OLAP really a logical way to go with this or will the cubes take too long to process and limit the level of access to the data that the user require?
I would like to create huge data sets (25 ints a row, 30 rows per second, multiply that by 60).
On the other hand, I want to query it for rows that match a certain condition (e.g. rows that not more than 5 ints of the 25 are out of a certain range).
And I want it all in real time, i.e. inserting and querying continuously.
Does someone know how to do it, preferably using a cloud service (Amazon? Google?)
Thanks
Try one-tick, though its designed for market pricing data, I think it can be adapted to other forms of high frequency data. You can then build filtering/aggregation rules that can help you analyze the data