Not all [GA4] BigQuery Export schema not available via GA4 API?

Not all [GA4] BigQuery Export schema not available via GA4 API? - google-analytics-api

I was looking for the number of IDFA /user concents collected per an app release version.
I saw it exists on [GA4] BigQuery Export schema table (device.is_limited_ad_tracking).
But couldn't find it on GA4.
Is there any alternative?

The API, webUI, and BigQuery export are different sources of data. Not only do they have different schemas (available dimensions and metrics), when compared, the data often will not match. This is by design.
This article compares the data sources:
https://analyticscanvas.com/4-ways-to-export-ga4-data/
This article explains why they don't match:
https://analyticscanvas.com/3-reasons-your-ga4-data-doesnt-match/
In most cases, you'll find the solution is to use the BigQuery export. It has the most rich set of data and doesn't have quota limits.

Related

Data Import in GA4 for dimensions

We are planning to use GA4 for our analytics use case for an eCommerce website. We have use cases to enrich our view_item event data and all other events with custom parameters which will then be used as dimension in the Analysis Hub. I have the following questions:
Size Limit - The data import csv file limit is 1GB in our use case we may have more than 10GB of data. Instead of uploading CSV file is there a way to enrich event data using BigQuery etc?
Custom Parameters - If I introduce a custom parameter during my enrichment process, can I use that as dimension in the Analysis Hub?
Thanks

Yes, BigQuery is a database, you can upload what you want, for example in a separate table and build an ad hoc query for what you are interested in obtaining.
Yes, you have to activate it in the GA4 interface.

BigQuery View is not working if I used BigQuery Plugin

I've been used bigquery plugin under the source category. When I used bigquery View, Pipeline through an error of not allowed View. Also If I used the permanent table in which repeatable columns have existed, then it also through an error of unsupported mode 'repeated' while retrieving its schema. Does anyone have any information on this?

BigQuery source exports the data from the table into temporary GCS buckets and then read it in the pipeline. Since BigQuery VIEWs cannot be exported (please see limitations here - https://cloud.google.com/bigquery/docs/views), pipeline fails.
Also currently BigQuery source does not support repeatable column. The work is currently in progress - https://issues.cask.co/browse/CDAP-15256. Is this what you are looking for?

How to run Apache Ignite sql queries over a rest-api

I am trying to access Apache Ignite over the http-rest api . I see that the api mostly provides ability to request data with a specific key ( meaning you should always know/have the key to query data) .
However i would like to understand
1) if we have the ability to say query a set of records which are filtered based on one or more of the value fields of the POJO value.
2) If we can run join like sql-queries through the rest api when my data is as part of more then 1 cache and some fields in them have common values to create the relation among caches.

Take a look at the REST API documentation for the list of the available commands - there are commands that allow to execute SQL.
Also take a look at the Ignite SQL documentation for the syntax reference and some examples.
Finally, please see the Ignite SQL examples - you can find them in the Ignite distribution or in the git repository. E.g. SqlDmlExample should give a notion of how to execute various SQL queries on an Ignite cache.

Does IBM Dataworks support CSV delimiters?

Does the IBM DataWorks Data Load API support CSV files as input source?

The answer is yes. To accomplish this, you have provide the structure of the file in the request payload. This is explained in the API documentation Creating a Data Load Activity. This an excerpt of the documentation:
Within the columns array, specify the columns to provision data
from. If Analytics for Hadoop, Amazon S3, or SoftLayer Object Storage
is the source, you must specify the columns. If you specify columns,
only the columns that you specify are provisioned to the target...
The Data Load application included in DataWorks is provided just as an example and assumes the input file has 2 columns, the first being an INTEGER and the second one a VARCHAR.
Note: This question was answered on dW Answers by user emalaga.

Mongodb to redshift

We have a few collections in mongodb that we wish to transfer to redshift (on an automatic incremental daily basis).
How can we do it? Should we export the mongo to csv?

I wrote some code to export data from Mixpanel into Redshift for a client. Initially the client was exporting to Mongo but we found Redshift offered very large performance improvements for query. So first of all we transferred the data out of Mongo into Redshift, and then we came up with a direct solution that transfers the data from Mixpanel to Redshift.
To store JSON data in Redshift first you need to create a SQL DDL to store the schema in Redshift i.e. a CREATE TABLE script.
You can use a tool like Variety to help as it can give you some insight into your Mongo schema. However it does struggle with big datasets - you might need to subsample your dataset.
Alternatively DDLgenerator can generate DDL from various sources including CSV or JSON. This also struggles with large datasets (well the dataset I was dealing with was 120GB).
So in theory you could use MongoExport to generate CSV or JSON from Mongo and then run it through DDL generator to get a DDL.
In practice I found using JSON export a little easier because you don't need to specify the fields you want to extract. You need to select the JSON array format. Specifically:
mongoexport --db <your db> --collection <your_collection> --jsonArray > data.json
head data.json > sample.json
ddlgenerator postgresql sample.json
Here - because I am using head - I use a sample of the data to show the process works. However, if your database has schema variation, you want to compute the schema based on the whole database which could take several hours.
Next you upload the data into Redshift.
If you have exported JSON, you need to use Redshift's Copy from JSON feature. You need to define a JSONpath to do this.
For more information check out the Snowplow blog - they use JSONpaths to map the JSON on to a relational schema. See their blog post about why people might want to read JSON to Redshift.
Turning the JSON into columns allows much faster query than the other approaches such as using JSON EXTRACT PATH TEXT.
For incremental backups, it depends if data is being added or data is changing. For analytics, it's normally the former. The approach I used is to export the analytic data once a day, then copy it into Redshift in an incremental fashion.
Here are some related resources although in the end I did not use them:
Spotify has a open-source project called Luigi - this code claims to upload JSON to Redshift but I haven't used it so I don't know if it works.
Amiato have a web page that says they offer a commercial solution for loading JSON data into Redshift - but there is not much information beyond that.
This blog post discusses performing ETL on JSON datasources such as Mixpanel into Redshift.
Related Redit question
Blog post about dealing with JSON arrays in Redshift

Honestly, I'd recommend using a third party here. I've used Panoply (panoply.io) and would recommend it. It'll take your mongo collections and flatten them into their own tables in redshift.

AWS Database Migration Service(DMS) Adds Support for MongoDB and Amazon DynamoDB.So I think now onward best option to migrate from MongoDB to Redshift is DMS.
MongoDB versions 2.6.x and 3.x as a database source
Document Mode and Table Mode supported
Supports change data capture(CDC)
Details - http://docs.aws.amazon.com/dms/latest/userguide/CHAP_Source.MongoDB.html

A few questions that would be helpful to know would be:
Is this an add-only always increasing incremental sync i.e. data is only being added and not being updated / removed or rather your redshift instance is interested only in additions?
Is the data inconsistency due to delete / updates happening at source and not being fed to redshift instance ok?
Does it need to be daily-incremental batch or can it be realtime as it is happening as well?
Depending on your situation may be mongoexport works for you, but you have to understand the shortcoming of it, which can be found at http://docs.mongodb.org/manual/reference/program/mongoexport/ .

I had to tackle the same issue (not on a daily basis though).
as ask mentioned, You can use mongoexport in order to export the data, but keep in mind that redshift doesn't support arrays, so in case your collections data contains arrays you'll find it a bit problematic.
My solution to this was to pipe the mongoexport into a small utility program I wrote that transforms the mongoexport json rows into my desired csv output.
piping the output also allows you to make the process parallel.
Mongoexport allows you to add a mongodb query to the command, so if your collection data supports it you can spawn N different mongoexport processes, pipe it's results into the other program and decrease the total runtime of the migration process.
Later on, I uploaded the files to S3, and performed a COPY into the relevant table.
This should be a pretty easy solution.

Stitch Data is the best tool ever I've ever seen to replicate incrementally from MongoDB to Redshift within a few clicks and minutes.
Automatically and dynamically Detect DML, DDL for tables for replication.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse