Does the IBM DataWorks Data Load API support CSV files as input source?
The answer is yes. To accomplish this, you have provide the structure of the file in the request payload. This is explained in the API documentation Creating a Data Load Activity. This an excerpt of the documentation:
Within the columns array, specify the columns to provision data
from. If Analytics for Hadoop, Amazon S3, or SoftLayer Object Storage
is the source, you must specify the columns. If you specify columns,
only the columns that you specify are provisioned to the target...
The Data Load application included in DataWorks is provided just as an example and assumes the input file has 2 columns, the first being an INTEGER and the second one a VARCHAR.
Note: This question was answered on dW Answers by user emalaga.
Related
I want to update a source excel column with a particular string.
My source contains n columns. I need to check where the string apple exists in any one of the columns. If the value exist in any column I need to replace the apple with orange string. And output the excel. How can I do this in ADF?
Note:I cannot use dataflows since we were using a self hosted vm
Excel files has lot of limitations in ADF like it is not supported in the copy activity sink and in Data flow sink as well.
You can raise the feature request for that in ADF.
So, try the above operation with a csv and copy the result to a csv in blob which later you can change it to Excel in your local machine.
To do the operations like above, Data flow can be a better option than doing it with normal activities as Dataflow deals with the transformations.
But Data flow won't support Self hosted linked service.
So, as a workaround first copy the Excel file as csv to Blob storage using copy activity. Create a Blob linked service for that to use in dataflow.
Now follow the below process in Data flow.
Source CSV from Blob:
Derived column transformation:
give the condition for each column case(col1=="apple", "orange", col1)
Sink :
In Sink settings specify as Output to single file.
After Pipeline execution a csv will be generated in the blob. You can convert it to Excel in your local machine.
We are performing data ingestion of Dataverse[Common data service apps] Entities into ADLS Gen2 using Azure Data Factory. We see few columns missing from Dataverse source which are not copied into ADLS, specifically with Dataverse Data type - Choice.
Are all Dataverse column data types supported by ADF linked service? Please suggest fix or any workaround.
Are all Dataverse column data types supported by ADF linked service?
Yes, dataverse supports all column data types.
For missing columns, you should consider the below given points:
When you copy data from Dynamics, explicit column mapping from Dynamics to sink is optional. But we highly recommend the mapping to ensure a deterministic copy result.
When the service imports a schema in the authoring UI, it infers the schema. It does so by sampling the top rows from the Dynamics query result to initialize the source column list. In that case, columns with no values in the top rows are omitted. The same behavior also applies to data preview and copy executions if there is no explicit mapping. You can review and add more columns into the mapping, which are honored during copy runtime.
To consume the dataverse choices using ADF, you should use data flow activity and use the derived transformation because choice values are written as an integer label and not a text label to maintain consistency during edits. The integer-to-text label mapping is stored in the Microsoft.Athena.TrickleFeedService/table-EntityMetadata.json file.
Refer this Microsoft official document to implement the same.
I was looking for the number of IDFA /user concents collected per an app release version.
I saw it exists on [GA4] BigQuery Export schema table (device.is_limited_ad_tracking).
But couldn't find it on GA4.
Is there any alternative?
The API, webUI, and BigQuery export are different sources of data. Not only do they have different schemas (available dimensions and metrics), when compared, the data often will not match. This is by design.
This article compares the data sources:
https://analyticscanvas.com/4-ways-to-export-ga4-data/
This article explains why they don't match:
https://analyticscanvas.com/3-reasons-your-ga4-data-doesnt-match/
In most cases, you'll find the solution is to use the BigQuery export. It has the most rich set of data and doesn't have quota limits.
I'm used to use Dataprep to recipe json and csv files from Cloud Storage, but today I tried to ingest a table from BigQuery and could not parametrize.
Is it possible to do that?
Here are some screenshots to illustrate my question:
The prefix that I need
The standard does not work
From Cloud Storage works
In order to ingest a table from BigQuery, you can directly create a Dataset with SQL. I am not sure on what you would like to achieve with the 'Search' input, but it does not accept regular expressions. So, the '*' would not be needed, and just writing 'event_', the interface will filter the matching entries.
I have 3 questions, for the following context:
I'm trying to migrate my historical from RDS postgresql to S3. I have about a billion rows of dat in my database,
Q1) Is there a way for me to tell an aws glue job what rows to load? For example i want it to load data from a certain date onwards? There is no bookmarking feature for a PostgreSQL data source,
Q2) Once my data is processed, the glue job automatically creates a name for the s3 output objects, I know i can speciofy the path in DynamicFrame write, but can I specify the object name? if so, how? I cannot find an option for this.
Q3) I tried my glue job on a sample table with 100 rows of data, and it automatically separated the output into 20 files with 5 rows in each of those files, how can I specify the batch size in a job?
Thanks in advance
This is a question I have also posted in AWS Glue forum as well, here is a link to that: https://forums.aws.amazon.com/thread.jspa?threadID=280743
Glue supports pushdown predicates feature, however currently it works with partitioned data on s3 only. There is a feature request to support it for JDBC connections though.
It's not possible to specify name of output files. However, looks like there is an option with renaming files (note that renaming on s3 means copying file from one location into another so it's costly and not atomic operation)
You can't really control the size of output files. There is an option to control min number of files using coalesce though. Also starting from Spark 2.2 there is a possibility to set max number of records per file by setting config spark.sql.files.maxRecordsPerFile