I am trying to insert data to big query using google cloud dataprep, I did create recipe and add first row as header row, but when I am trying to run on multiple files it insert the header row to my big query table also.
Anybody facing this problem ?
Welcome to StackOverflow, Andy!
I think I'm correctly understanding your problem, but I want to make sure since I'm making some assumptions:
You have multiple files imported in Dataprep
You created a recipe for the first file and convert row 1 to be a header
You apply a UNION step to merge the additional files
Your output contains the header rows for the additional files
If that's correct, the issue is that the header rows in the other files aren't being removed simply because Dataprep doesn't know what they are. In most cases, Dataprep will detect the file structure and you won't have to manually specify the header row. When that fails, however, UNION steps get a little funny like this—but you can definitely fix it in Dataprep.
Workarounds:
Apply a Recipe to Each Input File
Simply add a recipe to each file that converts the first row to a header—then instead of selecting your original file in the main recipe's UNION, select the other recipes (Dataprep will run them before merging the data).
While this takes some extra effort, it's doable for a small number of files. The advantage here is that you don't have to worry about whether your data may contain the header value—but I'd recommend using the other option if you're able to.
Use a Custom Filter Formula to Delete All Header Rows
The other option is a bit more dependent on your data, but lets you do everything in the main recipe. For example, after setting headers from the first file and applying your UNION you would add a "Filter rows using custom formula" step (or clicking Filter Rows > On column values > Custom Filter...), then match using a column that wouldn't contain the header string (e.g. CustomerID == "CustomerID")—integer columns work great since you don't have to worry if the value could contain the header string. The Resulting wrangle script should look something like this:
header sourcerownumber: 1
[union step goes here]
filter type: custom rowType: single row: CustomerID == 'CustomerID' action: Delete
Note: You may be tempted to do this by using $sourcerownumber, but that doesn't exist due to the union. I'm hoping that they'll eventually support it for this use case though.
These aren't the only ways you could eliminate the headers, but should provide two easy options for you.
As a pro-tip, you can copy a line of the wrangle script above and paste it after clicking "New Step" in your recipe and it'll set up the filter the same way that I did so you don't have to start from scratch. Just change the column name/value and you should be good to go.
Again, welcome to the site—and if any of the assumptions above are incorrect, update your original question with the additional details and let me know in a comment and I'll be happy help you out further.
Related
There's a working report but I wan't to change the visibility of certain rows based on a completely new DB select that should be executed when the report is created.
It would be ideal, if I could load the values of said select in an array or a list and than simply trigger certain row's visibility by comparing e.g. the Row Id with the values in the array.
Im used to solve a problem like this by creating a View that delivers all the essential information in each row and is used as the main data source but I was wondering if there's an elegant way within crystal reports to solve such a task.
I can think of three ways to include control data like this into the report:
One row of config data: If you can arrange it that your config data query returns one row of data, it can be just added to the data sources of the main report, without any links to tables and views already there.
Config resultset for matching: If you have to match main data results to config values row by row, e.g. by the Row Id you mentioned, add this config query to your report and link it to the main data source accordingly. (You are probably already doing this within your pre-created view on database side.)
Query config by subreport: Most flexible but also time consuming option is to add a subreport in report header, add the config data query and arrange config results into (shared) variables as needed in the subreport. Shared variable values can be used in main report then to control section visibility.
Yes, one of the 3rd-party Crystal Reports UFLs (User Function Libraries) listed here provides such a function.
I have a csv file in my ADLS:
a,A,b
1,1,1
2,2,2
3,3,3
When I load this data into a delimited text Dataset in ADF with first row as header the data preview seems correct (see picture below). The schema has the names a, A and b for columns.
However, now I want to use this dataset in Mapping Data Flow and here does the Data Preview mode break. The second column name (A) is seen as duplicate and no preview can be loaded.
All other functionality in Data Flow keeps on working fine, it is only the Data Preview tab that gives an error. All consequent transformation nodes also gives this error in the Data Preview.
Moreover, if the data contains two "exact" same column names (e.g. a, a, b), then the Dataset recognizes the columns as duplicates and puts a "1" and "2" after each name. It is only when they are case-sensitive unequal and case-insensitive equal that the Dataset doesn't get an error and Data Flow does.
Is this a known error? Is it possible to change a specific column name in the dataset before loading into Data Flow? Or is there just something I'am missing?
I testes it and get the error in source Data Preview:
I ask Azure support for help and they are testing now. Please wait my update.
Update:
I sent Azure Support the test.csv file. They tested and replied me. If you insist to use " first row as header", Data Factory can not solve the error. The solution is that re-edit the csv file. Even in Azure SQL database, it doesn't support we create a table with same column name. Column names are case-insensitive.
For example, this code is not supported:
Here's the full email message:
Hi Leon,
Good morning! Thanks for your information.
I have tested the sample file you share with me and reproduce the issue. The data preview is alright by default when I connect to your sample file.
But I noticed when we do the trouble shooting session – a, A, b are column name, so you have checked the first row as header your source connection. Please confirm it’s alright and you want to use a, A, b as column headers. If so, it should be a error because there’s no text- transform for “A” in schema.
Hope you can understand the column name doesn’t influence the data transform and it’s okay to change it to make sure no errors block the data flow.
There’re two tips for you to remove block, one is change the column name from your source csv directly or you can click the import schema button ( in the blow screen shot) in schema tab, and you can choose a sample file to redefine the schema which also allows you to change column name.
Hope this helps.
I am trying to clean up throughout columns within a table to create a clear attribution/reference for reporting on my digital marketing campaigns. The goal is to keep one part of a string while deleting all others. All strings within my marketing campaigns have symbols separating each substring.
Attached are pictures of my current table and of the desired table.
I am essentially trying to only keep on part of the structure of a string and delete all other sub strings. I have already managed to do this successfully by applying the following formula given to be from a separate thread.
update adwords
set campaign = substring(campaign from '%-%-#"%#"' for '#')
where campaign like '%-%-%';
This worked perfectly, however, I do not fully understand why and have not found a clear answer thus far on this forum.
How would I apply this to future rows? Ad group and match type can be used for this purpose.
Many Thanks.
First thing: You do not modify source data. Do ETL instead, and transform it to a final stage. Do that periodically and thus taking care of new data.
You could just create a trigger which should work for all new data, but there are 2 caveats with that:
Failure will lead to missing data and you not being able to QA it.
If you modify the source data in an incorrect way by mistake, you cannot undo it unless you have a backup, and even then it's just too hard.
So instead look at ETL tools like Talend or Pentaho Kettle; create your own ETL scripts, or whatever. Use Jenkins to schedule all of this periodically and you're set.
Now, about the transformation itself.
for '#'
indicates that # will be an escape symbol, which means that #" will be treated as a regular quote in this case.
substring(campaign from '%-%-#"%#"' for '#')
thus, selects everything between the quotes in the pattern. % is a wildcard, same as used in LIKE comparisons. So everything in the last group will be returned. This can better be done with regular expressions
substring(campaign from '.*?-.*?-(.*)')
For the second column the regex would be ^(.*?)\s*\{
And for the third one - similar: ^(.*?)\s*\}
I would create the new table like this:
CREATE TABLE aw_final AS
SELECT
substring(campaign FROM '^\w{2}-\w+-(.*)$') AS campaign,
substring(ad_group FROM '^(\w+)\s*\{\w+\}$') AS ad_group,
substring(match_type FROM '^(\w+)\s*\}$') AS match_type
FROM adwords
WHERE campaign ~ '^\w{2}-\w+-(.*)$'
But if you must do an update, this would be how:
UPDATE adwords SET
campaign = substring(campaign FROM '^\w{2}-\w+-(.*)$'),
ad_group = substring(ad_group FROM '^(\w+)\s*\{\w+\}$'),
match_type = substring(match_type FROM '^(\w+)\s*\}$')
WHERE campaign ~ '^\w{2}-\w+-(.*)$'
I have hundreds of rows of raw data in a CSV. It is stacked like this:
ID.....SS.....P.....Window
S1235...345...48.....Fall
S1235...460....68....Winter
S1123....389....50....Fall
S1123.....598....98.....Winter
What I would like to do is match it up so that all of the IDs match with similar ID on one row only (without doing a manual copy and paste) so it looks like this:
ID.....SS.....P.....Window
S1235...345...48.....Fall..... S1235...460....68....Winter
S1123....389....50....Fall.....S1123.....598....98.....Winter
I don't care if there are multiple columns...because I will manually delete what I don't need. What I need is for all the data with the same ID to appear on one row only. Each piece of information needs to be in it's own cell.
I appreciate the help in advance...I'm not very good with this stuff...but it would be a HUGE time saver if you can help me out!
I'm trying to copy some field values to a duplicate database. One record at a time. This is used for history and so I can delete some records in the original database to keep it fast.
I don't want to manually save the values in a variable because there are hundreds of fields. So I want to go to the first field, save the field name and value and then go over to the other database and save the data. Then run a 'Go to Next Field' and loop through all the fields.
This works perfectly, but here is the problem: When a field is a calculation you cannot tab into it and therefore 'Go to Next Field' doesn't work. It skips it.
I though of doing a 'Go to Object' but then I need to name all the objects and I can't find a script to name objects.
Can anyone out there think of a solution?
Thanks!
This is one of those problems where I always found it easier to do an export/import.
Export all the data you want from the one database, and then import it into the other database. All you need to do is:
Manually specify which fields you want to copy
Map the data from the export to the right fields in the new database/table
You can even write a script to do these things for you.
There are several ways to achieve this.
To make a "history file", I have found there are several cases out there, so lets take a look.
CASE ONE
Single file I just want to "keep" a very large file with historical data, because I need to erease all data in my Main file.
In this case, you should create a "clone" table (in the same file ore in other file, is the same). Then change any calculation field to the type of the calculation result (number, text, date, an so on...). Remove any "auto entered value or calculation from any field, like auto number, auto creation date, etc..). You will have a "Plain Table" with no calculations or auto entered data.
Then add a field to control duplicate data. If you have lets say an invoice number (unique) for each record, you can do this to achieve this task. But if you do not have a unique field that identifies the record as unique, then you have to create one...
To create such a field, I recommed to add a new field on the clone table and set as an aunto entered calculation and make a field combination that is unique... somthing like this: invoiceNumber & "-" & lineNumber & "-" " & date.
On the clone table make shure that validation is set up for "always", and no empty values allowed and that this value is unique.
Once you setup the clone table... then you can import your records, making sure that the auto enty option is on. Yo can do it as many times as you like, new records will be added and no duplicates.
If you want, can make a Script to do the move to historical table all the current records before deleting them.
NOTE:
This technique works fine when the data you try to keep do not have changes over time. This means, once the record is created is has no changes.
CASE TWO
A historical table must be created but some fields are updated.
In the beginnig I thougth a historical data, never changes. In some cases I found this is not the case, like the case I want to track historical invoices but at the same time, keep track if they are paid or not...
In this case you may use the same technique above, but instead of importing data... you must update data based on the "unique" fields that identifiy the record.
Hope this technique helps
FileMaker's FieldNames() function, along with GetField() can give you a list of field names and then their values