AWS Glue+Athena skip header row - aws-cloudformation

As of January 19, 2018 updates, Athena can skip the header row of files,
Support for ignoring headers. You can use the skip.header.line.count property when defining tables, to allow Athena to ignore headers.
I use AWS Glue in Cloudformation to manage my Athena tables. Using the Glue Table Input, how can I tell Athena to skip the header row?

Basing off the full template for AWS::Glue::Table here, making the change from,
Resources:
...
MyGlueTable:
...
Properties:
...
TableInput:
...
StorageDescriptor:
...
SerdeInfo:
Parameters: { "separatorChar" : "," }
To,
Parameters:
separatorChar : ","
"skip.header.line.count" : 1
Does the trick.

Related

Invalid value Error on AWS Redshift delivery by Firehose

I am using Kinesis Firehose to deliver to the Redshift database. I am stuck while Firehose tries to execute COPY query from the saved stream on the S3 bucket.
The error is
ERROR:Invalid value.
That's all. To mitigate this error, I tried to reproduce error without manifest;
COPY firehose_test_table FROM 's3://xx/terraform-kinesis-firehose-test-stream-2-1-2022-05-19-14-37-02-53dc5a65-ae25-4089-8acf-77e199fd007c.gz' CREDENTIALS 'aws_iam_role=arn:aws:iam::xx' format as json 'auto ignorecase';
The data inside the .gz is default AWS streaming data,
{"CHANGE":0.58,"PRICE":13.09,"TICKER_SYMBOL":"WAS","SECTOR":"RETAIL"}{"CHANGE":1.17,"PRICE":177.33,"TICKER_SYMBOL":"BNM","SECTOR":"TECHNOLOGY"}{"CHANGE":-0.78,"PRICE":29.5,"TICKER_SYMBOL":"PPL","SECTOR":"HEALTHCARE"}{"CHANGE":-0.5,"PRICE":41.47,"TICKER_SYMBOL":"KFU","SECTOR":"ENERGY"}
and the object itself and target table as
Create table firehose_test_table
(
ticker_symbol varchar(4),
sector varchar(16),
change float,
price float
);
I am not sure what to do next, the error is too unrevealing to understand the problem. I also tried JSONpaths by defining
{
"jsonpaths": [
"$['change']",
"$['price']",
"$['ticker_symbol']",
"$['sector']"
]
}
however, the same error was raised. What am I missing?
A few things to try...
Specify GZIP in the COPY options configuration. This is explicitly stated in the Kinesis Delivery Stream documentation.
Parameters that you can specify in the Amazon Redshift COPY command. These might be required for your configuration. For example, "GZIP" is required if Amazon S3 data compression is enabled.
Explicitly specify Redshift column names in the Kinesis Delivery Stream configuration. The order of the comma-separated list of column names must match the order of the fields in the message: change,price,ticker_symbol,sector.
Query STL_LOAD_ERRORS Redshift table (STL_LOAD_ERRORS docs) to view error details of the COPY command. You should be able to see the exact error. Example: select * from stl_load_errors order by starttime desc limit 10;
Verify all varchar fields do not exceed the column size limit. You can specify the TRUNCATECOLUMNS COPY option if this is acceptable for your use case (TRUNCTATECOLUMNS docs).

Using stringify activity in azure data factory

I need to sync a cosmosdb container to sql database. The objects in cosmosdb are like so :
[
{
id: "d8ab4619-eb3d-4e25-8663-925bd33b9b1e",
buyerIds: [
"4a7c169f-0642-42a9-b5a7-214a646d6c59",
"87a956b3-2aef-43a1-a0f0-29c07519dfbc",
...
]
},
{...}
]
On the SQL side, the sink table contains 2 columns: Id and BuyerId.
What I want is to convert the buyerIds array to a string joined by coma for instance, to then be able to pass it to a SQL stored procedure.
The sql stored procedure will then split the string, and insert as many lines in the table as there are buyerIds.
In azure adf, I tried using a stringify activity in a dataflow but I have this error and don't understand what I need to change: Stringify expressions must be a complex type or an array of complex types.
My stringify activity take the buyerIds column in input and perform the following to create the string :
reduce(buyerIds, '', #acc + ',' + #item, #result)
Do you know what I am missing or another way to do it more simply ?
Because your property is an array, you'll want to use Flatten. That will allow you to unroll your array for the target relational destination. Use stringify to turn structures into strings.

How to configure AWS glue crawler to read csv file having comma in dataset?

I have data as follow in csv file in S3 bucket:
"Name"|"Address"|"Age"
----------------------
"John"|"LA,USA"|"27"
I have created the crawler which has created the table and when I am trying to query data on Athena. Getting following data:
How to configure the AWS glue Crawler to create catalog table to read above data?
You must have figured it out already, but thought this answer would benefit anyone visits this question.
This can be resolved either using Crawler classifier or making modifications to table properties after table is created.
Using classifier:
Create classifier with "Quote symbol"
Add Classifer in Crawler you create.
Or you can modify table SerDe properties by editing table (after crawler creates table):

Is there a way to SkipLinesAtEnd in a TextFormat Azure Data Factory

We receive Text files from a external partner.
They claim to be csv but have some difficult pre-header and footers.
In a ADF TextFormat I can use "skipLineCount": 6, But at the end i'm running in troubles ...
Any suggestions ?
Can't find something like SkipLinesAtEnd ....
This is the Sample
TITLE : Liste de NID_C_BG_NPIG configuré.
FILE NAME : Ines_bcn_npig_net_f.csv
CREATION DATE : 09/10/2019 23:18:43
ENVIRONMENT : Production 12c
<Begin of file>
"NID_C";"NID_BG";"N_PIG"
"253";"0";"0"
"253";"0";"1"
"253";"1";"0"
"253";"1";"1"
"253";"2";"0"
"253";"2";"1"
"253";"3";"0"
<End of file>
It seems that you are using skipLineCount setting in Data Flow.No feature like skipLinesAtEnd in ADF.
You could follow suggestion mentioned by #Joel that using Alter Row.
However,based on the official document,it only supports database mode sink.
So,if you are limited by that,i would suggest you parse the file first before copy job.For example,add an Azure Function Activity to cut the extra rows if you know the specific location of header and foot.Inside the Azure Function,just use the code to alter the file.
Jay & Joel are correct in pointing you toward Data Flows to solve this problem. Use Copy Activity in ADF for data movement-focused operations and Data Flows for data transformation.
You'll find the price for data movement similar to that of data transformation.
I would solve this in Data Flow and use a Filter transformation to filter out any row that has the string "" in it.
Should not need an Alter Row in this case. HTH!!

How to make the first row of a csv file the column names when loading into AWS Athena?

I am pipelining csv's from an S3 bucket to AWS's Athena using Glue and the titles of the columns are just the default 'col0', 'col1' etc, while the true titles of the columns are found in the first row entry. Is there a way, either in the pipeline process or in an early postgreSQL query, to make the first row entry the column names? Ideally avoiding directly hardcoding in the column names in the Glue crawler.
https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-etl-format.html
Use withHeader=True while reading data using Glue Api