Get s3 folder partitions with AWS Glue create_dynamic_frame.from_options - pyspark

I'm creating a dynamic frame with create_dynamic_frame.from_options that pulls data directly from s3.
However, I have multiple partitions on my raw data that don't show up in the schema this way, because they aren't actually part of the data, they're part of the s3 folder structure.
If the partitions were part of the data I could use partitionKeys=["date"] , but date is not a column, it is a folder.
How can I detect these partitions using from_options or some other mechanism?
I can't use a glue crawler to detect the full schema including partitions, because my data is too nested for glue crawlers to handle.

Another way to deal with partitions is to register them in the Glue Data catalog either manually or programmatically.
You can use the glue client in a lambda function.
The example below is from a lambda function triggered upon file landing in a datalake, that parses the file name to create partitions in a target table in the glue catalog. (You can find a use case with Adobe Analytics in aws archives from where I got the code amazon-archives/athena-adobe-datafeed-splitter.)
import boto3
def create_glue_client():
"""Create a return a Glue client for the region this AWS lambda job is running in"""
current_region = os.environ['AWS_REGION'] # With Glue, we only support writing to the region where this code runs
return boto3.client('glue', region_name=current_region)
def does_partition_exist(glue_client, database, table, part_values):
"""Test if a specific partition exists in a database.table"""
try:
glue_client.get_partition(DatabaseName=database, TableName=table, PartitionValues=part_values)
return True
except glue_client.exceptions.EntityNotFoundException:
return False
def add_partition(glue_client, database_name, s3_report_base, key):
"""Add a partition to the target table for a specific date"""
# key_example = 01-xxxxxxxx_2019-11-07.tsv.gz
partition_date = key.split("_")[1].split(".")[0]
# partition_date = "2019-11-07"
year, month, day = partition_date.split('-')
# year, month, day = [2019,11,07]
part = key[:2]
# part = "01"
if does_partition_exist(glue_client, database_name, "target_table", [year, month, day, part]):
return
# get headers python list from csv file in S3
headers = get_headers(s3_headers_path)
# create new partition
glue_client.create_partition(
DatabaseName=database_name,
TableName="target_table",
PartitionInput={
"Values": [year, month, day, part],
"StorageDescriptor": storage_descriptor(
[{"Type": "string", "Name": name} for name in headers], #columns
'%s/year=%s/month=%s/day=%s/part=%s' % (s3_report_base, year, month, day, part) #location
),
"Parameters": {}
}
)
def storage_descriptor(columns, location):
"""Data Catalog storage descriptor with the desired columns and S3 location"""
return {
"Columns": columns,
"Location": location,
"InputFormat": "org.apache.hadoop.mapred.TextInputFormat",
"OutputFormat": "org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat",
"SerdeInfo": {
"SerializationLibrary": "org.apache.hadoop.hive.serde2.OpenCSVSerde",
"Parameters": {
"separatorChar": "\t"
}
},
"BucketColumns": [], # Required or SHOW CREATE TABLE fails
"Parameters": {} # Required or create_dynamic_frame.from_catalog fails
}
In your case with an existing file structure you can use the function below to iterate on the file keys in your raw bucket and parse them to create partitions :
s3 = boto3.resource('s3')
def get_s3_keys(bucket):
"""Get a list of keys in an S3 bucket."""
keys = []
resp = s3.list_objects_v2(Bucket=bucket)
for obj in resp['Contents']:
keys.append(obj['Key'])
return keys
Then you can build your dynamic frame as follows :
datasource = glueContext.create_dynamic_frame.from_catalog(
database = "database_name",
table_name = "target_table",
transformation_ctx = "datasource1",
push_down_predicate = "(year='2020' and month='5' and day='4')"
)

Another workaround to this is to show the file paths of the data using attachFilename option (similar to input_file_name in spark) as a column and parse the partitions manually from the string paths.
df = glueContext.create_dynamic_frame.from_options(
connection_type='s3',
connection_options = {"paths": paths, "groupFiles": "none"},
format="csv",
format_options = {"withHeader": True,
"multiLine": True,
"quoteChar": '"',
"attachFilename": "source_file"
},
transformation_ctx="table_source_s3"
)

Related

Flink SQL-CLi: bring header records

I'm new with flink sql cli and I want to create a sink from my kafka cluster.
I've read the documentation and as I understand de headers are a map<STRING, BYTE> types and through them are all the important information.
When I'm using de sql-cli I try to create a sink table following this command:
CREATE TABLE KafkaSink (
`headers` MAP<STRING, BYTES> METADATA
) WITH (
'connector' = 'kafka',
'topic' = 'MyTopic',
'properties.bootstrap.servers' ='LocalHost',
'properties.group.id' = 'MyGroypID',
'scan.startup.mode' = 'earliest-offset',
'value.format' = 'json'
);
But when I try to read the data with select * from KafkaSink limit 10; It returns me null records
I've tried to run queries like
select headers.col1 from a limit 10;
And also, I've tried to create the sink table with different structures at selecting columns part:
...
`headers` STRING
...
...
`headers` MAP<STRING, STRING>
...
...
`headers` ROW(COL1 VARCHAR, COL2 VARCHAR...)
...
But it returns me nothing, however when I bring the offset columns from kafka cluster it brings me the offset but no the headers.
Can someone explain me my error?
I want to create a kafka sink with flink sql cli
Ok, as I could see it, when I tried to change to
'format' = 'debezium-json'
I could see in a better way the json.
I follow the json schema, in my case was
{
"data": {...},
"metadata":{...}
}
So instead of bringing the header i'm bringing the data with all the columns that i need, the data as a string and the columns as for example
data.col1, data.col2
In order to see the records, just with a
select
json_value(data, '$.Col1') as Col1
from Table;
it works!

How to create a nested json schema with Stream

The INPUT_STREAM in Kafka was created with ksql statement below:
CREATE STREAM INPUT_STREAM (year STRUCT<month STRUCT<day STRUCT<hour INTEGER, minute INTEGER>>>) WITH (KAFKA_TOPIC = 'INPUT_TOPIC', VALUE_FORMAT = 'JSON');
It defines four levels nested json schema with fields year, month and day and hour and minute, like the one below:
{
"year": {
"month": {
"day": {
"hour": string,
"minute": string
}
}
}
}
I want to create a second OUTPUT_STREAM that will read the messages from INPUT_STREAM and re-map its field names to some custom ones. I want to grab the hour and minute values and place them in a nested json below the fields one and two, like this one below:
{
"one": {
"two": {
"hour": string,
"minute": string
}
}
}
I go ahead and put together ksql statement to create OUTPUT_STREAM
CREATE STREAM OUTPUT_STREAM WITH (KAFKA_TOPIC='OUTPUT_TOPIC', REPLICAS=3) AS SELECT YEAR->MONTH->DAY->HOUR ONE->TWO->HOUR FROM INPUT_STREAM EMIT CHANGES;
The statement fails with an error. Is there a syntax error in this statement? Is it be possible to specify the destination field name like I do here with
...AS SELECT YEAR->MONTH->DAY->HOUR ONE->TWO->HOUR FROM... ?
I've tried to use STRUCT instead of ONE->TWO->HOUR:
CREATE STREAM OUTPUT_STREAM WITH (KAFKA_TOPIC='OUTPUT_TOPIC', REPLICAS=3) AS SELECT YEAR->MONTH->DAY->HOUR ONE STRUCT<TWO STRUCT<HOUR VARCHAR>> FROM INPUT_STREAM EMIT CHANGES;
It errors out too and doesn't work
To create structure in ksqlDB
SELECT
STRUCT(
COLUMN_NEW_NAME_1 := OLD_COLUMN_NAME_1,
COLUMN_NEW_NAME_2 := OLD_COLUMN_NAME_2
) as STRUCT_COLUMN_NAME
FROM OLD_TABLE_OR_STREAM
So answer to your question would be, yes you are having syntax error.
SELECT
STRUCT(
TWO := STRUCT(
HOUR := YEAR->MONTH->DAY->HOUR,
MINUTE := YEAR->MONTH->DAY->MINUTE
)
) as ONE
FROM INPUT_STREAM;

ClickHouse JSON parse exception: Cannot parse input: expected ',' before

I'm trying to add JSON data to ClickHouse from Kafka. Here's simplified JSON:
{
...
"sendAddress":{
"sendCommChannelTypeId":4,
"sendCommChannelTypeCode":"SMS",
"sendAddress":"789345345945"},
...
}
Here's the steps for creating table in ClickHouse, create another table using Kafka Engine and creating MATERIALIZED VIEW to connect these two tables, and also connect CH with Kafka.
Creating the first table
CREATE TABLE tab
(
...
sendAddress Tuple (sendCommChannelTypeId Int32, sendCommChannelTypeCode String, sendAddress String),
...
)Engine = MergeTree()
PARTITION BY applicationId
ORDER BY (applicationId);
Creating a second table with Kafka Engine SETTINGS:
CREATE TABLE tab_kfk
(
...
sendAddress Tuple (sendCommChannelTypeId Int32, sendCommChannelTypeCode String, sendAddress String),
...
)ENGINE = Kafka
SETTINGS kafka_broker_list = 'localhost:9092',
kafka_topic_list = 'topk2',
kafka_group_name = 'group1',
kafka_format = 'JSONEachRow',
kafka_row_delimiter = '\n';
Create MATERIALIZED VIEW
CREATE MATERIALIZED VIEW tab_mv TO tab AS
SELECT ... sendAddress, ...
FROM tab_kfk;
Then I try to SELECT all or specific items from the first table - tab and get nothing. Logs is following
OK. Just add '[]' before curly braces in the sendAddress like this:
"authkey":"some_value",
"sendAddress":[{
"sendCommChannelTypeId":4,
"sendCommChannelTypeCode":"SMS",
"sendAddress":"789345345945"
}]
And I still get a mistake, but slightly different:
What should I do to fix this problem, thanks!
There are 3 ways to fix it:
Not use nested objects and flatten messages before inserting to Kafka topic. For example such way:
{
..
"authkey":"key",
"sendAddress_CommChannelTypeId":4,
"sendAddress_CommChannelTypeCode":"SMS",
"sendAddress":"789345345945",
..
}
Use Nested data structure that required to change the JSON-message schema and table schema:
{
..
"authkey":"key",
"sendAddress.sendCommChannelTypeId":[4],
"sendAddress.sendCommChannelTypeCode":["SMS"],
"sendAddress.sendAddress":["789345345945"],
..
}
CREATE TABLE tab_kfk
(
applicationId Int32,
..
sendAddress Nested(
sendCommChannelTypeId Int32,
sendCommChannelTypeCode String,
sendAddress String),
..
)
ENGINE = Kafka
SETTINGS kafka_broker_list = 'localhost:9092',
kafka_topic_list = 'topk2',
kafka_group_name = 'group1',
kafka_format = 'JSONEachRow',
kafka_row_delimiter = '\n',
input_format_import_nested_json = 1 /* <--- */
Take into account the setting input_format_import_nested_json.
Interpret input JSON-message as string & parse it manually (see github issue #16969):
CREATE TABLE tab_kfk
(
message String
)
ENGINE = Kafka
SETTINGS
..
kafka_format = 'JSONAsString', /* <--- */
..
CREATE MATERIALIZED VIEW tab_mv TO tab
AS
SELECT
..
JSONExtractString(message, 'authkey') AS authkey,
JSONExtract(message, 'sendAddress', 'Tuple(Int32,String,String)') AS sendAddress,
..
FROM tab_kfk;
By reading at this commit, I believe for release 23.1, that a new setting input_format_json_read_objects_as_strings will allow converting nested json to String.
Example:
SET input_format_json_read_objects_as_strings = 1;
CREATE TABLE test (id UInt64, obj String, date Date) ENGINE=Memory();
INSERT INTO test FORMAT JSONEachRow {"id" : 1, "obj" : {"a" : 1, "b" : "Hello"}, "date" : "2020-01-01"};
SELECT * FROM test;
Result:
id
obj
date
1
{"a" : 1, "b" : "Hello"}
2020-01-01
docs
Of course, you can still use the materialized view to convert the String object to the correct column types using the same technics as for parsing JSONAsString format

how to convert map<anydata> to json

In my CRUD Rest Service I do an insert into a DB and want to respond to the caller with the created new record. I am looking for a nice way to convert the map to json.
I am running on ballerina 0.991.0 and using a postgreSQL.
The return of the Update ("INSERT ...") is a map.
I tried with convert and stamp but i did not work for me.
import ballerinax/jdbc;
...
jdbc:Client certificateDB = new({
url: "jdbc:postgresql://localhost:5432/certificatedb",
username: "USER",
password: "PASS",
poolOptions: { maximumPoolSize: 5 },
dbOptions: { useSSL: false }
}); ...
var ret = certificateDB->update("INSERT INTO certificates(certificate, typ, scope_) VALUES (?, ?, ?)", certificate, typ, scope_);
// here is the data, it is map<anydata>
ret.generatedKeys
map should know which data type it is, right?
then it should be easy to convert it to json like this:
{"certificate":"{certificate:
"-----BEGIN
CERTIFICATE-----\nMIIFJjCCA...tox36A7HFmlYDQ1ozh+tLI=\n-----END
CERTIFICATE-----", typ: "mqttCertificate", scope_: "QARC", id_:
223}"}
Right now i do a foreach an build the json manually. Quite ugly. Maybe somebody has some tips to do this in a nice way.
It cannot be excluded that it is due to my lack of programming skills :-)
The return value of JDBC update remote function is sql:UpdateResult|error.
The sql:UpdateResult is a record with two fields. (Refer https://ballerina.io/learn/api-docs/ballerina/sql.html#UpdateResult)
UpdatedRowCount of type int- The number of rows which got affected/updated due to the given statement execution
generatedKeys of type map - This contains a map of auto generated column values due to the update operation (only if the corresponding table has auto generated columns). The data is given as key value pairs of column name and column value. So this map contains only the auto generated column values.
But your requirement is to get the entire row which is inserted by the given update function. It can’t be returned with the update operation if self. To get that you have to execute the jdbc select operation with the matching criteria. The select operation will return a table or an error. That table can be converted to a json easily using convert() function.
For example: Lets say the certificates table has a auto generated primary key column name ‘cert_id’. Then you can retrieve that id value using below code.
int generatedID = <int>updateRet.generatedKeys.CERT_ID;
Then use that generated id to query the data.
var ret = certificateDB->select(“SELECT certificate, typ, scope_ FROM certificates where id = ?”, (), generatedID);
json convertedJson = {};
if (ret is table<record {}>) {
var jsonConversionResult = json.convert(ret);
if (jsonConversionResult is json) {
convertedJson = jsonConversionResult;
}
}
Refer the example https://ballerina.io/learn/by-example/jdbc-client-crud-operations.html for more details.?

ServiceNow REST API: Get list of column names

From the admin UI, there is a Tables and Columns explorer that dutifully shows all available columns for a table, such as the incident table:
My ultimate goal is to be able to query all fields for a given table that I can insert data to (mostly centered around incident and problem tables), match that against what data I have, and then insert the record with a PUT to the table. The immediate problem I am having is that when I query sys_dictionary as various forums suggest, I only get returned a subset of the columns the UI displays.
Postman query:
https://{{SNOW_INSTANCE}}.service-now.com/api/now/table/sys_dictionary?sysparm_fields=internal_type,sys_name,name,read_only,max_length,active,mandatory,comments,sys_created_by,element&name={{TableName}}&sysparm_display_value=all
I understand the reduced result set has something to do with them being real columns in the table vs. links to other tables but I can't find any documentation describing how to get the result set that the UI has using the REST api.
The follow on problem is that I can't find an example with an example payload where all standard fields have been filled out for the incident table so that I can populate as many fields as I have data for.
The reason you don't get all the columns back is because the table you are querying inherits from another table. You need to go through all the inheritance relationships first, finding all parent tables, then query the sys_dictionary for all of those tables.
In the case of the incident table, you need to query the sys_db_object table (table of all tables) to find the parent, which is the task table. Then query the sys_db_object table again to find its parent, which is empty, so we have all the relevant tables: incident and task. Obviously, you would want to write this code as a loop, building up a list of tables by querying the table at the end of the list.
Once you have this list, you can query sys_dictionary with the query: sysparm_query=name=incident^ORname=task, which should return your full list of columns.
I think you could do this by creating your own scripted rest api and iterating/inspecting the fields:
(function process(/*RESTAPIRequest*/ request, /*RESTAPIResponse*/ response) {
var queryParams = request.queryParams;
var table = queryParams.table;
var t = new GlideRecord(table);
t.initialize();
var fields = t.getElements(); //or getFields if global scope
var fieldList = [];
for (var i = 0; i < fields.length; i++) {
var glideElement = fields[i]; //or field.get(i) if global scope
var descriptor = glideElement.getED();
var fldName = glideElement.getName().toString();
var fldLabel = descriptor.getLabel().toString();
var fldType = descriptor.getInternalType().toString();
var canWrite = glideElement.canWrite();
if (canWrite){
fieldList.push({
name: fldName,
type: fldType,
label: fldLabel,
writable: canWrite
});
}
}
return fieldList;
})(request, response);
This should save you the hassle of determining the inheritance of fields. Here's the sample output:
{
"result": [
{
"name": "parent",
"type": "reference",
"label": "Parent",
"writable": true
},
{
"name": "made_sla",
"type": "boolean",
"label": "Made SLA",
"writable": true
},
...