How to Validate the data type of a field is it integer or string in AWS Glue in Scala and throw the errors of that invalid data - scala

I want to read data from s3 and applymapping to it and then write it to another s3.
I want to check by datatype in field wise whether the data match the mapping datatype or not.
Like in mapping I made username is string.
Now when I write it to a S3 I need to check whether username field has all string or it has some thing odd value.
How can I achieve it ?
any help would really appreciable to me.

Related

Jooq dsl for batch insert of maps, arrays and so forth

Im hoping to use jooq dsl to do batch inserts to postgres. I know it's possible but Im having issues getting the data formatted properly.
dslContext.loadInto(table).loadJSON(json-data).fields(...).execute();
is where Im starting from. The tricky part seems to be getting Map<String, String> into a jsonb column.
I have the data formatted according to this description and jooq seems to be ok with it.. until the map/json-in-json shows up.
Another json-array column still needs to be dealt with too.
Questions:
is this a reasonable approach?
if not - what would you recommend instead?
Error(s) Im seeing:
ERROR: column "changes_to" is of type jsonb but expression is of type character varying
Hint: You will need to rewrite or cast the expression.
Edit:
try (DSLContext context = DSL.using(pgClient.getDataSource(), SQLDialect.POSTGRES_10)) {
context.loadInto(table(RECORD_TABLE))
.loadJSON(jsonData)
.fields(field(name(RECORD_ID_COLUMN)),
field(name(OTHER_ID_COLUMN)),
field(name(CHANGES_TO_COLUMN)),
field(name(TYPE_COLUMN)),
IDS_FIELD)
.execute();
} catch (IOException e) {
throw new RuntimeException(e);
}
with json data:
{"fields":[{"name":"rec_id","type":"VARCHAR"},{"name":"other_id","type":"VARCHAR"},{"name":"changes_to","type":"jsonb"},{"name":"en_type","type":"VARCHAR"},{"name":"ids","type":"BIGINT[]"}],"records":[["recid","crmid","{\"key0\":\"val0\"}","ent type",[10,11,12]],["recid2","crmid2","{\"key0\":\"val0\"}","ent type2",[10,11,12]]]}
The problem(s) being how to format the 'changes_to' and 'ids' columns.
There's a certain price to pay if you're not using jOOQ's code generator (and you should!) jOOQ doesn't know what of data type your columns are if you create a field(name("...")), so it won't be able to bind your values correctly. Granted, the Loader API could read the JSON header information, but it currently doesn't.
Instead, why not just either:
Provide explicit type information to your column references, like field(name(CHANGES_TO_COLUMN), SQLDataType.JSONB)
Much better: use the code generator, in case of which you already have all the type information associated with your Field expression.

How to Validate Data issue for fixed length file in Azure Data Factory

I am reading a fixed-width file in mapping Data Flow and loading it to the table. I want to validate the fields, datatype, lengths of the field that I am extracting in the Derived column using substring.
How to Achieve this in ADF
Use a Conditional Split and add a condition for each property of the field that you wish to test for. For data type checking, we literally just landed new isInteger(), isString() ... functions today. The docs are still in the printing press, but you'll find them in the expression builder. For length use length().

Convert String to Int in Azure data factory Derived column expression

I've created a dataflow task in azure data factory and used derived column transformation. One of the source derived column value is '678396' which is extracted through Substring function and datatype "String" by default. I want to convert it into "Integer" because my target column datatype is "Integer".
I've to converted the column in this expression:
ToInteger(Substring(Column_1,1,8))
Please help me with correct expression.
Kind regards,
Rakesh
You don't need to build the expression. If you column data are all like int string "678396", or the output of Substring(Column_1,1,8) are int String
Data Factory can convert the int string to integer data type directly from source to sink. We don't need convert again.
Make sure you set column mapping correctly in sink settings. All things would works well.
Update:
This my csv dataset:
You can choose the Quote character to singe quote, then could solve the problem. See the source data preview in Copy active and Data Flow:
Copy active source:
Data Flow overview:
In data flow, we will get the alert like you said comment, we could ignore it and debug the data flow directly:
HTH.
you don't even need to substruct quotes '', as ToInteger function can convert numbers as string type

IBM Datastage assumes a column is WVARCHAR while it's date

I'm doing an ETL for a job. For the data source stage, I input custom select statement. In the output tab of data source stage, I defined the INCEPTION column data type is Timestamp. The right data type for INCEPTION is date. I check it via DBEAVER. But somehow the IBM Datastage assumes that it is WVARCHAR. It says ODBC_Connector_0: Schema reconciliation detected a type mismatch for field INCEPTION. When moving data from field type WVARCHAR(min=0,max=10) into DATETIME(fraction=6), data corruption can occur (CC_DBSchemaRules::reportTypeMismatch, file CC_DBSchemaRules.cpp, line 1,838). I don't know why it is, since from the database shows that INCEPTION is definitely a Date column. I don't know how to fix this since I don't think I'm making mistake. What did I do wrong and how to fix this?
Where did DataStage get its table definition? DataStage is a computer program; it can't "decide" anything. If you import the table definition from the source, what data type is INCEPTION ? If it is either Date or Timestamp, load that table definition into your DataStage job. Otherwise explicitly convert the string using StringToTimestamp() function in a Transformer stage.

ADF copy task field type boolean as lowercase

In ADF i have a copy task that copies data from JSON to Delimited text, i get the result as
A | B | C
"name"|False|"description"
Json record is like
{"A":"name","B":"false","C":"description"}
Excepted result is as below
A | B | C
"name"|false|"description"
The bool value have to be in lowercase in the resulting Delimited text file, what am i missing?
I can reproduce this. The reason is you are converting the string to the ADF dataytpe "Boolean" which for some reason renders the values in Proper case.
Do you really have a receiving process which is case-sensitive? If you need to maintain the case of the source value simply remove the mapping, ie
If you do need some kind of custom mapping, then simply change the mapping data type to String and not Boolean.
UPDATE after new JSON provided
OK, so your first json sample has the "false" value in quotes so is treated as a string. In your second example, your json "true" is not in quotes so is a genuine json boolean value. ADF is auto-detecting this at run time and it seems like it can not be over-ridden as far as I can tell. Happy to be corrected. As an alternative, consider altering your original json to a string, as per you original example OR copying the file to Blob Store or Azure Data Lake, runniing some transform on it (eg Databricks) and then outputting the file. Alternately consider Mapping Data Flows.