Jooq dsl for batch insert of maps, arrays and so forth - postgresql

Im hoping to use jooq dsl to do batch inserts to postgres. I know it's possible but Im having issues getting the data formatted properly.
dslContext.loadInto(table).loadJSON(json-data).fields(...).execute();
is where Im starting from. The tricky part seems to be getting Map<String, String> into a jsonb column.
I have the data formatted according to this description and jooq seems to be ok with it.. until the map/json-in-json shows up.
Another json-array column still needs to be dealt with too.
Questions:
is this a reasonable approach?
if not - what would you recommend instead?
Error(s) Im seeing:
ERROR: column "changes_to" is of type jsonb but expression is of type character varying
Hint: You will need to rewrite or cast the expression.
Edit:
try (DSLContext context = DSL.using(pgClient.getDataSource(), SQLDialect.POSTGRES_10)) {
context.loadInto(table(RECORD_TABLE))
.loadJSON(jsonData)
.fields(field(name(RECORD_ID_COLUMN)),
field(name(OTHER_ID_COLUMN)),
field(name(CHANGES_TO_COLUMN)),
field(name(TYPE_COLUMN)),
IDS_FIELD)
.execute();
} catch (IOException e) {
throw new RuntimeException(e);
}
with json data:
{"fields":[{"name":"rec_id","type":"VARCHAR"},{"name":"other_id","type":"VARCHAR"},{"name":"changes_to","type":"jsonb"},{"name":"en_type","type":"VARCHAR"},{"name":"ids","type":"BIGINT[]"}],"records":[["recid","crmid","{\"key0\":\"val0\"}","ent type",[10,11,12]],["recid2","crmid2","{\"key0\":\"val0\"}","ent type2",[10,11,12]]]}
The problem(s) being how to format the 'changes_to' and 'ids' columns.

There's a certain price to pay if you're not using jOOQ's code generator (and you should!) jOOQ doesn't know what of data type your columns are if you create a field(name("...")), so it won't be able to bind your values correctly. Granted, the Loader API could read the JSON header information, but it currently doesn't.
Instead, why not just either:
Provide explicit type information to your column references, like field(name(CHANGES_TO_COLUMN), SQLDataType.JSONB)
Much better: use the code generator, in case of which you already have all the type information associated with your Field expression.

Related

Redshift Spectrum table doesnt recognize array

I have ran a crawler on json S3 file for updating an existing external table.
Once finished I checked the SVL_S3LOG to see the structure of the external table and saw it was updated and I have new column with Array<int> type like expected.
When I have tried to execute select * on the external table I got this error: "Invalid operation: Nested tables do not support '*' in the SELECT clause.;"
So I have tried to detailed the select statement with all columns names:
select name, date, books.... (books is the Array<int> type)
from external_table_a1
and got this error:
Invalid operation: column "books" does not exist in external_table_a1;"
I have also checked under "AWS Glue" the table external_table_a1 and saw that column "books" is recognized and have the type Array<int>.
Can someone explain why my simple query is wrong?
What am I missing?
Querying JSON data is a bit of a hassle with Redshift: when parsing is enabled (eg using the appropriate SerDe configuration) the JSON is stored as a SUPER type. In your case that's the Array<int>.
The AWS documentation on Querying semistructured data seems pretty straightforward, mentioning that PartiQL uses "dotted notation and array subscript for path navigation when accessing nested data". This doesn't work for me, although I don't find any reasons in their SUPER Limitations Documentation.
Solution 1
What I have to do is set the flags set json_serialization_enable to true; and set json_serialization_parse_nested_strings to true; which will parse the SUPER type as JSON (ie back to JSON). I can then use JSON-functions to query the data. Unnesting data gets even crazier because you can only use the unnest syntax select item from table as t, t.items as item on SUPER types. I genuinely don't think that this is the supposed way to query and unnest SUPER objects but that's the only approach that worked for me.
They described that in some older "Amazon Redshift Developer Guide".
Solution 2
When you are writing your query or creating a query Redshift will try to fit the output into one of the basic column data types. If the result of your query does not match any of those types, Redshift will not process the query. Hence, in order to convert a SUPER to a compatible type you will have to unnest it (using the rather peculiar Redshift unnest syntax).
For me, this works in certain cases but I'm not always able to properly index arrays, not can I access the array index (using my_table.array_column as array_entry at array_index syntax).

spring data jpa issue with postgres - Tried to send an out-of-range integer as a 2-byte value

quoteEntitiesPage = quoteRepository.findAllByQuoteIds(quoteIds, pageRequest);
The above query gives me the error "Tried to send an out-of-range integer as a 2-byte value" if the count of quoteIds parameter is above Short.MAX_VALUE.
What is the best approach to get all quote entities here? My Quote class has id(long) and quoteId(UUID) fields.
When using a query of the type "select ... where x in (list)", such as yours, Spring adds a bind parameter for each list element. PostgreSQL limits the number of bind parameters in a query to Short.MAX_VALUE bind, so when the list is longer than that, you get that exception.
A simple solution for this problem would be to partition the list in blocks, query for each one of them, and combine the results.
Something like this, using Guava:
List<QuoteEntity> result = new ArrayList<>();
List<List<Long>> partitionedQuoteIds = Lists.partition(quoteIds, 10000);
for (List<Long> partitionQuoteIds: partitionedQuoteIds) {
result.addAll(quoteRepository.findAllByQuoteIds(partitionQuoteIds))
}
This is very wasteful when paginating, but it might be enough for your use case.

How to get the values in DataFrame with the correct DataType?

When I tried to get some values in a DataFrame, like:
df.select("date").head().get(0) // type: Any
The result type is Any, which is not expected.
Since a dataframe contains the schema of the data, it should know the DataType for each column, so when i try to get a value using get(0), it should return the value with the correct type. However, it does not.
Instead, I need to specify which DataType i want using getDate(0), which seems weird, inconvenient, and makes me mad.
When I have specified the schema with the correct DataTypes for each column when i created the Dataframe, I don't want to use different getXXX()' for differentcolumn`s.
Are there some convenient ways that I can get the values with their own correct types? That is to say, how can I get the values with the correct DataType specified in the schema?
Thank you!
Scala is a statically typed language. so the get method defined on the Row can only return values with a single type because the return type of the get method is Any. It cannot return Int for one call and a String for another.
you should be calling the getInt, getDate and other get methods provided for each type. Or the getAs method in which you can pass the type as a parameter (for example row.getAs[Int](0)).
As mentioned in the comments other options are
use Dataset instead of a DataFrame.
use Spark SQL
You can call the generic getAs method as getAs[Int](columnIndex), getAs[String](columnIndex) or use specific methods like getInt(columnIndex), getString(columnIndex).
Link to the Scaladoc for org.apache.spark.sql.Row.

Talend Data Itegration: Avoid nulls coming out of tExtractXMLField?

I have this simple flow in Talend DI 6 (simplified for posting on SO):
The last step crashes with a NullPointerException, because missing XML attributes are returned as null.
Is there a way to get empty string values instead of nulls?
For now I'm using a tReplace step to remove nulls as a work-around, but it's tedious and adds to the cost of maintenance by creating one more place where the list of attributes needs to be maintained.
In Talend DI 5.6.2 it is possible to add default data values to the schema. The column in the schema is called "Default". If you expect strings, you can set an empty string, which is set if the column value is null:
Talend schema view with Default column
Works also for other data types. Talend DI 6 should still be able to do this, although the field might be renamed.

xx' property on 'yyy could not be set to a 'String' value. You must set this property to a non-null value of type 'Int32'

I am facing this problem due to unknown reason and I have tried every forum and blog for solving this but could not get any satisfactory answer for this.
Let me describe the scenario.
I have a view in database which is consisting columns from two tables. None of the tables have any column with data type "int" hence the resultant view (let's name is "MyRecord") also does not have any column with "int" data types. All the columns in the view have varchar as datatype.
Now, in my .edmx I am adding this view and the model is created (with name "MyRecord") fine with all the properties are created fine with datatype "String". I am using Silverlight with RIA services, to after builing the application related proxies are also created fine without any confiction.
The problem starts when I try to query the "MyRecord" using my domain context, I am getting following error.
Load operation failed for query 'GetMyRecords'. The 'CenterCode' property on 'MyRecord' could not be set to a 'String' value. You must set this property to a non-null value of type 'Int32'.
As seen in the error, it is clearly forcing me to convert data type of "string" column "CenterCode" to the "Int32" which is totally useless and unnecessary for me. The "String" or "varchar" columns are there because they have some business importance and changing them to "Int32" or "int" might break the application in future. Its true that "CenterCode" column has numeric data only in it but there can be character data in future thats why it is created with 'varchar' datatype.
I can not change type of my data just because EF is not supporting.
I used sql server profiler, the query is being executed correct and I can run the same query in SSMS without any error. The error comes in the application only when EF is building objects from the data returned by the query.
I am failed to understand why Entity Framework is throwing this error, it is simply not converting "varchar" to "String" and unnecessarily bringing "Int32" in picture and making the life difficult. I am struggling with this issue since last 4 hours and tried every possible way to resolve it but everything is in vein.
Please provide some information or solution on this if anyone is having it.
EF team, you must have some answer to this question or work around for this problem.
I had the same problem with a double datatype.
Solution:
Change your view/procedure and cast the column like:cast(columnname as int32)
Not sure if you solved this problem or not, but I just ran into something like this while working on multiple result sets with EF. In my case, I had reader.NextResult() that was causing a problem for me because I hadn't read all the records from the previous result and I think EF was failing due to trying to map data from the second result set into the first object.
CAST(columnName as Type) solve my problem in stored procedure.