I want to return a mocked result for the query:
return dslContext.fetchExists(
dslContext
.selectFrom(CARDS)
.where(CARDS.CARD_ID.eq(cardId)));
The returned value is
+---------+
|exists |
+---------+
|true |
+---------+
Mocking Connection has code examples on how to create a record from a generated Record class (like CARDS). How to generate a custom record for the return value of select exists (... some query ...)?
This manual page shows how to create arbitrary Field references in jOOQ, e.g.
Field<Boolean> exists = DSL.field("exists", SQLDataType.BOOLEAN)
That's just one way to do it (probably the simplest). There are many others. There isn't any fundamental difference between this kind of Field and a generated one.
Of course, you could also just reproduce the exists expression instead:
Field<Boolean> exists = DSL.exists(...);
Or really, just any dummy expression.
Related
I have the following delta table
+-+----+
|A|B |
+-+----+
|1|10 |
|1|null|
|2|20 |
|2|null|
+-+----+
I want to fill the null values in column B based on the A column.
I figured this to do so:
var df = spark.sql("select * from MyDeltaTable")
val w = Window.partitionBy("A")
df = df.withColumn("B", last("B", true).over(w))
Which gives me the desired output:
+-+----+
|A|B |
+-+----+
|1|10 |
|1|10 |
|2|20 |
|2|20 |
+-+----+
Now, my question is:
What is the best way to write the result in my delta table correctly ?
Should I merge ? Re-write with overwrite option ?
My delta table us huge and it will keep on increasing, I am looking for the best possible method to achieve so.
Thank you
It depends on the distribution of the rows (aka. are they all in 1 file or spread through many?) that contain null values you'd like to fill.
MERGE will rewrite entire files, so you may end up rewriting enough of the table to justify simply overwriting it instead. You'll have to test this to determine what's best for your use case.
Also, to use MERGE, you need to filter the dataset down only to the changes. Your example "desired output" table has the all the data, which you'd fail to MERGE in its current state because there are duplicate keys.
Check the Important! section in the docs for more
TLDR: If I want to save arrays of integers in a Postgres table, are there any pros or cons to using an array column (integer[]) vs. using a JSON column (eg. does one perform better than the other)?
Backstory:
I'm using a PostgreSQL database, and Node/Knex to manage it. Knex doesn't have any way of directly defining a PostgreSQL integer[] column type, so someone filed a Knex bug asking for it ... but one of the Knex devs closed the ticket, essentially saying that there was no need to support PostgreSQL array column types when anyone can instead use the JSON column type.
My question is, what downsides (if any) are there to using a JSON column type to hold a simple array of integers? Are there any benefits, such as improved performance, to using a true array column, or am I equally well off by just storing my arrays inside a JSON column?
EDIT: Just to be clear, all I'm looking for in an answer is either of the following:
A) an explanation of how JSON columns and integer[] columns in PostgreSQL work, including either how one is better than the other or how the two are (at least roughly) equal.
B) no explanation, but at least a reference to some benchmarks that show that one column type or the other performs better (or that the two are equal)
An int[] is a lot more efficient in terms of storage it requires. Consider the following query which returns the size of an array with 500 elements
select pg_column_size(array_agg(i)) as array_size,
pg_column_size(jsonb_agg(i)) as jsonb_size,
pg_column_size(json_agg(i)) as json_size
from generate_series(1,500) i;
returns:
array_size | jsonb_size | json_size
-----------+------------+----------
2024 | 6008 | 2396
(I am quite surprised that the JSON value is so much smaller than the JSONB, but that's a different topic)
If you always use the array as a single value it does not really matter in terms of query performance But if you do need to look into the array and search for specific value(s), that will be a lot more efficient with a native array.
There are a lot more functions and operators available for native arrays than there are for JSON arrays. You can easily search for a single value in a JSON array, but searching for multiple values requires workarounds.
The following query demonstrates that:
with array_test (id, int_array, json_array) as (
values
(1, array[1,2,3], '[1,2,3]'::jsonb)
)
select id,
int_array #> array[1] as array_single,
json_array #> '1' json_single,
int_array #> array[1,2] as array_all,
json_array ?& array['1','2'] as json_all,
int_array && array[1,2] as array_any,
json_array ?| array['1','2'] as json_any
from array_test;
You can easily query an array if it contains one specific value. This also works for JSON arrays. Those are the expressions array_single and json_single. With a native array you could also use 1 = any(int_array) instead.
But check if an array contains all values from a list, or any value from a list does not work with JSON arrays.
The above test query returns:
id | array_single | json_single | array_all | json_all | array_any | json_any
---+--------------+-------------+-----------+----------+-----------+---------
1 | true | true | true | false | true | false
I am using Postgres 9.5. If I update certain values of a row and commit, is there any way to fetch the old value afterwards? I am thinking is there something like a flashback? But this would be a selective flashback. I don't want to rollback the entire database. I just need to revert one row.
Short answer - it is not possible.
But for future readers, you can create an array field with historical data that will look something like this:
Column | Type |
----------------+--------------------------+------
value | integer |
value_history | integer[] |
For more info read the docs about arrays
I'm using SQL Server 2014 Developer Edition Service Pack 2 on Windows 7 Enterprise machine.
The question
Is there a set based way that I can create an integer field based on a string input? It must ensure that the Entity ID field is never duplicated for different inputs?
Hypothetical table structure
|ID|Entity ID|Entity Code|Field One|From Date|To Date |
|1 |1 |CodeOne |ValueOne |20160731 |20160801|
|2 |1 |CodeOne |ValueTwo |20160802 |NULL |
|3 |2 |CodeTwo |ValueSix |20160630 |NULL |
Given the above table, I'm trying to find a way to create the Entity ID based on the Entity Code field (it is possible that we would use a combination of fields)
What I've tried so far
Using a sequence object (don't like this because it is too easy for the sequence to be dropped and reset the count)
Creating a table to track the Entities, creating a new Entity ID each time a new Entity is discovered (don't like this because it is not a set based operation)
Creating a hashbyte on the Entity Code field and converting this to a BIGINT (I have no proof that this won't work but it doesn't feel like this is a robust solution)
Thanks in advance all.
Your concerns over HashBytes collisions is understandable, but I think yo can put your worries aside. see How many random elements before MD5 produces collisions?
I've used this technique when masking tens of thousands of customer account numbers. I've yet to witness a collision
Select cast(HashBytes('MD5', 'V5H 3K3') as int)
Returns
-381163718
(Note: as illustrated above, you may see negative values. We didn't mind)
What is the best way to delete from a table using Talend?
I'm currently using a tELTJDBCoutput with the action on Delete.
It looks like Talend always generate a DELETE ... WHERE EXISTS (<your generated query>) query.
So I am wondering if we have to use the field values or just put a fixed value of 1 (even in only one field) in the tELTmap mapping.
To me, putting real values looks like it useless as in the where exists it only matters the Where clause.
Is there a better way to delete using ELT components?
My current job is set up like so:
The tELTMAP component with real data values looks like:
But I can also do the same thing with the following configuration:
Am I missing the reason why we should put something in the fields?
The following answer is a demonstration of how to perform deletes using ETL operations where the data is extracted from the database, read in to memory, transformed and then fed back into the database. After clarification, the OP specifically wants information around how this would differ for ELT operations
If you need to delete certain records from a table then you can use the normal database output components.
In the following example, the use case is to take some updated database and check to see which records are no longer in the new data set compared to the old data set and then delete the relevant rows in the old data set. This might be used for refreshing data from one live system to a non live system or some other usage case where you need to manually move data deltas from one database to another.
We set up our job like so:
Which has two tMySqlConnection components that connect to two different databases (potentially on different hosts), one containing our new data set and one containing our old data set.
We then select the relevant data from the old data set and inner join it using a tMap against the new data set, capturing any rejects from the inner join (rows that exist in the old data set but not in the new data set):
We are only interested in the key for the output as we will delete with a WHERE query on this unique key. Notice as well that the key has been selected for the id field. This needs to be done for updates and deletes.
And then we simply need to tell Talend to delete these rows from the relevant table by configuring our tMySqlOutput component properly:
Alternatively you can simply specify some constraint that would be used to delete the records as if you had built the DELETE statement manually. This can then be fed in as the key via a main link to your tMySqlOutput component.
For instance I might want to read in a CSV with a list of email addresses, first names and last names of people who are opting out of being contacted and then make all of these fields a key and connect this to the tMySqlOutput and Talend will generate a DELETE for every row that matches the email address, first name and last name of the records in the database.
In the first example shown in your question:
you are specifically only selecting (for the deletion) products where the SOME_TABLE.CODE_COUNTRY is equal to JS_OPP.CODE_COUNTRY and SOME_TABLE.FK_USER is equal to JS_OPP.FK_USER in your where clause and then the data you send to the delete statement is setting the CODE_COUNTRY equal to JS_OPP.CODE_COUNTRY and FK_USER equal to JS_OPP.CODE_COUNTRY.
If you were to put a tLogRow (or some other output) directly after your tELTxMap you would be presented with something that looks like:
.----------+---------.
| tLogRow_1 |
|=-----------+------=|
|CODE_COUNTRY|FK_USER|
|=-----------+------=|
|GBR |1 |
|GBR |2 |
|USA |3 |
'------------+-------'
In your second example:
You are setting CODE_COUNTRY to an integer of 1 (your database will then translate this to a VARCHAR "1"). This would then mean the output from the component would instead look like:
.------------.
|tLogRow_1 |
|=-----------|
|CODE_COUNTRY|
|=-----------|
|1 |
|1 |
|1 |
'------------'
In your use case this would mean that the deletion should only delete the rows where the CODE_COUNTRY is equal to "1".
You might want to test this a bit further though because the ELT components are sometimes a little less straightforward than they seem to be.