Group by within LISTAAG - amazon-redshift

Our database postgresql and were using array_agg function in one sql query to get following output similar to
{"Basic:1","Basic:1","Basic:1","Basic:1","Basic:1","Basic:1","Paying:1","Paying:1","Paying:1"}
We have migrated to redshift and used LISTAGG function. It is fine if data is less and data is stored as
Basic:1,Basic:1,Basic:1,Basic:1,Basic:1,Basic:1,Paying:1,Paying:1,Paying:1
but we are getting following error if the dataset is large. -
Invalid operation: Result size exceeds LISTAGG limit
Thing is that We need to achieve final result as Basic:6,Paying:3. Is there any alternate of LISTAGG?

How big is your column you used in LISTAGG? LISTAGG can only handle up to 64K chars.
Documentation has this:
VARCHAR(MAX). If the result set is larger than the maximum VARCHAR
size (64K – 1, or 65535), then LISTAGG returns the following error:
Invalid operation: Result size exceeds LISTAGG limit**
https://docs.amazonaws.cn/en_us/redshift/latest/dg/r_WF_LISTAGG.html

Related

Redshift: Cannot use aggregate function inside UDF's?

I have written the below code:
create or replace function max_price()
returns real
volatile
as
$$
select
max(main_amount)
from
table
$$
language sql;
I am receiving this error message:
ERROR: The select expression can not have aggregate or window function.
CONTEXT: Create SQL function "max_price" body
How can I work around this?
No, Redshift UDFs are scalar - each "row" of input values returns one output.
https://docs.aws.amazon.com/redshift/latest/dg/udf-creating-a-scalar-sql-udf.html
You may be able to use a Stored Procedure to obtain the result you are looking for.
https://docs.aws.amazon.com/redshift/latest/dg/stored-procedure-create.html
A scalar User-Defined Function in Amazon Redshift cannot issue a SELECT command that retrieves data from a table. It is intended as a means of calculating a number, rather than querying the database.
From Creating a scalar SQL UDF - Amazon Redshift:
The SELECT clause can't include any of the following types of clauses: FROM, INTO, WHERE, GROUP BY, ORDER BY, LIMIT
If you need to consult another table as part of the function, use a Stored procedure.

Snowflake : Unsupported subquery type cannot be evaluated

I am using snowflake as a data warehouse. I have a CSV file at AWS S3. I am writing a merge sql to merge data received in CSV to the table in snowflake. I have a column in time dimension table with data type as Number(38,0) data type in SF. This table holds all dates time, one e.g. is of column
time_id= 232 and time=12:00
In CSV I am getting a column with the label as time and value as 12:00.
In merge sql I am fetching this value and trying to get time_id for it.
update table_name set start_time_dim_id = (select time_id from time_dim t where t.time_name = csv_data.start_time_dim_id)
On this statement I am getting this error "SQL compilation error: Unsupported subquery type cannot be evaluated"
I am struggling to solve it, during this I google for it and got one reference for it
https://github.com/snowflakedb/snowflake-connector-python/issues/251
So want to make sure if anyone have encountered this issue? If yes, will appreciate pointers over it.
It seems like a conversion issue. I suggest you to check the data in CSV file. Maybe there is a wrong or missing value. Please check your data, and make sure it returns numeric values
create table simpleone ( id number );
insert into simpleone values ( True );
The last statement fails with:
SQL compilation error: Expression type does not match column data type, expecting NUMBER(38,0) but got BOOLEAN for column ID
If you provide sample data, and SQL to produce this error, maybe we can provide a solution.
unfortunately correlated and nested subqueries in Snowflake are a bit limited at this stage.
I would try running something like this:
update table_name
set start_time_dim_id= time_id
from time_dim
where t.time_name=csv_data.start_time_dim_id

PostgreSQL insert with nested query fails with large numbers of rows

I'm trying to insert data into a PostgreSQL table using a nested SQL statement. I'm finding that my inserts work with a small (a few thousand) rows being returned from the nested query. For instance, when I attempt:
insert into the_target_table (a_few_columns, average_metric)
SELECT a_few_columns, AVG(a_metric)
FROM a table
GROUP BY a_few_columns LIMIT 5000)
However, this same query fails when I remove my LIMIT (the inner query without limit returns about 30,000 rows):
ERROR: Integer out of range
a_metric is a double precision, and a_few_columns are text. I've played around with the LIMIT rows, and it seems like the # of rows it can insert without throwing an error is 14,000. I don't know if this is non-deterministic, or a constant threshold before the error is thrown.
I've looked through a few other SO posts on this topic, including this one, and changed my table primary key data type to BIGINT. I still get the same error. I don't think it's an issue w/ numerical overflow, however, as the number of inserts I'm making is small and nowhere even close to hitting the threshold.
Anyone have any clues what is causing this error?
The issue here was an improper definition of the avg_metric field in my table that I wanted to insert it into. I accidentally had defined it as an integer. This normally isn’t a huge issue, but I also had a handful of infinity values ( inf). Once I switched my field data type to double precision I was able to insert successfully. Of course, it’s probably best if my application had checked beforehand for finite values prior to attempting the insert- normally I’d do this programmatically via asserts, but with a nested query I hadn’t bothered to check.
The final query I used was
insert into the_target_table (a_few_columns, average_metric)
SELECT a_few_columns, CASE WHEN AVG(a_metric) = 'inf' THEN NULL ELSE AVG(a_metric) END
FROM a_table
GROUP BY a_few_columns LIMIT 5000)
An even better solution would have been to go through a_table and replace all inf values first.

SQL Server 2008 R2, "string or binary data would be truncated" error

In SQL Server 2008 R2, I am trying to insert 30 million records from a source table to the target table. Out of these 30 million records, few records have some bad data and exceeds the length of target field. Generally due to these bad data, the whole insert gets aborted with "string or binary data would be truncated" error, without loading any rows in the target table and SQL Server also do not specify which row had the problem. Is there a way that we can insert rest of rows and catch the bad data rows without big impact on the performance (because performance is the main concern in this case) .
You can use the len function in your where condition to filter out long values:
select ...
from ...
where len(yourcolumn) <= 42
gives you the "good" records
select ...
from ...
where len(yourcolumn) > 42
gives you the "bad" records. You can use such where conditions in an insert select syntax as well.
You can also truncate your string as well, like:
select
left(col, 42) col
from yourtable
In the examples I assumed that 42 is your character limit.
You are not mention that how to insert data i.e. bulk insert or SSIS.
I prefer in this condition SSIS, in which you have control and also find the solution of your issue means you can insert the proper data as #Lajos suggest as well as for bad data you can create a temporary table and get the bad datas.
You can give flow of your logic via transformation and also error handling. You can more search for this too.
https://www.simple-talk.com/sql/reporting-services/using-sql-server-integration-services-to-bulk-load-data/
https://www.mssqltips.com/sqlservertip/2149/capturing-and-logging-data-load-errors-for-an-ssis-package/
http://www.techbrothersit.com/2013/07/ssis-how-to-redirect-invalid-rows-from.html

Postgresql and BLOBs - maximum size of bytea?

I'm currently trying to store images in a psql table and was following this guide here using a bytea for the image. Problem is that the image I'm trying to insert is ~24kb and I keep getting an error that the maximum size is 8191, though I've read in other places that a bytea should be able to store up to 1gb. Surely I should be able to raise this max limit somehow?
Code:
String query = "INSERT INTO " + tableName + " VALUES(?);";
try {
PreparedStatement stmt = conn.prepareStatement(query);
File file = new File(location);
FileInputStream fi = new FileInputStream(file);
stmt.setBinaryStream(1, fi, (int)file.length());
boolean res = stmt.execute();
stmt.close();
fi.close
return res;
}
The database table only consists of a bytea at the moment.
Error message:
org.postgresql.util.PSQLException: ERROR: index row requires 23888 bytes, maximum size is 8191
Max size of bytea
According to this old thread, maximum size for a field in Postgres is 1 GB.
The PostgreSQL version 12 protocol limits row size to 2 GiB minus message header when it is sent to the client (SELECTed). (The protocol uses 32-bit signed integers to denote message size.)
No other limits found (another topic).
But largeobjects are stored as multiple bytea records so they not limited on such way. See this docs for them.
Apparently you have an index on that column (to be honest I'm surprised that you could create it - I would have expected Postgres to reject that).
An index on a bytea column does not really make sense. If you remove that index, you should be fine.
The real question is: why did you create an index on a column that stores binary data?
If you need to ensure that you don't upload the same image twice, you can create a unique index on the md5 (or some other hash) of the bytea:
create table a(a bytea);
create unique index a_bytea_unique_hash on a (md5(a));
insert into a values ('abc');
INSERT 0 1
insert into a values ('abc');
ERROR: duplicate key value violates unique constraint "a_bytea_unique_hash"
DETAIL: Key (md5(a))=(900150983cd24fb0d6963f7d28e17f72) already exists.