Redshift join with varchar(40) and 2.3 billions rows - amazon-redshift

I am new to Amazon Redshift. Trying to figure out best way to join two tables in redshift.
I have 1 table with 2.3 billion records and the id column has datatype varchar(40) with sort key and dist key.
Doing left join with other table having 23 million records on same column id - sort key and dist key.
The query takes hours to execute. Is there anything I am doing wrong here?

See if you have alerts on STL_ALERT_EVENT_LOG table, you can also use 'EXPLAIN' on your query and check does your query is using typically the fastest join(Merge Join). You should also identifying tables with data skew or unsorted rows(see Redshift documentation)

Related

How do you manage UPSERTs on PostgreSQL partitioned tables for unique constraints on columns outside the partitioning strategy?

This question is for a database using PostgreSQL 12.3; we are using declarative partitioning and ON CONFLICT against the partitioned table is possible.
We had a single table representing application event data from client activity. Therefore, each row has fields client_id int4 and a dttm timestamp field. There is also an event_id text field and a project_id int4 field which together formed the basis of a composite primary key. (While rare, it was possible for two event records to have the same event_id but different project_id values for the same client_id.)
The table became non-performant, and we saw that queries most often targeted a single client in a specific timeframe. So we shifted the data into a partitioned table: first by LIST (client_id) and then each partition is further partitioned by RANGE(dttm).
We are running into problems shifting our upsert strategy to work with this new table. We used to perform a query of INSERT INTO table SELECT * FROM staging_table ON CONFLICT (event_id, project_id) DO UPDATE ...
But since the columns that determine uniqueness (event_id and project_id) are not part of the partitioning strategy (dttm and client_id), I can't do the same thing with the partitioned table. I thought I could get around this by building UNIQUE indexes on each partition on (project_id, event_id) but the ON CONFLICT is still not firing because there is no such unique index on the parent table (there can't be, since it doesn't contain all partitioning columns). So now a single upsert query appears impossible.
I've found two solutions so far but both require additional changes to the upsert script that seem like they'd be less performant:
I can still do an INSERT INTO table_partition_subpartition ... ON CONFLICT (event_id, project_id) DO UPDATE ... but that requires explicitly determining the name of the partition for each row instead of just INSERT INTO table ... once for the entire dataset.
I could implement the "old way" UPSERT procedure: https://www.postgresql.org/docs/9.4/plpgsql-control-structures.html#PLPGSQL-UPSERT-EXAMPLE but this again requires looping through all rows.
Is there anything else I could do to retain the cleanliness of a single, one-and-done INSERT INTO table SELECT * FROM staging_table ON CONFLICT () DO UPDATE ... while still keeping the partitioning strategy as-is?
Edit: if it matters, concurrency is not an issue here; there's just one machine executing the UPSERT into the main table from the staging table on a schedule.

Redshift CDC or delta load

Any one knows best way for loading delta or CDC with using any tools
I got big table with billions of records and want to update or insert like Merge in Sql server or Oracle but in Amazon Redshift S3
Also we have loads of columns as can't compare all columns as well
e.g
TableA
Col1 Col2 Col3 ...
It has say already records
SO when inserting new records need to check that particular record is already existing if so no insert if not insert and if changed update record like that
I do have key id and date columns but as its got 200+ columns not easy to check all columns and taking much time
Many thanks in advance

How do I verify in CQL that all the rows have successfully copied from a CSV to a Cassandra table? ***SELECT statements are not returning all results

I am trying to understand Cassandra by playing with a public dataset.
I had inserted 1.5M rows from CSV to a table on my local instance of Cassandra, WITH REPLICATION = { 'class' : 'SimpleStrategy', 'replication_factor' : 1 }
The table was created with one field as a partition key, and one more as primary key
I had a confirmation that 1.5M rows were processed.
COPY Completed
But when I run SELECT or SELECT COUNT(*) on the table, I always get a max of 182 rows.

Secondly, the number of records returned with clustered columns seem to higher than single columns which is not making sense to me. What am I missing from Cassandra's architecture and querying point of view.
Lastly I have also tried reading the same Cassandra table from pyspark shell, and it seems to be reading 182 rows too.
Your primary key is PRIMARY KEY (state, severity). With this primary key definition, all rows for accidents in the same state of same severity will overwrite each other. You probably only have 182 different (state, severity) combinations in your dataset.
You could include another clustering column to record the unique accident, like an accident_id
This blog highlights the importance of the primary key, and has some examples:
https://www.datastax.com/blog/2016/02/most-important-thing-know-cassandra-data-modeling-primary-key

Postgresql how to improve performance for large joined table sorting?

Assuming I have a query:
SELECT product.id, product.name, sku.usd FROM product LEFT OUTER JOIN sku on product.sku_id=sku.id ORDER BY sku.usd;
where product table is a table that contains several millions rows. In this case, is there any way to add index to help sorting? Furthermore, if I have some where clause on this joined table, is there any way to improve performance?
If your db is running on hdd, you can add btree index on sku.usd and cluster your table using created index.
So you will have physically sorted rows in a table.
If you often insert data into this table then you only need to periodically cluster the table, for example, once a month.
Note that cluster command blocks the table at run time.

Redshift select * vs select single column

I'm having the following Redshift performance issue:
I have a table with ~ 2 billion rows, which has ~100 varchar columns and one int8 column (intCol). The table is relatively sparse, although there are columns which have values in each row.
The following query:
select colA from tableA where intCol = ‘111111’;
returns approximately 30 rows and runs relatively quickly (~2 mins)
However, the query:
select * from tableA where intCol = ‘111111’;
takes an undetermined amount of time (gave up after 60 mins).
I know pruning the columns in the projection is usually better but this application needs the full row.
Questions:
Is this just a fundamentally bad thing to do in Redshift?
If not, why is this particular query taking so long? Is it related to the structure of the table somehow? Is there some Redshift knob to tweak to make it faster? I haven't yet messed with the distkey and sortkey on the table, but it's not clear that those should matter in this case.
The main reason why the first query is faster is because Redshift is a columnar database. A columnar database
stores table data per column, writing a same column data into a same block on the storage. This behavior is different from a row-based database like MySQL or PostgreSQL. Based on this, since the first query selects only colA column, Redshift does not need to access other columns at all, while the second query accesses all columns causing a huge disk access.
To improve the performance of the second query, you may need to set "sortkey" to colA column. By setting sortkey to a column, that column data will be stored in sorted order on the storage. It reduces the cost of disk access when fetching records with a condition including that column.