I'm using AWS Redshift Spectrum to query a Hudi table. As we know, filtering data by partition column when querying data in Spectrum could reduce the size of the data scanned by Spectrum and speed up the query.
My question is, if I use Spectrum to query a Hudi table like :select a, b from my_table where a = 3, does this query perform differently if I set hoodie.datasource.write.partitionpath.field = a or hoodie.datasource.write.partitionpath.field = b? Can Spectrum use Hudi's partition path to reduce the scanning data size?
Absolutely, partitioning is useful for all the Hudi clients, when you partition your table with column a, and then you use the column a as a filter on your query, Spectrum will only scan the files in the partition a=3.
But partitioning the table with column b, and using the column a as filter may slightly affect the performance of your query where Spectrum still need to scan all the files, and since the table is partitioned, it will contain more files, which means more IO stream opened by Spectrum to read the files. So try to use a column which you use frequently in your queries as a partition key.
Related
I am trying to read from a partitioned delta table, perform some narrow transformations, and write it into a new delta table which is partitioned on the same fields.
Table A (partitioned on col1, col2) -> Table B (partitioned on col1, col2)
Since the partitioning strategy is the same and there are no wide transformations, my assumption is that shuffle is not needed here
Do I need to specify some special options while reading or writing to ensure that the shuffle operation is not triggered for this?
I tried to read the data normally and write it back using df_B.write.partitionBy("col1", "col2")... but the shuffle still seems to be the bottleneck
I got the issue. I was seeing a shuffle happening because of the Delta Table property spark.databricks.delta.optimizeWrite.enabled. This may not be needed now since the partition strategy of source and destination is the same now.
I tried to load the redshift table but failed on one column- The length of the data column 'column_description'is longer than the length defined in the table. Table: 65535, Data: 86555.
I tried to increase the length of column in RS table, looks like 65535 is the max length RS supports.
Do we have any alternatives to store value in Redshift?
The answer is that Redshift doesn't support anything larger and that one shouldn't store large artifacts in an analytic database. If you are using Redshift for its analytic powers to find specific artifacts (images, files, etc) then these should be stored in S3 and the object key (pointer) should be stored in redshift.
I have one application to store and query the time series data from multiple sensors. The sensor readings of multiple months needed to be stored. And we also need add more and more sensors in the future. So I need consider the scalability in two dimensions, time and sensor ID. We are using postgresql db for data storage. In addition, to simplify the data query layer design, we want to use one table name to query all the data.
In order to improve the query efficiency and scalability, I am considering to use Partitions for this use case. And I want to create the partitions based on two columns. RANGE for the event time for the readings. And VALUE for the sensor ID. So under the paritioned table, I want to get some sub table as sensor_readings_1week_Oct_2020_ID1, sensor_readings_2week_Oct_2020_ID1, sensor_readings_1week_Oct_2020_ID2. I knew PostgreSQL supports multiple columns Partition, but from most examples I can only see RANGE are used for all the columns. One example is as below. How can I create the multiple column partitions, one is for the time RANGE, another is based on the specific sensor ID? Thanks!
CREATE TABLE p1 PARTITION OF tbl_range
FOR VALUES FROM (1, 110, 50) TO (20, 200, 200);
Or are there some better solutions besides Paritions for this use case?
The two level partitions is a good solution for my use case. It improves the efficiency a lot.
CREATE TABLE sensor_readings (
id bigserial NOT NULL,
create_date_time timestamp NULL DEFAULT now(),
value int8 NULL
) PARTITION BY LIST (id);
CREATE TABLE sensor_readings_id1
PARTITION OF sensor_readings
FOR VALUES IN (111) PARTITION BY RANGE(create_date_time);
CREATE TABLE sensor_readings_id1_wk1
PARTITION OF sensor_readings_id1
FOR VALUES FROM (''2020-10-01 00:00:00'') TO (''2020-10-20 00:00:00'');
I have a table that is initially partitioned by day. At the end of every day no more records will be added to that partition, so I cluster the index and then I then do a lot of number crunching and aggregation on that table (using the index I clustered):
CLUSTER table_a_20181104 USING table_a_20181104_index1;
After a few days (typically a week) I merge the partition for one day into a larger partition that contains all the days data for that month. I use this SQL to achieve this:
WITH moved_rows AS
(
DELETE FROM table_a_20181104
RETURNING *
)
INSERT INTO table_a_201811
SELECT * FROM moved_rows;
After maybe a month or too I change the tablespace to move the data from an SSD disk to a conventional magnetic hard disk.
ALTER TABLE ... SET TABLESPACE ...
My initial clustering of the index at the end of the day definitely improves the performance of the queries run against it.
I know that clustering is a one-off command and needs to be repeated if new records are added/removed.
My questions are:
Do I need to repeat the clustering after merging the 'day' partition into the 'month' partition?
Do I need to repeat the clustering after altering the tablespace?
Do I need to repeat the clustering if I VACUUM the partition?
Moving the data from one partition to the other will destroy the clustering, so you'll need to re-cluster after it.
ALTER TABLE ... SET TABLESPACE will just copy the table files as they are, so clustering will be preserved.
VACUUM does not move the rows, so clustering will also be preserved.
I'm having the following Redshift performance issue:
I have a table with ~ 2 billion rows, which has ~100 varchar columns and one int8 column (intCol). The table is relatively sparse, although there are columns which have values in each row.
The following query:
select colA from tableA where intCol = ‘111111’;
returns approximately 30 rows and runs relatively quickly (~2 mins)
However, the query:
select * from tableA where intCol = ‘111111’;
takes an undetermined amount of time (gave up after 60 mins).
I know pruning the columns in the projection is usually better but this application needs the full row.
Questions:
Is this just a fundamentally bad thing to do in Redshift?
If not, why is this particular query taking so long? Is it related to the structure of the table somehow? Is there some Redshift knob to tweak to make it faster? I haven't yet messed with the distkey and sortkey on the table, but it's not clear that those should matter in this case.
The main reason why the first query is faster is because Redshift is a columnar database. A columnar database
stores table data per column, writing a same column data into a same block on the storage. This behavior is different from a row-based database like MySQL or PostgreSQL. Based on this, since the first query selects only colA column, Redshift does not need to access other columns at all, while the second query accesses all columns causing a huge disk access.
To improve the performance of the second query, you may need to set "sortkey" to colA column. By setting sortkey to a column, that column data will be stored in sorted order on the storage. It reduces the cost of disk access when fetching records with a condition including that column.