Does Redshift distribute by DISTKEY sequentially? - amazon-redshift

I have a Redshift table with page hits, like so
CREATE TABLE hits
(
user_id INT,
ts TIMESTAMP,
page VARCHAR(255)
)
SORTKEY(user_id, ts)
DISTKEY(user_id);
Since I'll be running a bunch of window functions over user_id, I thought it would be a good idea to distribute the table by user_id so nodes don't have to exchange data on users before being able to execute the query.
But the users are only ever active for some time and are numbered sequentially. user_id and time are therefore correlated so whenever I run a query that subsets by time (ts) this will lead to skew if Redshift also distributes by user_id sequentially. This would be less of a problem if it distributed by the DISTKEY randomly. My question is: does it?
(I'm new to Redshift so all of this may just be a total misunderstanding of how things work in general. In that case, apologies in advance!)

Amazon Redshift uses a hash of the DISTRIBUTION KEY (DISTKEY) to distribute data records amongst nodes.
Thus, records will be distributed differently on a 3-node cluster than a 4-node cluster.
If you are seeking evenly-distributed data, use the EVEN distribution method, which simply spreads records evenly between nodes. (However, this is unlikely to be optimal for your use-case.)
See documentation:
Choose the Best Distribution Style
Amazon Redshift Best Practices for Designing Tables

Related

Performant Redshift query to return year and shipping mode with max count?

I have a Redshift table lineitem with 303 million rows. The sortkey is on l_receiptdate.
l_receiptdate
l_shipmode
1992-01-03
TRUCK
1992-01-03
TRUCK
1992-03-03
SHIP
1993-02-03
AIR
1993-05-03
SHIP
1993-07-03
AIR
1993-09-05
AIR
Ultimate goal: find what shipmode was used the most for each year. Return year, shipmode, and count for that most popular ship mode.
Expected output:
receiptyear
shipmode
ship_mode_count
1992
TRUCK
2
1993
AIR
3
I'm new to Redshift and it's nuances. I know 303 million rows isn't considered big data but I'd like to start learning Redshift best query practices from the beginning. Below is what I have so far, not sure how to move forward:
select DATE_TRUNC('year', l_receiptdate) as receiptyear,
l_shipmode as shipmode,
count(*) as ship_mode_count
FROM lineitem
group by 1,2
Your query is fine, in a general sense. The missing piece of data is what is the distribution key of the table? You see Redshift is a clustered (distributed) database and this distribution is controlled by the DISTSTYLE and DISTKEY of the table.
Here's a simple way to think about the performance of a Redshift query. Given the nature of Redshift there are few aspects that tend to dominate poorly performing queries:
Too much network redistribution of data
Scanning too much data from disk
Spilling to disk, making more data than needed through cross or looped joins, and a whole bunch of other baddies.
Your query has no joins so #3 isn't an issue. Your query needs to scan the entire table from disk so there is nothing that can be better in #2. However, #1 is where your could get in trouble especially when your data grows.
Your query needs to group by the ship mode and the year. This means that all the data for each unique combination of these needs to be brought together. So if your table was distributed by ship mode (don't do this)
then all the data for each value would reside on a single "slice" of the database and no network data transmission would be needed to perform the count. However you don't to do this in this case since you are just dealing with a COUNT() function and Redshift is smart enough to count locally and then ship the partial results, which are much smaller than the original data, to one place for the final count.
If more complicated actions were being performed that can't be done in parts, then the distribution of the table could make a big difference to the query. Having the data all in one place when rows need to be combined (join, group by, partition, etc) can prevent a lot of data needed to be shipped around the cluster via the network.
Your query will work fine but hopefully walking through this mental exercise helps you understand Redshift better.

When should I use the postgresql 10 declarative partitioning function?

the official explain is:
The benefits will normally be worthwhile only when a table would otherwise be very large. The exact point at which a table will benefit from partitioning depends on the application, although a rule of thumb is that the size of the table should exceed the physical memory of the database server.
When a table is very large? How to judge a table is very large?
A rule of thumb is that the size of the table should exceed the physical memory of the database server? What does this sentence mean?
The typical use cases for table partitioning (not limited to Postgres) are:
Cleanup data
If you need to delete rows from large tables that can be identified by a single partition.
In that case drop partition would be a lot faster than using delete. A typical use case is a range-partitioned table on a timespan (week, month, year)
Improve queries
If all (or nearly all) queries you use, contain a condition on the partition key.
A typical use case would be partitioning an "orders" table on e.g. the country and all queries would involve a condition like where country_code = 'de' or something similar. Queries not including the partitioning key will be however be slower compared to a query on a non-partitioned table.
What is "large"? That depends very much on your hardware and system. But I would not consider a table with less 100 million rows "large". Indexing (including partial indexes) can get you a long way in Postgres.
Note that Postgres 10 partitioning is still severely limited compared to e.g. Oracle or SQL Server. One of the biggest limitations is the lack of support for foreign keys and global indexes (i.e. a primary key ensuring uniqueness across all partitions). So if you need that, partitioning is not for you.

Difference between DISTRIBUTION and PARTITION in db2

Two performance enhancement features of db2, PARTITION and DISTRIBUTION is confusing me. How can I understand the exact difference between them? And what type of field to be used for PARTITION and what for DISTRIBUTION?
Please refer to the online DB2 Knowledge Centre for your version and operating-system platform, which explains this stuff in depth and gives the syntax. Below is only a summary
For DB2 on Linux/Unix/Windows, A partitioned DB2-instance can run on multiple physical or logical hostnames, but a database in that partitioned instance appears as a single database to applications. There can be logical-partitions (running on the same hostname) or physical-partitions (run on different hostnames) in a shared-nothing arrangement, i.e. different CPUs, different disks, different RAM etc. In a partitioned DB2-instance, tables can be distributed-by-hash ("hash partitioning") on a column chosen by the designer to equally distribute the table data over all chosen partitions. So a column with only 2 discrete values would be unsuitable. The designer can group partitions in as many groupings (partition groups) as make sense for the workload. To partition your DB2-instance you need a special licence for DB2 , and this configuration is also known as DPF (distributed partitioning feature), and IBM sells (or at please used to sell) a hardware/software solution (IBM Smart Analytics series) with configurations to suit a particular workload. This configuration is common for some warehouse workloads, decision support/OLAP workloads for very large databases.
On large warehouses it is common to combine hash-partitioning and range-partitioning. But they can be separately implemented.
Range-partitioning (partition by range) is a common technique to logically split up a table into multiple separate tables (which can be in different tablespaces/storage objects). In this case it is the table that is partitioned, as distinct from the DB2-instance. The designer chooses a partitioning column that suits the workload, often the column has geographic scope, or temporal scope (one partition per day/week/month/hour etc) or whatever makes sense logically. The designer usually arranges for the indexes to be also partitioned, although global indexes are allowed. Range-partitioning supports easier roll-in of new partitions on demand, and roll-out of old partitions (as part of table clean up) with minimal concurrency overheads. This is crucial if databases need to remain within a certain size with regular archival of old content that can be sent to tape or long-term less costly storage outside of DB2.

Slow select from one billion rows GreenPlum DB

I've created the following table on GreenPlum:
CREATE TABLE data."CDR"
(
mcc text,
mnc text,
lac text,
cell text,
from_number text,
to_number text,
cdr_time timestamp without time zone
)
WITH (
OIDS = FALSE,appendonly=true, orientation=column,compresstype=quicklz, compresslevel=1
)
DISTRIBUTED BY (from_number);
I've loaded one billion rows to this table but every query works very slow.
I need to do queries on all fields (not only one),
What can I do to speed up my queries?
Using PARTITION? using indexes?
maybe using a different DB like Cassandra or Hadoop?
This highly depends on the actual queries you are doing and what your hardware setup looks like.
Since you are querying all the fields the selectivity gained by going columnar orientation is probably hurting you more than helping, as you needs to scan all the data anyway. I would remove columnar orientation.
Generally speaking indexes don't help in a Greenplum system. Usually the amount of hardware that is involved tends to make scanning the data directory faster than doing index lookups.
Partitioning could be a great help but there would need to be a better understanding of the data. You are probably accessing specific time intervals so creating a partitioning scheme around cdr_time could eliminate the scan of data not needed for the result. The last thing I would worry about is indexes.
Your distribution by from_number could have an impact on query speed. The system will hash the data based on from_number so if you are querying selectively on the from_number the data will only be returned by the node that has it and you won't be leveraging the parallel nature of the system and spreading the request across all of the nodes. Unless you are joining to other tables on from_number, which allows the joins to be collocated and performed within the node, I would change that to be distributed RANDOMLY.
On top of all of that there is the question of what the hardware is and if you have a proper amount of segments setup and resources to feed them. Essentially every segment is a database. Good hardware can handle multiple segments per node, but if you are doing this on a light hardware you need to find the sweet spot where number of segments matches what the underlying system can provide.
#Dor,
I have same type of data where CDR info is stored for a telecom company, and daily 10-12 millions rows inserted and also heavy queries running on those CDRs related tables, I was also facing the same issue last year, and i have created partitions on those tables on the CDR timings column.
As per My understanding GP creates physical tables for each partition whereas logical tables created in other RDBMS. After this I got better performance with all SELECTs on these tables. Also I think you should convert text datatype to Character Varying for all columns (if text is really not required) I felt DB operations on Text field is very slow(specially order by, group by)
index will help you depends on your queries in my case i have huge inserts so i didnt try yet
If you are selecting all the columns in select so no need of Column Oriented table
Regards

Postgres partitioning?

My software runs a cronjob every 30 minutes, which pulls data from Google Analytics / Social networks and inserts the results into a Postgres DB.
The data looks like this:
url text NOT NULL,
rangeStart timestamp NOT NULL,
rangeEnd timestamp NOT NULL,
createdAt timestamp DEFAULT now() NOT NULL,
...
(various integer columns)
Since one query returns 10 000+ items, it's obviously not a good idea to store this data in a single table. At this rate, the cronjob will generate about 480 000 records a day and about 14.5 million a month.
I think the solution would be using several tables, for example I could use a specific table to store data generated in a given month: stats_2015_09, stats_2015_10, stats_2015_11 etc.
I know Postgres supports table partitioning. However, I'm new to this concept, so I'm not sure what's the best way to do this. Do I need partitioning in this case, or should I just create these tables manually? Or maybe there is a better solution?
The data will be queried later in various ways, and those queries are expected to run fast.
EDIT:
If I end up with 12-14 tables, each storing 10-20 millions rows, Postgres should be still able to run select statements quickly, right? Inserts don't have to be super fast.
Partitioning is a good idea under various circumstances. Two that come to mind are:
Your queries have a WHERE clause that can be readily mapped onto one or a handful of partitions.
You want a speedy way to delete historical data (dropping a partition is faster than deleting records).
Without knowledge of the types of queries that you want to run, it is difficult to say if partitioning is a good idea.
I think I can say that splitting the data into different tables is a bad idea because it is a maintenance nightmare:
You can't have foreign key references into the table.
Queries spanning multiple tables are cumbersome, so simple questions are hard to answer.
Maintaining tables becomes a nightmare (adding/removing a column).
Permissions have to be carefully maintained, if you have users with different roles.
In any case, the place to start is with Postgres's documentation on partitioning, which is here. I should note that Postgres's implementation is a bit more awkward than in other databases, so you might want to review the documentation for MySQL or SQL Server to get an idea of what it is doing.
Firstly, I would like to challenge the premise of your question:
Since one query returns 10 000+ items, it's obviously not a good idea to store this data in a single table.
As far as I know, there is no fundamental reason why the database would not cope fine with a single table of many millions of rows. At the extreme, if you created a table with no indexes, and simply appended rows to it, Postgres could simply carry on writing these rows to disk until you ran out of storage space. (There may be other limits internally, I'm not sure; but if so, they're big.)
The problems only come when you try to do something with that data, and the exact problems - and therefore exact solutions - depend on what you do.
If you want to regularly delete all rows which were inserted more than a fixed timescale ago, you could partition the data on the createdAt column. The DELETE would then become a very efficient DROP TABLE, and all INSERTs would be routed through a trigger to the "current" partition (or could even by-pass it if your import script was aware of the partition naming scheme). SELECTs, however, would probably not be able to specify a range of createAt values in their WHERE clause, and would thus need to query all partitions and combine the results. The more partitions you keep around at a time, the less efficient this would be.
Alternatively, you might examine the workload on the table and see that all queries either already do, or easily can, explicitly state a rangeStart value. In that case, you could partition on rangeStart, and the query planner would be able to eliminate all but one or a few partitions when planning each SELECT query. INSERTs would need to be routed through a trigger to the appropriate table, and maintenance operations (such as deleting old data that is no longer needed) would be much less efficient.
Or perhaps you know that once rangeEnd becomes "too old" you will no longer need the data, and can get both benefits: partition by rangeEnd, ensure all your SELECT queries explicitly mention rangeEnd, and drop partitions containing data you are no longer interested in.
To borrow Linus Torvald's terminology from git, the "plumbing" for partitioning is built into Postgres in the form of table inheritance, as documented here, but there is little in the way of "porcelain" other than examples in the manual. However, there is a very good extension called pg_partman which provides functions for managing partition sets based on either IDs or date ranges; it's well worth reading through the documentation to understand the different modes of operation. In my case, none quite matched, but forking that extension was significantly easier than writing everything from scratch.
Remember that partitioning does not come free, and if there is no obvious candidate for a column to partition by based on the kind of considerations above, you may actually be better off leaving the data in one table, and considering other optimisation strategies. For instance, partial indexes (CREATE INDEX ... WHERE) might be able to handle the most commonly queried subset of rows; perhaps combined with "covering indexes", where Postgres can return the query results directly from the index without reference to the main table structure ("index-only scans").