Slow select from one billion rows GreenPlum DB - postgresql

I've created the following table on GreenPlum:
CREATE TABLE data."CDR"
(
mcc text,
mnc text,
lac text,
cell text,
from_number text,
to_number text,
cdr_time timestamp without time zone
)
WITH (
OIDS = FALSE,appendonly=true, orientation=column,compresstype=quicklz, compresslevel=1
)
DISTRIBUTED BY (from_number);
I've loaded one billion rows to this table but every query works very slow.
I need to do queries on all fields (not only one),
What can I do to speed up my queries?
Using PARTITION? using indexes?
maybe using a different DB like Cassandra or Hadoop?

This highly depends on the actual queries you are doing and what your hardware setup looks like.
Since you are querying all the fields the selectivity gained by going columnar orientation is probably hurting you more than helping, as you needs to scan all the data anyway. I would remove columnar orientation.
Generally speaking indexes don't help in a Greenplum system. Usually the amount of hardware that is involved tends to make scanning the data directory faster than doing index lookups.
Partitioning could be a great help but there would need to be a better understanding of the data. You are probably accessing specific time intervals so creating a partitioning scheme around cdr_time could eliminate the scan of data not needed for the result. The last thing I would worry about is indexes.
Your distribution by from_number could have an impact on query speed. The system will hash the data based on from_number so if you are querying selectively on the from_number the data will only be returned by the node that has it and you won't be leveraging the parallel nature of the system and spreading the request across all of the nodes. Unless you are joining to other tables on from_number, which allows the joins to be collocated and performed within the node, I would change that to be distributed RANDOMLY.
On top of all of that there is the question of what the hardware is and if you have a proper amount of segments setup and resources to feed them. Essentially every segment is a database. Good hardware can handle multiple segments per node, but if you are doing this on a light hardware you need to find the sweet spot where number of segments matches what the underlying system can provide.

#Dor,
I have same type of data where CDR info is stored for a telecom company, and daily 10-12 millions rows inserted and also heavy queries running on those CDRs related tables, I was also facing the same issue last year, and i have created partitions on those tables on the CDR timings column.
As per My understanding GP creates physical tables for each partition whereas logical tables created in other RDBMS. After this I got better performance with all SELECTs on these tables. Also I think you should convert text datatype to Character Varying for all columns (if text is really not required) I felt DB operations on Text field is very slow(specially order by, group by)
index will help you depends on your queries in my case i have huge inserts so i didnt try yet
If you are selecting all the columns in select so no need of Column Oriented table
Regards

Related

Performant Redshift query to return year and shipping mode with max count?

I have a Redshift table lineitem with 303 million rows. The sortkey is on l_receiptdate.
l_receiptdate
l_shipmode
1992-01-03
TRUCK
1992-01-03
TRUCK
1992-03-03
SHIP
1993-02-03
AIR
1993-05-03
SHIP
1993-07-03
AIR
1993-09-05
AIR
Ultimate goal: find what shipmode was used the most for each year. Return year, shipmode, and count for that most popular ship mode.
Expected output:
receiptyear
shipmode
ship_mode_count
1992
TRUCK
2
1993
AIR
3
I'm new to Redshift and it's nuances. I know 303 million rows isn't considered big data but I'd like to start learning Redshift best query practices from the beginning. Below is what I have so far, not sure how to move forward:
select DATE_TRUNC('year', l_receiptdate) as receiptyear,
l_shipmode as shipmode,
count(*) as ship_mode_count
FROM lineitem
group by 1,2
Your query is fine, in a general sense. The missing piece of data is what is the distribution key of the table? You see Redshift is a clustered (distributed) database and this distribution is controlled by the DISTSTYLE and DISTKEY of the table.
Here's a simple way to think about the performance of a Redshift query. Given the nature of Redshift there are few aspects that tend to dominate poorly performing queries:
Too much network redistribution of data
Scanning too much data from disk
Spilling to disk, making more data than needed through cross or looped joins, and a whole bunch of other baddies.
Your query has no joins so #3 isn't an issue. Your query needs to scan the entire table from disk so there is nothing that can be better in #2. However, #1 is where your could get in trouble especially when your data grows.
Your query needs to group by the ship mode and the year. This means that all the data for each unique combination of these needs to be brought together. So if your table was distributed by ship mode (don't do this)
then all the data for each value would reside on a single "slice" of the database and no network data transmission would be needed to perform the count. However you don't to do this in this case since you are just dealing with a COUNT() function and Redshift is smart enough to count locally and then ship the partial results, which are much smaller than the original data, to one place for the final count.
If more complicated actions were being performed that can't be done in parts, then the distribution of the table could make a big difference to the query. Having the data all in one place when rows need to be combined (join, group by, partition, etc) can prevent a lot of data needed to be shipped around the cluster via the network.
Your query will work fine but hopefully walking through this mental exercise helps you understand Redshift better.

Redshift Performance of Flat Tables Vs Dimension and Facts

I am trying to create dimensional model on a flat OLTP tables (not in 3NF).
There are people who are thinking dimensional model table is not required because most of the data for the report present single table. But that table contains more than what we need like 300 columns. Should I still separate flat table into dimensions and facts or just use the flat tables directly in the reports.
You've asked a generic question about database modelling for data warehouses, which is going to get you generic answers that may not apply to the database platform you're working with - if you want answers that you're going to be able to use then I'd suggest being more specific.
The question tags indicate you're using Amazon Redshift, and the answer for that database is different from traditional relational databases like SQL Server and Oracle.
Firstly you need to understand how Redshift differs from regular relational databases:
1) It is a Massively Parallel Processing (MPP) system, which consists of one or more nodes that the data is distributed across and each node typically does a portion of the work required to answer each query. There for the way data is distributed across the nodes becomes important, the aim is usually to have the data distributed in a fairly even manner so that each node does about equal amounts of work for each query.
2) Data is stored in a columnar format. This is completely different from the row-based format of SQL Server or Oracle. In a columnar database data is stored in a way that makes large aggregation type queries much more efficient. This type of storage partially negates the reason for dimension tables, because storing repeating data (attibutes) in rows is relatively efficient.
Redshift tables are typically distributed across the nodes using the values of one column (the distribution key). Alternatively they can be randomly but evenly distributed or Redshift can make a full copy of the data on each node (typically only done with very small tables).
So when deciding whether to create dimensions you need to think about whether this is actually going to bring much benefit. If there are columns in the data that regularly get updated then it will be better to put those in another, smaller table rather than update one large table. However if the data is largely append-only (unchanging) then there's no benefit in creating dimensions. Queries grouping and aggregating the data will be efficient over a single table.
JOINs can become very expensive on Redshift unless both tables are distributed on the same value (e.g. a user id) - if they aren't Redshift will have to physically copy data around the nodes to be able to run the query. So if you have to have dimensions, then you'll want to distribute the largest dimension table on the same key as the fact table (remembering that each table can only be distributed on one column), then any other dimensions may need to be distributed as ALL (copied to every node).
My advice would be to stick with a single table unless you have a pressing need to create dimensions (e.g. if there are columns being frequently updated).
When creating tables purely for reporting purposes (as is typical in a Data Warehouse), it is customary to create wide, flat tables with non-normalized data because:
It is easier to query
It avoids JOINs that can be confusing and error-prone for causal users
Queries run faster (especially for Data Warehouse systems that use columnar data storage)
This data format is great for reporting, but is not suitable for normal data storage for applications — a database being used for OLTP should use normalized tables.
Do not be worried about having a large number of columns — this is quite normal for a Data Warehouse. However, 300 columns does sound rather large and suggests that they aren't necessarily being used wisely. So, you might want to check whether they are required.
A great example of many columns is to have flags that make it easy to write WHERE clauses, such as WHERE customer_is_active rather than having to join to another table and figuring out whether they have used the service in the past 30 days. These columns would need to be recalculated daily, but are very convenient for querying data.
Bottom line: You should put ease of use above performance when using Data Warehousing. Then, figure out how to optimize access by using a Data Warehousing system such as Amazon Redshift that is designed to handle this type of data very efficiently.

When should I use the postgresql 10 declarative partitioning function?

the official explain is:
The benefits will normally be worthwhile only when a table would otherwise be very large. The exact point at which a table will benefit from partitioning depends on the application, although a rule of thumb is that the size of the table should exceed the physical memory of the database server.
When a table is very large? How to judge a table is very large?
A rule of thumb is that the size of the table should exceed the physical memory of the database server? What does this sentence mean?
The typical use cases for table partitioning (not limited to Postgres) are:
Cleanup data
If you need to delete rows from large tables that can be identified by a single partition.
In that case drop partition would be a lot faster than using delete. A typical use case is a range-partitioned table on a timespan (week, month, year)
Improve queries
If all (or nearly all) queries you use, contain a condition on the partition key.
A typical use case would be partitioning an "orders" table on e.g. the country and all queries would involve a condition like where country_code = 'de' or something similar. Queries not including the partitioning key will be however be slower compared to a query on a non-partitioned table.
What is "large"? That depends very much on your hardware and system. But I would not consider a table with less 100 million rows "large". Indexing (including partial indexes) can get you a long way in Postgres.
Note that Postgres 10 partitioning is still severely limited compared to e.g. Oracle or SQL Server. One of the biggest limitations is the lack of support for foreign keys and global indexes (i.e. a primary key ensuring uniqueness across all partitions). So if you need that, partitioning is not for you.

Postgres partitioning?

My software runs a cronjob every 30 minutes, which pulls data from Google Analytics / Social networks and inserts the results into a Postgres DB.
The data looks like this:
url text NOT NULL,
rangeStart timestamp NOT NULL,
rangeEnd timestamp NOT NULL,
createdAt timestamp DEFAULT now() NOT NULL,
...
(various integer columns)
Since one query returns 10 000+ items, it's obviously not a good idea to store this data in a single table. At this rate, the cronjob will generate about 480 000 records a day and about 14.5 million a month.
I think the solution would be using several tables, for example I could use a specific table to store data generated in a given month: stats_2015_09, stats_2015_10, stats_2015_11 etc.
I know Postgres supports table partitioning. However, I'm new to this concept, so I'm not sure what's the best way to do this. Do I need partitioning in this case, or should I just create these tables manually? Or maybe there is a better solution?
The data will be queried later in various ways, and those queries are expected to run fast.
EDIT:
If I end up with 12-14 tables, each storing 10-20 millions rows, Postgres should be still able to run select statements quickly, right? Inserts don't have to be super fast.
Partitioning is a good idea under various circumstances. Two that come to mind are:
Your queries have a WHERE clause that can be readily mapped onto one or a handful of partitions.
You want a speedy way to delete historical data (dropping a partition is faster than deleting records).
Without knowledge of the types of queries that you want to run, it is difficult to say if partitioning is a good idea.
I think I can say that splitting the data into different tables is a bad idea because it is a maintenance nightmare:
You can't have foreign key references into the table.
Queries spanning multiple tables are cumbersome, so simple questions are hard to answer.
Maintaining tables becomes a nightmare (adding/removing a column).
Permissions have to be carefully maintained, if you have users with different roles.
In any case, the place to start is with Postgres's documentation on partitioning, which is here. I should note that Postgres's implementation is a bit more awkward than in other databases, so you might want to review the documentation for MySQL or SQL Server to get an idea of what it is doing.
Firstly, I would like to challenge the premise of your question:
Since one query returns 10 000+ items, it's obviously not a good idea to store this data in a single table.
As far as I know, there is no fundamental reason why the database would not cope fine with a single table of many millions of rows. At the extreme, if you created a table with no indexes, and simply appended rows to it, Postgres could simply carry on writing these rows to disk until you ran out of storage space. (There may be other limits internally, I'm not sure; but if so, they're big.)
The problems only come when you try to do something with that data, and the exact problems - and therefore exact solutions - depend on what you do.
If you want to regularly delete all rows which were inserted more than a fixed timescale ago, you could partition the data on the createdAt column. The DELETE would then become a very efficient DROP TABLE, and all INSERTs would be routed through a trigger to the "current" partition (or could even by-pass it if your import script was aware of the partition naming scheme). SELECTs, however, would probably not be able to specify a range of createAt values in their WHERE clause, and would thus need to query all partitions and combine the results. The more partitions you keep around at a time, the less efficient this would be.
Alternatively, you might examine the workload on the table and see that all queries either already do, or easily can, explicitly state a rangeStart value. In that case, you could partition on rangeStart, and the query planner would be able to eliminate all but one or a few partitions when planning each SELECT query. INSERTs would need to be routed through a trigger to the appropriate table, and maintenance operations (such as deleting old data that is no longer needed) would be much less efficient.
Or perhaps you know that once rangeEnd becomes "too old" you will no longer need the data, and can get both benefits: partition by rangeEnd, ensure all your SELECT queries explicitly mention rangeEnd, and drop partitions containing data you are no longer interested in.
To borrow Linus Torvald's terminology from git, the "plumbing" for partitioning is built into Postgres in the form of table inheritance, as documented here, but there is little in the way of "porcelain" other than examples in the manual. However, there is a very good extension called pg_partman which provides functions for managing partition sets based on either IDs or date ranges; it's well worth reading through the documentation to understand the different modes of operation. In my case, none quite matched, but forking that extension was significantly easier than writing everything from scratch.
Remember that partitioning does not come free, and if there is no obvious candidate for a column to partition by based on the kind of considerations above, you may actually be better off leaving the data in one table, and considering other optimisation strategies. For instance, partial indexes (CREATE INDEX ... WHERE) might be able to handle the most commonly queried subset of rows; perhaps combined with "covering indexes", where Postgres can return the query results directly from the index without reference to the main table structure ("index-only scans").

Search Engine Database (Cassandra) & Best Practise

I'm currently storing rankings in MongoDB (+ nodejs as API) . It's now at 10 million records, so it's okay for now but the dataset will be growing drastically in the near future.
At this point I see two options:
MongoDB Sharding
Change Database
The queries performed on the database will not be text searches, but for example:
domain, keyword, language, start date, end date
keyword, language, start date, end date
A rank contains a:
1. domain
2. url
3. keyword
4. keyword language
5. position
6. date (unix)
Requirement is to be able to query and analyze the data without caching. For example get all data for domain x, between dates y, z and analyze the data.
I'm noticing a perfomance decrease lately and I'm looking into other databases. The one that seems to fit the job best is Cassandra, I did some testing and it looked promising, performance is good. Using Amazon EC2 + Cassandra seems a good solution, since it's easilly scalable.
Since I'm no expert on Cassandra I would like to know if Cassandra is the way to go. Secondly, what would be the best practice / database model.
Make a collection for (simplified):
domains (domain_id, name)
keywords (keyword_id, name, language)
rank (domain_id, keyword_id, position, url, unix)
Or put all in one row:
domain, keyword, language, position, url, unix
Any tips, insights would be greatly appreciated.
Cassandra relies heavily on query driven modelling. It's very restrictive in how you can query, but it is possible to fit an awful lot of requirements within those capabilities. For any large scale database, knowing your queries is important, but in terms of cassandra, it's almost vital.
Cassandra has the notion of primary keys. Each primary key consists of one or more keys (read columns). The first column (which may be a composite) is referred to as the partition key. Cassandra keeps all "rows" for a partition in the same place (on disk, in mem, etc.), and a partition is the unit of replication, etc.
Additional keys in the primary key are called clustering keys. Data within a partition are ordered according to successive clustering keys. For instance, if your primary key is (a, b, c, d) then data will be partitioned by hashing a, and within a partition, data will be ordered by b, c and d.
For efficient querying, you must hit one (or very few) partitions. So your query must have a partition key. This MUST be exact equality (no starts with, contains, etc.). Then you need to filter down to your targets. This can get interesting too:
Your query can specify exact equality conditions for successive clustering keys, and a range (or equality) for the last key in your query. So, in the previous example, this is allowed:
select * from tbl where a=a1 and b=b1 and c > c1;
This is not:
select * from tbl where a=a1 and b>20 and c=c1;
[You can use allow filtering for this]
or
select * from tbl where a=a1 and c > 20;
Once you understand the data storage model, this makes sense. One of the reason cassandra is so fast for queries is that it pin points data in a range and splats it out. If it needed to do pick and choose, it'd be slower. You can always grab data and filter client side.
You can also have secondary indexes on columns. These would allow you to filter on exact equality on non-key columns. Be warned, never use a query with a secondary index without specifying a partition key. You'll be doing a cluster query which will time out in real usage. (The exception is if you're using Spark and locality is being honoured, but that's a different thing altogether).
In general, it's good to limit partition sizes to less than a 100mb or at most a few hundred meg. Any larger, you'll have problems. Usually, a need for larger partitions suggests a bad data model.
Quite often, you'll need to denormalise data into multiple tables to satisfy all your queries in a fast manner. If your model allows you to query for all your needs with the fewest possible tables, that's a really good model. Often that might not be possible though, and denormalisation will be necessary. For your question, the answer to whether or not all of it goes in one row depends on whether you can still query it and keep partition sizes less than 100 meg or not if everything is in one row.
For OLTP, cassandra will be awesome IF you can build the data model that works the way Cassandra does. Quite often OLAP requirements won't be satisfied by this. The current tool of choice for OLAP with Cassandra data is the DataStax Spark connector + Apache Spark. It's quite simple to use, and is really powerful.
That's quite a brain dump. But it should give you some idea of the things you might need to learn if you intend to use Cassandra for a real world project. I'm not trying to put you off Cassandra or anything. It's an awesome data store. But you have to learn what it's doing to harness its power. It works very different to Mongo, and you should expect a mindshift when switching. It's most definitely NOT like switching from mysql to sql server.