Redshift table design for efficiency - amazon-redshift

I have a redshift cluster with a single dc1.large node. I've got data writing into it, on order of 50 million records a day, in the format of a timestamp, a user ID and an item ID. The item ID (varchar) is unique, the user ID (varchar) is not, and the timestamp (timestamp) is not.
In my redshift DB of about 110m records, if I have a table with no sort key, it takes about 30 seconds to search for a single item ID.
If I have a table with a sort key on item ID, I get a single item ID search time of about 14-16 seconds.
If I have a table with an interleved sort key with all three columns, the single item ID search time is still 14-16 seconds.
What I'm hoping to achieve is the ability to query for the records of thousands or tens of thousands of item IDs on order of a second.
The query just looks like
select count(*) from rs_table where itemid = 'id123';
or
select count(*) from rs_table where itemid in ('id123','id124','id125');
This query comes back in 541ms
select count(*) from rs_table;
AWS documentation suggests that there is a compile time for queries the first time they're run, but I don't think that's what I'm seeing (and it would be not ideal if it was, since each unique set of 10,000 IDs might never be queried in exactly the same order again.
I have to assume I'm doing something wrong with either the sort key design, the query, or some combination of the two - for only ~10g of table space, something like redshift shouldn't take this long to query, right?

Josh,
We probably need a few additional pieces of information to give you a good recommendation.
Here are some things to start thinking about.
Are most of your queries record lookups as you describe above?
What is your distribution key?
Do you join this table with other large fact tables?
If you load 50M records per day and you only have 110M records in the
table, does that mean that you only store 2 days?
Do you do massive deletes and then load another 50M records per day?
Do you run ANALYZE after your loads?
If you deleted a large number of records, did you run VACUUM?
If all of your queries are similar to the ones that you describe, why are you using Redshift? Amazon DynamoDB or MongoDB (even Cassandra) would be great database choices for the types of queries that you describe.
If you run analytical workloads Redshift is an excellent platform. If you are more interested in "record lookups" the NoSQL options, as well as mysql or MariaDB might give you better performance.
Also, if this is a dev/test environment and you have loaded and deleted large amounts of data without ever running a VACUUM you would see significant performance degradation.

Related

Statistics of all/many tables in FileMaker

I'm writing a kind of summary page for my FileMaker solution.
For this, I have define a "statistics" table, which uses formula fields with ExecuteSQL to gather info from most tables, such as number of records, recently changed records, etc.
This strangely takes a long time - around 10 seconds when I have a total of about 20k records in about 10 tables. The same SQL on any database system shouldn't take more than some fractions of a second.
What could the reason be, what can I do about it and where can I start debugging to figure out what's causing all this time?
The actual code is, like this:
SQLAusführen ( "SELECT COUNT(*) FROM " & _Stats::Table ; "" ; "" )
SQLAusführen ( "SELECT SUM(\"some_field_name\") FROM " & _Stats::Table ; "" ; "" )
Where "_Stats" is my statistics table, and it has a string field "Table" where I store the name of the other tables.
So each row in this _Stats table should have the stats for the table named in the "Table" field.
Update: I'm not using FileMaker server, this is a standalone client application.
We can definitely talk about why it may be slow. Usually this has mostly to do with the size and complexity of your schema. That is "usually", as you have found.
Can you instead use the DDR ( database design report ) instead? Much will depend on what you are actually doing with this data. Tools like FMPerception also will give you many of the stats you are looking for. Again, depends on what you are doing with it.
Also, can you post your actual calculation? Is the statistic table using unstored calculations? Is the statistics table related to any of the other tables? These are a couple things that will affect how ExecuteSQL performs.
One thing to keep in mind, whether ExecuteSQL, a Perform Find, or relationship, it's all the same basic query under-the-hood. So if it would be slow doing it one way, it's going to likely be slow with any other directly related approach.
Taking these one at a time:
All records count.
Placing an unstored calc in the target table allows you to get the count of the records through the relationship, without triggering a transfer of all records to the client. You can get the value from the first record in the relationship. Super light way to get that info vs using Count which requires FileMaker to touch every record on the other side.
Sum of Records Matching a Value.
using a field on the _Stats table with a relationship to the target table will reduce how much work FileMaker has to do to give you an answer.
Then having a Summary field in the target table so sum the records may prove to be more efficient than using an aggregate function. The summary field will also only sum the records that match the relationship. ( just don't show that field on any of your layouts if you don't need it )
ExecuteSQL is fastest when it can just rely on a simple index lookup. Once you get outside of that, it's primarily about testing to find the sweet-spot. Typically, I will use ExecuteSQL for retrieving either a JSON object from a user table, or verifying a single field value. Once you get into sorting and aggregate functions, you step outside of the optimizations of the function.
Also note, if you have an open record ( that means you as the current user ), FileMaker Server doesn't know what data you have on the client side, and so it sends ALL of the records. That's why I asked if you were using unstored calcs with ExecuteSQL. It can seem slow when you can't control when the calculations fire. Often I will put the updating of that data into a scheduled script.

How to index already created data in rails?

I am trying to index already created columns with over 5 million data in my table. My question is if I add index with the migration will the already created data be indexed as well ? Or do I need to re-index the created data if so how ?
This is my migration
add_index :data_prods, :date_field
add_index :data_prods, :entity_id
Thank you.
Edit
I am using PostgreSQL dbms.
The process of adding an index re-indexes the entire tables contents. A table with 5 million rows may take some time, I suggest testing in a staging environment (with a similar amount of data) to see how long this migration will take, as well as impact to the application.
Re: your comment about improving query times
Indexes will make queries faster, where the indexed columns are commonly referenced in "where" clauses. In your case, any query where you filter by date_field OR entity_id will be faster, but other queries will not be improved. It should be noted that each query will only use 1 index, if the majority of your queries use both date_field AND entity_id at the same time to filter data, you might be better off using a composite index. Id check out this post for further reading on composite indexes.
Index on multiple columns in Ruby on Rails

Data Lake Analytics - Large vertex query

I have a simple query which make a GROUP BY using two fields:
#facturas =
SELECT a.CodFactura,
Convert.ToInt32(a.Fecha.ToString("yyyyMMdd")) AS DateKey,
SUM(a.Consumo) AS Consumo
FROM #table_facturas AS a
GROUP BY a.CodFactura, a.DateKey;
#table_facturas has 4100 rows but query takes several minutes to finish. Seeing the graph explorer I see it uses 2500 vertices because I'm having 2500 CodFactura+DateKey unique rows. I don't know if it normal ADAL behaviour. Is there any way to reduce the vertices number and execute this query faster?
First: I am not sure your query actually will compile. You would need the Convert expression in your GROUP BY or do it in a previous SELECT statement.
Secondly: In order to answer your question, we would need to know how the full query is defined. Where does #table_facturas come from? How was it produced?
Without this information, I can only give some wild speculative guesses:
If #table_facturas is coming from an actual U-SQL Table, your table is over partitioned/fragmented. This could be because:
you inserted a lot of data originally with a distribution on the grouping columns and you either have a predicate that reduces the number of rows per partition and/or you do not have uptodate statistics (run CREATE STATISTICS on the columns).
you did a lot of INSERT statements, each inserting a small number of rows into the table, thus creating a big number of individual files. This will "scale-out" the processing as well. Use ALTER TABLE REBUILD to recompact.
If it is coming from a fileset, you may have too many small files in the input. See if you can merge them into less, larger files.
You can also try to hint a small number of rows in your query that creates #table_facturas if the above does not help by adding OPTION(ROWCOUNT=4000).

AWS RDS postgresql performance

We have around 90 million rows in a new Postgresql table in an RDS instance. It contains 2 numbers, start_num and end_num(Bigint, mostly finance related) and details related to those numbers. The PK is on the start_num and end_num and table is CLUSTERed on this. The query will always be range query. Input will be a number and the output will be range in which this number is falling along with details. For eg: There is a row which has start_num=112233443322 and end_num as 112233543322. The input comes in as 112233443645. So the row containing 112233443322, 112233543322 needs to be returned.
select start_num, end_num from ipinfo.ipv4 where input_value between start_num and end_num;
This is always going into seq scan and the PK is not getting used. I have tried creating separate indexes on start_num and end_num desc but not much change in time. We are looking for an output of less than 300 ms. Now, I am wondering if that is even possible in Postgresql for range queries on large data sets or this is due to the Postgresql being on AWS RDS.
Looking forward to some advice on steps to improve the performance.

Efficient way of insert millions of rows, convert data and deal with it, on PostgreSQL+PostGIS

I have a big collection of data I want to use for user search later.
Currently I have 200 millions resources (~50GB). For each, I have latitude+longitude. The goal is to create spatial index to be able to do spatial queries on it.
So for that, the plan is to use PostgreSQL + PostGIS.
My data are on CSV file. I tried to use custom function to not insert duplicates, but after days of processing I gave up. I found a way to load it fast in the database: with COPY it takes less than 2 hours.
Then, I need to convert latitude+longitude on Geometry format. For that I just need to do:
ST_SetSRID(ST_MakePoint(longi::double precision,lat::double precision),4326))
After some checking, I saw that for 200 millions, I have 50 millions points. So, I think the best way is to have a table "TABLE_POINTS" that will store all the points, and a table "TABLE_RESOURCES" that will store resources with point_key.
So I need to fill "TABLE_POINTS" and "TABLE_RESOURCES" from temporary table "TABLE_TEMP" and not keeping duplicates.
For "POINTS" I did:
INSERT INTO TABLE_POINTS (point)
SELECT DISTINCT ST_SetSRID(ST_MakePoint(longi::double precision,lat::double precision),4326))
FROM TABLE_RESOURCES
I don't remember how much time it took, but I think it was matter of hours.
Then, to fill "RESOURCES", I tried:
INSERT INTO TABLE_RESOURCES (...,point_key)
SELECT DISTINCT ...,point_key
FROM TABLE_TEMP, TABLE_POINTS
WHERE ST_SetSRID(ST_MakePoint(longi::double precision,lat::double precision),4326) = point;
but again take days, and there is no way to see how far the query is ...
Also something important, the number of resources will continue to grow up, currently should be like 100K added by day, so storage should be optimized to keep fast access to data.
So if you have any idea for the loading or the optimization of the storage you are welcome.
Look into optimizing postgres first (ie google postgres unlogged, wal and fsync options), second do you really need points to be unique? Maybe just have one table with resources and points combined and not worry about duplicate points as it seems your duplicate lookup maybe whats slow.
For DISTINCT to work efficiently, you'll need a database index on those columns for which you want to eliminate duplicates (e.g. on the latitude/longitude columns, or even on the set of all columns).
So first insert all data into your temp table, then CREATE INDEX (this is usually faster that creating the index beforehand, as maintaining it during insertion is costly), and only afterwards do the INSERT INTO ... SELECT DISTINCT.
An EXPLAIN <your query> can tell you whether the SELECT DISTINCT now uses the index.