We have a transaction table of sale to customers with over 2000 million rows on Redshift. Every months transaction table has 5 million rows. For MIS (monthly 5 million rows only), I need to check if a customer is new based on mobile number, or the mobile number exists in the 2000 million database without joining it on the full table so my query remains efficient.
What I have tried:
newtable=SELECT DISTINCT(mobile_no) as mobile_no,'old' as category FROM table
maintable=SELECT maintable.*, coalesce(nq.category,'new')
FROM maintable as maintable
LEFT JOIN (newquery) as nq on nq.mobile_no=maintable.mobile_no;
This is very slow takes over 50 mins. I also tried
SELECT exists (SELECT 1 FROM newtable WHERE mobile_no=maintable.mobile_no LIMIT 1) as as category but this gives an 'out of memory' error.
Amazon RedShift is a data warehouse, so it won't be fast on queries by design. If you will be doing analysis on the data and expect a faster result, you might want to explore other products they offer such as EMR to do your queries faster.
Here is a reference on what each service is intention is: https://aws.amazon.com/big-data/datalakes-and-analytics/
Related
I have a table which contains about 900K rows.I want to delete about 90% of the rows. Tried using TABLESAMPLE to select them randomly but didn't get much performance improvement. Here are the queries which i have tried and there times
sql> DELETE FROM users WHERE id IN (
SELECT id FROM users ORDER BY random() LIMIT 5000
)
[2017-11-22 11:35:39] 5000 rows affected in 1m 11s 55ms
sql> DELETE FROM users WHERE id IN (
SELECT id FROM users TABLESAMPLE BERNOULLI (5)
)
[2017-11-22 11:55:07] 5845 rows affected in 1m 13s 666ms
sql> DELETE FROM users WHERE id IN (
SELECT id FROM users TABLESAMPLE SYSTEM (5)
)
[2017-11-22 11:57:59] 5486 rows affected in 1m 4s 574ms
Only deleting 5% data takes about an min. So this is going to take very long for large data. Pls suggest if I am doing things right or if there is any better way to do this.
Deleting a large number of rows is always going to be slow. The way how you identify them won't make much difference.
Instead of deleting a large number it's usually a lot faster, to create a new table that contains those rows that you want to keep, e.g.:
create table users_to_keep
as
select *
from users
tablesample system (10);
then truncate the original table and insert the rows that you stored away:
truncate table users;
insert into users
select *
from users_to_keep;
If you want, you can do that in a single transaction.
As a_horse_with_no_name pointed out, the random selection itself is a relatively minor factor. And much of the cost associated with a deletion (e.g. foreign key checks) is not something you can avoid.
The only thing which stands out as an unnecessary overhead is the id-based lookup in the DELETE statement; you just visited the row during the random selection step, and now you're looking it up again, presumably via an index on id.
Instead, you can perform the lookup using the row's physical location, represented by the hidden ctid column:
DELETE FROM users WHERE ctid = ANY(ARRAY(
SELECT ctid FROM users TABLESAMPLE SYSTEM (5)
))
This gave me a ~6x speedup in an artificial test, though it will likely be dwarfed by other costs in most real-world scenarios.
I'm using SQL Server 2008 R2 on my development machine (not a server box).
I have a table with 12.5 million records. It has 126 columns, half of which are int. Most columns in most rows are NULL. I've also tested with an EAV design which seems 3-4 times faster to return the same records (but that means pivoting data to make it presentable in a table).
I have a website that paginates the data. When the user tries to go to the last page of records (last 25 records), the resulting query is something like this:
select * from (
select
A.Id, part_id as PartObjectId,
Year_formatted 'year', Make_formatted 'Make',
Model_formatted 'Model',
row_number() over ( order by A.id ) as RowNum
FROM vehicles A
) as innerQuery where innerQuery.RowNum between 775176 and 775200
... but this takes nearly 3 minutes to run. That seems excessive? Is there a better way to structure this query? In the browser front-end I'm using jqGrid to display the data. The user can navigate to the next, previous, first, or last page. They can also filter and order data (example: show all records whose Make is "Bugatti").
vehicles.Id is int and is the primary key (clustered ASC). part_id is int, Make and Model are varchar(100) and typically only contain 20 - 30 characters.
Table vehicles is updated ~100 times per day in individual transactions, and 20 - 30 users use the webpage to view, search, and edit/add vehicles 8 hours/day. It gets read from and updated a lot.
Would it be wise to shard the vehicles table into multiple tables only containing say 3 million records each? Would that have much impact on performance?
I see lots of videos and websites talking about people having tables with 100+ million rows that are read from and updated often without issue.
Note that the performance issues I observe are on my own development computer. The database has a dedicated 16GB of RAM. I'm not using SSD or even SCSI for that matter. So I know hardware would help, but 3 minutes to retrieve the last 25 records seems a bit excessive no?
Though I'm running these tests on SQL Server 2008 R2, I could also use 2012 if there is much to be gained from doing so.
Yes there is a better way, even on older releases of MsSQL But it is involved. First, this process should be done in a stored procedure. The stored procedure should take as 2 of it's input parameters, the page requested (#page)and the page size (number of records per page - #pgSiz).
In the stored procedure,
Create a temporary table variable and put into it a sorted list of the integer Primary Keys for all the records, with a rowNumber column that is itself an indexed, integer, Primary Key for the temp table
Declare #PKs table
(rowNo integer primary key Identity not null,
vehicleId integer not null)
Insert #PKS (vehicleId)
Select vehicleId from Vehicles
Order By --[Here put sort criteria as you want pages sorted]
--[Try to only include columns that are in an index]
then, based on which page (and the page size), (#page, #pgSiz) the user requested, the stored proc selects the actual data for that page by joining to this temp table variable:
Select [The data columns you want]
From #PKS p join Vehicles v
on v.VehicleId = p.VehicleId
Where rowNo between #page*#pgSiz+1 and (#page+1)*#pgSiz
order by rowNo -- if you want to sort page of records on server
assuming #page is 0-based. Also, the Stored proc will need some input argument validation to ensure that the #page, #pgSize values are reasonable (do not take the code pas the end of the records.)
I am trying to get the last date an insert was performed in a table (on Amazon Redshift), is there any way to do this using the metadata? The tables do not store any timestamp column, and even if they had it, we need to find out for 3k tables so it would be impractical so a metadata approach is our strategy. Any tips?
All insert execution steps for queries are logged in STL_INSERT. This query should give you the information you're looking for:
SELECT sti.schema, sti.table, sq.endtime, sq.querytxt
FROM
(SELECT MAX(query) as query, tbl, MAX(i.endtime) as last_insert
FROM stl_insert i
GROUP BY tbl
ORDER BY tbl) inserts
JOIN stl_query sq ON sq.query = inserts.query
JOIN svv_table_info sti ON sti.table_id = inserts.tbl
ORDER BY inserts.last_insert DESC;
Note: The STL tables only retain approximately two to five days of log history.
I have the following huge tables.
Table_1 with 500000 (0.5m) rows
Table_8 with 20000000 (20m) rows
Table_13 with 4000000 (4m) rows
Table_6 with 500000 (0.5m) rows
Table_15 with 200000 (0.2m) rows
I need to pull out so many records(recent events) to show on google map by joining about 28 tables.
How to speed up for huge tables in SQL select query ?
I searched from Google to use the clustered index and non clustered indexes. By getting the advices from DTA(Database Engine Tuning Advisor), I build those clustered index and non clustered indexes. But it still take long time.
I have 2 views and 1 stored procedure as the following.
https://gist.github.com/LaminGitHub/57b314b34599b2a47e65
Please kindly give me idea.
Best Regards
I'm trying to set-up a testing environment for performance testing, currently we have a table with 8 million records and we want to duplicate this records for 30 days.
In other words:
- Table 1
--Partition1(8 million records)
--Partition2(0 records)
.
.
--Partition30(0 records)
Now I want to take the 8 million records in Partition1 and duplicate them across the rest of partitions, the only difference that they have is a column that contains a DATE. This column should vary 1 day in each copy.
Partition1(DATE)
Partition2(DATE+1)
Partition3(DATE+2)
And so on.
The last restrictions are that there are 2 indexes in the original table and they must be preserved in the copies and Oracle DB is 10g.
How can I duplicate this content?
Thanks!
It seems to me to be as simple as running as efficient an insert as possible.
Probably if you cross-join the existing data to a list of integers, 1 .. 29, then you can generate the new dates you need.
with list_of_numbers as (
select rownum day_add
from dual
connect by level <= 29)
insert /*+ append */ into ...
select date_col + day_add, ...
from ...,
list_of_numbers;
You might want to set NOLOGGING on the table, since this is test data.