Yii2 and TimescaleDB: GridView very slow on large Table - postgresql

Using Yii2 with TimescaleDB (PostgreSQL Extension for time series data) gets very slow on large tables (> 1 billion rows).
I think that comes from the function getTotalCount()in yii\data\BaseDataProvider which gets called to provide pagination on every page.
We know from TimescaleDB that row counting on large tables gets very slow. If we replace the function getTotalCount() in the model by a "coorrect" solution as
static public function getTotalCount()
{
$connection = Yii::$app->getDb();
$command = $connection->createCommand("
SELECT h.schema_name,
h.table_name,
h.id AS table_id,
h.associated_table_prefix,
row_estimate.row_estimate
FROM _timescaledb_catalog.hypertable h
CROSS JOIN LATERAL ( SELECT sum(cl.reltuples) AS row_estimate
FROM _timescaledb_catalog.chunk c
JOIN pg_class cl ON cl.relname = c.table_name
WHERE c.hypertable_id = h.id and h.table_name = 'value'
GROUP BY h.schema_name, h.table_name) row_estimate
ORDER BY \"schema_name\", \"table_name\";
");
$result = $command->queryAll();
return floatval($result[0]["row_estimate"]);
}
But do we really have to perform this query for pagination?
This seems quite complex. How to optimize this? Maybe just by returning a very large but fixed number since nobody needs pagination over a billion pages?
Example:
const LARGE_NUM = 1000;
static public function getTotalCount()
{
if (Yii::$app->session->get("mytable-rows")>MyClass::LARGE_NUM) {
return MyClass::LARGE_NUM;
}
$totalRows = parent::getTotalCount();
Yii::$app->session->set("mytable-rows",$totalRows);
return $totalRows;
}
Is this a valid and fast solution? Are the dangerous side effects?

Related

Column is not updating in postgresql

I tried to update my table like below:
$query = "select *
FROM sites s, companies c, tests t
WHERE t.test_siteid = s.site_id
AND c.company_id = s.site_companyid
AND t.test_typeid = '20' AND s.site_id = '1337'";
$queryrow = $db->query($query);
$results = $queryrow->as_array();
foreach($results as $key=>$val){
$update = "update tests set test_typeid = ? , test_testtype = ? where test_siteid = ?";
$queryrow = $db->query($update,array('10','Meter Calibration Semi Annual',$val->site_id));
}
The above code is working good. But in update query , The column test_typeid is not updated with '10'. Column test_typeid is updating with empty value. Other columns are updating good. I dont know why this column test_typeid is not updating? And the column test_typeid type is integer only. I am using postgreSql
And My table definition is:
What i did wrong with the code. Kindly advice me on this.
Thanks in advance.
First, learn to use proper JOIN syntax. Never use commas in the FROM clause; always use proper explicit JOIN syntax.
You can write the query in one statement:
update tests t
set test_typeid = '10',
test_testtype = 'Meter Calibration Semi Annual'
from sites s join
companies c
on c.company_id = s.site_companyid
where t.test_siteid = s.site_id and
t.test_typeid = 20 and s.site_id = 1337;
I assume the ids are numbers, so there is no need to use single quotes for the comparisons.

Optimizing Postgres query with timestamp filter

I have a query:
SELECT DISTINCT ON (analytics_staging_v2s.event_type, sent_email_v2s.recipient, sent_email_v2s.sent) sent_email_v2s.id, sent_email_v2s.user_id, analytics_staging_v2s.event_type, sent_email_v2s.campaign_id, sent_email_v2s.recipient, sent_email_v2s.sent, sent_email_v2s.stage, sent_email_v2s.sequence_id, people.role, people.company, people.first_name, people.last_name, sequences.name as sequence_name
FROM "sent_email_v2s"
LEFT JOIN analytics_staging_v2s ON sent_email_v2s.id = analytics_staging_v2s.sent_email_v2_id
JOIN people ON sent_email_v2s.person_id = people.id
JOIN sequences on sent_email_v2s.sequence_id = sequences.id
JOIN users ON sent_email_v2s.user_id = users.id
WHERE "sent_email_v2s"."status" = 1
AND "people"."person_type" = 0
AND (sent_email_v2s.sequence_id = 1888) AND (sent_email_v2s.sent >= '2016-03-18')
AND "users"."team_id" = 1
When I run EXPLAIN ANALYZE on it, I get:
Then, if I change that to the following (Just removing the (sent_email_v2s.sent >= '2016-03-18')) as follows:
SELECT DISTINCT ON (analytics_staging_v2s.event_type, sent_email_v2s.recipient, sent_email_v2s.sent) sent_email_v2s.id, sent_email_v2s.user_id, analytics_staging_v2s.event_type, sent_email_v2s.campaign_id, sent_email_v2s.recipient, sent_email_v2s.sent, sent_email_v2s.stage, sent_email_v2s.sequence_id, people.role, people.company, people.first_name, people.last_name, sequences.name as sequence_name
FROM "sent_email_v2s"
LEFT JOIN analytics_staging_v2s ON sent_email_v2s.id = analytics_staging_v2s.sent_email_v2_id
JOIN people ON sent_email_v2s.person_id = people.id
JOIN sequences on sent_email_v2s.sequence_id = sequences.id
JOIN users ON sent_email_v2s.user_id = users.id
WHERE "sent_email_v2s"."status" = 1
AND "people"."person_type" = 0
AND (sent_email_v2s.sequence_id = 1888) AND "users"."team_id" = 1
when I run EXPLAIN ANALYZE on this query, the results are:
EDIT:
The results above from today are about as I expected. When I ran this last night, however, the difference created by including the timestamp filter was about 100x slower (0.5s -> 59s). The EXPLAIN ANALYZE from last night showed all of the time increase to be attributed to the first unique/sort operation in the query plan above.
Could there be some kind of caching issue here? I am worried now that there might be something else going on (transiently) that might make this query take 100x longer since it happened at least once.
Any thoughts are appreciated!

Update big amount of data postgresql

The main table used is transaction, and can store million rows (let's say 4-5 million max). I need to update a status as fast as possible.
The update query looks like this :
UPDATE transaction SET transaction.status = 'TO_EXECUTE'
WHERE transaction.id IN (SELECT transaction.id FROM transaction
JOIN anotherTable ON transaction.id = anotherTable.id
JOIN anotherTable2 ON transaction.serviceId = ontherTable2.id
WHERE transaction.status = :filter1, transaction.filter2 = :filter2, ...)
Do you have a better solution? Could it be better to create another table to store the status an the id ? (I red that updating large Tables can be really slow).
The IN part of your query could likely be re-written as "exists" to potentially get improvements, depending on the other table layouts and volume. Also, it's highly possible that you do not need the transaction table mentioned yet again in the sub query (exists or in)
UPDATE transaction tx SET transaction.status = 'TO_EXECUTE'
WHERE exists (SELECT *
FROM anotherTable
JOIN anotherTable2 ON tx.serviceId = anotherTable2.id
WHERE anothertable.id=tx.id and
transaction.status = :filter1 and transaction.filter2 = :filter2,
...)
try this:
UPDATE transaction
SET transaction.status = 'TO_EXECUTE'
From anotherTable
JOIN anotherTable2 ON transaction.serviceId = anotherTable2.id
WHERE transaction.id = anotherTable.id AND transaction.status = :filter1, transaction.filter2 = :filter2, ...

Additional conditions in JOIN

I have tables with articles and users, both have many-to-many mapping to third table - reads.
What I am trying to do here is to get all unread articles for particular user ( user_id not present in table reads ).
My query is getting all articles but those read are marked, which if fine as I can filter them out (user_id field contains id of user in question).
I have an SQL query like this:
SELECT articles.id, reads.user_id
FROM articles
LEFT JOIN
reads
ON articles.id = reads.article_id AND reads.user_id = 9
ORDER BY articles.last_update DESC LIMIT 5;
Which yields following:
articles.id | reads.user_id
-------------------+-----------------
57125839 | 9
57065456 |
56945065 |
56945066 |
56763090 |
(5 rows)
This is fine. This is what I want.
I'd like to get same result in Catalyst using my article model, but I cannot find any option to add conditions to a JOIN clause.
Do you know any way how to add AND X = Y to DBIx JOIN?
I know this can be done with custom resoult source and virtual view, but I have some other queries that could benefit from it and I'd like to avoid creating virtual view for each of them.
Thanks,
Canto
I don't even know what Catalyst is but I can hack the SQL query:
select articles.id, reads.user_id
from
articles
left join
(
select *
from reads
where user_id = 9
) reads on articles.id = reads.article_id
order by articles.last_update desc
limit 5;
I got an solution.
It's not straight forward, but it's better than virtual view.
http://search.cpan.org/dist/DBIx-Class/lib/DBIx/Class/Relationship/Base.pm#condition
Above describes how to use conditions in JOIN clause.
However, my case needs an variable in those conditions, which is not available by default in model.
So getting around a bit of model concept and introducing variable to it, we have the following.
In model file
our $USER_ID;
__PACKAGE__->has_many(
pindols => "My::MyDB::Result::Read",
sub {
my $args = shift;
die "no user_id specified!" unless $USER_ID;
return ({
"$args->{self_alias}.id" => { -ident => "$args->{foreign_alias}.article_id" },
"$args->{foreign_alias}.user_id" => { -ident => $USER_ID },
});
}
);
in controller
$My::MyDB::Result::Article::USER_ID = $c->user->id;
$articles = $channel->search(
{ "pindols.user_id" => undef } ,
{
page => int($page),
rows => 20,
order_by => 'last_update DESC',
prefetch => "pindols"
}
);
Will fetch all unread articles and yield following SQL.
SELECT me.id, me.url, me.title, me.content, me.last_update, me.author, me.thumbnail, pindols.article_id, pindols.user_id FROM (SELECT me.id, me.url, me.title, me.content, me.last_update, me.author, me.thumbnail FROM articles me LEFT JOIN reads pindols ON ( me.id = pindols.article_id AND pindols.user_id = 9 ) WHERE ( pindols.user_id IS NULL ) GROUP BY me.id, me.url, me.title, me.content, me.last_update, me.author, me.thumbnail ORDER BY last_update DESC LIMIT ?) me LEFT JOIN reads pindols ON ( me.id = pindols.article_id AND pindols.user_id = 9 ) WHERE ( pindols.user_id IS NULL ) ORDER BY last_update DESC: '20'
Of course you can skip the paging but I had it in my code so I included it here.
Special thanks goes to deg from #dbix-class on irc.perl.org and https://blog.afoolishmanifesto.com/posts/dbix-class-parameterized-relationships/.
Thanks,
Canto

How to determine the size of a Full-Text Index on SQL Server 2008 R2?

I have a SQL 2008 R2 database with some tables on it having some of those tables a Full-Text Index defined. I'd like to know how to determine the size of the index of a specific table, in order to control and predict it's growth.
Is there a way of doing this?
The catalog view sys.fulltext_index_fragments keeps track of the size of each fragment, regardless of catalog, so you can take the SUM this way. This assumes the limitation of one full-text index per table is going to remain the case. The following query will get you the size of each full-text index in the database, again regardless of catalog, but you could use the WHERE clause if you only care about a specific table.
SELECT
[table] = OBJECT_SCHEMA_NAME(table_id) + '.' + OBJECT_NAME(table_id),
size_in_KB = CONVERT(DECIMAL(12,2), SUM(data_size/1024.0))
FROM sys.fulltext_index_fragments
-- WHERE table_id = OBJECT_ID('dbo.specific_table_name')
GROUP BY table_id;
Also note that if the count of fragments is high you might consider a reorganize.
If you are after a specific Catalogue
Use SSMS
- Clik on [Database] and expand the objects
- Click on [Storage]
- Right Click on {Specific Catalogue}
- Choose Propertie and click.
IN General TAB.. You will find the Catalogue Size = 'nn'
I use something similar to this (which will also calculate the size of XML-indexes, ... if present)
SELECT S.name,
SO.name,
SIT.internal_type_desc,
rows = CASE WHEN GROUPING(SIT.internal_type_desc) = 0 THEN SUM(SP.rows)
END,
TotalSpaceGB = SUM(SAU.total_pages) * 8 / 1048576.0,
UsedSpaceGB = SUM(SAU.used_pages) * 8 / 1048576.0,
UnusedSpaceGB = SUM(SAU.total_pages - SAU.used_pages) * 8 / 1048576.0,
TotalSpaceKB = SUM(SAU.total_pages) * 8,
UsedSpaceKB = SUM(SAU.used_pages) * 8,
UnusedSpaceKB = SUM(SAU.total_pages - SAU.used_pages) * 8
FROM sys.objects SO
INNER JOIN sys.schemas S ON S.schema_id = SO.schema_id
INNER JOIN sys.internal_tables SIT ON SIT.parent_object_id = SO.object_id
INNER JOIN sys.partitions SP ON SP.object_id = SIT.object_id
INNER JOIN sys.allocation_units SAU ON (SAU.type IN (1, 3)
AND SAU.container_id = SP.hobt_id)
OR (SAU.type = 2
AND SAU.container_id = SP.partition_id)
WHERE S.name = 'schema'
--AND SO.name IN ('TableName')
GROUP BY GROUPING SETS(
(S.name,
SO.name,
SIT.internal_type_desc),
(S.name, SO.name), (S.name), ())
ORDER BY S.name,
SO.name,
SIT.internal_type_desc;
This will generally give numbers higher than sys.fulltext_index_fragments, but when combined with the sys.partitions of the table, it will add up to the numbers returned from EXEC sys.sp_spaceused #objname = N'schema.TableName';.
Tested with SQL Server 2016, but documentation says it should be present since 2008.