I am having a problem regarding a Log Searching Speed and Disk Size.
It is extremely big, it has about 220 millions rows and 25 gigabyte disk size and takes several minutes to fetch some selects.
How does it work?
The log is saved in the database using Sql Anywhere, currently version 9 and soon will be migrated to 11 (we tried to 12, but due some driver and some problems, we went back to 11).
The log consists with two tables (name changed to english so the people here are able to understand):
LogTable
Id, DateTime, User, Url, Action and TableName.
Action is what the used did: insert/delete/update
TableName is which table in the database was affected.
LogTableFields
Id, LogTable_Id, FieldName, NewValue, OldValue.
LogTable_Id is foreign key from LogTable.
FieldName is the field of the table from DB.
Important to note that NewValue and OldValue are type of varchar. Because it's recorded every kind of fields from other tables (datetime, int, etc).
Why it was made this way?
Because we must record everything important. The system is made to an Institutional Department of Traffic (i don't know if it's spelled this way in proper english, but now you can an ideia what this is about) and sometimes they demand some kind of random report.
Until now, we have made our report simply doing some SQL select. However it takes several minutes to complete, even if datetime filtered. Isn't and issue to complain when it's not request often.
But they are demanding more and more reports that it is necessary to create a feature in the software with a nice and beauty report. As we never know theirs needs, we must go back to log and unbury the data.
Some information requested are only in the log. (e.g what user gave improperly access of the vehicle to someone)
Some ideas suggested until now:
Idea 1:
I did some researches and I was told to work with NoSql using CouchDB. But the little i read i feel NoSql isn't a solution for my problem. I can't argue why for non experience in it.
Idea 2:
Separate the Log Tables physically from the Database or from the machine.
Idea 3: Create a mirror from every table with a version field to keep history.
I'd like a macro optimization or architecture change if needed.
This seems like a pretty standard audit table. I'm not sure you need to go to a NoSQL solution for this. 220mil rows will be comfortably handled by most RDBMs.
It seems that the biggest problem is the table structure. Generally you flatten the table to improve logging speed and normalize it to improve reporting speed. As you can see these are conflicting.
If you were using something like MS SQL, you could build a single flat table for logging performance, then build a simple Analysis Services cube on top of it.
Another option would be to just optimize for reporting assuming you could maintain sufficient logging throughput. To do that, you may want to create a structure like this:
create table LogTable (
LogTableID int identity(1,1),
TableName varchar(100),
Url varchar(200)
)
create table LogUser (
LogUserID int indentity(1,1),
UserName varchar(100)
)
create table LogField (
LogFieldID int identity(1,1),
FieldName varchar(100),
)
create table LogData (
LogDataID bigint identity(1,1),
LogDate datetime,
LogTableID int references LogTable(LogTableID),
LogFieldID int references LogField(LogFieldID),
LogUserID int references LogUserID(LogUserID),
Action char(1), -- U = update, I = insert, D = delete
OldValue varchar(100),
NewValue varchar(100)
)
This should still be fast enough to log data quickly, but provide enough performance for reporting. Index design is also important, generally done in order of increasing cardinality, so something like LogData(LogTableID, LingFieldID, LogDate). You can also get fancy with partitioning to allow for parallelized queries.
Adding proper indices is going to be the biggest improvement you can make. You don't mention having any indices, so I assume you don't have any. That would make it very slow.
For example, limiting your query to a particular range of DateTime doesn't help at all unless you have an index on DateTime. Without an index, the database still needs to touch nearly all 25GB of data to find the few rows that are in the right time range. But with an index, it could quickly identify the few rows that are in the time range you care about.
In general, you should always ask your database what plan it is using to execute a query that is taking too long. I'm not particularly familiar with Sql Anywhere, but I know it has a Plan Viewer that can do this. You want to identify big sequential scans and put indices on those fields instead.
I doubt you would see a measurable improvement from breaking up the table and using integer foreign keys. To the extent that your queries touch many columns, you'll just end up joining all those tables back together anyway.
Related
I have a "services" table for detailing services that we provide. Among the data that needs recording are several small one-to-many relationships (all with a foreign key constraint to the service_id) such as:
service_owners -- user_ids responsible for delivery of service
service_tags -- e.g. IT, Records Management, Finance
customer_categories -- ENUM value
provider_categories -- ENUM value
software_used -- self-explanatory
The problem I have is that I want to keep a history of updates to a service, for which I'm using an update trigger on the table, that performs an insert into a history table matching the original columns. However, if a normalized approach to the above data is used, with separate tables and foreign keys for each one-to-many relationship, any update on these tables will not be recognised in the history of the service.
Does anyone have any suggestions? It seems like I need to store child keys in the service table to maintain the integrity of the service history. Is a delimited text field a valid approach here or, as I am using postgreSQL, perhaps arrays are also a valid option? These feel somewhat dirty though!
Thanks.
If your table is:
create table T (
ix int identity primary key,
val nvarchar(50)
)
And your history table is:
create table THistory (
ix int identity primary key,
val nvarchar(50),
updateType char(1), -- C=Create, U=Update or D=Delete
updateTime datetime,
updateUsername sysname
)
Then you just need to put an update trigger on all tables of interest. You can then find out what the state of any/all of the tables were at any point in history, to determine what the relationships were at that time.
I'd avoid using arrays in any database whenever possible.
I don't like updates for the exact reason you are saying here...you lose information as it's over written. My answer is quite simple...don't update. Not sure if you're at a point where this can be implemented...but if you can I'd recommend using the main table itself to store historical (no need for a second set of history tables).
Add a column to your main header table called 'active'. This can be a character or a bit (0 is off and 1 is on). Then it's a bit of trigger magic...when an update is preformed, you insert a row into the table identical to the record being over-written with a status of '0' (or inactive) and then update the existing row (this process keeps the ID column on the active record the same, the newly inserted record is the inactive one with a new ID).
This way no data is ever lost (admittedly you are storing quite a few rows...) and the history can easily be viewed with a select where active = 0.
The pain here is if you are working on something already implemented...every existing query that hits this table will need to be updated to include a check for the active column. Makes this solution very easy to implement if you are designing a new system, but a pain if it's a long standing application. Unfortunately existing reports will include both off and on records (without throwing an error) until you can modify the where clause
I am trying to redesigning Pg database to gain more performance. Db is for ERP IS and it holds larger amount of date (four years). Every year was in separate database, which was a bad solution (building reports was pain in the a??), so I consolidated all four db's into one... but... some tables are just to large! In order to gain some performance I decided to divide data in tables. I have 2 ways to do this.
First: dividing tables into "arch_table" and "working_table" and using views for reporting.
or
Second: using partitioning (say separate partition for every year).
So, my question is which way is better ? Partitioning or some archiving system ?
PostgreSQL's partitioning is, effectively, a bunch of views that use a check constraint to verify that only correct data is in each partition. A parent table is created and additional partitions are created that inherit from the master:
CREATE TABLE measurement (
city_id int not null,
logdate date not null,
peaktemp int,
unitsales int
);
CREATE TABLE measurement_y2006m02 ( ) INHERITS (measurement);
CREATE TABLE measurement_y2006m03 ( ) INHERITS (measurement);
...
CREATE TABLE measurement_y2007m11 ( ) INHERITS (measurement);
CREATE TABLE measurement_y2007m12 ( ) INHERITS (measurement);
CREATE TABLE measurement_y2008m01 ( ) INHERITS (measurement);
Obviously, I've omitted a bit of code, but you can check out the documentation on PostgreSQL table partitioning. The most important part of partitioning is to make sure you build automatic scripts to create new partitions into the future as well as merge old partitions.
Operationally, when PostgreSQL goes to run your query it looks at SELECT * FROM measurement WHERE logdate BETWEEN '2006-02-13' AND '2006-02-22'; the optimizer goes "AH HA! I know what's up here, there's a partition. I'll just look at table measurement_y2006m02 and pull back the appropriate data."
As you age data out of the main partitions, you can either just drop the old tables or else merge them into an archive partition. Much of this work can be automated through scripting - all you really need to do is write the scripts once and test it. A side benefit is that older data tends to not change - many partitions will require no index maintenance or vacuuming.
Keep in mind that partitioning is largely a data management solution and may not provide the performance benefit that you're looking for. Tuning queries, applying indexes, and examining the PostgreSQL configuration (postgresql.conf, storage configuration, and OS configuration) may lead to far bigger performance gains that partitioning your data.
You should use partitioning with any of those ways. It's exactly what you need.
I have seen performance tweaks for delete on normal tables in t-sql.
But are there performance tweaks on deletes on table variables to be done?
EDIT
Here's an example:
The plot gets thicker, as UserExclusionsEvaluate is actually a CTE, but I'm going to try and optimise it around the table variable first (if possible). The CTE itself runs very quickly. Just the delete that is slow.
DELETE FROM #UsersCriteria
FROM #UsersCriteria UsersCriteria
WHERE UserId IN (SELECT UserID FROM UserExclusionsEvaluate WHERE PushRuleExclusionMet = 1)
In it's current incarnation, #UsersCriteria is:
DECLARE #UsersCriteria TABLE
(
UserId int primary key,
PushRuleUserCriteriaType int
)
I've tried #UsersCriteria as non primary and experimented with clustered non-clustered.
It's probable also the problem is with the IN. I've also tried a JOIN on a subquery.
EDIT:
GOOD NEWS!
After lots of playing with the SQL, including moving the suquery into a chained CTE, attempting table hints etc etc etc.
A simple change from a table variable to a temp table dramatically improved the performance.
Which is really interesting, as deletes ran fine byself, the subquery (on the CTE) ran fine byitself. But mixing the two ran mega slow.
I'm guessing that the optimiser can't kick in when using a CTE in a subquery? Maybe when mixed with the delete.
Not really.
Unless you defined a PK in the DECLARE which may work: there are no statistics for table variables and the table is assumed to have 1 row only
Well there is a limited amount you can do. However, if you have a large data set in the table variable, you should be using a temp table instead if you need better performance.
You could also do the deletes in batches (say 1000 at a time).
Otherwise, show us your delete statment and we'll see if we see anything that can be imporved.
NO.
Table variables are unindexable and transient. They have no statistics.
They are not intended to store very large amounts of data.
If you have a table variable that is big enough to give you performance problems when you delete from it, you're using them in an unintended way. Put that data into a #Temp table or a real table so you have more control.
Imagine a table with the following structure on PostgreSQL 9.0:
create table raw_fact_table (text varchar(1000));
For the sake of simplification I only mention one text column, in reality it has a dozen. This table has 10 billion rows and each column has lots of duplicates. The table is created from a flat file (csv) using COPY FROM.
To increase performance I want to convert to the following star schema structure:
create table dimension_table (id int, text varchar(1000));
The fact table would then be replaced with a fact table like the following:
create table fact_table (dimension_table_id int);
My current method is to essentially run the following query to create the dimension table:
Create table dimension_table (id int, text varchar(1000), primary key(id));
then to create fill the dimension table I use:
insert into dimension_table (select null, text from raw_fact_table group by text);
Afterwards I need to run the following query:
select id into fact_table from dimension inner join raw_fact_table on (dimension.text = raw_fact_table.text);
Just imagine the horrible performance I get by comparing all strings to all other strings several times.
On MySQL I could run a stored procedure during the COPY FROM. This could create a hash of a string and all subsequent string comparison is done on the hash instead of the long raw string. This does not seem to be possible on PostgreSQL, what do I do then?
Sample data would be a CSV file containing something like this (I use quotes also around integers and doubles):
"lots and lots of text";"3";"1";"2.4";"lots of text";"blabla"
"sometext";"30";"10";"1.0";"lots of text";"blabla"
"somemoretext";"30";"10";"1.0";"lots of text";"fooooooo"
Just imagine the horrible performance
I get by comparing all strings to all
other strings several times.
When you've been doing this a while, you stop imagining performance, and you start measuring it. "Premature optimization is the root of all evil."
What does "billion" mean to you? To me, in the USA, it means 1,000,000,000 (or 1e9). If that's also true for you, you're probably looking at between 1 and 7 terabytes of data.
My current method is to essentially
run the following query to create the
dimension table:
Create table dimension_table (id int, text varchar(1000), primary key(id));
How are you gonna fit 10 billion rows into a table that uses an integer for a primary key? Let's even say that half the rows are duplicates. How does that arithmetic work when you do it?
Don't imagine. Read first. Then test.
Read Data Warehousing with PostgreSQL. I suspect these presentation slides will give you some ideas.
Also read Populating a Database, and consider which suggestions to implement.
Test with a million (1e6) rows, following a "divide and conquer" process. That is, don't try to load a million at a time; write a procedure that breaks it up into smaller chunks. Run
EXPLAIN <sql statement>
You've said you estimate at least 99% duplicate rows. Broadly speaking, there are two ways to get rid of the dupes
Inside a database, not necessarily the same platform you use for production.
Outside a database, in the filesystem, not necessarily the same filesystem you use for production.
If you still have the text files that you loaded, I'd consider first trying outside the database. This awk one-liner will output unique lines from each file. It's relatively economical, in that it makes only one pass over the data.
awk '!arr[$0]++' file_with_dupes > file_without_dupes
If you really have 99% dupes, by the end of this process you should have reduced your 1 to 7 terabytes down to about 50 gigs. And, having done that, you can also number each unique line and create a tab-delimited file before copying it into the data warehouse. That's another one-liner:
awk '{printf("%d\t%s\n", NR, $0);}' file_without_dupes > tab_delimited_file
If you have to do this under Windows, I'd use Cygwin.
If you have to do this in a database, I'd try to avoid using your production database or your production server. But maybe I'm being too cautious. Moving several terabytes around is an expensive thing to do.
But I'd test
SELECT DISTINCT ...
before using GROUP BY. I might be able to do some tests on a large data set for you, but probably not this week. (I don't usually work with terabyte-sized files. It's kind of interesting. If you can wait.)
Just to questions:
- it neccessary to convert your data in 1 or 2 steps?
- May we modify the table while converting?
Running more simplier queries may improve your performance (and the server load while doing it)
One approach would be:
generate dimension_table (If i understand it correctly, you don't have performance problems with this) (maybe with an additional temporary boolean field...)
repeat: choose one previously not selected entry from dimension_table, select every rows from raw_fact_table containing it and insert them into fact_table. Mark dimension_table record as done, and next... You can write this as a stored procedure, and it can convert your data in the background, eating minimal resources...
Or another (probably better):
create fact_table as EVERY record from raw_fact_table AND one dimension_id. (so including dimension_text and dimension_id rows)
create dimension_table
create an after insert trigger for fact_table which:
searches for dimension_text in fact_table
if not found, creates a new record in dimension_table
updates dimension_id to this id
in a simle loop, insert every record from raw_fact_table to fact_table
You are omitting some details there at the end, but I don't see that there necessarily is a problem. It is not in evidence that all strings are actually compared to all other strings. If you do a join, PostgreSQL could very well pick a smarter join algorithm, such as a hash join, which might give you the same hashing that you are implementing yourself in your MySQL solution. (Again, your details are hazy on that.)
-- add unique index
CREATE UNIQUE INDEX uidx ON dimension_table USING hash(text);
-- for non case-sensitive hash(upper(text))
try hash(text); and btree(text) to see which one is faster
I an see several ways of solving your problem
There is md5 function in PostgreSql
md5(string) Calculates the MD5 hash of string, returning the result in hexadecimal
insert into dimension_table (select null, md5(text), text from raw_fact_table group by text)
add md5 field into raw_fact_table as well
select id into fact_table from dimension inner join raw_fact_table on (dimension.md5 = raw_fact_table.md5);
Indexes on MD5 filed might help as well
Or you can calculate MD5 on the fly while loading the data.
For example our ETL tool Advanced ETL processor can do it for you.
Plus it can load data into multiple tables same time.
There is a number of on-line tutorials available on our web site
For example this one demonstrates loading slow changing dimension
http://www.dbsoftlab.com/online-tutorials/advanced-etl-processor/advanced-etl-processor-working-with-slow-changing-dimension-part-2.html
If you're creating a temporary table within a stored procedure and want to add an index or two on it, to improve the performance of any additional statements made against it, what is the best approach? Sybase says this:
"the table must contain data when the index is created. If you create the temporary table and create the index on an empty table, Adaptive Server does not create column statistics such as histograms and densities. If you insert data rows after creating the index, the optimizer has incomplete statistics."
but recently a colleague mentioned that if I create the temp table and indices in a different stored procedure to the one which actually uses the temporary table, then Adaptive Server optimiser will be able to make use of them.
On the whole, I'm not a big fan of wrapper procedures that add little value, so I've not actually got around to testing this, but I thought I'd put the question out there, to see if anyone had any other approaches or advice?
A few thoughts:
If your temporary table is so big that you have to index it, then is there a better way to solve the problem?
You can force it to use the index (if you are sure that the index is the correct way to access the table) by giving an optimiser hint, of the form:
SELECT *
FROM #table (index idIndex)
WHERE id = #id
If you are interested in performance tips in general, I've answered a couple of other questions about that at some length here:
Favourite performance tuning tricks
How do you optimize tables for specific queries?
What's the problem with adding the indexes after you put data into the temp table?
One thing you need to be mindful of is the visibility of the index to other instances of the procedure that might be running at the same time.
I like to add a guid to these kinds of temp tables (and to the indexes), to make sure there is never a conflict. The other benefit of this approach is that you could simply make the temp table a real table.
Also, make sure that you will need to query the data in these temp tables more than once during the running of the stored procedure, otherwise the cost of index creation will outweigh the benefit to the select.
In Sybase if you create a temp table and then use it in one proc the plan for the select is built using an estimate of 100 rows in the table. (The plan is built when the procedure starts before the tables are populated.) This can result in the temp table being table scanned since it is only "100 rows". Calling a another proc causes Sybase to build the plan for the select with the actual number of rows, this allows the optimizer to pick a better index to use. I have seen significant improvedments using this approach but test on your database as sometimes there is no difference.