IS there a way to tell how large bug numbers can get in Bugzilla? In other words, how can I find the upper limit? Is it regulated by the processor or the tables? Thanks.
It depends on what the abstract data type MEDIUMSERIAL is on the database.
The abstract schema is in Bugzilla::DB::Schema, and the corresponding implementation is in the appropriate Bugzilla::DB::Schema::* module.
On MySQL, MEDIUMSERIAL is a mediumint, so the largest value for bug_id is 8388607.
There's only one known Bugzilla instance with as many as 1M rows in bugs, so I'm not aware of any drive to increase the size in the vanilla source.
http://lpsolit.wordpress.com/bugzilla-usage-worldwide/
At Yahoo!, where we have more than 5 million rows in bugs, we've already increased bug_id to a 4-byte signed integer, so our new maximum bug_id is 2147483647.
Related
I am running PostgreSQL version 9.6.13. I have 2 questions
What is the max storage range/size of datatype 'jsonb' and 'text'
How to find max storage range/size for other fields?
I looked into 'pg_type' catalog table that PostgreSQL provides.
For both 'text' and 'jsonb' datatype the field typlen=-1
which means that they are variable length type, but nowhere can I find the max storage size for both.
There are more limit's than one:
There is theoretical limit 1GB specified by TOAST storage.
But practical limit is significantly lower - processing is memory extensive, and for long values there can be problems with free memory. There can be performance problems too - jsonb is immutable atomic value - every update generate complete new value, every reads have to read complete value. If your values are less than 200MB, then you don't have a problems usually.
A database server should to not use swap intensively. That means so real limit depends on number of active queries (active users). Higher max_connection means lower practical limit for large values.
Types with typlen == -1 are varlena types. You can find maximal size in documentation. But, again, this is theoretical limit. Practical limits are lower and depends on available memory for Postgres and probably structure (properties) of stored objects. You have to test it. There are not any other methods.
orientdb 2.0.5
I have a database in which we create non-unque index on 2 properties on a class called indexstat.
The two properties which make up the index are a string identifier plus a long timestamp.
Data is created in batches of few hundred records every 5 minutes. After a few hours old records are deleted.
This is file listing are the files related to that table.
Question:
Why is the .irs file which according to documentation (is related to non-unique indexes)...so monstrously huge after a few hours. 298056704 bytes larger than actual data (.irs size - .sbt size - .cpm size).
I would think the index would be smaller than the actual data.
Second question:
What is best practice here. Should I be using unique indexes instead of non-unique? Should I find a way to make the data in the index smaller (e.g. use longs instead of strings as identifiers)?
Below are file names and the sizes of each.
indexstat.cpm 727778304
indexstatidx.irs 1799095296
indexstatidx.sbt 263168
indexstat.pcl 773260288
This is repeated for a few tables where the index size is larger than the database data.
Internals of *.irs files organised in a such way that when you delete something from an index there is an unused hole left in the file. At some point, when about a half of the file space is wasted, those unused holes come into play again and become available for reuse and allocation. That is done for performance reasons to lower the index data fragmentation. In your case this means that sooner or later the *.irs file will stop growing, and its maximum size should be around 2-3 times larger than the maximum observed size of the corresponding *.pcl file, assuming your single stat record size is not much bigger compared to the size of the id-timestamp pair size.
Regarding the second question, in a long run it is almost always better to use the most specific/strict data types to model the data and the most specific/strict index types to index it.
At this link is shown a discussion relative to the index file, maybe can help you.
For the second question, the index should be chosen according to your purpose and your data (not vice versa). The data type (long, string) must be the one that best represents your fields (and already here if for example if you just an integer and this is sufficient to the scope, it is useless to use a long). The same choice for the index, if you need to not have duplicate the choice will be non-unique. if you need an index that allows to choose the range sb-tree instead of the hash and so on ...
Our current PostgreSQL database is using GUID's as primary keys and storing them as a Text field.
My initial reaction to this is that trying to perform any kind of minimal cartesian join would be a nightmare of indexing trying to find all the matching records. However, perhaps my limited understanding of database indexing is wrong here.
I'm thinking that we should be using UUID as these are stored as a binary representation of the GUID where a Text is not and the amount of indexing that you get on a Text column is minimal.
It would be a significant project to change these, and I'm wondering if it would be worth it?
When dealing with UUID numbers store them as data type uuid. Always. There is simply no good reason to even consider text as alternative. Input and output is done via text representation by default anyway. The cast is very cheap.
The data type text requires more space in RAM and on disk, is slower to process and more error prone. #khampson's answer provides most of the rationale. Oddly, he doesn't seem to arrive at the same conclusion.
This has all been asked and answered and discussed before. Related questions on dba.SE with detailed explanation:
Would index lookup be noticeably faster with char vs varchar when all values are 36 chars
What is the optimal data type for an MD5 field?
bigint?
Maybe you don't need UUIDs (GUIDs) at all. Consider bigint instead. It only occupies 8 bytes and is faster in every respect. It's range is often underestimated:
-9223372036854775808 to +9223372036854775807
That's 9.2 millions of millions of millions positive numbers. IOW, nine quintillion two hundred twenty-three quadrillion three hundred seventy-two trillion thirty-six something billion.
If you burn 1 million IDs per second (which is an insanely high number) you can keep doing so for 292471 years. And then another 292471 years for negative numbers. "Tens or hundreds of millions" is not even close.
UUID is really just for distributed systems and other special cases.
As #Kevin mentioned, the only way to know for sure with your exact data would be to compare and contrast both methods, but from what you've described, I don't see why this would be different from any other case where a string was either the primary key in a table or part of a unique index.
What can be said up front is that your indexes will probably larger, since they have to store larger string values, and in theory the comparisons for the index will take a bit longer, but I wouldn't advocate premature optimization if to do so would be painful.
In my experience, I have seen very good performance on a unique index using md5sums on a table with billions of rows. I have found it tends to be other factors about a query which tend to result in performance issues. For example, when you end up needing to query over a very large swath of the table, say hundreds of thousands of rows, a sequential scan ends up being the better choice, so that's what the query planner chooses, and it can take much longer.
There are other mitigating strategies for that type of situation, such as chunking the query and then UNIONing the results (e.g. a manual simulation of the sort of thing that would be done in Hive or Impala in the Hadoop sphere).
Re: your concern about indexing of text, while I'm sure there are some cases where a dataset produces a key distribution such that it performs terribly, GUIDs, much like md5sums, sha1's, etc. should index quite well in general and not require sequential scans (unless, as I mentioned above, you query a huge swath of the table).
One of the big factors about how an index would perform is how many unique values there are. For that reason, a boolean index on a table with a large number of rows isn't likely to help, since it basically is going to end up having a huge number of row collisions for any of the values (true, false, and potentially NULL) in the index. A GUID index, on the other hand, is likely to have a huge number of values with no collision (in theory definitionally, since they are GUIDs).
Edit in response to comment from OP:
So are you saying that a UUID guid is the same thing as a Text guid as far as the indexing goes? Our entire table structure is using Text fields with a guid-like string, but I'm not sure Postgre recognizes it as a Guid. Just a string that happens to be unique.
Not literally the same, no. However, I am saying that they should have very similar performance for this particular case, and I don't see why optimizing up front is worth doing, especially given that you say to do so would be a very involved task.
You can always change things later if, in your specific environment, you run into performance problems. However, as I mentioned earlier, I think if you hit that scenario, there are other things that would likely yield better performance than changing the PK data types.
A UUID is a 128-bit data type (so, 16 bytes), whereas text has 1 or 4 bytes of overhead plus the actual length of the string. For a GUID, that would mean a minimum of 33 bytes, but could vary significantly depending on the encoding used.
So, with that in mind, certainly indexes of text-based UUIDs will be larger since the values are larger, and comparing two strings versus two numerical values is in theory less efficient, but is not something that's likely to make a huge difference in this case, at least not usual cases.
I would not optimize up front when to do so would be a significant cost and is likely to never be needed. That bridge can be crossed if that time does come (although I would persue other query optimizations first, as I mentioned above).
Regarding whether Postgres knows the string is a GUID, it definitely does not by default. As far as it's concerned, it's just a unique string. But that should be fine for most cases, e.g. matching rows and such. If you find yourself needing some behavior that specifically requires a GUID (for example, some non-equality based comparisons where a GUID comparison may differ from a purely lexical one), then you can always cast the string to a UUID, and Postgres will treat the value as such during that query.
e.g. for a text column foo, you can do foo::uuid to cast it to a uuid.
There's also a module available for generating uuids, uuid-ossp.
Need suggestion on which Datatype would give better performance if we set one of these as primary key in DB2 - BIGINT or Decimal(13,0) type?
I suspect Decimal(13,0) will have issues once the key grows to a very big size but I wanted a better answer/understanding for this.
Thanks.
Decimal does not have issues. The only thing, is that DB2 has to do more operations to retrieve the data, once is read. I mean, DB2 read the data and then it should find the decimal part (the precision) even if is 0.
On the other hand, DB2 will read the BigInt and it does not need any further process. The number is on the bufferpool.
If you are going to use integers of 13 positions (most of them), probably Decimal will be better because you are not going to use extra bytes, however decimals have extra bytes for the precision. By using decimal in this way, you are going to optimize the storage, and this will be translated in better IO, better performance. However, it depends on the other columns of your table. You have to test which of them gives you better performance.
When using compression, there are more CPU cycles to recover the information. You have to test if the performance is affected.
Use BIGINT:
Can store ~19 digits (versus 13)
Will take 8 bytes (versus maybe either 7 or 13 - see next)
Depending on platform, DECIMAL will be stored as a form of Binary Coded Decimal - for example, on the iSeries (and I can't remember if it's Packed or Zoned). Can't speak to other deployments, unfortunately.
You aren't doing math on these values (things like "next entry" don't count) - save DECIMAL/NUMERIC for measurements/values.
Note that, really, ids are just a sequence of bits - the fact that it happens to be an integer (usually) is irrelevant. It's best to consider them random data; sequential assignment is an optimization detail, there's often gaps (rollbacks, system crashes, whatever), and they're meaningless for anything other than joining.
What is the maximum number of placeholders is allowed in a single statement? I.e. the upper limit of attribute NUM_OF_PARAMS.
I'm experiencing odd issue where I try to tune the maximum number of multiple rows insert, ie set the number to 20,000 gives me an error because $sth->{NUM_OF_PARAMS} becomes negative.
Reducing the max inserts to 5000 works fine.
Thanks.
As far as I am aware the only limitation in DBI is that the value is placed into a Perl scalar so it is what can be held in that. However, for DBDs it is totally different. I doubt many, if any databases support 20000 parameters. BTW, NUM_OF_PARAMS is readonly so I've no idea what you mean by "set the number to 20,000". I presume you just mean you create a SQL statement with 20000 parameters and then read NUM_OF_PARAMS and it gives you a negative value. If the latter I suggest you report (with an example) that on rt.cpan.org as it does not sound right at all.
I cannot imagine creating a SQL statement with 20000 parameters is going to be very efficient in any database. Far better to try and reduce that to a range or something like it if you can. In ODBC, 20000 parameters would mean 20000 IPDs and APDs and they are quite big structures. Since DB2 cli library is very like ODBC I would imagine you are going to eat up loads of memory.
Given that 20,000 causes negative problems and 5,000 doesn't, there's a signed 16-bit integer somewhere in the system, and the upper bound is therefore approximately 16383.
However, the limit depends on the underlying DBMS and the API used by the DBD module for the DBMS (and possibly the DBD code itself); it is not affected by DBI.
Are you sure that's the best way to deal with your problem?