Relational database management system - rdbms

If in a database with data in it, say for a column which is defined as char and it contains data in that column. If i change the column definition from char to number, will it allow to change? If yes, how it will effect the data present in the column?

It will attempt to cast any content from char to number. If it can't on even one row, the change will fail. (This does depend on which DBMS you are using. I describe the ones I know; I don't know them all.)

Related

DECIMAL Types in tsql

I have a Table with about 200Mio Rows and multiple Columns of Datatype DECIMAL(p,s) with varying precision/scales.
Now, as far as i understand, DECIMAL(p,s) is a fixed size column, with a size depending on the precision, see:
https://learn.microsoft.com/en-us/sql/t-sql/data-types/decimal-and-numeric-transact-sql?view=sql-server-ver16
Now, when altering the table and changing a column from DECIMAL(15,2) to DECIMAL(19,6), for example, i would have expected there to be almost no work to be done on the side of the SQL-Sever as the required bytes to store the value are the same, yet the altering itself does take a long time - so what exactly does the server do when i execute the alter statement?
Also, is there any benefit (other than having constraints on a column) to storing a DECIMAL(15,2) instead of, for example, a DECIMAL(19,2)? It seems to me the storage requirements would be the same, but i could store larger values in the latter.
Thanks in advance!
The precision and scale of a decimal / numeric type matters considerably.
As far as SQL Server is concerned, decimal(15,2) is a different data type to decimal(19,6), and is stored differently. You therefore cannot make the assumption that just because the overall storage requirements do not change, nothing else does.
SQL Server stores decimal data types in byte-reversed (little endian) format with the scale being the first incrementing value therefore changing the definition can require the data to be re-written, SQL Server will use an internal worktable to safely convert the data and update the values on every page.

Is there any case of table rewrite on add nullable column?

On the one hand the documentation states clearly that
When a column is added with ADD COLUMN and a non-volatile DEFAULT is
specified, the default is evaluated at the time of the statement and
the result stored in the table's metadata. That value will be used for
the column for all existing rows. If no DEFAULT is specified, NULL is
used. In neither case is a rewrite of the table required.
https://www.postgresql.org/docs/12/sql-altertable.html
On the other hand I've heard that because of tuple headers structure in some cases a table might be rewritten.
The documentation states that there are three fields responsible for null values storage behavior
The null bitmap is only present if the HEAP_HASNULL bit is set in
t_infomask. If it is present it begins just after the fixed header and
occupies enough bytes to have one bit per data column (that is, the
number of bits that equals the attribute count in t_infomask2). In
this list of bits, a 1 bit indicates not-null, a 0 bit is a null. When
the bitmap is not present, all columns are assumed not-null.
https://www.postgresql.org/docs/12/storage-page-layout.html
I used this information and pageinspect extension to see what is going on with tuple headers when nullable columns are added or deleted. In all cases when I add or delete nullable column I see that existed tuples never change their headers.
So, I cannot find any evidence to the statement that a table might be rewritten. Fields t_infomask, t_infomask2, t_bits never change their values.
Am I miss something? Can I say that adding nullable columns to high-volume tables is absolutely safe in terms of disk activity (excluding the fact of exclusive locking)?
The behavior has changed in PostgreSQL v11. Before that, adding a column with a non-NULL default will rewrite the table.
The point is that from v11 on, the tuple and its header don't have to be modified, adding such a column is just a metadata change. You won't see that with pageinspect. NULL values are never stored in the tuple, there is just the NULL bitmap, which is present if one of the table columns is nullable.

Char vs Symbol KDB parted splay table

I am creating a new table on a KDB database as a parted splay (parted by date), the new table schema has a column called CCYY, which has a lot of repeating values. I am unsure if I should save it as char or symbols. My main goal is to use least amount of memory.
As a result which one should I use? What is the benefit/disadvantage of saving repeating values as either a char array or a symbol in a parted splayed setup?
It sounds like you should use symbol.
There's a guide to symbols/enumerations here:http://www.timestored.com/kdb-guides/strings-symbols-enumeration#when-to-use quote:
Typically you should follow the guidelines:
If the column is used in where clause equality comparisons e.g.
select from t where sym in AB -> Symbol
Short, often repeated strings -> Symbol
Else Long, Non-repeated strings -> String
When evaluating whether or not to use symbol for a column, cardinality of that column is key. Length of individual values matters less and, if anything, longer values might be better off as symbol, as they will be stored only once in the sym file, but repeated in the char vector. That consideration is pretty much moot if you compress you data on disk though.
If your values are short enough, don't forget about the possibility of using .Q.j10, .Q.x10, .Q.j12 and .Q.x12. This will use less space than a char vector. And it doesn't rely on a sym file, which in complex environments can save you from having to re-enumerate if you are, say, copying tables between hdbs who's sym files are not in sync.
If space is a concern, always compress the data on disk.

How to query Cassandra by date range

I have a Cassandra ColumnFamily (0.6.4) that will have new entries from users. I'd like to query Cassandra for those new entries so that I can process that data in another system.
My sense was that I could use a TimeUUIDType as the key for my entry, and then query on a KeyRange that starts either with "" as the startKey, or whatever the lastStartKey was. Is this the correct method?
How does get_range_slice actually create a range? Doesn't it have to know the data type of the key? There's no declaration of the data type of the key anywhere. In the storage_conf.xml file, you declare the type of the columns, but not of the keys. Is the key assumed to be of the same type as the columns? Or does it do some magic sniffing to guess?
I've also seen reference implementations where people store TimeUUIDType in columns. However, this seems to have scale issues as this particular key would then become "hot" since every change would have to update it.
Any pointers in this case would be appreciated.
When sorting data only the column-keys are important. The data stored is of no consequence neither is the auto-generated timestamp. The CompareWith attribute is important here. If you set CompareWith as UTF8Type then the keys will be interpreted as UTF8Types. If you set the CompareWith as TimeUUIDType then the keys are automatically interpreted as timestamps. You do not have to specify the data type. Look at the SlicePredicate and SliceRange definitions on this page http://wiki.apache.org/cassandra/API This is a good place to start. Also, you might find this article useful http://www.sodeso.nl/?p=80 In the third part or so he talks about slice ranging his queries and so on.
Doug,
Writing to a single column family can sometimes create a hot spot if you are using an Order-Preserving Partitioner, but not if you are using the default Random Partitioner (unless a subset of users create vastly more data than all other users!).
If you sorted your rows by time (using an Order-Preserving Partitioner) then you are probably even more likely to create hotspots, since you will be adding rows sequentially and a single node will be responsible for each range of the keyspace.
Columns and Keys can be of any type, since the row key is just the first column.
Virtually, the cluster is a circular hash key ring, and keys get hashed by the partitioner to get distributed around the cluster.
Beware of using dates as row keys however, since even the randomization of the default randompartitioner is limited and you could end up cluttering your data.
What's more, if that date is changing, you would have to delete the previous row since you can only do inserts in C*.
Here is what we know :
A slice range is a range of columns in a row with a start value and an end value, this is used mostly for wide rows as columns are ordered. Known column names defined in the CF are indexed however so they can be retrieved specifying names.
A key slice, is a key associated with the sliced column range as returned by Cassandra
The equivalent of a where clause uses secondary indexes, you may use inequality operators there, however there must be at least ONE equals clause in your statement (also see https://issues.apache.org/jira/browse/CASSANDRA-1599).
Using a key range is ineffective with a Random Partitionner as the MD5 hash of your key doesn't keep lexical ordering.
What you want to use is a Column Family based index using a Wide Row :
CompositeType(TimeUUID | UserID)
In order for this not to become hot, add a first meaningful key ("shard key") that would split the data accross nodes such as the user type or the region.
Having more data than necessary in Cassandra is not a problem, it's how it is designed, so what you must ask yourself is "what do I need to query" and then design a Column Family for it rather than trying to fit everything in one CF like you'd do in an RDBMS.

How to alter Postgres table data based on its contents?

This is probably a super simple question, but I'm struggling to come up with the right keywords to find it on Google.
I have a Postgres table that has among its contents a column of type text named content_type. That stores what type of entry is stored in that row.
There are only about 5 different types, and I decided I want to change one of them to display as something else in my application (I had been directly displaying these).
It struck me that it's funny that my view is being dictated by my database model, and I decided I would convert the types being stored in my database as strings into integers, and enumerate the possible types in my application with constants that convert them into their display names. That way, if I ever got the urge to change any category names again, I could just change it with one alteration of a constant. I also have the hunch that storing integers might be somewhat more efficient than storing text in the database.
First, a quick threshold question of, is this a good idea? Any feedback or anything I missed?
Second, and my main question, what's the Postgres command I could enter to make an alteration like this? I'm thinking I could start by renaming the old content_type column to old_content_type and then creating a new integer column content_type. However, what command would look at a row's old_content_type and fill in the new content_type column based off of that?
If you're finding that you need to change the display values, then yes, it's probably a good idea not to store them in a database. Integers are also more efficient to store and search, but I really wouldn't worry about it unless you've got millions of rows.
You just need to run an update to populate your new column:
update table_name set content_type = (case when old_content_type = 'a' then 1
when old_content_type = 'b' then 2 else 3 end);
If you're on Postgres 8.4 then using an enum type instead of a plain integer might be a good idea.
Ideally you'd have these fields referring to a table containing the definitions of type. This should be via a foreign key constraint. This way you know that your database is clean and has no invalid values (i.e. referential integrity).
There are many ways to handle this:
Having a table for each field that can contain a number of values (i.e. like an enum) is the most obvious - but it breaks down when you have a table that requires many attributes.
You can use the Entity-attribute-value model, but beware that this is too easy to abuse and cause problems when things grow.
You can use, or refer to my implementation solution PET (Parameter Enumeration Tables). This is a half way house between between 1 & 2.