ROWID and RECID - progress-4gl

what is ROWID and RECID actually in progress.Can we use the RECID instead of ROWID.what is the diffrrence between them??

Both RECID and ROWID are unique pointers to a specific record in the database.
Both are more-or-less physical pointers into the database itself, except for non-OpenEdge tables where there is no equivalent on the underlying platform. In those cases, it may be comprised of the values making up the primary key.
RECIDs are 32 bit integers up through 10.1A, and were fine when the database was an OpenEdge database and had only one area. From 10.1B forward they were upgraded to 64 bit integers.
In v6 the capacity was added to connect to non-OpenEdge databases, and in v8 to create OpenEdge databases of more than one storage area. At that point, RECIDs were insufficient to address all of the records in a table uniquely in all circumstances.
So the ROWID structure was born. Its actual architecture depends on the type of database underneath, but it does not suffer from the limitations of being an integer.
The documentation is fairly clear in stating that RECIDs should not be used going forward, except for code that manipulates the OpenEdge database metaschema.

RECID is deprecated, for a couple of versions now. ROWID is the replacement for it. I understand that what it actually returns is the physical address of the DB block containing your record. From memory, they introduced ROWID when they wanted to support different DB engines - Oracle / SQL server et al - from the 4GL, which implies that there is more in a ROWID than a RECID.
I'd stay away from RECID, you might get away with it short term, but you're giving yourself a potential problem that you could avoid altogether.

Related

PostgreSQL indexes and replication techniques

Background: so far I've always used Django with its ORM to build small websites, so which database (MySQL vs PostgreSQL) was doing all the work behind the curtains wasn't really an issue.
Recently I decided to learn more about the differences between those two. I've just finished reading this (long) article which explores how indexes work in PostgresSQL and I am really shocked about the following fact:
"For instance, if we have a table with a dozen indexes defined on it, an update to a field that is only covered by a single index must be propagated into all 12 indexes to reflect the ctid for the new row."
I'm not an expert at all, but sounds insane to me that such an overload should happen by design when updating fields not involved in an index.
Moreover, the article goes on explaining how PostgreSQL replication strategy does not work at logical level, but at on-disk level, i.e. the master sends to the slaves a list (byte by byte) of all changes to apply on the disk rather than more abstract instructions such as UPDATE <fields> ON <table> WHERE ....
Although many short articles on the web comparing MySQL and PostgreSQL generally tend to claim that PostgreSQL is technically more advanced (ACID, JSON support, etc..), these two problems seem to be serious drawback to me. Can you confirm those statements and possibly point out further resources about these issues?
Thank you.
About indexes and performance
It is certainly true that PostgreSQL has to do more work on indexes when a row is updated. This is due to the fact that an UPDATE actually creates a new row version in the table, and the indexes have to point to that new row version.
There is, however, a way to mitigate the impact: if you set fillfactor to less than 100, so that there is free space in the data pages, and no indexed column is updated, PostgreSQL can create a “heap only tuple”, and such a HOT update will not need to touch any index.
MySQL's InnoDB with its secondary indexes (that reference the primary key index) has to do less work updating indexes. You pay the price for that with every index scan: first, you have to scan the secondary index to find the primary key, then you have to scan the primary key index to find the table row.
So there is a trade-off, but I think it is one-sided to unconditionally say that one solution is better.
About replication
MySQL has had a replication solution much earlier than PostgreSQL. It uses the binary log for replication, which is a slightly deceptive name as it actually contains SQL statements.
Version 9.0 of PostgreSQL introduced streaming replication, which ships the transaction log to the standby. This information is on the physical level, so primary and standby are kept physically identical. This is often more wasteful than shipping SQL statements (index updates!), but it is a very stable solution that leaves no room for replication conflicts.
PostgreSQL v10 has introduced logical replication, which generates an abstract description of the change, similar to an SQL statement. This allows for more flexible replication scenarios.
So the article you are referencing has become somewhat outdated in this respect.

Can Postgres tables be optimized by the column order?

I recently had to propose a set of new Postgres tables to our DB team that will be used by an application I am writing. They failed the design because my table had fields that were listed like so:
my_table
my_table_id : PRIMARY KEY, AUTO INCREMENT INT
some_other_table_id, FOREIGN KEY INT
some_text : CHARACTER VARYING(100)
some_flag : BOOLEAN
They said that the table would not be optimal because some_text appears before some_flag, and since CHARACTER VARYING fields search slower than BOOLEANs, when doing a table scan, it is faster to have a table structure whose columns are sequenced from greatest precision to least precision; so, like this:
my_table
my_table_id : PRIMARY KEY, AUTO INCREMENT INT
some_other_table_id, FOREIGN KEY INT
some_flag : BOOLEAN
some_text : CHARACTER VARYING(100)
These DBAs come from a Sybase background and have only recently switched over as our Postgres DBAs. I am thinking that this is perhaps a Sybase optimization that doesn't apply to Postgres (I would think Postgres is smart enough to somehow not care about column sequence).
Either way I can't find any Postgres documentation that confirms or denies. Looking for any battle-worn Postgres DBAs to weigh-in as to whether this is a valid or bogus (or conditionally-valid!) claim.
Speaking from my experience with Oracle on similar issues, where there was a big change in behaviour between versions 9 and 10 (or 8 and 9) if memory serves (due to CPU overhead in finding column data within a row), I don't believe you should rely on documented behaviour for an issue like this when a practical experiment would be fairly straightforward and conclusive.
So I'd suggest that you create a test case for this. Create two tables with exactly the same data and the columns in a different order, and run repeated and varied tests. Try to implement the entire test as a single script that can be run on a development or test system and which tells you the answer. Maybe the DBA's are right, and you can say, "Hey, confirmed your thoughts on this, thanks a lot", or you might find no measurable and significant difference. In the latter case you can hand the entire test to the DBA's, and explain how you can't reproduce the problem. Let them run the tests.
Either way, someone is going to learn something, and you've got a test case you can apply to future (or past) versions.
Lastly, post here on what you found ;)
What your DBA's are probably referring to, is the access strategy for "gettting to" the boolean value in a given tuple (/row).
In THEIR proposed design, a system can "get to" that value by looking at byte 9.
In YOUR proposed design, the system must first inspect the LENGTH field of all varying-length columns [that come before your boolean column], before it can know the byte offset where the boolean value can be found. That is ALWAYS going to be slower than "their" way.
Their consideration is one of PHYSICAL design (and it is a correct one). Damir's answer is also correct, but it is an answer from the perspective of LOGICAL design.
If the remark by your DBA's is really intended as "criticism of a 'bad' design", then they deserve to be pointed out that LOGICAL design is your job (and column order doesn't matter at that level), and PHYSICAL design is their job. And if they expect you to do the PHYSICAL design (their job) as well, then there is no longer any reason for the boss to keep them employed.
From a database design point, there is no difference between your design and what your DBA suggests -- your application should not care. In relational databases (logically) there is no such thing as order of columns; actually if order of columns matters (logically) it failed 1NF.
So, simply pass all create table scripts to your DBAs and let them implement (reorder columns) in any way they feel it is optimal on the physical level. You simply continue with the application.
Database design can not fail on order of columns -- it is simply not part of the design process.
Future users of large data banks must be protected from having to know
how the data is organized in the machine ...
... the problems treated hare are those of data independence -- the
independence of application programs and terminal activities from
growth in data types and changes ...
E.F. Codd ~ 1979
Changes to the physical level ... must not require a change to an
application ...
Rule 8: Physical data independence (E.F. Codd ~ 1985)
So here we are -- 33 years later ...

How to create HBase columns / table for related but separated entities

I saw video tutorial on HBase, where data got stored in a table like this:
EmployeeName - Height - ProjectInfo
------------------------------------
Jdoe - 5'7" - ProjA-TeamLead, ProjB-Contributor
What happens when some Business requirements comes up that name of ProjA has to be changed to ProjX ?
Wouldn't there be a separate table where Project information is stored?
In a relational database, yes: you'd have a project table, and the employee table would refer to it via a foreign key and only store the immutable project id (rather than the name). Then when you want to query it (in a relational database), you'd do a JOIN like:
SELECT
employee.name,
employee.height,
project.name,
employee_project_role.role_name
FROM
employee
INNER JOIN employee_project_role
ON employee_project_role.employee_id = employee.employee_id
INNER JOIN project
ON employee_project_role.project_id = project.project_id
This isn't how things are done in HBase (and other NoSQL databases); the reason is that since these databases are geared towards extremely large data sets, and distributed over many machines, the actual algorithms to transparently execute complex joins like this become a lot harder to pull off in ways that perform well. Thus, HBase doesn't even have built-in joins.
Instead, the general approach with systems like this is that you denormalize your data, and store things in a single table. So in this case, there might be one row per employee, and denormalized into that row is all of the employee's project role info (probably in separate columns -- the contents of a row in HBase is actually a key/value map, so you can represent repeating things like all of their different roles easily).
You're absolutely right, though: if you change the name of the project, that means you'd need to change the data that's stored for every employee. In this respect, the relational model is "cleaner". But if you're dealing with Petabytes of data or trillions of rows, the "clean" abstraction of a relational database becomes a lot messier, because you end up having to shard it all manually. The point of systems like HBase is to pay these costs up front in the design process, and not just assume the relational database will magically solve problems like this for you at scale. (Because it won't).
That said: if you don't expect to have at least Terabtyes of data (that's a million MB, remember), just do it in a relational database. It'll be much easier.
I think going through this presentation will give you some perspective:
http://ianvarley.com/coding/HBaseSchema_HBaseCon2012.pdf
And for a more programetical representation, have a look at:
http://jimbojw.com/wiki/index.php?title=Understanding_Hbase_and_BigTable

How do I reset the primary key count/max in Core Data?

I've managed to delete all entities stored using Core Data (following this answer).
The problem is, I've noticed the primary key is still counting upwards. Is there a way (without manually writing a SQL query) to reset the Z_MAX value for the entity? Screenshot below to clarify what I mean.
The value itself isn't an issue, but I'm just concerned that at some point in the future the maximum integer may be reached and I don't want this to happen. My application syncs data with a web service and caches it using core data, so potentially the primary key may increase by hundreds/thousands at a time. Deleting the entire Sqlite DB isn't an option as I need to retain some of the information for other entities.
I've seen the 'reset' method, but surely that will reset the entire Sqlite DB? How can I reset the primary key for just this one set of entities? There are no relationships to other entities with the primary key I want to reset.
I'm just concerned that at some point
in the future the maximum integer may
be reached and I don't want this to
happen.
Really? What type is your primary key? Because if it's anything other than an Int16 you really don't need to care about that. A signed 32-bit integer gives you 2,147,483,647 values. A 64-bit signed integer gives you 9,223,372,036,854,775,807 values.
If you think you're going to use all those up, you probably have more important things to worry about than having an integer overflow.
More importantly, if you're using Core Data you shouldn't need to care about or really use primary keys. Core Data isn't a database - when using Core Data you are meant to use relationships and not really care about primary keys. Core Data has no real need for them.
Core Data uses 64 bit integer primary keys. Unless I/O systems get many orders of magnitude faster, which unlike CPUs, they have not in recent years, you could save as fast as possible for millions of years.
Please file a bug with bugreport.apple.com when you run out.
Ben
From the sqlite faq:
If the largest possible integer key,
9223372036854775807, then an unused
key value is chosen at random.
9,223,372,036,854,775,807 / (1024^4) = 8,388,608 tera-rows. I suspect you will run into other limits first. :) http://www.sqlite.org/limits.html reviews the more practical limits you'll run into.
asking sqlite3 about a handy core data store yields:
sqlite> .schema zbookmark
CREATE TABLE ZBOOKMARK ( Z_PK INTEGER PRIMARY KEY, ...
note lack of "autoincrement", which in sqlite means to never reuse a key. So core data does allow old keys to be reused, so you're pretty safe even if you manage to add (and remove most of) that many rows over time.
If you really do want to reset it, poking around in apple's z_ tables is really the only way. [This is not to say that this is a thing you should in fact do. It is not (at least in any code you want to ship), even if it seems to work.]
Besides the fact that directly/manually editing a Core Data store is a horrendously stupid idea, the correct answer is:
Delete the database and re-create it.
Of course, you're going to lose all your data doing that, but if you're that concerned about this little number, then that's ok, right?
Oh, and Core Data will make sure you don't have primary key collisions.

Is it a good idea to store attributes in an integer column and perform bitwise operations to retrieve them?

In a recent CODE Magazine article, John Petersen shows how to use bitwise operators in TSQL in order to store a list of attributes in one column of a db table.
Article here.
In his example he's using one integer column to hold how a customer wants to be contacted (email,phone,fax,mail). The query for pulling out customers that want to be contacted by email would look like this:
SELECT C.*
FROM dbo.Customers C
,(SELECT 1 AS donotcontact
,2 AS email
,4 AS phone
,8 AS fax
,16 AS mail) AS contacttypes
WHERE ( C.contactmethods & contacttypes.email <> 0 )
AND ( C.contactmethods & contacttypes.donotcontact = 0 )
Afterwards he shows how to encapsulate this in to a table function.
My questions are these:
1. Is this a good idea? Any drawbacks? What problems might I run in to using this approach of storing attributes versus storing them in two extra tables (Customer_ContactType, ContactType) and doing a join with the Customer table? I guess one problem might be if my attribute list gets too long. If the column is an integer then my attribute list could only be at most 32.
2. What is the performance of doing these bitwise operations in queries as you move in to the tens of thousands of records? I'm guessing that it would not be any more expensive than any other comparison operation.
If you wish to filter your query based on the value of any of those bit values, then yes this is a very bad idea, and is likely to cause performance problems.
Besides, there simply isn't any need - just use the bit data type.
The reason why using bitwise operators in this way is a bad idea is that SQL server maintains statistics on various columns in order to improve query performance - for example if you have an email column, SQL server can tell you roughly what percentage of values that email column are true and select an appropriate execution plan based on that knowledge.
If however you have flags column, SQL server will have absolutely no idea how many records in a table match flags & 2 (email) - it doesn't maintain these sorts of indexes. Without this sort of information available to it SQL server is far more likely to choose a poor execution plan.
And don't forget the maintenance problems using this technique would cause. As it is not standard, all new devs will probably be confused by the code and not know how to adjust it properly. Errors will abound and be hard to find. It is also hard to do reporting type queries from. This sort of trick stuff is almost never a good idea from a maintenance perspective. It might look cool and elegant, but all it really is - is clunky and hard to work with over time.
One major performance implication is that there will not be a lookup operator for indexes that works in this way. If you said WHERE contact_email=1 there might be an index on that column and the query would use it; if you said WHERE (contact_flags & 1)=1 then it wouldn't.
** One column stores one piece of information only - it's the database way. **
(Didnt see - Kragen's answer also states this point, way before mine)
In opposite order: The best way to know what your performance is going to be is to profile.
This is, most definately, an "It Depends" question. I personally would never store such things as integers. For one thing, as you mention, there's the conversion factor. For another, at some point you or some other DBA, or someone is going to have to type:
Select CustomerName, CustomerAddress, ContactMethods, [etc]
From Customer
Where CustomerId = xxxxx
because some data has become corrupt, or because someone entered the wrong data, or something. Having to do a join and/or a function call just to get at that basic information is way more trouble than it's worth, IMO.
Others, however, will probably point to the diversity of your options, or the ability to store multiple value types (email, vs phone, vs fax, whatever) all in the same column, or some other advantage to this approach. So you would really need to look at the problem you're attempting to solve and determine which approach is the best fit.