Internal working of RDBMS

Internal working of RDBMS - oracle10g

I have a table with 10 columns and I wish to select column 1 and column 9 from the table. In RDBMS how many columns shall be selected internally?

One of the fundamental ideas behind the Relational Model is that RDBMS users characterize their problems in terms of tables & queries representing abstract application relationships and describing abstract application states while the RDBMS interface hides (as much as possible) table & query implementation by another such "logical" RDBMS layer or by other, "physical", paradigm layers. (Hence, logical and physical data independence.)
Your question can only be answered about a particular implementation of a particular version of a particular DBMS. You can find DBMS implementation discussed in textbooks & slides, of which there are numerous sites & .pdfs online.
If this is a performance concern: It just isn't the sort of thing you should worry about until you are familiar with schema design and querying as well as basic performance issues like indexing. Googling: Oracle's .pdf Oracle Database SQL Tuning Guide. Bookboon's free online downloadable ebook Database Design and Implementation: A practical introduction using Oracle SQL.

RDBMS stores table datas in rows and rows in pages.
RDBMS can indexes table datas.
Internally the RDBMS reads row by row in an index or a table.
So the problem is to know if there is an index or not.
In the case where there is no index at all for the table, all the columns needs to be read, because the table row is compound of all the columns
In the case where and index has just exactly the twos columns, only the twos columns plus a reference to the table rows place is read.
In the case where an index have the two columns plus some more columns, this index will be choose by the optimizer and conducts to read some more informations, but less than the table row.
In some special case, due to the optimizer (in the "bigs" RDBS like SQL Server…) if there is two indexes , one containing the first column and the other the second , the two indexes can be read concurrently and a final join is executed.

Related

Redshift Performance of Flat Tables Vs Dimension and Facts

I am trying to create dimensional model on a flat OLTP tables (not in 3NF).
There are people who are thinking dimensional model table is not required because most of the data for the report present single table. But that table contains more than what we need like 300 columns. Should I still separate flat table into dimensions and facts or just use the flat tables directly in the reports.

You've asked a generic question about database modelling for data warehouses, which is going to get you generic answers that may not apply to the database platform you're working with - if you want answers that you're going to be able to use then I'd suggest being more specific.
The question tags indicate you're using Amazon Redshift, and the answer for that database is different from traditional relational databases like SQL Server and Oracle.
Firstly you need to understand how Redshift differs from regular relational databases:
1) It is a Massively Parallel Processing (MPP) system, which consists of one or more nodes that the data is distributed across and each node typically does a portion of the work required to answer each query. There for the way data is distributed across the nodes becomes important, the aim is usually to have the data distributed in a fairly even manner so that each node does about equal amounts of work for each query.
2) Data is stored in a columnar format. This is completely different from the row-based format of SQL Server or Oracle. In a columnar database data is stored in a way that makes large aggregation type queries much more efficient. This type of storage partially negates the reason for dimension tables, because storing repeating data (attibutes) in rows is relatively efficient.
Redshift tables are typically distributed across the nodes using the values of one column (the distribution key). Alternatively they can be randomly but evenly distributed or Redshift can make a full copy of the data on each node (typically only done with very small tables).
So when deciding whether to create dimensions you need to think about whether this is actually going to bring much benefit. If there are columns in the data that regularly get updated then it will be better to put those in another, smaller table rather than update one large table. However if the data is largely append-only (unchanging) then there's no benefit in creating dimensions. Queries grouping and aggregating the data will be efficient over a single table.
JOINs can become very expensive on Redshift unless both tables are distributed on the same value (e.g. a user id) - if they aren't Redshift will have to physically copy data around the nodes to be able to run the query. So if you have to have dimensions, then you'll want to distribute the largest dimension table on the same key as the fact table (remembering that each table can only be distributed on one column), then any other dimensions may need to be distributed as ALL (copied to every node).
My advice would be to stick with a single table unless you have a pressing need to create dimensions (e.g. if there are columns being frequently updated).

When creating tables purely for reporting purposes (as is typical in a Data Warehouse), it is customary to create wide, flat tables with non-normalized data because:
It is easier to query
It avoids JOINs that can be confusing and error-prone for causal users
Queries run faster (especially for Data Warehouse systems that use columnar data storage)
This data format is great for reporting, but is not suitable for normal data storage for applications — a database being used for OLTP should use normalized tables.
Do not be worried about having a large number of columns — this is quite normal for a Data Warehouse. However, 300 columns does sound rather large and suggests that they aren't necessarily being used wisely. So, you might want to check whether they are required.
A great example of many columns is to have flags that make it easy to write WHERE clauses, such as WHERE customer_is_active rather than having to join to another table and figuring out whether they have used the service in the past 30 days. These columns would need to be recalculated daily, but are very convenient for querying data.
Bottom line: You should put ease of use above performance when using Data Warehousing. Then, figure out how to optimize access by using a Data Warehousing system such as Amazon Redshift that is designed to handle this type of data very efficiently.

Slow select from one billion rows GreenPlum DB

I've created the following table on GreenPlum:
CREATE TABLE data."CDR"
(
mcc text,
mnc text,
lac text,
cell text,
from_number text,
to_number text,
cdr_time timestamp without time zone
)
WITH (
OIDS = FALSE,appendonly=true, orientation=column,compresstype=quicklz, compresslevel=1
)
DISTRIBUTED BY (from_number);
I've loaded one billion rows to this table but every query works very slow.
I need to do queries on all fields (not only one),
What can I do to speed up my queries?
Using PARTITION? using indexes?
maybe using a different DB like Cassandra or Hadoop?

This highly depends on the actual queries you are doing and what your hardware setup looks like.
Since you are querying all the fields the selectivity gained by going columnar orientation is probably hurting you more than helping, as you needs to scan all the data anyway. I would remove columnar orientation.
Generally speaking indexes don't help in a Greenplum system. Usually the amount of hardware that is involved tends to make scanning the data directory faster than doing index lookups.
Partitioning could be a great help but there would need to be a better understanding of the data. You are probably accessing specific time intervals so creating a partitioning scheme around cdr_time could eliminate the scan of data not needed for the result. The last thing I would worry about is indexes.
Your distribution by from_number could have an impact on query speed. The system will hash the data based on from_number so if you are querying selectively on the from_number the data will only be returned by the node that has it and you won't be leveraging the parallel nature of the system and spreading the request across all of the nodes. Unless you are joining to other tables on from_number, which allows the joins to be collocated and performed within the node, I would change that to be distributed RANDOMLY.
On top of all of that there is the question of what the hardware is and if you have a proper amount of segments setup and resources to feed them. Essentially every segment is a database. Good hardware can handle multiple segments per node, but if you are doing this on a light hardware you need to find the sweet spot where number of segments matches what the underlying system can provide.

#Dor,
I have same type of data where CDR info is stored for a telecom company, and daily 10-12 millions rows inserted and also heavy queries running on those CDRs related tables, I was also facing the same issue last year, and i have created partitions on those tables on the CDR timings column.
As per My understanding GP creates physical tables for each partition whereas logical tables created in other RDBMS. After this I got better performance with all SELECTs on these tables. Also I think you should convert text datatype to Character Varying for all columns (if text is really not required) I felt DB operations on Text field is very slow(specially order by, group by)
index will help you depends on your queries in my case i have huge inserts so i didnt try yet
If you are selecting all the columns in select so no need of Column Oriented table
Regards

PostgreSQL: What is the maximum number of tables can store in postgreSQL database?

Q1: What is the maximum number of tables can store in database?
Q2: What is the maximum number of tables can union in view?

Q1: There's no explicit limit in the docs. In practice, some operations are O(n) on number of tables; expect planning times to increase, and problems with things like autovacuum as you get to many thousands or tens of thousands of tables in a database.
Q2: It depends on the query. Generally, huge unions are a bad idea. Table inheritance will work a little better, but if you're using constraint_exclusion will result in greatly increased planning times.
Both these questions suggest an underlying problem with your design. You shouldn't need massive numbers of tables, and giant unions.
Going by the comment in the other answer, you should really just be creating a few tables. You seem to want to create one table per phone number, which is nonsensical, and to create views per number on top of that. Do not do this, it's mismodelling the data and will make it harder, not easier, to work with. Indexes, where clauses, and joins will allow you to use the data more effectively when it's logically structured into a few tables. I suggest studying basic relational modelling.
If you run into scalability issues later, you can look at partitioning, but you won't need thousands of tables for that.

Both are, in a practical sense, without limit.
The number of tables a database can hold is restricted by the space on your disk system. However, having a database with more than a few thousand tables is probably more an expression of an incorrect analysis of your application domain. Same goes for unions: if you have to union more than a handful of tables you probably should look at your table structure.
One practical scenario where this can happen is with Postgis: having many tables with similar attributes that could be joined in a single view (this is a flaw in the design of Postgis IMHO), but that would typically be handled at the application side (e.g. a GIS).
Can you explain your scenario where you would need a very large number of tables that need to be queried in one sweep?

Entity FrameWork CodeFirst- Table name

Hi i am building database, with couple tables (products, orders, costumers) and i am interested if it is possible to do such a trick, generate table every-day based on orders table with name of current day, because orders table will have about 1000 or more rows every-day and it will hurt application speed.

1000 rows is nothing. What database are you using? Most modern databases can handle millions of rows with no issue, as long as you put some effort into proper indexing of the table.
From your comment, I'm assuming you don't know about database table indexing.
A database index is a data structure that improves the speed of data
retrieval operations on a database table at the cost of slower writes
and increased storage space. Indices can be created using one or more
columns of a database table, providing the basis for both rapid random
lookups and efficient access of ordered records.
From http://en.wikipedia.org/wiki/Database_index
You need to add indexes to your database tables to ensure they can be searched optimally.
What you are suggesting is a bad idea IMO, and it's going to make working with the application a pain. Instead, if you really fill this table with vast amounts of data you could consider periodically archiving old data, but don't do this until you really need to.

How to create HBase columns / table for related but separated entities

I saw video tutorial on HBase, where data got stored in a table like this:
EmployeeName - Height - ProjectInfo
------------------------------------
Jdoe - 5'7" - ProjA-TeamLead, ProjB-Contributor
What happens when some Business requirements comes up that name of ProjA has to be changed to ProjX ?
Wouldn't there be a separate table where Project information is stored?

In a relational database, yes: you'd have a project table, and the employee table would refer to it via a foreign key and only store the immutable project id (rather than the name). Then when you want to query it (in a relational database), you'd do a JOIN like:
SELECT
employee.name,
employee.height,
project.name,
employee_project_role.role_name
FROM
employee
INNER JOIN employee_project_role
ON employee_project_role.employee_id = employee.employee_id
INNER JOIN project
ON employee_project_role.project_id = project.project_id
This isn't how things are done in HBase (and other NoSQL databases); the reason is that since these databases are geared towards extremely large data sets, and distributed over many machines, the actual algorithms to transparently execute complex joins like this become a lot harder to pull off in ways that perform well. Thus, HBase doesn't even have built-in joins.
Instead, the general approach with systems like this is that you denormalize your data, and store things in a single table. So in this case, there might be one row per employee, and denormalized into that row is all of the employee's project role info (probably in separate columns -- the contents of a row in HBase is actually a key/value map, so you can represent repeating things like all of their different roles easily).
You're absolutely right, though: if you change the name of the project, that means you'd need to change the data that's stored for every employee. In this respect, the relational model is "cleaner". But if you're dealing with Petabytes of data or trillions of rows, the "clean" abstraction of a relational database becomes a lot messier, because you end up having to shard it all manually. The point of systems like HBase is to pay these costs up front in the design process, and not just assume the relational database will magically solve problems like this for you at scale. (Because it won't).
That said: if you don't expect to have at least Terabtyes of data (that's a million MB, remember), just do it in a relational database. It'll be much easier.

I think going through this presentation will give you some perspective:
http://ianvarley.com/coding/HBaseSchema_HBaseCon2012.pdf
And for a more programetical representation, have a look at:
http://jimbojw.com/wiki/index.php?title=Understanding_Hbase_and_BigTable

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse