Mondrian: Aggregate tables for columnar DB - aggregate

I'm working with Mondrian on a columnar DB, means I have a big flat fully denormalized fact table that contains all facts and dimensions. Unfortunately, I am not able to use a aggregate table. When I collapse all dimensions in the aggregate table, Mondrian successfully recognizes the aggregate table. But when I keep e. g. the time dimension, Mondrian does not:
14:06:14,859 WARN [AggTableManager] Recognizer.checkUnusedColumns: Candidate aggregate table 'agg_days_flattened' for fact table 'flattened' has a column 'dayofMonth' with unknown usage.
14:06:14,860 WARN [AggTableManager] Recognizer.checkUnusedColumns: Candidate aggregate table 'agg_days_flattened' for fact table 'flattened' has a column 'month' with unknown usage.
14:06:14,860 WARN [AggTableManager] Recognizer.checkUnusedColumns: Candidate aggregate table 'agg_days_flattened' for fact table 'flattened' has a column 'year' with unknown usage.
Furthermore, the aggregate table is not used when I perform a corresponding MDX query. When I model the same cube with a classical star schema, everything works fine.
For me, it looks like Mondrian neads "true" foreignkey/primarykey mappings to work with agggregate tables, which do not apply to my szenario (big flat fully denormalized fact table).
Does anyone have an idea?

Related

distkey and sortkey on temporary tables - Redshift

I am starting to do some research on query tuning, and have been experimenting with using distkey and sortkey. From what I've read if I set the distkey to the joining column, the query planner will use a merge join instead of a hash join, which should be faster in Redshift. I was wondering if this also applies to temporary tables? Our production tables are actually views, so they do not have any keys already set. I'm not sure why we don't use the actual warehouse tables.
Yes, keys can be set for temporary tables:
create temp table fred DISTKEY (1) as ...
this is easily done with column position - first column in this example. You can also set the distribution style on temp tables is you so desire. Doing this can force data to stay "on node" for intermediate results in very large and complex queries. Redshift does a good job make reasonable decisions on how to distribute intermediate results but isn't perfect and doesn't understand the nature of the data. I've done this with good results when large data images are in play.
As to you second point about using views instead of tables - In Redshift standard views are basically SQL macros that are flattened / optimized through by the Redshift query compiler. So use of views instead of tables is not bad in itself. Use of view, especially complex ones, can hide what is being done by the query and this can add unneeded and unexpected complexity to the query. The keys are set in the tables referenced by the views. (I'm assuming that the views are not referencing external/spectrum tables)
Lastly, you state you are looking to achieve Merge Join behavior to improve performance. While it is true that this is the fastest type of join, the time and work required to get merge joins to happen on temp tables will not be offset by this performance gain (experience). Redshift will only use a Merge join when it is sure that the data being joined will "zipper" together without issue. If it isn't completely sure this is the case it has to perform a Hash join which is a more general process. To get Redshift to do the Merge join you will need to sort and analyze your temp tables which will cost much more time than the savings you will get. It is far more important to have your joins be "DIST NONE" - no network distribution of the data - than moving from a hash join to a merge join.
Yes, it can be done. Just put the distkey before the start of the table query
create temp table a distkey(column_name) as
(select query .....)

select all columns except two in q kdb historical database

In output I want to select all columns except two columns from a table in q/kdb historical database.
I tried running below query but it does not work on hdb.
delete colid,coltime from table where date=.z.d-1
but it is failing with below error
ERROR: 'par
(trying to update a physically partitioned table)
I referred https://code.kx.com/wiki/Cookbook/ProgrammingIdioms#How_do_I_select_all_the_columns_of_a_table_except_one.3F but no help.
How can we display all columns except for two in kdb historical database?
The reason you are getting par error is due to the fact that it is a partitioned table.
The error is documented here
trying to update a partitioned table
You cannot directly update, delete anything on a partitioned table ( there is a separate db maintenance script for that)
The query you have used as fix is basically selecting the data first in-memory (temporarily) and then deleting the columns, hence it is working.
delete colid,coltime from select from table where date=.z.d-1
You can try the following functional form :
c:cols[t] except `p
?[t;enlist(=;`date;2015.01.01) ;0b;c!c]
Could try a functional select:
?[table;enlist(=;`date;.z.d);0b;{x!x}cols[table]except`colid`coltime]
Here the last argument is a dictionary of column name to column title, which tells the query what to extract. Instead of deleting the columns you specified this selects all but those two, which is the same query more or less.
To see what the functional form of a query is you can run something like:
parse"select colid,coltime from table where date=.z.d"
And it will output the arguments to the functional select.
You can read more on functional selects at code.kx.com.
Only select queries work on partitioned tables, which you resolved by structuring your query where you first selected the table into memory, then deleted the columns you did not want.
If you have a large number of columns and don't want to create a bulky select query you could use a functional select.
?[table;();0b;{x!x}((cols table) except `colid`coltime)]
And show all columns except a subset of columns. The column clause expects a dictionary hence I am using the function {x!x} to convert my list to a dictionary. See more information here
https://code.kx.com/q/ref/funsql/
As nyi mentioned, if you want to permanently delete columns from an historical database you can use the deleteCol function in the dbmaint tools https://github.com/KxSystems/kdb/blob/master/utils/dbmaint.md

Understanding indexes and performance as they relate to indexed column and non-indexed column data in the same row

I have some tables that are around 100 columns wide. I haven't normalized them because to put it back together would require almost 3 dozen joins and am not sure it would perform any better... haven't tested it yet (I will) so can't say for sure.
Anyway, that really isn't the question. I have been indexing columns in these tables that I know will be pulled frequently, so something like 50 indexes per table.
I got to thinking though. These columns will never be pulled by themselves and are meaningless without the primary key (basically an item number). The PK will always be used for the join and even in simple SELECT queries, it will have to be a specified column so the data makes sense.
That got me thinking further about indexes and how they work. As I understand them the locations of a values are committed to memory for that column so it is quickly found in a query.
For example, if you have:
SELECT itemnumber, expdate
FROM items;
And both itemnumber and expdate are indexed, is that excessive and really adding any benefit? Is it sufficient to just index itemnumber and the index will know that expdate, or anything else that is queried for that item, is on the same row?
Secondly, if multiple columns constitute a primary key, should the index include them grouped together, or is individually sufficient?
For example,
CREATE INDEX test_index ON table (pk_col1, pk_col2, pk_col3);
vs.
CREATE INDEX test_index1 ON table (pk_col1);
CREATE INDEX test_index2 ON table (pk_col2);
CREATE INDEX test_index3 ON table (pk_col3);
Thanks for clearing that up in advance!
Uh oh, there is a mountain of basics that you still have to learn.
I'd recommend that you read the PostgreSQL documentation and the excellent book “SQL Performance Explained”.
I'll give you a few pointers to get you started:
Whenever you create a PRIMARY KEY or UNIQUE constraint, PostgreSQL automatically creates a unique index over all the columns of that constraint. So you don't have to create that index explicitly (but if it is a multicolumn index, it sometimes is useful to create another index on any but the first column).
Indexes are relevant to conditions in the WHERE clause and the GROUP BY clause and to some extent for table joins. They are irrelevant for entries in the SELECT list. An index provides an efficient way to get the part of a table that satisfies a certain condition; an (unsorted) access to all rows of a table will never benefit from an index.
Don't sprinkle your schema with indexes randomly, since indexes use space and make all data modification slow.
Use them where you know that they will do good: on columns on which a foreign key is defined, on columns that appear in WHERE clauses and contain many different values, on columns where your examination of the execution plan (with EXPLAIN) suggests that you can expect a performance benefit.

Redshift select * vs select single column

I'm having the following Redshift performance issue:
I have a table with ~ 2 billion rows, which has ~100 varchar columns and one int8 column (intCol). The table is relatively sparse, although there are columns which have values in each row.
The following query:
select colA from tableA where intCol = ‘111111’;
returns approximately 30 rows and runs relatively quickly (~2 mins)
However, the query:
select * from tableA where intCol = ‘111111’;
takes an undetermined amount of time (gave up after 60 mins).
I know pruning the columns in the projection is usually better but this application needs the full row.
Questions:
Is this just a fundamentally bad thing to do in Redshift?
If not, why is this particular query taking so long? Is it related to the structure of the table somehow? Is there some Redshift knob to tweak to make it faster? I haven't yet messed with the distkey and sortkey on the table, but it's not clear that those should matter in this case.
The main reason why the first query is faster is because Redshift is a columnar database. A columnar database
stores table data per column, writing a same column data into a same block on the storage. This behavior is different from a row-based database like MySQL or PostgreSQL. Based on this, since the first query selects only colA column, Redshift does not need to access other columns at all, while the second query accesses all columns causing a huge disk access.
To improve the performance of the second query, you may need to set "sortkey" to colA column. By setting sortkey to a column, that column data will be stored in sorted order on the storage. It reduces the cost of disk access when fetching records with a condition including that column.

T-SQL: SELECT INTO sparse table?

I am migrating a large quantity of mostly empty tables into SQL Server 2008.
The tables are vertical partitions of one big logical table.
Problem is this logical table has more than 1024 columns.
Given that most of the fields are null, I plan to use a sparse table.
For all of my tables so far I have been using SELECT...INTO, which has been working really well.
However, now I have "CREATE TABLE failed because column 'xyz' in table 'MyBigTable' exceeds the maximum of 1024 columns."
Is there any way I can do SELECT...INTO so that it creates the new table with sparse support?
What you probably want to do is create the table manually and populate it with an INSERT ... SELECT statement.
To create the table, I would recommend scripting the different component tables and merging their definitions, making them all SPARSE as necessary. Then just run your single CREATE TABLE statement.
You cannot (and probably don't want to anyway). See INTO Clause (TSQL) for the MSDN documentation.
The problem is that sparse tables are a physical storage characteristic and not a logical characteristic, so there is no way the DBMS engine would know to copy over that characteristic. Moreover, it is a table-wide property and the SELECT can have multiple underlying source tables. See the Remarks section of the page I linked where it discusses how you can only use default organization details.