Column filtering in cassandra - nosql

Hello I was wondering if someone with some knowledge of Cassandra could help me. Right now i'm investigating the read path of Cassandra using a debugger but for some reason cannot find the specific place where the columns for a row are filtered with respect to the query.
I have issued both queries:
"select * from tabl where name1='rowkey8888';"
and
"select name9 from tabl where name1='rowkey999211';"
What is confusing to me is that they are both considered SliceFromReadCommand commands and the
SliceQueryFilter object associated with the command is the same for both: [reversed=false, slices=[[, ]], count=10000, toGroup = 0]
My question is that can anyone explain to me the reason for this behaviour, as I was under the impression when a column name was given in the query (name9 in the above case) then
the command would be a SliceByNamesReadCommand and the specified columns would be in the QueryFilter slices array. Additionally would anyone be able to point out where this filtering is done as I cannot find it?

For reasons related to TTL handling, C* actually has to query a "slice" of the partition even when you give it a specific name.
CollationController is a good place to start code diving.

Related

What could be causing this 'invalid host' error on kdb query?

I get an odd error when trying to query too many dates from a date-partitioned historical database:
q)eod: h"select from eod where date within 2018.01.01 2018.04.22"
'/tablepath/2018.04.04/eod/somecolumn: invalid host
q)eod: h"select from eod where date within 2018.01.17 2018.04.20"
'/tablepath/2018.04.20/eod/othercolumn: invalid host
q)eod: h"select from eod where date within 2018.01.18 2018.04.20"
q)
Note that both dates mentioned in the error messages are within the date range that we manage to extract in the end, and that it fails on a different column each time. This seems to indicate it's something to do with the size of the table being pulled, but when we check the size of the largest table we managed to get:
q)(-22!eod) % 1024 * 1024
646.9043
q)count eod
2872546
we find that it's not particularly large by either memory size nor by number of rows.
Googling for "invalid host" errors doesn't seem to turn up anything relevant, and I'm not seeing anything in the kdb docs about size limits that would be relevant. Anyone got any ideas?
Edit:
When loading the table in a session and making the queries directly, we get what appears to be the same error, but with a different message. For instance:
q)jj: select from eod where date within 2018.01.01 2018.04.22
Too many compressed files open
k){0!(?).#[x;0;p1[;y;z]]}
'./2018.04.04/eod/settlecab: No such file or directory
.
?
(+`exch`date`class..
q.Q))
Note that the file ./2018.04.04/eod/settlecab does in fact exist, and contains data:
I have no problem loading the data for just the date mentioned in the error, and the column mentioned has meaningful values:
q)jj: select from eod where date=2018.04.04
q)select count i by settlecab from jj
settlecab| x
---------| -----
0 | 41573
1 | 2269
The key point seems to be the Too many compressed files open message, but what can I do about this?
Edit for Summary/Solutions:
The table in question had many columns, all stored in a compressed format. When issuing a query against too many dates at once, kdb would try to mmap all of those columns at once, running into a limit on how many compressed files could be open at once.
Once I understood the problem, several solutions were available:
I could pull only certain columns from the database, reducing the number of files that kdb needed to keep open,
I could force kdb to pull all the data into memory by adding a dummy where clause to the query, such as (null column) | not null column (hacky, but it works),
I could have upgraded the kdb version and lifted OS limits (not practical in my case).
I still have no idea why this resulted in an invalid host error when querying the database remotely.
First off, can we just clarify the database structure you're working with. It seems from the filepaths returned in your errors that you've got a date-partitioned database. Did you mean non-segmented database when you said non-partitioned in your original query?
In terms of a fix for your issue, have you tried loading your database into a session, and making those queries directly? If so do you get the same issues?
If that seems to be working alright, the problem might lie with how you're defining your database handle. How is h defined in your original example?
It might also be worth trying to select individual dates from your database, to try and isolate the problem, and to determine if it lies with your on-disk data. Try specifically querying the dates that are mentioned in your errors.
You could also try performing your original queries with a subset a columns, again to try and pinpoint where your issue is coming from.
Let us know if you get any further with this.
Joseph

Informatica SQ returns different result

I am trying to pull data from DB2 via informatica, I have a SQ query that pulls few fields based on joins for 4 different tables.
When I run the query directly in the database, it returns the expected result, however when I run it in informatica and run a debugger, I see something else.
Please note all the columns data perfectly match, except one single column.
Weird thing is, this is a calculated field from the table based on a case statement:
CASE WHEN Column1='3' THEN 'N' ELSE 'Y' END.
Since this is a calculated field with a length of one string, I have connected from the source to SQ from one of the sources having 1 character length.
This returns 'Y' when executed in the database, the same query when I copy paste in SQ of information and run it, I get a data 'E', and this data can never be possible as I expect only a N or a Y. I have verified the column order, that its in the right place. This is very strange, is something going wrong because of the CASE Statement?
Save yourself the hassle, put an expression transformation after tge source qualifier and calculate, port value there then forget about it
I think i got the issue. We use Informatica PowerExchange to connect to a as400 system(DB2), and it seems that when we are trying to set a flag information in AS400, and pass it to informatica via PowerExchange, it converts it to binary, and to solve this, there needs to be an entry in the PowerExchange configuration file.
Unfortunately, i myself was not aware that it could be related to PowerExchange instead of powercenter itself.!!
Thanks for your assistance! Below is the KB about it.
https://kb.informatica.com/solution/4/Pages/17498.aspx

Can I have more than 250 columns in the result of a PostgreSQL query?

Note that PostgreSQL website mentions that it has a limit on number of columns between 250-1600 columns depending on column types.
Scenario:
Say I have data in 17 tables each table having around 100 columns. All are joinable through primary keys. Would it be okay if I select all these columns in a single select statement? The query would be pretty complex but can be programmatically generated. The reason for doing this is to get denormalised data to populate a web page. Please do not ask why though :)
Quite obviously if I do create table table1 as (<the complex select statement>), I will be hitting the limit mentioned in the website. But do simple queries also face the same restriction?
I could probably find this out by doing the exercise myself. In the next few days I probably will. However, if someone has an idea about this and the problems I might face by doing a single query, please share the knowledge.
I can't find definitive documentation to back this up, but I have
received the following error using JDBC on Postgresql 9.1 before.
org.postgresql.util.PSQLException: ERROR: target lists can have at most 1664 entries
As I say though, I can't find the documentation for that so it may
vary by release.
I've found the confirmation. The maximum is 1664.
This is one of the metrics that is available for confirmation in the INFORMATION_SCHEMA.SQL_SIZING table.
SELECT * FROM INFORMATION_SCHEMA.SQL_SIZING
WHERE SIZING_NAME = 'MAXIMUM COLUMNS IN SELECT';

DB2 subselect with 'in', limitations?

Somebody mentioned to me, that when performing a subselect with 'in' in DB2, there may be a limit to how many results can be returned by the subselect. If so, does anybody know what that limit is? Or if it might be dependent on the version of the DB, how to find this information? Thanks in advance.
The best place to find such information is on IBM's website. For instance, here are the limitations for DB2 on z/OS
I didn't see anything about there being a limit to the number of values in an "IN" clause however the "Maximum number of columns that are in a table or view (the value depends on the complexity of the CREATE VIEW statement) or columns returned by a table function." is 750.
Unrelated to your question - the DB2 SQL Cookbook is an excellent reference for working with DB2.

Wildcard query expansion resulted in too many terms

I am receiving a "wildcard query expansion resulted in too many terms" error when executing a query similar to the following:
SELECT *
FROM table_a
WHERE contains(clob_field, '%a%') > 0;
Does anyone know a workaround/solution to this problem?
According to this, you may need to increase the wildcard_maxterms parameter, or take further steps. See the link for details (I'm not an expert in Oracle Text though).