Cassandra CompositeType as row key Validator - nosql

I'm working on some POC.
I have the Column Family which stores server event. Avoiding to get row oversize we are splitting each row to N another rows using compositeType in row key:
CREATE COLUMN FAMILY logs with comparator='ReversedType(TimeUUIDType)' and key_validation_class='CompositeType(UTF8Type,IntegerType)' and default_validation_class=UTF8Type;
so for each server name we have N rows and we are writing data to each row using Very Simple Round Robin algorithm.
I have no problem to write data to any row:
Mutator<Composite> mutator = HFactory.createMutator(keySpace, CompositeSerializer.get());
HColumn<UUID,String> col =
HFactory.createColumn( TimeUUIDUtils.getUniqueTimeUUIDinMillis(), log);
Composite rowName = new Composite();
rowName.addComponent(serverName, StringSerializer.get());
rowName.addComponent(this.roundRobinDestributor.getRow(), IntegerSerializer.get());
mutator.insert(rowName, columnFamilyName, col);
}
So far so good, but now I have two quetions:
1) Due to the fact that if I want to get all logs for some serverName I would scan row keys, should I use ByteOrderedPartitioner?
2) Can any body help me, or point me on some help how to create Hector query which will bring all rows for server1 ( {server1:0}, {server1:1} {server1:2), etc...)? I saw a lot of example using CompositeType as comparator, but no example for key validator.
Any help or comment is highly appreciated.

First of all, row oversizing shouldn't be a problem in cassandra. Despite that, it might worth to spilt rows, since data distribution across cluster will be more even in this situation.
ByteOrderedPartitioner doesn't look like a good option here, since it would be hard to achieve uniform distribution of rows across cluster, that will lead to hotspots.
There's no way to query range of keys when using RandomPartitioner. However, if the maximum N value is reasonably small (up to 256) MultigetSliceQuery might be used to query whole set of rows.

Related

Maintain N latest points in Postgres

My purpose is to have N latest points in Postgres in one row. When a new point is added, remove the oldest point.
For example say I have N+1 points. 1 userid and N integers.
Now when a new integer comes for a user, I want to remove the oldest entry and add the new entry. Obviously the issue is performance here. Since I am adding only one new integer, I want some way to do it fast.
I tried one very naive way by keeping two columns
userid, json
where json was list of all integers. I would remove the first entry and append new entry and dump json in postgres. Undoubtedly it is not performing well.
Please suggest some good way to do it. Does Postgres has some min heap type of data structure which does it in much better than linear complexity?

kdb+/q optimize union function

To give you a bit of background. I have a process which does this large complex calculation which takes a while to complete. It runs on a timer. After some investigation I realise that what is causing the slowness isn't the actual calculation but the internal q function, union.
I am trying to union two simple tables, table A and table B. A is approximately 5m rows and B is 500. Both tables have only two columns. First column is a symbol. Table A is actually a compound primary key of a table. (Also, how do you copy directly from the console?)
n:5000000
big:([]n?`4;n?100)
small:([]500?`4;500?100)
\ts big union small
I tried keying both columns and upserting, join and then distinct, "big, small where not small in big" but nothing seems to work :(
Any help will be appreciated!
If you want to upsert the big table it has to be keyed and upsert operator should be used. For example
n:5000000
//big ids are unique numbers from 0 to 499999
//table is keyed with 1! operator
big:1!([]id:(neg n)?n;val:n?100)
//big ids are unique numbers. 250 from 0-4999999 and 250 from 500000-1000000 intervals
small:([]id:(-250?n),(n+-250?n);val:500?100)
If big is global variable it is efficient to upsert it as
`big upsert small
if big is local
big: big upsert small
As the result big will have 500250 elements, because there are 250 common keys (id column) in big and small tables
this may not be relevant, but just a quick thought. If your big table has a column which has type `sym and if this column does not really show up that much throughout your program, why not cast it to string or other value? if you are doing this update process every single day then as the data gets packed in your partitioned hdb, whenever the new data is added, kdb+ process has to reassign/rewrite its sym file and i believe this is the part that actually takes a lot of time, not the union calculation itself..
if above is true, i'd suggest either rewriting your schema for the table which minimises # of rehashing(not sure if this is the right term though!) on your symfile. or, as the above person mentioned, try to assign attribute to your table.. this may reduce the time too.

Creating the optimum index for my database

I have a table in postgresql with the following information:
rawData (fileID integer references otherTable, lineNum integer, data1 double, ...)
When I am searching this table, I do so with the following query:
SELECT lineNum, data1, ...other data FROM rawData WHERE
fileID = ? AND data1 < ? ORDER BY lineNum;
In general, the data in this table is a number of entries for each fileID, and each fileID has lineNum from 0 to x, with lineNum never repeating for each fileID (but it does repeat for different fileID's). Then data1 is effectively a random number that may or may not overlap.
In order to speed up the reading of this data, I am trying to create an index on it, but am having trouble figuring out the best way to index it. Currently I am looking at one of the following two index methods, and am wondering which would be better for my search, or if there is another option that I haven't thought of that would be better than either of them.
index idea 1:
CREATE INDEX searchIndex ON rawData (fileID, data1, lineNum);
index idea 2:
CREATE INDEX searchIndex ON rawData (fileID, lineNum, data1);
Note that at this time, this and a search not constrained by data1 are the only searches that I run on this table, so I'm not too concerned about this index slowing down other searches.
Lastly, would I have to change my search query to use the index, or would it automatically use that index when I search the table?
You should look at using this instead:
CREATE INDEX searchIndex ON rawData (fileID, lineNum);
A few things:
In particular, as per docs, Indexes with more than three columns are unlikely to be helpful unless the usage of the table is extremely stylized.
Since your second search query requires filtering without the data1 column, keeping the second column lineNum should be sufficient (since you mention it would be quasi-random), and in the rare occurrence that there are repeats, table fetches should ensure correctness. But what this would mean is that the Index would be 1/3rd smaller in size, which is a big win (Think index small-enough to be in memory / index-only-scans etc.)
Either index can be used. Which is faster will depend on many things, like how many rows are in the table, how many lineNum there are per fileID, how selective the data1 < ? clause is, what your hardware is, what our config settings are, which version of PostreSQL you are using, what physical order the table rows lie in, etc.
The only way to know for sure is to try it with your own data on your own system and see.
I'd just build an index on (fileID, lineNum, data1), or even just (fileID, lineNum), because that seems more natural, and then forget about it. Most likely it will be fast enough. Once there is a demonstrable performance problem, than you will have the test case at hand which is needed to come to a real conclusion.

ETL Process when and how to add in Foreign Keys T-SQL SSIS

I am in the early stages of creating a Data Warehouse based loosely on the Kimball methodology.
I am currently investigating my source data. I understand by the adding of a Primary key (not a natural key) this will then allow me to make the connections between the facts and dimensions.
Sounds like a silly question but how exactly is this done? Are there any good articles that run through this process?
I would imagine we bring in all of the Dimensions first. And when the fact data is brought over a lookup is performed that "pushes" the Foreign key into the Fact table? At what point is this done? Within SSIS whats is the "best practice" method? Is this all done in one package for example?
Is that roughly how it happens?
In this case do we have to be particularly careful in what order we load our data, or we could be loading facts for which there is no corresponding dimension?
I would imagine we bring in all of the Dimensions first. And when the
fact data is brought over a lookup is performed that "pushes" the
Foreign key into the Fact table? At what point is this done? Within
SSIS whats is the "best practice" method? Is this all done in one
package for example?
It would depend on your schema and table design.
Assuming it's star schema and the FK is based on the data value itself:
DIM1 <- FACT1 -> DIM2
^
|
FACT2 -> DIM3
you'll first fill DIM1 and DIM2 before inserting into FACT1 as you would need the FK.
Assuming it's snowflake schema:
DIM1_1
^
|
DIM1 <- FACT1 -> DIM2
you'll first fill DIM1_1 then DIM1 and DIM2 before inserting into FACT1.
Assuming the FK relation is based on something else (mostly a number) instead of the data value itself (kinda an optimization when dealing with huge amount of data and/or strings as dimension values), you won't need to wait until you insert the data into DIM table. I'm sure it's very confusing :), so I'll try to explain in short. The steps involved would be something like (assume a simple star schema with 2 tables, FACT1 and DIMENSION1):
Extract FACT and DIMENSION values from the data set you are processing.
Generate a unique number based on the DIMENSION's value (which say is a string), using a reproducible algorithm (e.g. SHA1, given same string, it always gives same number).
Insert into FACT1 table, the number and FACT values.
Insert into DIMENSION1 table, the number and DIMENSION values.
Steps 3 & 4 can be done in parallel. as long as there is NO constraint in place. A join on a numeric column would be more efficient than one of a string.
And there is no need to store the mapping for #2 because it's reproducible (just ensure you pick the right algo).
Obviously this can be extended for snowflake schema and/or multiple dimensions.
HTH

How to list all row keys in an hbase table?

How do I list all row keys in an hbase table?
I need to do this using PHP with a REST interface.
If you are listing all of the keys in an HBase table, then you are using the wrong tool. HBase is for large data systems where it is impractical to list all of the keys.
What may be more sensible is to start at a given key and list the next N keys (for values of N less than 10K). There are nice Java interfaces for doing this type of thing with a scan -- setting a start key and/or an end key.
Most HBase functionality is exposed via the Thrift interface. I would suggest looking there
I have found a way..
http://localhost:8080/tablename/* will return an xml data and i can preg-match it to get the rows.
Inviting better suggestions..
This...
http://localhost:8080/tablename/*/columnfamily:columnid
...will return all values in your table relative to that column in that table, sort of like applying column filter in the scanner.
Also, if you're looking for multiple columns - separate them with a comma.
So: /tablename/*/columnfamily:columnid,columnfamily:columnid2
I don't know what the REST interface is like, but you probably want to filter some data out client-side to avoid large RPC responses. You can do this by adding server-side filters to your scan:
Scan s = new Scan();
FilterList fl = new FilterList();
// returns first instance of a row, then skip to next row
fl.addFilter(new FirstKeyOnlyFilter());
// only return the Key, don't return the value
fl.addFilter(new KeyOnlyFilter());
s.setFilter(fl);
HTable myTable;
ResultScanner rs = myTable.getScanner(s);
Result row = rs.next();
while (row != null) ...
http://svn.apache.org/repos/asf/hbase/branches/0.90/src/main/java/org/apache/hadoop/hbase/filter/