I'm new in hbase and want to save multiple values for a row key in hbase.
Is this possible?
For example
RowKey | Values
1212 | 12
1213 | 12, 13, 14
Yes this is possible. You can think of HBase data model as several nested maps.
Map<RowKey, Map<ColumnFamilyKey, <Map<ColumnKey, <Map<Version, Value>>>>. All kyes, as a value, have type byte arrays, except version which should be long number (64bit integer). The number and values of Column Families should be predefined for table and should not exceed 3-4 due to performance issue. From this you have two variants to store multiples values per row: in different columns or in single column with different versions. Version should be a long number.
Related
We often have columns that can contain values of varying sizes. For these, I like to set the data type to VARCHAR with a size way beyond the current maximum length. For example, if I have a column where the current minimum length for a value is 10 and the maximum length is 35, I might set the data type to VARCHAR(64). My rationale is that Db2 stores the 2 byte length followed by the exact value, therefore, there is no difference, from a storage perspective, defining the data type as VARCHAR(64) instead of VARCHAR(35). And I don't get an error if I a value with a length of 36 comes along.
Is there a nuance that I'm missing and should I not be so glib about my VARCHAR assignments?
The exact formula to calculate row length is described in the docs for CREATE TABLE. VARCHAR(64) or VARCHAR(35) should not make a difference.
Be aware that rows a stored in data pages in tablespaces. Database systems usually pre-allocate pages for performance reasons. Moreover, pages might not be fully filled or there is compression. And you might have defined indexes which require their own pages with structures. Plus there is metadata in the system catalog.
I'm working in Pentaho 4.4.1-GA (Kettle / PDI). The database is Postgres.
I need to be able to insert multiple records into a fact table based on the fields that come from a single record. The single record contains fields:
productcode1, price1
productcode2, price2
productcode3, price3
...
productcode10,price10
So if there was a value for each of the 10 productcode / prices then I'd need to insert a total of 10 records into the fact table. If there were values for 4 of the combinations, then I'd need to insert 4 records into the fact table, etcetera. All field values for the fact records would be identical except for the PK (generated by sequence), product codes, and prices.
I figure that I need some type of looping construct which would let me check whether or not a value was present for each productx field, and if so, do an insert/update step on the fact table with the desired field values. I'm just not sure how to do this in Pentaho.
Any ideas? All suggestions are welcome :)
Thank You,
Rakesh
Could you give a sample input and output for your scenario??
From your example data I can infer that if there are 10 different product codes and only 4 product prices you want to have 4 records inserted into your table. Is that so?
Well for a start you can add a constant value of 1 to those records by filtering for NOT NULL and then use an Group BY Step to count the number of 1's. This would give you the count. BTW it would be helpful if you could provide more details on what columns you would be loading as there are ways to make a PDI transformation execute multiple times
I am building a database with a couple of million records, and I've got a question regarding one of the relational tables which will be used to store two searchable reference numbers. I am new to this, so I apologize f this has been asked before.
id digit1 digit2
varchar(9) varchar(9) varchar(9)
Is it better to a) keep 2 separate optional columns in one table or b) two separate tables for digit1 and digit2?
What kind of a mysql character type should I use if digit1 always consists of 6 - 9 numbers and digit2 always consists of same 3 letters and 6 numbers? How do I limit the input by a set of such rules?
Thanks!
actually, if you're going to store numbers and if you don't want to query by digit1 and digit2 at the same time, it's better to keep them apart in different tables. Otherwise, it's better to keep them in the same table, or you'll have a painful join. It also depends how sparse is your matrix (I mean, if there are too many items in column 2 and just a few in column 3, probably it's better to keep them apart too)
now, what will make a bigger difference here, if you want to store numbers, is to use a numeric field to store the values (instead of varchar), which will be smaller and faster to search and index (and so, to retrieve)
I have about a billion pieces of data that I would like to store in Cassandra. The data items are ordered by time, and one of the main queries I'll be doing is to find the items between two time ranges, in order. I'd really prefer to use the RandomParititioner, if at all possible. Is there a way to do this in Cassandra?
At first, since I'm coming from SQL, I assumed I should create each event as a row, but then it occurred to me that I was thinking about it the wrong way and I should really use columns. Columns in Cassandra seem to be ordered, but I'm confused as to just how ordered they are. If I use a time as the column name, is there a way for me to get all of the columns from one time to another in order?
Another thing I looked at was the 0.7 feature of secondary indices, but I've had trouble finding documentation for whether I can use these to view a range of things in order.
All I want is the Cassandra equivalent of this SQL: "Select * from Stuff where date > X and date < Y order by date asc". How can I do this?
The partitioner only affects the distribution of keys around the ring, not the order of columns within a key. Columns are always ordered according to the Column Comparator defined for the column family.
You can call get_slice with a SlicePredicate that specifies a SliceRange to get all the columns of a key within a range.
To model your data, you can create 1 row for each day (or suitable time shard) and have a column for each piece of data. Something like,
"yyyy-mm-dd" : { #key, one for each day
timeStampMillis1:dataid1 : "value1" # one column for each piece of data
timeStampMillis2:dataid2 : "value2"
timeStampMillis3:dataid3 : "value3"
}
The column names should be binary, using the binary comparator. The first 8 bytes are the timestamp, while the rest of the bytes are the id of the data.
Assuming X and Y are on the same day, to find all items between X and Y, do a do a get_slice on the day key, with a SlicePredicate with a SliceRange specifying a start of X and a finish of Y+1. Both start and finish are byte arrays of 8 bytes.
To find data over multiple days, read from multiple keys.
Is it possible to predict the amount of disk space/memory that will be used by a basic index in PostgreSQL 9.0?
E.g. If I have a default B-tree index on an integer column in a table of 1 million rows, how much space would be taken up by the index? Is the entire index held in memory at all times?
Not really a definitive answer, but I looked at a table in a 9.0 test system I have with a couple of int indexes on a table of 280k rows. The indexs all report a size of 6232kb. So roughly 22 bytes per row.
There is no way to say that. It depends on the type of operations you will make, as PostgreSQL stores many different versions of the same row, including row versions stored in index files.
Just make the table you are interested in and check it.