Scan HTable rows for specific column value using HBase shell - nosql

I want to scan rows in a HTable from hbase shell where a column family (i.e., Tweet) has a particular value (i.e., user_id).
Now I want to find all rows where tweet:user_id has value test1 as this column has value 'test1'
column=tweet:user_id, timestamp=1339581201187, value=test1
Though I can scan table for a particular using,
scan 'tweetsTable',{COLUMNS => 'tweet:user_id'}
but I did not find any way to scan a row for a value.
Is it possible to do this via HBase Shell?
I checked this question as well.

It is possible without Hive:
scan 'filemetadata',
{ COLUMNS => 'colFam:colQualifier',
LIMIT => 10,
FILTER => "ValueFilter( =, 'binaryprefix:<someValue.e.g. test1 AsDefinedInQuestion>' )"
}
Note: in order to find all rows that contain test1 as value as specified in the question, use binaryprefix:test1 in the filter (see this answer for more examples)

Nishu,
here is solution I periodically use. It is actually much more powerful than you need right now but I think you will use it's power some day. Yes, it is for HBase shell.
import org.apache.hadoop.hbase.filter.CompareFilter
import org.apache.hadoop.hbase.filter.SingleColumnValueFilter
import org.apache.hadoop.hbase.filter.SubstringComparator
import org.apache.hadoop.hbase.util.Bytes
scan 'yourTable', {LIMIT => 10, FILTER => SingleColumnValueFilter.new(Bytes.toBytes('family'), Bytes.toBytes('field'), CompareFilter::CompareOp.valueOf('EQUAL'), Bytes.toBytes('AAA')), COLUMNS => 'family:field' }
Only family:field column is returned with filter applied. This filter could be improved to perform more complicated comparisons.
Here are also hints for you that I consider most useful:
http://hadoop-hbase.blogspot.com/2012/01/hbase-intra-row-scanning.html - Intra-row scanning explanation (Java API).
https://hbase.apache.org/devapidocs/org/apache/hadoop/hbase/filter/FilterBase.html - JavaDoc for FilterBase class with links to descendants which actually can be used the same style. OK, shell syntax will be slightly different but having example above you can use this.

As there were multiple requests to explain this answer this additional answer has been posted.
Example 1
If
scan '<table>', { COLUMNS => '<column>', LIMIT => 3 }
would return:
ROW COLUMN+CELL
ROW1 column=<column>, timestamp=<timestamp>, value=hello_value
ROW2 column=<column>, timestamp=<timestamp>, value=hello_value2
ROW3 column=<column>, timestamp=<timestamp>, value=hello_value3
then this filter:
scan '<table>', { COLUMNS => '<column>', LIMIT => 3, FILTER => "ValueFilter( =, 'binaryprefix:hello_value2') AND ValueFilter( =, 'binaryprefix:hello_value3')" }
would return:
ROW COLUMN+CELL
ROW2 column=<column>, timestamp=<timestamp>, value=hello_value2
ROW3 column=<column>, timestamp=<timestamp>, value=hello_value3
Example 2
If not is supported as well:
scan '<table>', { COLUMNS => '<column>', LIMIT => 3, FILTER => "ValueFilter( !=, 'binaryprefix:hello_value2' )" }
would return:
ROW COLUMN+CELL
ROW1 column=<column>, timestamp=<timestamp>, value=hello_value
ROW3 column=<column>, timestamp=<timestamp>, value=hello_value3

An example of a text search for a value BIGBLUE in table t1 with column family of d:a_content. A scan of the table will show all the available values :-
scan 't1'
...
column=d:a_content, timestamp=1404399246216, value=BIGBLUE
...
To search just for a value of BIGBLUE with limit of 1, try the below command :-
scan 't1',{ COLUMNS => 'd:a_content', LIMIT => 1, FILTER => "ValueFilter( =, 'regexstring:BIGBLUE' )" }
COLUMN+CELL
column=d:a_content, timestamp=1404399246216, value=BIGBLUE
Obviously removing the limit will show all occurrences in that table/cf.

To scan a table in hbase on the basis of any column value, SingleColumnValueFilter can be used as :
scan 'tablename' ,
{
FILTER => "SingleColumnValueFilter('column_family','col_name',>, 'binary:1')"
}

From HBAse shell i think it is not possible because it is some how like query from which we use want to find spsecific data. As all we know that HBAse is noSQL so when we want to apply query or if we have a case like you then i think you should use Hive or PIG where as Hive is quiet good approach because in PIG we need to mess with scripts.
Anyway you can get good guaidence about hive from here HIVE integration with HBase and from Here
If yout only purpose is to view data not to get from code (of any client) then you can use HBase Explorer or a new and very good product but it is in its beta release is "HBase manager". You can get this from HBase Manager
Its simple, and more importantly, it helps to insert and delete data, applying filters on column qualifiers from UI like other DBclients. Have a try.
I hope it would be helpful for you :)

Slightly different question but if you you want to query a specific column which is not present in all rows, DependentColumnFilter is your best friend:
import org.apache.hadoop.hbase.filter.DependentColumnFilter
scan 'orgtable2', {FILTER => "DependentColumnFilter('cf1','lan',false,=,'binary:fre')"}
The previous scan will return all columns for the rows in which the lan column is present and for which its associated value is equal to fre. The third argument is dropDependentColumn and would prevent the lan column itself to be displayed in the results if set to true.

Related

In Postgres, how can I efficiently filter using the inner numbers of this jsonb structure?

So I work with Postgres SQL, and I have a jsonb column with the following structure:
{
"Store1":[
{
"price":5.99,
"seller":"seller"
},
{
"price":56.43,
"seller":"seller"
}
],
"Store2":[
{
"price":45.65,
"seller":"seller"
},
{
"price":44.66,
"seller":"seller"
}
]
}
I have a jsonb like this for every product in the database. I want to run an SQL query that will answer the following question:
For each product, is one of the prices in this JSON is bigger/equal/smaller than X?
Basically filter the product to include only the ones who have at least one price that satisfies a mathematical condition.
How can I do it efficiently? What's the best way in Postgres to iterate a JSON like this, with a relatively complex inner structure?
Also, if I could control the way the data is structured (to an extent, I can), what changes can I do to make this query more efficient?
Thanks!
Use a json path expression:
WHERE col ## '$.*[*].price < 20'
or
WHERE col #? '$.*[*] ? (#.price < 20)'
If you need to compare to another column or make the query parameterised, you can either build the jsonpath dynamically
WHERE col ## format('$.*[*].price < %s', $1)::jsonpath
WHERE col #? format('$.*[*] ? (#.price < %s)', $1)::jsonpath
or you can use the respective function and pass variables as an object:
WHERE jsonb_path_match(col, '$.*[*].price < $limit', jsonb_build_object('limit', $1))
WHERE jsonb_path_exists(col, format('$.*[*] ? (#.price < $limit)', jsonb_build_object('limit', $1))
I admit I had to check my cheat sheet to figure out the right combination of operator and expression. Takeaways:
if a comparison operator needs to work with multiple values, it generally functions as an ANY
## does not work with ? (# …) filter expressions since they don't return a boolean,
#? does not work with predicates since they always return a value (even if it's false)
What changes can I do to make this query more efficient?
As #jjanes commented on my other answer, the jsonpath match col ## '$.*[*].price < $limit' isn't going to be fast and needs to do full table scan, at least for < and >. To make a useful index, a different approach is required. An index can only have a single value to compare with, not any number. For that, we need to change the condition from EXISTS(SELECT prices_of(col) WHERE price < $limit) to (SELECT MIN(prices_of(col))) < $limit.
With this idea it is possible to build an expression index on the result of a custom immutable function:
CREATE FUNCTION min_price(data jsonb) RETURNS float
LANGUAGE SQL
IMMUTABLE
RETURNS NULL ON NULL INPUT
RETURN (
SELECT min((offer ->> 'price')::float)
FROM jsonb_each(data) AS entries(name, store),
LATERAL jsonb_array_elements(store) AS elements(offer)
);
CREATE INDEX example_min_data_price_idx ON example (min_price(data));
which you can use as
SELECT * FROM example WHERE min_price(data) < 20;
Looking for rows with a price larger than a certain number requires a separate index on max_price(data). If you want to use the index in a JOIN with more conditions, consider making it a multi-column index.
Looking for row with a price equalling a certain number can be optimised by indexing the jsonb column and using a jsonpath:
CREATE INDEX example_data_idx ON example USING GIN (data jsonb_ops);
SELECT * FROM example WHERE data ## '$.*[*].price == 20';
SELECT * FROM example WHERE data #? '$.*[*] ? (#.price == 20)';
Unfortunately you can't use jsonb_path_ops here since that doesn't support the wildcard.

Indexing a josnb column in postgresql

I have a column in postgresql table with type jsonb.
{
.....
"type": "car",
"vehicleIds": [
"980e3761-935a-4e52-be77-9f9461dec4d1","980e3761-935a-4e52-be77-9f9461dec4d2"
]
.....
}
Application runs queries against these fields to fetch records. I need to index this column only for these fields.
How can this be done?
This is query structure with properties as the column name:
SELECT *
FROM Vehicle f
WHERE f.properties::text ## CONCAT('$.vehicleIds[*] >', :vehicleId )= true
AND f.properties::text ## CONCAT('$.type >', :type ) = true
The query you are using is highly confusing, as it boils down to be a text search query, as the ## is applied on a text value.
I also don't understand the '$.type > ... condition. With values like car I would expect an equality operator, rather than "greater than". Using > together with a UUID also doesn't seem to make sense.
If you want to search for values of type car and contain a list of IDs, using the "contains" operator #> is a better way to do that:
SELECT *
FROM Vehicle f
WHERE f.properties #> '{"type": "car", "vehicleIds": ["980e3761-935a-4e52-be77-9f9461dec4d1"]}'
The above could make use of a GIN index on the properties column:
create index on vehicles using gin (properties);
If the type key is always queried with equality (which I assume), a combined index might be more efficient:
create index on vehicles using gin ( (properties ->> 'type'), (properties -> 'vehicleIds') );
You need to install the btree_gin extension in order to create that index.
That index would be a bit smaller but needs a different query:
SELECT *
FROM Vehicle f
WHERE f.properties ->> 'type' = 'car'
AND f.properties -> 'vehicleIds' #> '["980e3761-935a-4e52-be77-9f9461dec4d1"]'
You will need to validate if the indexes are used and which ones is more efficient by looking at the execution plan

Is there a Scala collection that maintains the order of insert?

I have a List:hdtList which contain columns that represent the columns of a Hive table:
forecast_id bigint,period_year bigint,period_num bigint,period_name string,drm_org string,ledger_id bigint,currency_code string,source_system_name string,source_record_type string,gl_source_name string,gl_source_system_name string,year string
I have a List: partition_columns which contains two elements: source_system_name, period_year
Using the List: partition_columns, I am trying to match them and move the corresponding columns in List: hdtList to the end of it as below:
val (pc, notPc) = hdtList.partition(c => partition_columns.contains(c.takeWhile(x => x != ' ')))
But when I print them as: println(notPc.mkString(",") + "," + pc.mkString(","))
I see the output unordered as below:
forecast_id bigint,period_num bigint,period_name string,drm_org string,ledger_id bigint,currency_code string,source_record_type string,gl_source_name string,gl_source_system_name string,year string,period string,period_year bigint,source_system_name string
The columns period_year comes first and the source_system_name last. Is there anyway I can make data as below so that the order of columns in the List: partition_columns is maintained.
forecast_id bigint,period_num bigint,period_name string,drm_org string,ledger_id bigint,currency_code string,source_record_type string,gl_source_name string,gl_source_system_name string,year string,period string,source_system_name string,period_year bigint
I know there is an option to reverse a List but I'd like to learn if I can implement a collection that maintains that order of insert.
It doesn't matter which collections you use; you only use partition_columns to call contains which doesn't depend on its order, so how could it be maintained?
But your code does maintain order: it's just hdtList's.
Something like
// get is ugly, but safe here
val pc1 = partition_columns.map(x => pc.find(y => y.startsWith(x)).get)
after your code will give you desired order, though there's probably more efficient way to do it.

Minus logic implementation not working with spark/scala

Minus Logic in Hive:
The below (Hive)query will return only records available in left side table ( Full_Table ft), but not in both.
Select ft.* from Full_Table ft left join Stage_Table stg where stg.primary_key1 IS null and stg.primary_key2 IS null
I tried to implement the same in spark/scala using following method ( To support both primary key and composite key ) , But joined result set does not have column from right table, because of that not able to apply stg.primary_key2 IS null condition in joined result set.
ft.join(stg,usingColumns, “left_outer”) // used seq to support composite key column join
Please suggest me how to implement minus logic in spark scala.
Thanks,
Saravanan
https://www.linkedin.com/in/saravanan303/
If your tables have the same columns you can use except method from DataSet:
fullTable.except(stageTable)
If they don't have, but you are interested only on subset of columns that exists in both tables you can first select those column using select transformation and than use except:
val fullTableSelectedColumns = fullTable.select(c1,c2,c3)
val stageTableSelectedColumns = stageTable.select(c1,c2,c3)
fullTableSelectedColumns.except(stageTableSelectedColumns)
On other case, you can use join and filter transformations:
fullTable
.join(stageTable, fullTable("primary_key") === stageTable("primary_key"), "left")
.filter(stageTable("primary_key1").isNotNull)

Scan with filter using HBase shell

Does anybody know how to scan records based on some scan filter i.e.:
column:something = "somevalue"
Something like this, but from HBase shell?
Try this. It's kind of ugly, but it works for me.
import org.apache.hadoop.hbase.filter.CompareFilter
import org.apache.hadoop.hbase.filter.SingleColumnValueFilter
import org.apache.hadoop.hbase.filter.SubstringComparator
import org.apache.hadoop.hbase.util.Bytes
scan 't1', { COLUMNS => 'family:qualifier', FILTER =>
SingleColumnValueFilter.new
(Bytes.toBytes('family'),
Bytes.toBytes('qualifier'),
CompareFilter::CompareOp.valueOf('EQUAL'),
SubstringComparator.new('somevalue'))
}
The HBase shell will include whatever you have in ~/.irbrc, so you can put something like this in there (I'm no Ruby expert, improvements are welcome):
# imports like above
def scan_substr(table,family,qualifier,substr,*cols)
scan table, { COLUMNS => cols, FILTER =>
SingleColumnValueFilter.new
(Bytes.toBytes(family), Bytes.toBytes(qualifier),
CompareFilter::CompareOp.valueOf('EQUAL'),
SubstringComparator.new(substr)) }
end
and then you can just say in the shell:
scan_substr 't1', 'family', 'qualifier', 'somevalue', 'family:qualifier'
scan 'test', {COLUMNS => ['F'],FILTER => \
"(SingleColumnValueFilter('F','u',=,'regexstring:http:.*pdf',true,true)) AND \
(SingleColumnValueFilter('F','s',=,'binary:2',true,true))"}
More information can be found here. Note that multiple examples reside in the attached Filter Language.docx file.
Use the FILTER param of scan, as shown in the usage help:
hbase(main):002:0> scan
ERROR: wrong number of arguments (0 for 1)
Here is some help for this command:
Scan a table; pass table name and optionally a dictionary of scanner
specifications. Scanner specifications may include one or more of:
TIMERANGE, FILTER, LIMIT, STARTROW, STOPROW, TIMESTAMP, MAXLENGTH,
or COLUMNS. If no columns are specified, all columns will be scanned.
To scan all members of a column family, leave the qualifier empty as in
'col_family:'.
Some examples:
hbase> scan '.META.'
hbase> scan '.META.', {COLUMNS => 'info:regioninfo'}
hbase> scan 't1', {COLUMNS => ['c1', 'c2'], LIMIT => 10, STARTROW => 'xyz'}
hbase> scan 't1', {FILTER => org.apache.hadoop.hbase.filter.ColumnPaginationFilter.new(1, 0)}
hbase> scan 't1', {COLUMNS => 'c1', TIMERANGE => [1303668804, 1303668904]}
For experts, there is an additional option -- CACHE_BLOCKS -- which
switches block caching for the scanner on (true) or off (false). By
default it is enabled. Examples:
hbase> scan 't1', {COLUMNS => ['c1', 'c2'], CACHE_BLOCKS => false}
Scan scan = new Scan();
FilterList list = new FilterList(FilterList.Operator.MUST_PASS_ALL);
//in case you have multiple SingleColumnValueFilters,
you would want the row to pass MUST_PASS_ALL conditions
or MUST_PASS_ONE condition.
SingleColumnValueFilter filter_by_name = new SingleColumnValueFilter(
Bytes.toBytes("SOME COLUMN FAMILY" ),
Bytes.toBytes("SOME COLUMN NAME"),
CompareOp.EQUAL,
Bytes.toBytes("SOME VALUE"));
filter_by_name.setFilterIfMissing(true);
//if you don't want the rows that have the column missing.
Remember that adding the column filter doesn't mean that the
rows that don't have the column will not be put into the
result set. They will be, if you don't include this statement.
list.addFilter(filter_by_name);
scan.setFilter(list);
One of the filter is Valuefilter which can be used to filter all column values.
hbase(main):067:0> scan 'dummytable', {FILTER => "ValueFilter(=,'binary:2016-01-26')"}
binary is one of the comparators used within the filter. You can use different comparators within the filter based on what you want to do.
You can refer following url: http://www.hadooptpoint.com/filters-in-hbase-shell/.
It provides good examples on how to use different filters in HBase Shell.
Add setFilterIfMissing(true) at the end of query
hbase(main):009:0> import org.apache.hadoop.hbase.util.Bytes;
import org.apache.hadoop.hbase.filter.SingleColumnValueFilter;
import org.apache.hadoop.hbase.filter.BinaryComparator;
import org.apache.hadoop.hbase.filter.CompareFilter;
import org.apache.hadoop.hbase.filter. Filter;
scan 'test:test8', { FILTER => SingleColumnValueFilter.new(Bytes.toBytes('account'),
Bytes.toBytes('ACCOUNT_NUMBER'), CompareFilter::CompareOp.valueOf('EQUAL'),
BinaryComparator.new(Bytes.toBytes('0003000587'))).setFilterIfMissing(true)}