Scan with filter using HBase shell - nosql

Does anybody know how to scan records based on some scan filter i.e.:
column:something = "somevalue"
Something like this, but from HBase shell?

Try this. It's kind of ugly, but it works for me.
import org.apache.hadoop.hbase.filter.CompareFilter
import org.apache.hadoop.hbase.filter.SingleColumnValueFilter
import org.apache.hadoop.hbase.filter.SubstringComparator
import org.apache.hadoop.hbase.util.Bytes
scan 't1', { COLUMNS => 'family:qualifier', FILTER =>
SingleColumnValueFilter.new
(Bytes.toBytes('family'),
Bytes.toBytes('qualifier'),
CompareFilter::CompareOp.valueOf('EQUAL'),
SubstringComparator.new('somevalue'))
}
The HBase shell will include whatever you have in ~/.irbrc, so you can put something like this in there (I'm no Ruby expert, improvements are welcome):
# imports like above
def scan_substr(table,family,qualifier,substr,*cols)
scan table, { COLUMNS => cols, FILTER =>
SingleColumnValueFilter.new
(Bytes.toBytes(family), Bytes.toBytes(qualifier),
CompareFilter::CompareOp.valueOf('EQUAL'),
SubstringComparator.new(substr)) }
end
and then you can just say in the shell:
scan_substr 't1', 'family', 'qualifier', 'somevalue', 'family:qualifier'

scan 'test', {COLUMNS => ['F'],FILTER => \
"(SingleColumnValueFilter('F','u',=,'regexstring:http:.*pdf',true,true)) AND \
(SingleColumnValueFilter('F','s',=,'binary:2',true,true))"}
More information can be found here. Note that multiple examples reside in the attached Filter Language.docx file.

Use the FILTER param of scan, as shown in the usage help:
hbase(main):002:0> scan
ERROR: wrong number of arguments (0 for 1)
Here is some help for this command:
Scan a table; pass table name and optionally a dictionary of scanner
specifications. Scanner specifications may include one or more of:
TIMERANGE, FILTER, LIMIT, STARTROW, STOPROW, TIMESTAMP, MAXLENGTH,
or COLUMNS. If no columns are specified, all columns will be scanned.
To scan all members of a column family, leave the qualifier empty as in
'col_family:'.
Some examples:
hbase> scan '.META.'
hbase> scan '.META.', {COLUMNS => 'info:regioninfo'}
hbase> scan 't1', {COLUMNS => ['c1', 'c2'], LIMIT => 10, STARTROW => 'xyz'}
hbase> scan 't1', {FILTER => org.apache.hadoop.hbase.filter.ColumnPaginationFilter.new(1, 0)}
hbase> scan 't1', {COLUMNS => 'c1', TIMERANGE => [1303668804, 1303668904]}
For experts, there is an additional option -- CACHE_BLOCKS -- which
switches block caching for the scanner on (true) or off (false). By
default it is enabled. Examples:
hbase> scan 't1', {COLUMNS => ['c1', 'c2'], CACHE_BLOCKS => false}

Scan scan = new Scan();
FilterList list = new FilterList(FilterList.Operator.MUST_PASS_ALL);
//in case you have multiple SingleColumnValueFilters,
you would want the row to pass MUST_PASS_ALL conditions
or MUST_PASS_ONE condition.
SingleColumnValueFilter filter_by_name = new SingleColumnValueFilter(
Bytes.toBytes("SOME COLUMN FAMILY" ),
Bytes.toBytes("SOME COLUMN NAME"),
CompareOp.EQUAL,
Bytes.toBytes("SOME VALUE"));
filter_by_name.setFilterIfMissing(true);
//if you don't want the rows that have the column missing.
Remember that adding the column filter doesn't mean that the
rows that don't have the column will not be put into the
result set. They will be, if you don't include this statement.
list.addFilter(filter_by_name);
scan.setFilter(list);

One of the filter is Valuefilter which can be used to filter all column values.
hbase(main):067:0> scan 'dummytable', {FILTER => "ValueFilter(=,'binary:2016-01-26')"}
binary is one of the comparators used within the filter. You can use different comparators within the filter based on what you want to do.
You can refer following url: http://www.hadooptpoint.com/filters-in-hbase-shell/.
It provides good examples on how to use different filters in HBase Shell.

Add setFilterIfMissing(true) at the end of query
hbase(main):009:0> import org.apache.hadoop.hbase.util.Bytes;
import org.apache.hadoop.hbase.filter.SingleColumnValueFilter;
import org.apache.hadoop.hbase.filter.BinaryComparator;
import org.apache.hadoop.hbase.filter.CompareFilter;
import org.apache.hadoop.hbase.filter. Filter;
scan 'test:test8', { FILTER => SingleColumnValueFilter.new(Bytes.toBytes('account'),
Bytes.toBytes('ACCOUNT_NUMBER'), CompareFilter::CompareOp.valueOf('EQUAL'),
BinaryComparator.new(Bytes.toBytes('0003000587'))).setFilterIfMissing(true)}

Related

Indexing a josnb column in postgresql

I have a column in postgresql table with type jsonb.
{
.....
"type": "car",
"vehicleIds": [
"980e3761-935a-4e52-be77-9f9461dec4d1","980e3761-935a-4e52-be77-9f9461dec4d2"
]
.....
}
Application runs queries against these fields to fetch records. I need to index this column only for these fields.
How can this be done?
This is query structure with properties as the column name:
SELECT *
FROM Vehicle f
WHERE f.properties::text ## CONCAT('$.vehicleIds[*] >', :vehicleId )= true
AND f.properties::text ## CONCAT('$.type >', :type ) = true
The query you are using is highly confusing, as it boils down to be a text search query, as the ## is applied on a text value.
I also don't understand the '$.type > ... condition. With values like car I would expect an equality operator, rather than "greater than". Using > together with a UUID also doesn't seem to make sense.
If you want to search for values of type car and contain a list of IDs, using the "contains" operator #> is a better way to do that:
SELECT *
FROM Vehicle f
WHERE f.properties #> '{"type": "car", "vehicleIds": ["980e3761-935a-4e52-be77-9f9461dec4d1"]}'
The above could make use of a GIN index on the properties column:
create index on vehicles using gin (properties);
If the type key is always queried with equality (which I assume), a combined index might be more efficient:
create index on vehicles using gin ( (properties ->> 'type'), (properties -> 'vehicleIds') );
You need to install the btree_gin extension in order to create that index.
That index would be a bit smaller but needs a different query:
SELECT *
FROM Vehicle f
WHERE f.properties ->> 'type' = 'car'
AND f.properties -> 'vehicleIds' #> '["980e3761-935a-4e52-be77-9f9461dec4d1"]'
You will need to validate if the indexes are used and which ones is more efficient by looking at the execution plan

Index created for PostgreSQL jsonb column not utilized

I have created an index for a field in jsonb column as:
create index on Employee using gin ((properties -> 'hobbies'))
Query generated is:
CREATE INDEX employee_expr_idx ON public.employee USING gin (((properties -> 'hobbies'::text)))
My search query has structure as:
SELECT * FROM Employee e
WHERE e.properties #> '{"hobbies": ["trekking"]}'
AND e.department = 'Finance'
Running EXPLAIN command for this query gives:
Seq Scan on employee e (cost=0.00..4452.94 rows=6 width=1183)
Filter: ((properties #> '{"hobbies": ["trekking"]}'::jsonb) AND (department = 'Finance'::text))
Going by this, I am not sure if index is getting used for search.
Is this entire setup ok?
The expression you use in the WHERE clause must match the expression in the index exactly, your index uses the expression: ((properties -> 'hobbies'::text)) but your query only uses e.properties on the left hand side.
To make use of that index, your WHERE clause needs to use the same expression as was used in the index:
SELECT *
FROM Employee e
WHERE (properties -> 'hobbies') #> '["trekking"]'
AND e.department = 'Finance'
However: your execution plan shows that the table employee is really tiny (rows=6). With a table as small as that, a Seq Scan is always going to be the fastest way to retrieve data, no matter what kind of indexes you define.

Is there a Scala collection that maintains the order of insert?

I have a List:hdtList which contain columns that represent the columns of a Hive table:
forecast_id bigint,period_year bigint,period_num bigint,period_name string,drm_org string,ledger_id bigint,currency_code string,source_system_name string,source_record_type string,gl_source_name string,gl_source_system_name string,year string
I have a List: partition_columns which contains two elements: source_system_name, period_year
Using the List: partition_columns, I am trying to match them and move the corresponding columns in List: hdtList to the end of it as below:
val (pc, notPc) = hdtList.partition(c => partition_columns.contains(c.takeWhile(x => x != ' ')))
But when I print them as: println(notPc.mkString(",") + "," + pc.mkString(","))
I see the output unordered as below:
forecast_id bigint,period_num bigint,period_name string,drm_org string,ledger_id bigint,currency_code string,source_record_type string,gl_source_name string,gl_source_system_name string,year string,period string,period_year bigint,source_system_name string
The columns period_year comes first and the source_system_name last. Is there anyway I can make data as below so that the order of columns in the List: partition_columns is maintained.
forecast_id bigint,period_num bigint,period_name string,drm_org string,ledger_id bigint,currency_code string,source_record_type string,gl_source_name string,gl_source_system_name string,year string,period string,source_system_name string,period_year bigint
I know there is an option to reverse a List but I'd like to learn if I can implement a collection that maintains that order of insert.
It doesn't matter which collections you use; you only use partition_columns to call contains which doesn't depend on its order, so how could it be maintained?
But your code does maintain order: it's just hdtList's.
Something like
// get is ugly, but safe here
val pc1 = partition_columns.map(x => pc.find(y => y.startsWith(x)).get)
after your code will give you desired order, though there's probably more efficient way to do it.

Apache Spark Count by Group Method

I want to get a listing of values and counts for a specific column (column "a") in a Cassandra table using Datastax and Spark, but I'm having trouble determining the correct method of performing that request.
I'm essentially trying to do the equivalent of a T-SQL
SELECT a, COUNT(a)
FROM mytable
I've tried the following using datastax and spark on Cassandra
import com.datastax.spark.connector._
import org.apache.spark.sql.cassandra._
val rdd = sc.cassandraTable("mykeyspace", "mytable").select("a")
rdd.groupBy(row => row.getString("a")).count()
This looks to just give me the count of distinct values in the a column, but I was more after a listing of the values and the counts of those values (so val1:10 ... val2:5 ... val3:12 ... and so forth. I've tried some .collect and similar; just not sure how to get the listing there; any help would be appreciated.
The code snippet below will fetch the partition key named "a" and get the column with "column_name" and find the number of count for that.
val cassandraPartitionKeys = List("a")
val partitionKeyRdd = sc.parallelize(cassandraPartitionKeys)
val cassandraRdd = partitionKeyRdd.joinWithCassandraTable(keyspace,table).map(x => x._2)
cassandraRdd.map(row => (row.getString("column_name"),1)).countByKey().collect.foreach(println)
It seems like this could be a partial answer (it provides the correct data, but there likely is a better solution)
import com.datastax.spark.connector._
import org.apache.spark.sql.cassandra._
val rdd = sc.cassandraTable("mykeyspace", "mytable").groupBy(row => row.getString("a"))
rdd.foreach{ row => { println(row._1 + " " + row._2.count(x => true)) } }
I'm assuming there is a better solution, but this looks to work in terms of getting results.

Scan HTable rows for specific column value using HBase shell

I want to scan rows in a HTable from hbase shell where a column family (i.e., Tweet) has a particular value (i.e., user_id).
Now I want to find all rows where tweet:user_id has value test1 as this column has value 'test1'
column=tweet:user_id, timestamp=1339581201187, value=test1
Though I can scan table for a particular using,
scan 'tweetsTable',{COLUMNS => 'tweet:user_id'}
but I did not find any way to scan a row for a value.
Is it possible to do this via HBase Shell?
I checked this question as well.
It is possible without Hive:
scan 'filemetadata',
{ COLUMNS => 'colFam:colQualifier',
LIMIT => 10,
FILTER => "ValueFilter( =, 'binaryprefix:<someValue.e.g. test1 AsDefinedInQuestion>' )"
}
Note: in order to find all rows that contain test1 as value as specified in the question, use binaryprefix:test1 in the filter (see this answer for more examples)
Nishu,
here is solution I periodically use. It is actually much more powerful than you need right now but I think you will use it's power some day. Yes, it is for HBase shell.
import org.apache.hadoop.hbase.filter.CompareFilter
import org.apache.hadoop.hbase.filter.SingleColumnValueFilter
import org.apache.hadoop.hbase.filter.SubstringComparator
import org.apache.hadoop.hbase.util.Bytes
scan 'yourTable', {LIMIT => 10, FILTER => SingleColumnValueFilter.new(Bytes.toBytes('family'), Bytes.toBytes('field'), CompareFilter::CompareOp.valueOf('EQUAL'), Bytes.toBytes('AAA')), COLUMNS => 'family:field' }
Only family:field column is returned with filter applied. This filter could be improved to perform more complicated comparisons.
Here are also hints for you that I consider most useful:
http://hadoop-hbase.blogspot.com/2012/01/hbase-intra-row-scanning.html - Intra-row scanning explanation (Java API).
https://hbase.apache.org/devapidocs/org/apache/hadoop/hbase/filter/FilterBase.html - JavaDoc for FilterBase class with links to descendants which actually can be used the same style. OK, shell syntax will be slightly different but having example above you can use this.
As there were multiple requests to explain this answer this additional answer has been posted.
Example 1
If
scan '<table>', { COLUMNS => '<column>', LIMIT => 3 }
would return:
ROW COLUMN+CELL
ROW1 column=<column>, timestamp=<timestamp>, value=hello_value
ROW2 column=<column>, timestamp=<timestamp>, value=hello_value2
ROW3 column=<column>, timestamp=<timestamp>, value=hello_value3
then this filter:
scan '<table>', { COLUMNS => '<column>', LIMIT => 3, FILTER => "ValueFilter( =, 'binaryprefix:hello_value2') AND ValueFilter( =, 'binaryprefix:hello_value3')" }
would return:
ROW COLUMN+CELL
ROW2 column=<column>, timestamp=<timestamp>, value=hello_value2
ROW3 column=<column>, timestamp=<timestamp>, value=hello_value3
Example 2
If not is supported as well:
scan '<table>', { COLUMNS => '<column>', LIMIT => 3, FILTER => "ValueFilter( !=, 'binaryprefix:hello_value2' )" }
would return:
ROW COLUMN+CELL
ROW1 column=<column>, timestamp=<timestamp>, value=hello_value
ROW3 column=<column>, timestamp=<timestamp>, value=hello_value3
An example of a text search for a value BIGBLUE in table t1 with column family of d:a_content. A scan of the table will show all the available values :-
scan 't1'
...
column=d:a_content, timestamp=1404399246216, value=BIGBLUE
...
To search just for a value of BIGBLUE with limit of 1, try the below command :-
scan 't1',{ COLUMNS => 'd:a_content', LIMIT => 1, FILTER => "ValueFilter( =, 'regexstring:BIGBLUE' )" }
COLUMN+CELL
column=d:a_content, timestamp=1404399246216, value=BIGBLUE
Obviously removing the limit will show all occurrences in that table/cf.
To scan a table in hbase on the basis of any column value, SingleColumnValueFilter can be used as :
scan 'tablename' ,
{
FILTER => "SingleColumnValueFilter('column_family','col_name',>, 'binary:1')"
}
From HBAse shell i think it is not possible because it is some how like query from which we use want to find spsecific data. As all we know that HBAse is noSQL so when we want to apply query or if we have a case like you then i think you should use Hive or PIG where as Hive is quiet good approach because in PIG we need to mess with scripts.
Anyway you can get good guaidence about hive from here HIVE integration with HBase and from Here
If yout only purpose is to view data not to get from code (of any client) then you can use HBase Explorer or a new and very good product but it is in its beta release is "HBase manager". You can get this from HBase Manager
Its simple, and more importantly, it helps to insert and delete data, applying filters on column qualifiers from UI like other DBclients. Have a try.
I hope it would be helpful for you :)
Slightly different question but if you you want to query a specific column which is not present in all rows, DependentColumnFilter is your best friend:
import org.apache.hadoop.hbase.filter.DependentColumnFilter
scan 'orgtable2', {FILTER => "DependentColumnFilter('cf1','lan',false,=,'binary:fre')"}
The previous scan will return all columns for the rows in which the lan column is present and for which its associated value is equal to fre. The third argument is dropDependentColumn and would prevent the lan column itself to be displayed in the results if set to true.