Transform HBase Scan to RowFilter - scala

I'm using scio from spotify for my Dataflow jobs.
In last scio version, new bigtable java api is used (com.google.bigtable.v2)
Now scio bigtable entry point required "RowFilter" to filter instead of Hbase "Scan". Is there a simple way to transform "Scan" to "RowFilter" ? I looked for adapters in source code but I'm not sure how to use it.
I don't find documentation to easy migrate from hbase api to "new" api.
A simple scan I used in my code that I need to transform:
val scan = new Scan()
scan.setRowPrefixFilter("helloworld".getBytes)
scan.addColumn("family".getBytes, "qualifier".getBytes)
scan.setMaxVersions()

In theory, you can add the bigtable-hbase dependency to the project and call com.google.cloud.bigtable.hbase.adapters.Adapters.SCAN_ADAPTER.adapt(scan) to convert the Scan to a RowFilter, or more specifically a [ReadRowsRequest][3] which contains a [RowFilter][4]. (The links are to the protobuf definition of those objects which contain the variables and extensive comments).
That said, the bigtable-hbase dependency adds quite a few transitive dependencies. I would use the bigtable-hbase SCAN_ADAPTER in a standalone project, and then print the RowFilter to see how it's constructed.
In the specific case that you mention, the RowFilter is quite simple, but there may be additional complications. You have three parts to your scan, so I'll give a breakdown of how to achieve them:
scan.setRowPrefixFilter("helloworld".getBytes). This translates to a start key and end key on BigtableIO. "helloworld" is the start key, and you can calculate the end key with RowKeyUtil. calculateTheClosestNextRowKeyForPrefix. The default BigtableIO does not expose set start key and set end key, so the scio version will have to change to make those setters public.
scan.addColumn("family".getBytes, "qualifier".getBytes) translates to two RowFilters added to a RowFilter with a Chain (mostly analogous to an AND). The first RowFilter will have familyNameRegexFilter set, and the second RowFilter will have columnNameRegexFilter
scan.setMaxVersions() converts to a RowFilter with cellsPerColumnLimitFilter set. It would need to be added to a the chain from #2. Warning: If you use a timestampRangeFilter or value filter of a RowFilter to limit the range of the columns, make sure to put the cellsPerColumnLimitFilter at the end of the chain.

Related

Deploy Knowledge Studio dictionary pre-annotator to Natural Language Understanding

I'm getting started with Knowledge Studio and Natural Language Understanding.
I'm able to deploy a machine-learning model toNatural Language Understanding and use the API to query it.
I would know if there's a way to deploy only the pre-annotator.
I read from Knowledge Studio's documentation that
You can deploy or export a machine-learning annotator. A dictionary pre-annotator can only be used to pre-annotate documents within Watson Knowledge Studio.
Does exist a workaround to create a model that simply does the job of the pre-annotator, i.e. use dictionaries to find entities instead of the machine-learning model?
Does exist a workaround to create a model that simply does the job of the pre-annotator, i.e. use dictionaries to find entities instead of the machine-learning model?
You may need to explain this better in what you need.
WKS allows you to pre-annotate documents with dictionaries you upload. Once you have created a ML model, you can alternatively use that to annotate your training documents, and then manually correct. As you continue the amount of manual work will reduce after each model iteration.
The assumption is that you are creating a model with a reasonable amount of examples. In your model results, you will want the mention/relations to be outside or close to outside the gray area of the report.
The other interpretation of your request I took was you want to create a dictionary based model only. This is possible using the "Rule-Based Model" functionality. You would have to create the parsing rules but you just map what you want to find to the dictionary/rule.
Using this in production though is still limited. You should get a warning when you deploy these kinds of models.
It's slightly better than just a keyword search as you can map items to parts of speech.
The last point. The purpose of WKS is to create a machine learning model which will do the work in discovering new terms you haven't seen before. With the rule based engine it can only find what you explicitly tell it to find.
If all you want is just dictionary entries, then you can create a very simple string comparison solution, but you lose the linguistic features.

How does the built-in Apache Hive hash function work and where can I find that documentation?

I'm working with Apache Hive and need to be certain of how the built-in hash function works. I found this page that lists hash under the Misc. Functions section. It says that hash has been available "As of Hive 0.4".
I would just like to see some documentation on what it's doing exactly. Is it deterministic? Will it always produce the same output given the same input? How many collisions should I expect?
A hash function is deterministic, by definition, cf. https://en.wikipedia.org/wiki/Hash_function#Determinism
So if the implementation of hash() was not deterministic, then it would be a bug, and someone would have noticed!
Caveat: that implementation is subject to change (and bug fixes) hence determinism stands only for a given version of Hive.
Hive is Open Source. Documentation is not bad by Apache standards, but still incomplete. Just inspect the source code => https://github.com/apache/hive
For Hive 2.1 for example:
the hash() function (an UDF in Hive jargon) is defined here
it just calls ObjectInspectorUtils.getBucketHashCode() which calls ObjectInspectorUtils.hashCode() on each argument, then merges its hash into a global "bucket" hash - as defined here
a comment shows that the (crude) hashing method implemented by Hive is derived from String.hashCode()
For alternative hashing functions in Hive, see Calculate hash without using exisiting hash fuction in Hive but the answer basically points to the same documentation page that you already found.

How to get column name and data type returned by a custom query in postgres?

How to get column name and data type returned by a custom query in postgres? We have inbuilt functions for table/views but not for custom queries. For more clarification I would say that I need a postgres function which will take sql string as parameter and will return colnames and their datatype.
I don't think there's any built-in SQL function which does this for you.
If you want to do this purely at the SQL level, the simplest and cheapest way is probably to CREATE TEMP VIEW AS (<your_query>), dig the column definitions out of the catalog tables, and drop the view when you're done. However, this can have a non-trivial overhead depending on how often you do it (as it needs to write view definitions to the catalogs), can't be run in a read-only transaction, and can't be done on a standby server.
The ideal solution, if it fits your use case, is to build a prepared query on the client side, and make use of the metadata returned by the server (in the form of a RowDescription message passed as part of the query protocol). Unfortunately, this depends very much on which client library you're using, and how much of this information it chooses to expose. For example, libpq will give you access to everything, whereas the JDBC driver limits you to the public methods on its ResultSetMetadata object (though you could probably pull more information from its private fields via reflection, if you're determined enough).
If you want a read-only, low-overhead, client-independent solution, then you could also write a server-side C function to prepare and describe the query via SPI. Writing and building C functions comes with a bit of a learning curve, but you can find numerous examples on PGXN, or within Postgres' own contrib modules.

Handing Arrays in soap webservice testing using fitnesse

Is there a way to dynamically create tables in wiki?
Usecase : I'm trying to mimic similar to soap sonar in fitnesse. SOAP SOANR 1. Once we import the wsdl, soap sonar generates inputs for operations in wsdl. 2. Choose a operation, Enter input and then execute the operation. 3. In case of arrays, we can select size of array and enter values in respective array.
Fitnesse 1. I'm able to achieve point 1 using soapui jars. 2. This i'm able to achieve using xmlhttptest fixture
I'm stuck in 3rd point. Is there a way i can do this in fitnesse? (My idea is from point 1, i can get sample input for each operation, from which i will get to know that there are arrays/complex types present in input.xml but how do we represent this in wiki dynamically?
Thanks in advance
What I've done in the past is use ListFixture (and MapFixture) to dynamically fill a List (and Map/Hashes for each element's properties) and then use these as input values to a XmlHttpTest's feature to create the body to be sent using a FreeMarker template (which allows iteration over a list, which I use to create elements in the array based on the list).
But this gets quite complex quickly. Is that level of flexibility truly required? I found that quite often hard coding the number of elements in arrays/lists in the wiki is simpler to do and makes the test far easier to understand/maintain.
I most cases I prefer to create a script (or scenario) with the right number of elements for the test case(s) in with the request in the wiki page. The use of scenarios allows me to test with different values (but the same number of elements). Another element count gets its own script/scenario.
Being able to dynamically change the number of elements is only worthwhile if you need to test for many different counts, otherwise the added complexity of dynamically creating the body is just not worth it.

In salesforce.com can you have multivalued attributes?

I am developing a Novell Identity Manager driver for Salesforce.com, and am trying to understand the Salesforce.com platform better.
I have had really good success to date. I can read pretty much arbitrary object classes out of SFDC, and create eDirectory objects for them, and what not. This is all done and working nicely. (Publisher Channel). Once I got Query events mapped out, most everything started working in the Publisher Channel.
I am now working on sending events back to SFDC (Subscriber channel) when changes occur in eDirectory.
I am using the upsert() function in the SOAP API, and with Novell Identity Manager, you basically build the SOAP doc, and can see the results as you build it. (You can do it in XSLT or you can use the various allowed tokens to build the document in DirXML Script. I am using DirXML Script which has been working well so far.).
The upshot of that comment is that I can build the SOAP document, see it, to be sure I get it right. Which is usually different than the Java/C++ approach that the sample code usually provides. Much more visual this way.
There are several things about upsert() that I do not entirely understand. I know how to blank a value, should I get that sort of event. Inside the <urn:sObjects> node, add a node like (assuming you get your namespaces declared already):
<urn1:fieldsToNull>FieldName</urn1:fieldsToNull>
I know how to add a value (AttrValue) to the attribute (FieldName), add a node like:
<FieldName>AttrValue</FieldName>
All this works and is pretty straight forward.
The question I have is, can a value in SFDC be multi-valued? In eDirectory, a multi valued attribute being changed, can happen two ways:
All values can be removed, and the new set re-added.
The single value removed can be sent as that sort of event (remove-value) or many values can be removed in one operation.
Looking at SFDC, I only ever see Multi-picklist attributes that seem to be stored in a single entry : or ; delimited. Is there another kind of multi valued attribute managed differently in SFDC? And if so, how would one manipulate it via the SOAP API?
I still have to decide if I want to map those multi-picklists to a single string, or a multi valued attribute of strings. First way is easier, second way is more useful... Hmmm... Choices...
Some references:
I have been using the page Sample SOAP messages to understand what the docs should look like.
Apex Explorer is a kicking tool for browsing the database and testing queries. Much like DBVisualizer does for JDBC connected databases. This would have been so much harder without it!
SoapUi is also required, and a lovely tool!
As far as I know there's no multi-value field other than multi-select picklists (and they map to semicolon-separated string). Generally platform encourages you to create a proper relationship with another (possibly new, custom) table if you're in need of having multiple values associated to your data.
Only other "unusual" thing I can think of is how the OwnerId field on certain objects (Case, Lead, maybe something else) can be used to point to User or Queue record. Looks weird when you are used to foreign key relationships from traditional databases. But this is not identical with what you're asking as there will be only one value at a time.
Of course you might be surpised sometimes with values you'll see in the database depending on the viewing user's locale (stuff like System Administrator profile becoming Systeembeheerder in Dutch). But this will be still a single value, translated on the fly just before the query results are sent back to you.
When I had to perform SOAP integration with SFDC, I've always used WSDL files and most of the time was fine with Java code generated out of them with Apache Axis. Hand-crafting the SOAP message yourself seems... wow, hardcore a bit. Are you sure you prefer visualisation of XML over the creation of classes, exceptions and all this stuff ready for use with one of several out-of-the-box integration methods? If they'll ever change the WSDL I need just to regenerate the classes from it; whereas changes to your SOAP message creation library might be painful...