What hash function does Vertica use - hash

I'm looking for a way to assign devices to different groups for an A/B test.
To identify unique devices, we assign them with unique strings as keys - I have no control over this.
I thought about hashing, we're using a vertica DB and it has a built in function for hashing. But, because I don't know what kind of algorithm the function uses I can't reproduce it in the controller that assigns the devices to the A/B test groups.
I'm looking to apply the function on the unique device key.
I looked in the vertica documentation about the function but to no avail.
Help would be appreciated

The HASH() function is proprietary; however there are plans to make it open source in an upcoming release.
For segmentation, any SQL function can be used as long as it's immutable.

Related

Deploy Knowledge Studio dictionary pre-annotator to Natural Language Understanding

I'm getting started with Knowledge Studio and Natural Language Understanding.
I'm able to deploy a machine-learning model toNatural Language Understanding and use the API to query it.
I would know if there's a way to deploy only the pre-annotator.
I read from Knowledge Studio's documentation that
You can deploy or export a machine-learning annotator. A dictionary pre-annotator can only be used to pre-annotate documents within Watson Knowledge Studio.
Does exist a workaround to create a model that simply does the job of the pre-annotator, i.e. use dictionaries to find entities instead of the machine-learning model?
Does exist a workaround to create a model that simply does the job of the pre-annotator, i.e. use dictionaries to find entities instead of the machine-learning model?
You may need to explain this better in what you need.
WKS allows you to pre-annotate documents with dictionaries you upload. Once you have created a ML model, you can alternatively use that to annotate your training documents, and then manually correct. As you continue the amount of manual work will reduce after each model iteration.
The assumption is that you are creating a model with a reasonable amount of examples. In your model results, you will want the mention/relations to be outside or close to outside the gray area of the report.
The other interpretation of your request I took was you want to create a dictionary based model only. This is possible using the "Rule-Based Model" functionality. You would have to create the parsing rules but you just map what you want to find to the dictionary/rule.
Using this in production though is still limited. You should get a warning when you deploy these kinds of models.
It's slightly better than just a keyword search as you can map items to parts of speech.
The last point. The purpose of WKS is to create a machine learning model which will do the work in discovering new terms you haven't seen before. With the rule based engine it can only find what you explicitly tell it to find.
If all you want is just dictionary entries, then you can create a very simple string comparison solution, but you lose the linguistic features.

PostgreSQL: how to define and use "global" constants

I am writing a few stored procs that process some batch upload data. Each input line can be flagged for a variety of application errors. I have nearly 100 different types of errors in all, and over a dozen different file load procedures.
In C/C++ the idiom for error codes is a bunch of #define or const in a project-wide include (class) file and then using the symbolic names in application code. The compilers check for wayward spellings. Java/C# too offer a similar construct. How does one obtain a similar effect in plpgsql? I have toyed with setting up these in postgresql.conf but is that a sound approach? It obviously will not work at compile time. And I don't want to grant write privileges to conf files to application developers. Further, it will require a reload of conf for every application change, possibly a system stability issue. I am sure there are many other drawbacks.
In a like vein, I have also a need for plain "user-defined" types wherein I would like to fix the representation of certain application data types, such as "part_number" to be varchar(20), "currency_code" to be char(3) and so on. Again, in C/C++ one would use typedef or struct as the case might be. So I tried creating a TYPE in PostgreSQL for consistent usage across tables, views, function headers. But with the UDTs I ran into a new set of issues: specifying primary keys, and in CSV input specs where the value must now be given in parentheses. Is there a different way of dealing with such objectives in PostgreSQL?
I am new to PostgreSQL. We are using 9.2 on Linux. I am tempted to use a pre-processor but then it will not be compatible with any design tool I have seen.
For your first question you could potentially use an ENUM type.
CREATE TYPE flag AS ENUM ('ok', 'bad', 'superbad');
Which would at least allow for sanity checking of your spellings for each of the flag states.
For your second question (and please ask multiple questions in the future - since it keeps things on topic) you might want to look at DOMAINs

How to organize parameters for a postgres application

I am working on a postgres application. For the moment I am not sure how to manage application constant parameters best. For example I want to define a threshold variable which I am going to use in several functions.
One idea is making a table "config" and query the variable every time I need them. And for a shortcut wrap the sql query into an other function i.e.: t := get_Config('Threshold');
But in fact I am not really lucky with this. What is the best way to handle custom application configuration parameters? They should be handy in maintainance and I want to avoid querying every time for constants. In oracle for example you could compile constants into package specs. Are there any better ways to deal with such configuration parameters?
I have organized global parameters just the way you describe it for some years now. It seems a bit awkward but it works just fine.
I have got quite a number of those, so I added an integer plus index to my config table and use get_config($my_id) (plus comment) - which is slightly faster but less readable.
OR you can use custom_variable_classes. See:
How to declare variable in PostgreSQL

a simple/practical example of fuzzy c-means algorithm

I am writing my master thesis on the subject of dynamic keystroke authentication. To support ongoing research, I am writing code to test out different methods of feature extraction and feature matching.
My current simple approach just checks if the reference password keycodes matches the currently typed in keycodes and also checks if the keypress times (dwell) and the key-to-key times (flight) are the same as reference times +/- 100ms (tolerance). This is of course very limited and I want to extend it with some sort of fuzzy c-means pattern matching.
For each key the features look like: keycode, dwelltime, flighttime (first flighttime is always 0).
Obviously the keycodes can be taken out of the fuzzy algorithm because they have to be exactly the same.
In this context, how would a practical implementation of fuzzy c-means look like?
Generally, you would do the following:
Determine how many clusters you want (2? "Authentic" and "Fake"?)
Determine what elements you want to cluster (individual keystrokes? login attempts?)
Determine what your feature vectors will look like (dwell time, flight time?)
Determine what distance metric you will be using (how will you measure the distance of each sample from each cluster?)
Create exemplar training data for each cluster type (what does an authentic login look like?)
Run the FCM algorithm on the training data to generate the clusters
To create the membership vector for any given login attempt sample, run it through the FCM algorithm using the clusters you found in step 6
Use the resulting membership vector to determine (based on some threshold criteria) whether the login attempt is authentic
I'm not an expert, but this seems like an odd approach to determining whether a login attempt is authentic or not. I've seen FCM used for pattern recognition (eg. which facial expression am I making?), which makes sense because you're dealing with several categories (eg. happy, sad, angry, etc...) with defining characteristics. In your case, you really only have one category (authentic) with defining characteristics. Non-authentic keystrokes are simply "not like" authentic keystrokes, so they won't cluster.
Perhaps I am missing something?
I don't think you really want to do clustering here. You might want to do some proper fuzzy matching though instead of just allowing some delta on each value.
For clustering, you need to have many data points. Additionally, you'd need to know the proper number of means you need.
But what are these multiple objects meant to be? You have one data point for every keycode. You don't want to have the user type the password 100 times to see if he can do it consistently. And even then, what do you expect the clusters to be? You already know which keycode comes at which position, you don't want to find out what keycodes the user use for his password...
Sorry, I really don't see any clustering here. The term "fuzzy" seems to have mislead you to this clustering algorithm. Try "fuzzy logic" instead.

Hashes vs Numeric id's

When creating a web application that some how displays the display of a unique identifier for a recurring entity (videos on YouTube, or book section on a site like mine), would it be better to use a uniform length identifier like a hash or the unique key of the item in the database (1, 2, 3, etc).
Besides revealing a little, what I think is immaterial, information about the internals of your app, why would using a hash be better than just using the unique id?
In short: Which is better to use as a publicly displayed unique identifier - a hash value, or a unique key from the database?
Edit: I'm opening up this question again because Dmitriy brought up the good point of not tying down the naming to db specific property. Will this sort of tie down prevent me from optimizing/normalizing the database in the future?
The platform uses php/python with ISAM /w MySQL.
Unless you're trying to hide the state of your internal object ID counter, hashes are needlessly slow (to generate and to compare), needlessly long, needlessly ugly, and needlessly capable of colliding. GUIDs are also long and ugly, making them just as unsuitable for human consumption as hashes are.
For inventory-like things, just use a sequential (or sharded) counter instead. If you migrate to a different database, you will just have to initialize the new counter to a value at least as large as your largest existing record ID. Pretty much every database server gives you a way to do this.
If you are trying to hide the state of your counter, perhaps because you're counting users and don't want competitors to know how many you have, I suggest avoiding the display of your internal IDs. If you insist on displaying them and don't want the drawbacks of a hash, you might consider using a maximal-period linear feedback shift register to generate IDs.
I typically use hashes if I don't want the user to be able to guess the next ID in the series. But for your book sections, I'd stick with numerical id's.
Using hashes is preferable in case you need to rebuild your database for some reason, for example, and the ordering changes. The ordinal numbers will move around -- but the hashes will stay the same.
Not relying on the order you put things into a box, but on properties of the things, just seems.. safer.
But watch out for collisions, obviously.
With hashes you
Are free to merge the database with a similar one (or a backup), if necessary
Are not doing something that could help some guessing attacks even a bit
Are not disclosing more private information about the user than necessary, e.g. if somebody sees a user number 2 in your current database log in, they're getting information that he is an oldie.
(Provided that you use a long hash or a GUID,) greatly helping youself in case you're bought by YouTube and they decide to integrate your databases.
Helping yourself in case there appears a search engine that indexes by GUID.
Please let us know if the last 6 months brought you some clarity on this question...
Hashes aren't guaranteed to be unique, nor, I believe, consistent.
will your users have to remember/use the value? or are you looking at it from a security POV?
From a security perspective, it shouldn't matter - since you shouldn't just be relying on people not guessing a different but valid ID of something they shouldn't see in order to keep them out.
Yeah, I don't think you're looking for a hash - you're more likely looking for a Guid.If you're on the .Net platform, try System.Guid.
However, the most important reason not to use a Guid is for performance. Doing database joins and lookups on (long) strings is very suboptimal. Numbers are fast. So, unless you really need it, don't do it.
Hashes have the advantage that you can check if they are valid or not BEFORE performing any check to your database whether they exist or not. This can help you to fend off attacks with random hashes as you don't need to burden your database with fake lookups.
Therefor, if your hash has some kind of well-defined format with for example a checksum at the end, you can check if it's correct without needing to go to the database.