How to add index in Database using ELKI java API for Custom POJO with String type fields - cluster-analysis

I am using DBSCAN to cluster some categorical data using a POJO. My class looks like this
public class Dimension {
private String app;
private String node;
private String cluster;
.............
All my fields are String instead of integer or Float because they are discrete/categorical value. Rest of my code is as follows.
final SimpleTypeInformation<Dimension> dimensionTypeInformation = new SimpleTypeInformation<>(Dimension.class);
PrimitiveDistanceFunction<Dimension> dimensionPrimitiveDistanceFunction = new PrimitiveDistanceFunction<Dimension>() {
public double distance(Dimension d1, Dimension d2) {
return simpleMatchingCoefficient(d1, d2);
}
public SimpleTypeInformation<? super Dimension> getInputTypeRestriction() {
return dimensionTypeInformation;
}
public boolean isSymmetric() {
return true;
}
public boolean isMetric() {
return true;
}
public <T extends Dimension> DistanceQuery<T> instantiate(Relation<T> relation) {
return new PrimitiveDistanceQuery<>(relation, this);
}
};
DatabaseConnection dbc = new DimensionDatabaseConnection(dimensionList);
Database db = new StaticArrayDatabase(dbc, null);
db.initialize();
DBSCAN<Dimension> dbscan = new DBSCAN<>(dimensionPrimitiveDistanceFunction, 0.6, 20);
Result result = dbscan.run(db);
Now as expected this code works fine for small dataset but gets very very slow when my dataset gets bigger. So I want to add an index to speed up the process. But all the index that I could think of require me to implement NumberVector. But my class has only Strings not number.
What index can I use in this case ? can I use the distance function, double simpleMatchingCoefficient(Dimension d1, Dimension d2) to create an IndexFactory ?
Thanks in advance.

There are (at least) three broad families of indexes:
Coordinate based indexes, such as the k-d-tree and R-tree. These work well on dense, continuous variables
Metric indexes, that require the distance function to satisfy the triangle inequality. These can work on any kind of data, but may still need a fairly smooth distribution of distance values (e.g., they will not help with the discrete metric, that is 0 of x=y and 1 otherwise).
Inverted lookup indexes. They are mostly used for text search, and exploit that for each attribute only a small subset of the data is relevant. These work well for high-cardinality discrete attributes.
In your case, I'd consider an inverted index. If you have a lot of attributes, a metric index may work, but I doubt that holds, because you use POJOs with strings to store your data.
And of course, profile your code and check if you can improve the implementation of your distance function! E.g. string interning may help, it can reduce matching time of strings to equality testing rather than comparing each character...

First of all, note that the SMC is usually defined as a similarity function, not a distance function, but 1-SMC is the usual transformation. Just don't confuse these two.
For the simple matching coefficient, you probably will want to build your own inverted index, for your particular POJO data type. Because of your POJO design (Dimension sounds like a very bad name, btw.), this cannot be implemented in a generic, reusable, way easily. That would require expensive introspection, and still require customization: should string matches be case sensitive? Do they need trimming? Should they be tokenized?
Your inverted index will then likely contain a series of maps specific to your POJO:
Map<String, DBIDs> by_app;
Map<String, DBIDs> by_node;
Map<String, DBIDs> by_cluster;
...
and for each attribute, you get the matching DBIDs, and count how often they appear. The most often returned DBIDs have the highest SMC (and hence lowest distance).
At some point, you can forget counting candidates that can no longer make it into the result set. Just look up an information retrieval book how such search works.
Such an index is beneficial if the average number of matches for each attribute is low. You can further speed this up by bitmap index compression and such techniques, but that is likely not necessary to do (at some point, it can be attractive to build upon existing tools such as Apache Lucene to handle the search then).

Related

Hibernate / Postgres - Changing the default column type mappings

I am starting new project using Postgres and hibernate (5.5.7) as the ORM,
however I have recently read the following wiki page:
https://wiki.postgresql.org/wiki/Don%27t_Do_This
Based on that I would like to change some of the default column mappings, specifically:
Use timestamptz instead of timestamp
Use varchar instead of varchar(255) when the column length is unspecified.
Increase the scale of numeric types so that the default is numeric(19,5) - The app uses BigDecimals to store currency values.
Reading through the hibernate code it appears that the length, precision and scale are hardcoded in the class: org.hibernate.mapping.Column, specifically:
public static final int DEFAULT_LENGTH = 255;
public static final int DEFAULT_PRECISION = 19;
public static final int DEFAULT_SCALE = 2;
For the 2nd and 3rd cases (varchar and numeric) I don't see any easy way to change the default (length, precision and scale), so the best option I have been able to come up with is to create a new custom "Dialect" extending from PostgreSQL95Dialect who's constructor redefines the mappings as follows:
registerColumnType(Types.TIMESTAMP, "timestamptz");
registerColumnType(Types.VARCHAR, "varchar");
registerColumnType(Types.NUMERIC, "numeric($p, 5)");
Using this overridden dialect I can generate a schema which includes the changes I am trying to achieve.
I am happy with the timestamp change since I don't see any use cases where I would need to store a timestamp without the timezone (I typically use Instant's (UTC time) in my model).
I can't foresee a problem with the varchar change since all validation occurs when data is sent into the system (Rest service).
However I have lost the capability to use the (#Column) scale attribute - I have to use an explicit "columnDefinition" if I want to specify a different scale.
This still leaves me with the following questions:
Is there a better solution than I have described?
Can you foresee any problems using the custom dialect, that I haven't listed here?
Would you recommend using the custom dialect for schema generation ONLY or should it be used for both schema generation and when the application is running (why)?
Well, if you really must do this, it's fine, but I wouldn't recommend you to do this. The default values usually come from the #Column annotation. So I would recommend you simply set proper values everywhere in your model. IMO the only okish thing you did is the switch to timestamptz but you didn't understand how the type works correctly. The type does not store the timezone, but instead will normalize to UTC.
The next problem IMO is that you switch to an unbounded varchar. It might be "discouraged" to use a length limit, but believe me, it will save you a lot of trouble in the long run. There was just recently a blog post about switching back to length limited varchar due to users saving way too big texts. So even if you believe that you validated the lengths, you probably won't get this right for all future uses. Increasing the limit is easy and doesn't require any expensive locks, so if you already define some limits on your REST API, it would IMO be stupid not to model this in the database as well. Or are you omitting foreign key constraints just because your REST API also validates that? No you are not, and every DBA will tell you that you should never omit such constraints. These constraints are valuable.
As for numerics, just bite the sour apply and apply the values on all #Column annotations. You can use some kind of constant holder class to avoid inconsistencies:
public class MyConstants {
public static final int VARCHAR_SMALL = 1000;
public static final int VARCHAR_MEDIUM = 5000;
public static final int VARCHAR_LARGE = 10000;
public static final int PRICE_PRECISION = 19;
public static final int PRICE_SCALE = 5;
}
and use it like #Column(precision = MyConstants.PRICE_PRECISION, scale = MyConstants.PRICE_SCALE)

How to read probability distributions from a database and save them in collections

I'm migrating from Arena to AnyLogic and have a question about distributions. I need to use different distributions based on some agent parameters. I have seen the suggestion here but the number of distributions is too large and I prefer not to hard code them.
How to associate a probability distribution to Agents - Anylogic
In Arena it was possible to create expression arrays and link them to a database (e.g. excel) and use those parameters to obtain distributions from expression arrays. I tried to use collections in AnyLogic to do the same but could not convert the strings (e.g. "uniform(100,120)") to distributions.
Is there any way in AnyLogic to store distributions in collections?
Is there any way in AnyLogic to read distributions from a database?
Thanks
Everything you say is possible.. there are at least 4 ways of doing it: creating agents with the distribution, creating a collection of distribution classes, executing the string expression you mentioned and reading and calculating directly from the database. In this particular case, I like the option with classes and the expression one will be the simpler for you, but I may write down all the other options too later:
Using ExecuteExpression
If you managed to create a collection with strings that represent your distributions, you can do this:
executeExpression("uniform(100,200)");
or in your case with a collection (whatever you choose "i" to be)
executeExpression(collection.get(i));
But this is ugly, so I will do the complicated and cool way
Using Databases
The first thing obviously is to create a database with your information. I will assume since that is what seems to be your case, that you want to have a collection of distributions that are all uniform. So the database will look like this:
Where cum_probability is the cumulative probability for that distribution to be chosen and maximum and minimum will represent the parameters of your uniform(minimum, maximum) distribution.
Collection of Distributions using a Class
Now we will create a class with all that information:
public class Distribution implements Serializable {
public double probability;
public double min;
public double max;
/**
* Default constructor
*/
public Distribution(double probability,double min,double max) {
this.probability=probability;
this.min=min;
this.max=max;
}
public double getDistributionResult() {
return uniform(this.min,this.max,new Random());
}
}
You will also create a collection:
And you will initialize your collection in Main - on startup
List <Tuple> theList=selectFrom(distributions).list();
for(Tuple t : theList){
distributionsArray.add(
new Distribution(t.get(distributions.cum_probability),
t.get(distributions.minimum),
t.get(distributions.maximum))
);
}
Ok, now you have a collection of distributions. Great. The only thing remaining is to create a function that will return the random collection distribution result:
double rand=uniform();
List <Distribution> filtered=filter(distributionsArray,d->d.probability>=rand);
return top(filtered,d->-d.probability).getDistributionResult();

Generate C* bucket hash from multipart primary key

I will have C* tables that will be very wide. To prevent them to become too wide I have encountered a strategy that could suit me well. It was presented in this video.
Bucket Your Partitions Wisely
The good thing with this strategy is that there is no need for a "look-up-table" (it is fast), the bad part is that one needs to know the max amount of buckets and eventually end up with no more buckets to use (not scalable). I know my max bucket size so I will try this.
By calculating a hash from the tables primary keys this can be used as a bucket part together with the rest of the primary keys.
I have come up with the following method to be sure (I think?) that the hash always will be the same for a specific primary key.
Using Guava Hashing:
public static String bucket(List<String> primKeyParts, int maxBuckets) {
StringBuilder combinedHashString = new StringBuilder();
primKeyParts.forEach(part ->{
combinedHashString.append(
String.valueOf(
Hashing.consistentHash(Hashing.sha512()
.hashBytes(part.getBytes()), maxBuckets)
)
);
});
return combinedHashString.toString();
}
The reason I use sha512 is to be able to have strings with max characters of 256 (512 bit) otherwise the result will never be the same (as it seems according to my tests).
I am far from being a hashing guru, hence I'm asking the following questions.
Requirement: Between different JVM executions on different nodes/machines the result should always be the same for a given Cassandra primary key?
Can I rely on the mentioned method to do the job?
Is there a better solution of hashing large strings so they always will produce the same result for a given string?
Do I always need to hash from string or could there be a better way of doing this for a C* primary key and always produce same result?
Please, I don't want to discuss data modeling for a specific table, I just want to have a bucket strategy.
EDIT:
Elaborated further and came up with this so the length of string can be arbitrary. What do you say about this one?
public static int murmur3_128_bucket(int maxBuckets, String... primKeyParts) {
List<HashCode> hashCodes = new ArrayList();
for(String part : primKeyParts) {
hashCodes.add(Hashing.murmur3_128().hashString(part, StandardCharsets.UTF_8));
};
return Hashing.consistentHash(Hashing.combineOrdered(hashCodes), maxBuckets);
}
I currently use a similar solution in production. So for your method I would change to:
public static int bucket(List<String> primKeyParts, int maxBuckets) {
String keyParts = String.join("", primKeyParts);
return Hashing.consistentHash(
Hashing.murmur3_32().hashString(keyParts, Charsets.UTF_8),
maxBuckets);
}
So the differences
Send all the PK parts into the hash function at once.
We actually set the max buckets as a code constant since the consistent hash is only if the max buckets stay the same.
We use MurMur3 hash since we want it to be fast not cryptographically strong.
For your direct questions 1) Yes the method should do the job. 2) I think with the tweaks above you should be set. 3) The assumption is you need the whole PK?
I'm not sure you need to use the whole primary key since the expectation is that your partition part of your primary key is gonna be the same for many things which is why you are bucketing. You could just hash the bits that will provide you with good buckets to use in your partition key. In our case we just hash some of the clustering key parts of the PK to generate the bucket id we use as part of the partition key.

MongoDB custom and unique IDs

I'm using MongoDB, and I would like to generate unique and cryptical IDs for blog posts (that will be used in restful URLS) such as s52ruf6wst or xR2ru286zjI.
What do you think is best and the more scalable way to generate these IDs ?
I was thinking of following architecture :
a periodic (daily?) batch running to generate a lot of random and uniques IDs and insert them in a dedicated MongoDB collection with InsertIfNotPresent
and each time I want to generate a new blog post, I take an ID from this collection and mark it as "taken" with UpdateIfCurrent atomic operation
WDYT ?
This is exactly why the developers of MongoDB constructed their ObjectID's (the _id) the way they did ... to scale across nodes, etc.
A BSON ObjectID is a 12-byte value
consisting of a 4-byte timestamp
(seconds since epoch), a 3-byte
machine id, a 2-byte process id, and a
3-byte counter. Note that the
timestamp and counter fields must be
stored big endian unlike the rest of
BSON. This is because they are
compared byte-by-byte and we want to
ensure a mostly increasing order.
Here's the schema:
0123 456 78 91011
time machine pid inc
Traditional databases often use
monotonically increasing sequence
numbers for primary keys. In MongoDB,
the preferred approach is to use
Object IDs instead. Object IDs are
more synergistic with sharding and
distribution.
http://www.mongodb.org/display/DOCS/Object+IDs
So I'd say just use the ObjectID's
They are not that bad when converted to a string (these were inserted right after each other) ...
For example:
4d128b6ea794fc13a8000001
4d128e88a794fc13a8000002
They look at first glance to be "guessable" but they really aren't that easy to guess ...
4d128 b6e a794fc13a8000001
4d128 e88 a794fc13a8000002
And for a blog, I don't think it's that big of a deal ... we use it production all over the place.
What about using UUIDs?
http://www.famkruithof.net/uuid/uuidgen as an example.
Make a web service that returns a globally-unique ID so that you can have many webservers participate and know you won't hit any duplicates?
If your daily batch didn't allocate enough items? Do you run it midday?
I would implement the web-service client as a queue that can be looked at by a local process and refilled as needed (when server is slower) and could keep enough items in queue not to need to run during peak usage. Makes sense?
This is an old question but for anyone who could be searching for another solution.
One way is to use simple and fast substitution cipher. (The code below is based on someone else's code -- I forgot where I took it from so cannot give proper credit.)
class Array
def shuffle_with_seed!(seed)
prng = (seed.nil?) ? Random.new() : Random.new(seed)
size = self.size
while size > 1
# random index
a = prng.rand(size)
# last index
b = size - 1
# switch last element with random element
self[a], self[b] = self[b], self[a]
# reduce size and do it again
size = b;
end
self
end
def shuffle_with_seed(seed)
self.dup.shuffle_with_seed!(seed)
end
end
class SubstitutionCipher
def initialize(seed)
normal = ('a'..'z').to_a + ('A'..'Z').to_a + ('0'..'9').to_a + [' ']
shuffled = normal.shuffle_with_seed(seed)
#map = normal.zip(shuffled).inject(:encrypt => {} , :decrypt => {}) do |hash,(a,b)|
hash[:encrypt][a] = b
hash[:decrypt][b] = a
hash
end
end
def encrypt(str)
str.split(//).map { |char| #map[:encrypt][char] || char }.join
end
def decrypt(str)
str.split(//).map { |char| #map[:decrypt][char] || char }.join
end
end
You use it like this:
MY_SECRET_SEED = 3429824
cipher = SubstitutionCipher.new(MY_SECRET_SEED)
id = hash["_id"].to_s
encrypted_id = cipher.encrypt(id)
decrypted_id = cipher.decrypt(encrypted_id)
Note that it'll only encrypt a-z, A-Z, 0-9 and a space leaving other chars intact. It's sufficient for BSON ids.
The "correct" answer, which is not really a great solution IMHO, is to generate a random ID, and then check the DB for a collision. If it is a collision, do it again. Repeat until you've found an unused match. Most of the time the first will work (assuming that your generation process is sufficiently random).
It should be noted that, this process is only necessary if you are concerned about the security implications of a time-based UUID, or a counter-based ID. Either of these will lead to "guessability", which may or may not be an issue in any given situation. I would consider a time-based or counter-based ID to be sufficient for blog posts, though I don't know the details of your situation and reasoning.

Pseudorandom seed methodology for lookup tables

Could someone please suggest a good way of taking a global seed value e.g. "Hello World" and using that value to lookup values in arrays or tables.
I'm sort of thinking like that classic spacefaring game of "Elite" where there were different attributes for the planets but they were not random, simply derived from the seed value for the universe.
I was thinking MD5 on the input value and then using bytes from the hash, casting them to integers and mod them into acceptable indexes for lookup tables, but i suspect there must be a better way? I read something about Mersenne twisters but maybe that would be overkill.
I'm hoping for something which will give a good distrubution over the values in my lookup tables. e.g. Red, Orange, Yellow, Green, Blue, Purple
Also to emphasize I'm not looking for random values but consistent values each time.
Update: Perhaps I'm having difficulty in expressing my own problem domain. Here is an example of a site which uses generators and can generate X number of values: http://www.seventhsanctum.com
Additional criteria
I would prefer to work from first principles rather than making use of library functions such as System.Random
My approach would be to use your key as a seed for a random number generator
public StarSystem(long systemSeed){
java.util.Random r = new Random(systemSeed);
Color c = colorArray[r.nextInt(colorArray.length)]; // generates a psudo-random-number based from your seed
PoliticalSystem politics = politicsArray[r.nextInt(politicsArray.length)];
...
}
For a given seed this will produce the same color and the same political system every time.
For getting the starting seed from a string you could just use MD5Sum and grab the first/last 64bits for your long, the other approach would be to just use a numeric for each plant. Elite also generated the names for each system using its pseudo-random-generator.
for(long seed=1; seed<NUMBER_OF_SYSTEMS; seed++){
starSystems.add(new StarSystem(seed));
}
By setting the seed to a known value each time the Random will return the same sequence every time it is called, this is why when trying for good random values a good seed is very important. However in your case a known seed will produce the results your looking for.
The c# equivalent is
public StarSystem(int systemSeed){
System.Random r = new Random(systemSeed);
Color c = colorArray[r.next(colorArray.length)]; // generates a psudo-random-number based from your seed
PoliticalSystem politics = politicsArray[r.next(politicsArray.length)];
...
}
Notice a difference? no, nor did I.
Many common random number generators will generate the same sequence given the same seed value, so it seems that all you need to do is convert your name into a number. There are any number of hashing functions that will do that.
Supplementary question: Is it required that all unique strings generate unique hashes and so (probably) unique pseudo-random sequences.
?