Spark: Searching for matches in large DataSets

Spark: Searching for matches in large DataSets - scala

I need to count how many values of one of the columns of df1 are present in one of the columns of df2. (I just need the number of matched values)
I wouldn't be asking this question if efficiency wasn't such a big concern:
df1 contains 100,000,000+ records
df2 contains 1,000,000,000+ records

Just an off the top of my head idea for the case that intersection won't cut it:
For the datatype that is contained in the columns, find two hash functions h1, h2 such that
h1 produces hashes roughly uniformly between 0 and N
h2 produces hashes roughly uniformly between 0 and M
such that M * N is approximately 1B, e.g. M = 10k, N = 100k,
then:
map each entry x from the column from df1 to (h1(x), x)
map each entry x from the column from df2 to (h1(x), x)
group both by h1 into buckets with xs
join on h1 (that's gonna be the nasty shuffle)
then locally, for each pair of buckets (b1, b2) that came from df1 and df2 and had the same h1 hash code, do essentially the same:
compute h2 for all bs from b1 and from b2,
group by the hash code h2
Compare the small sub-sub-buckets that remain by converting everything toSet and computing the intersection directly.
Everything that remains after intersection is present in both df1 and df2, so compute size and sum the results across all partitions.
The idea is to select N small enough so that the buckets with M entries still comfortably fit on a single node, but at the same time prevent that the whole application dies on the first shuffle trying to find out where is what by sending every key to everyone else. For example, using SHA-256 as "hash code" for h1 wouldn't help much, because the keys would be essentially unique, so that you could take the original data directly and try to do a shuffle with that. However, if you restrict N to some reasonably small number, e.g. 10k, you obtain a rough approximation of where what is, so that you can then regroup the buckets and start the second stage with h2.
Essentially it's just a random guess, I didn't test it. It could well be that the built-in intersection is smarter than everything I could possibly come up with.

Related

Combination Operators on Sensors Arrays Data

Is there any standardized operators over data from arrays of sensors?
I am normally dealing with sensors data in the form of time + channels. The time is a timestamp, and the channels are the available data for these timestamps. All these fields are numeric, no strings involved.
Normally I have to mix those data objects in different ways. Let's suppose M1 is size m1xn1 and M2 is size m2xn2:
Combine rows of data from the same channels and different timestamps (i.e. n1 == n2). This leads to a vertical concatenation [M1; M2].
Combine columns of data from the same timestamps and different channels, (i.e. m1 == m2). This leads to a horizontal concatenation [M1 M2].
These operators are trivial and well defined.
When I have slight differences, for example, a few additional samples in M1 or M2, everything turns complicated and I have to think in weird schemes to perform such operations, such as these:
Cleaning the exceeding samples on M1 or M2, for matching the dimensions.
Calculate an aggregated timestamp, obtaining a unique(sort()) on the timestamps, and then apply a union like in a SQL JOIN sentence.
Aggregate the data on M1 or M2, this is, reducing m1 or m2 to a smaller figure, resampling the timescale, and then apply an aggregation like in a SQL GROUP sentence.
I cannot think of a unique and definite function to combine this sort of data. How can I do this?

Let's say you have an m1-element vector of time values t1 and and n1-element vector of channel values c1 for your m1-by-n1 matrix M1 (and likewise for M2). First and foremost, you will likely need to convert your time and channel values into equivalent index values. You can do this by expanding your time and channel values into grids using ndgrid, then converting them to index values using unique:
[t1, c1] = ndgrid(t1, c1);
[t2, c2] = ndgrid(t2, c2);
[tUnion, ~, tIndex] = unique([t1(:); t2(:)]);
[cUnion, ~, cIndex] = unique([c1(:); c2(:)]);
Now there are two approaches you can take for aggregating the data using the above indices. If you know for certain that the matrices M1 and M2 will never contain repeated measurements (i.e. the same combination of time and channel will not appear in both), then you can build the final joined matrix by creating a linear index from tIndex and cIndex and combining the values from M1 and M2 like so:
MUnion = zeros(numel(tUnion), numel(cUnion));
MUnion(tIndex+numel(tUnion).*(cIndex-1)) = [M1(:); M2(:)];
If the matrices M1 and M2 could contain repeated measurements at the same combination of time and channel values, then accumarray will be the way to go. You will have to decide how you want to combine the repeated measurements, such as taking the mean as shown here:
MUnion = accumarray([tIndex cIndex], [M1(:); M2(:)], [], #mean);

Pseudo randomization in MATLAB with minimum intervals between stimulus categories

For an experiment I need to pseudo randomize a vector of 100 trials of stimulus categories, 80% of which are category A, 10% B, and 10% C. The B trials have at least two non-B trials between each other, and the C trials must come after two A trials and have two A trials following them.
At first I tried building a script that randomized a vector and sort of "popped" out the trials that were not where they should be, and put them in a space in the vector where there was a long series of A trials. I'm worried though that this is overcomplicated and will create an endless series of unforeseen errors that will need to be debugged, as well as it not being random enough.
After that I tried building a script which simply shuffles the vector until it reaches the criteria, which seems to require less code. However now that I have spent several hours on it, I am wondering if these criteria aren't too strict for this to make sense, meaning that it would take forever for the vector to shuffle before it actually met the criteria.
What do you think is the simplest way to handle this problem? Additionally, which would be the best shuffle function to use, since Shuffle in psychtoolbox seems to not be working correctly?

The scope of this question moves much beyond language-specific constructs, and involves a good understanding of probability and permutation/combinations.
An approach to solving this question is:
Create blocks of vectors, such that each block is independent to be placed anywhere.
Randomly allocate these blocks to get a final random vector satisfying all constraints.
Part 0: Category A
Since category A has no constraints imposed on it, we will go to the next category.
Part 1: Make category C independent
The only constraint on category C is that it must have two A's before and after. Hence, we first create random groups of 5 vectors, of the pattern A A C A A.
At this point, we have an array of A vectors (excluding blocks), blocks of A A C A A vectors, and B vectors.
Part 2: Resolving placement of B
The constraint on B is that two consecutive Bs must have at-least 2 non-B vectors between them.
Visualize as follows: Let's pool A and A A C A A in one array, X. Let's place all Bs in a row (suppose there are 3 Bs):
s0 B s1 B s2 B s3
Where s is the number of vectors between each B. Hence, we require that s1, s2 be at least 2, and overall s0 + s1 + s2 + s3 equal to number of vectors in X.
The task is then to choose random vectors from X and assign them to each s. At the end, we finally have a random vector with all categories shuffled, satisfying the constraints.
P.S. This can be mapped to the classic problem of finding a set of random numbers that add up to a certain sum, with constraints.
It is easier to reduce the constrained sum problem to one with no constraints. This can be done as:
s0 B s1 t1 B s2 t2 B s3
Where t1 and t2 are chosen from X just enough to satisfy constraints on B, and s0 + s1 + s2 + s3 equal to number of vectors in X not in t.
Implementation
Implementing the same in MATLAB could benefit from using cell arrays, and this algorithm for the random numbers of constant sum.
You would also need to maintain separate pools for each category, and keep building blocks and piece them together.
Really, this is not trivial but also not impossible. This is the approach you could try, if you want to step aside from brute-force search like you have tried before.

Matlab : Conceptual difficulty in How to create multiple hash tables in Locality sensitive Hashing

The key idea of Locality sensitive hashing (LSH) is that neighbor points, v are more likely
mapped to the same bucket but points far from each other are more likely mapped to different buckets. In using Random projection, if the the database contains N samples each of higher dimension d, then theory says that we must create k randomly generated hash functions, where k is the targeted reduced dimension denoted as g(**v**) = (h_1(v),h_2(v),...,h_k(v)). So, for any vector point v, the point is mapped to a k-dimensional vector with a g-function. Then the hash code is the vector of reduced length /dimension k and is regarded as a bucket. Now, to increase probability of collision, theory says that we should have L such g-functions g_1, g_2,...,g_L at random. This is the part that I do not understand.
Question : How to create multiple hash tables? How many buckets are contained in a hash table?
I am following the code given in the paper Sparse Projections for High-Dimensional Binary Codes by Yan Xia et. al Link to Code
In the file Coding.m
dim = size(X_train, 2);
R = randn(dim, bit);
% coding
B_query = (X_query*R >= 0);
B_base = (X_base*R >=0);
X_query is the set of query data each of dimension d and there are 1000 query samples; R is the random projection and bit is the target reduced dimensionality. The output of B_query and B_base are N strings of length k taking 0/1 values.
Does this way create multiple hash tables i.e. N is the number of hash tables? I am confused as to how. A detailed explanation will be very helpful.

How to create multiple hash tables?
LSH creates hash-table using (amplified) hash functions by concatenation:
g(p) = [h1(p), h2(p), · · · , hk (p)], hi ∈R H
g() is a hash function and it corresponds to one hashtable. So we map the data, via g() to that hashtable and with probability, the close ones will fall into the same bucket and the non-close ones will fall into different buckets.
We do that L times, thus we create L hashtables. Note that every g() is/should most likely to be different that the other g() hash functions.
Note: Large k ⇒ larger gap between P1, P2. Small P1 ⇒ larer L so as to find neighbors. A practical choice is L = 5 (or 6). P1 and P2 are defined in the image below:
How many buckets are contained in a hash table?
Wish I knew! That's a difficult question, how about sqrt(N) where N is the number of points in the dataset. Check this: Number of buckets in LSH
The code of Yan Xia
I am not familiar with that, but from what you said, I believe that the query data you see are 1000 in number, because we wish to pose 1000 queries.
k is the length of the strings, because we have to hash the query to see in which bucket of a hashtable it will be mapped. The points inside that bucket are potential (approximate) Nearest Neighbors.

Transforming a dataset from SQL to RDD[vector]

I know several questions has been asked on similar topics but I couldn't apply any of the answers to my problem, also I am wondering about best practices.
I have loaded a dateset for ML to a SQL database. I want to apply mllib's clustering function according to it. I have loaded the SQL database to DataFrame using sqlContext, dropped the irrelevant columns. then happened the problematic part, I create a vector by parsing each row of the DataFrame.
The Vector is then transformed to RDD using the toJavaRDD function.
Here is the code (works):
val usersDF = sqlContext.read.format("jdbc").option("url","jdbc:mysql://localhost/database").
option("driver","com.mysql.jdbc.Driver").option("dbtable","table").
option("user","woot").option("password","woot-password").load()
val cleanDF = usersDF.drop("id").drop("username")
cleanDF.show()
val parsedData = cleanDF.map(s => Vectors.dense(s.toString().replaceAll("[\\[\\]]", "").trim.split(',').map(_.toDouble))).cache()
val splits = parsedData.randomSplit(Array(0.6,0.4), seed = 11L)
val train_set = splits(0).cache()
val gmm = new GaussianMixture().setK(2).run(train_set)
My main question regards to what I read on spark documentation about: Local vector, in my understanding the DataFrame mapping will be performed on the workers and later will be sent to the Driver when creating the Vector(Is that the meaning of local vector) only to later be sent to the workers again? isn't there a better way to achieve this?
Another things is that it seems a little odd to load SQL to DataFrame only to turn it into string and parse it again. Are there any other best practices suggestions?

From the link you suggested
A local vector has integer-typed and 0-based indices and double-typed
values, stored on a single machine. MLlib supports two types of local
vectors: dense and sparse.
A distributed matrix has long-typed row and column indices and
double-typed values, stored distributively in one or more RDDs.
The local vector are behaving like any object you would use for your RDD (String, Integer, Array), they are created and stored on a single machine, the worker node, and only if you collect them they will be sent to the driver node.
If you consider a vector x of size 2n storing it distributively you would separate it in two halfs of length n, x1 and x2, (x = x1::x2). To perform the dot product with another vector y, the workers will perform r1=x1*y1 (on machine 1) and r2=x2*y2 (on machine 2) and then you will need to group the partial results giving r=r1+r2. Your vector x is distributed, the vectors x1 and x2 are again local vectors. If you have x as a local vector then in a single step you can perform on a worker node r=x*y.
For your second question, I do not see why you would store the vectors in SQL format. Having a CSV file like this would be sufficient:
label feature1 feature2 ...
1, 0.5, 1.2 ...
0, 0.2, 0.5 ...

space complexity of multi-dimensional hash

I would like to store an input of tab separated values where say C1, C2, C3 and C4 represent the columns of the data and there are N rows of data. If so, I could do lookups in the hash to see if some given values for C1,C2,C3,C4 exist. Someone suggested to me that, in the worst case, the space complexity of this was N4. I would like help formulating a clear explanation as to why that is not true.

The other person is thinking that if you try to store an N by N grid of points, there will be N4 points to store.
But if you have N points, then you're just storing a hash. And a hash with N data points typically takes O(N) space. (Technically it takes the size of the hash table plus the space for the data, but people usually dynamically size the hash table to be the same order of magnitude size as the data set.)

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse