Using variable length data inputs with EM algorithm clustering - cluster-analysis

We have a set of sequences with taxi positions. We want to cluster the data by considering the sequential patterns in the data lines.
For example:
T1, T2, T3, T4 be the travels and a,b,c,d,e be set of places.
The data we have is like,
T1 b c b a d
T2 a
T3 a b a b a b c e d
T4 b c d c b d c a
But the problem is the length of the data are not variable. How can we cluster these type of data using EM. Since it does not accept variable length data is there way we can customize it.

EM is a general principle. You can use it with very different models.
Probably the most popular model for EM is Gaussian Mixture Modeling, GMM.
Naturally, if you use covariances, GMM requires a fixed dimensionality.
But if you use other models, there is no reason it cannot work with variable length vectors. For example, there are EM variants that process text data, and text usually does have different length.

Related

Combination Operators on Sensors Arrays Data

Is there any standardized operators over data from arrays of sensors?
I am normally dealing with sensors data in the form of time + channels. The time is a timestamp, and the channels are the available data for these timestamps. All these fields are numeric, no strings involved.
Normally I have to mix those data objects in different ways. Let's suppose M1 is size m1xn1 and M2 is size m2xn2:
Combine rows of data from the same channels and different timestamps (i.e. n1 == n2). This leads to a vertical concatenation [M1; M2].
Combine columns of data from the same timestamps and different channels, (i.e. m1 == m2). This leads to a horizontal concatenation [M1 M2].
These operators are trivial and well defined.
When I have slight differences, for example, a few additional samples in M1 or M2, everything turns complicated and I have to think in weird schemes to perform such operations, such as these:
Cleaning the exceeding samples on M1 or M2, for matching the dimensions.
Calculate an aggregated timestamp, obtaining a unique(sort()) on the timestamps, and then apply a union like in a SQL JOIN sentence.
Aggregate the data on M1 or M2, this is, reducing m1 or m2 to a smaller figure, resampling the timescale, and then apply an aggregation like in a SQL GROUP sentence.
I cannot think of a unique and definite function to combine this sort of data. How can I do this?
Let's say you have an m1-element vector of time values t1 and and n1-element vector of channel values c1 for your m1-by-n1 matrix M1 (and likewise for M2). First and foremost, you will likely need to convert your time and channel values into equivalent index values. You can do this by expanding your time and channel values into grids using ndgrid, then converting them to index values using unique:
[t1, c1] = ndgrid(t1, c1);
[t2, c2] = ndgrid(t2, c2);
[tUnion, ~, tIndex] = unique([t1(:); t2(:)]);
[cUnion, ~, cIndex] = unique([c1(:); c2(:)]);
Now there are two approaches you can take for aggregating the data using the above indices. If you know for certain that the matrices M1 and M2 will never contain repeated measurements (i.e. the same combination of time and channel will not appear in both), then you can build the final joined matrix by creating a linear index from tIndex and cIndex and combining the values from M1 and M2 like so:
MUnion = zeros(numel(tUnion), numel(cUnion));
MUnion(tIndex+numel(tUnion).*(cIndex-1)) = [M1(:); M2(:)];
If the matrices M1 and M2 could contain repeated measurements at the same combination of time and channel values, then accumarray will be the way to go. You will have to decide how you want to combine the repeated measurements, such as taking the mean as shown here:
MUnion = accumarray([tIndex cIndex], [M1(:); M2(:)], [], #mean);

What is the correct approach when decomposing dependencies

I am struggling with Carnonical Cover, Dependency Preservation and Lossless Decomposition.
Are the approach and thoughts here correct?
R(ABCDEFG)
Provided is the following set of dependencies after a canonical cover has been made. I did not do the canonical cover myself but the assignment said I had to assume it had been done.
Fc:
A -> C
E -> A
C -> ABF
F -> CDG
A+ = ABCDFG
E+ = ABCDEFG
C+ = ABCDFG
F+ = ABCDFG
E = Candidate Key.
This list of functional dependencies is in 2NF since there are no partial dependencies. It is however not in 3NF since there are transitive dependencies.
However decomposing into the following 4 relations will result in it being not only in 3NF but also BCNF
R1 = {E,A}
E -> A
R2 = {A, C}
A -> C
R3 = {CABF}
C -> ABF
R4 = {FCDG}
F -> CDG
I use A in R1 as a foreign key to R2 and C in R2 as a foreign key to R3 etc.
There are no transitive dependencies and since all left hand sides are candidate keys in their respective relations it is in BCNF.
Is also lossless and dependency preserving?
What is decomposed
In the title you say:
What is the correct approach when decomposing dependencies
but one does not decompose dependencies, but relation schemas. So, in this case, here there is a relation schema R(ABCDEFG) with a set of functional dependencies and one must decompose that schema.
What is a decomposition
A decomposition produces a set of relation schemas with the following properties: a) every attribute of the original schema is present in some (possibly more than one) subschema; b) no other attributes are present. Moreover, a decomposition is redundant when a relation subschema is contained in another. In your case, this is true for R2, which is contained in R3: there is no need to have both relations, since it would imply unuseful data redundancy.
What is a good decomposition
To be really useful, a decomposition should satisfy two important properties: preserve functional dependencies and preserve data (lossless decomposition). But another property characterizes a good decomposition: it should be as small as possible: there is no point in decomposing a schema in too many subschemas, since this would produce a non natural and complex database.
Actually your decomposition is lossless and preserves the dependencies.
How to decompose
The final objective of all this stuff is to produce a decomposisition (lossless and dependency preserving) in which the subschemas are in BCNF or 3NF. The simple solution of decomposing by using the attributes of the functional dependencies is not, however, a good solution. For this, there are algorithms, described in textbooks, that produces decompositions either for BCNF or for 3NF (the so-called “analysis” algorithm for BCNF, and “synthesis” algorithm for 3NF), trying to produce not too many subschemas. For instance, the “analysis” algorithm in this case produce the following decomposition in BCNF, with only two subschemas:
R1 < (A B C D F G) ,
{ F → C
F → D
F → G
C → A
C → B
C → F
A → C } >
R2 < (A E) ,
{ E → A } >
This decomposition is lossless and preserves the dependencies (which is not always true for the analysis algorithm).

Decomposition into ABC & CDE and preserving functional dependencies

Consider a relation R with five attributes ABCDE. Now
assume that R is decomposed into two smaller relations ABC and CDE.
Define S to be the relation (ABC NaturalJoin CDE).
a) Assume that the above decomposition is lossless join. What is the
dependency that guarantees the lossless join property.
b) Give an additional FD such that “dependency preserving” property is
violated by this decomposition.
c) Give two additional FD's that would be preserved by this
decomposition.
Question seems different to me because there is no FD given and its asking:
a)
R1=(A,B,C) R2=(C,D,E) R1∩R2 =C (how can i control dependency now)
F1' = {A->B,A->C,B->C,B->A,C->A,C->B,AB->C,AC->B,BC->A...}
F2' = {C->D,C->E,D->E....}
then i will find F' ??
b,c) how do i check , do i need to look for all possible FD's for R1 and R2
The question is definitely assuming things it hasn't said clearly. ABCDE could be subject to the JD *{ABC,CDE} while not being subject to any nontrivial FDs at all.
But suppose that the relation is subject to some FDs and isn't subject to any JDs other than ones that they imply. If C is a CK then the join is lossless. But then C -> ABCDE holds, because a CK determines all attributes, and C -> ABDE holds, because a CK determines all other attributes. No other FD holding would imply that the join is lossless, although that requires tedium (by looking at every possible case of CK) or inspiration to show.
Both these FDs guarantee losslessness. Although if one of these holds the other holds, and they express the same condition. So the question is sloppy. Or the question might consider that the two expressions express the same FD in the sense of a condition, but a FD is an expression and not a condition, so that would also be sloppy.
I suspect that the questioner really just wanted you to give some FD whose holding would guarantee losslessness. That would get rid of the complications.

RSA-Calculate d without p and q

I am given a task to decrypt a message. However, i am only given the value of n and e. So, is it still possible to find the value of d? Is there any shortcut formula that can calculate d without knowing p and q?
The security of RSA is derived from the difficulty in calculating d from e and n (the public key). It sounds like the task you have been set is essentially to break RSA by factoring n into its prime factors p and q, and then using these to calculate d. Assuming n is not too large, factorization should be relatively easy (Wolfram|Alpha may be able to do it for example).

How to allocate the memory for b in LAPACK sgelsd routine

According to the official user guide line, sgelsd is used to solve the least square problem
min_x || b - Ax ||_2
and allows matrix A to be rectangle and rank-deficient. And according to the interface description in the sgelsd source code, b is used as input-output parameter. When sgelsd is finished, b stores the solution. So b occupies m*sizeof(float) bytes. While the solution x needs n*sizeof(float) bytes (assume A is a m*n matrix, and b is a m*1 vector).
However, when n>m, the memory of b is too small to store the solution x. How to deal with this situation? I did not get it from the comments of sgelsd source code. Can I just allocate n*sizeof(float) bytes for b and use the first m*sizeof(float) to store the b vector?
Thanks.
This example from Intel MKL has the answer. B is allocated as LDB*NRHS (LDB = max(M,N), and zero-padded. Note that the input B is not necessarily a 1-vector, SGELSD can handle multiple least-squares problems at the same time (hence NRHS).
From the Lapack docs for SGELSD:
[in,out] B
B is REAL array, dimension (LDB,NRHS)
On entry, the M-by-NRHS right hand side matrix B.
On exit, B is overwritten by the N-by-NRHS solution
matrix X. If m >= n and RANK = n, the residual
sum-of-squares for the solution in the i-th column is given
by the sum of squares of elements n+1:m in that column.
[in] LDB
LDB is INTEGER
The leading dimension of the array B. LDB >= max(1,max(M,N)).