Most appropriate analysis method - Clustering? - merge

I have 2 large data frames with similar variables representing 2 separate surveys. Some rows (participants) in each data frame correspond to the other and I would like to link these two together.
There is an index in both dataframes though this index indicates locality of the survey (i.e region) and not individual IDs.
Merging is not possible as in most cases there is an identical index values for different participants.
Given that merging on an index value from the 2 data frames is not possible, I wish to compare similar variables (binary) from both data frames that (in addition to the index values common to both data frame) in order to give me a highest likelihood of a match. I can then (with some margin of error) match rows with the most similar values for similar variables and merge them together.
What do you think would be the appropriate method for doing this? Clustering?
Best,
James

That obviously is not clustering. You don't want large groups of records.
What you want to do is an approximate JOIN.

Related

Generate subset of data with known mean

I have a dataset of n observations (nx1 vector) and would like to create a subset of this data, whose mean is known in advance, by selecting at random only n/3 observations (or within some constraint, ie where the mean of the data subset is within a range about the known mean).
Can someone please help me with the code do this in matlab?
Note, I don't want to use the rand function to create random data as I already have my data collected.
For example on a smaller scale: If I had the following dataset of 12 observations:
data = [8;7;4;6;9;6;4;7;3;2;1;1];
but then wanted to randomly select a subset of this data containing only 4 observations with a mean of 4 (or with a mean between 3.5-4.5 for example):
Then the answer might be datasubset=[7;3;2;4] but the answer could also be datasubset=[6;4;2;4] or datasubset=[6;4;3;4].
It doesn't matter if there are several possible solutions, I just need one of them, though I'd like to know the alternative solutions also.

Intersecting two tables with one common row elements in matlab

I have two different tables (.csv files) as:
I need to merge these two tables in MATLAB, while intersecting first columns of both the tables. I want to make a new separate table with six number of columns(combined columns of both the tables) and number of rows will be equal to the number of intersecting elements of first column of both the tables.
How should I do the intersection and merging of these two tables?
I'm proposing an answer. I'm not claiming it is the best answer. In fact, I think there are probably much faster ones! Also note that I do not have MATLAB in front of me right now and can't test this. There might be some mistakes.
First of all, read the .csv files into memory. In table 1, convert the first column into numeric data (currently, it looks like they are strings). After this step, you want to have two double matricies I'll call table1 (which is 3296x5) and table2 (which is 3184x3).
Second, (this is where it gets mildly interesting, step 1 was the boring stuff), is to find all IDs that are common to both tables. This can be done by calling commonIDs=unique([table1(:,1) ; table2(:,1)]).
Third, get the indicies of the common rows for table1. Then repeat for table2. This is done using the ismember function as follows:
goodEntries1=ismember(table1(:,1),commonIDs);
goodEntries2=ismember(table2(:,1),commonIDs);
Lastly, we extract data and combine to get a result. Note that I only include the ID column once:
result=[table1(goodEntries1,:) table2(goodEntries2,2:end)];
You will need to test this to make sure it is robust. I think that this will keep the right rows together, but depending on how ismember works, you might end up combining rows out of order (for instance, table1's ID=5 with table2's ID=6).

Selecting the proper db index

I have a table with 10+ million tuples in my Postgres database that I will be querying. There are 3 fields, "layer" integer, "time", and "cnt". Many records share the same values for "layer" (distributed from 0 to about 5 or so, heavily concentrated between 0-2). "time" has has relatively unique values, but during queries the values will be manipulated such that some will be duplicates, and then they will be grouped by to account for those duplicates. "cnt" is just used to count.
I am trying to query records from certain layers (WHERE layer = x) between certain times (WHERE time <= y AND time >= z), and I will be using "time" as my GROUP BY field. I currently have 4 indexes, one each on (time), (layer), (time, layer), and (layer, time) and I believe this is too many (I copied this from an template provided by my supervisor).
From what I have read online, fields with relatively unique values, as well as fields that are frequently-searched, are good candidates for indexing. I have also seen that having too many indexes will hinder the performance of my query, which is why I know I need to drop some.
This leads me to believe that the best index choice would be on (time, layer) (I assume a btree is fine because I have not seen reason to use anything else), because while I query slightly more frequently on layer, time better fits the criterion of having more relatively unique values. Or, should I just have 2 indices, 1 on layer and 1 on time?
Also, is an index on (time, layer) any different from (layer, time)? Because that is one of the confusions that led me to have so many indices. The provided template has multiple indices with the same 3 attributes, just arranged in different orders...
Your where clause appears to be:
WHERE layer = x and time <= y AND time >= z
For this query, you want an index on (layer, time). You could include cnt in the index so the index covers the query -- that is, all data columns are in the index so the original data pages don't need to be accessed for the data (they may be needed for locking information).
Your original four indexes are redundant, because the single-column indexes are not needed. The advice to create all four is not good advice. However, (layer, time) and (time, layer) are different indexes and under some circumstances, it is a good idea to have both.

Database solution to store and aggregate vectors?

I'm looking for a way to solve a data storage problem for a project.
The Data:
We have a batch process that generates 6000 vectors of size 3000 each daily. Each element in the vectors is a DOUBLE. For each of the vectors, we also generate tags like "Country", "Sector", "Asset Type" and so on (It's financial data).
The Queries:
What we want to be able to do is see aggregates by tag of each of these vectors. So for example if we want to see the vectors by sector, we want to get back a response that gives us all the unique sectors and a 3000x1 vector that is the sum of all the vectors of each element tagged by that sector.
What we've tried:
It's easy enough to implement a normalized star schema with 2 tables, one with the tagging information and an ID and a second table that has "VectorDate, ID, ElementNumber, Value" which will have a row to represent each element for each vector. Unfortunately, given the size of the data, it means we add 18 million records to this second table daily. And since our queries need to read (and add up) all 18 million of these records, it's not the most efficient of operations when it comes to disk reads.
Sample query:
SELECT T1.country, T2.ElementNumber, SUM(T2.Value)
FROM T1 INNER JOIN T2 ON T1.ID=T2.ID
WHERE VectorDate = 20140101
GROUP BY T1.country, T2.ElementNumber
I've looked into NoSQL solutions (which I don't have experience with) but seen that some, like MongoDB allow for storing entire vectors as part of a single document - but I'm unsure if they would allow aggregations like we're trying efficiently (adding each element of the vector in a document to the corresponding element of other documents' vectors). I read the $unwind operation required isn't that efficient either?
It would be great if someone could point me in the direction of a database solution that can help us solve our problem efficiently.
Thanks!

In preprocessing data with high cardinality, do you hash first or one-hot-encode first?

Hashing reduces dimensionality while one-hot-encoding essentially blows up the feature space by transforming multi-categorical variables into many binary variables. So it seems like they have opposite effects. My questions are:
What is the benefit of doing both on the same dataset? I read something about capturing interactions but not in detail - can somebody elaborate on this?
Which one comes first and why?
Binary one-hot-encoding is needed for feeding categorical data to linear models and SVMs with the standard kernels.
For example, you might have a feature which is a day of a week. Then you create a one-hot-encoding for each of them.
1000000 Sunday
0100000 Monday
0010000 Tuesday
...
0000001 Saturday
Feature-hashing is mostly used to allow for significant storage compression for parameter vectors: one hashes the high dimensional input vectors into a lower dimensional feature space. Now the parameter vector of a resulting classifier can therefore live in the lower-dimensional space instead of in the original input space. This can be used as a method of dimension reduction thus usually you expect to trade a bit of decreasing of performance with significant storage benefit.
The example in wikipedia is a good one. Suppose your have three documents:
John likes to watch movies.
Mary likes movies too.
John also likes football.
Using a bag-of-words model, you first create below document to words model. (each row is a document, each entry in the matrix indicates whether a word appears in the document).
The problem with this process is that such dictionaries take up a large amount of storage space, and grow in size as the training set grows.
Instead of maintaining a dictionary, a feature vectorizer that uses the hashing trick can build a vector of a pre-defined length by applying a hash function h to the features (e.g., words) in the items under consideration, then using the hash values directly as feature indices and updating the resulting vector at those indices.
Suppose you generate below hashed features with 3 buckets. (you apply k different hash functions to the original features and count how many times the hashed value hit a bucket).
bucket1 bucket2 bucket3
doc1: 3 2 0
doc2: 2 2 0
doc3: 1 0 2
Now you successfully transformed the features in 9-dimensions to 3-dimensions.
A more interesting application of feature hashing is to do personalization. The original paper of feature hashing contains a nice example.
Imagine you want to design a spam filter but customized to each user. The naive way of doing this is to train a separate classifier for each user, which are unfeasible regarding either training (to train and update the personalized model) or serving (to hold all classifiers in memory). A smart way is illustrated below:
Each token is duplicated and one copy is individualized by concatenating each word with a unique user id. (See USER123_NEU and USER123_Votre).
The bag of words model now holds the common keywords and also use-specific keywords.
All words are then hashed into a low dimensioanl feature space where the document is trained and classified.
Now to answer your questions:
Yes. one-hot-encoding should come first since it is transforming a categorical feature to binary feature to make it consumable by linear models.
You can apply both on the same dataset for sure as long as there is benefit to use the compressed feature-space. Note if you can tolerate the original feature dimension, feature-hashing is not required. For example, in a common digit recognition problem, e.g., MINST, the image is represented by 28x28 binary pixels. The input dimension is only 784. For sure feature hashing won't have any benefit in this case.