Database solution to store and aggregate vectors? - mongodb

I'm looking for a way to solve a data storage problem for a project.
The Data:
We have a batch process that generates 6000 vectors of size 3000 each daily. Each element in the vectors is a DOUBLE. For each of the vectors, we also generate tags like "Country", "Sector", "Asset Type" and so on (It's financial data).
The Queries:
What we want to be able to do is see aggregates by tag of each of these vectors. So for example if we want to see the vectors by sector, we want to get back a response that gives us all the unique sectors and a 3000x1 vector that is the sum of all the vectors of each element tagged by that sector.
What we've tried:
It's easy enough to implement a normalized star schema with 2 tables, one with the tagging information and an ID and a second table that has "VectorDate, ID, ElementNumber, Value" which will have a row to represent each element for each vector. Unfortunately, given the size of the data, it means we add 18 million records to this second table daily. And since our queries need to read (and add up) all 18 million of these records, it's not the most efficient of operations when it comes to disk reads.
Sample query:
SELECT T1.country, T2.ElementNumber, SUM(T2.Value)
FROM T1 INNER JOIN T2 ON T1.ID=T2.ID
WHERE VectorDate = 20140101
GROUP BY T1.country, T2.ElementNumber
I've looked into NoSQL solutions (which I don't have experience with) but seen that some, like MongoDB allow for storing entire vectors as part of a single document - but I'm unsure if they would allow aggregations like we're trying efficiently (adding each element of the vector in a document to the corresponding element of other documents' vectors). I read the $unwind operation required isn't that efficient either?
It would be great if someone could point me in the direction of a database solution that can help us solve our problem efficiently.
Thanks!

Related

Most appropriate analysis method - Clustering?

I have 2 large data frames with similar variables representing 2 separate surveys. Some rows (participants) in each data frame correspond to the other and I would like to link these two together.
There is an index in both dataframes though this index indicates locality of the survey (i.e region) and not individual IDs.
Merging is not possible as in most cases there is an identical index values for different participants.
Given that merging on an index value from the 2 data frames is not possible, I wish to compare similar variables (binary) from both data frames that (in addition to the index values common to both data frame) in order to give me a highest likelihood of a match. I can then (with some margin of error) match rows with the most similar values for similar variables and merge them together.
What do you think would be the appropriate method for doing this? Clustering?
Best,
James
That obviously is not clustering. You don't want large groups of records.
What you want to do is an approximate JOIN.

Are there alternative solution without cross-join in Spark 2?

Stackoverflow!
I wonder if there is a fancy way in Spark 2.0 to solve the situation below.
The situation is like this.
Dataset1 (TargetData) has this schema and has about 20 milion records.
id (String)
vector of embedding result (Array, 300 dim)
Dataset2 (DictionaryData) has this schema and has about 9,000 records.
dict key (String)
vector of embedding result (Array, 300 dim)
For each vector of records in dataset 1, I want to find the dict key that will be the maximum when I compute cosine similarity it with dataset 2.
Initially, I tried cross-join dataset1 and dataset2 and calculate cosine simliarity of all records, but the amount of data is too large to be available in my environment.
I have not tried it yet, but I thought of collecting dataset2 as a list and then applying udf.
Are there any other method in this situation?
Thanks,
There might be two options the one is to broadcast Dataset2 since you need to scan it for each row of Dataset1 thus avoid the network delays by accessing it from a different node. Of course in this case you need to consider first if your cluster can handle the memory cost which 9000rows x 300cols(not too big in my opinion). Also you still need your join although with broadcasting should be faster. The other option is to populate a RowMatrix from your existing vectors and leave spark do the calculations for you

Best way to store data : Many columns vs many rows for a case of 10,000 new rows a day

after checking a lot of similar questions on stackoverflow, it seems that context will tell which way is the best to hold the data...
Short story, I add over 10,000 new rows of data in a very simple table containing only 3 columns. I will NEVER update the rows, only doing selects, grouping and making averages. I'm looking for the best way of storing this data to make the average calculations as fast as possible.
To put you in context, I'm analyzing a recorded audio file (Pink Noise playback in a sound mixing studio) using FFTs. The results for a single audio file is always in the same format: The frequency bin's ID (integer) and its value in decibels (float value). I'm want to store these values in a PostgreSQL DB.
Each bin (band) of frequencies (width = 8Hz) gets an amplitude in decibels. The first bin is ignored, so it goes like this (not actual dB values):
bin 1: 8Hz-16Hz, -85.0dB
bin 2: 16Hz-32Hz, -73.0dB
bin 3: 32Hz-40Hz, -65.0dB
...
bin 2499: 20,000Hz-20,008Hz, -49.0dB
The goal is to store an amplitude of each bin from 8Hz through 20,008Hz (1 bin covers 8Hz).
Many rows approach
For each analyzed audio file, there would be 2,499 rows of 3 columns: "Analysis UID", "Bin ID" and "dB".
For each studio (4), there is one recording daily that is to be appended in the database (that's 4 times 2,499 = 9,996 new rows per day).
After a recording in one studio, the new 2,499 rows are used to show a plot of the frequency response.
My concern is that we also need to make a plot of the averaged dB values of every bin in a single studio for 5-30 days, to see if the frequency response tends to change significantly over time (thus telling us that a calibration is needed in a studio).
I came up with the following data structure for the many rows approach:
"analysis" table:
analysisUID (serial)
studioUID (Foreign key)
analysisTimestamp
"analysis_results" table:
analysisUID (Foreign key)
freq_bin_id (integer)
amplitude_dB (float)
Is this the optimal way of storing data? A single table holding close to 10,000 new rows a day and making averages of 5 or more analysis, grouping by analysisUIDs and freq_bin_ids? That would give me 2,499 rows (each corresponding to a bin and giving me the averaged dB value).
Many columns approach:
I thought I could do it the other way around, breaking the frequency bins in 4 tables (Low, Med Low, Med High, High). Since Postgres documentation says the column limit is "250 - 1600 depending on column types", it would be realistic to make 4 tables containing around 625 columns (2,499 / 4) each representing a bin and containing the "dB" value, like so:
"low" table:
analysisUID (Foreign key)
freq_bin_id_1_amplitude_dB (float)
freq_bin_id_2_amplitude_dB (float)
...
freq_bin_id_625_amplitude_dB (float)
"med_low" table:
analysisUID (Foreign key)
freq_bin_id_626_amplitude_dB (float)
freq_bin_id_627_amplitude_dB (float)
...
freq_bin_id_1250_amplitude_dB (float)
etc...
Would the averages be computed faster if the server only has to Group by analysisUIDs and make averages of each column?
Rows are not going to be an issue, however, the way in which you insert said rows could be. If insert time is one of the primary concerns, then make sure you can bulk insert them OR go for a format with fewer rows.
You can potentially store all the data in a jsonb format, especially since you will not be doing any updates to the data-- it may be convenient to store it all in one table at a time, however the performance may be less.
In any case, since you're not updating the data, the (usually default) fillfactor of 100 is appropriate.
I would NOT use the "many column" approach, as the
amount of data you're talking about really isn't that much. Using your first example of 2 tables and few columns is very likely the optimal way to do your results.
It may be useful to index the following columns:
analysis_results.freq_bin_id
analysis.analysisTimestamp
As to breaking the data into different sections, it'll depend on what types of queries you're running. If you're looking at ALL freq bins, using multiple tables will just be a hassle and net you nothing.
If only querying at some freq_bin's at a time, it could theoretically help, however, you're basically doing table partitions and once you've moved into that land, you might as well make a partition for each frequency band.
If I were you, I'd create your first table structure, fill it with 30 days worth of data and query away. You may (as we often do) be overanalyzing the situation. Postgres can be very, very fast.
Remember, the raw data you're analyzing is something on the order of a few (5 or less) meg per day at an absolute maximum. Analyzing 150 mb of data is no sweat for a DB running with modern hardware if it's indexed and stored properly.
The optimizer is going to find the correct rows in the "smaller" table really, really fast and likely cache all of those, then go looking for the child rows, and it'll know exactly what ID's and ranges to search for. If your data is all inserted in chronological order, there's a good chance it'll read it all in very few reads with very few seeks.
My main concern is with the insert speed, as a doing 10,000 inserts can take a while if you're not doing bulk inserts.
Since the measurements seem well behaved, you could use an array, using the freq_bin as an index (Note: indices are 1-based in sql)
This has the additional advantage of the aray being stored in toasted storage, keeping the fysical table small.
CREATE TABLE herrie
( analysisUID serial NOT NULL PRIMARY KEY
, studioUID INTEGER NOT NULL REFERENCES studio(studioUID)
, analysisTimestamp TIMESTAMP NOT NULL
, decibels float[] -- array with 625 measurements
, UNIQUE (studioUID,analysisTimestamp)
);

MongoDB and using DBRef with Spatial Data

I have a collection with 100 million documents of geometry.
I have a second collection with time data associated to each of the other geometries. This will be 365 * 96 * 100 million or 3.5 trillion documents.
Rather than store the 100 million entries (365*96) times more than needed, I want to keep them in separate collections and do a type of JOIN/DBRef/Whatever I can in MongoDB.
First and foremost, I want to get a list of GUIDs from the geometry collection by using a geoIntersection. This will filter it down to 100 million to 5000. Then using those 5000 geometries guids I want to filter the 3.5 trillion documents based on the 5000 goemetries and additional date criteria I specify and aggregate the data and find the average. You are left with 5000 geometries and 5000 averages for the date criteria you specified.
This is basically a JOIN as I know it in SQL, is this possible in MongoDB and can it be done optimally in say less than 10 seconds.
Clarify: as I understand, this is what DBrefs is used for, but I read that it is not efficient at all, and with dealing with this much data that it wouldn't be a good fit.
If you're going to be dealing with a geometry and its time series data together, it makes sense to store them in the same doc. A years worth of data in 15 minute increments isn't killer - and you definitely don't want a document for every time-series entry! Since you can retrieve everything you want to operate on as a single geometry document, it's a big win. Note that this also let's you sparse things up for missing data. You can encode the data differently if it's sparse rather than indexing into a 35040 slot array.
A $geoIntersects on a big pile of geometry data will be a performance issue though. Make sure you have some indexing on (like 2dsphere) to speed things up.
If there is any way you can build additional qualifiers into the query that could cheaply eliminate members from the more expensive search, you may make things zippier. Like, say the search will hit states in the US. You could first intersect the search with state boundaries to find the states containing the geodata and use something like a postal code to qualify the documents. That would be a really quick pre-search against 50 documents. If a search boundary was first determined to hit 2 states, and the geo-data records included a state field, you just winnowed away 96 million records (all things being equal) before the more expensive geo part of the query. If you intersect against smallish grid coordinates, you may be able to winnow it further before the geo data is considered.
Of course, going too far adds overhead. If you can correctly tune the system to the density of the 100 million geometries, you may be able to get the times down pretty low. But without actually working with the specifics of the problem, it's hard to know. That much data probably requires some specific experimentation rather than relying on a general solution.

Different Aggregation calculations of a measure using two dimensions in Tableau

It is a Tableau 8.3 Desktop Edition question.
I am trying to aggregate data using two different dimensions. So, I want to aggregate twice: first I want to sum over all the rows and then multiply the results in a cummulative manner (so I can build a graph). How do I do that? Ok, too vague, here follow some more details:
I have a set of historical data. The columns are the date, the rows are the categories.
Easy part: I would like to sum all the rows.
Hard part: Given this those summations I want to build a graph that for each date it shows the product of all the summations from the earlier date till this date.
In another words:
Take the sum of all rows, call it x_i, where i is the date.
For each date i find y_i such that y_i = x_0 * x_1 * ... * x_i (if there is missing data, consider it to be one)
Then show a line graph for the y values versus the date.
I have searched for a solution for this and tried to figure it out by myself, but failed.
Thank you very much for your time and help :)
You need n calculated fields (number of columns you have), and manually do the calculation you need:
y_i = sum(field0)*sum(field1)
Basically because you cannot iterate on columns. For tableau, each column represent a different dimension or measure. So it won't consider that there is a logic order among them, meaning, it won't assume that column A comes before column B. It will assume A and B are different things.
Tableau works better with tables organized as databases. So if you have year columns, you should reorganize your data, eliminate all those columns and create a single field called 'Date', which will identify the value of your measure for that date. Yes, you will have less columns but far more rows. But Tableau works better this way (for very good reasons).
Tableau 9.0 allows you to do that directly. I only watched a demo (it was launched yesterday), but I understand that now there is an option to selected those columns (in the Data Connection tab) and convert them to a database format.
With that done, you can use a PREVIOUS_VALUE function to help you. I'm not with Tableau right now. As soon as I get to it I'll update this with the final answer . Unless you take the lead and discover yourself before that ;)