Using aggregate to create a two-column matrix - aggregate

I wanted to politely ask if anyone knew how to solve this exercise using aggregate. I was capable of solving it with sapply, but do not know how to do it with aggregate. Your help is appreciated, thank you.
Using the mtcars dataset and aggregate, create a two-column matrix that stores in the first column the median wt by gear and in the second column the hp of the group (in the first row there is the median wt for cars with 3 gears, in the second row there is the median wt for cars with four gears, etc.)

Related

Power BI: Finding average of averages and STDEV.P of averages

All,
My overall objective is to find outliers within an aggregated data set vs the underlying detail for different date ranges. The issue I am having is that Power BI is averaging the SalesPerDay and finding the STDEV.P at the daily level which is the grain of the raw data. I need to first find the average Sales, then find the average of those averages for that "rolled up" data set. Same with STDEV.P. Need to find the STDEV of the "rolled up" averages. Screenshot below depicting how I need the tool to aggregate.
I have brought the Sales column into my dashboard, dimentionalized by user, and set to AVERAGE to get average SalesPerDay.
Then I created the new measure
newavg = CALCULATE(AVERAGE(SalesPerDay[Sales]),ALLSELECTED())
Which is finding the overall average, but at the daily level vs the aggregated level.
I also tried
newSTDV = CALCULATE(STDEV.P(AVERAGE(SalesPerDay[Sales])),ALLSELECTED())
But you cannot find the STDEV.P of a calculation.
Thank you.
What you are looking for is the iterator functions, which take a table or column of data as a grouping, and then applies a calculation on that group.
Example of one would be SUMX. In the example below, it would do a grouping based on Product. Within each product it would get the total of qty and multiply it by the sum of x. It would then sum the results of that calculation into a total.
SUMX( VALUES( table1 [ Product ] ), [Qty] * [x] )
There also being averagex, minx, maxx, plus for the statistical functions there is STDEVX.P and STDEVX.S

Are there alternative solution without cross-join in Spark 2?

Stackoverflow!
I wonder if there is a fancy way in Spark 2.0 to solve the situation below.
The situation is like this.
Dataset1 (TargetData) has this schema and has about 20 milion records.
id (String)
vector of embedding result (Array, 300 dim)
Dataset2 (DictionaryData) has this schema and has about 9,000 records.
dict key (String)
vector of embedding result (Array, 300 dim)
For each vector of records in dataset 1, I want to find the dict key that will be the maximum when I compute cosine similarity it with dataset 2.
Initially, I tried cross-join dataset1 and dataset2 and calculate cosine simliarity of all records, but the amount of data is too large to be available in my environment.
I have not tried it yet, but I thought of collecting dataset2 as a list and then applying udf.
Are there any other method in this situation?
Thanks,
There might be two options the one is to broadcast Dataset2 since you need to scan it for each row of Dataset1 thus avoid the network delays by accessing it from a different node. Of course in this case you need to consider first if your cluster can handle the memory cost which 9000rows x 300cols(not too big in my opinion). Also you still need your join although with broadcasting should be faster. The other option is to populate a RowMatrix from your existing vectors and leave spark do the calculations for you

How to store a huge matrix into database

I plan to use PostgreSQL to store a huge matrix.
The structure of the dataset is like below:
It's a 20,000 * 20,000 matrix
Each element in the matrix has 5ish
records to describe the features of the interactions between two
nodes.
Is there any way to construct the database to make it easy to store and efficient to query?
Thanks in advance!
My first advice would be to design you table with the column and row of each matrix element.
Eg. table = {row,column,record1,record2,record3,record4,record5}, with {row,column} been the primary key.
Hope it helps.

Different Aggregation calculations of a measure using two dimensions in Tableau

It is a Tableau 8.3 Desktop Edition question.
I am trying to aggregate data using two different dimensions. So, I want to aggregate twice: first I want to sum over all the rows and then multiply the results in a cummulative manner (so I can build a graph). How do I do that? Ok, too vague, here follow some more details:
I have a set of historical data. The columns are the date, the rows are the categories.
Easy part: I would like to sum all the rows.
Hard part: Given this those summations I want to build a graph that for each date it shows the product of all the summations from the earlier date till this date.
In another words:
Take the sum of all rows, call it x_i, where i is the date.
For each date i find y_i such that y_i = x_0 * x_1 * ... * x_i (if there is missing data, consider it to be one)
Then show a line graph for the y values versus the date.
I have searched for a solution for this and tried to figure it out by myself, but failed.
Thank you very much for your time and help :)
You need n calculated fields (number of columns you have), and manually do the calculation you need:
y_i = sum(field0)*sum(field1)
Basically because you cannot iterate on columns. For tableau, each column represent a different dimension or measure. So it won't consider that there is a logic order among them, meaning, it won't assume that column A comes before column B. It will assume A and B are different things.
Tableau works better with tables organized as databases. So if you have year columns, you should reorganize your data, eliminate all those columns and create a single field called 'Date', which will identify the value of your measure for that date. Yes, you will have less columns but far more rows. But Tableau works better this way (for very good reasons).
Tableau 9.0 allows you to do that directly. I only watched a demo (it was launched yesterday), but I understand that now there is an option to selected those columns (in the Data Connection tab) and convert them to a database format.
With that done, you can use a PREVIOUS_VALUE function to help you. I'm not with Tableau right now. As soon as I get to it I'll update this with the final answer . Unless you take the lead and discover yourself before that ;)

Database solution to store and aggregate vectors?

I'm looking for a way to solve a data storage problem for a project.
The Data:
We have a batch process that generates 6000 vectors of size 3000 each daily. Each element in the vectors is a DOUBLE. For each of the vectors, we also generate tags like "Country", "Sector", "Asset Type" and so on (It's financial data).
The Queries:
What we want to be able to do is see aggregates by tag of each of these vectors. So for example if we want to see the vectors by sector, we want to get back a response that gives us all the unique sectors and a 3000x1 vector that is the sum of all the vectors of each element tagged by that sector.
What we've tried:
It's easy enough to implement a normalized star schema with 2 tables, one with the tagging information and an ID and a second table that has "VectorDate, ID, ElementNumber, Value" which will have a row to represent each element for each vector. Unfortunately, given the size of the data, it means we add 18 million records to this second table daily. And since our queries need to read (and add up) all 18 million of these records, it's not the most efficient of operations when it comes to disk reads.
Sample query:
SELECT T1.country, T2.ElementNumber, SUM(T2.Value)
FROM T1 INNER JOIN T2 ON T1.ID=T2.ID
WHERE VectorDate = 20140101
GROUP BY T1.country, T2.ElementNumber
I've looked into NoSQL solutions (which I don't have experience with) but seen that some, like MongoDB allow for storing entire vectors as part of a single document - but I'm unsure if they would allow aggregations like we're trying efficiently (adding each element of the vector in a document to the corresponding element of other documents' vectors). I read the $unwind operation required isn't that efficient either?
It would be great if someone could point me in the direction of a database solution that can help us solve our problem efficiently.
Thanks!