for (index, item) in users.enumerated(){
//countUserIds[item.name] = [Float(item.id),avgs[index]]
print("Name:",item.name,"ID:",item.id,"Score:",avgs[index])
}
Im trying to sort my output by having the highest averages listed first, but if I sort my array it will not be linked to the right name and ID. How can I sort the averages and display the users with the highest average first?
Related
I have a spark dataframe as below.
val df = Seq(("a",1,1400),("a",1,1250),("a",2,1200),("a",4,1250),("a",4,1200),("a",4,1100),("b",2,2500),("b",2,1250),("b",2,500),("b",4,250),("b",4,200),("b",4,100),("b",4,100),("b",5,800)).
toDF("id","hierarchy","amount")
I am working in scala language to make use of this data frame and trying to get result as shown below.
val df = Seq(("a",1,1400),("a",4,1250),("a",4,1200),("a",4,1100),("b",2,2500),("b",2,1250),("b",4,250),("b",4,200),("b",4,100),("b",5,800)).
toDF("id","hierarchy","amount")
Rules: Grouped by id, if min(hierarchy)==1 then I take the row with the highest amount and then I go on to analyze hierarchy >= 4 and take 3 of each of them in descending order of the amount. On the other hand, if min(hierarchy)==2 then I take two rows with the highest amount and then I go on to analyze hierarchy >= 4 and take 3 of each of them in descending order of the amount. And so on for all the id's in the data.
Thanks for the suggestions..
You may use window functions to generate the criteria which you will filter upon eg
val results = df.withColumn("minh",min("hierarchy").over(Window.partitionBy("id")))
.withColumn("rnk",rank().over(Window.partitionBy("id").orderBy(col("amount").desc())))
.withColumn(
"rn4",
when(col("hierarchy")>=4, row_number().over(
Window.partitionBy("id",when(col("hierarchy")>=4,1).otherwise(0)).orderBy(col("amount").desc())
) ).otherwise(5)
)
.filter("rnk <= minh or rn4 <=3")
.select("id","hierarchy","amount")
NB. More verbose filter .filter("(rnk <= minh or rn4 <=3) and (minh in (1,2))")
Above temporary columns generated by window functions to assist in the filtering criteria are
minh : used to determine the minimum hierarchy for a group id and subsequently select the top minh number of columns from the group .
rnk used to determine the rows with the highest amount in each group
rn4 used to determine the rows with the highest amount in each group with hierarchy >=4
I have imported a db from a csv with has info about:
country
region
commodity
price
date
(This is the csv: https://www.kaggle.com/jboysen/global-food-prices)
the strings in the csv are ordered in this way:
country 1, region 1.1, commodity X, price, dateA
country 1, region 1.1, commodity X, price, dateB
country 1, region 1.1, commodity Y, price, dateA
country 1, region 1.1, commodity Y, price, dateB
...
country 1, region 1.2, commodity X, price, dateA
country 1, region 1.2, commodity X, price, dateB
country 1, region 1.2, commodity Y, price, dateA
country 1, region 1.2, commodity Y, price, dateB
...
country 2, region 2.1, commodity X, price, dateA
...
I need to show, for each country, for each product, the biggest price.
I wrote:
1) a map with key country+commodity and value price
var map = function() {
emit({country: this.country_name, commodity: this.commodity_name}, {price: this.price});
};
2) a reduce that scans the prices related to a key and check what's the highest price
var reduce = function(key, values) {
var maxPrice = 0.0;
values.forEach(function(doc) {
var thisPrice = parseFloat(doc.price);
if( typeof doc.price != "undefined") {
if (thisPrice > maxPrice) {
maxPrice = thisPrice;
}
}
});
return {max_price: maxPrice};
};
3) I send the output of a map reduce to a collection "mr"
db.prices.mapReduce(map, reduce, {out: "mr"});
PROBLEM:
For example, if I open the csv and manually order by:
country (increasing order)
commodity (increasing order)
price (decreasing order)
I can check that (to give an example of data) in Afghanistan the highest price for the commodity Bread is 65.25
When I check the M-R though, it results 0 for max price of Bread in Afghanistan.
WHAT HAPPENS:
There are 10 regions in the csv in which Bread is logged for Afghanistan.
I've added, on the last line of the reduce:
print("reduce with key: " + key.country + ", " + key.commodity + "; max price: " + maxPrice);
Theoretically, if I search in mongodb log, I should only find ONE entrance with "reduce with key: Afghanistan, Bread; max price: ???".
Instead I see TEN lines (same numbers of the regions), each one with a different max price.
The last one has "max price 0".
MY HYPOTESIS:
It seems that, after the emit, when the reduce is called, instead of looking for ALL k-v pairs with the same key, it considers sub-groups that are in promixity.
So, recalling my starting example on the csv structure:
until the reduce scans emit outputs related to "afghanista, region 1, bread", it does a reduce on themm
then it does a reduce on the outputs related to "afghanistan, region 1, commodityX"
then it does another reduce on the outputs related to "afghanistan, region 2, bread" (instead of reducing ALL the k-v pairs with afghanistan+bread in a single reduce)
Do I have to do a re-reduce to work on all the partial reduce jobs?
I've managed to solve this.
MongoDB doesn't necessarily do the reducing of all k-v pairs with the same key in one go.
It can happen that (as in this case) MongoDB will perform a reduce on a subset of k-v pairs related to a specific key, and then it will send the output of this first reduce when it will do a second reduce on another subset related to the same key.
My code didn't work because:
MongoDB performed a reduce on a subset of k-v pairs related to the key "Afghanistan, Bread", with a variable in output named "maxPrice"
MongoDB would proceed to reduce other subsets
MongoDB, when faced with another subset of "Afghanistan, Bread", would take the output of the first reduce, and use it as a value
The output of a reduce is named "maxPrice", but the other values are named "price"
Since I ask for the value "doc.price", when I scan the doc that contains "maxPrice", it gets ignored
There are 2 approaches to solve this:
1) You use the same name for the reduce output variable as the emit output value
2) You index the properties chosen as key, and you use the "sort" option on mapReduce() so that all k-v pairs related to a key get reduced in one go
The second approach is if you don't want to give up using a different name for the name of the reduce output (plus it has better performance since it only does a single reduce per key).
I am trying to create a report which has addresses in form of house Nbr and street Name. I want to group all address by street name and then order them by house nbr which is a string but should sort like a number. Ideally i would like the odd ascending and then the evens descending so that my list would look like
1,3,5,7,9 .... 8,6,4,2
How would i go about this ? I created first group on street name and then 2nd group on house number with a formula for sorting of nbrs
i created a Formula Field OddEven with
ToNumber({tbl_FarmMaster.sano}) MOD 2
but i am having hard time applying that to my group
Create two formulas like below. Let's call them oddFirst and negativeEven.
oddFirst formula:
ToNumber({tbl_FarmMaster.sano}) MOD 2 == 1 then
1 //it is odd
else
2 //it is even
negativeEven formula:
if ToNumber({tbl_FarmMaster.sano}) MOD 2 == 1 then
ToNumber({tbl_FarmMaster.sano}) //it is odd
else
-ToNumber({tbl_FarmMaster.sano}) //it is even, note the negative sign
Then create two groups to sort:
first by the formula oddFirst
second by the formula negativeEven
Show the {tbl_FarmMaster.sano} field.
I want to filter out those entries that have operation_id equal to "0".
val operations_seen_qty = parsed.flatMap(_.lift("operation_id")).toSet.size.toString
parsed is List[Map[String,String]].
How can I do it?
This is my draft, but I think that I am in contrast selecting only those entries that have operation_id equal to 0:
val operations_seen_qty = parsed.flatMap(_.lift("operation_id")).filter(p=>p.equals("0")).toSet.size.toString
The final objective is to count the number of unique operation_id values that are not equal to "0".
If I understand correctly, you only want to retain those entries whose entry id is NOT equal to "0". In this case, the function in the filter should be p=>!p.equals("0") or p=>p!="0".
Filter will retain the entries fulfill the predicate. What you did is exactly the opposite.
I have a dataset of stock prices called 'stocks'. Each column is a different stock. Each row is the date of the stock prices.
How can I rank the stock price of a given date?
I tried
tiedrank(stocks.yhoo)
And it successfully ranked the prices of YHOO stock. However, I would like to rank by row, not column.
Also, when I tried
tiedrank(stocks(1,:))
or to delete the date column in column 1
tiedrank(stocks(1,2:3))
I got the error message: Dataset array subscripts must be two-dimensional.
Am I doing something wrong? Or am I better off using matrices?
If I understand correctly, you want to rank the stocks according to price at a given date, where dates are rows, and stocks are columns. To use tiedrank across a row, you need to convert that part of the dataset to double, and then use the output index list to sort:
%# create index for sorting
idx = tiedrank( double( stocks(1,:) ));
%# reorder columns with index
sortedStocks = stocks(:,idx);