How to compute the total number of bills per clients'age? - pyspark

I have two datasets, one the client with their respective bills,with the following elements: "number of bill", "date", "client", import", and the other dataset are the clients grouped by age.
****An example:****
1st Dataset
u'F1,01/01/2013,C1,11'
2nd Dataset
u'C1,20'
I have parsed the two data set to select the data of importance to my subject. Here is the code
def parseClients(clients):
fields=clients.split(",")
return (fields[0], fields[1])
def parseBill(bill):
fields=bill.split(",")
return (fields[2], bill)
new_bills=bills.map(parseBill)
new_clients=clients.map(parseClients)
Age_Bills=new_bills.join(new_clients)
A sample as following:
Age_Bills.take(10):
(u'C856', (u'F2982,06/01/2013,C856,88', u'81'))
(u'C856', (u'F11953,22/01/2013,C856,87', u'81'))
(u'C856', (u'F12893,24/01/2013,C856,10', u'81'))
(u'C856', (u'F12913,24/01/2013,C856,41', u'81'))
(u'C856', (u'F17883,02/02/2013,C856,45', u'81'))
(u'C856', (u'F17895,02/02/2013,C856,75', u'81'))
(u'C856', (u'F18867,04/02/2013,C856,105', u'81'))
(u'C856', (u'F21864,09/02/2013,C856,26', u'81'))
(u'C856', (u'F30889,26/02/2013,C856,154', u'81'))
(u'C856', (u'F49990,02/04/2013,C856,90', u'81'))
Now I'd like to count the number of bills
per age, but I don't know how to continue. I have thought about using KeyReduce or flatmap. I would be grateful if you could help me.
Thanks,

This should work:
Age_Bills.map(lambda x: (x[1][1], 1)).reduceByKey(lambda x, y: x + y)

Related

How to get Min, Max and Length between dates for each year?

I have an rdd with type RDD[String] as an example here is a part of it as such:
1990,1990-07-08
1994,1994-06-18
1994,1994-06-18
1994,1994-06-22
1994,1994-06-22
1994,1994-06-26
1994,1994-06-26
1954,1954-06-20
2002,2002-06-26
1954,1954-06-23
2002,2002-06-29
1954,1954-06-16
2002,2002-06-30
...
result:
(1982,52)
(2006,64)
(1962,32)
(1966,32)
(1986,52)
(2002,64)
(1994,52)
(1974,38)
(1990,52)
(2010,64)
(1978,38)
(1954,26)
(2014,64)
(1958,35)
(1998,64)
(1970,32)
I group it nicely, but my problem is this v.size part, I do not know to to calculate that length.
Just to put it in perspective, here are expected results:
It is not a mistake that there is two times for 2002. But ignore that.
define date format:
val formatter = DateTimeFormatter.ofPattern("yyyy-MM-dd")
and order:
implicit val localDateOrdering: Ordering[LocalDate] = Ordering.by(_.toEpochDay)
create a function that receives "v" and returns MAX(date_of_matching_year) - MIN(date_of_matching_year)) = LENGTH (in days):
def f(v: Iterable[Array[String]]): Int = {
val parsedDates = v.map(LocalDate.parse(_(1), formatter))
parsedDates.max.getDayOfYear - parsedDates.min.getDayOfYear
then replace the v.size with f(v)

Iterate through array of JSON objects in coffeescript

Hi I have the following looking dataset:
[
{ date:"somedatehere", series1:"series1Value", series2:"series2Value" ..., seriesX:"seriesXValue"},
{ date:"anotherDateHere", series1:"anotherseries1Value", series2:"anotherseries2Value"...,seriesX:"anotherseriesXValue"},...
]
I'd like to loop through this in coffeescript and extract arrays such that I would have an array of dates (comprised of somedatehere, anotherDateHere, etc), series1 values, series2 values, seriesX values, etc.
Preferrably all of these arrays would go in order such that dates[0] === somedatehere and series1[0] === series1Value and series2[0] === series2Value and seriesX[1] === anotherseriesXValue etc.
Is there an easy way to go about doing this in coffeescript?
dates = (obj.date for obj in my_array)
series1 = (obj.series for obj in my_array)
in case you have a lot of series and don't want to manually enumerate them:
types = (k for k, v of my_array[0])
result = {}
result[type] = (obj[type] for obj in my_array) for type in types
Will give you
my_array = [{date: 1, x: 2}, {date: 123, x: 2134}]
result = {
date: [ 1, 123 ],
x: [ 2, 2134 ]
}

Ranking of a cell array

Take the following example:
clear all
Name1 = {'Data1','Data2','Data3','Data4'};
Data = {6.2,6,3.2,8};
CombnsName = nchoosek(Name1,2);
CombnsData = nchoosek(Data,2);
for i = 1:length(CombnsData);
multiplied{i} = CombnsData{i,1}.*CombnsData{i,2};
end
multiplied = multiplied';
Final = [CombnsName, multiplied];
Rankd = sort(cell2mat(multiplied));
Here, Final represents the values gained by multiplying every possible combination of 'Name1'. Now, I'm trying to find a way of changing the order of 'Final' to correspond to the ranking order defined by 'Rankd'. For example the first 'line' of Final should read 'Data2 'Data3' 19.2; and the last 'line' should read 'Data1' Data4' 49.6.
Is there a method for doing this?
There are a couple of options. Firstly, you could use the second output of sort, which gives you the indexes corresponding to the entries in the sorted array:
>> [Rankd Index] = sort(cell2mat(multiplied));
and then do
>> Final(Index,:)
ans =
'Data2' 'Data3' [19.200000000000003]
'Data1' 'Data3' [19.840000000000003]
'Data3' 'Data4' [25.600000000000001]
'Data1' 'Data2' [37.200000000000003]
'Data2' 'Data4' [ 48]
'Data1' 'Data4' [49.600000000000001]
However, an even easier method is to use the function sortrows which was designed for exactly this situation:
>> sortrows(Final,3)
ans =
'Data2' 'Data3' [19.200000000000003]
'Data1' 'Data3' [19.840000000000003]
'Data3' 'Data4' [25.600000000000001]
'Data1' 'Data2' [37.200000000000003]
'Data2' 'Data4' [ 48]
'Data1' 'Data4' [49.600000000000001]

matlab: grouping variables for observations that can be in multiple groups

I would like to use MATLAB group statistics functions (like grpstats) on data where each observation can be in multiple groups. For example, pizzas can have {'pepperoni', 'mushroom','onions'} or {'pepperoni'} or whatever and then I want group stats by topping: all of the pizzas with 'pepperoni', all of them with 'mushroom', etc.
Alternatively if you know a way to do this by hand without iterating like an idiot that would also be helpful.
Just put repeated measures in different rows. For example:
store = repmat(cellstr(num2str((1:3)')), 3, 1);
type = repmat({'pepperoni', 'mushrooms', 'onions'}, 3, 1);
type = Type(:);
score = dataset({randn(9,3), 'taste', 'looks', 'price'});
data = [dataset(store, type) score];
grpstats(data(:,2:end), 'type')
Raw data:
>> data
data =
store type taste looks price
'1' 'pepperoni' -0.19224 -0.44463 -0.50782
'2' 'pepperoni' -0.27407 -0.15594 -0.32058
'3' 'pepperoni' 1.5301 0.27607 0.012469
'1' 'mushrooms' -0.24902 -0.26116 -3.0292
'2' 'mushrooms' -1.0642 0.44342 -0.45701
'3' 'mushrooms' 1.6035 0.39189 1.2424
'1' 'onions' 1.2347 -1.2507 -1.0667
'2' 'onions' -0.22963 -0.94796 0.93373
'3' 'onions' -1.5062 -0.74111 0.35032
Group stats:
>> grpstats(data(:,2:end), 'type')
ans =
type GroupCount mean_taste mean_looks mean_price
pepperoni 'pepperoni' 3 0.35459 -0.10817 -0.27197
mushrooms 'mushrooms' 3 0.09674 0.19138 -0.74791
onions 'onions' 3 -0.16704 -0.97992 0.072449

fastest way to find the union and intersection items among two list

which is the fastest way to find the union and intersection between two lists?
i mean.
i have two list say
List<1>
1
2
3
4
Lis<2>
2
3
Finally i need to get output as
List<3>
Not Defined
2
3
Not Defined
Hope i am clear with my requirement.
Please let me know if i am conusing
LINQ already has Union and Intersection. Your example is neither.
var set = new HashSet(list2)
var list3 = List1.Select(x => set.Contains(x) ? x : null).ToList();
Or you could do the following, which just gives you the intersection:
HashSet<int> list1 = new HashSet<int>() { 1, 2, 3, 4 };
HashSet<int> list2 = new HashSet<int>() { 2, 3 };
List<int> list3 = list1.Intersect(list2).ToList();
for (int i = 0; i < list3.Count; i++)
{
Console.WriteLine(list3[i]);
}
Console.ReadLine();