struggling to handle deduplication after aggregation in spark streaming - scala

1.streaming data is coming from kafka
2.consuming through spark streaming
3.firstname,lastname,userid and membername ( using member names i am getting the member count
e.g mark,tyson,2,chris,lisa,iwanka - so here member count is 3
somehow i have to do the count its the requirmnt . but how can i remove deduplication after aggregation . its my concern
val df2=df.select(firstname,lastname,membercount,userid)
df2.writestream.format("console").start().awaitTermination
or
df3.select("*").where("membercount >= 3").dropDuplication("userid")
// this one is not working , but i need to do the same after
count only so that in batches same user id will not come again.
only first time entry i want.
Batch-1 output
firstname lastname member-count userid
john smith 5 1
mark boucher 8 2
shawn pollock 3 3
batch-2 output
firstname lastname member-count userid
john smith 7 (prev.count 5) 1
shawn pollock 12 (prev.count 8) 3
chris jordan 6 4
// but here i want batch -2 ---------output
1.The possibilty is the john smith ,shawn pollock count will increase again in next batches ,but i dont want to show or keep in output for next batches.
i.e based on userid , i want entry for the one time only in batch output
and neglect again the same user in batch output
firstname lastname member-count userid
chris jordan 6 4

Your question is hard to read, but as I understand you want a while loop with a condition?
var a = 10;
while(a < 20){
println( "Value of a: " + a );
a = a + 1;
}
For example will print
value of a: 10
value of a: 11
value of a: 12
value of a: 13
value of a: 14
value of a: 15
value of a: 16
value of a: 17
value of a: 18
value of a: 19

Related

How to filter data with starting and ending conditions?

I'm trying to filter my data based on two conditions dependent on sequential dates.
I am looking for values below 2 for 5+ sequential dates,
with a "cushion period" of values 2 to 5 for up to 3 sequential days.
It would look something like this (sorry for the terrible excel attempt here):
Day 1 to Day 10 would be included and day 11 would not be. Days 6 to 8 would be considered the "cushion period." I hope this makes sense!!
Right now, I am able to get the cushion period (in the reprex) only but I cant figure out how to add the start and ending condition for values under 2 for 5 sequential dates to be included (the 5 days could be broken up with the cushion period inbetween but I feel like this might complicate things).
Any help would be GREATLY appreciated!
For my reprex (below), the dates that would be included in the final df are in blue (dates from 1/1/2000 to 1/9/2000, and 1/22/2000 to 1/30/2000) and the dates in grey would not be.
Reprex:
library("dplyr")
#Goal: include all values with values of 2 or less for 5 consecutive days and allow for a "cushion" period of values of 2 to 5 for up to 3 days
data <- data.frame(Date = c("2000-01-01", "2000-01-02", "2000-01-03", "2000-01-04", "2000-01-05", "2000-01-06", "2000-01-07", "2000-01-08", "2000-01-09", "2000-01-10", "2000-01-11", "2000-01-12", "2000-01-13", "2000-01-14", "2000-01-15", "2000-01-16", "2000-01-17", "2000-01-18", "2000-01-19", "2000-01-20", "2000-01-21", "2000-01-22", "2000-01-23", "2000-01-24", "2000-01-25", "2000-01-26", "2000-01-27", "2000-01-28", "2000-01-29", "2000-01-30"),
Value = c(2,3,4,5,2,2,1,0,1,8,7,9,4,5,2,3,4,5,7,2,6,0,2,1,2,0,3,4,0,1))
head(data)
#Goal: values should include dates from 1/1/2000 to 1/9/2000, and 1/22/2000 to 1/30/2000
#I am able to subset the "cushion period" but I'm not sure how to add the starting and ending conditions for it
attempt1 <- data %>%
group_by(group_id = as.integer(gl(n(),3,n()))) %>%
filter(Value <= 5 & Value >=3) %>%
ungroup() %>%
select(-group_id)
head(attempt1)
If I get it correctly, you need to keep groups of consecutive values that are below or equal to 5 with at least 5 consecutive values below or equal to 2 within it. Here's a way to do that, with some explanation:
library(dplyr)
data %>%
mutate(under_three = Value <= 2) %>%
# under_three = TRUE if Value is below or equal to 2
group_by(rl_two = data.table::rleid(Value <= 2)) %>%
# Group by sequence of values that are under_three
mutate(big = n() >= 5 & all(under_three)) %>%
# big = T if there are more 5 or more consecutive values that are below or equal to 2
group_by(rl_five = data.table::rleid(Value <= 5)) %>%
# ungroup by rl_two, and group by rl_five, i.e. consecutive values that are below or equal to 5
filter(any(big))
# keep from the data frame groups of rl_five if they have at least one big = T; remove other groups.
Output:
data %>%
ungroup() %>%
select(Date, Value)
Date Value
1 2000-01-01 2
2 2000-01-02 3
3 2000-01-03 4
4 2000-01-04 5
5 2000-01-05 2
6 2000-01-06 2
7 2000-01-07 1
8 2000-01-08 0
9 2000-01-09 1
10 2000-01-22 0
11 2000-01-23 2
12 2000-01-24 1
13 2000-01-25 2
14 2000-01-26 0
15 2000-01-27 3
16 2000-01-28 4
17 2000-01-29 0
18 2000-01-30 1

Find the number at the n position in the infinite sequence

Having an infinite sequence s = 1234567891011...
Let's find the number at the n position (n <= 10^18)
EX: n = 12 => 1; n = 15 => 2
import Foundation
func findNumber(n: Int) -> Character {
var i = 1
var z = ""
while i < n + 1 {
z.append(String(i))
i += 1
}
print(z)
return z[z.index(z.startIndex, offsetBy: n-1)]
}
print(findNumber(n: 12))
That's my code but when I find the number at 100.000th position, it returns an error, I thought I appended too many i to z string.
Can anyone help me, in swift language?
The problem we have here looks fairly straight forward. Take a list of all the number 1-infinity and concatenate them into a string. Then find the nth digit. Straight forward problem to understand. The issue that you are seeing though is that we do not have an infinite amount of memory nor time to be able to do this reasonably in a computer program. So we must find an alternative way around this that does not just add the numbers onto a string and then find the nth digit.
The first thing we can say is that we know what the entire list is. It will always be the same. So can we use any properties of this list to help us?
Let's call the input number n. This is the position of the digit that we want to find. Let's call the output digit d.
Well, first off, let's look at some examples.
We know all the single digit numbers are just in the same position as the number itself.
So, for n<10 ... d = n
What about for two digit numbers?
Well, we know that 10 starts at position 10. (Because there are 9 single digit numbers before it). 9 + 1 = 10
11 starts at position 12. Again, 9 single digits + one 2 digit number before it. 9 + 2 + 1 = 12
So how about, say... 25? Well that has 9 single digit numbers and 15 two digit numbers before it. So 25 starts at 9*1 + 15*2 + 1 = 40 (+ 1 as the sum gets us to the end of 24 not the start of 25).
So... 99 starts at? 9*1 + 89*2 + 1 = 188.
The we do the same for the three digit numbers...
100... 9*1 + 90*2 + 1 = 190
300... 9*1 + 90*2 + 199*3 + 1 = 787
1000...? 9*1 + 90*2 + 900*3 + 1 = 2890
OK... so now I'm seeing a pattern here that seems to need to know the number of digits in each number. Well... I can get the number of digits in a number by rounding up the log(base 10) of that number.
rounding up log base 10 of 5 = 1
rounding up log base 10 of 23 = 2
rounding up log base 10 of 99 = 2
rounding up log base 10 of 627 = 3
OK... so I think I need something like...
// in pseudo code
let lengthOfNumber = getLengthOfNumber(n)
var result = 0
for each i from 0 to lengthOfNumber - 1 {
result += 9 * 10^i * (i + 1) // this give 9*1 + 90*2 + 900*3 + ...
}
let remainder = n - 10^(lengthOfNumber - 1) // these have not been added in the loop above
result += remainder * lengthOfNumber
So, in the above pseudo code you can give it any number and it will return the position in the list that that number starts on.
This isn't the exact same as the problem you are trying to solve. And I don't want to solve it for you.
This is just a leg up on how I would go about solving it. Hopefully, this will give you some guidance on how you can take this further and solve the problem that you are trying to solve.

Table sort by month

I have a table in MATLAB with attributes in the first three columns and data from the fourth column onwards. I was trying to sort the entire table based on the first three columns. However, one of the columns (Column C) contains months ('January', 'February' ...etc). The sortrows function would only let me choose 'ascend' or 'descend' but not a custom option to sort by month. Any help would be greatly appreciated. Below is the code I used.
sortrows(Table, {'Column A','Column B','Column C'} , {'ascend' , 'ascend' , '???' } )
As #AnonSubmitter85 suggested, the best thing you can do is to convert your month names to numeric values from 1 (January) to 12 (December) as follows:
c = {
7 1 'February';
1 0 'April';
2 1 'December';
2 1 'January';
5 1 'January';
};
t = cell2table(c,'VariableNames',{'ColumnA' 'ColumnB' 'ColumnC'});
t.ColumnC = month(datenum(t.ColumnC,'mmmm'));
This will facilitate the access to a standard sorting criterion for your ColumnC too (in this example, ascending):
t = sortrows(t,{'ColumnA' 'ColumnB' 'ColumnC'},{'ascend', 'ascend', 'ascend'});
If, for any reason that is unknown to us, you are forced to keep your months as literals, you can use a workaround that consists in sorting a clone of the table using the approach described above, and then applying to it the resulting indices:
c = {
7 1 'February';
1 0 'April';
2 1 'December';
2 1 'January';
5 1 'January';
};
t_original = cell2table(c,'VariableNames',{'ColumnA' 'ColumnB' 'ColumnC'});
t_clone = t_original;
t_clone.ColumnC = month(datenum(t_clone.ColumnC,'mmmm'));
[~,idx] = sortrows(t_clone,{'ColumnA' 'ColumnB' 'ColumnC'},{'ascend', 'ascend', 'ascend'});
t_original = t_original(idx,:);

How to use GroupByKey in Spark to calculate nonlinear-groupBy task

I have a table looks like
Time ID Value1 Value2
1 a 1 4
2 a 2 3
3 a 5 9
1 b 6 2
2 b 4 2
3 b 9 1
4 b 2 5
1 c 4 7
2 c 2 0
Here is the tasks and requirements:
I want to set the column ID as the key, not the column Time, but I don't want to delete the column Time. Is there a way in Spark to set Primary Key?
The aggregation function is non-linear, which means you can not use "reduceByKey". All the data must be shuffled to one single node before calculation. For example, the aggregation function may looks like root N of the sum values, where N is the number of records (count) for each ID :
output = root(sum(value1), count(*)) + root(sum(value2), count(*))
To make it clear, for ID="a", the aggregated output value should be
output = root(1 + 2 + 5, 3) + root(4 + 3 + 9, 3)
the later 3 is because we have 3 record for a. For ID='b', it is:
output = root(6 + 4 + 9 + 2, 4) + root(2 + 2 + 1 + 5, 4)
The combination is non-linear. Therefore, in order to get correct results, all the data with the same "ID" must be in one executor.
I checked UDF or Aggregator in Spark 2.0. Based on my understanding, they all assume "linear combination"
Is there a way to handle such nonlinear combination calculation? Especially, taking the advantage of parallel computing with Spark?
Function you use doesn't require any special treatment. You can use plain SQL with join
import org.apache.spark.sql.Column
import org.apache.spark.sql.functions.{count, lit, sum, pow}
def root(l: Column, r: Column) = pow(l, lit(1) / r)
val out = root(sum($"value1"), count("*")) + root(sum($"value2"), count("*"))
df.groupBy("id").agg(out.alias("outcome")).join(df, Seq("id"))
or window functions:
import org.apache.spark.sql.expressions.Window
val w = Window.partitionBy("id")
val outw = root(sum($"value1").over(w), count("*").over(w)) +
root(sum($"value2").over(w), count("*").over(w))
df.withColumn("outcome", outw)

Use Lag function in SAS find difference and delete if the value is less than 30

Eg.
Subject Date
1 2/10/13
1 2/15/13
1 2/27/13
1 3/15/13
1 3/29/13
2 1/11/13
2 1/31/13
2 2/15/13
I would need only the subjects with the dates between them more than 30.
required output:
Subject Date
1 2/10/13
1 3/15/13
2 1/11/13
2 2/15/13
This is a very interesting problem. I'll use the retain statement in the DATA step.
Since we are trying to compare dates between different observations, it's a bit more difficult. We can take advantage of the fact that SAS can convert dates to SAS date values (i.e. number of days after Jan 1 1960). Then we can compare these numeric values using conditional statements.
data work.test;
input Subject Date anydtdte15.;
sasdate = Date;
retain x;
if -30 <= sasdate - x <= 30 then delete;
else x = sasdate;
datalines;
1 2/10/13
1 2/15/13
1 2/27/13
1 3/15/13
1 3/29/13
2 1/11/13
2 1/31/13
2 2/15/13
;
run;
proc print data=test;
format Date mmddyy8.;
var Subject Date;
run;
OUTPUT as required:
Obs Subject Date
1 1 02/10/13
2 1 03/15/13
3 2 01/11/13
4 2 02/15/13