So I am trying to create basic pagination in Gatling but failing miserably.
My current scenario is as follows:
1) First call is always a POST request, the body includes page index 1 and page size 50
{"pageIndex":1,"pageSize":50}
I am then receiving 50 object + the total count of objects on the environment:
"totalCount":232
Since I need to iterate through all objects on the environment, I will need to POST this call 5 time, each time with an updated pageIndex.
My current (failing) code looks like:
def getAndPaginate(jsonBody: String) = {
val pageSize = 50;
var totalCount: Int = 0
var currentPage: Int = 1
var totalPages: Int =0
exec(session => session.set("pageIndex", currentPage))
exec(http("Get page")
.post("/api")
.body(ElFileBody("json/" + jsonBody)).asJson
.check(jsonPath("$.result.objects[?(#.type == 'type')].id").findAll.saveAs("list")))
.check(jsonPath("$.result.totalCount").saveAs("totalCount"))
.exec(session => {
totalCount = session("totalCount").as[Int]
totalPages = Math.ceil(totalCount/pageSize).toInt
session})
.asLongAs(currentPage <= totalPages)
{
exec(http("Get assets action list")
.post("/api")
.body(ElFileBody("json/" + jsonBody)).asJson
.check(jsonPath("$.result.objects[?(#.type == 'type')].id").findAll.saveAs("list")))
currentPage = currentPage+1
exec(session => session.set("pageIndex", currentPage))
pause(Config.minDelayValue seconds, Config.maxDelayValue seconds)
}
}
Currently the pagination values are not assign to the variables that I have created at the beginning of the function, if I create the variables at the Object level then they are assigned but in a manner which I dont understand. For example the result of Math.ceil(totalCount/pageSize).toInt is 4 while it should be 5. (It is 5 if I execute it in the immediate window.... I dont get it ). I would than expect asLongAs(currentPage <= totalPages) to repeat 5 times but it only repeats twice.
I tried to create the function in a class rather than an Object because as far as I understand there is only one Object. (To prevent multiple users accessing the same variable I also ran only one user with the same result)
I am obviously missing something basic here (new to Gatling and Scala) so any help would be highly appreciated :)
using regular scala variables to hold the values isn't going to work - the gatling DSL defines builders that are only executed once at startup, so lines like
.asLongAs(currentPage <= totalPages)
will only ever execute with the initial values.
So you just need to handle everything using session variables
def getAndPaginate(jsonBody: String) = {
val pageSize = 50;
exec(session => session.set("notDone", true))
.asLongAs("${notDone}", "index") {
exec(http("Get assets action list")
.post("/api")
.body(ElFileBody("json/" + jsonBody)).asJson
.check(
jsonPath("$.result.totalCount")
//transform lets us take the result of a check (and an optional session) and modify it before storing - so we can use it to get store a boolean that reflects whether we're on the last page
.transform( (totalCount, session) => ((session("index").as[Int] + 1) * pageSize) < totalCount.toInt)
.saveAs("notDone")
)
)
.pause(Config.minDelayValue seconds, Config.maxDelayValue seconds)
}
}
Related
The question is related to slick:
I have three tables:
1) Users
2)Team2members
3)Team2Owners
In my post request to users, I am passing values of memberOf and managerOf, these values will be inserted in the tables Team2members and Team2Owners respectively and not in Users table. Though other values of the post request will be inserted in 'Users' table.
My Post request looks like below:
{"kind": "via#user",
"userReference":{"userId":"priya16"},
"user":"preferredNameSpecialChar#domain1.com","memberOf":{"teamReference":{"organizationId":"airtel","teamId":"supportteam"}},
"managerOf":{"teamReference":{"organizationId":"airtel","teamId":"supportteam"}},
"firstName":"Special_fn1",
"lastName":"specialChar_ln1",
"preferredName":[{"locale":"employee1","value":"##$%^&*(Z0FH"}],
"description":" preferredNameSpecialChar test "}
I am forming the query which is shown below:
The query seems to work fine when only memberInsert is defined, when I try to define both the values i.e.memberInsert and managerInsert then insertion happens only for second value.
val query = config.api.customerTableDBIO(apiRequest.parameters.organizationId).flatMap { tables =>
val userInsert = tables.Users returning tables.Users += empRow
val memberInsert = inputObject.memberOf.map(m => m.copy(teamReference = m.teamReference.copy(organizationId = apiRequest.parameters.organizationId))).map { r =>
for {
team2MemberRow <- tables.Team2members returning tables.Team2members += Teams2MembersEntity.fromEmtToTeams2Members(r, empRow.id)
team <- tables.Teams.filter(_.id === r.teamReference.teamId.toLowerCase).map(_.friendlyName).result.headOption
} yield (team2MemberRow, team)
}
val managerInsert = inputObject.managerOf.map(m => m.copy(teamReference = m.teamReference.copy(organizationId = apiRequest.parameters.organizationId))).map { r =>
for {
team2OwnerRow <- tables.Team2owners returning tables.Team2owners += Teams2OwnersEntity.fromEmtToTeam2owners(r, empRow.id)
team <- tables.Teams.filter(_.id === r.teamReference.teamId.toLowerCase).map(_.friendlyName).result.headOption
} yield (team2OwnerRow, team)
}
userInsert.flatMap { userRow =>
val user = UserEntity.fromDbEntity(userRow)
if (memberInsert.isDefined) memberInsert.get
.map(r => user.copy(memberOf = Some(Teams2MembersEntity.fromEmtToMemberRef(r._1, r._2.map(TeamEntity.toApiFriendlyName).getOrElse(List.empty)))))
else DBIO.successful(user)
if (managerInsert.isDefined) managerInsert.get
.map(r => user.copy(managerOf = Some(Teams2OwnersEntity.fromEmtToManagerRef(r._1, r._2.map(TeamEntity.toApiFriendlyName).getOrElse(List.empty)))))
else DBIO.successful(user)
}
}
The query seems to work fine when only memberInsert is defined, when I try to define both the values i.e.memberInsert and managerInsert then insertion happens only for second value.
The problem looks to be with the final call to flatMap.
That should return a DBIO[T]. However, your expression generates a DBIO[T] in various branches, but only one value will be returned from flatMap. That would explain why you don't see all the actions being run.
Instead, what you could do is assign each step to a value and sequence them. There are lots of ways you could do that, such as using DBIO.seq or andThen.
Here's a sketch of one approach that might work for you....
val maybeInsertMemeber: Option[DBIO[User]] =
member.map( your code for constructing an action here )
val maybeInsertManager Option[DBIO[User]] =
manager.map( your code for constructing an action here )
DBIO.sequenceOption(maybeInsertMember) andThen
DBIO.sequenceOption(maybeInsertManager) andThen
DBIO.successful(user)
The result of that expression is a DBIO[User] which combines three queries together.
I'm fetching a list of items from a REST api. The user interacts with each one via a click, and when there are only, say, a couple left unused, I'd like to repeat the request to get more items. I'm trying to do this using a proper RxJs (5) stream-oriented approach.
So, something like:
var userClick$ = Observable.fromEvent(button.nativeElement, 'click');
var needToExtend$ = new BehaviorSubject(1);
var list$ = needToExtend$
.flatMap( () => this.http.get("http://myserver/get-list") )
.flatMap( x => x['list'] );
var itemsUsed$ = userClick$.zip(list$, (click, item) => item);
itemsUsed$.subscribe( item => use(item) );
and then, to trigger a re-load when necessary:
list$.subscribe(
if (list$.isEmpty()) {
needToExtend$.next(1);
}
)
This last bit is wrong, and manually re-triggering doesn't seem very "stream-oriented" even if it did work as intended. Any ideas?
This is similar to Rxjs - Consume API output and re-query when cache is empty but I can't make assumptions about the length of the list returned by the API, and I'd like to re-request before the list is completely consumed. And the solution there feels a bit too clever. There must be a more readable way, right?
How about something like this:
const LIST_LIMIT = 3;
userClick$ = Observable.fromEvent(button.nativeElement, 'click');
list$ = this.http.get("http://myserver/get-list").map(r => r.list);
clickCounter$ = this.userClick$.scan((acc: number, val) => acc + 1, 0);
getList$ = new BehaviorSubject([]);
this.getList$
.switchMap(previousList => this.list$)
.switchMap(list => this.clickCounter$, (list, clickCount) => { return {list, clickCount}; })
.filter(({list, clickCount}) => clickCount >= list.length - LIST_LIMIT)
.map(({list, clickCount}) => list)
.subscribe(this.getList$);
The logic here if you define a list getter stream, and a signal to trigger it.
First, the signal causes switchMap to fetch a new list, which is then fed into another switchmap that resubscribes to a click counter. You combine the result of both streams and feed that to filter, which only emits when the click count is greater than or equal to the list length minus 3 (or whatever you want). Then the signal is subscribed to this whole stream so that it retriggers itself.
Edit: the biggest weakness of this is that you need to set the list value (for display) in a side effect rather than in subscription or with the async pipe. You can rearrange it and multicast though:
const LIST_LIMIT = 3;
userClick$ = Observable.fromEvent(button.nativeElement, 'click');
list$ = this.http.get("http://myserver/get-list").map(r => r.list);
clickCounter$: Observable<number> = this.userClick$.scan((acc: number, val) => acc + 1, 0).startWith(0);
getList$ = new BehaviorSubject([]);
refresh$ = this.getList$
.switchMap(list => this.clickCounter$
.filter(clickCount => list.length <= clickCount + LIST_LIMIT)
.first(),
(list, clickCount) => list)
.switchMap(previousList => this.list$)
.multicast(() => this.getList$);
this.refresh$.connect();
this.refresh$.subscribe(e => console.log(e));
This way has a few advantages, but may be a little less "readable". The pieces are mostly the same, but instead you go to the counter first and let that lead into the switch to the list fetch. and you multicast it to restart the counter.
I'm not clear on how you are tracking getting the next set of items so I will assume it is some form of paging for my answer. I also assume that you don't know the total number of items.
console.clear();
const pageSize = 5;
const pageBuffer = 2;
const data = [...Array(17).keys()]
function getData(page) {
const begin = pageSize * page
const end = begin + pageSize;
return Rx.Observable.of(data.slice(begin, end));
}
const clicks = Rx.Observable.interval(400);
clicks
.scan(count => ++count, 0)
.do(() => console.log('click'))
.map(count => {
const page = Math.floor(count / pageSize) + 1;
const total = page * pageSize;
return { total, page, count }
})
.filter(x => x.total - pageBuffer === x.count)
.startWith({ page: 0 })
.switchMap(x => getData(x.page))
.takeWhile(x => x.length > 0)
.subscribe(
x => { console.log('next: ', x); },
x => { console.log('error: ', x); },
() => { console.log('completed'); }
);
<script src="https://cdnjs.cloudflare.com/ajax/libs/rxjs/5.5.3/Rx.min.js"></script>
Here is an explaination:
Rx.Observable.interval(#): simulates the client click events
.scan(...): accumulates the click events
.map(...): calculates the page index and potential total item count (actual count could be less but it doesn't matter for our purposes
.filter(...): only allow to pass through to get a new page of data if it has just hit the page buffer.
.startWith(...): get the first page without waiting for clicks. The +1 on the page calculation in the .scan accounts for this.
.switchMap(...): get the next page of data.
.takeWhile(...): keep the stream open till we get an empty list.
So it will get an initial page and then go get a new page whenever the number of clicks comes within the designated buffer. Once all items have been retrieved (known by empty list) it will complete.
One thing I didn't figure out how to do is to complete the list when the page length is less than the page size. Not sure if it matters to you.
I've got a simple program here that displays the number of letters in various words. It works as expected.
static void Main(string[] args) {
var word = new Subject<string>();
var wordPub = word.Publish().RefCount();
var length = word.Select(i => i.Length);
var report =
wordPub
.GroupJoin(length,
s => wordPub,
s => Observable.Empty<int>(),
(w, a) => new { Word = w, Lengths = a })
.SelectMany(i => i.Lengths.Select(j => new { Word = i.Word, Length = j }));
report.Subscribe(i => Console.WriteLine($"{i.Word} {i.Length}"));
word.OnNext("Apple");
word.OnNext("Banana");
word.OnNext("Cat");
word.OnNext("Donkey");
word.OnNext("Elephant");
word.OnNext("Zebra");
Console.ReadLine();
}
And the output is:
Apple 5
Banana 6
Cat 3
Donkey 6
Elephant 8
Zebra 5
I used the Publish().RefCount() because "wordpub" is included in "report" twice. Without it, when a word is emitted first one part of the report would get notified by a callback, and then the other part of report would be notified, double the notifications. That is kindof what happens; the output ends up having 11 items rather than 6. At least that is what I think is going on. I think of using Publish().RefCount() in this situation as simultaneously updating both parts of the report.
However if I change the length function to ALSO use the published source like this:
var length = wordPub.Select(i => i.Length);
Then the output is this:
Apple 5
Apple 6
Banana 6
Cat 3
Banana 3
Cat 6
Donkey 6
Elephant 8
Donkey 8
Elephant 5
Zebra 5
Why can't the length function also use the same published source?
This was a great challenge to solve!
So subtle the conditions that this happens.
Apologies in advance for the long explanation, but bear with me!
TL;DR
Subscriptions to the published source are processed in order, but before any other subscription directly to the unpublished source. i.e. you can jump the queue!
With GroupJoin subscription order is important to determine when windows open and close.
My first concern would be that you are publish refcounting a subject.
This should be a no-op.
Subject<T> has no subscription cost.
So when you remove the Publish().RefCount() :
var word = new Subject<string>();
var wordPub = word;//.Publish().RefCount();
var length = word.Select(i => i.Length);
then you get the same issue.
So then I look to the GroupJoin (because my intuition suggests that Publish().Refcount() is a red herring).
For me, eyeballing this alone was too hard to rationalise, so I lean on a simple debugging too I have used dozens of times of the years - a Trace or Log extension method.
public interface ILogger
{
void Log(string input);
}
public class DumpLogger : ILogger
{
public void Log(string input)
{
//LinqPad `Dump()` extension method.
// Could use Console.Write instead.
input.Dump();
}
}
public static class ObservableLoggingExtensions
{
private static int _index = 0;
public static IObservable<T> Log<T>(this IObservable<T> source, ILogger logger, string name)
{
return Observable.Create<T>(o =>
{
var index = Interlocked.Increment(ref _index);
var label = $"{index:0000}{name}";
logger.Log($"{label}.Subscribe()");
var disposed = Disposable.Create(() => logger.Log($"{label}.Dispose()"));
var subscription = source
.Do(
x => logger.Log($"{label}.OnNext({x.ToString()})"),
ex => logger.Log($"{label}.OnError({ex})"),
() => logger.Log($"{label}.OnCompleted()")
)
.Subscribe(o);
return new CompositeDisposable(subscription, disposed);
});
}
}
When I add the logging to your provided code it looks like this:
var logger = new DumpLogger();
var word = new Subject<string>();
var wordPub = word.Publish().RefCount();
var length = word.Select(i => i.Length);
var report =
wordPub.Log(logger, "lhs")
.GroupJoin(word.Select(i => i.Length).Log(logger, "rhs"),
s => wordPub.Log(logger, "lhsDuration"),
s => Observable.Empty<int>().Log(logger, "rhsDuration"),
(w, a) => new { Word = w, Lengths = a })
.SelectMany(i => i.Lengths.Select(j => new { Word = i.Word, Length = j }));
report.Subscribe(i => ($"{i.Word} {i.Length}").Dump("OnNext"));
word.OnNext("Apple");
word.OnNext("Banana");
word.OnNext("Cat");
word.OnNext("Donkey");
word.OnNext("Elephant");
word.OnNext("Zebra");
This will then output in my log something like the following
Log with Publish().RefCount() used
0001lhs.Subscribe()
0002rhs.Subscribe()
0001lhs.OnNext(Apple)
0003lhsDuration.Subscribe()
0002rhs.OnNext(5)
0004rhsDuration.Subscribe()
0004rhsDuration.OnCompleted()
0004rhsDuration.Dispose()
OnNext
Apple 5
0001lhs.OnNext(Banana)
0005lhsDuration.Subscribe()
0003lhsDuration.OnNext(Banana)
0003lhsDuration.Dispose()
0002rhs.OnNext(6)
0006rhsDuration.Subscribe()
0006rhsDuration.OnCompleted()
0006rhsDuration.Dispose()
OnNext
Banana 6
...
However when I remove the usage Publish().RefCount() the new log output is as follows:
Log without only Subject
0001lhs.Subscribe()
0002rhs.Subscribe()
0001lhs.OnNext(Apple)
0003lhsDuration.Subscribe()
0002rhs.OnNext(5)
0004rhsDuration.Subscribe()
0004rhsDuration.OnCompleted()
0004rhsDuration.Dispose()
OnNext
Apple 5
0001lhs.OnNext(Banana)
0005lhsDuration.Subscribe()
0002rhs.OnNext(6)
0006rhsDuration.Subscribe()
0006rhsDuration.OnCompleted()
0006rhsDuration.Dispose()
OnNext
Apple 6
OnNext
Banana 6
0003lhsDuration.OnNext(Banana)
0003lhsDuration.Dispose()
...
This gives us some insight, however when the issue really becomes clear is when we start annotating our logs with a logical list of subscriptions.
In the original (working) code with the RefCount our annotations might look like this
//word.Subsribers.Add(wordPub)
0001lhs.Subscribe() //wordPub.Subsribers.Add(0001lhs)
0002rhs.Subscribe() //word.Subsribers.Add(0002rhs)
0001lhs.OnNext(Apple)
0003lhsDuration.Subscribe() //wordPub.Subsribers.Add(0003lhsDuration)
0002rhs.OnNext(5)
0004rhsDuration.Subscribe()
0004rhsDuration.OnCompleted()
0004rhsDuration.Dispose()
OnNext
Apple 5
0001lhs.OnNext(Banana)
0005lhsDuration.Subscribe() //wordPub.Subsribers.Add(0005lhsDuration)
0003lhsDuration.OnNext(Banana)
0003lhsDuration.Dispose() //wordPub.Subsribers.Remove(0003lhsDuration)
0002rhs.OnNext(6)
0006rhsDuration.Subscribe()
0006rhsDuration.OnCompleted()
0006rhsDuration.Dispose()
OnNext
Banana 6
So in this example, when word.OnNext("Banana"); is executed the chain of observers is linked in this order
wordPub
0002rhs
However, wordPub has child subscriptions!
So the real subscription list looks like
wordPub
0001lhs
0003lhsDuration
0005lhsDuration
0002rhs
If we annotate the Subject only log we see where the subtlety lies
0001lhs.Subscribe() //word.Subsribers.Add(0001lhs)
0002rhs.Subscribe() //word.Subsribers.Add(0002rhs)
0001lhs.OnNext(Apple)
0003lhsDuration.Subscribe() //word.Subsribers.Add(0003lhsDuration)
0002rhs.OnNext(5)
0004rhsDuration.Subscribe()
0004rhsDuration.OnCompleted()
0004rhsDuration.Dispose()
OnNext
Apple 5
0001lhs.OnNext(Banana)
0005lhsDuration.Subscribe() //word.Subsribers.Add(0005lhsDuration)
0002rhs.OnNext(6)
0006rhsDuration.Subscribe()
0006rhsDuration.OnCompleted()
0006rhsDuration.Dispose()
OnNext
Apple 6
OnNext
Banana 6
0003lhsDuration.OnNext(Banana)
0003lhsDuration.Dispose()
So in this example, when word.OnNext("Banana"); is executed the chain of observers is linked in this order
1. 0001lhs
2. 0002rhs
3. 0003lhsDuration
4. 0005lhsDuration
As the 0003lhsDuration subscription is activated after the 0002rhs, it wont see the "Banana" value to terminate the window, until after the rhs has been sent the value, thus yielding it in the still open window.
Whew
As #francezu13k50 points out the obvious and simple solution to your problem is to just use word.Select(x => new { Word = x, Length = x.Length });, but as I think you have given us a simplified version of your real problem (appreciated) I understand why this isn't suitable.
However, as I dont know what your real problem space is I am not sure what to suggest to you to provide a solution, except that you have one with your current code, and now you should know why it works the way it does.
RefCount returns an Observable that stays connected to the source as long as there is at least one subscription to the returned Observable. When the last subscription is disposed, RefCount disposes it's connection to the source, and reconnects when a new subscription is being made. It might be the case with your report query that all subscriptions to the 'wordPub' are disposed before the query is fulfilled.
Instead of the complicated GroupJoin query you could simply do :
var report = word.Select(x => new { Word = x, Length = x.Length });
Edit:
Change your report query to this if you want to use the GroupJoin operator :
var report =
wordPub
.GroupJoin(length,
s => wordPub,
s => Observable.Empty<int>(),
(w, a) => new { Word = w, Lengths = a })
.SelectMany(i => i.Lengths.FirstAsync().Select(j => new { Word = i.Word, Length = j }));
Because GroupJoin seems to be very tricky to work with, here is another approach for correlating the inputs and outputs of functions.
static void Main(string[] args) {
var word = new Subject<string>();
var length = new Subject<int>();
var report =
word
.CombineLatest(length, (w, l) => new { Word = w, Length = l })
.Scan((a, b) => new { Word = b.Word, Length = a.Word == b.Word ? b.Length : -1 })
.Where(i => i.Length != -1);
report.Subscribe(i => Console.WriteLine($"{i.Word} {i.Length}"));
word.OnNext("Apple"); length.OnNext(5);
word.OnNext("Banana");
word.OnNext("Cat"); length.OnNext(3);
word.OnNext("Donkey");
word.OnNext("Elephant"); length.OnNext(8);
word.OnNext("Zebra"); length.OnNext(5);
Console.ReadLine();
}
This approach works if every input has 0 or more outputs subject to the constraints that (1) outputs only arrive in the same order as the inputs AND (2) each output corresponds to its most recent input. This is like a LeftJoin - each item in the first list (word) is paired with items in the right list (length) that subsequently arrive, up until another item in the first list is emitted.
Trying to use regular Join instead of GroupJoin. I thought the problem was that when a new word was created there was a race condition inside Join between creating a new window and ending the current one. So here I tried to elimate that by pairing every word with a null signifying the end of the window. Doesn't work, just like the first version did not. How is it possible that a new window is created for each word without the previous one being closed first? Completely confused.
static void Main(string[] args) {
var lgr = new DelegateLogger(Console.WriteLine);
var word = new Subject<string>();
var wordDelimited =
word
.Select(i => Observable.Return<string>(null).StartWith(i))
.SelectMany(i => i);
var wordStart = wordDelimited.Where(i => i != null);
var wordEnd = wordDelimited.Where(i => i == null);
var report = Observable
.Join(
wordStart.Log(lgr, "word"), // starts window
wordStart.Select(i => i.Length),
s => wordEnd.Log(lgr, "expireWord"), // ends current window
s => Observable.Empty<int>(),
(l, r) => new { Word = l, Length = r });
report.Subscribe(i => Console.WriteLine($"{i.Word} {i.Length}"));
word.OnNext("Apple");
word.OnNext("Banana");
word.OnNext("Cat");
word.OnNext("Zebra");
word.OnNext("Elephant");
word.OnNext("Bear");
Console.ReadLine();
}
I have done implementation of daily compute. Here is some pseudo-code.
"newUser" may called first activated user.
// Get today log from hbase or somewhere else
val log = getRddFromHbase(todayDate)
// Compute active user
val activeUser = log.map(line => ((line.uid, line.appId), line).reduceByKey(distinctStrategyMethod)
// Get history user from hdfs
val historyUser = loadFromHdfs(path + yesterdayDate)
// Compute new user from active user and historyUser
val newUser = activeUser.subtractByKey(historyUser)
// Get new history user
val newHistoryUser = historyUser.union(newUser)
// Save today history user
saveToHdfs(path + todayDate)
Computation of "activeUser" can be converted to spark-streaming easily. Here is some code:
val transformedLog = sdkLogDs.map(sdkLog => {
val time = System.currentTimeMillis()
val timeToday = ((time - (time + 3600000 * 8) % 86400000) / 1000).toInt
((sdkLog.appid, sdkLog.bcode, sdkLog.uid), (sdkLog.channel_no, sdkLog.ctime.toInt, timeToday))
})
val activeUser = transformedLog.groupByKeyAndWindow(Seconds(86400), Seconds(60)).mapValues(x => {
var firstLine = x.head
x.foreach(line => {
if (line._2 < firstLine._2) firstLine = line
})
firstLine
})
But the approach of "newUser" and "historyUser" is confusing me.
I think my question can be summarized as "how to count new element from stream". As my pseudo-code above, "newUser" is part of "activeUser". And I must maintain a set of "historyUser" to know which part is "newUser".
I consider an approach, but I think it may not work right way:
Load the history user as a RDD. Foreach DStream of "activeUser" and find the elements doesn't exist in the "historyUser". A problem here is when should I update this RDD of "historyUser" to make sure I can get the right "newUser" of a window.
Update the "historyUser" RDD means add "newUser" to it. Just like what I did in the pseudo-code above. The "historyUser" is updated once a day in that code. Another problem is how to do this update RDD operation from a DStream. I think update "historyUser" when window slides is proper. But I haven't find a proper API to do this.
So which is the best practice to solve this problem.
updateStateByKey would help here as it allows you to set initial state (your historical users) and then update it on each interval of your main stream. I put some code together to explain the concept
val historyUsers = loadFromHdfs(path + yesterdayDate).map(UserData(...))
case class UserStatusState(isNew: Boolean, values: UserData)
// this will prepare the RDD of already known historical users
// to pass into updateStateByKey as initial state
val initialStateRDD = historyUsers.map(user => UserStatusState(false, user))
// stateful stream
val trackUsers = sdkLogDs.updateStateByKey(updateState, new HashPartitioner(sdkLogDs.ssc.sparkContext.defaultParallelism), true, initialStateRDD)
// only new users
val newUsersStream = trackUsers.filter(_._2.isNew)
def updateState(newValues: Seq[UserData], prevState: Option[UserStatusState]): Option[UserStatusState] = {
// Group all values for specific user as needed
val groupedUserData: UserData = newValues.reduce(...)
// prevState is defined only for users previously seen in the stream
// or loaded as initial state from historyUsers RDD
// For new users it is None
val isNewUser = !prevState.isDefined
// as you return state here for the user - prevState won't be None on next iterations
Some(UserStatusState(isNewUser, groupedUserData))
}
I would like to do a little project to do some calculation and add the calculated results in listbox.
My code:
int SumLoop(int lowLimit, int highLimit)
{
int idx;
int totalSum = 0;
for (idx = lowLimit; idx <= highLimit; idx = idx + 1)
{
totalSum += idx;
}
return totalSum;
}
private void button1_Click(object sender, EventArgs e)
{
var test2 = Observable.Interval(TimeSpan.FromMilliseconds(1000)).Select(x=>(int)x).Take(10);
test2.Subscribe(n =>
{
this.BeginInvoke(new Action(() =>
{
listBox1.Items.Add("input:" + n);
listBox1.Items.Add("result:" + SumLoop(n,99900000));
}));
});
}
The result:
input:0
result:376307504
(stop a while)
input:1
result:376307504
(stop a while)
input:2
result:376307503
(stop a while)
input:3
result:376307501
(stop a while)
....
...
..
.
input:"9
result:376307468
If i would like to modify the interval constant from 1000 --> 10,
var test2 = Observable.Interval(TimeSpan.FromMilliseconds(10)).Select(x=>(int)x).Take(10);
The displaying behavior becomes different. The listbox will display all inputs and results just a shot. It seems that it waits all results to complete and then display everything to listbox. Why?
If i would like to keep using this constant (interval:10) and dont want to display everything just a shot. I want to display "Input :0" -->wait for calculation-->display "result:376307504"....
So, how can i do this?
Thankx for your help.
If I understand you correctly you're wanting to run the sum loop off the UI thread, here's how you would do that:
Observable
.Interval(TimeSpan.FromMilliseconds(1000))
.Select(x => (int)x)
.Select(x => SumLoop(x, 99900000))
.Take(10)
.ObserveOn(listBox1) // or ObserveOnDispatcher() if you're using WPF
.Subscribe(r => {
listBox1.Items.Add("result:" + r);
});
You should see the results trickle in on an interval of 10ms + ~500ms.
Instead of doing control.Invoke/control.BeginInvoke, you'll want to call .ObserveOnDispatcher() to get your action invoked on the UI thread:
Observable
.Interval(TimeSpan.FromMilliseconds(1000))
.Select(x=>(int)x)
.Take(10)
.Subscribe(x => {
listBox1.Items.Add("input:" + x);
listBox1.Items.Add("result:" + SumLoop(x, 99900000));
});
You said that if you change the interval from 1000 ms to 10ms, you observe different behavior.
The listbox will display all inputs and results just a shot.
I suspect this is because 10ms is so fast, all the actions you're executing are queued up. The UI thread comes around to execute them, and wham, executes everything that's queued.
In contrast, posting them every 1000ms (one second) allows the UI thread to execute one, rest, execute another one, rest, etc.