Unexpected behavior with RxJava2 PublishSubject

Unexpected behavior with RxJava2 PublishSubject - rx-java2

I have the following code which uses a PublishSubject.
val subject = PublishSubject.create<Int>()
val o1: Observable<String> =
subject.observeOn(Schedulers.newThread()).map { i: Int ->
println("${Thread.currentThread()} | ${Date()} | map => $i")
i.toString()
}
o1.subscribe {
println("${Thread.currentThread()} | ${Date()} | direct subscription (1) => $it")
}
o1.subscribe {
println("${Thread.currentThread()} | ${Date()} | direct subscription (2) => $it")
}
o1.subscribe {
println("${Thread.currentThread()} | ${Date()} | direct subscription (3) => $it")
}
println("${Thread.currentThread()} | ${Date()} | submitting 1")
subject.onNext(1)
1) I create an Observable from it and map it (for the purpose of this example I am just converting to a String) => o1.
2) I then subscribe to o1 3 times.
3) Finally I "publish" an event by calling subject.onNext(1).
To my surprise I am getting the following output:
Thread[main,5,main] | Mon Jun 19 09:46:37 PDT 2017 | submitting 1
Thread[RxNewThreadScheduler-1,5,main] | Mon Jun 19 09:46:37 PDT 2017 | map => 1
Thread[RxNewThreadScheduler-2,5,main] | Mon Jun 19 09:46:37 PDT 2017 | map => 1
Thread[RxNewThreadScheduler-3,5,main] | Mon Jun 19 09:46:37 PDT 2017 | map => 1
Thread[RxNewThreadScheduler-1,5,main] | Mon Jun 19 09:46:37 PDT 2017 | direct subscription (1) => 1
Thread[RxNewThreadScheduler-2,5,main] | Mon Jun 19 09:46:37 PDT 2017 | direct subscription (2) => 1
Thread[RxNewThreadScheduler-3,5,main] | Mon Jun 19 09:46:37 PDT 2017 | direct subscription (3) => 1
map ends up being called 3 times and I don't understand why since I am subscribing to o1 which should happen after map has occurred. I must be missing something. Any help would be appreciated.
Thanks
Yan

From the comments:
You subscribe to o1 three times, each creating an independent sequence up until the PublishSubject that will dispatch the onNext to all 3 chains.
From the perspective of all 3 subscribers, PublishSubject is a multicasting source that signals events to them through independent chains established by the subscribe() calls.
Applying operators on a Subject generally don't make the whole chain hot because those operator elements only get attached to the source Subject only when they are subscribed to. Thus, multiple subscriptions will yield multiple channels to the same upstream Subject.
Use publish to get a ConnectableObservable (or another PublishSubject at the very end) to make the sequence hot from that point on.

Related

PySpark: how to performs conditional calculation on each element of a long string

I have a dataframe that looks like this:
+--------+-------------------------------------+-----------+
| Worker | Schedule | Overtime |
+--------+-------------------------------------+-----------+
| 1 | 23344--23344--23344--23344--23344-- | 3 |
+--------+-------------------------------------+-----------+
| 2 | 34455--34455--34455--34455--34455-- | 2 |
+--------+-------------------------------------+-----------+
| 3 | 466554-466554-466554-466554-466554- | 1 |
+--------+-------------------------------------+-----------+
Each number in the long 35-digit string in Schedule is a worker's work hours in a 35-day window.
Here is how to read each row:
Worker #1 works 2hr on Monday, 3hr on Tuesday, 3hr on Wednesday, 4hr on Thursday, 4hr on Friday, then off on Saturday and Sunday... (same for the following weeks in that 35-day window)
Worker #3 works 4hr on Monday, 6hr on Tuesday, 6hr on Wednesday, 5hr on Thursday, 5hr on Friday, 4hr on Saturday, then off on Sunday... (same for the following weeks in that 35-day window)
I would like to implement the following operation:
- For each day of a worker's schedule, if the hour he works that day + Overtime is <= 6, add that overtime hours to his schedule. No change is applied to days off (marked with -)
For example:
Worker #1's updated schedule would look like:
56644--56644--56644--56644--56644--
2+3 <= 6 -> add 3 hrs
3+3 <= 6 -> add 3 hrs
4+3 !<= 6 -> no edit
-- -> days off, no edit
Using same logic, Worker #2's updated schedule would look like:
56655--56655--56655--56655--56655--
Worker #3's updated schedule would look like:
566665-566665-566665-566665-566665-
I am wondering how do I perform this operation in PySpark?
Much appreciation for your help!

The shortest way (and probably the best performance) is using Spark SQL transform, which will loop through an array of your schedule, and perform the comparison element-wise. Even though, the code would look a bit cryptic.
from pyspark.sql import functions as F
(df
.withColumn('recalculated_schedule', F.expr('concat_ws("", transform(split(schedule, ""), x -> case when x = "-" then "-" when x + overtime <= 6 then cast(x + overtime as int) else x end))'))
.show(10, False)
)
+------+-----------------------------------+--------+-----------------------------------+
|worker|schedule |overtime|recalculated_schedule |
+------+-----------------------------------+--------+-----------------------------------+
|1 |23344--23344--23344--23344--23344--|3 |56644--56644--56644--56644--56644--|
|2 |34455--34455--34455--34455--34455--|2 |56655--56655--56655--56655--56655--|
|3 |466554-466554-466554-466554-466554-|1 |566665-566665-566665-566665-566665-|
+------+-----------------------------------+--------+-----------------------------------+

SQL Query to display Calculated fields on a year, monthly basis

I need help writing this SQL query (PostgresSQL) to display results in the form below:
--------------------------------------------------------------------------------
State | Jan '17 | Feb '17 | Mar '17 | Apr '17 | May '17 ... Dec '18
--------------------------------------------------------------------------------
Principal Outs. |700,839 |923,000 |953,000 |6532,293 | 789,000 ... 913,212
Disbursal Amount |23,000 |25,000 |23,992 | 23,627 | 25,374 ... 23,209
Interest |113,000 |235,000 |293,992 |322,627 |323,374 ... 267,209
There are multiple tables but I would be okay joining them.

Cannot create stream in Ksql

I have the stream as below and i want to create another stream from this. I am trying the command as below and i am getting the following error. Am i missing something?
ksql> create stream down_devices_stream as select * from fakedata119 where deviceProperties['status']='false';
Failed to generate code for SqlPredicate.filterExpression: (FAKEDATA119.DEVICEPROPERTIES['status'] = 'false')schema:org.apache.kafka.connect.data.SchemaBuilder#6e18dbbfisWindowedKey:false
Caused by: Line 1, Column 180: Operator "<=" not allowed on reference operands
ksql> select * from fakedata119;
1529505497087 | null | 19 | visibility sensors | Wed Jun 20 16:38:17 CEST 2018 | {visibility=74, status=true}
1529505498087 | null | 7 | fans | Wed Jun 20 16:38:18 CEST 2018 | {temperature=44, rotationSense=1, status=false, frequency=49}
1529505499088 | null | 28 | air quality monitors | Wed Jun 20 16:38:19 CEST 2018 | {coPpm=257, status=false, Co2Ppm=134}
1529505500089 | null | 4 | fans | Wed Jun 20 16:38:20 CEST 2018 | {temperature=42, rotationSense=1, status=true, frequency=51}
1529505501089 | null | 23 | air quality monitors | Wed Jun 20 16:38:21 CEST 2018 | {coPpm=158, status=true, Co2Ppm=215}
sql> describe fakedata119;
Field | Type
---------------------------------------------------------
ROWTIME | BIGINT (system)
ROWKEY | VARCHAR(STRING) (system)
DEVICEID | INTEGER
CATEGORY | VARCHAR(STRING)
TIMESTAMP | VARCHAR(STRING)
DEVICEPROPERTIES | MAP[VARCHAR(STRING),VARCHAR(STRING)]

Without seeing your input data, I have guessed that it looks something like this:
{
"id": "a42",
"category": "foo",
"timestamp": "2018-06-21 10:04:57 BST",
"deviceID": 42,
"deviceProperties": {
"status": "false",
"foo": "bar"
}
}
And if so, you are better using EXTRACTJSONFIELD to access the nested values, and build predicates.
CREATE STREAM test (Id VARCHAR, category VARCHAR, timeStamp VARCHAR, \
deviceID INTEGER, deviceProperties VARCHAR) \
WITH (KAFKA_TOPIC='test_map2', VALUE_FORMAT='JSON');
ksql> SELECT EXTRACTJSONFIELD(DEVICEPROPERTIES,'$.status') AS STATUS FROM fakeData223;
false
ksql> SELECT * FROM fakeData223 \
WHERE EXTRACTJSONFIELD(DEVICEPROPERTIES,'$.status')='false';
1529572405759 | null | a42 | foo | 2018-06-21 10:04:57 BST | 42 | {"status":"false","foo":"bar"}
The error you've found I've logged as a bug to track here: https://github.com/confluentinc/ksql/issues/1474

I've added a test to cover this usecase:
https://github.com/confluentinc/ksql/pull/1476/files
Interestingly, this passes on our master and upcoming 5.0 branches, but fails on 4.1.
So... looks like this is an issue on the version you're using, but the good news is its fixed on the up coming release. Plus you can use Robin's work around above for now.
Happy querying!
Andy

Spark "CodeGenerator: failed to compile" with Dataset.groupByKey

I'm new to both Scala and Spark, so hopefully someone can let me know where I'm going wrong here.
I have a three-column dataset (id, name, year) and I want to find the most recent year for each name. In other words:
BEFORE AFTER
| id_1 | name_1 | 2015 | | id_2 | name_1 | 2016 |
| id_2 | name_1 | 2016 | | id_4 | name_2 | 2015 |
| id_3 | name_1 | 2014 |
| id_4 | name_2 | 2015 |
| id_5 | name_2 | 2014 |
I thought groupByKey and reduceGroups would get the job done:
val latestYears = ds
.groupByKey(_.name)
.reduceGroups((left, right) => if (left.year > right.year) left else right)
.map(group => group._2)
But it gives this error, and spits out a lot of generated Java code:
ERROR CodeGenerator: failed to compile:
org.codehaus.commons.compiler.CompileException:
File 'generated.java', Line 21, Column 101: Unknown variable or type "value4"
Interestingly, if I create a dataset with just the name and year columns, it works as expected.
Here's the full code I'm running:
object App {
case class Record(id: String, name: String, year: Int)
def main(args: Array[String]) {
val spark = SparkSession.builder().master("local").appName("test").getOrCreate()
import spark.implicits._
val ds = spark.createDataset[String](Seq(
"id_1,name_1,2015",
"id_2,name_1,2016",
"id_3,name_1,2014",
"id_4,name_2,2015",
"id_5,name_2,2014"
))
.map(line => {
val fields = line.split(",")
new Record(fields(0), fields(1), fields(2).toInt)
})
val latestYears = ds
.groupByKey(_.name)
.reduceGroups((left, right) => if (left.year > right.year) left else right)
.map(group => group._2)
latestYears.show()
}
}
EDIT: I believe this may be a bug with Spark v2.0.1. After downgrading to v2.0.0, this no longer occurs.

Your groupBy and reduceGroups functions are experimental. Why not use reduceByKey (api)?
Pros:
It should be easy to translate from the code you have.
It is more stable (not experimental).
It should be more efficient because it does not require a complete shuffle of all the items in each groups (which can also create network I/O slow downs and overflow the memory in a node).

Get Users who login into the system Today From A.M to P.M

I have an entity LoginHistory that saves user's login and logout time. I am building a query to get users who login today.
id | user_id | login_at | logout_at | created_at
----+---------+---------------------+---------------------+---------------------
44 | 38 | 2016-01-18 02:02:26 | | 2016-01-18 02:02:26
45 | 38 | | 2016-01-18 02:02:35 | 2016-01-18 02:02:35
51 | 38 | 2016-01-18 14:33:10 | | 2016-01-18 14:33:10
52 | 38 | | 2016-01-18 14:33:24 | 2016-01-18 14:33:24
Now on the Twig template i have to display the user login and logout time of one session at each line.
i worte this querybuilder
$loggedInUsers= $this->createQueryBuilder('logs')
->where('logs.user =:userID')
->andwhere('logs.loginAt >= :todayStartDate or logs.logoutdAt <= :todayEndDate')
->setParameter('userID', $userID)
->setParameter('todayStartDate', '2016-01-18 00:00:00')
->setParameter('todayEndDate', '2016-01-18 23:59:59')
->orderBy('logs.createdAt', 'ASC')
->getQuery()
->getResult();
and on twig the code is below;
{% for data in login_history %}
Login At {{ data.loginat|date('d M Y h:i:s A') }} & Logout at {{ data.logoutdAt|date('d M Y h:i:s A') }} <br />
{% endfor %}
and it output as;
Login At 18 Jan 2016 03:17:08 PM & Logout at 13 Jan 2016 05:27:46 PM
Login At 18 Jan 2016 03:17:08 PM & Logout at 15 Jan 2016 11:27:35 AM
Login At 18 Jan 2016 02:02:26 AM & Logout at 18 Jan 2016 03:17:08 PM
Login At 18 Jan 2016 03:17:08 PM & Logout at 18 Jan 2016 02:02:35 AM
Login At 18 Jan 2016 02:33:10 PM & Logout at 18 Jan 2016 03:17:08 PM
Login At 18 Jan 2016 03:17:08 PM & Logout at 18 Jan 2016 02:33:24 PM
Instead it should return;
Login At 18 Jan 2016 02:02:26 AM & Logout at 13 Jan 2016 02:02:35 AM
Login At 18 Jan 2016 02:33:10 PM & Logout at 13 Jan 2016 02:33:24 PM
I don't know where I am getting wrong...

Your query will find all users that logged in today as well as any user that logged out any time previously up until the end of today.
If you updated your query like...
$qb = $this->createQueryBuilder('logs');
$logHistory = $qb
->where('logs.user =:userID')
// and where multiple "OR's"
->andWhere($qb->expr()->orX(
// logs.loginAt BETWEEN :todayStartDate AND :todayEndDate
$qb->expr()->between('logs.loginAt', ':todayStartDate', ':todayEndDate'),
// logs.logoutAt BETWEEN :todayStartDate AND :todayEndDate
$qb->expr()->between('logs.logoutAt', ':todayStartDate', ':todayEndDate')
))
->setParameter('userID', $userID)
->setParameter('todayStartDate', new \DateTime('today'))
->setParameter('todayEndDate', new \DateTime('tomorrow - 1 second'))
->orderBy('logs.createdAt', 'ASC')
->getQuery()
->getResult();
That would give you all users that have either logged in or out today but only in the format of your original table (1 record per row as opposed to aggregated down to a log in a log out per row).
You could do the following to aggregate your results (It's not pretty and I'm not at all happy with it).
$userLog = array();
foreach ($logHistory as $log) {
if (null !== $loginAt = $log->getLoginAt()) {
$userLog[] = array(
'userId' => $log->getUserId(),
'loginAt' => $loginAt,
'logoutAt' => null,
);
}
if (null !== $logoutAt = $log->getLogoutAt()) {
$updated = false;
// reverse the array as logs at the start are more likely to be complete
foreach (array_reverse($userLog) as $k => &$row) {
if ($log->getUser() === $row['userId'] && null === $row['logoutAt']) {
$row['logoutAt'] = $logoutAt;
$updated = true;
break;
}
}
// for logout events without a login, i.e. users that logged in yesterday
if (false === $updated) {
$userLog[] = array(
'userId' => $log->getUserId(),
'loginAt' => null,
'logoutAt' => $logoutAt,
);
}
}
}