no output from aggregateMessages in GraphFrames - scala

I am just starting with GraphFrames, and though I am following the documentation, I am not able to get any result from the aggregateMessages function (it returns an empty dataframe). Here is a simplified example of my problem: I GraphFrames object called testGraph such that my vertexRDD consists of only a single vertex Y with no vertex attributes, and my edgeRDD consists of two records like this:
| src | dst | min_ts1 | min_ts2 |
| X | Y | 20 | null |
| Y | X | null | -10 |
Now, I want to implement a simple algorithm that sends the value of min_ts1 to dst, and sends min_ts2 to the src. The code I am using to implement this algorithm is :
import org.graphframes.lib.AggregateMessages
import org.apache.spark.sql.functions._
val AM = AggregateMessages
val msgToSrc = AM.edge("min_ts2)
val msgToDst = AM.edge("min_ts1")
val delay = testGraph
.aggregateMessages
.sendToSrc(msgToSrc)
.sendToDst(msgToDst)
.agg(sum(AM.msg).as("avg_time_delay"))
I realize there are some null values here, but regardless I would expect the message passing algorithm to do the following: look at the first record, and send a message of 20 to Y and a message of null to X. Then look at the second record, and send a message of null to X and a message of -10 to Y. Finally I would expect the result to show that the sum of messages for Y is 10, and for there to be no record for X in the result, since it was not included in the vertexRDD. And if X were included in the vertexRDD, I would expect the result to be simply null, since both of the messages were null.
However, what I am getting is an empty RDD. Could someone please help me understand why I am getting an empty result?

Ok, it appears that the reason for this issue is indeed that I did not have X in my VertexRDD. I guess even if there are edges going to and from that vertex in my edgeRDD and my aggregatemessages depend only on edge attributes, the algorithm is not able to send those messages.

Related

Find points inside the intersection of polygons in PostgreSQL/PostGIS

I want to find the points inside the intersection (Figure 1) of polygons in PostgreSQL.
Figure 1 example
I use psycopg2 and the code that I used is:
intersects = """select ST_Intersects( ST_GeographyFromText('SRID=4326; POLYGON(( 32.0361328 33.6877818, 31.9042969 33.5780147,33.5742188 11.3507967,66.2695313 20.4270128, 51.9433594 34.270836, 32.0361328 33.6877818))'),
ST_GeographyFromText('SRID=4326; POLYGON((33.7060547 37.1953306,36.6943359 16.0880422,64.9072266 12.4258478,64.8632813 37.0551771,33.5742188 37.1953306,33.7060547 37.1953306))')), col.vessel_hash,ST_X(col.the_geom) AS long, ST_Y(col.the_geom) AS lat
from samplecol as col"""
cursor.execute(intersects)
pointsINtw = cursor.fetchall()
count = 0;
shipsrecords = open("/home/antonis/Desktop/testme1.txt", "w")
for ex in pointsINtw:
if str(ex[0])=='True':
count = count + 1
shipsrecords.write(str(ex) + "\n")
print (CBLUE + "Number of returned results: " + CBLUEEND), count
Example record:
vessel_hash | speed | latitude | longitude | course | heading | timestamp | the_geom
--------------+--------+---------+-------+-------------+-------------+--------+---------+--------------------------+----------------------------------------------------
103079215239 | 5 | -5.41844510 | 36.12160900 | 314 | 511 | 2016-06-12T06:31:04.000Z | 0101000020E61000001BF33AE2900F424090AF4EDF7CAC15C0
The problem is that above code does not work properly. I create two polygons like Figure 1 and I know that inside the intersection exist some points. But the code always returns all points from db.
If I create two polygons that do not intersect then the algorithm seems to work properly as it does not return any point.
Does anyone know what am I doing wrong?
demo:db<>fiddle (of your query, with your polygons, own points),
visualisation of the situation (maybe Chrome necessary)
ST_Intersects() only checks if the two given polygons share some space. It is true, they share. But there is no part within your query that includes the check with the points. You only call the intersection check but without any usage of your point column.
I believe you need to calculate the intersection polygon (ST_Intersection()) instead of only check for its existance. After that you can take this result to check whether your points are in it or not (ST_Contains()):
Pseudocode:
SELECT
ST_Contains(
ST_Intersection(my_geometry1, my_geometry2),
my_pointgeometry
)
...
demo:db<>fiddle
(Demo uses geometry instead of geography and the polygon needs to get valid for some reasons; so you need to adapt this to your use case)

Parallelizing sequential for-loop for GPU

I have a for-loop where the current index of a vector depends on the previous indices that I am trying to parallelize for a GPU in MATLAB.
A is an nx1 known vector
B is an nx1 output vector that is initialized to zeros.
The code is as follows:
for n = 1:size(A)
B(n+1) = B(n) + A(n)*B(n) + A(n)^k + B(n)^2
end
I have looked at this similar question and tried to find a simple closed form for the recurrence relation, but couldn't find one.
I could do a prefix sum as mentioned in the first link over the A(n)^k term, but I was hoping there would be another method to speed up the loop.
Any advice is appreciated!
P.S. My real code involves 3D arrays that index and sum along 2D slices, but any help for the 1D case should transfer to a 3D scaling.
A word "Parallelizing" sounds magically, but scheduling rules apply:
Your problem is not in spending efforts on trying to convert a pure SEQ-process into it's PAR-re-representation, but in handling the costs of doing so, if you indeed persist into going PAR at any cost.
m = size(A); %{
+---+---+---+---+---+---+---+---+---+---+---+---+---+ .. +---+
const A[] := | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | A | B | C | D | | M |
+---+---+---+---+---+---+---+---+---+---+---+---+---+ .. +---+
:
\
\
\
\
\
\
\
\
\
\
\
+---+---+---+---+---+ .. + .. +---+---+---+---+---+ .. +---+
var B[] := | 0 | 0 | 0 | 0 | 0 | : | 0 | 0 | 0 | 0 | 0 | | 0 |
+---+---+---+---+---+ .. : .. +---+---+---+---+---+ .. +---+ }%
%% : ^ :
%% : | :
for n = 1:m %% : | :
B(n+1) =( %% ====:===+ : .STO NEXT n+1
%% : :
%% v :
B(n)^2 %% : { FMA B, B, .GET LAST n ( in SEQ :: OK, local data, ALWAYS )
+ B(n) %% v B } ( in PAR :: non-local data. CSP + bcast + many distributed-caches invalidates )
+ B(n) * A(n) %% { FMA B, A,
+ A(n)^k %% ApK}
);
end
Once the SEQ-process data-dependency is recurrent ( having a need to re-use the LAST B(n-1) for an assignment of the NEXT B(n), any attempt to make such SEQ calculation work in PAR will have to introduce a system-wide communication of known values, before "new"-values could get computed only after the respective "previous" B(n-1) has been evaluated and assigned -- through the pure serial SEQ chain of recurrent evaluation, thus not before all the previous cell have been processed serially -- as the LAST piece is always needed for the NEXT step , ref. the "crossroads" in for()-loop iterator dependency-map ( having this, all the rest have to wait in a "queue", to become able do their two primitive .FMA-s + .STO result for the next one in the recurrency indoctrinated "queue" ).
Yes, one can "enforce" the formula to become PAR-executed, but the very costs of such LAST values being communicated "across" the PAR-execution fabric ( towards the NEXT ) is typically prohibitively expensive ( in terms of resources and accrued delays -- either damaging the SIMT-optimised scheduler latency-masking, or blocking all the threads until receiving their "neighbour"-assigned LAST-value that they rely on and cannot proceed without getting it first -- either of which effectively devastates any potential benefit from all the efforts invested into going PAR ).
Even just a pair of FMA-s is not enough code to justify add-on costs -- indeed an extremely small amount of work to do -- for all the PAR efforts.
Unless some very mathematically "dense" processing is being in place, all the additional costs do not get easily adjusted and such attempt to introduce a PAR-mode of computing exhibits nothing but a negative ( adverse ) effect, instead of any wished speedup. One ought, in all professional cases, express all add-on costs during the Proof-of-Concept phase ( a PoC ), before deciding whether any feasible PAR-approach is possible at all, and how to achieve a speedup of >> 1.0 x
Relying on just advertised theoretical GFLOPS and TFLOPS is a nonsense. Your actual GPU-kernel will never be able to repeat the advertised tests' performance figures ( unless you run exactly the same optimised layout and code, which one does not need, does one? ). One typically needs to compute one's own specific algorithmisation, that is related to one's problem domain, not willing to artificially align all the toy-problem elements so that the GPU-silicon will not have to wait for real data and can enjoy some tweaked cache/register based ILP-artifacts, practically not achievable in most of the real-world problem solutions ). If there is one step to recommend -- do always evaluate overhead-fair PoC first to see, if there exists any such chance for speedup, before diving any resources and investing time and money into prototyping detailed design & testing.
Recurrent and weak processing GPU kernel-payloads almost in every case will fight hard to get at least their additional overhead-times ( bidirectional data-transfer related ( H2D + D2H ) + kernel-code related loads ) adjusted.

conditional statement with logical operators for thingspeak data

I am working on a thingspeak code on matlab analysis for my weather station which checks last 24 readings and then gives alert on the basis of given conditions, and I gave this condition but I guess I am messing up with something hence getting wrong results. I want the answer to be overall logical 1 or 0. I get 1's for even the values that should not give me one and the answer for both variables is a 24*1 logical array. But even then the tweets are not being generated as well. Here's my code;
t =thingSpeakRead(293182,'Fields',1,'NumPoints',24,'OutputFormat','matrix');
h =thingSpeakRead(293182,'Fields',2,'NumPoints',24,'OutputFormat','matrix');
DangerAlert = ((t>42.5)&(t<43.5)&(h>17)&(h<21))|(((t>40.5)&(t<43.5))&((h>21)&(h<27)))|((t>39.5)&(t<43.5)&(h>27)&(h<31)) | ((t>38.5)&(t<43.5)&(h>31)&(h<37))| ((t>37.5)&(t<42.5)&(h>37)&(h<41))| ((t>36.5)&(t<40.5)&(h>41)&(h<47))| ((t>35.5)&(t<39.5)&(h>47)&(h<51))| ((t>34.5)&(t<38.5)&(h>51)&(h<57))| ((t>33.5)&(t<38.5)&(h>57)&(h<68))| ((t>33.5)&(t<37.5)&(h>63)&(h<68)) | ((t>32.5)&(t<38.5)&(h>68)&(h<73)) | ((t>31.5)&(t<35.5)&(h>73)&(h<83))| ((t>30.5)&(t<33.5)&(h>83)&(h<88)) | ((t>29.5)&(t<33.5)&(h>83)&(h<93))| ((t>29.5)&(t<32.5)&(h>93)&(h<100))
HeatStrokeAlert=((t>42.5)&(t<43.5)&(h>37)&(h<41)) | ((t>40.5)&(t<2.5)&(h>41)&(h<47)) | ((t>39.5)&(t<41.5)&(h>47)|(h<51))| ((t>38.5)&(t<40.5)&(h>51)&(h<57))| ((t>38.5)&(t<39.5)&(h>57)&(h<63))| ((t>37.5)&(t<38.5)&(h>63)&(h<68))| ((t>36.5)&(t<38.5)&(h>68)&(h<78))| ((t>35.5)&(t<37.5)&(h>73)&(h<83)) | ((t>34.5)&(t<36.5)&(h>83)&(h<88)) | ((t>33.5)&(t<36.5)&(h>88)&(h<93)) | ((t>33.5)&(t<35.5)&(h>93)&(h<97)) | ((t>32.5)&(t<34.5)&(h>97))
if DangerAlert
webwrite('http://api.thingspeak.com/apps/thingtweet/1/statuses/update','api_key', 'XXXXXXXXXXXXX', 'status', 'Alert!Dangerously High temperature tomorrow!')
end
if HeatStrokeAlert
webwrite('http://api.thingspeak.com/apps/thingtweet/1/statuses/update','api_key', 'XXXXXXXXX', 'status', 'Alert!Heat Stroke alert tomorrow!')
end
I know the blunder is minor.But this needs to solve.
Your range values for t go from 29.5 to 43.5, and for h go from 17 to 100. So any value you put in between those numbers will give you a 1, because you are using the OR statements ||. So if ANY one of those is true, it will come back true (=1).
Also, for the website, make sure you follow these directions:
https://www.mathworks.com/help/matlab/ref/webwrite.html
Make sure you have a ThinkSpeak account, and try changing your URL to match their format:
[thingSpeakURL 'update'];
So add 'update' string and use brackets.
Also, set your if statement expression to one. So:
if DangerAlert = 1

Spark HashingTF result explanation

I tried standard spark HashingTF example on DataBricks.
import org.apache.spark.ml.feature.{HashingTF, IDF, Tokenizer}
val sentenceData = spark.createDataFrame(Seq(
(0, "Hi I heard about Spark"),
(0, "I wish Java could use case classes"),
(1, "Logistic regression models are neat")
)).toDF("label", "sentence")
val tokenizer = new Tokenizer().setInputCol("sentence").setOutputCol("words")
val wordsData = tokenizer.transform(sentenceData)
val hashingTF = new HashingTF()
.setInputCol("words").setOutputCol("rawFeatures").setNumFeatures(20)
val featurizedData = hashingTF.transform(wordsData)
display(featurizedData)
I have diffuculty in understanding result below.
Please see the image
When numFeatures is 20
[0,20,[0,5,9,17],[1,1,1,2]]
[0,20,[2,7,9,13,15],[1,1,3,1,1]]
[0,20,[4,6,13,15,18],[1,1,1,1,1]]
If [0,5,9,17] are hash values
and [1,1,1,2] are frequencies.
17 has frequency 2
9 has 3 (it has 2)
13,15 have 1 while they must have 2.
Probably I am missing something. Could not find documentation of detailed explanation.
As mcelikkaya notes, the output frequencies are not what you would expect. This is due to hash collisions for a small number of features, 20 in this case. I have added some words to the input data (for illustration purposes) and upped features to 20,000, and then the correct frequencies are produced:
+-----+---------------------------------------------------------+-------------------------------------------------------------------------+--------------------------------------------------------------------------------------+
|label|sentence |words |rawFeatures |
+-----+---------------------------------------------------------+-------------------------------------------------------------------------+--------------------------------------------------------------------------------------+
|0 |Hi hi hi hi I i i i i heard heard heard about Spark Spark|[hi, hi, hi, hi, i, i, i, i, i, heard, heard, heard, about, spark, spark]|(20000,[3105,9357,11777,11960,15329],[2.0,3.0,1.0,4.0,5.0]) |
|0 |I i wish Java could use case classes spark |[i, i, wish, java, could, use, case, classes, spark] |(20000,[495,3105,3967,4489,15329,16213,16342,19809],[1.0,1.0,1.0,1.0,2.0,1.0,1.0,1.0])|
|1 |Logistic regression models are neat |[logistic, regression, models, are, neat] |(20000,[286,1193,9604,13138,18695],[1.0,1.0,1.0,1.0,1.0]) |
+-----+---------------------------------------------------------+-------------------------------------------------------------------------+------------------------------------------------------------
Your guesses are correct:
20 - is a vector size
first list is a list of indices
second list is a list of values
Leading 0 is just an artifact of internal representation.
There is nothing more here to learn.

how to handle type F in a table in Q/KDB

I have started to learn q/KDB since a while, therefore forgive me in advance for trivial question but I am facing the following problem I don't know how to solve.
I have a table named "res" showing, side, summation of orders and average_price of some simbols
sym side | sum_order avg_price
----------| -------------------
ALPHA B | 95109 9849.73
ALPHA S | 91662 9849.964
BETA B | 47 9851.638
BETA S | 60 9853.383
with these types
c | t f a
---------| -----
sym | s p
side | s
sum_order| f
avg_price| f
I would like to calculate close and open positions, average point, made by close position, and average price of the open position.
I have used this query which I believe it is pretty bizarre (I am sure there will be a more professional way to do it) but it works as expected
position_summary:select
close_position:?[prev[sum_order]>sum_order;sum_order;prev[sum_order]],
average_price:avg_price-prev[avg_price],
open_pos:prev[sum_order]-sum_order,
open_wavgprice:?[sum_order>next[sum_order];avg_price;next[avg_price]][0]
by sym from res
giving me the following table
sym | close_position average_price open_pos open_wavgprice
----------| ----------------------------------------------------
ALPHA | 91662 0.2342456 3447 9849.73
BETA | 47 1.745035 -13 9853.38
and types are
c | t f a
--------------| -----
sym | s s
close_position| F
average_price | F
open_pos | F
open_wavgprice| f
Now my problem starts here, imagine I join position_summary table with another table appending another column "current_price" of type f
What I want to do is to determinate the points of the open positions.
I have tried this way:
select
?[open_pos>0;open_price-open_wavgprice;open_wavgprice-open]
from position_summary
but I got 'type error,
surely because sum_order is type F and open_wavgprice and current_price are f. I have search on internet by I did not find much about F type.
First: how can I handle this ? I have tried "cast" or use "raze" but no effects and moreover I am not sure if they are right on this particular occasion.
Second: is there a better way to use "if-then" during query tables (for example, in plain English :if this row of this column then take the previous / next of another column or the second or third of previous /next column)
Thank you for you help
Let me rephrase your question using a slightly simpler table:
q)show res:([sym:`A`A`B`B;side:`B`S`B`S]size:95 91 47 60;price:49.7 49.9 51.6 53.3)
sym side| size price
--------| ----------
A B | 95 49.7
A S | 91 49.9
B B | 47 51.6
B S | 60 53.3
You are trying to find the closing position for each symbol using a query like this:
q)show summary:select close:?[prev[size]>size;size;prev[size]] by sym from res
sym| close
---| -----
A | 91
B | 47
The result seems to have one number in each row of the "close" column, but in fact it has two. You may notice an extra space before each number in the display above or you can display the first row
q)first 0!summary
sym | `A
close| 0N 91
and see that the first row in the "close" column is 0N 91. Since the missing values such as 0N are displayed as a space, it was hard to see them in the earlier display.
It is not hard to understand how you've got these two values. Since you select by sym, each column gets grouped by symbol and for the symbol A, you have
q)show size:95 91
95 91
and
q)prev size
0N 95
that leads to
q)?[prev[size]>size;size;prev[size]]
0N 91
(Recall that 0N is smaller than any other integer.)
As a side note, ?[a>b;b;a] is element-wise minimum and can be written as a & b in q, so your conditional expression could be written as
q)size & prev size
0N 91
Now we can see why ? gave you the type error
q)close:exec close from summary
q)close
91
47
While the display is deceiving, "close" above is a list of two vectors:
q)first close
0N 91
and
q)last close
0N 47
The vector conditional does not support that:
q)?[close>0;10;20]
'type
[0] ?[close>0;10;20]
^
One can probably cure that by using each:
q)?[;10;20]each close>0
20 10
20 10
But I don't think this is what you want. Your problem started when you computed the summary table. I would expect the closing position to be the sum of "B" orders minus the sum of "S" orders that can be computed as
q)select close:sum ?[side=`B;size;neg size] by sym from res
sym| close
---| -----
A | 4
B | -13
Now you should be able to fix the rest of the columns in your summary query. Just make sure that you use an aggregation function such as sum in the expression for every column.
Type F means the "cell" in the column contains a vector of floats rather than an atom. So your column is actually a vector of vectors rather than a flat vector.
In your case you have a vector of size 1 in each cell, so in your case you could just do:
select first each close_position, first each average_price.....
which will give you a type f.
I'm not 100% on what you were trying to do in the first query, and I don't have a q terminal to hand to check but you could put this into your query:
select close_position:?[prev[sum_order]>sum_order;last sum_order; last prev[sum_order].....
i.e. get the last sum_order in the list.