Spark Iterated Function CUSUM - scala

I'm still fairly new to Spark and I'm struggling to implement an iterated function. I'm hoping someone can help me out?
In particular, I'm trying to implement the CUSUM control statistic:
$ S_i = \max (0, S_{i-1} + x_i - Target - w $ with $ S_0 = 0 $ and $ w, Target $ being fixed parameters.
The challenge is that the CUSUM statistic is defined as an iterated function which requires ordered data and the previous function value.
The following data frame shows the desired output for $ Target = 1 $ and $ w = 0.1 $ :
i x S
--------------
1 1.3 0.2
2 1.8 0.9
3 0.5 0.3
4 0.6 0
5 1.2 0.1
6 1.8 0.8
On a different note: I guess it's not possible to run CUSUM in a distributed fashion? My data set is fairly large but contains multiple groups. I hope this means I can still achieve some concurrency. I guess I have to re-partition my data to have one single partition per group to run the CUSUM algorithm per group concurrently?
I hope this makes sense and any pointers are highly appreciated!
Ideally I am looking for a solution in Scala and Spark 2.1
Thanks a lot!

After a lot of Google research I found a solution to the problem using mapPartitions
val dataset = Seq(1.3, 1.8, 0.5, 0.6, 1.2, 1.8).toDS
dataset.repartition(1).mapPartitions(iterator => {
var s = 0.0
val target = 1.0
val w = 0.1
iterator.map(x => {
s = Math.max(0.0, s + x -target - w)
Math.round(10.0 *s)/10.0
})
}).show()
+-----+
|value|
+-----+
| 0.2|
| 0.9|
| 0.3|
| 0.0|
| 0.1|
| 0.8|
+-----+
I hope this will save someone some time in the future.

Related

statistical test to compare 1st/2nd differences based on output from ggpredict / ggeffect

I want to conduct a simple two sample t-test in R to compare marginal effects that are generated by ggpredict (or ggeffect).
Both ggpredict and ggeffect provide nice outputs: (1) table (pred prob / std error / CIs) and (2) plot. However, it does not provide p-values for assessing statistical significance of the marginal effects (i.e., is the difference between the two predicted probabilities difference from zero?). Further, since I’m working with Interaction Effects, I'm also interested in a two sample t-tests for the First Differences (between two marginal effects) and the Second Differences.
Is there an easy way to run the relevant t tests with ggpredict/ggeffect output? Other options?
Attaching:
. reprex code with fictitious data
. To be specific: I want to test the following "1st differences":
--> .67 - .33=.34 (diff from zero?)
--> .5 - .5 = 0 (diff from zero?)
...and the following Second difference:
--> 0.0 - .34 = .34 (diff from zero?)
See also Figure 12 / Table 3 in Mize 2019 (interaction effects in nonlinear models)
Thanks Scott
library(mlogit)
#> Loading required package: dfidx
#>
#> Attaching package: 'dfidx'
#> The following object is masked from 'package:stats':
#>
#> filter
library(sjPlot)
library(ggeffects)
# create ex. data set. 1 row per respondent (dataset shows 2 resp). Each resp answers 3 choice sets, w/ 2 alternatives in each set.
cedata.1 <- data.frame( id = c(1,1,1,1,1,1,2,2,2,2,2,2), # respondent ID.
QES = c(1,1,2,2,3,3,1,1,2,2,3,3), # Choice set (with 2 alternatives)
Alt = c(1,2,1,2,1,2,1,2,1,2,1,2), # Alt 1 or Alt 2 in choice set
LOC = c(0,0,1,1,0,1,0,1,1,0,0,1), # attribute describing alternative. binary categorical variable
SIZE = c(1,1,1,0,0,1,0,0,1,1,0,1), # attribute describing alternative. binary categorical variable
Choice = c(0,1,1,0,1,0,0,1,0,1,0,1), # if alternative is Chosen (1) or not (0)
gender = c(1,1,1,1,1,1,0,0,0,0,0,0) # male or female (repeats for each indivdual)
)
# convert dep var Choice to factor as required by sjPlot
cedata.1$Choice <- as.factor(cedata.1$Choice)
cedata.1$LOC <- as.factor(cedata.1$LOC)
cedata.1$SIZE <- as.factor(cedata.1$SIZE)
# estimate model.
glm.model <- glm(Choice ~ LOC*SIZE, data=cedata.1, family = binomial(link = "logit"))
# estimate MEs for use in IE assessment
LOC.SIZE <- ggpredict(glm.model, terms = c("LOC", "SIZE"))
LOC.SIZE
#>
#> # Predicted probabilities of Choice
#> # x = LOC
#>
#> # SIZE = 0
#>
#> x | Predicted | SE | 95% CI
#> -----------------------------------
#> 0 | 0.33 | 1.22 | [0.04, 0.85]
#> 1 | 0.50 | 1.41 | [0.06, 0.94]
#>
#> # SIZE = 1
#>
#> x | Predicted | SE | 95% CI
#> -----------------------------------
#> 0 | 0.67 | 1.22 | [0.15, 0.96]
#> 1 | 0.50 | 1.00 | [0.12, 0.88]
#> Standard errors are on the link-scale (untransformed).
# plot
# plot(LOC.SIZE, connect.lines = TRUE)

Odds and Rate Ratio CIs in Hurdle Models with Factor-Factor Interactions

I am trying to build hurdle models with factor-factor interactions but can't figure out how to calculate the CIs of the odds or rate ratios among the various factor-factor combinations.
library(glmmTMB)
data(Salamanders)
m3 <- glmmTMB(count ~ spp + mined + spp * mined,
zi=~spp + mined + spp * mined,
family=truncated_poisson, data=Salamanders) # added in the interaction
pred_dat <- data.frame(spp = rep(unique(Salamanders$spp), 2),
mined = rep(unique(Salamanders$mined), each = length(unique(Salamanders$spp))))
pred_dat # All factor-factor combos
Does anyone know how to appropriately calculate the CI around the ratios among these various factor-factor combos? I know how to calculate the actual ratio estimates (which consists of exponentiating the sum of 1-3 model coefficients, depending on the exact comparison being made) but I just can't seem to find any info on how to get the corresponding CI when an interaction is involved. If the ratio in question only requires exponentiating a single coefficient, the CI can easily be calculated; I just don't know how to do it when two or three coefficients are involved in calculating the ratio. Any help would be much appreciated.
EDIT:
I need the actual odds and rate ratios and their CIs, not the predicted values and their CIs. For example: exp(confint(m3)[2,3]) gives the rate ratio of sppPR/minedYes vs sppGP/minedYes, and c(exp(confint(m3)[2,1]),exp(confint(m3)[2,2]) gives the CI of that rate ratio. However, a number of the potential comparisons among the spp/mined combinations require summing multiple coefficients e.g., exp(confint(m3)[2,3] + confint(m3)[8,3]) but in these circumstances I do not know how to calculate the rate ratio CI because it involves multiple coefficients, each of which has its own SE estimates. How can I calculate those CIs, given that multiple coefficients are involved?
If I understand your question correctly, this would be one way to obtain the uncertainty around the predicted/fitted values of the interaction term:
library(glmmTMB)
library(ggeffects)
data(Salamanders)
m3 <- glmmTMB(count ~ spp + mined + spp * mined,
zi=~spp + mined + spp * mined,
family=truncated_poisson, data=Salamanders) # added in the interaction
ggpredict(m3, c("spp", "mined"))
#>
#> # Predicted counts of count
#> # x = spp
#>
#> # mined = yes
#>
#> x | Predicted | SE | 95% CI
#> --------------------------------------
#> GP | 1.59 | 0.92 | [0.26, 9.63]
#> PR | 1.13 | 0.66 | [0.31, 4.10]
#> DM | 1.74 | 0.29 | [0.99, 3.07]
#> EC-A | 0.61 | 0.96 | [0.09, 3.96]
#> EC-L | 0.42 | 0.69 | [0.11, 1.59]
#> DF | 1.49 | 0.27 | [0.88, 2.51]
#>
#> # mined = no
#>
#> x | Predicted | SE | 95% CI
#> --------------------------------------
#> GP | 2.67 | 0.11 | [2.15, 3.30]
#> PR | 1.59 | 0.28 | [0.93, 2.74]
#> DM | 3.10 | 0.10 | [2.55, 3.78]
#> EC-A | 2.30 | 0.17 | [1.64, 3.21]
#> EC-L | 5.25 | 0.07 | [4.55, 6.06]
#> DF | 2.68 | 0.12 | [2.13, 3.36]
#> Standard errors are on link-scale (untransformed).
plot(ggpredict(m3, c("spp", "mined")))
Created on 2020-08-04 by the reprex package (v0.3.0)
The ggeffects-package calculates marginal effects / estimates marginal means (EMM) with confidence intervals for your model terms. ggpredict() computes these EMMs based on predict(), ggemmeans() wraps the fantastic emmeans package and ggeffect() uses the effects package.

How do I decrease iteration time when making data transformations?

I have a couple of data transformations that it seems operate quite slowly while iterating.
What general strategies can I use to increase performance?
Input Data:
+-----------+-------+
| key | val |
+-----------+-------+
| a | 1 |
| a | 2 |
| b | 1 |
| b | 2 |
| b | 3 |
+-----------+-------+
My code I'm iterating on is the following:
from pyspark.sql import functions as F
# Output = /my/function/output
# input_df = /my/function/input
def my_compute_function(input_df):
"""Compute difference from maximum of a value column by key
Keyword arguments:
input_df (pyspark.sql.DataFrame) : input DataFrame
Returns:
pyspark.sql.DataFrame
"""
max_df = input_df \
.groupBy("key") \
.agg(F.max(F.col("val").alias("max_val")
joined_df = input_df \
.join(max_df, "key")
diff_df = joined_df \
.withColumn("diff", F.col("max_val") - F.col("val"))
return diff_df
It took me 4 build iterations to get max_df right, 4 to get joined_df right, and 4 to get diff_df right.
This represents total work of:
pipeline_1:
transform_A:
work_1 : input -> max_df
(takes 4 iterations to get right): 4 * max_df
work_2: max_df -> joined_df
(takes 4 iterations to get right): 4 * joined_df + 4 * max_df
= 4 joined_df + 4 max_df
work_3: joined_df -> diff_df
(takes 4 iterations to get right): 4 * diff_df + 4 * joined_df + 4 * max_df
total work:
transform_A
= work_1 + work_2 + work_3
= 4 * max_df + (4 * joined_df + 4 * max_df) + (4 * diff_df + 4 * joined_df + 4 * max_df)
= 12 * max_df + 8 * joined_df + 4 * diff_df
Output data:
+-----------+-------+--------+
| key | val | diff |
+-----------+-------+--------+
| a | 1 | 1 |
| a | 2 | 0 |
| b | 1 | 2 |
| b | 2 | 1 |
| b | 3 | 0 |
+-----------+-------+--------+
Refactoring
For experimentation / fast iteration, it's often a good idea to refactor your code into several smaller steps instead of a single large step.
This way, you compute the upstream cells first, write the data back to Foundry, and use this pre-computed data in later steps. If you were to keep re-computing without changing these early steps' logic, you are doing nothing but extra work again and again.
Concretely:
from pyspark.sql import functions as F
# output = /my/function/output_max
# input_df = "/my/function/input
def my_compute_function(input_df):
"""Compute the max by key
Keyword arguments:
input_df (pyspark.sql.DataFrame) : input DataFrame
Returns:
pyspark.sql.DataFrame
"""
max_df = input_df \
.groupBy("key") \
.agg(F.max(F.col("val").alias("max_val")
return max_df
# output = /my/function/output_joined
# input_df = /my/function/input
# max_df = /my/function/output_max
def my_compute_function(max_df, input_df):
"""Compute the joined output of max and input
Keyword arguments:
max_df (pyspark.sql.DataFrame) : input DataFrame
input_df (pyspark.sql.DataFrame) : input DataFrame
Returns:
pyspark.sql.DataFrame
"""
joined_df = input_df \
.join(max_df, "key")
return joined_df
# Output = /my/function/output_diff
# joined_df = /my/function/output_joined
def my_compute_function(joined_df):
"""Compute difference from maximum of a value column by key
Keyword arguments:
joined_df (pyspark.sql.DataFrame) : input DataFrame
Returns:
pyspark.sql.DataFrame
"""
diff_df = joined_df \
.withColumn("diff", F.col("max_val") - F.col("val"))
return diff_df
The work you perform would instead look like:
pipeline_2:
transform_A:
work_1: input -> max_df
(takes 4 iterations to get right): 4 * max_df
transform_B:
work_2: max_df -> joined_df
(takes 4 iterations to get right): 4 * joined_df
transform_C:
work:3: joined_df -> diff_df
(takes 4 iterations to get right): 4 * diff_df
total_work:
transform_A + transform_B + transform_C
= work_1 + work_2 + work_3
= 4 * max_df + 4 * joined_df + 4 * diff_df
If you assume max_df, joined_df, and diff_df all cost the same amount to compute, pipeline_1.total_work = 24 * max_df, whereas pipeline_2.total_work = 12 * max_df so you can expect something on the order of 2x speed improvement on iteration.
Caching
For any 'small' datasets, you should cache them. This will keep the rows in-memory for your pipeline and not require fetching from the written-back dataset. 'small' is somewhat arbitrary given a lot of different factors that must be considered, but Spark does a good job of trying to cache it no matter what and warning you if it's too big.
In this case, depending on your setup, you could cache the intermediate layers of max_df and joined_df depending on which step you are developing.
Function Calls
You should stick to native PySpark methods as much as possible and never user Python methods directly on data (i.e. looping over individual rows, executing a UDF). PySpark methods call the underlying Spark methods that are written in Scala and run directly against the data instead of the Python runtime, and if you simply use Python as the layer to interact with this system instead of being the system that interacts with the data, you will get all the performance benefits of Spark itself.
In the above example, only native PySpark methods are called, so this computation will be quite fast.
Downsampling
If you can derive your own accurate sample of a large input dataset, this can be used as the mock input for your transformations, until such time you perfect your logic and want to test it against the full set.
In the above case, we could downsample input_df to be a single key before executing any prior steps.
I personally down-sample and cache datasets above 1M rows before ever writing a line of PySpark code, that way my turnaround times are very fast and I don't ever catch syntax bugs slowly due to large dataset sizes.
All Together
A good development pipeline looks like:
Discrete chunks of code that do particular materializations you expect to re-use later but don't need to be recomputed over and over again
Downsampled to 'small' sizes
Cached 'small' datasets for very fast fetching
PySpark native code only that exploits the fast underlying Spark libraries

How to implement cycle detection with pyspark graphframe pregel API

I am trying to implement the algorithm from Rocha & Thatte (http://cdsid.org.br/sbpo2015/wp-content/uploads/2015/08/142825.pdf) with Pyspark and the pregel wraper from graphframes.
Here I am getting stuck with the correct syntax for the message aggregation.
The idea is strait forward:
...In each pass, each active vertex of G sends a set of sequences of
vertices to its out- neighbours as described next. In the first pass,
each vertex v sends the message (v) to all its out- neighbours. In
subsequent iterations, each active vertex v appends v to each sequence
it received in the previous iteration. It then sends all the updated
sequences to its out-neighbours. If v has not received any message in
the previous iteration, then v deactivates itself. The algorithm
terminates when all the vertices have been deactivated. ...
My idea is to send the vertices ids to the destination vertices (dst) and in the aggregation function collect them into a list. Then in my vertex column "sequence" I would like to append/merge this new list items with the existing one and then do a check with when statements if the current vertex id is already in the sequence. Then I could set the vertex according vertex columns to true to flag them as in a cycle.
But I can't find the correct syntax in Spark on how to concatenate this.
Does anyone has an idea? Or implemented something similar?
My current code
from pyspark import SparkContext, SparkConf
from pyspark.sql import SparkSession
from pyspark.sql import SQLContext
import pyspark.sql.functions as f
from pyspark.sql.functions import coalesce, col, lit, sum, when
from graphframes import GraphFrame
from graphframes.lib import *
SimpleCycle=[
("1","2"),
("2","3"),
("3","4"),
("4","5"),
("5","2"),
("5","6")
]
edges = sqlContext.createDataFrame(SimpleCycle,["src","dst"]) \
.withColumn("self_loop",when(col("src")==col("dst"),True).otherwise(False))
edges.show()
+---+---+---------+
|src|dst|self_loop|
+---+---+---------+
| 1| 2| false|
| 2| 3| false|
| 3| 4| false|
| 4| 5| false|
| 5| 2| false|
| 5| 6| false|
+---+---+---------+
vertices=edges.select("src").union(edges.select("dst")).distinct().distinct().withColumnRenamed('src', 'id')
#vertices = spark.createDataFrame([[1], [2], [3], [4],[5],[6],[7],[8],[9]], ["id"])
#vertices.sort("id").show()
graph = GraphFrame(vertices, edges)
cycles=graph.pregel \
.setMaxIter(5) \
.withVertexColumn("is_cycle", lit(""),lit("logic to be added")) \
.withVertexColumn("sequence", lit(""),Pregel.msg()) \
.sendMsgToDst(Pregel.src("id")) \
.aggMsgs(f.collect_list(Pregel.msg())) \
.run()
cycles.show()
+---+-----------------+--------+
| id| is_cycle|sequence|
+---+-----------------+--------+
| 3|logic to be added| [2]|
| 5|logic to be added| [4]|
| 6|logic to be added| [5]|
| 1|logic to be added| null|
| 4|logic to be added| [3]|
| 2|logic to be added| [5, 1]|
+---+-----------------+--------+
Code that does not work but what I think the logic should be
cycles=graph.pregel \
.setMaxIter(5) \
.withVertexColumn("is_cycle", lit(""), \
when(Pregel.src("id").isin(Pregel.src(sequence)),True).otherwise(False) \
.withVertexColumn("sequence", lit("null"),Append_To_Existing_List(Pregel.msg()) \
.sendMsgToDst(
when(Pregel.src("sequence").isNull(),Pregel.src("id")) \
.otherwise(Pregel.src("sequence")) \
.aggMsgs(f.collect_list(Pregel.msg())) \
.run()
# I would like to have a result like
+---+-----------------+---------+
| id| is_cycle|sequence |
+---+-----------------+---------+
| 1|false | [1] |
| 2|true |[2,3,4,5]|
| 3|true |[2,3,4,5]|
| 4|true |[2,3,4,5]|
| 5|true |[2,3,4,5]|
| 6|false | null |
+---+-----------------+---------+
Finally I implemented Rocha-Thatte algorithm not via pregel but with the underlying
message aggregation function of graphframe/graphX. In case someone is interested I'd like to share the solution
This solution works correctly and can handle very large graphs without failing
However it is getting quite slow if the cycle length or the graph is long.
Not sure how to improve this right now.
Possibly in using checkpoints or broadcasting in a smart way
Happy about any input for improvement
# spark modules
from pyspark import SparkContext, SparkConf
from pyspark.sql import SparkSession
from pyspark.sql import SQLContext
from pyspark.sql.types import *
from pyspark.sql import Row
from pyspark.sql.window import Window
import pyspark.sql.functions as f
# graphframes modules
from graphframes import GraphFrame
from graphframes.lib import *
AM=AggregateMessages
def find_cycles(sqlContext,sc,vertices,edges,max_iter=100000):
# Cycle detection via message aggregation
"""
This code is an implementation of the Rocha-Thatte algorithm for large-scale sparce graphs
Source:
==============
wiki: https://en.wikipedia.org/wiki/Rocha%E2%80%93Thatte_cycle_detection_algorithm
paper: https://www.researchgate.net/publication/283642998_Distributed_cycle_detection_in_large-scale_sparse_graphs
The basic idea:
===============
We propose a general algorithm for detecting cycles in a directed graph G by message passing among its vertices,
based on the bulk synchronous message passing abstraction. This is a vertex-centric approach in which the vertices
of the graph work together for detecting cycles. The bulk synchronous parallel model consists of a sequence of iterations,
in each of which a vertex can receive messages sent by other vertices in the previous iteration, and send messages to other
vertices.
In each pass, each active vertex of G sends a set of sequences of vertices to its out- neighbours as described next.
In the first pass, each vertex v sends the message (v) to all its out- neighbours. In subsequent iterations, each active vertex v
appends v to each sequence it received in the previous iteration. It then sends all the updated sequences to its out-neighbours.
If v has not received any message in the previous iteration, then v deactivates itself. The algorithm terminates when all the
vertices have been deactivated.
For a sequence (v1, v2, . . . , vk) received by vertex v, the appended sequence is not for- warded in two cases: (i) if v = v1,
then v has detected a cycle, which is reported (see line 9 of Algorithm 1); (ii) if v = vi for some i ∈ {2, 3, . . . , k},
then v has detected a sequence that contains the cycle (v = vi, vi+1, . . . , vk, vk+1 = v); in this case,
the sequence is discarded, since the cycle must have been detected in an earlier iteration (see line 11 of Algorithm 1);
to be precise, this cycle must have been detected in iteration k − i + 1. Every cycle (v1, v2, . . . , vk, vk+1 = v1)
is detected by all vi,i = 1 to k in the same iteration; it is reported by the vertex min{v1,...,vk} (see line 9 of Algorithm 1).
The total number of iterations of the algorithm is the number of vertices in the longest path in the graph, plus a few more steps
for deactivating the final vertices. During the analysis of the total number of iterations, we ignore the few extra iterations
needed for deactivating the final vertices and detecting the end of the computation, since it is O(1).
Pseudocode of the algorithm:
============================
M(v): Message received from vertex v
N+(v): all dst verties from v
functionCOMPUTE(M(v)):
if i=0 then:
for each w ∈ N+(v) do:
send (v) to w
else if M(v) = ∅ then:
deactivate v and halt
else:
for each (v1,v2,...,vk) ∈ M(v) do:
if v1 = v and min{v1,v2,...,vk} = v then:
report (v1 = v,v2,...,vk,vk+1 = v)
else if v not ∈ {v2,...,vk} then:
for each w ∈ N+(v) do:
send (v1,v2,...,vk,v) to w
Scalablitiy of the algorithm:
============================
the number of iteration depends on the path of the longest cycle
the scaling it between
O(log(n)) up to maxium O(n) where n=number of vertices
so the number of iterations is less to max linear to the number of vertices,
if there are more edges (parallel etc.) it will not affect the the runtime
for more details please refer to the oringinal publication
"""
_logger.warning("+++ find_cycles(): starting cycle search ...")
# create emtpy dataframe to collect all cycles
cycles = sqlContext.createDataFrame(sc.emptyRDD(),StructType([StructField("cycle",ArrayType(StringType()),True)]))
# initialize the messege column with own source id
init_vertices=(vertices
.withColumn("message",f.array(f.col("id")))
)
init_edges=(edges
.where(f.col("src")!=f.col("dst"))
.select("src","dst")
)
# create graph object that will be update each iteration
gx = GraphFrame(init_vertices, init_edges)
# iterate until max_iter
# max iter is used in case that the3 break condition is never reached during this time
# defaul value=100.000
for iter_ in range(max_iter):
# message that should be send to destination for aggregation
msgToDst = AM.src["message"]
# aggregate all messages that where received into a python set (drops duplicate edges)
agg = gx.aggregateMessages(
f.collect_set(AM.msg).alias("aggMess"),
sendToSrc=None,
sendToDst=msgToDst)
# BREAK condition: if no more messages are received all cycles where found
# and we can quit the loop
if(len(agg.take(1))==0):
#print("THE END: All cycles found in " + str(iter_) + " iterations")
break
# apply the alorithm logic
# filter for cycles that should be reported as found
# compose new message to be send for next iteration
# _column name stands for temporary columns that are only used in the algo and then dropped again
checkVerties=(
agg
# flatten the aggregated message from [[2]] to [] in order to have proper 1D arrays
.withColumn("_flatten1",f.explode(f.col("aggMess")))
# take first element of the array
.withColumn("_first_element_agg",f.element_at(f.col("_flatten1"), 1))
# take minimum element of th array
.withColumn("_min_agg",f.array_min(f.col("_flatten1")))
# check if it is a cycle
# it is cycle when v1 = v and min{v1,v2,...,vk} = v
.withColumn("_is_cycle",f.when(
(f.col("id")==f.col("_first_element_agg")) &
(f.col("id")==f.col("_min_agg"))
,True)
.otherwise(False)
)
# pick cycle that should be reported=append to cylce list
.withColumn("_cycle_to_report",f.when(f.col("_is_cycle")==True,f.col("_flatten1")).otherwise(None))
# sort array to have duplicates the same
.withColumn("_cycle_to_report",f.sort_array("_cycle_to_report"))
# create column where first array is removed to check if the current vertices is part of v=(v2,...vk)
.withColumn("_slice",f.array_except(f.col("_flatten1"), f.array(f.element_at(f.col("_flatten1"), 1))))
# check if vertices is part of the slice and set True/False column
.withColumn("_is_cycle2",f.lit(f.size(f.array_except(f.array(f.col("id")), f.col("_slice"))) == 0))
)
#print("checked Vertices")
#checkVerties.show(truncate=False)
# append found cycles to result dataframe via union
cycles=(
# take existing cycles dataframe
cycles
.union(
# union=append all cyles that are in the current reporting column
checkVerties
.where(f.col("_cycle_to_report").isNotNull())
.select("_cycle_to_report")
)
)
# create list of new messages that will be send in the next iteration to the vertices
newVertices=(
checkVerties
# append current vertex id on position 1
.withColumn("message",f.concat(
f.coalesce(f.col("_flatten1"), f.array()),
f.coalesce(f.array(f.col("id")), f.array())
))
# only send where it is no cycle duplicate
.where(f.col("_is_cycle2")==False)
.select("id","message")
)
print("vertics to send forward")
newVertices.sort("id").show(truncate=False)
# cache new vertices using workaround for SPARK-1334
cachedNewVertices = AM.getCachedDataFrame(newVertices)
# update graphframe object for next round
gx = GraphFrame(cachedNewVertices, gx.edges)
# materialize results and get number of found cycles
#cycles_count=cycles.persist().count()
_cycle_statistics=(
cycles
.withColumn("cycle_length",f.size(f.col("cycle")))
.agg(f.count(f.col("cycle")),f.max(f.col("cycle_length")),f.min(f.col("cycle_length")))
).collect()
cycle_statistics={"count":_cycle_statistics[0]["count(cycle)"],"max":_cycle_statistics[0]["max(cycle_length)"],"min":_cycle_statistics[0]["min(cycle_length)"]}
end_time =time.time()
_logger.warning("+++ find_cycles(): " + str(cycle_statistics["count"]) + " cycles found in " + str(iter_) + " iterations (min length=" + str(cycle_statistics["min"]) +", max length="+ str(cycle_statistics["max"]) +") in " + str(end_time-start_time) + " seconds")
_logger.warning("+++ #########################################################################################")
return cycles, cycle_statistics
this functions takes a graphs like
SimpleCycle:
NestedCycle:
SimpleCycle=[
("0","1"),
("1","2"),
("2","3"),
("3","4"),
("3","1")]
NestedCycle=[
("1","2"),
("2","3"),
("3","4"),
("4","1"),
("3","1"),
("5","1"),
("5","2")]
edges = sqlContext.createDataFrame(SimpleCycle,["src","dst"])
vertices=edges.select("src").union(edges.select("dst")).distinct().distinct().withColumnRenamed('src', 'id')
edges.show()
# +---+---+
# |src|dst|
# +---+---+
# | 1| 2|
# | 2| 3|
# | 3| 4|
# | 4| 1|
# | 3| 1|
# | 5| 1|
# | 5| 2|
# +---+---+
raw_cycles=find_cycles(sqlContext,sc,vertices,edges,max_iter=1000)
raw_cycles.show()
# +------------+
# | cycle|
# +------------+
# | [1, 2, 3]|
# |[1, 2, 3, 4]|
#+------------+

Convert KMeans "centres" output to PySpark dataframe

I'm running a K-means clustering model, and I want to analyse the cluster centroids, however the centers output is a LIST of my 20 centroids, with their coordinates (8 each) as an ARRAY. I need it as a dataframe, with clusters 1:20 as rows, and their attribute values (centroid coordinates) as columns like so:
c1 | 0.85 | 0.03 | 0.01 | 0.00 | 0.12 | 0.01 | 0.00 | 0.12
c2 | 0.25 | 0.80 | 0.10 | 0.00 | 0.12 | 0.01 | 0.00 | 0.77
c3 | 0.05 | 0.10 | 0.00 | 0.82 | 0.00 | 0.00 | 0.22 | 0.00
The dataframe format is important because what I WANT to do is:
For each centroid
Identify the 3 strongest attributes
Create a "name" for each of the 20 centroids that is a concatenation of the 3 most dominant traits in that centroid
For example:
c1 | milk_eggs_cheese
c2 | meat_milk_bread
c3 | toiletries_bread_eggs
This code is running in Zeppelin, EMR version 5.19, Spark2.4. The model works great, but this is the boilerplate code from the Spark documentation (https://spark.apache.org/docs/latest/ml-clustering.html#k-means), which produces the list of arrays output that I can't really use.
centers = model.clusterCenters()
print("Cluster Centers: ")
for center in centers:
print(center)
This is an excerpt of the output I get.
Cluster Centers:
[0.12391775 0.04282062 0.00368751 0.27282358 0.00533401 0.03389095
0.04220946 0.03213536 0.00895981 0.00990327 0.01007891]
[0.09018751 0.01354349 0.0130329 0.00772877 0.00371508 0.02288211
0.032301 0.37979978 0.002487 0.00617438 0.00610262]
[7.37626746e-02 2.02469798e-03 4.00944473e-04 9.62304581e-04
5.98964859e-03 2.95190585e-03 8.48736175e-01 1.36797882e-03
2.57451073e-04 6.13320072e-04 5.70559278e-04]
Based on How to convert a list of array to Spark dataframe I have tried this:
df = sc.parallelize(centers).toDF(['fresh_items', 'wine_liquor', 'baby', 'cigarettes', 'fresh_meat', 'fruit_vegetables', 'bakery', 'toiletries', 'pets', 'coffee', 'cheese'])
df.show()
But this throws the following error:
ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()
model.clusterCenters() gives you a list of numpy arrays and not a list of lists like in the answer you have linked. Just convert the numpy arrays to a lists before creating the dataframe:
bla = [e.tolist() for e in centers]
df = sc.parallelize(bla).toDF(['fresh_items', 'wine_liquor', 'baby', 'cigarettes', 'fresh_meat', 'fruit_vegetables', 'bakery', 'toiletries', 'pets', 'coffee', 'cheese'])
#or df = spark.createDataFrame(bla, ['fresh_items', 'wine_liquor', 'baby', 'cigarettes', 'fresh_meat', 'fruit_vegetables', 'bakery', 'toiletries', 'pets', 'coffee', 'cheese']
df.show()