How to implement cycle detection with pyspark graphframe pregel API - pyspark

I am trying to implement the algorithm from Rocha & Thatte (http://cdsid.org.br/sbpo2015/wp-content/uploads/2015/08/142825.pdf) with Pyspark and the pregel wraper from graphframes.
Here I am getting stuck with the correct syntax for the message aggregation.
The idea is strait forward:
...In each pass, each active vertex of G sends a set of sequences of
vertices to its out- neighbours as described next. In the first pass,
each vertex v sends the message (v) to all its out- neighbours. In
subsequent iterations, each active vertex v appends v to each sequence
it received in the previous iteration. It then sends all the updated
sequences to its out-neighbours. If v has not received any message in
the previous iteration, then v deactivates itself. The algorithm
terminates when all the vertices have been deactivated. ...
My idea is to send the vertices ids to the destination vertices (dst) and in the aggregation function collect them into a list. Then in my vertex column "sequence" I would like to append/merge this new list items with the existing one and then do a check with when statements if the current vertex id is already in the sequence. Then I could set the vertex according vertex columns to true to flag them as in a cycle.
But I can't find the correct syntax in Spark on how to concatenate this.
Does anyone has an idea? Or implemented something similar?
My current code
from pyspark import SparkContext, SparkConf
from pyspark.sql import SparkSession
from pyspark.sql import SQLContext
import pyspark.sql.functions as f
from pyspark.sql.functions import coalesce, col, lit, sum, when
from graphframes import GraphFrame
from graphframes.lib import *
SimpleCycle=[
("1","2"),
("2","3"),
("3","4"),
("4","5"),
("5","2"),
("5","6")
]
edges = sqlContext.createDataFrame(SimpleCycle,["src","dst"]) \
.withColumn("self_loop",when(col("src")==col("dst"),True).otherwise(False))
edges.show()
+---+---+---------+
|src|dst|self_loop|
+---+---+---------+
| 1| 2| false|
| 2| 3| false|
| 3| 4| false|
| 4| 5| false|
| 5| 2| false|
| 5| 6| false|
+---+---+---------+
vertices=edges.select("src").union(edges.select("dst")).distinct().distinct().withColumnRenamed('src', 'id')
#vertices = spark.createDataFrame([[1], [2], [3], [4],[5],[6],[7],[8],[9]], ["id"])
#vertices.sort("id").show()
graph = GraphFrame(vertices, edges)
cycles=graph.pregel \
.setMaxIter(5) \
.withVertexColumn("is_cycle", lit(""),lit("logic to be added")) \
.withVertexColumn("sequence", lit(""),Pregel.msg()) \
.sendMsgToDst(Pregel.src("id")) \
.aggMsgs(f.collect_list(Pregel.msg())) \
.run()
cycles.show()
+---+-----------------+--------+
| id| is_cycle|sequence|
+---+-----------------+--------+
| 3|logic to be added| [2]|
| 5|logic to be added| [4]|
| 6|logic to be added| [5]|
| 1|logic to be added| null|
| 4|logic to be added| [3]|
| 2|logic to be added| [5, 1]|
+---+-----------------+--------+
Code that does not work but what I think the logic should be
cycles=graph.pregel \
.setMaxIter(5) \
.withVertexColumn("is_cycle", lit(""), \
when(Pregel.src("id").isin(Pregel.src(sequence)),True).otherwise(False) \
.withVertexColumn("sequence", lit("null"),Append_To_Existing_List(Pregel.msg()) \
.sendMsgToDst(
when(Pregel.src("sequence").isNull(),Pregel.src("id")) \
.otherwise(Pregel.src("sequence")) \
.aggMsgs(f.collect_list(Pregel.msg())) \
.run()
# I would like to have a result like
+---+-----------------+---------+
| id| is_cycle|sequence |
+---+-----------------+---------+
| 1|false | [1] |
| 2|true |[2,3,4,5]|
| 3|true |[2,3,4,5]|
| 4|true |[2,3,4,5]|
| 5|true |[2,3,4,5]|
| 6|false | null |
+---+-----------------+---------+

Finally I implemented Rocha-Thatte algorithm not via pregel but with the underlying
message aggregation function of graphframe/graphX. In case someone is interested I'd like to share the solution
This solution works correctly and can handle very large graphs without failing
However it is getting quite slow if the cycle length or the graph is long.
Not sure how to improve this right now.
Possibly in using checkpoints or broadcasting in a smart way
Happy about any input for improvement
# spark modules
from pyspark import SparkContext, SparkConf
from pyspark.sql import SparkSession
from pyspark.sql import SQLContext
from pyspark.sql.types import *
from pyspark.sql import Row
from pyspark.sql.window import Window
import pyspark.sql.functions as f
# graphframes modules
from graphframes import GraphFrame
from graphframes.lib import *
AM=AggregateMessages
def find_cycles(sqlContext,sc,vertices,edges,max_iter=100000):
# Cycle detection via message aggregation
"""
This code is an implementation of the Rocha-Thatte algorithm for large-scale sparce graphs
Source:
==============
wiki: https://en.wikipedia.org/wiki/Rocha%E2%80%93Thatte_cycle_detection_algorithm
paper: https://www.researchgate.net/publication/283642998_Distributed_cycle_detection_in_large-scale_sparse_graphs
The basic idea:
===============
We propose a general algorithm for detecting cycles in a directed graph G by message passing among its vertices,
based on the bulk synchronous message passing abstraction. This is a vertex-centric approach in which the vertices
of the graph work together for detecting cycles. The bulk synchronous parallel model consists of a sequence of iterations,
in each of which a vertex can receive messages sent by other vertices in the previous iteration, and send messages to other
vertices.
In each pass, each active vertex of G sends a set of sequences of vertices to its out- neighbours as described next.
In the first pass, each vertex v sends the message (v) to all its out- neighbours. In subsequent iterations, each active vertex v
appends v to each sequence it received in the previous iteration. It then sends all the updated sequences to its out-neighbours.
If v has not received any message in the previous iteration, then v deactivates itself. The algorithm terminates when all the
vertices have been deactivated.
For a sequence (v1, v2, . . . , vk) received by vertex v, the appended sequence is not for- warded in two cases: (i) if v = v1,
then v has detected a cycle, which is reported (see line 9 of Algorithm 1); (ii) if v = vi for some i ∈ {2, 3, . . . , k},
then v has detected a sequence that contains the cycle (v = vi, vi+1, . . . , vk, vk+1 = v); in this case,
the sequence is discarded, since the cycle must have been detected in an earlier iteration (see line 11 of Algorithm 1);
to be precise, this cycle must have been detected in iteration k − i + 1. Every cycle (v1, v2, . . . , vk, vk+1 = v1)
is detected by all vi,i = 1 to k in the same iteration; it is reported by the vertex min{v1,...,vk} (see line 9 of Algorithm 1).
The total number of iterations of the algorithm is the number of vertices in the longest path in the graph, plus a few more steps
for deactivating the final vertices. During the analysis of the total number of iterations, we ignore the few extra iterations
needed for deactivating the final vertices and detecting the end of the computation, since it is O(1).
Pseudocode of the algorithm:
============================
M(v): Message received from vertex v
N+(v): all dst verties from v
functionCOMPUTE(M(v)):
if i=0 then:
for each w ∈ N+(v) do:
send (v) to w
else if M(v) = ∅ then:
deactivate v and halt
else:
for each (v1,v2,...,vk) ∈ M(v) do:
if v1 = v and min{v1,v2,...,vk} = v then:
report (v1 = v,v2,...,vk,vk+1 = v)
else if v not ∈ {v2,...,vk} then:
for each w ∈ N+(v) do:
send (v1,v2,...,vk,v) to w
Scalablitiy of the algorithm:
============================
the number of iteration depends on the path of the longest cycle
the scaling it between
O(log(n)) up to maxium O(n) where n=number of vertices
so the number of iterations is less to max linear to the number of vertices,
if there are more edges (parallel etc.) it will not affect the the runtime
for more details please refer to the oringinal publication
"""
_logger.warning("+++ find_cycles(): starting cycle search ...")
# create emtpy dataframe to collect all cycles
cycles = sqlContext.createDataFrame(sc.emptyRDD(),StructType([StructField("cycle",ArrayType(StringType()),True)]))
# initialize the messege column with own source id
init_vertices=(vertices
.withColumn("message",f.array(f.col("id")))
)
init_edges=(edges
.where(f.col("src")!=f.col("dst"))
.select("src","dst")
)
# create graph object that will be update each iteration
gx = GraphFrame(init_vertices, init_edges)
# iterate until max_iter
# max iter is used in case that the3 break condition is never reached during this time
# defaul value=100.000
for iter_ in range(max_iter):
# message that should be send to destination for aggregation
msgToDst = AM.src["message"]
# aggregate all messages that where received into a python set (drops duplicate edges)
agg = gx.aggregateMessages(
f.collect_set(AM.msg).alias("aggMess"),
sendToSrc=None,
sendToDst=msgToDst)
# BREAK condition: if no more messages are received all cycles where found
# and we can quit the loop
if(len(agg.take(1))==0):
#print("THE END: All cycles found in " + str(iter_) + " iterations")
break
# apply the alorithm logic
# filter for cycles that should be reported as found
# compose new message to be send for next iteration
# _column name stands for temporary columns that are only used in the algo and then dropped again
checkVerties=(
agg
# flatten the aggregated message from [[2]] to [] in order to have proper 1D arrays
.withColumn("_flatten1",f.explode(f.col("aggMess")))
# take first element of the array
.withColumn("_first_element_agg",f.element_at(f.col("_flatten1"), 1))
# take minimum element of th array
.withColumn("_min_agg",f.array_min(f.col("_flatten1")))
# check if it is a cycle
# it is cycle when v1 = v and min{v1,v2,...,vk} = v
.withColumn("_is_cycle",f.when(
(f.col("id")==f.col("_first_element_agg")) &
(f.col("id")==f.col("_min_agg"))
,True)
.otherwise(False)
)
# pick cycle that should be reported=append to cylce list
.withColumn("_cycle_to_report",f.when(f.col("_is_cycle")==True,f.col("_flatten1")).otherwise(None))
# sort array to have duplicates the same
.withColumn("_cycle_to_report",f.sort_array("_cycle_to_report"))
# create column where first array is removed to check if the current vertices is part of v=(v2,...vk)
.withColumn("_slice",f.array_except(f.col("_flatten1"), f.array(f.element_at(f.col("_flatten1"), 1))))
# check if vertices is part of the slice and set True/False column
.withColumn("_is_cycle2",f.lit(f.size(f.array_except(f.array(f.col("id")), f.col("_slice"))) == 0))
)
#print("checked Vertices")
#checkVerties.show(truncate=False)
# append found cycles to result dataframe via union
cycles=(
# take existing cycles dataframe
cycles
.union(
# union=append all cyles that are in the current reporting column
checkVerties
.where(f.col("_cycle_to_report").isNotNull())
.select("_cycle_to_report")
)
)
# create list of new messages that will be send in the next iteration to the vertices
newVertices=(
checkVerties
# append current vertex id on position 1
.withColumn("message",f.concat(
f.coalesce(f.col("_flatten1"), f.array()),
f.coalesce(f.array(f.col("id")), f.array())
))
# only send where it is no cycle duplicate
.where(f.col("_is_cycle2")==False)
.select("id","message")
)
print("vertics to send forward")
newVertices.sort("id").show(truncate=False)
# cache new vertices using workaround for SPARK-1334
cachedNewVertices = AM.getCachedDataFrame(newVertices)
# update graphframe object for next round
gx = GraphFrame(cachedNewVertices, gx.edges)
# materialize results and get number of found cycles
#cycles_count=cycles.persist().count()
_cycle_statistics=(
cycles
.withColumn("cycle_length",f.size(f.col("cycle")))
.agg(f.count(f.col("cycle")),f.max(f.col("cycle_length")),f.min(f.col("cycle_length")))
).collect()
cycle_statistics={"count":_cycle_statistics[0]["count(cycle)"],"max":_cycle_statistics[0]["max(cycle_length)"],"min":_cycle_statistics[0]["min(cycle_length)"]}
end_time =time.time()
_logger.warning("+++ find_cycles(): " + str(cycle_statistics["count"]) + " cycles found in " + str(iter_) + " iterations (min length=" + str(cycle_statistics["min"]) +", max length="+ str(cycle_statistics["max"]) +") in " + str(end_time-start_time) + " seconds")
_logger.warning("+++ #########################################################################################")
return cycles, cycle_statistics
this functions takes a graphs like
SimpleCycle:
NestedCycle:
SimpleCycle=[
("0","1"),
("1","2"),
("2","3"),
("3","4"),
("3","1")]
NestedCycle=[
("1","2"),
("2","3"),
("3","4"),
("4","1"),
("3","1"),
("5","1"),
("5","2")]
edges = sqlContext.createDataFrame(SimpleCycle,["src","dst"])
vertices=edges.select("src").union(edges.select("dst")).distinct().distinct().withColumnRenamed('src', 'id')
edges.show()
# +---+---+
# |src|dst|
# +---+---+
# | 1| 2|
# | 2| 3|
# | 3| 4|
# | 4| 1|
# | 3| 1|
# | 5| 1|
# | 5| 2|
# +---+---+
raw_cycles=find_cycles(sqlContext,sc,vertices,edges,max_iter=1000)
raw_cycles.show()
# +------------+
# | cycle|
# +------------+
# | [1, 2, 3]|
# |[1, 2, 3, 4]|
#+------------+

Related

Dataflow job doesn't emit messages after GroupByKey()

I have a streaming dataflow pipeline that writes to BQ, and I want to window all the failed rows and do some further analysis. The pipeline looks like this, I'm getting all the error messages in the 2nd step but all the messages are getting stuck to the beam.GroupByKey(). Nothing moves downstream after that. Does anyone have any idea how to fix this?
data = (
| "Read PubSub Messages" >> beam.io.ReadFromPubSub(subscription=options.input_subscription,
with_attributes=True)
...
| "write to BQ" >> beam.io.WriteToBigQuery(
table=f"{options.bq_dataset}.{options.bq_table}",
write_disposition=beam.io.BigQueryDisposition.WRITE_APPEND,
method='STREAMING_INSERTS',
insert_retry_strategy=beam.io.gcp.bigquery_tools.RetryStrategy.RETRY_NEVER
)
)
(
data[beam.io.gcp.bigquery.BigQueryWriteFn.FAILED_ROWS]
| f"Window into: {options.window_size}m" >> GroupWindowsIntoBatches(options.window_size)
| f"Failed Rows for " >> beam.ParDo(BadRows(options.bq_dataset, 'table'))
)
and
class GroupWindowsIntoBatches(beam.PTransform):
"""A composite transform that groups Pub/Sub messages based on publish
time and outputs a list of dictionaries, where each contains one message
and its publish timestamp.
"""
def __init__(self, window_size):
# Convert minutes into seconds.
self.window_size = int(window_size * 60)
def expand(self, pcoll):
return (
pcoll
# Assigns window info to each Pub/Sub message based on its publish timestamp.
| "Window into Fixed Intervals" >> beam.WindowInto(window.FixedWindows(10))
# If the windowed elements do not fit into memory please consider using `beam.util.BatchElements`.
| "Add Dummy Key" >> beam.Map(lambda elem: (None, elem))
| "Groupby" >> beam.GroupByKey()
| "Abandon Dummy Key" >> beam.MapTuple(lambda _, val: val)
)
also, I don't know if it's relevant but the beam.DoFn.TimestampParam inside my GroupWindowsIntoBatches has invalid timestamp (negative)
Ok, so the issue was that the messages coming from BigQuery FAILED_ROWS were not timestamped. adding | 'Add Timestamps' >> beam.Map(lambda x: beam.window.TimestampedValue(x, time.time())) seems to fix the group by.
class GroupWindowsIntoBatches(beam.PTransform):
"""A composite transform that groups Pub/Sub messages based on publish
time and outputs a list of dictionaries, where each contains one message
and its publish timestamp.
"""
def __init__(self, window_size):
# Convert minutes into seconds.
self.window_size = int(window_size * 60)
def expand(self, pcoll):
return (
pcoll
| 'Add Timestamps' >> beam.Map(lambda x: beam.window.TimestampedValue(x, time.time())) <----- Added This line
| "Window into Fixed Intervals" >> beam.WindowInto(window.FixedWindows(30))
| "Add Dummy Key" >> beam.Map(lambda elem: (None, elem))
| "Groupby" >> beam.GroupByKey()
| "Abandon Dummy Key" >> beam.MapTuple(lambda _, val: val)
)

How do I decrease iteration time when making data transformations?

I have a couple of data transformations that it seems operate quite slowly while iterating.
What general strategies can I use to increase performance?
Input Data:
+-----------+-------+
| key | val |
+-----------+-------+
| a | 1 |
| a | 2 |
| b | 1 |
| b | 2 |
| b | 3 |
+-----------+-------+
My code I'm iterating on is the following:
from pyspark.sql import functions as F
# Output = /my/function/output
# input_df = /my/function/input
def my_compute_function(input_df):
"""Compute difference from maximum of a value column by key
Keyword arguments:
input_df (pyspark.sql.DataFrame) : input DataFrame
Returns:
pyspark.sql.DataFrame
"""
max_df = input_df \
.groupBy("key") \
.agg(F.max(F.col("val").alias("max_val")
joined_df = input_df \
.join(max_df, "key")
diff_df = joined_df \
.withColumn("diff", F.col("max_val") - F.col("val"))
return diff_df
It took me 4 build iterations to get max_df right, 4 to get joined_df right, and 4 to get diff_df right.
This represents total work of:
pipeline_1:
transform_A:
work_1 : input -> max_df
(takes 4 iterations to get right): 4 * max_df
work_2: max_df -> joined_df
(takes 4 iterations to get right): 4 * joined_df + 4 * max_df
= 4 joined_df + 4 max_df
work_3: joined_df -> diff_df
(takes 4 iterations to get right): 4 * diff_df + 4 * joined_df + 4 * max_df
total work:
transform_A
= work_1 + work_2 + work_3
= 4 * max_df + (4 * joined_df + 4 * max_df) + (4 * diff_df + 4 * joined_df + 4 * max_df)
= 12 * max_df + 8 * joined_df + 4 * diff_df
Output data:
+-----------+-------+--------+
| key | val | diff |
+-----------+-------+--------+
| a | 1 | 1 |
| a | 2 | 0 |
| b | 1 | 2 |
| b | 2 | 1 |
| b | 3 | 0 |
+-----------+-------+--------+
Refactoring
For experimentation / fast iteration, it's often a good idea to refactor your code into several smaller steps instead of a single large step.
This way, you compute the upstream cells first, write the data back to Foundry, and use this pre-computed data in later steps. If you were to keep re-computing without changing these early steps' logic, you are doing nothing but extra work again and again.
Concretely:
from pyspark.sql import functions as F
# output = /my/function/output_max
# input_df = "/my/function/input
def my_compute_function(input_df):
"""Compute the max by key
Keyword arguments:
input_df (pyspark.sql.DataFrame) : input DataFrame
Returns:
pyspark.sql.DataFrame
"""
max_df = input_df \
.groupBy("key") \
.agg(F.max(F.col("val").alias("max_val")
return max_df
# output = /my/function/output_joined
# input_df = /my/function/input
# max_df = /my/function/output_max
def my_compute_function(max_df, input_df):
"""Compute the joined output of max and input
Keyword arguments:
max_df (pyspark.sql.DataFrame) : input DataFrame
input_df (pyspark.sql.DataFrame) : input DataFrame
Returns:
pyspark.sql.DataFrame
"""
joined_df = input_df \
.join(max_df, "key")
return joined_df
# Output = /my/function/output_diff
# joined_df = /my/function/output_joined
def my_compute_function(joined_df):
"""Compute difference from maximum of a value column by key
Keyword arguments:
joined_df (pyspark.sql.DataFrame) : input DataFrame
Returns:
pyspark.sql.DataFrame
"""
diff_df = joined_df \
.withColumn("diff", F.col("max_val") - F.col("val"))
return diff_df
The work you perform would instead look like:
pipeline_2:
transform_A:
work_1: input -> max_df
(takes 4 iterations to get right): 4 * max_df
transform_B:
work_2: max_df -> joined_df
(takes 4 iterations to get right): 4 * joined_df
transform_C:
work:3: joined_df -> diff_df
(takes 4 iterations to get right): 4 * diff_df
total_work:
transform_A + transform_B + transform_C
= work_1 + work_2 + work_3
= 4 * max_df + 4 * joined_df + 4 * diff_df
If you assume max_df, joined_df, and diff_df all cost the same amount to compute, pipeline_1.total_work = 24 * max_df, whereas pipeline_2.total_work = 12 * max_df so you can expect something on the order of 2x speed improvement on iteration.
Caching
For any 'small' datasets, you should cache them. This will keep the rows in-memory for your pipeline and not require fetching from the written-back dataset. 'small' is somewhat arbitrary given a lot of different factors that must be considered, but Spark does a good job of trying to cache it no matter what and warning you if it's too big.
In this case, depending on your setup, you could cache the intermediate layers of max_df and joined_df depending on which step you are developing.
Function Calls
You should stick to native PySpark methods as much as possible and never user Python methods directly on data (i.e. looping over individual rows, executing a UDF). PySpark methods call the underlying Spark methods that are written in Scala and run directly against the data instead of the Python runtime, and if you simply use Python as the layer to interact with this system instead of being the system that interacts with the data, you will get all the performance benefits of Spark itself.
In the above example, only native PySpark methods are called, so this computation will be quite fast.
Downsampling
If you can derive your own accurate sample of a large input dataset, this can be used as the mock input for your transformations, until such time you perfect your logic and want to test it against the full set.
In the above case, we could downsample input_df to be a single key before executing any prior steps.
I personally down-sample and cache datasets above 1M rows before ever writing a line of PySpark code, that way my turnaround times are very fast and I don't ever catch syntax bugs slowly due to large dataset sizes.
All Together
A good development pipeline looks like:
Discrete chunks of code that do particular materializations you expect to re-use later but don't need to be recomputed over and over again
Downsampled to 'small' sizes
Cached 'small' datasets for very fast fetching
PySpark native code only that exploits the fast underlying Spark libraries

Collecting output from Apache Beam pipeline and displaying it to console

I have been working on Apache Beam for a couple of days. I wanted to quickly iterate on the application I am working and make sure the pipeline I am building is error free. In spark we can use sc.parallelise and when we apply some action we get the value that we can inspect.
Similarly when I was reading about Apache Beam, I found that we can create a PCollection and work with it using following syntax
with beam.Pipeline() as pipeline:
lines = pipeline | beam.Create(["this is test", "this is another test"])
word_count = (lines
| "Word" >> beam.ParDo(lambda line: line.split(" "))
| "Pair of One" >> beam.Map(lambda w: (w, 1))
| "Group" >> beam.GroupByKey()
| "Count" >> beam.Map(lambda (w, o): (w, sum(o))))
result = pipeline.run()
I actually wanted to print the result to console. But I couldn't find any documentation around it.
Is there a way to print the result to console instead of saving it to a file each time?
You don't need the temp list. In python 2.7 the following should be sufficient:
def print_row(row):
print row
(pipeline
| ...
| "print" >> beam.Map(print_row)
)
result = pipeline.run()
result.wait_until_finish()
In python 3.x, print is a function so the following is sufficient:
(pipeline
| ...
| "print" >> beam.Map(print)
)
result = pipeline.run()
result.wait_until_finish()
After exploring furthermore and understanding how I can write testcases for my application I figure out the way to print the result to console. Please not that I am right now running everything to a single node machine and trying to understand functionality provided by apache beam and how can I adopt it without compromising industry best practices.
So, here is my solution. At the very last stage of our pipeline we can introduce a map function that will print result to the console or accumulate the result in a variable later we can print the variable to see the value
import apache_beam as beam
# lets have a sample string
data = ["this is sample data", "this is yet another sample data"]
# create a pipeline
pipeline = beam.Pipeline()
counts = (pipeline | "create" >> beam.Create(data)
| "split" >> beam.ParDo(lambda row: row.split(" "))
| "pair" >> beam.Map(lambda w: (w, 1))
| "group" >> beam.CombinePerKey(sum))
# lets collect our result with a map transformation into output array
output = []
def collect(row):
output.append(row)
return True
counts | "print" >> beam.Map(collect)
# Run the pipeline
result = pipeline.run()
# lets wait until result a available
result.wait_until_finish()
# print the output
print output
Maybe logging info instead of print?
def _logging(elem):
logging.info(elem)
return elem
P | "logging info" >> beam.Map(_logging)
Follow an example from pycharm Edu
import apache_beam as beam
class LogElements(beam.PTransform):
class _LoggingFn(beam.DoFn):
def __init__(self, prefix=''):
super(LogElements._LoggingFn, self).__init__()
self.prefix = prefix
def process(self, element, **kwargs):
print self.prefix + str(element)
yield element
def __init__(self, label=None, prefix=''):
super(LogElements, self).__init__(label)
self.prefix = prefix
def expand(self, input):
input | beam.ParDo(self._LoggingFn(self.prefix))
class MultiplyByTenDoFn(beam.DoFn):
def process(self, element):
yield element * 10
p = beam.Pipeline()
(p | beam.Create([1, 2, 3, 4, 5])
| beam.ParDo(MultiplyByTenDoFn())
| LogElements())
p.run()
Output
10
20
30
40
50
Out[10]: <apache_beam.runners.portability.fn_api_runner.RunnerResult at 0x7ff41418a210>
I know it isn't what you asked for but why don't you store it to a text file? It's always better than printing it via stdout and it isn't volatile

Spark Iterated Function CUSUM

I'm still fairly new to Spark and I'm struggling to implement an iterated function. I'm hoping someone can help me out?
In particular, I'm trying to implement the CUSUM control statistic:
$ S_i = \max (0, S_{i-1} + x_i - Target - w $ with $ S_0 = 0 $ and $ w, Target $ being fixed parameters.
The challenge is that the CUSUM statistic is defined as an iterated function which requires ordered data and the previous function value.
The following data frame shows the desired output for $ Target = 1 $ and $ w = 0.1 $ :
i x S
--------------
1 1.3 0.2
2 1.8 0.9
3 0.5 0.3
4 0.6 0
5 1.2 0.1
6 1.8 0.8
On a different note: I guess it's not possible to run CUSUM in a distributed fashion? My data set is fairly large but contains multiple groups. I hope this means I can still achieve some concurrency. I guess I have to re-partition my data to have one single partition per group to run the CUSUM algorithm per group concurrently?
I hope this makes sense and any pointers are highly appreciated!
Ideally I am looking for a solution in Scala and Spark 2.1
Thanks a lot!
After a lot of Google research I found a solution to the problem using mapPartitions
val dataset = Seq(1.3, 1.8, 0.5, 0.6, 1.2, 1.8).toDS
dataset.repartition(1).mapPartitions(iterator => {
var s = 0.0
val target = 1.0
val w = 0.1
iterator.map(x => {
s = Math.max(0.0, s + x -target - w)
Math.round(10.0 *s)/10.0
})
}).show()
+-----+
|value|
+-----+
| 0.2|
| 0.9|
| 0.3|
| 0.0|
| 0.1|
| 0.8|
+-----+
I hope this will save someone some time in the future.

Stack overflow error caused by long lineage RDD created after many recursions [duplicate]

I'm quite new to Spark and I'm trying to implement some iterative algorithm for clustering (expectation-maximization) with centroid represented by Markov model. So I need to do iterations and joins.
One problem that I experience is that each iterations time growth exponentially.
After some experimenting I found that when doing iterations it's needed to persist RDD that is going to be reused in the next iteration, otherwise every iteration spark will create execution plan that will recalculate the RDD from from start, thus increasing calculation time.
init = sc.parallelize(xrange(10000000), 3)
init.cache()
for i in range(6):
print i
start = datetime.datetime.now()
init2 = init.map(lambda n: (n, n*3))
init = init2.map(lambda n: n[0])
# init.cache()
print init.count()
print str(datetime.datetime.now() - start)
Results in:
0
10000000
0:00:04.283652
1
10000000
0:00:05.998830
2
10000000
0:00:08.771984
3
10000000
0:00:11.399581
4
10000000
0:00:14.206069
5
10000000
0:00:16.856993
So adding cache() helps and iteration time become constant.
init = sc.parallelize(xrange(10000000), 3)
init.cache()
for i in range(6):
print i
start = datetime.datetime.now()
init2 = init.map(lambda n: (n, n*3))
init = init2.map(lambda n: n[0])
init.cache()
print init.count()
print str(datetime.datetime.now() - start)
0
10000000
0:00:04.966835
1
10000000
0:00:04.609885
2
10000000
0:00:04.324358
3
10000000
0:00:04.248709
4
10000000
0:00:04.218724
5
10000000
0:00:04.223368
But when making Join inside the iteration the problem comes back.
Here is some simple code I demonstrating the problem. Even making cache on each RDD transformation doesn't solve the problem:
init = sc.parallelize(xrange(10000), 3)
init.cache()
for i in range(6):
print i
start = datetime.datetime.now()
init2 = init.map(lambda n: (n, n*3))
init2.cache()
init3 = init.map(lambda n: (n, n*2))
init3.cache()
init4 = init2.join(init3)
init4.count()
init4.cache()
init = init4.map(lambda n: n[0])
init.cache()
print init.count()
print str(datetime.datetime.now() - start)
And here is the output. As you can see iteration time growing exponentially :(
0
10000
0:00:00.674115
1
10000
0:00:00.833377
2
10000
0:00:01.525314
3
10000
0:00:04.194715
4
10000
0:00:08.139040
5
10000
0:00:17.852815
I will really appreciate any help :)
Summary:
Generally speaking iterative algorithms, especially ones with self-join or self-union, require a control over:
Length of the lineage (see for example Stackoverflow due to long RDD Lineage and unionAll resulting in StackOverflow).
Number of partitions.
Problem described here is a result of the lack of the former one. In each iteration number of partition increases with self-join leading to exponential pattern. To address that you have to either control number of partitions in each iteration (see below) or use global tools like spark.default.parallelism (see an answer provided by Travis). In general the first approach provides much more control in general and doesn't affect other parts of code.
Original answer:
As far as I can tell there are two interleaved problems here - growing number of partitions and shuffling overhead during joins. Both can be easily handled so lets go step by step.
First lets create a helper to collect the statistics:
import datetime
def get_stats(i, init, init2, init3, init4,
start, end, desc, cache, part, hashp):
return {
"i": i,
"init": init.getNumPartitions(),
"init1": init2.getNumPartitions(),
"init2": init3.getNumPartitions(),
"init4": init4.getNumPartitions(),
"time": str(end - start),
"timen": (end - start).seconds + (end - start).microseconds * 10 **-6,
"desc": desc,
"cache": cache,
"part": part,
"hashp": hashp
}
another helper to handle caching/partitioning
def procRDD(rdd, cache=True, part=False, hashp=False, npart=16):
rdd = rdd if not part else rdd.repartition(npart)
rdd = rdd if not hashp else rdd.partitionBy(npart)
return rdd if not cache else rdd.cache()
extract pipeline logic:
def run(init, description, cache=True, part=False, hashp=False,
npart=16, n=6):
times = []
for i in range(n):
start = datetime.datetime.now()
init2 = procRDD(
init.map(lambda n: (n, n*3)),
cache, part, hashp, npart)
init3 = procRDD(
init.map(lambda n: (n, n*2)),
cache, part, hashp, npart)
# If part set to True limit number of the output partitions
init4 = init2.join(init3, npart) if part else init2.join(init3)
init = init4.map(lambda n: n[0])
if cache:
init4.cache()
init.cache()
init.count() # Force computations to get time
end = datetime.datetime.now()
times.append(get_stats(
i, init, init2, init3, init4,
start, end, description,
cache, part, hashp
))
return times
and create initial data:
ncores = 8
init = sc.parallelize(xrange(10000), ncores * 2).cache()
Join operation by itself, if numPartitions argument is not provided, adjust number of partitions in the output based on the number of partitions of the input RDDs. It means growing number of partitions with each iteration. If number of partitions is to large things get ugly. You can deal with these by providing numPartitions argument for join or repartition RDDs with each iteration.
timesCachePart = sqlContext.createDataFrame(
run(init, "cache + partition", True, True, False, ncores * 2))
timesCachePart.select("i", "init1", "init2", "init4", "time", "desc").show()
+-+-----+-----+-----+--------------+-----------------+
|i|init1|init2|init4| time| desc|
+-+-----+-----+-----+--------------+-----------------+
|0| 16| 16| 16|0:00:01.145625|cache + partition|
|1| 16| 16| 16|0:00:01.090468|cache + partition|
|2| 16| 16| 16|0:00:01.059316|cache + partition|
|3| 16| 16| 16|0:00:01.029544|cache + partition|
|4| 16| 16| 16|0:00:01.033493|cache + partition|
|5| 16| 16| 16|0:00:01.007598|cache + partition|
+-+-----+-----+-----+--------------+-----------------+
As you can see when we repartition execution time is more or less constant.
The second problem is that above data is partitioned randomly. To ensure join performance we would like to have same keys on a single partition. To achieve that we can use hash partitioner:
timesCacheHashPart = sqlContext.createDataFrame(
run(init, "cache + hashpart", True, True, True, ncores * 2))
timesCacheHashPart.select("i", "init1", "init2", "init4", "time", "desc").show()
+-+-----+-----+-----+--------------+----------------+
|i|init1|init2|init4| time| desc|
+-+-----+-----+-----+--------------+----------------+
|0| 16| 16| 16|0:00:00.946379|cache + hashpart|
|1| 16| 16| 16|0:00:00.966519|cache + hashpart|
|2| 16| 16| 16|0:00:00.945501|cache + hashpart|
|3| 16| 16| 16|0:00:00.986777|cache + hashpart|
|4| 16| 16| 16|0:00:00.960989|cache + hashpart|
|5| 16| 16| 16|0:00:01.026648|cache + hashpart|
+-+-----+-----+-----+--------------+----------------+
Execution time is constant as before and There is a small improvement over the basic partitioning.
Now lets use cache only as a reference:
timesCacheOnly = sqlContext.createDataFrame(
run(init, "cache-only", True, False, False, ncores * 2))
timesCacheOnly.select("i", "init1", "init2", "init4", "time", "desc").show()
+-+-----+-----+-----+--------------+----------+
|i|init1|init2|init4| time| desc|
+-+-----+-----+-----+--------------+----------+
|0| 16| 16| 32|0:00:00.992865|cache-only|
|1| 32| 32| 64|0:00:01.766940|cache-only|
|2| 64| 64| 128|0:00:03.675924|cache-only|
|3| 128| 128| 256|0:00:06.477492|cache-only|
|4| 256| 256| 512|0:00:11.929242|cache-only|
|5| 512| 512| 1024|0:00:23.284508|cache-only|
+-+-----+-----+-----+--------------+----------+
As you can see number of partitions (init2, init3, init4) for cache-only version doubles with each iteration and execution time is proportional to the number of partitions.
Finally we can check if we can improve performance with large number of partitions if we use hash partitioner:
timesCacheHashPart512 = sqlContext.createDataFrame(
run(init, "cache + hashpart 512", True, True, True, 512))
timesCacheHashPart512.select(
"i", "init1", "init2", "init4", "time", "desc").show()
+-+-----+-----+-----+--------------+--------------------+
|i|init1|init2|init4| time| desc|
+-+-----+-----+-----+--------------+--------------------+
|0| 512| 512| 512|0:00:14.492690|cache + hashpart 512|
|1| 512| 512| 512|0:00:20.215408|cache + hashpart 512|
|2| 512| 512| 512|0:00:20.408070|cache + hashpart 512|
|3| 512| 512| 512|0:00:20.390267|cache + hashpart 512|
|4| 512| 512| 512|0:00:20.362354|cache + hashpart 512|
|5| 512| 512| 512|0:00:19.878525|cache + hashpart 512|
+-+-----+-----+-----+--------------+--------------------+
Improvement is not so impressive but if you have a small cluster and a lot of data it is still worth trying.
I guess take away message here is partitioning matters. There are contexts where it is handled for you (mllib, sql) but if you use low level operations it is your responsibility.
The problem is (as zero323 pointed out in his thorough answer) that calling join without specifying the number of partitions may (does) result in a growing number of partitions. The number of partitions can grow (apparently) without bound. There are (at least) two ways to prevent the number of partitions from growing (without bound) when repeatedly calling join.
Method 1:
As zero323 pointed out, you can specify the number of partitions manually when you call join. For example
rdd1.join(rdd2, numPartitions)
This will ensure that the number of Partitions does not exceed numPartitions and in particular the number of partitions will not continually grow.
Method 2:
When you create your SparkConf you can specify the default level of parallelism. If this value is set, then when you call functions like join without specifying numPartitions, the default parallelism will be used instead, effectively capping the number of partitions and preventing them from growing. You can set this parameter as
conf=SparkConf.set("spark.default.parallelism", numPartitions)
sc = SparkContex(conf=conf)
Rdds are immutable. Try to do rdd = rdd.cache()