Dataflow job doesn't emit messages after GroupByKey() - streaming

I have a streaming dataflow pipeline that writes to BQ, and I want to window all the failed rows and do some further analysis. The pipeline looks like this, I'm getting all the error messages in the 2nd step but all the messages are getting stuck to the beam.GroupByKey(). Nothing moves downstream after that. Does anyone have any idea how to fix this?
data = (
| "Read PubSub Messages" >> beam.io.ReadFromPubSub(subscription=options.input_subscription,
with_attributes=True)
...
| "write to BQ" >> beam.io.WriteToBigQuery(
table=f"{options.bq_dataset}.{options.bq_table}",
write_disposition=beam.io.BigQueryDisposition.WRITE_APPEND,
method='STREAMING_INSERTS',
insert_retry_strategy=beam.io.gcp.bigquery_tools.RetryStrategy.RETRY_NEVER
)
)
(
data[beam.io.gcp.bigquery.BigQueryWriteFn.FAILED_ROWS]
| f"Window into: {options.window_size}m" >> GroupWindowsIntoBatches(options.window_size)
| f"Failed Rows for " >> beam.ParDo(BadRows(options.bq_dataset, 'table'))
)
and
class GroupWindowsIntoBatches(beam.PTransform):
"""A composite transform that groups Pub/Sub messages based on publish
time and outputs a list of dictionaries, where each contains one message
and its publish timestamp.
"""
def __init__(self, window_size):
# Convert minutes into seconds.
self.window_size = int(window_size * 60)
def expand(self, pcoll):
return (
pcoll
# Assigns window info to each Pub/Sub message based on its publish timestamp.
| "Window into Fixed Intervals" >> beam.WindowInto(window.FixedWindows(10))
# If the windowed elements do not fit into memory please consider using `beam.util.BatchElements`.
| "Add Dummy Key" >> beam.Map(lambda elem: (None, elem))
| "Groupby" >> beam.GroupByKey()
| "Abandon Dummy Key" >> beam.MapTuple(lambda _, val: val)
)
also, I don't know if it's relevant but the beam.DoFn.TimestampParam inside my GroupWindowsIntoBatches has invalid timestamp (negative)

Ok, so the issue was that the messages coming from BigQuery FAILED_ROWS were not timestamped. adding | 'Add Timestamps' >> beam.Map(lambda x: beam.window.TimestampedValue(x, time.time())) seems to fix the group by.
class GroupWindowsIntoBatches(beam.PTransform):
"""A composite transform that groups Pub/Sub messages based on publish
time and outputs a list of dictionaries, where each contains one message
and its publish timestamp.
"""
def __init__(self, window_size):
# Convert minutes into seconds.
self.window_size = int(window_size * 60)
def expand(self, pcoll):
return (
pcoll
| 'Add Timestamps' >> beam.Map(lambda x: beam.window.TimestampedValue(x, time.time())) <----- Added This line
| "Window into Fixed Intervals" >> beam.WindowInto(window.FixedWindows(30))
| "Add Dummy Key" >> beam.Map(lambda elem: (None, elem))
| "Groupby" >> beam.GroupByKey()
| "Abandon Dummy Key" >> beam.MapTuple(lambda _, val: val)
)

Related

Dataflow streaming pipeline using TextIO does not autoscale

I created a streaming pipeline which does the following actions:
get pubsub messages containing addresses of csv files
read the corresponding csv files
write the content of the csv files to a BigQuery table
My pipeline code looks like this:
def run():
options = PipelineOptions( save_main_session=True, streaming=False, autoscaling_algorithm='THROUGHPUT_BASED',max_num_workers=500)
options.view_as(GoogleCloudOptions).project = 'my_project'
options.view_as(GoogleCloudOptions).region = 'europe-west1'
options.view_as(GoogleCloudOptions).staging_location = 'staging_address'
options.view_as(GoogleCloudOptions).temp_location = 'temp_address'
options.view_as(StandardOptions).runner = 'DataflowRunner'
p = beam.Pipeline(options=options)
road = (p | 'ReadFromPubSub' >> beam.io.ReadFromPubSub('projects/my_project/topics/my_topic')
| 'ParseJson' >> beam.Map(parse_json)
| 'GetFileAddress' >> beam.Map(lambda element: element["fileAddress"])
| 'ReadCSVFiles' >> beam.io.ReadAllFromText(skip_header_lines=1)
| 'FormatLines' >> beam.Map(read_line)
| 'WriteAggToBQ1' >> beam.io.WriteToBigQuery(
destination_table,
schema=schema,
create_disposition=beam.io.BigQueryDisposition.CREATE_IF_NEEDED,
write_disposition=beam.io.BigQueryDisposition.WRITE_APPEND
)
)
p.run()
if __name__ == '__main__':
run()
When running the pipeline on Dataflow and trying to read big csv files (50 GB), the pipeline never scales and keeps being stuck at one worker.
Why does Dataflow refuse to scale in this situation?

unable to Write json to Pubsub topic using apache beam python

I am trying to read a topic from pubsub and do some cleanup/transfermation and write the final result to another pubsub topic. however i am ending up with the following error. pls guide me.
code:
Ingest = ( p
| 'Read from Topic' >> beam.io.ReadFromPubSub(topic=known_args.topic).with_output_types(bytes)
| 'Parse' >> beam.Map(parse_json)
| 'Cleanup' >> beam.Map(cleanup)
| 'write to pubsub' | beam.io.WriteToPubSub("projects/test/topics/cdp_aa_food" , with_attributes=False)
)
the error which i am getting is below:
raise TypeError("Expected a PTransform object, got %s" % transform)
TypeError: Expected a PTransform object, got write to pubsub
not sure what i am doing wrong..
Ingest = ( p
| 'Read from Topic' >> beam.io.ReadFromPubSub(topic=known_args.topic).with_output_types(bytes)
| 'Parse' >> beam.Map(parse_json)
| 'Cleanup' >> beam.Map(cleanup)
| 'write to pubsub' >> beam.io.WriteToPubSub("projects/test/topics/cdp_aa_food" , with_attributes=False)
)
There is a typo in your pipeline, you need >> instead of | for the write to pubsub step.

How to implement cycle detection with pyspark graphframe pregel API

I am trying to implement the algorithm from Rocha & Thatte (http://cdsid.org.br/sbpo2015/wp-content/uploads/2015/08/142825.pdf) with Pyspark and the pregel wraper from graphframes.
Here I am getting stuck with the correct syntax for the message aggregation.
The idea is strait forward:
...In each pass, each active vertex of G sends a set of sequences of
vertices to its out- neighbours as described next. In the first pass,
each vertex v sends the message (v) to all its out- neighbours. In
subsequent iterations, each active vertex v appends v to each sequence
it received in the previous iteration. It then sends all the updated
sequences to its out-neighbours. If v has not received any message in
the previous iteration, then v deactivates itself. The algorithm
terminates when all the vertices have been deactivated. ...
My idea is to send the vertices ids to the destination vertices (dst) and in the aggregation function collect them into a list. Then in my vertex column "sequence" I would like to append/merge this new list items with the existing one and then do a check with when statements if the current vertex id is already in the sequence. Then I could set the vertex according vertex columns to true to flag them as in a cycle.
But I can't find the correct syntax in Spark on how to concatenate this.
Does anyone has an idea? Or implemented something similar?
My current code
from pyspark import SparkContext, SparkConf
from pyspark.sql import SparkSession
from pyspark.sql import SQLContext
import pyspark.sql.functions as f
from pyspark.sql.functions import coalesce, col, lit, sum, when
from graphframes import GraphFrame
from graphframes.lib import *
SimpleCycle=[
("1","2"),
("2","3"),
("3","4"),
("4","5"),
("5","2"),
("5","6")
]
edges = sqlContext.createDataFrame(SimpleCycle,["src","dst"]) \
.withColumn("self_loop",when(col("src")==col("dst"),True).otherwise(False))
edges.show()
+---+---+---------+
|src|dst|self_loop|
+---+---+---------+
| 1| 2| false|
| 2| 3| false|
| 3| 4| false|
| 4| 5| false|
| 5| 2| false|
| 5| 6| false|
+---+---+---------+
vertices=edges.select("src").union(edges.select("dst")).distinct().distinct().withColumnRenamed('src', 'id')
#vertices = spark.createDataFrame([[1], [2], [3], [4],[5],[6],[7],[8],[9]], ["id"])
#vertices.sort("id").show()
graph = GraphFrame(vertices, edges)
cycles=graph.pregel \
.setMaxIter(5) \
.withVertexColumn("is_cycle", lit(""),lit("logic to be added")) \
.withVertexColumn("sequence", lit(""),Pregel.msg()) \
.sendMsgToDst(Pregel.src("id")) \
.aggMsgs(f.collect_list(Pregel.msg())) \
.run()
cycles.show()
+---+-----------------+--------+
| id| is_cycle|sequence|
+---+-----------------+--------+
| 3|logic to be added| [2]|
| 5|logic to be added| [4]|
| 6|logic to be added| [5]|
| 1|logic to be added| null|
| 4|logic to be added| [3]|
| 2|logic to be added| [5, 1]|
+---+-----------------+--------+
Code that does not work but what I think the logic should be
cycles=graph.pregel \
.setMaxIter(5) \
.withVertexColumn("is_cycle", lit(""), \
when(Pregel.src("id").isin(Pregel.src(sequence)),True).otherwise(False) \
.withVertexColumn("sequence", lit("null"),Append_To_Existing_List(Pregel.msg()) \
.sendMsgToDst(
when(Pregel.src("sequence").isNull(),Pregel.src("id")) \
.otherwise(Pregel.src("sequence")) \
.aggMsgs(f.collect_list(Pregel.msg())) \
.run()
# I would like to have a result like
+---+-----------------+---------+
| id| is_cycle|sequence |
+---+-----------------+---------+
| 1|false | [1] |
| 2|true |[2,3,4,5]|
| 3|true |[2,3,4,5]|
| 4|true |[2,3,4,5]|
| 5|true |[2,3,4,5]|
| 6|false | null |
+---+-----------------+---------+
Finally I implemented Rocha-Thatte algorithm not via pregel but with the underlying
message aggregation function of graphframe/graphX. In case someone is interested I'd like to share the solution
This solution works correctly and can handle very large graphs without failing
However it is getting quite slow if the cycle length or the graph is long.
Not sure how to improve this right now.
Possibly in using checkpoints or broadcasting in a smart way
Happy about any input for improvement
# spark modules
from pyspark import SparkContext, SparkConf
from pyspark.sql import SparkSession
from pyspark.sql import SQLContext
from pyspark.sql.types import *
from pyspark.sql import Row
from pyspark.sql.window import Window
import pyspark.sql.functions as f
# graphframes modules
from graphframes import GraphFrame
from graphframes.lib import *
AM=AggregateMessages
def find_cycles(sqlContext,sc,vertices,edges,max_iter=100000):
# Cycle detection via message aggregation
"""
This code is an implementation of the Rocha-Thatte algorithm for large-scale sparce graphs
Source:
==============
wiki: https://en.wikipedia.org/wiki/Rocha%E2%80%93Thatte_cycle_detection_algorithm
paper: https://www.researchgate.net/publication/283642998_Distributed_cycle_detection_in_large-scale_sparse_graphs
The basic idea:
===============
We propose a general algorithm for detecting cycles in a directed graph G by message passing among its vertices,
based on the bulk synchronous message passing abstraction. This is a vertex-centric approach in which the vertices
of the graph work together for detecting cycles. The bulk synchronous parallel model consists of a sequence of iterations,
in each of which a vertex can receive messages sent by other vertices in the previous iteration, and send messages to other
vertices.
In each pass, each active vertex of G sends a set of sequences of vertices to its out- neighbours as described next.
In the first pass, each vertex v sends the message (v) to all its out- neighbours. In subsequent iterations, each active vertex v
appends v to each sequence it received in the previous iteration. It then sends all the updated sequences to its out-neighbours.
If v has not received any message in the previous iteration, then v deactivates itself. The algorithm terminates when all the
vertices have been deactivated.
For a sequence (v1, v2, . . . , vk) received by vertex v, the appended sequence is not for- warded in two cases: (i) if v = v1,
then v has detected a cycle, which is reported (see line 9 of Algorithm 1); (ii) if v = vi for some i ∈ {2, 3, . . . , k},
then v has detected a sequence that contains the cycle (v = vi, vi+1, . . . , vk, vk+1 = v); in this case,
the sequence is discarded, since the cycle must have been detected in an earlier iteration (see line 11 of Algorithm 1);
to be precise, this cycle must have been detected in iteration k − i + 1. Every cycle (v1, v2, . . . , vk, vk+1 = v1)
is detected by all vi,i = 1 to k in the same iteration; it is reported by the vertex min{v1,...,vk} (see line 9 of Algorithm 1).
The total number of iterations of the algorithm is the number of vertices in the longest path in the graph, plus a few more steps
for deactivating the final vertices. During the analysis of the total number of iterations, we ignore the few extra iterations
needed for deactivating the final vertices and detecting the end of the computation, since it is O(1).
Pseudocode of the algorithm:
============================
M(v): Message received from vertex v
N+(v): all dst verties from v
functionCOMPUTE(M(v)):
if i=0 then:
for each w ∈ N+(v) do:
send (v) to w
else if M(v) = ∅ then:
deactivate v and halt
else:
for each (v1,v2,...,vk) ∈ M(v) do:
if v1 = v and min{v1,v2,...,vk} = v then:
report (v1 = v,v2,...,vk,vk+1 = v)
else if v not ∈ {v2,...,vk} then:
for each w ∈ N+(v) do:
send (v1,v2,...,vk,v) to w
Scalablitiy of the algorithm:
============================
the number of iteration depends on the path of the longest cycle
the scaling it between
O(log(n)) up to maxium O(n) where n=number of vertices
so the number of iterations is less to max linear to the number of vertices,
if there are more edges (parallel etc.) it will not affect the the runtime
for more details please refer to the oringinal publication
"""
_logger.warning("+++ find_cycles(): starting cycle search ...")
# create emtpy dataframe to collect all cycles
cycles = sqlContext.createDataFrame(sc.emptyRDD(),StructType([StructField("cycle",ArrayType(StringType()),True)]))
# initialize the messege column with own source id
init_vertices=(vertices
.withColumn("message",f.array(f.col("id")))
)
init_edges=(edges
.where(f.col("src")!=f.col("dst"))
.select("src","dst")
)
# create graph object that will be update each iteration
gx = GraphFrame(init_vertices, init_edges)
# iterate until max_iter
# max iter is used in case that the3 break condition is never reached during this time
# defaul value=100.000
for iter_ in range(max_iter):
# message that should be send to destination for aggregation
msgToDst = AM.src["message"]
# aggregate all messages that where received into a python set (drops duplicate edges)
agg = gx.aggregateMessages(
f.collect_set(AM.msg).alias("aggMess"),
sendToSrc=None,
sendToDst=msgToDst)
# BREAK condition: if no more messages are received all cycles where found
# and we can quit the loop
if(len(agg.take(1))==0):
#print("THE END: All cycles found in " + str(iter_) + " iterations")
break
# apply the alorithm logic
# filter for cycles that should be reported as found
# compose new message to be send for next iteration
# _column name stands for temporary columns that are only used in the algo and then dropped again
checkVerties=(
agg
# flatten the aggregated message from [[2]] to [] in order to have proper 1D arrays
.withColumn("_flatten1",f.explode(f.col("aggMess")))
# take first element of the array
.withColumn("_first_element_agg",f.element_at(f.col("_flatten1"), 1))
# take minimum element of th array
.withColumn("_min_agg",f.array_min(f.col("_flatten1")))
# check if it is a cycle
# it is cycle when v1 = v and min{v1,v2,...,vk} = v
.withColumn("_is_cycle",f.when(
(f.col("id")==f.col("_first_element_agg")) &
(f.col("id")==f.col("_min_agg"))
,True)
.otherwise(False)
)
# pick cycle that should be reported=append to cylce list
.withColumn("_cycle_to_report",f.when(f.col("_is_cycle")==True,f.col("_flatten1")).otherwise(None))
# sort array to have duplicates the same
.withColumn("_cycle_to_report",f.sort_array("_cycle_to_report"))
# create column where first array is removed to check if the current vertices is part of v=(v2,...vk)
.withColumn("_slice",f.array_except(f.col("_flatten1"), f.array(f.element_at(f.col("_flatten1"), 1))))
# check if vertices is part of the slice and set True/False column
.withColumn("_is_cycle2",f.lit(f.size(f.array_except(f.array(f.col("id")), f.col("_slice"))) == 0))
)
#print("checked Vertices")
#checkVerties.show(truncate=False)
# append found cycles to result dataframe via union
cycles=(
# take existing cycles dataframe
cycles
.union(
# union=append all cyles that are in the current reporting column
checkVerties
.where(f.col("_cycle_to_report").isNotNull())
.select("_cycle_to_report")
)
)
# create list of new messages that will be send in the next iteration to the vertices
newVertices=(
checkVerties
# append current vertex id on position 1
.withColumn("message",f.concat(
f.coalesce(f.col("_flatten1"), f.array()),
f.coalesce(f.array(f.col("id")), f.array())
))
# only send where it is no cycle duplicate
.where(f.col("_is_cycle2")==False)
.select("id","message")
)
print("vertics to send forward")
newVertices.sort("id").show(truncate=False)
# cache new vertices using workaround for SPARK-1334
cachedNewVertices = AM.getCachedDataFrame(newVertices)
# update graphframe object for next round
gx = GraphFrame(cachedNewVertices, gx.edges)
# materialize results and get number of found cycles
#cycles_count=cycles.persist().count()
_cycle_statistics=(
cycles
.withColumn("cycle_length",f.size(f.col("cycle")))
.agg(f.count(f.col("cycle")),f.max(f.col("cycle_length")),f.min(f.col("cycle_length")))
).collect()
cycle_statistics={"count":_cycle_statistics[0]["count(cycle)"],"max":_cycle_statistics[0]["max(cycle_length)"],"min":_cycle_statistics[0]["min(cycle_length)"]}
end_time =time.time()
_logger.warning("+++ find_cycles(): " + str(cycle_statistics["count"]) + " cycles found in " + str(iter_) + " iterations (min length=" + str(cycle_statistics["min"]) +", max length="+ str(cycle_statistics["max"]) +") in " + str(end_time-start_time) + " seconds")
_logger.warning("+++ #########################################################################################")
return cycles, cycle_statistics
this functions takes a graphs like
SimpleCycle:
NestedCycle:
SimpleCycle=[
("0","1"),
("1","2"),
("2","3"),
("3","4"),
("3","1")]
NestedCycle=[
("1","2"),
("2","3"),
("3","4"),
("4","1"),
("3","1"),
("5","1"),
("5","2")]
edges = sqlContext.createDataFrame(SimpleCycle,["src","dst"])
vertices=edges.select("src").union(edges.select("dst")).distinct().distinct().withColumnRenamed('src', 'id')
edges.show()
# +---+---+
# |src|dst|
# +---+---+
# | 1| 2|
# | 2| 3|
# | 3| 4|
# | 4| 1|
# | 3| 1|
# | 5| 1|
# | 5| 2|
# +---+---+
raw_cycles=find_cycles(sqlContext,sc,vertices,edges,max_iter=1000)
raw_cycles.show()
# +------------+
# | cycle|
# +------------+
# | [1, 2, 3]|
# |[1, 2, 3, 4]|
#+------------+

Collecting output from Apache Beam pipeline and displaying it to console

I have been working on Apache Beam for a couple of days. I wanted to quickly iterate on the application I am working and make sure the pipeline I am building is error free. In spark we can use sc.parallelise and when we apply some action we get the value that we can inspect.
Similarly when I was reading about Apache Beam, I found that we can create a PCollection and work with it using following syntax
with beam.Pipeline() as pipeline:
lines = pipeline | beam.Create(["this is test", "this is another test"])
word_count = (lines
| "Word" >> beam.ParDo(lambda line: line.split(" "))
| "Pair of One" >> beam.Map(lambda w: (w, 1))
| "Group" >> beam.GroupByKey()
| "Count" >> beam.Map(lambda (w, o): (w, sum(o))))
result = pipeline.run()
I actually wanted to print the result to console. But I couldn't find any documentation around it.
Is there a way to print the result to console instead of saving it to a file each time?
You don't need the temp list. In python 2.7 the following should be sufficient:
def print_row(row):
print row
(pipeline
| ...
| "print" >> beam.Map(print_row)
)
result = pipeline.run()
result.wait_until_finish()
In python 3.x, print is a function so the following is sufficient:
(pipeline
| ...
| "print" >> beam.Map(print)
)
result = pipeline.run()
result.wait_until_finish()
After exploring furthermore and understanding how I can write testcases for my application I figure out the way to print the result to console. Please not that I am right now running everything to a single node machine and trying to understand functionality provided by apache beam and how can I adopt it without compromising industry best practices.
So, here is my solution. At the very last stage of our pipeline we can introduce a map function that will print result to the console or accumulate the result in a variable later we can print the variable to see the value
import apache_beam as beam
# lets have a sample string
data = ["this is sample data", "this is yet another sample data"]
# create a pipeline
pipeline = beam.Pipeline()
counts = (pipeline | "create" >> beam.Create(data)
| "split" >> beam.ParDo(lambda row: row.split(" "))
| "pair" >> beam.Map(lambda w: (w, 1))
| "group" >> beam.CombinePerKey(sum))
# lets collect our result with a map transformation into output array
output = []
def collect(row):
output.append(row)
return True
counts | "print" >> beam.Map(collect)
# Run the pipeline
result = pipeline.run()
# lets wait until result a available
result.wait_until_finish()
# print the output
print output
Maybe logging info instead of print?
def _logging(elem):
logging.info(elem)
return elem
P | "logging info" >> beam.Map(_logging)
Follow an example from pycharm Edu
import apache_beam as beam
class LogElements(beam.PTransform):
class _LoggingFn(beam.DoFn):
def __init__(self, prefix=''):
super(LogElements._LoggingFn, self).__init__()
self.prefix = prefix
def process(self, element, **kwargs):
print self.prefix + str(element)
yield element
def __init__(self, label=None, prefix=''):
super(LogElements, self).__init__(label)
self.prefix = prefix
def expand(self, input):
input | beam.ParDo(self._LoggingFn(self.prefix))
class MultiplyByTenDoFn(beam.DoFn):
def process(self, element):
yield element * 10
p = beam.Pipeline()
(p | beam.Create([1, 2, 3, 4, 5])
| beam.ParDo(MultiplyByTenDoFn())
| LogElements())
p.run()
Output
10
20
30
40
50
Out[10]: <apache_beam.runners.portability.fn_api_runner.RunnerResult at 0x7ff41418a210>
I know it isn't what you asked for but why don't you store it to a text file? It's always better than printing it via stdout and it isn't volatile

How would I test that a PowerShell function properly streams input from the pipeline?

I know how to write a function that streams input from the pipeline. I can reasonably tell by reading the source for a function if it will perform properly. However, is there any method for actually testing for the correct behavior?
I accept any definition of "testing"... be that some manual test that I can run or something more automated.
If you need an example, let's say I have a function that splits text into words.
PS> Get-Content ./warandpeace.txt | Split-Text
How would I check that it streams input from the pipeline and begins splitting immediately?
You can write a helper function, which would give you some indication as pipeline items passed to it and processed by next command:
function Print-Pipeline {
param($Name, [ConsoleColor]$Color)
begin {
$ColorParameter = if($PSBoundParameters.ContainsKey('Color')) {
#{ ForegroundColor = $Color }
} else {
#{ }
}
}
process {
Write-Host "${Name}|Before|$_" #ColorParameter
,$_
Write-Host "${Name}|After|$_" #ColorParameter
}
}
Suppose you have some functions to test:
$Text = 'Some', 'Random', 'Text'
function CharSplit1 { $Input | % GetEnumerator }
filter CharSplit2 { $Input | % GetEnumerator }
And you can test them like that:
PS> $Text |
>>> Print-Pipeline Before` CharSplit1 |
>>> CharSplit1 |
>>> Print-Pipeline After` CharSplit1
Before CharSplit1|Before|Some
Before CharSplit1|After|Some
Before CharSplit1|Before|Random
Before CharSplit1|After|Random
Before CharSplit1|Before|Text
Before CharSplit1|After|Text
After CharSplit1|Before|S
S
After CharSplit1|After|S
After CharSplit1|Before|o
o
After CharSplit1|After|o
After CharSplit1|Before|m
m
After CharSplit1|After|m
After CharSplit1|Before|e
e
After CharSplit1|After|e
After CharSplit1|Before|R
R
After CharSplit1|After|R
After CharSplit1|Before|a
a
After CharSplit1|After|a
After CharSplit1|Before|n
n
After CharSplit1|After|n
After CharSplit1|Before|d
d
After CharSplit1|After|d
After CharSplit1|Before|o
o
After CharSplit1|After|o
After CharSplit1|Before|m
m
After CharSplit1|After|m
After CharSplit1|Before|T
T
After CharSplit1|After|T
After CharSplit1|Before|e
e
After CharSplit1|After|e
After CharSplit1|Before|x
x
After CharSplit1|After|x
After CharSplit1|Before|t
t
After CharSplit1|After|t
PS> $Text |
>>> Print-Pipeline Before` CharSplit2 |
>>> CharSplit2 |
>>> Print-Pipeline After` CharSplit2
Before CharSplit2|Before|Some
After CharSplit2|Before|S
S
After CharSplit2|After|S
After CharSplit2|Before|o
o
After CharSplit2|After|o
After CharSplit2|Before|m
m
After CharSplit2|After|m
After CharSplit2|Before|e
e
After CharSplit2|After|e
Before CharSplit2|After|Some
Before CharSplit2|Before|Random
After CharSplit2|Before|R
R
After CharSplit2|After|R
After CharSplit2|Before|a
a
After CharSplit2|After|a
After CharSplit2|Before|n
n
After CharSplit2|After|n
After CharSplit2|Before|d
d
After CharSplit2|After|d
After CharSplit2|Before|o
o
After CharSplit2|After|o
After CharSplit2|Before|m
m
After CharSplit2|After|m
Before CharSplit2|After|Random
Before CharSplit2|Before|Text
After CharSplit2|Before|T
T
After CharSplit2|After|T
After CharSplit2|Before|e
e
After CharSplit2|After|e
After CharSplit2|Before|x
x
After CharSplit2|After|x
After CharSplit2|Before|t
t
After CharSplit2|After|t
Before CharSplit2|After|Text
Add some Write-Verbose statements to your Split-Text function, and then call it with the -Verbose parameter. You should see output in real-time.
Ah, I've got a very simple solution. The concept is to insert your own step into the pipeline with obvious side-effects before the function that you're testing. For example...
PS> 1..10 | %{ Write-Host $_; $_ } | function-under-test
If your function-under-test is "bad", you will see all of the output from 1..10 twice, like this
1
2
3
1
2
3
If the function-under-test is processing items lazily from the pipeline, you'll see the output interleaved.
1
1
2
2
3
3