Collecting output from Apache Beam pipeline and displaying it to console - apache-beam

I have been working on Apache Beam for a couple of days. I wanted to quickly iterate on the application I am working and make sure the pipeline I am building is error free. In spark we can use sc.parallelise and when we apply some action we get the value that we can inspect.
Similarly when I was reading about Apache Beam, I found that we can create a PCollection and work with it using following syntax
with beam.Pipeline() as pipeline:
lines = pipeline | beam.Create(["this is test", "this is another test"])
word_count = (lines
| "Word" >> beam.ParDo(lambda line: line.split(" "))
| "Pair of One" >> beam.Map(lambda w: (w, 1))
| "Group" >> beam.GroupByKey()
| "Count" >> beam.Map(lambda (w, o): (w, sum(o))))
result = pipeline.run()
I actually wanted to print the result to console. But I couldn't find any documentation around it.
Is there a way to print the result to console instead of saving it to a file each time?

You don't need the temp list. In python 2.7 the following should be sufficient:
def print_row(row):
print row
(pipeline
| ...
| "print" >> beam.Map(print_row)
)
result = pipeline.run()
result.wait_until_finish()
In python 3.x, print is a function so the following is sufficient:
(pipeline
| ...
| "print" >> beam.Map(print)
)
result = pipeline.run()
result.wait_until_finish()

After exploring furthermore and understanding how I can write testcases for my application I figure out the way to print the result to console. Please not that I am right now running everything to a single node machine and trying to understand functionality provided by apache beam and how can I adopt it without compromising industry best practices.
So, here is my solution. At the very last stage of our pipeline we can introduce a map function that will print result to the console or accumulate the result in a variable later we can print the variable to see the value
import apache_beam as beam
# lets have a sample string
data = ["this is sample data", "this is yet another sample data"]
# create a pipeline
pipeline = beam.Pipeline()
counts = (pipeline | "create" >> beam.Create(data)
| "split" >> beam.ParDo(lambda row: row.split(" "))
| "pair" >> beam.Map(lambda w: (w, 1))
| "group" >> beam.CombinePerKey(sum))
# lets collect our result with a map transformation into output array
output = []
def collect(row):
output.append(row)
return True
counts | "print" >> beam.Map(collect)
# Run the pipeline
result = pipeline.run()
# lets wait until result a available
result.wait_until_finish()
# print the output
print output

Maybe logging info instead of print?
def _logging(elem):
logging.info(elem)
return elem
P | "logging info" >> beam.Map(_logging)

Follow an example from pycharm Edu
import apache_beam as beam
class LogElements(beam.PTransform):
class _LoggingFn(beam.DoFn):
def __init__(self, prefix=''):
super(LogElements._LoggingFn, self).__init__()
self.prefix = prefix
def process(self, element, **kwargs):
print self.prefix + str(element)
yield element
def __init__(self, label=None, prefix=''):
super(LogElements, self).__init__(label)
self.prefix = prefix
def expand(self, input):
input | beam.ParDo(self._LoggingFn(self.prefix))
class MultiplyByTenDoFn(beam.DoFn):
def process(self, element):
yield element * 10
p = beam.Pipeline()
(p | beam.Create([1, 2, 3, 4, 5])
| beam.ParDo(MultiplyByTenDoFn())
| LogElements())
p.run()
Output
10
20
30
40
50
Out[10]: <apache_beam.runners.portability.fn_api_runner.RunnerResult at 0x7ff41418a210>

I know it isn't what you asked for but why don't you store it to a text file? It's always better than printing it via stdout and it isn't volatile

Related

I override variables with OMPython but the simulation starts with the default values

I'm trying to work on a system made in openmodelica and now i want to simulate it multiple time. On every iteration I have to change some value and i'm using OMPython to do it. This is the part of code where i override the interested parameters:
#Overwrite parameters
with open("newValues.txt", 'wt') as f:
f.write("const.N="+str(n)+"\n")
f.write("const.nIntr="+str(intr)+"\n")
f.write("const.nRocket="+str(miss)+"\n")
f.write("const.nStatObs="+str(statObs)+"\n")
for i in range(len(fault)):
for j in range(len(fault[i])):
f.write("fault.transMatrix["+str(i+1)+","+str(j+1)+"]="+str(fault[i][j])+"\n")
for i in range(3):
f.write("const.flyZone["+str(i+1)+"]="+str(flyZone)+"\n")
f.flush()
os.fsync(f)
os.system("./System -overrideFile=newValues.txt >> LogOverride.txt")
Now, if i try to take all the values overrided from the .mat file, for example, with this method:
print(str(omc.sendExpression("val(const.N," + str(stopTime) + ", \"System_res.mat\")")))
They are changed, but the simulation starts with the default value.
For example, if i set
const.N = 7
and in the .om file it is
const.N = 5
The simulation count N as 5 and not 7.
For more info, this is the github repository: https://github.com/BigMautone/Drones.git
The python script is in pythonScripts/testing.py and most of the parameter changed is in constant.mo
EDIT 1:
For more context, i'm explaining the purpose of the system and why i can't see the changes made it with the script. The system have to simulate a swarm of drones and implements different algorithms for pathfinding and for obstacle avoidance. Inside the record class K there are all the parameters used in the system, like the N parameter, defining the number of drones in the swarm. Because i want to simulate the system multiple times with different parameters, i have to change it. So, if the default value of K.N is 5 and i'm changing it with 7 inside the python script, i'm expecting to see 7 drones doing something. Instead, the system is simulated with the default values!
i'm adding here part of the code and where i can't see the changes made with the overrideFile option.
"""
Here i load all the file and model needed...
"""
omc.sendExpression("buildModel(System, stopTime=180)")
omc.sendExpression("getErrorString()")
#This is the function i made for run the simulation multiple times
def startSimulation(n, intr, miss, statObs, fault, flyZone):
#Overwrite parameters
with open("newValues.txt", 'wt') as f:
f.write("const.N="+str(n)+"\n")
f.write("const.nIntr="+str(intr)+"\n")
f.write("const.nRocket="+str(miss)+"\n")
f.write("const.nStatObs="+str(statObs)+"\n")
for i in range(len(fault)):
for j in range(len(fault[i])):
f.write("fault.transMatrix["+str(i+1)+","+str(j+1)+"]="+str(fault[i][j])+"\n")
for i in range(3):
f.write("const.flyZone["+str(i+1)+"]="+str(flyZone)+"\n")
f.flush()
os.fsync(f)
os.system("./System -overrideFile=newValues.txt >> LogOverride.txt")
os.system("rm -f newValues.txt") # .... to be on the safe side
#Down there i extract some values...
for j in range(1,n+1):
arrivalTime = omc.sendExpression("val(sucMo.arrivalTime[" + str(j) + "]," + str(stopTime) + ", \"System_res.mat\")")
droneArrived = omc.sendExpression("val(sucMo.arrived[" + str(j) + "]," + str(stopTime) + ", \"System_res.mat\")")
droneInfo[j] = (droneArrived, arrivalTime)
#Here i call the function and try to execute it
startSimulation(7,1,1,1,noFault,100)
If i print the droneInfo dictionary, it will have obviously 7 keys, but if the default value N is equal to 5, then i'll get this output:
{1: (1.0, 8.0), 2: (1.0, 5.0), 3: (1.0, 8.0), 4: (1.0, 9.0), 5: (1.0, 12.0), 6: ('NaN', 'NaN'), 7: ('NaN', 'NaN')} 0.0
EDIT 2: I have made some changes. I had not instantiated the K class, which i use a lot for array lenght and for loops. Now that i've done it in all other models, the problem persists. Also, now all the value of K class have the isValueChangeble flag to false
I tried to reproduce this. Made this small script to read the value at the end:
adrpo33#ida-0030 MINGW64 /c/home/adrpo33/dev/modelica/Drones
# cat val.mos
val(const.N, 9, "System_res.mat"); getErrorString();
Then ran the run.mos script:
adrpo33#ida-0030 MINGW64 /c/home/adrpo33/dev/modelica/Drones
# omc run.mos
record SimulationResult
resultFile = "C:/home/adrpo33/dev/modelica/Drones/System_res.mat",
simulationOptions = "startTime = 0.0, stopTime = 30.0, numberOfIntervals = 500, tolerance = 1e-06, method = 'dassl', fileNamePrefix = 'System', options = '', outputFormat = 'mat', variableFilter = '.*', cflags = '', simflags = ''",
messages = "LOG_SUCCESS | info | The initialization finished successfully without homotopy method.
[C:/home/adrpo33/dev/modelica/Drones/Monitors/MonitorSuccess.mo:52:28-52:116:writable]
stdout | info | Simulation call terminate() at time 9.000000
| | | | Message : Tutti i droni hanno raggiunto la destinazione oppure hanno avuto collisioni
LOG_SUCCESS | info | The simulation finished successfully.
",
timeFrontend = 0.3274193,
timeBackend = 1.4592339,
timeSimCode = 0.5092729,
timeTemplates = 0.1533226,
timeCompile = 24.1450323,
timeSimulation = 3.3621366,
timeTotal = 29.9572641
end SimulationResult;
"Warning: The initial conditions are not fully specified. For more information set -d=initialization. In OMEdit Tools->Options->Simulation->Show additional information from the initialization process, in OMNotebook call setCommandLineOptions(\"-d=initialization\").
"
Take the value out:
adrpo33#ida-0030 MINGW64 /c/home/adrpo33/dev/modelica/Drones
# omc val.mos
5.0
""
Override the value:
adrpo33#ida-0030 MINGW64 /c/home/adrpo33/dev/modelica/Drones
# ./System.exe -override const.N=7
LOG_SUCCESS | info | The initialization finished successfully without homotopy method.
[C:/home/adrpo33/dev/modelica/Drones/Monitors/MonitorSuccess.mo:52:28-52:116:writable]
stdout | info | Simulation call terminate() at time 9.000000
| | | | Message : Tutti i droni hanno raggiunto la destinazione oppure hanno avuto collisioni
LOG_SUCCESS | info | The simulation finished successfully.
Take the value out:
adrpo33#ida-0030 MINGW64 /c/home/adrpo33/dev/modelica/Drones
# omc val.mos
7.0
""
So I cannot really reproduce your behavior. When I change something the .mat file reflects the change.

Dataflow streaming pipeline using TextIO does not autoscale

I created a streaming pipeline which does the following actions:
get pubsub messages containing addresses of csv files
read the corresponding csv files
write the content of the csv files to a BigQuery table
My pipeline code looks like this:
def run():
options = PipelineOptions( save_main_session=True, streaming=False, autoscaling_algorithm='THROUGHPUT_BASED',max_num_workers=500)
options.view_as(GoogleCloudOptions).project = 'my_project'
options.view_as(GoogleCloudOptions).region = 'europe-west1'
options.view_as(GoogleCloudOptions).staging_location = 'staging_address'
options.view_as(GoogleCloudOptions).temp_location = 'temp_address'
options.view_as(StandardOptions).runner = 'DataflowRunner'
p = beam.Pipeline(options=options)
road = (p | 'ReadFromPubSub' >> beam.io.ReadFromPubSub('projects/my_project/topics/my_topic')
| 'ParseJson' >> beam.Map(parse_json)
| 'GetFileAddress' >> beam.Map(lambda element: element["fileAddress"])
| 'ReadCSVFiles' >> beam.io.ReadAllFromText(skip_header_lines=1)
| 'FormatLines' >> beam.Map(read_line)
| 'WriteAggToBQ1' >> beam.io.WriteToBigQuery(
destination_table,
schema=schema,
create_disposition=beam.io.BigQueryDisposition.CREATE_IF_NEEDED,
write_disposition=beam.io.BigQueryDisposition.WRITE_APPEND
)
)
p.run()
if __name__ == '__main__':
run()
When running the pipeline on Dataflow and trying to read big csv files (50 GB), the pipeline never scales and keeps being stuck at one worker.
Why does Dataflow refuse to scale in this situation?

Dataflow job doesn't emit messages after GroupByKey()

I have a streaming dataflow pipeline that writes to BQ, and I want to window all the failed rows and do some further analysis. The pipeline looks like this, I'm getting all the error messages in the 2nd step but all the messages are getting stuck to the beam.GroupByKey(). Nothing moves downstream after that. Does anyone have any idea how to fix this?
data = (
| "Read PubSub Messages" >> beam.io.ReadFromPubSub(subscription=options.input_subscription,
with_attributes=True)
...
| "write to BQ" >> beam.io.WriteToBigQuery(
table=f"{options.bq_dataset}.{options.bq_table}",
write_disposition=beam.io.BigQueryDisposition.WRITE_APPEND,
method='STREAMING_INSERTS',
insert_retry_strategy=beam.io.gcp.bigquery_tools.RetryStrategy.RETRY_NEVER
)
)
(
data[beam.io.gcp.bigquery.BigQueryWriteFn.FAILED_ROWS]
| f"Window into: {options.window_size}m" >> GroupWindowsIntoBatches(options.window_size)
| f"Failed Rows for " >> beam.ParDo(BadRows(options.bq_dataset, 'table'))
)
and
class GroupWindowsIntoBatches(beam.PTransform):
"""A composite transform that groups Pub/Sub messages based on publish
time and outputs a list of dictionaries, where each contains one message
and its publish timestamp.
"""
def __init__(self, window_size):
# Convert minutes into seconds.
self.window_size = int(window_size * 60)
def expand(self, pcoll):
return (
pcoll
# Assigns window info to each Pub/Sub message based on its publish timestamp.
| "Window into Fixed Intervals" >> beam.WindowInto(window.FixedWindows(10))
# If the windowed elements do not fit into memory please consider using `beam.util.BatchElements`.
| "Add Dummy Key" >> beam.Map(lambda elem: (None, elem))
| "Groupby" >> beam.GroupByKey()
| "Abandon Dummy Key" >> beam.MapTuple(lambda _, val: val)
)
also, I don't know if it's relevant but the beam.DoFn.TimestampParam inside my GroupWindowsIntoBatches has invalid timestamp (negative)
Ok, so the issue was that the messages coming from BigQuery FAILED_ROWS were not timestamped. adding | 'Add Timestamps' >> beam.Map(lambda x: beam.window.TimestampedValue(x, time.time())) seems to fix the group by.
class GroupWindowsIntoBatches(beam.PTransform):
"""A composite transform that groups Pub/Sub messages based on publish
time and outputs a list of dictionaries, where each contains one message
and its publish timestamp.
"""
def __init__(self, window_size):
# Convert minutes into seconds.
self.window_size = int(window_size * 60)
def expand(self, pcoll):
return (
pcoll
| 'Add Timestamps' >> beam.Map(lambda x: beam.window.TimestampedValue(x, time.time())) <----- Added This line
| "Window into Fixed Intervals" >> beam.WindowInto(window.FixedWindows(30))
| "Add Dummy Key" >> beam.Map(lambda elem: (None, elem))
| "Groupby" >> beam.GroupByKey()
| "Abandon Dummy Key" >> beam.MapTuple(lambda _, val: val)
)

Merging several vcf files using snakemake

I am trying to merge several vcf files by chromosome using snakemake. My files are like this, and as you can see has various coordinates. What is the best way to merge all chr1A and all chr1B?
chr1A:0-2096.filtered.vcf
chr1A:2096-7896.filtered.vcf
chr1B:0-3456.filtered.vcf
chr1B:3456-8796.filtered.vcf
My pseudocode:
chromosomes=["chr1A","chr1B"]
rule all:
input:
expand("{sample}.vcf", sample=chromosomes)
rule merge:
input:
I1="path/to/file/{sample}.xxx.filtered.vcf",
I2="path/to/file/{sample}.xxx.filtered.vcf",
output:
outf ="{sample}.vcf"
shell:
"""
java -jar picard.jar GatherVcfs I={input.I1} I={input.I2} O={output.outf}
"""
EDIT:
workdir: "/media/prova/Maxtor2/vcf2/merged/"
import subprocess
d = {"chr1A": ["chr1A:0-2096.flanking.view.filtered.vcf", "chr1A:2096-7896.flanking.view.filtered.vcf"],
"chr1B": ["chr1B:0-3456.flanking.view.filtered.vcf", "chr1B:3456-8796.flanking.view.filtered.vcf"]}
rule all:
input:
expand("{sample}.vcf", sample=d)
def f(w):
return d.get(w.chromosome, "")
rule merge:
input:
f
output:
outf ="{chromosome}.vcf"
params:
lambda w: "I=" + " I=".join(d[w.chromosome])
shell:
"java -jar /home/Documents/Tools/picard.jar GatherVcfs {params[0]} O={output.outf}"
I was able to reproduce your bug. When constraining the wildcards, it works:
d = {"chr1A": ["chr1A:0-2096.flanking.view.filtered.vcf", "chr1A:2096-7896.flanking.view.filtered.vcf"],
"chr1B": ["chr1B:0-3456.flanking.view.filtered.vcf", "chr1B:3456-8796.flanking.view.filtered.vcf"]}
chromosomes = list(d)
rule all:
input:
expand("{sample}.vcf", sample=chromosomes)
# these tell Snakemake exactly what values the wildcards may take
# we use "|" to create the regex chr1A|chr1B
wildcard_constraints:
chromosome = "|".join(chromosomes)
rule merge:
input:
# a lambda is an unnamed function
# the first argument is the wildcards
# we merely use it to look up the appropriate files in the dict d
lambda w: d[w.chromosome]
output:
outf = "{chromosome}.vcf"
params:
# here we create the string
# "I=chr1A:0-2096.flanking.view.filtered.vcf I=chr1A:2096-7896.flanking.view.filtered.vcf"
# for use in our command
lambda w: "I=" + " I=".join(d[w.chromosome])
shell:
"java -jar /home/Documents/Tools/picard.jar GatherVcfs {params[0]} O={output.outf}"
It should have worked without the constraints too; this seems like a bug in Snakemake.

How would I test that a PowerShell function properly streams input from the pipeline?

I know how to write a function that streams input from the pipeline. I can reasonably tell by reading the source for a function if it will perform properly. However, is there any method for actually testing for the correct behavior?
I accept any definition of "testing"... be that some manual test that I can run or something more automated.
If you need an example, let's say I have a function that splits text into words.
PS> Get-Content ./warandpeace.txt | Split-Text
How would I check that it streams input from the pipeline and begins splitting immediately?
You can write a helper function, which would give you some indication as pipeline items passed to it and processed by next command:
function Print-Pipeline {
param($Name, [ConsoleColor]$Color)
begin {
$ColorParameter = if($PSBoundParameters.ContainsKey('Color')) {
#{ ForegroundColor = $Color }
} else {
#{ }
}
}
process {
Write-Host "${Name}|Before|$_" #ColorParameter
,$_
Write-Host "${Name}|After|$_" #ColorParameter
}
}
Suppose you have some functions to test:
$Text = 'Some', 'Random', 'Text'
function CharSplit1 { $Input | % GetEnumerator }
filter CharSplit2 { $Input | % GetEnumerator }
And you can test them like that:
PS> $Text |
>>> Print-Pipeline Before` CharSplit1 |
>>> CharSplit1 |
>>> Print-Pipeline After` CharSplit1
Before CharSplit1|Before|Some
Before CharSplit1|After|Some
Before CharSplit1|Before|Random
Before CharSplit1|After|Random
Before CharSplit1|Before|Text
Before CharSplit1|After|Text
After CharSplit1|Before|S
S
After CharSplit1|After|S
After CharSplit1|Before|o
o
After CharSplit1|After|o
After CharSplit1|Before|m
m
After CharSplit1|After|m
After CharSplit1|Before|e
e
After CharSplit1|After|e
After CharSplit1|Before|R
R
After CharSplit1|After|R
After CharSplit1|Before|a
a
After CharSplit1|After|a
After CharSplit1|Before|n
n
After CharSplit1|After|n
After CharSplit1|Before|d
d
After CharSplit1|After|d
After CharSplit1|Before|o
o
After CharSplit1|After|o
After CharSplit1|Before|m
m
After CharSplit1|After|m
After CharSplit1|Before|T
T
After CharSplit1|After|T
After CharSplit1|Before|e
e
After CharSplit1|After|e
After CharSplit1|Before|x
x
After CharSplit1|After|x
After CharSplit1|Before|t
t
After CharSplit1|After|t
PS> $Text |
>>> Print-Pipeline Before` CharSplit2 |
>>> CharSplit2 |
>>> Print-Pipeline After` CharSplit2
Before CharSplit2|Before|Some
After CharSplit2|Before|S
S
After CharSplit2|After|S
After CharSplit2|Before|o
o
After CharSplit2|After|o
After CharSplit2|Before|m
m
After CharSplit2|After|m
After CharSplit2|Before|e
e
After CharSplit2|After|e
Before CharSplit2|After|Some
Before CharSplit2|Before|Random
After CharSplit2|Before|R
R
After CharSplit2|After|R
After CharSplit2|Before|a
a
After CharSplit2|After|a
After CharSplit2|Before|n
n
After CharSplit2|After|n
After CharSplit2|Before|d
d
After CharSplit2|After|d
After CharSplit2|Before|o
o
After CharSplit2|After|o
After CharSplit2|Before|m
m
After CharSplit2|After|m
Before CharSplit2|After|Random
Before CharSplit2|Before|Text
After CharSplit2|Before|T
T
After CharSplit2|After|T
After CharSplit2|Before|e
e
After CharSplit2|After|e
After CharSplit2|Before|x
x
After CharSplit2|After|x
After CharSplit2|Before|t
t
After CharSplit2|After|t
Before CharSplit2|After|Text
Add some Write-Verbose statements to your Split-Text function, and then call it with the -Verbose parameter. You should see output in real-time.
Ah, I've got a very simple solution. The concept is to insert your own step into the pipeline with obvious side-effects before the function that you're testing. For example...
PS> 1..10 | %{ Write-Host $_; $_ } | function-under-test
If your function-under-test is "bad", you will see all of the output from 1..10 twice, like this
1
2
3
1
2
3
If the function-under-test is processing items lazily from the pipeline, you'll see the output interleaved.
1
1
2
2
3
3