Transforming `PCollection` with many elements into a single element - apache-beam

I am trying to convert a PCollection, that has many elements, into a PCollection that has one element. Basically, I want to go from:
[1,2,3,4,5,6]
to:
[[1,2,3,4,5,6]]
so that I can work with the entire PCollection in a DoFn.
I've tried CombineGlobally(lamdba x: x), but only a portion of elements get combined into an array at a time, giving me the following result:
[1,2,3,4,5,6] -> [[1,2],[3,4],[5,6]]
Or something to that effect.
This is my relevant portion of my script that I'm trying to run:
import apache_beam as beam
raw_input = range(1024)
def run_test():
with TestPipeline() as test_pl:
input = test_pl | "Create" >> beam.Create(raw_input)
def combine(x):
print(x)
return x
(
input
| "Global aggregation" >> beam.CombineGlobally(combine)
)
pl.run()
run_test()

I figured out a pretty painless way to do this, which I missed in the docs:
The more general way to combine elements, and the most flexible, is
with a class that inherits from CombineFn.
CombineFn.create_accumulator(): This creates an empty accumulator. For
example, an empty accumulator for a sum would be 0, while an empty
accumulator for a product (multiplication) would be 1.
CombineFn.add_input(): Called once per element. Takes an accumulator
and an input element, combines them and returns the updated
accumulator.
CombineFn.merge_accumulators(): Multiple accumulators could be
processed in parallel, so this function helps merging them into a
single accumulator.
CombineFn.extract_output(): It allows to do additional calculations
before extracting a result.
I suppose supplying a lambda function that simply passes its argument to the "vanilla" CombineGlobally wouldn't do what I expected initially. That functionality has to be specified by me (although I still think it's weird this isn't built into the API).
You can find more about subclassing CombineFn here, which I found very helpful:
A CombineFn specifies how multiple values in all or part of a
PCollection can be merged into a single value—essentially providing
the same kind of information as the arguments to the Python “reduce”
builtin (except for the input argument, which is an instance of
CombineFnProcessContext). The combining process proceeds as follows:
Input values are partitioned into one or more batches.
For each batch, the create_accumulator method is invoked to create a fresh initial “accumulator” value representing the combination of
zero values.
For each input value in the batch, the add_input method is invoked to combine more values with the accumulator for that batch.
The merge_accumulators method is invoked to combine accumulators from separate batches into a single combined output accumulator value,
once all of the accumulators have had all the input value in their
batches added to them. This operation is invoked repeatedly, until
there is only one accumulator value left.
The extract_output operation is invoked on the final accumulator to get the output value. Note: If this CombineFn is used with a transform
that has defaults, apply will be called with an empty list at
expansion time to get the default value.
So, by subclassing CombineFn, I wrote this simple implementation, Aggregated, that does exactly what I want:
import apache_beam as beam
raw_input = range(1024)
class Aggregated(beam.CombineFn):
def create_accumulator(self):
return []
def add_input(self, accumulator, element):
accumulator.append(element)
return accumulator
def merge_accumulators(self, accumulators):
merged = []
for a in accumulators:
for item in a:
merged.append(item)
return merged
def extract_output(self, accumulator):
return accumulator
def run_test():
with TestPipeline() as test_pl:
input = test_pl | "Create" >> beam.Create(raw_input)
(
input
| "Global aggregation" >> beam.CombineGlobally(Aggregated())
| "print" >> beam.Map(print)
)
pl.run()
run_test()

You can also accomplish what you want with side inputs, e.g.
with beam.Pipeline() as p:
pcoll = ...
(p
# Create a PCollection with a single element.
| beam.Create([None])
# This will process the singleton exactly once,
# with the entirity of pcoll passed in as a second argument as a list.
| beam.Map(
lambda _, pcoll_as_side: ...consume pcoll_as_side here...,
pcoll_as_side=beam.pvalue.AsList(pcoll))

Related

How to access a collection of heterogenous functions randomly?

I am implementing an evolutionary algorithm where I have a numerical genetic encoding (0-n). Where each number from 0 to n represents a function. I have implemented a numpy version where it is possible to do the following. The actual implementation is a bit more complicated but this snippet captures the core functionality.
n = 3
max_ops = 10
# Generate randomly generated args and OPs
for i in range(number_of_iterations):
args = np.random.randint(min_val_arg, max_val_arg, size=(arg_count, arg_shape[0], arg_shape[1])
gene_of_operations = np.random.randint(0,n,size=(max_ops))
# A collection of OP encodings and operations. Doesn't need to be a dict.
dict_of_n_OPs = {
0:np.add,
1:np.multiply,
2:np.diff
}
#njit
def execute_genome(gene_of_operations, args, dict_of_n_OPs):
result = 0
for op, arg in zip(gene_of_operations,args)
result+= op(arg)
return result
## executing the gene
execute_genome(gene_of_operations, args, dict_of_n_OPs)
print(results)
Now when adding the njit decorator expects a statically typed function. Where heterogenously typed collections such as my dict_of_n_OPs are not supported, I have tried rendering it as a numpy array, numba.typed.Dict, numba.typed.List. But discovered none supports heteregoenous types.
What would be a numba compliant approach that allows for executing different functions based on a numerical encoding such as '00201'. Where number 0 would execute function 0?
Is the only way an n line if else statement for n unique operations/functions?

Scala - Not enough arguments for method count

I am fairly new to Scala and Spark RDD programming. The dataset I am working with is a CSV file containing list of movies (one row per movie) and their associated user ratings (comma delimited list of ratings). Each column in the CSV represents a distinct user and what rating he/she gave the movie. Thus, user 1's ratings for each movie are represented in the 2nd column from the left:
Sample Input:
Spiderman,1,2,,3,3
Dr.Sleep, 4,4,,,1
I am getting the following error:
Task4.scala:18: error: not enough arguments for method count: (p: ((Int, Int)) => Boolean)Int.
Unspecified value parameter p.
var moviePairCounts = movieRatings.reduce((movieRating1, movieRating2) => (movieRating1, movieRating2, movieRating1._2.intersect(movieRating2._2).count()
when I execute the few lines below. For the program below, the second line of code splits all values delimited by "," and produces this:
( Spiderman, [[1,0],[2,1],[-1,2],[3,3],[3,4]] )
( Dr.Sleep, [[4,0],[4,1],[-1,2],[-1,3],[1,4]] )
On the third line, taking the count() throws an error. For each movie (row), I am trying to get the number of common elements. In the above example, [-1, 2] is clearly a common element shared by both Spiderman and Dr.Sleep.
val textFile = sc.textFile(args(0))
var movieRatings = textFile.map(line => line.split(","))
.map(movingRatingList => (movingRatingList(0), movingRatingList.drop(1)
.map(ranking => if (ranking.isEmpty) -1 else ranking.toInt).zipWithIndex));
var moviePairCounts = movieRatings.reduce((movieRating1, movieRating2) => (movieRating1, movieRating2, movieRating1._2.intersect(movieRating2._2).count() )).saveAsTextFile(args(1));
My target output of line 3 is as follows:
( Spiderman, Dr.Sleep, 1 ) --> Between these 2 movies, there is 1 common entry.
Can somebody please advise ?
To get the number of elements in a collection, use length or size. count() returns number of elements which satisfy some additional condition.
Or you could avoid building the complete intersection by using count to count the elements of the first collection which the second contains:
movieRating1._2.count(movieRating2._2.contains(_))
The error message seems pretty clear: count takes one argument, but in your call, you are passing an empty argument list, i.e. zero arguments. You need to pass one argument to count.

Can operations on a numpy.memmap be deferred?

Consider this example:
import numpy as np
a = np.array(1)
np.save("a.npy", a)
a = np.load("a.npy", mmap_mode='r')
print(type(a))
b = a + 2
print(type(b))
which outputs
<class 'numpy.core.memmap.memmap'>
<class 'numpy.int32'>
So it seems that b is not a memmap any more, and I assume that this forces numpy to read the whole a.npy, defeating the purpose of the memmap. Hence my question, can operations on memmaps be deferred until access time?
I believe subclassing ndarray or memmap could work, but don't feel confident enough about my Python skills to try it.
Here is an extended example showing my problem:
import numpy as np
# create 8 GB file
# np.save("memmap.npy", np.empty([1000000000]))
# I want to print the first value using f and memmaps
def f(value):
print(value[1])
# this is fast: f receives a memmap
a = np.load("memmap.npy", mmap_mode='r')
print("a = ")
f(a)
# this is slow: b has to be read completely; converted into an array
b = np.load("memmap.npy", mmap_mode='r')
print("b + 1 = ")
f(b + 1)
Here's a simple example of an ndarray subclass that defers operations on it until a specific element is requested by indexing.
I'm including this to show that it can be done, but it almost certainly will fail in novel and unexpected ways, and require substantial work to make it usable.
For a very specific case it may be easier than redesigning your code to solve the problem in a better way.
I'd recommend reading over these examples from the docs to help understand how it works.
import numpy as np
class Defered(np.ndarray):
"""
An array class that deferrs calculations applied to it, only
calculating them when an index is requested
"""
def __new__(cls, arr):
arr = np.asanyarray(arr).view(cls)
arr.toApply = []
return arr
def __array_ufunc__(self, ufunc, method, *inputs, **kwargs):
## Convert all arguments to ndarray, otherwise arguments
# of type Defered will cause infinite recursion
# also store self as None, to be replaced later on
newinputs = []
for i in inputs:
if i is self:
newinputs.append(None)
elif isinstance(i, np.ndarray):
newinputs.append(i.view(np.ndarray))
else:
newinputs.append(i)
## Store function to apply and necessary arguments
self.toApply.append((ufunc, method, newinputs, kwargs))
return self
def __getitem__(self, idx):
## Get index and convert to regular array
sub = self.view(np.ndarray).__getitem__(idx)
## Apply stored actions
for ufunc, method, inputs, kwargs in self.toApply:
inputs = [i if i is not None else sub for i in inputs]
sub = super().__array_ufunc__(ufunc, method, *inputs, **kwargs)
return sub
This will fail if modifications are made to it that don't use numpy's universal functions. For instance percentile and median aren't based on ufuncs, and would end up loading the entire array. Likewise, if you pass it to a function that iterates over the array, or applies an index to substantial amounts the entire array will be loaded.
This is just how python works. By default numpy operations return a new array, so b never exists as a memmap - it is created when + is called on a.
There's a couple of ways to work around this. The simplest is to do all operations in place,
a += 1
This requires loading the memory mapped array for reading and writing,
a = np.load("a.npy", mmap_mode='r+')
Of course this isn't any good if you don't want to overwrite your original array.
In this case you need to specify that b should be memmapped.
b = np.memmap("b.npy", mmap+mode='w+', dtype=a.dtype, shape=a.shape)
Assigning can be done by using the out keyword provided by numpy ufuncs.
np.add(a, 2, out=b)

yield results in `finish_bundle` from a custom DoFn

One step of my pipeline involves fetching from an external data source and I'd like to do that in chunks (order doesn't matter). I couldn't find any class that does something similar so I've created the following:
class FixedSizeBatchSplitter(beam.DoFn):
def __init__(self, size):
self.size = size
def start_bundle(self):
self.current_batch = []
def finish_bundle(self):
if self.current_batch:
yield self.current_batch
def process(self, element):
self.current_batch.append(element)
if len(self.current_batch) >= self.size:
yield self.current_batch
self.current_batch = []
However, when I run this pipeline, I get a RuntimeError: Finish Bundle should only output WindowedValue type error:
with beam.Pipeline() as p:
res = (p
| beam.Create(range(10))
| beam.ParDo(FixedSizeBatchSplitter(3))
)
Why is that? How comes that I can yield outputs in process but not in finish_bundle? By the way, if I remove finish_bundle the pipeline works but obviously discards the leftovers.
A DoFn may be processing elements from multiple different windows. When you're in process(), the "current window" is unambiguous - it's the window of the element being processed. When you're in finish_bundle, it's ambiguous and you need to specify the window explicitly. You need to be yielding something of the form yield WindowedValue(something, timestamp, [window]).
If all your data is in the global window, that makes it easier: window will be just GlobalWindow(). If you're using multiple windows, then you'll need to have 1 buffer per window; capture the window in process() so that you add to the proper buffer; and in finish_bundle emit each of them in the respective window.

Scala Spark loop goes through without any error, but does not produce an output

I have a file in HDFS containing paths of various other files. Here is the file called file1:
path/of/HDFS/fileA
path/of/HDFS/fileB
path/of/HDFS/fileC
.
.
.
I am using a for loop in Scala Spark as follows to read each line of the above file and process it in another function:
val lines=Source.fromFile("path/to/file1.txt").getLines.toList
for(i<-lines){
i.toString()
val firstLines=sc.hadoopFile(i,classOf[TextInputFormat],classOf[LongWritable],classOf[Text]).flatMap {
case (k, v) => if (k.get == 0) Seq(v.toString) else Seq.empty[String]
}
}
when I run the above loop, it runs through without returning any errors and I get the Scala prompt in a new line: scala>
However, when I try to see a few lines of output which should be stored in firstLines, it does not work:
scala> firstLines
<console>:38: error: not found: value firstLines
firstLine
^
What is the problem in the above loop that is not producing the output, however running through without any errors?
Additional info
The function hadoopFile accepts a String path name as its first parameter. That is why I am trying to pass each line of file1 (each line is a path name) as a String in the first parameter i. The flatMap functionality is taking the first line of the file that has been passed to hadoopFile and stores that alone and dumps all the other lines. So the desired output (firstLines) should be the first line of all the files that are being passed to hadoopFile through their path names (i).
I tried running the function for just a single file, without a looop, and that produces the output:
val firstLines=sc.hadoopFile("path/of/HDFS/fileA",classOf[TextInputFormat],classOf[LongWritable],classOf[Text]).flatMap {
case (k, v) => if (k.get == 0) Seq(v.toString) else Seq.empty[String]
}
scala> firstLines.take(3)
res27: Array[String] = Array(<?xml version="1.0" encoding="utf-8"?>)
fileA is an XML file, so you can see the resulting first line of that file. So I know the function works fine, it is just a problem with the loop that I am not able to figure out. Please help.
The variable firstLines is defined in the body of the for loop and its scope is therefore limited to this loop. This means you cannot access the variable outside of the loop, and this is why the Scala compiler tells you error: not found: value firstLines.
From your description, I understand you want to collect the first line of every file which are listed in lines.
The every here can translate into different construct in Scala. We can use something like the for loop you wrote or even better adopt a functional approach and use a map function applied on the list of files. In the code below I put inside the map the code you used in your description, which creates an HadoopRDD and applies flatMap with your function to retrieve the first line of a file.
We then obtain a list of RDD[String] of lines. At this stage, note that we have not started to do any actual work. To trigger the evaluation of the RDDs and collect the result, we need an addition call to the collect method for each of the RDD we have in our list.
// Renamed "lines" to "files" as it is more explicit.
val fileNames = Source.fromFile("path/to/file1.txt").getLines.toList
val firstLinesRDDs = fileNames.map(sc.hadoopFile(_,classOf[TextInputFormat],classOf[LongWritable],classOf[Text]).flatMap {
case (k, v) => if (k.get == 0) Seq(v.toString) else Seq.empty[String]
})
// firstLinesRDDs is a list of RDD[String]. Based on this code, each RDD
// should consist in a single String value. We collect them using RDD#collect:
val firstLines = firstLinesRDDs.map(_.collect)
However, this approach suffers from a flaw which prevent us to benefit from any advantage Spark can provide.
When we apply the operation in map to filenames, we are not working with an RDD, hence the file names are processed sequentially on the driver (the process which hosts your Spark session) and not part of a parallelizable Spark job. This is equivalent to doing what you wrote in your second block of code, one file name at a time.
To address the problem, what can we do? A good thing to keep in mind when working with Spark is to try to push the declaration of the RDDs as early as possible in our code. Why? Because this allows Spark to parallelize and optimize the work we want to do. Your example could be a textbook illustration of this concept, though an additional complexity here is added by the requirement to manipulate files.
In our present case, we can benefit from the fact that hadoopFile accepts comma-separated files in input. Therefore, instead of sequentially creating RDDs for every file, we create one RDD for all of them:
val firstLinesRDD = sc.hadoopFile(fileNames.mkString(","), classOf[TextInputFormat],classOf[LongWritable],classOf[Text]).flatMap {
case (k, v) => if (k.get == 0) Seq(v.toString) else Seq.empty[String]
}
And we retrieve our first lines with a single collect:
val firstLines = firstLinesRDD.collect