How to flatten nested PCollection? - apache-beam

I have a PCollection<PCollection<T>> and I'm trying to flatten it to a PCollection<T>. org.apache.beam.sdk.transforms.Flatten has methods for flattening multiple PCollections, but not nested PCollections. Is it possible to flatten nested PCollections?

Related

How to read the nested elements from the xml in pyspark?

How to read the nested elements from the xml in pyspark?

Does flatMap with distinct give me a list out of RDD array values?

This is my RDD, as you can see a single block can have multiple genre values:
[['Adventure', 'Animation', 'Children', 'Comedy', 'Fantasy'],
['Adventure', 'Children', 'Fantasy'],
['Comedy', 'Romance'],
['Comedy', 'Drama', 'Romance'],
['Comedy']]
I wish to make a list out of the values in the arrays of the RDD, splitting all the values into multiple rows like this:
['Adventure',
'Drama',
'Comedy'] ....
and also make this collection distinct.
So far I have tried this
RDD.flatMap(lambda x: x).distinct().take(100)
But I don't know whether this code takes all the values out of all the arrays and makes a distinct list from them. The question is, does it perform the task I require it to do?

How to loop through tuple in scala [duplicate]

This question already has answers here:
Scala: How to convert tuple elements to lists
(5 answers)
Closed 5 years ago.
I have a tuple in scala
val captainStuff = ("One", "Two", "Three", "Four", "Five")
How i can iterate through for loop?? It's easy to loop through list and map. But how to loop through Tuple.
Thanks!!
You can convert it to iterator like:
val captainStuff = ("One", "Two", "Three", "Four", "Five")
captainStuff.productIterator.foreach(x => {
println(x)
})
This question is a duplicate btw:
Scala: How to convert tuple elements to lists
How i can iterate through for loop?? It's easy to loop through list and map. But how to loop through Tuple.
Lists and maps are collections. Tuples are not. Iterating (aka "looping through") really only makes sense for collections which tuples aren't.
Tuples are product types. They are a way of grouping multiple values of different types together into a single structure. Considering that the fields of a tuple may have different types, how exactly would you iterate over it? What would be the type of your element variable?
If you are familiar with other languages, you may be familiar with the concept of records (e.g. RECORD in Pascal or struct in C). Tuples are kind of like them, except the fields don't have names. How do you iterate over a record in Pascal or a struct in C? You don't, it makes no sense.
In fact, you can think of an object as a record. Again, how do you iterate over the fields of an object? You don't, it makes no sense.
Note #1: Yes, sometimes, it does make sense to iterate over the field of an object iff you are doing reflective metaprogramming.
Note #2: In Scala, tuples inherit from Product, which has a non-typesafe productIterator method that gives you an Iterator[Any] which allows you to iterate over a tuple, but without type-safety. Just don't do it.
tl;dr: tuples are not collections. You simply don't iterate over them. Period. If you think you have to, you're doing something wrong, i.e. you shouldn't have a tuple but maybe an array or a list instead.

Push all elements of array into new array

I am using aggregation framework. One of the early steps produces an array via a group. In the next group step I would like to create an array that is an aggregate of the grouped arrays.
I have tried using $push : "$myArrayField" but this produces an array of arrays. Is there a way to create an array of the values in the array field without doing an unwind?

What are the pros and cons of using embedded documents versus arrays in mongoid?

What are the pros and cons of using embedded documents when you can simply use an array datatype? Both seem similar to me (and I couldn't find any information online via google search). Please provide example cases!
In terms of data structures, you can think of embedded documents as hashes or dictionaries .. while arrays are a list of values.
With embedded documents in MongoDB:
embedded documents have named fields, and can embed other documents for rich data representation
you can reference fields directly using dotted notation
creating an index on an embedded document field only indexes that field
you can use field selection to retrieve a subset of fields.
With arrays in MongoDB:
you can manipulate arrays using operators such as $push, $pop, $pull, and $addToSet.
you can match array values using operators such as $all, $in, $nin.
you can also use multikey indexes
creating an index on an array element indexes each element of the array.
you can use the $slice operator to retrieve a subset of an array.
Mongoid's notion of relations express a few different combinations of embedded documents and arrays:
embeds_one - a single embedded document
embeds_many - an array of embedded documents