How to read the nested elements from the xml in pyspark?
Related
I want to use MongoDB to BigQuery dataflow template and I have 2 questions:
Can I somehow configure partitioning for the destination table? For e.g. if I want to dump my database every day?
Can I map nested fields in MongoDB to records in BQ instead of columns with string values?
I see User option with values FLATTEN and NONE, but FLATTEN will flatten documents for 1 level only.
May any of these two approaches help?:
Create a destination table with structure definition before running dataflow
Using UDF
I tried to use MongoDB to BigQuery dataflow with User option set to FLATTEN.
I need to have an array of objects and save it to JSON in Data Factory.
[
{"abc":123},
{"bca":123}
]
I can save it to JSON but it omits the comma (,).
This is my flow
My aggregate function
collect(#(abc=abc, ...))
This gives my an array for each object which is not what I want. I would like to wrap all the lines in one array.
Update
The image below shows the flattening of the incomming stream.
Thanks
You need to create the structure first in a Derived Column, then collect() that structure in an Aggregate.
I have a pyspark array that maps customers to a list of categories as well as a geo location.
[('customer', 'bigint'), ('category', 'array<int>'), ('geo_location', 'string')]
Each customer can map to more than one category so I capture it as a list.
I'd like to count up the number of customers in each category while preserving the geo information.
Is there a method in pyspark that will easily unpack the list values as a column so I can count them? Alternatively, is there a better pattern in PySpark that will accomplish this in a better way?
I have a PCollection<PCollection<T>> and I'm trying to flatten it to a PCollection<T>. org.apache.beam.sdk.transforms.Flatten has methods for flattening multiple PCollections, but not nested PCollections. Is it possible to flatten nested PCollections?
I am using Hadoop to apply map reduce in my MongoDB database.
I can able to execute the sample in this link.
Right now I can able to get only key, value pair in output collection after map reduce job was executed. I wonder if it is possible to save multiple columns in a map reduce output collection?
or embedded document in value column?
thanks.
Yes - use BSONWritable as your reducer output class, and create a BSONWritable object with as many columns as you need.
See example here:
https://github.com/mongodb/mongo-hadoop/blob/master/examples/treasury_yield/src/main/java/com/mongodb/hadoop/examples/treasury/TreasuryYieldReducer.java