Grouping CSV columns in Apache Beam transform - apache-beam

I have a csv with about 200 columns. I would like to group each column such that I get a pcollection of col_name:[column] pairs as elements. How would such a thing be done using beam python sdk?

You can output tuples of (column_index, column_value) then you can group them by the column index. If you have name associated with each column, then you can output (column_name, column_value).
Reference: https://beam.apache.org/documentation/programming-guide/#core-beam-transforms

Related

Extract info from some cloumns, store into dataframe after flattening up

I have one dataframe, my use case is to extract info from some columns do flattening up based on some column type and store into dataframe what is the efficient way to do and how ?
Ex: We have one dataframe which have lets say 5 columns for ex: a(string),b(string),c(string),d(json value),e(string). in transformation i wanted to extract or i can say flattened some records from d column (from json value). and in our resultant dataframe each value from json will become one row.

How to compute a variable or column of comma separated values from multiple rows of the same column

Scenario: azure data flow processing bulk records from a csv dataset. for doing dependent jobs at destination sql required a comma separated ids from multiple rows of that csv. Can some one help how to do this.
Tried using derived column step with coalesce, concat functions, didn't get the result looking for.
Use the collect() aggregate function. This will act like a string agg. It was just released last week.
https://learn.microsoft.com/en-us/azure/data-factory/data-flow-expression-functions#collect
https://techcommunity.microsoft.com/t5/azure-data-factory/adf-adds-new-hierarchical-data-handling-and-new-flexibility-for/ba-p/1353956

How to find if one column is contained in another array column in pyspark

I have 2 columns with the following schema in a pyspark dataframe
('pattern', 'array<struct<pattern:string,positions:array<int>>>')
('distinct_patterns', 'array<array<struct<pattern:string,positions:array<int>>>>')
I want to find those rows where pattern is there in distinct patterns

Spark agg to collect a single list for multiple columns

Here is my current code:
pipe_exec_df_final_grouped = pipe_exec_df_final.groupBy("application_id").agg(collect_list("table_name").alias("tables"))
However, in my collected list, I would like multiple column values, so the aggregated column would be an array of arrays. Currently the result look like this:
1|[a,b,c,d]
2|[e,f,g,h]
However, I would also like to keep another column attached to the aggragation (lets call it 'status' column name). So my new output would be:
1|[[a,pass],[b,fail],[c,fail],[d,pass]]
...
I tried collect_list("table_name, status") however collect_list only takes one column name. How can I accomplish what I am trying to do?
Use array to collect columns into an array column first, then apply collect_list:
df.groupBy(...).agg(collect_list(array("table_name", "status")))

Pivot data in Talend

I have some data which I need to pivot in Talend. This is a sample:
brandname,metric,value
A,xyz,2
B,xyz,2
A,abc,3
C,def,1
C,ghi,6
A,ghi,1
Now I need this data to be pivoted on the metric column like this:
brandname,abc,def,ghi,xyz
A,3,null,1,2
B,null,null,null,2
C,null,1,6,null
Currently I am using tPivotToColumnsDelimited to pivot the data to a file and reading back from that file. However having to store data on an external file and reading back is messy and unnecessary overhead.
Is there a way to do this with Talend without writing to an external file? I tried to use tDenormalize but as far as I understand, it will return the rows as 1 column which is not what I need. I also looked for some 3rd party component in TalendExchange but couldn't find anything useful.
Thank you for your help.
Assuming that your metrics are fixed, you can use their names as columns of the output. The solution to do the pivot has two parts: first, a tMap that transposes the value of each input-row in into the corresponding column in the output-row out and second, a tAggregate that groups the map's output-rows according to the brandname.
For the tMap you'd have to fill the columns conditionally like this, example for output colum named "abc":
out.abc = "abc".equals(in.metric)?in.value:null
In the tAggregate you'd have to group by out.brandname and aggregate each column as sum ignoring nulls.