Extract info from some cloumns, store into dataframe after flattening up - scala

I have one dataframe, my use case is to extract info from some columns do flattening up based on some column type and store into dataframe what is the efficient way to do and how ?
Ex: We have one dataframe which have lets say 5 columns for ex: a(string),b(string),c(string),d(json value),e(string). in transformation i wanted to extract or i can say flattened some records from d column (from json value). and in our resultant dataframe each value from json will become one row.

Related

How to find if one column is contained in another array column in pyspark

I have 2 columns with the following schema in a pyspark dataframe
('pattern', 'array<struct<pattern:string,positions:array<int>>>')
('distinct_patterns', 'array<array<struct<pattern:string,positions:array<int>>>>')
I want to find those rows where pattern is there in distinct patterns

Scala - How to append a column to a DataFrame preserving the original column name?

I have a basic DataFrame containing all the data and several derivative DataFrames that I've been subsequently creating from the basic DF making grouping, joins etc.
Every time I want to append a column to the last DataFrame containing the most relevant data I have to do something like this:
val theMostRelevantFinalDf = olderDF.withColumn("new_date_", to_utc_timestamp(unix_timestamp(col("new_date"))
.cast(TimestampType), "UTC").cast(StringType)).drop($"new_date")
As you may see I have to change the original column name to new_date_
But I want the column name to remain the same.
However if I don't change the name the column gets dropped. So renaming is just a not too pretty workaround.
How can I preserve the original column name when appending the column?
As far as I know you can not create two columns with the same name in a DataFrame transformation. I rename the new column to the olderĀ“s name like
val theMostRelevantFinalDf = olderDF.withColumn("new_date_", to_utc_timestamp(unix_timestamp(col("new_date"))
.cast(TimestampType), "UTC").cast(StringType)).drop($"new_date").withColumnRenamed("new_date_", "new_date")

Grouping CSV columns in Apache Beam transform

I have a csv with about 200 columns. I would like to group each column such that I get a pcollection of col_name:[column] pairs as elements. How would such a thing be done using beam python sdk?
You can output tuples of (column_index, column_value) then you can group them by the column index. If you have name associated with each column, then you can output (column_name, column_value).
Reference: https://beam.apache.org/documentation/programming-guide/#core-beam-transforms

splitting spark column into new columns based on fields in array<string>

I have a spark 1.6 dataframe which is of type array<string>. The column has key, value pairs. I would like to flatten the column and use the keys to make new columns with their values.
Here is what some the rows in my dataframe look like:
[{"sequence":192,"id":8697413670252052,"type":["AimLowEvent","DiscreteEvent"],"time":527638582195}]
[{"sequence":194,"id":8702167944035041,"sessionId":8697340571921940,"type":["SessionCanceled","SessionEnded"],"time":527780267698,"duration":143863999}, {"sequence":1,"id":8697340571921940,"source":"iOS","schema":{"name":"netflixApp","version":"1.8.0"},"type":["Log","Session"],"time":527636403699}, 1]
I can use concat_ws to flatten the array, but how would I make use new columns based on the data ?
EDIT:
removed

Add list as column to Dataframe in pyspark

I have a list of integers and a sqlcontext dataframe with the number of rows equal to the length of the list. I want to add the list as a column to this dataframe maintaining the order. I feel like this should be really simple but I can't find an elegant solution.
You cannot simply add a list as a dataframe column since list is local object and dataframe is distirbuted. You can try one of thw followin approaches:
convert dataframe to local by collect() or toLocalIterator() and for each row add corresponding value from the list OR
convert list to dataframe adding an extra column (with keys from dataframe) and then join them both