How to create dynamic number of columns in dataframe in spark using scala?
For ex: if condition is true i want 4 columns in dataframe and if condition is false i want to create 3 columns using .withColumn()
I am avoiding normal if else here and the use of var df.
Is there any method which can be applied on field itself to check and create columns dynamically.
Related
I have one dataframe, my use case is to extract info from some columns do flattening up based on some column type and store into dataframe what is the efficient way to do and how ?
Ex: We have one dataframe which have lets say 5 columns for ex: a(string),b(string),c(string),d(json value),e(string). in transformation i wanted to extract or i can say flattened some records from d column (from json value). and in our resultant dataframe each value from json will become one row.
I have 2 columns with the following schema in a pyspark dataframe
('pattern', 'array<struct<pattern:string,positions:array<int>>>')
('distinct_patterns', 'array<array<struct<pattern:string,positions:array<int>>>>')
I want to find those rows where pattern is there in distinct patterns
I have two columns of type org.apache.spark.sql.Column
I need to create a dataframe from these two columns such that the dataframe looks like [col1,col2] . Both columns contains data of double type.
Any suggestions on how to create the dataframe.
I have a spark 1.6 dataframe which is of type array<string>. The column has key, value pairs. I would like to flatten the column and use the keys to make new columns with their values.
Here is what some the rows in my dataframe look like:
[{"sequence":192,"id":8697413670252052,"type":["AimLowEvent","DiscreteEvent"],"time":527638582195}]
[{"sequence":194,"id":8702167944035041,"sessionId":8697340571921940,"type":["SessionCanceled","SessionEnded"],"time":527780267698,"duration":143863999}, {"sequence":1,"id":8697340571921940,"source":"iOS","schema":{"name":"netflixApp","version":"1.8.0"},"type":["Log","Session"],"time":527636403699}, 1]
I can use concat_ws to flatten the array, but how would I make use new columns based on the data ?
EDIT:
removed
I have a list of integers and a sqlcontext dataframe with the number of rows equal to the length of the list. I want to add the list as a column to this dataframe maintaining the order. I feel like this should be really simple but I can't find an elegant solution.
You cannot simply add a list as a dataframe column since list is local object and dataframe is distirbuted. You can try one of thw followin approaches:
convert dataframe to local by collect() or toLocalIterator() and for each row add corresponding value from the list OR
convert list to dataframe adding an extra column (with keys from dataframe) and then join them both