I have a list of integers and a sqlcontext dataframe with the number of rows equal to the length of the list. I want to add the list as a column to this dataframe maintaining the order. I feel like this should be really simple but I can't find an elegant solution.
You cannot simply add a list as a dataframe column since list is local object and dataframe is distirbuted. You can try one of thw followin approaches:
convert dataframe to local by collect() or toLocalIterator() and for each row add corresponding value from the list OR
convert list to dataframe adding an extra column (with keys from dataframe) and then join them both
Related
I have one dataframe, my use case is to extract info from some columns do flattening up based on some column type and store into dataframe what is the efficient way to do and how ?
Ex: We have one dataframe which have lets say 5 columns for ex: a(string),b(string),c(string),d(json value),e(string). in transformation i wanted to extract or i can say flattened some records from d column (from json value). and in our resultant dataframe each value from json will become one row.
I have a dataframe with many columns and want to do some changes for a specific column, while keeping all the other columns untouched.
More specific I want to explode a column.
Currently I am specifying all the column names in the select.
df.select($"col1", $"col2", ..., $"colN", explode($"colX"))
But I would prefer not to have to specify all the column names.
I guess I could use df.columns, filter out the one I want to explode, and use this array in the select.
Is there a cleaner way to achieve this?
Here is one way using filterNot. exp_col is the name of the column you want to use with explode:
import org.apache.spark.sql.functions.explode
val cols= df.columns.filterNot(_ == "exp_col").map(col(_)) :+ explode($"exp_col")
df.select(cols:_*).show
With filterNot we create a list with the items that we don't want to apply explode to. Then we concatenate them all together with :+ explode($"exp_col").
From a given DataFrame, I'dl like to group only few rows together, and keep the other rows in the same dataframe.
My current solution is:
val aggregated = mydf.filter(col("check").equalTo("do_aggregate")).groupBy(...).agg()
val finalDF = aggregated.unionByName(mydf.filter(col("check").notEqual("do_aggregate")))
However I'd like to find a more eleguant and performant way.
Use a derived column to group by, depending on the check.
mydf.groupBy(when(col("check").equalTo("do_aggregate"), ...).otherwise(monotonically_increasing_id)).agg(...)
If you have a unique key in the dataframe, use that instead of monotonically_increasing_id.
I have a spark 1.6 dataframe which is of type array<string>. The column has key, value pairs. I would like to flatten the column and use the keys to make new columns with their values.
Here is what some the rows in my dataframe look like:
[{"sequence":192,"id":8697413670252052,"type":["AimLowEvent","DiscreteEvent"],"time":527638582195}]
[{"sequence":194,"id":8702167944035041,"sessionId":8697340571921940,"type":["SessionCanceled","SessionEnded"],"time":527780267698,"duration":143863999}, {"sequence":1,"id":8697340571921940,"source":"iOS","schema":{"name":"netflixApp","version":"1.8.0"},"type":["Log","Session"],"time":527636403699}, 1]
I can use concat_ws to flatten the array, but how would I make use new columns based on the data ?
EDIT:
removed
Problem statement
I have a table called employee from which i am creating a data-frame .There are some columns which don't have any record.I want to remove that columns from data frame.i also don't know how many columns of the data frame has no record in it.
You cannot remove a column from the dataFrame AFAIK !
What you can do is make another dataframe from the old dataFrame and extract the column names that you actually want to !
Example:
oldDFSchema like this(id,name,badColumn,email)
then
val newDf=oldDF.select("id","name","email")
Or there is one more thing that you can use is :
the .drop() function on dataframe that takes the column names and drops them and returns you a new dataframe !
You can find about it here : https://spark.apache.org/docs/2.0.0/api/scala/index.html#org.apache.spark.sql.Dataset#drop(col:org.apache.spark.sql.Column):org.apache.spark.sql.DataFrame
I hope this might solve your use case !