Spark SQL - select all columns while updating one - scala

I have a dataframe with many columns and want to do some changes for a specific column, while keeping all the other columns untouched.
More specific I want to explode a column.
Currently I am specifying all the column names in the select.
df.select($"col1", $"col2", ..., $"colN", explode($"colX"))
But I would prefer not to have to specify all the column names.
I guess I could use df.columns, filter out the one I want to explode, and use this array in the select.
Is there a cleaner way to achieve this?

Here is one way using filterNot. exp_col is the name of the column you want to use with explode:
import org.apache.spark.sql.functions.explode
val cols= df.columns.filterNot(_ == "exp_col").map(col(_)) :+ explode($"exp_col")
df.select(cols:_*).show
With filterNot we create a list with the items that we don't want to apply explode to. Then we concatenate them all together with :+ explode($"exp_col").

Related

Spark Scala, grabbing the max value of 1 column, but keep all columns

I have a dataframe with 3 columns (customer, associations, timestamp).
I want to grab the latest customer by looking at timestamps.
Attempt
val rdd = readRdd.select(col("value"))
val val_columns = Seq("value.timestamp").map(x => last(col(x)).alias(x))
rdd.orderBy("value.timestamp")
.groupBy("value.customer")
.agg(val_columns.head, val_columns.tail: _*)
.show()
I believe the above code is working, but trying to figure out how to include all columns (ie. associations). If I understand correctly, adding it into the groupby would mean I'm grabbing the latest combination of customer and associations combined, but I only want to grab latest off the customer column and not look at multiple columns together.
Edit:
I might be onto something by adding:
val val_columns = Seq("value.lastRefresh", "value.associations")
.map(x => last(col(x)).alias(x))
Curious on thoughts.
If you want to return the latest customer data by the timestamp column, you can just order your dataframe by value.timestamp and apply limit(1):
import org.apache.spark.sql.functions._
df.orderBy(desc("value.timestamp")).limit(1).show()

Spark: group only part of the rows in a DataFrame

From a given DataFrame, I'dl like to group only few rows together, and keep the other rows in the same dataframe.
My current solution is:
val aggregated = mydf.filter(col("check").equalTo("do_aggregate")).groupBy(...).agg()
val finalDF = aggregated.unionByName(mydf.filter(col("check").notEqual("do_aggregate")))
However I'd like to find a more eleguant and performant way.
Use a derived column to group by, depending on the check.
mydf.groupBy(when(col("check").equalTo("do_aggregate"), ...).otherwise(monotonically_increasing_id)).agg(...)
If you have a unique key in the dataframe, use that instead of monotonically_increasing_id.

spark dropDuplicates based on json array field

I have json files of the following structure:
{"names":[{"name":"John","lastName":"Doe"},
{"name":"John","lastName":"Marcus"},
{"name":"David","lastName":"Luis"}
]}
I want to read several such json files and distinct them based on the "name" column inside names.
I tried
df.dropDuplicates(Array("names.name"))
but it didn't do the magic.
This seems to be a regression that was added in spark 2.0. If you bring the nested column to the highest level you can drop the duplicates. If we create a new column based on the columns you want to dedup on. Then we drop the columns and finally drop the column. The following function will work for composite keys as well.
val columns = Seq("names.name")
df.withColumn("DEDUP_KEY", concat_ws(",", columns:_*))
.dropDuplicates("DEDUP_KEY")
.drop("DEDUP_KEY")
just for future reference, the solution looks like
val uniqueNams = allNames.withColumn("DEDUP_NAME_KEY",
org.apache.spark.sql.functions.explode(new Column("names.name")))
.cache()
.dropDuplicates(Array("DEDUP_NAME_KEY"))
.drop("DEDUP_NAME_KEY")

how to remove a column from dataframe which dont have any value ( scala)

Problem statement
I have a table called employee from which i am creating a data-frame .There are some columns which don't have any record.I want to remove that columns from data frame.i also don't know how many columns of the data frame has no record in it.
You cannot remove a column from the dataFrame AFAIK !
What you can do is make another dataframe from the old dataFrame and extract the column names that you actually want to !
Example:
oldDFSchema like this(id,name,badColumn,email)
then
val newDf=oldDF.select("id","name","email")
Or there is one more thing that you can use is :
the .drop() function on dataframe that takes the column names and drops them and returns you a new dataframe !
You can find about it here : https://spark.apache.org/docs/2.0.0/api/scala/index.html#org.apache.spark.sql.Dataset#drop(col:org.apache.spark.sql.Column):org.apache.spark.sql.DataFrame
I hope this might solve your use case !

Add list as column to Dataframe in pyspark

I have a list of integers and a sqlcontext dataframe with the number of rows equal to the length of the list. I want to add the list as a column to this dataframe maintaining the order. I feel like this should be really simple but I can't find an elegant solution.
You cannot simply add a list as a dataframe column since list is local object and dataframe is distirbuted. You can try one of thw followin approaches:
convert dataframe to local by collect() or toLocalIterator() and for each row add corresponding value from the list OR
convert list to dataframe adding an extra column (with keys from dataframe) and then join them both