Scala - How to append a column to a DataFrame preserving the original column name? - scala

I have a basic DataFrame containing all the data and several derivative DataFrames that I've been subsequently creating from the basic DF making grouping, joins etc.
Every time I want to append a column to the last DataFrame containing the most relevant data I have to do something like this:
val theMostRelevantFinalDf = olderDF.withColumn("new_date_", to_utc_timestamp(unix_timestamp(col("new_date"))
.cast(TimestampType), "UTC").cast(StringType)).drop($"new_date")
As you may see I have to change the original column name to new_date_
But I want the column name to remain the same.
However if I don't change the name the column gets dropped. So renaming is just a not too pretty workaround.
How can I preserve the original column name when appending the column?

As far as I know you can not create two columns with the same name in a DataFrame transformation. I rename the new column to the olderĀ“s name like
val theMostRelevantFinalDf = olderDF.withColumn("new_date_", to_utc_timestamp(unix_timestamp(col("new_date"))
.cast(TimestampType), "UTC").cast(StringType)).drop($"new_date").withColumnRenamed("new_date_", "new_date")

Related

Spark agg to collect a single list for multiple columns

Here is my current code:
pipe_exec_df_final_grouped = pipe_exec_df_final.groupBy("application_id").agg(collect_list("table_name").alias("tables"))
However, in my collected list, I would like multiple column values, so the aggregated column would be an array of arrays. Currently the result look like this:
1|[a,b,c,d]
2|[e,f,g,h]
However, I would also like to keep another column attached to the aggragation (lets call it 'status' column name). So my new output would be:
1|[[a,pass],[b,fail],[c,fail],[d,pass]]
...
I tried collect_list("table_name, status") however collect_list only takes one column name. How can I accomplish what I am trying to do?
Use array to collect columns into an array column first, then apply collect_list:
df.groupBy(...).agg(collect_list(array("table_name", "status")))

spark dropDuplicates based on json array field

I have json files of the following structure:
{"names":[{"name":"John","lastName":"Doe"},
{"name":"John","lastName":"Marcus"},
{"name":"David","lastName":"Luis"}
]}
I want to read several such json files and distinct them based on the "name" column inside names.
I tried
df.dropDuplicates(Array("names.name"))
but it didn't do the magic.
This seems to be a regression that was added in spark 2.0. If you bring the nested column to the highest level you can drop the duplicates. If we create a new column based on the columns you want to dedup on. Then we drop the columns and finally drop the column. The following function will work for composite keys as well.
val columns = Seq("names.name")
df.withColumn("DEDUP_KEY", concat_ws(",", columns:_*))
.dropDuplicates("DEDUP_KEY")
.drop("DEDUP_KEY")
just for future reference, the solution looks like
val uniqueNams = allNames.withColumn("DEDUP_NAME_KEY",
org.apache.spark.sql.functions.explode(new Column("names.name")))
.cache()
.dropDuplicates(Array("DEDUP_NAME_KEY"))
.drop("DEDUP_NAME_KEY")

how to remove a column from dataframe which dont have any value ( scala)

Problem statement
I have a table called employee from which i am creating a data-frame .There are some columns which don't have any record.I want to remove that columns from data frame.i also don't know how many columns of the data frame has no record in it.
You cannot remove a column from the dataFrame AFAIK !
What you can do is make another dataframe from the old dataFrame and extract the column names that you actually want to !
Example:
oldDFSchema like this(id,name,badColumn,email)
then
val newDf=oldDF.select("id","name","email")
Or there is one more thing that you can use is :
the .drop() function on dataframe that takes the column names and drops them and returns you a new dataframe !
You can find about it here : https://spark.apache.org/docs/2.0.0/api/scala/index.html#org.apache.spark.sql.Dataset#drop(col:org.apache.spark.sql.Column):org.apache.spark.sql.DataFrame
I hope this might solve your use case !

Add list as column to Dataframe in pyspark

I have a list of integers and a sqlcontext dataframe with the number of rows equal to the length of the list. I want to add the list as a column to this dataframe maintaining the order. I feel like this should be really simple but I can't find an elegant solution.
You cannot simply add a list as a dataframe column since list is local object and dataframe is distirbuted. You can try one of thw followin approaches:
convert dataframe to local by collect() or toLocalIterator() and for each row add corresponding value from the list OR
convert list to dataframe adding an extra column (with keys from dataframe) and then join them both

How can I choose which duplicate rows to be dropped?

I'm trying to merge a new dataset in with an old dataset, I have a Seq[String] of primary keys for each table type, and an old dataframe and a new dataframe with the same schema.
If the primary key column values match, I want to replace the row in the old dataframe with the row in the new dataframe, if they don't match, I want to add the row in.
I have this so far:
val finalFrame: DataFrame = oldDF.withColumn("old/new",lit("1"))
.union(newDF.withColumn("old/new",lit("2")))
.dropDuplicates(primaryKeySet)
I add a literal column of 1's and 2's to keep track of which rows are which, union them together, and drop the duplicates based on the Seq[String] of primary key column names. The problem with this solution is that it doesn't let me specify which duplicates are dropped from the table, if I could specify that duplicates with "1" are dropped that would be optimal, but I'm open to alternate solutions.
Pounded my head on it a little longer and figured out a trick. My primary keys were a sequence, and so couldn't be straight taken into a partitionBy in a window function, so I did this:
val windowFunction = Window.partitionBy(primaryKeySet.head, primaryKeySet.tail: _*).orderBy(desc("old/new"))
val duplicateFreeFinalDF = finalFrame.withColumn("rownum", row_number.over(windowFunction)).where("rownum = 1").drop("rownum").drop("old/new")
Essentially just used vararg expansion so partitionBy would take my list, and then a rownum window function so I could make sure to get the most recent copy in case of a duplicate.