Spark agg to collect a single list for multiple columns - scala

Here is my current code:
pipe_exec_df_final_grouped = pipe_exec_df_final.groupBy("application_id").agg(collect_list("table_name").alias("tables"))
However, in my collected list, I would like multiple column values, so the aggregated column would be an array of arrays. Currently the result look like this:
1|[a,b,c,d]
2|[e,f,g,h]
However, I would also like to keep another column attached to the aggragation (lets call it 'status' column name). So my new output would be:
1|[[a,pass],[b,fail],[c,fail],[d,pass]]
...
I tried collect_list("table_name, status") however collect_list only takes one column name. How can I accomplish what I am trying to do?

Use array to collect columns into an array column first, then apply collect_list:
df.groupBy(...).agg(collect_list(array("table_name", "status")))

Related

How to compute a variable or column of comma separated values from multiple rows of the same column

Scenario: azure data flow processing bulk records from a csv dataset. for doing dependent jobs at destination sql required a comma separated ids from multiple rows of that csv. Can some one help how to do this.
Tried using derived column step with coalesce, concat functions, didn't get the result looking for.
Use the collect() aggregate function. This will act like a string agg. It was just released last week.
https://learn.microsoft.com/en-us/azure/data-factory/data-flow-expression-functions#collect
https://techcommunity.microsoft.com/t5/azure-data-factory/adf-adds-new-hierarchical-data-handling-and-new-flexibility-for/ba-p/1353956

Scala - How to append a column to a DataFrame preserving the original column name?

I have a basic DataFrame containing all the data and several derivative DataFrames that I've been subsequently creating from the basic DF making grouping, joins etc.
Every time I want to append a column to the last DataFrame containing the most relevant data I have to do something like this:
val theMostRelevantFinalDf = olderDF.withColumn("new_date_", to_utc_timestamp(unix_timestamp(col("new_date"))
.cast(TimestampType), "UTC").cast(StringType)).drop($"new_date")
As you may see I have to change the original column name to new_date_
But I want the column name to remain the same.
However if I don't change the name the column gets dropped. So renaming is just a not too pretty workaround.
How can I preserve the original column name when appending the column?
As far as I know you can not create two columns with the same name in a DataFrame transformation. I rename the new column to the olderĀ“s name like
val theMostRelevantFinalDf = olderDF.withColumn("new_date_", to_utc_timestamp(unix_timestamp(col("new_date"))
.cast(TimestampType), "UTC").cast(StringType)).drop($"new_date").withColumnRenamed("new_date_", "new_date")

Getting the value of a DataFrame column in Spark

I am trying to retrieve the value of a DataFrame column and store it in a variable. I tried this :
val name=df.select("name")
val name1=name.collect()
But none of the above is returning the value of column "name".
Spark version :2.2.0
Scala version :2.11.11
There are couple of things here. If you want see all the data collect is the way to go. However in case your data is too huge it will cause drive to fail.
So the alternate is to check few items from the dataframe. What I generally do is
df.limit(10).select("name").as[String].collect()
This will provide output of 10 element. But now the output doesn't look good
So, 2nd alternative is
df.select("name").show(10)
This will print first 10 element, Sometime if the column values are big it generally put "..." instead of actual value which is annoying.
Hence there is third option
df.select("name").take(10).foreach(println)
Takes 10 element and print them.
Now in all the cases you won't get a fair sample of the data, as the first 10 data will be picked. So to truely pickup randomly from the dataframe you can use
df.select("name").sample(.2, true).show(10)
or
df.select("name").sample(.2, true).take(10).foreach(println)
You can check the "sample" function on dataframe
The first will do :)
val name = df.select("name") will return another DataFrame. You can do for example name.show() to show content of the DataFrame. You can also do collect or collectAsMap to materialize results on driver, but be aware, that data amount should not be too big for driver
You can also do:
val names = df.select("name").as[String].collect()
This will return array of names in this DataFrame

spark dropDuplicates based on json array field

I have json files of the following structure:
{"names":[{"name":"John","lastName":"Doe"},
{"name":"John","lastName":"Marcus"},
{"name":"David","lastName":"Luis"}
]}
I want to read several such json files and distinct them based on the "name" column inside names.
I tried
df.dropDuplicates(Array("names.name"))
but it didn't do the magic.
This seems to be a regression that was added in spark 2.0. If you bring the nested column to the highest level you can drop the duplicates. If we create a new column based on the columns you want to dedup on. Then we drop the columns and finally drop the column. The following function will work for composite keys as well.
val columns = Seq("names.name")
df.withColumn("DEDUP_KEY", concat_ws(",", columns:_*))
.dropDuplicates("DEDUP_KEY")
.drop("DEDUP_KEY")
just for future reference, the solution looks like
val uniqueNams = allNames.withColumn("DEDUP_NAME_KEY",
org.apache.spark.sql.functions.explode(new Column("names.name")))
.cache()
.dropDuplicates(Array("DEDUP_NAME_KEY"))
.drop("DEDUP_NAME_KEY")

How to do pandas groupby([multiple columns]) so its result can be looked up

I have two dataframes: tr is a training-set, ts is a test-set.
They contain columns uid (a user_id), categ (a categorical), and response.
response is the dependent variable I'm trying to predict in ts.
I am trying to compute the mean of response in tr, broken out by columns uid and categ:
avg_response_uid_categ = tr.groupby(['uid','categ']).response.mean()
This gives the result but (unwantedly) the dataframe index is a MultiIndex. (this is the groupby(..., as_index=True) behavior):
MultiIndex[--5hzxWLz5ozIg6OMo6tpQ SomeValueOfCateg, --65q1FpAL_UQtVZ2PTGew AnotherValueofCateg, ...
But instead I want the result to keep the two columns 'uid', 'categ' and keep them separate.
Should I use aggregate() instead of groupby()?
Trying groupby(as_index=False) is useless.
The result seems to differ depending on whether you do:
tr.groupby(['uid','categ']).response.mean()
or:
tr.groupby(['uid','categ'])['response'].mean() # RIGHT
i.e. whether you slice a single Series, or a DataFrame containing a single Series. Related: Pandas selecting by label sometimes return Series, sometimes returns DataFrame