Sort every column of a dataframe in spark scala - scala

I am working in Spark & Scala and have a dataframe with several hundred columns. I would like to sort the dataframe by every column. Is there anyway to do this in Scala/Spark?
I have tried:
val sortedDf = actualDF.sort(actualDF.columns)
but .sort does not support Array[String] input.
This question has been asked before: Sort all columns of a dataframe but there is no Scala answer

Thank you to #blackbishop for the answer to this:
val dfSortedByAllItsColumns = actualDF.sort(actualDF.columns.map(col): _*)

Related

How to do a groupby rank and add it as a column to existing dataframe in spark scala?

Currently this is what I'm doing:
val new_df= old_df.groupBy("column1").count().withColumnRenamed("count","column1_count")
val new_df_rankings = new_df.withColumn(
"column1_count_rank",
dense_rank()
.over(
Window.orderBy($"column1_count".desc))).select("column1_count","column1_count_rank")
But really all I'm looking to do is add a column to the original df (old_df) called "column1_count_rank" without going through all these intermediate steps and merging back.
Is there a way to do this?
Thanks and have a great day!
As you apply aggregation, there will be a calculative result, it will create new dataframe.
Can you give some sample input and output example
old_df.groupBy("column1").agg(count("*").alias("column1_count")) .withColumn("column1_count_rank",dense_rank().over(Window.orderBy($"column1_count".desc))) .select("column1_count","column1_count_rank")

Applying transformations with filter or map which one is faster Scala spark

Iam trying to do some transformations on the dataset with spark using scala currently using spark sql but want to shift the code to native scala code. i want to know whether to use filter or map, doing some operations like matching the values in column and get a single column after the transformation into a different dataset.
SELECT * FROM TABLE WHERE COLUMN = ''
Used to write something like this earlier in spark sql can someone tell me an alternative way to write the same using map or filter on the dataset, and even which one is much faster when compared.
You can read documentation from Apache Spark website. This is the link to API documentation at https://spark.apache.org/docs/2.3.1/api/scala/index.html#package.
Here is a little example -
val df = sc.parallelize(Seq((1,"ABC"), (2,"DEF"), (3,"GHI"))).toDF("col1","col2")
val df1 = df.filter("col1 > 1")
df1.show()
val df2 = df1.map(x => x.getInt(0) + 3)
df2.show()
If I understand you question correctly, you need to rewrite your SQL query to DataFrame API. Your query reads all columns from table TABLE and filter rows where COLUMN is empty. You can do this with DF in the following way:
spark.read.table("TABLE")
.where($"COLUMN".eqNullSafe(""))
.show(10)
Performance will be the same as in your SQL. Use dataFrame.explain(true) method to understand what Spark will do.

How to parse a csv string into a Spark dataframe using scala?

I would like to convert a RDD containing records of strings, like below, to a Spark dataframe.
"Mike,2222-003330,NY,34"
"Kate,3333-544444,LA,32"
"Abby,4444-234324,MA,56"
....
The schema line is not inside the same RDD, but in a another variable:
val header = "name,account,state,age"
So now my question is, how do I use the above two, to create a dataframe in Spark? I am using Spark version 2.2.
I did search and saw a post:
Can I read a CSV represented as a string into Apache Spark using spark-csv
.
However it's not exactly what I need and I can't figure out a way to modify this piece of code to work in my case.
Your help is greatly appreciated.
The easier way would probably be to start from the CSV file and read it directly as a dataframe (by specifying the schema). You can see an example here: Provide schema while reading csv file as a dataframe.
When the data already exists in an RDD you can use toDF() to convert to a dataframe. This function also accepts column names as input. To use this functionality, first import the spark implicits using the SparkSession object:
val spark: SparkSession = SparkSession.builder.getOrCreate()
import spark.implicits._
Since the RDD contains strings it needs to first be converted to tuples representing the columns in the dataframe. In this case, this will be a RDD[(String, String, String, Int)] since there are four columns (the last age column is changed to int to illustrate how it can be done).
Assuming the input data are in rdd:
val header = "name,account,state,age"
val df = rdd.map(row => row.split(","))
.map{ case Array(name, account, state, age) => (name, account, state, age.toInt)}
.toDF(header.split(","):_*)
Resulting dataframe:
+----+-----------+-----+---+
|name| account|state|age|
+----+-----------+-----+---+
|Mike|2222-003330| NY| 34|
|Kate|3333-544444| LA| 32|
|Abby|4444-234324| MA| 56|
+----+-----------+-----+---+

collect a list of Rows to a dataframe in Scala

I have a List[org.apache.spark.sql.Row]
which I got from a ListBuffer[org.apache.spark.sql.Row]()
How do I turn this into a dataframe?
obviously .toDF (or RDD) does not work...

Applying function to Spark Dataframe Column

Coming from R, I am used to easily doing operations on columns. Is there any easy way to take this function that I've written in scala
def round_tenths_place( un_rounded:Double ) : Double = {
val rounded = BigDecimal(un_rounded).setScale(1, BigDecimal.RoundingMode.HALF_UP).toDouble
return rounded
}
And apply it to a one column of a dataframe - kind of what I hoped this would do:
bid_results.withColumn("bid_price_bucket", round_tenths_place(bid_results("bid_price")) )
I haven't found any easy way and am struggling to figure out how to do this. There's got to be an easier way than converting the dataframe to and RDD and then selecting from rdd of rows to get the right field and mapping the function across all of the values, yeah? And also something more succinct creating a SQL table and then doing this with a sparkSQL UDF?
You can define an UDF as follows:
val round_tenths_place_udf = udf(round_tenths_place _)
bid_results.withColumn(
"bid_price_bucket", round_tenths_place_udf($"bid_price"))
although built-in Round expression is using exactly the same logic as your function and should be more than enough, not to mention much more efficient:
import org.apache.spark.sql.functions.round
bid_results.withColumn("bid_price_bucket", round($"bid_price", 1))
See also following:
Updating a dataframe column in spark
How to apply a function to a column of a Spark DataFrame?