Spark: groupBy with conditions - scala

I have a groupBy for a DataFrame which is based on 3 columns. I am doing something like this:
myDf.groupBy($"col1", $"col2", $"col3")
Anyway I am not sure how this works.
Does it manage ignore cases? I need that for each column "FOO" and "foo" are considered the same like "" and null.
If this is not the supposed working mode how I can add it? From the API doc I can see something with apply on a column but I could not find any example.
Any idea?

You can run functions inside of your groupBy statement. So in this case it sounds like you will want to convert the strings to lower case when you are grouping them. Check out the lower function
https://spark.apache.org/docs/1.5.2/api/scala/index.html#org.apache.spark.sql.functions$

Related

How to avoid column names like 'sum(<column>)' in aggregation in Spark/Scala?

The aggregation
df.groupBy($"whatever").sum("A","B","C")
produces a DataFrame with column names like sum(A), sum(B) and sum(C). Often the names A, B and C are already correct names for the final aggregates. Is there a way to avoid doing this:
df.groupBy($"whatever").sum($"A".as("A"), $"B".as("B"), $"C".as("C"))
No, there is not.
You need to use alias via .as as you state yourself.
You can of course rename the columns latterly. scala - how to substring column names after the last dot? provides good guidance here with replaceAll on col name.

Use NVL logic only on selected columns spark dataframe scala

I need to replace only null values in selected columns in a dataframe. I know we have df.na.fill option . How can we implement it only on selected columns or is there any better option other than using df.na.fill
Reading spark documentation here we can see that fill is well suited for your need. You can do something like:
df.na.fill(0, Seq("colA", "colB"))

How to create DataFrame from the an array in Scala?

I have a use case where I need to create a DataFrame from an array.
I've created a DataFrame that reads a CSV then I am using a map to process/transform it further.
var mapTransform = df1.collect.map(
line => {
// line.split(",") logic for fields separation
//transformation logic here for various fields
(field1+","+field2+","+field3);
}
)
From this, I am getting an array(Array[String]) which is transformed result.
I want to further convert it DataFrames with separate columns so that later it can be used to write to DB or file, however, I am facing an issue. Is it possible to do it? Any solutions?
This does your job:
spark.sparkContext.parallelize(mapTransform.toSeq)
But note that you must avoid methods that produce non-rdd, as they load all the contents of the array to the one node and that's ineffective in the general case.
Also, there's a convention turn vars to vals as much as possible.

Getting the value of a DataFrame column in Spark

I am trying to retrieve the value of a DataFrame column and store it in a variable. I tried this :
val name=df.select("name")
val name1=name.collect()
But none of the above is returning the value of column "name".
Spark version :2.2.0
Scala version :2.11.11
There are couple of things here. If you want see all the data collect is the way to go. However in case your data is too huge it will cause drive to fail.
So the alternate is to check few items from the dataframe. What I generally do is
df.limit(10).select("name").as[String].collect()
This will provide output of 10 element. But now the output doesn't look good
So, 2nd alternative is
df.select("name").show(10)
This will print first 10 element, Sometime if the column values are big it generally put "..." instead of actual value which is annoying.
Hence there is third option
df.select("name").take(10).foreach(println)
Takes 10 element and print them.
Now in all the cases you won't get a fair sample of the data, as the first 10 data will be picked. So to truely pickup randomly from the dataframe you can use
df.select("name").sample(.2, true).show(10)
or
df.select("name").sample(.2, true).take(10).foreach(println)
You can check the "sample" function on dataframe
The first will do :)
val name = df.select("name") will return another DataFrame. You can do for example name.show() to show content of the DataFrame. You can also do collect or collectAsMap to materialize results on driver, but be aware, that data amount should not be too big for driver
You can also do:
val names = df.select("name").as[String].collect()
This will return array of names in this DataFrame

scala filter by column presence

I have many json string lines in many files and they are very similar in schema but there are a few different in some cases.
I made a DataFrame from them and want to see only rows which have a specific column like
DF.filter("myColumn" is present).show
How can I do this?
You can use isNotNull in filter()
import org.apache.spark.sql.functions.isNotNull
df.filter($"myColumn".isNotNull)