Get the column names from file in spark scala - scala

I'm new to spark/scala. I have a file say config where I specify all the column names.
Config:
Id,
Emp_Name,
Dept,
Address,
Account
I have a dataframe where i select the column names like:
df.select("id","Emp_Name","Dept","Address","Account").show()
Instead of specifying the column names in select, I want to get the column names from config file like
df.select(config-file_column_names).show()

You don't necessarily need the commas in your file if each column is in a different line.
This is the definition of select :
def select(col: String, cols: String*): DataFrame
def select(cols: org.apache.spark.sql.Column*): DataFrame
We are going to use the second definition here.
import org.apache.spark.sql.functions.col
val colNames = sc.textFile("file").map(_.replaceAll(",", "") ).map(col(_)).collect
// Unpacking the array in `select`
df.select(colNames: _*).show

Related

List of columns for orderBy in spark dataframe

I have a list of variables that contains column names. I am trying to use that to call orderBy on a dataframe.
val l = List("COL1", "COL2")
df.orderBy(l.mkString(","))
But mkstring combines the column names to be one string, leading to this error -
org.apache.spark.sql.AnalysisException: cannot resolve '`COL1,COL2`' given input columns: [COL1, COL2, COL3, COL4];
How can I convert this list of strings into different strings so it looks for "COL1", "COL2" instead of "COL1,COL2"?
Thanks,
You can call orderBy for a specific column:
import org.apache.spark.sql.functions._
df.orderBy(asc("COL1")) // df.orderBy(asc(l.headOption.getOrElse("COL1")))
// OR
df.orderBy(desc("COL1"))
If you want sort by multiple columns you can write something like this:
val l = List($"COL1", $"COL2".desc)
df.sort(l: _*)
Passing single String argument is telling Spark to sort data frame using one column with given name. There is a method that accepts multiple column names and you can use it that way:
val l = List("COL1", "COL2")
df.orderBy(l.head, l.tail: _*)
If you care about the order use Column version of orderBy instead
val l = List($"COL1", $"COL2".desc)
df.orderBy(l: _*)

Converting Column of Dataframe to Seq[Columns] Scala

I am trying to make the next operation:
var test = df.groupBy(keys.map(col(_)): _*).agg(sequence.head, sequence.tail: _*)
I know that the required parameter inside the agg should be a Seq[Columns].
I have then a dataframe "expr" containing the next:
sequences
count(col("colname1"),"*")
count(col("colname2"),"*")
count(col("colname3"),"*")
count(col("colname4"),"*")
The column sequence is of string type and I want to use the values of each row as input of the agg, but I am not capable to reach those.
Any idea of how to give it a try?
If you can change the strings in the sequences column to be SQL commands, then it would be possible to solve. Spark provides a function expr that takes a SQL string and converts it into a column. Example dataframe with working commands:
val df2 = Seq("sum(case when A like 2 then A end) as A", "count(B) as B").toDF("sequences")
To convert the dataframe to Seq[Column]s do:
val seqs = df2.as[String].collect().map(expr(_))
Then the groupBy and agg:
df.groupBy(...).agg(seqs.head, seqs.tail:_*)

How to join two dataframes in Scala and select on few columns from the dataframes by their index?

I have to join two dataframes, which is very similar to the task given here Joining two DataFrames in Spark SQL and selecting columns of only one
However, I want to select only the second column from df2. In my task, I am going to use the join function for two dataframes within a reduce function for a list of dataframes. In this list of dataframes, the column names will be different. However, in each case I would want to keep the second column of df2.
I did not find anywhere how to select a dataframe's column by their numbered index. Any help is appreciated!
EDIT:
ANSWER
I figured out the solution. Here is one way to do this:
def joinDFs(df1: DataFrame, df2: DataFrame): DataFrame = {
val df2cols = df2.columns
val desiredDf2Col = df2cols(1) // the second column
val df3 = df1.as("df1").join(df2.as("df2"), $"df1.time" === $"df2.time")
.select($"df1.*",$"df2.$desiredDf2Col")
df3
}
And then I can apply this function in a reduce operation on a list of dataframes.
var listOfDFs: List[DataFrame] = List()
// Populate listOfDFs as you want here
val joinedDF = listOfDFs.reduceLeft((x, y) => {joinDFs(x, y)})
To select the second column in your dataframe you can simply do:
val df3 = df2.select(df2.columns(1))
This will first find the second column name and then select it.
If the join and select methods that you want to define in reduce function is similar to Joining two DataFrames in Spark SQL and selecting columns of only one Then you should do the following :
import org.apache.spark.sql.functions._
d1.as("d1").join(d2.as("d2"), $"d1.id" === $"d2.id").select(Seq(1) map d2.columns map col: _*)
You will have to remember that the name of the second column i.e. Seq(1) should not be same as any of the dataframes column names.
You can select multiple columns as well but remember the bold note above
import org.apache.spark.sql.functions._
d1.as("d1").join(d2.as("d2"), $"d1.id" === $"d2.id").select(Seq(1, 2) map d2.columns map col: _*)

Makiing sql request on columns containing dot

i have a dataframe having column'name containing "."
I would like to filter columns to get column's name containing "." and then make a select on it.Any help will be appreciated.
here is the dataset
//dataset
time.1,col.1,col.2,col.3
2015-12-06 12:40:00,2,2,3
2015-12-07 12:41:35,3,3,4
val spark = SparkSession.builder.master("local").appName("my-spark-app").getOrCreate()
val df1 = spark.read.option("inferSchema", "true").option("header", "true").csv("C:/Users/mhattabi/Desktop/dataTestCsvFile/dataTest2.txt")
val columnContainingdots=df1.schema.fieldNames.filter(p=>p.contains('.'))
df1.select(columnContainingdots)
Having dot in column names will require you to enclose the names with "`" character. See the below code, this should serve your purpose.
val columnContainingDots = df1.schema.fieldNames.collect({
// since the column names has "." character, we must enclose the column names with "`", otherwise dataframe select will cause exception
case column if column.contains('.') => s"`${column}`"
})
df1.select(columnContainingDots.head, columnContainingDots.tail:_*)

Is it possible to alias columns programmatically in spark sql?

In spark SQL (perhaps only HiveQL) one can do:
select sex, avg(age) as avg_age
from humans
group by sex
which would result in a DataFrame with columns named "sex" and "avg_age".
How can avg(age) be aliased to "avg_age" without using textual SQL?
Edit:
After zero323 's answer, I need to add the constraint that:
The column-to-be-renamed's name may not be known/guaranteed or even addressable. In textual SQL, using "select EXPR as NAME" removes the requirement to have an intermediate name for EXPR. This is also the case in the example above, where "avg(age)" could get a variety of auto-generated names (which also vary among spark releases and sql-context backends).
Let's suppose human_df is the DataFrame for humans. Since Spark 1.3:
human_df.groupBy("sex").agg(avg("age").alias("avg_age"))
If you prefer to rename a single column it is possible to use withColumnRenamed method:
case class Person(name: String, age: Int)
val df = sqlContext.createDataFrame(
Person("Alice", 2) :: Person("Bob", 5) :: Nil)
df.withColumnRenamed("name", "first_name")
Alternatively you can use alias method:
import org.apache.spark.sql.functions.avg
df.select(avg($"age").alias("average_age"))
You can take it further with small helper:
import org.apache.spark.sql.Column
def normalizeName(c: Column) = {
val pattern = "\\W+".r
c.alias(pattern.replaceAllIn(c.toString, "_"))
}
df.select(normalizeName(avg($"age")))
Turns out def toDF(colNames: String*): DataFrame does exactly that. Pasting from 2.11.7 documentation:
def toDF(colNames: String*): DataFrame
Returns a new DataFrame with columns renamed. This can be quite
convenient in conversion from a RDD of tuples into a DataFrame
with meaningful names. For example:
val rdd: RDD[(Int, String)] = ...
rdd.toDF() // this implicit conversion creates a DataFrame
// with column name _1 and _2
rdd.toDF("id", "name") // this creates a DataFrame with
// column name "id" and "name"
Anonymous columns, such as the one that would be generated by avg(age) without AS avg_age, get automatically assigned names. As you point out in your question, the names are implementation-specific, generated by a naming strategy. If needed, you could write code that sniffs the environment and instantiates an appropriate discovery & renaming strategy based on the specific naming strategy. There are not many of them.
In Spark 1.4.1 with HiveContext, the format is "_cN" where N is the position of the anonymous column in the table. In your case, the name would be _c1.