Makiing sql request on columns containing dot - scala

i have a dataframe having column'name containing "."
I would like to filter columns to get column's name containing "." and then make a select on it.Any help will be appreciated.
here is the dataset
//dataset
time.1,col.1,col.2,col.3
2015-12-06 12:40:00,2,2,3
2015-12-07 12:41:35,3,3,4
val spark = SparkSession.builder.master("local").appName("my-spark-app").getOrCreate()
val df1 = spark.read.option("inferSchema", "true").option("header", "true").csv("C:/Users/mhattabi/Desktop/dataTestCsvFile/dataTest2.txt")
val columnContainingdots=df1.schema.fieldNames.filter(p=>p.contains('.'))
df1.select(columnContainingdots)

Having dot in column names will require you to enclose the names with "`" character. See the below code, this should serve your purpose.
val columnContainingDots = df1.schema.fieldNames.collect({
// since the column names has "." character, we must enclose the column names with "`", otherwise dataframe select will cause exception
case column if column.contains('.') => s"`${column}`"
})
df1.select(columnContainingDots.head, columnContainingDots.tail:_*)

Related

List of columns for orderBy in spark dataframe

I have a list of variables that contains column names. I am trying to use that to call orderBy on a dataframe.
val l = List("COL1", "COL2")
df.orderBy(l.mkString(","))
But mkstring combines the column names to be one string, leading to this error -
org.apache.spark.sql.AnalysisException: cannot resolve '`COL1,COL2`' given input columns: [COL1, COL2, COL3, COL4];
How can I convert this list of strings into different strings so it looks for "COL1", "COL2" instead of "COL1,COL2"?
Thanks,
You can call orderBy for a specific column:
import org.apache.spark.sql.functions._
df.orderBy(asc("COL1")) // df.orderBy(asc(l.headOption.getOrElse("COL1")))
// OR
df.orderBy(desc("COL1"))
If you want sort by multiple columns you can write something like this:
val l = List($"COL1", $"COL2".desc)
df.sort(l: _*)
Passing single String argument is telling Spark to sort data frame using one column with given name. There is a method that accepts multiple column names and you can use it that way:
val l = List("COL1", "COL2")
df.orderBy(l.head, l.tail: _*)
If you care about the order use Column version of orderBy instead
val l = List($"COL1", $"COL2".desc)
df.orderBy(l: _*)

scala - how to substring column names after the last dot?

After exploding a nested structure I have a DataFrame with column names like this:
sales_data.metric1
sales_data.type.metric2
sales_data.type3.metric3
When performing a select I'm getting the error:
cannot resolve 'sales_data.metric1' given input columns: [sales_data.metric1, sales_data.type.metric2, sales_data.type3.metric3]
How should I select from the DataFrame so the column names are parsed correctly?
I've tried the following: the substrings after dots are extracted successfully. But since I also have columns without dots like date - their names are getting removed completely.
var salesDf_new = salesDf
for(col <- salesDf .columns){
salesDf_new = salesDf_new.withColumnRenamed(col, StringUtils.substringAfterLast(col, "."))
}
I want to leave just metric1, metric2, metric3
You can use backticks to select columns whose names include periods.
val df = (1 to 1000).toDF("column.a.b")
df.printSchema
// root
// |-- column.a.b: integer (nullable = false)
df.select("`column.a.b`")
Also, you can rename them easily like this. Basically starting with your current DataFrame, keep updating it with a new column name for each field and return the final result.
val df2 = df.columns.foldLeft(df)(
(myDF, col) => myDF.withColumnRenamed(col, col.replace(".", "_"))
)
EDIT: Get the last component
To rename with just the last name component, this regex will work:
val df2 = df.columns.foldLeft(df)(
(myDF, col) => myDF.withColumnRenamed(col, col.replaceAll(".+\\.([^.]+)$", "$1"))
)
EDIT 2: Get the last two components
This is a little more complicated, and there might be a cleaner way to write this, but here is a way that works:
val pattern = (
".*?" + // Lazy match leading chars so we ignore that bits we don't want
"([^.]+\\.)?" + // Optional 2nd to last group
"([^.]+)$" // Last group
)
val df2 = df.columns.foldLeft(df)(
(myDF, col) => myDF.withColumnRenamed(col, col.replaceAll(pattern, "$1$2"))
)
df2.printSchema

Converting Column of Dataframe to Seq[Columns] Scala

I am trying to make the next operation:
var test = df.groupBy(keys.map(col(_)): _*).agg(sequence.head, sequence.tail: _*)
I know that the required parameter inside the agg should be a Seq[Columns].
I have then a dataframe "expr" containing the next:
sequences
count(col("colname1"),"*")
count(col("colname2"),"*")
count(col("colname3"),"*")
count(col("colname4"),"*")
The column sequence is of string type and I want to use the values of each row as input of the agg, but I am not capable to reach those.
Any idea of how to give it a try?
If you can change the strings in the sequences column to be SQL commands, then it would be possible to solve. Spark provides a function expr that takes a SQL string and converts it into a column. Example dataframe with working commands:
val df2 = Seq("sum(case when A like 2 then A end) as A", "count(B) as B").toDF("sequences")
To convert the dataframe to Seq[Column]s do:
val seqs = df2.as[String].collect().map(expr(_))
Then the groupBy and agg:
df.groupBy(...).agg(seqs.head, seqs.tail:_*)

Apply automatically backquotes to Array [column] spark

hello guys i have an Array[Column] the column included name with "." character.i know that using backquotes `` solves the issue of having".".How to add automatically backquotes on columnToKeep in the select command
val df = spark.read.option("header",true).option("inferSchema","false").csv("C:/data.csv")
val columToKeep = df.columns.map(c => stddev(c).as(c))
val new_Data= df.select(columToKeep:_*)//issue here because name contains "."
Row.Number,Poids,Age,Taille,0M.I,Hmean,Cooc.Param,Ldp.Param,Test.2,Classe.2
0,87,72,160,5,0.6993,2.9421,2.3745,3,4
1,54,70,163,5,0.6301,2.7273,2.2205,3,4
2,72,51,164,5,0.6551,2.9834,2.3993,3,4
3,75,74,170,5,0.6966,2.9654,2.3699,3,4
column with constant variable
expected output
OM.I,Test.2,Classe.2
5,3,4
5,3,4
5,3,4
5,3,4
Thanks
This will do the trick
val columToKeep = df.columns.map(c => stddev(c).as(c)).map(x => s"`${x}`")
val new_Data= df.select(columToKeep.head, columToKeep.tail:_*)
Though, I did not get the purpose of
stddev

Applying a function (mkString) to an entire column in Spark dataframe, error if column name has "."

I'm attempting to apply a function over a column of a Spark dataframe in Scala. The column is a String type, and I'd like to concatenate each token in the string with an "_" delimiter (e.g. "A B" --> "A_B"). I'm doing this with:
val converter: (String => String) = (arg: String) => {arg.split(" ").mkString("_")}
val myUDF = udf(converter)
val newDF = oldDF
.withColumn("TEST", myUDF(oldDF("colA.B")) )
display(newDF)
This works for columns in the dataframe with names without a dot ("."). However, the dot in the column name "colA.B" seems to be breaking the code and throws the error:
org.apache.spark.sql.AnalysisException: Cannot resolve column name "colA.B" among (colA.B, col1, col2);
I suppose a work around would be to rename the column (similar to this), but I'd prefer not to do this.
you can try with back quotes like below example (source)
val df = sqlContext.createDataFrame(Seq(
("user1", "task1"),
("user2", "task2")
)).toDF("user", "user.task")
df.select(df("user"), df("`user.task`")).show()
+-----+---------+
| user|user.task|
+-----+---------+
|user1| task1|
|user2| task2|
+-----+---------+
In your case before applying function you need to back quote such column...