Apply automatically backquotes to Array [column] spark - scala

hello guys i have an Array[Column] the column included name with "." character.i know that using backquotes `` solves the issue of having".".How to add automatically backquotes on columnToKeep in the select command
val df = spark.read.option("header",true).option("inferSchema","false").csv("C:/data.csv")
val columToKeep = df.columns.map(c => stddev(c).as(c))
val new_Data= df.select(columToKeep:_*)//issue here because name contains "."
Row.Number,Poids,Age,Taille,0M.I,Hmean,Cooc.Param,Ldp.Param,Test.2,Classe.2
0,87,72,160,5,0.6993,2.9421,2.3745,3,4
1,54,70,163,5,0.6301,2.7273,2.2205,3,4
2,72,51,164,5,0.6551,2.9834,2.3993,3,4
3,75,74,170,5,0.6966,2.9654,2.3699,3,4
column with constant variable
expected output
OM.I,Test.2,Classe.2
5,3,4
5,3,4
5,3,4
5,3,4
Thanks

This will do the trick
val columToKeep = df.columns.map(c => stddev(c).as(c)).map(x => s"`${x}`")
val new_Data= df.select(columToKeep.head, columToKeep.tail:_*)
Though, I did not get the purpose of
stddev

Related

scala - how to substring column names after the last dot?

After exploding a nested structure I have a DataFrame with column names like this:
sales_data.metric1
sales_data.type.metric2
sales_data.type3.metric3
When performing a select I'm getting the error:
cannot resolve 'sales_data.metric1' given input columns: [sales_data.metric1, sales_data.type.metric2, sales_data.type3.metric3]
How should I select from the DataFrame so the column names are parsed correctly?
I've tried the following: the substrings after dots are extracted successfully. But since I also have columns without dots like date - their names are getting removed completely.
var salesDf_new = salesDf
for(col <- salesDf .columns){
salesDf_new = salesDf_new.withColumnRenamed(col, StringUtils.substringAfterLast(col, "."))
}
I want to leave just metric1, metric2, metric3
You can use backticks to select columns whose names include periods.
val df = (1 to 1000).toDF("column.a.b")
df.printSchema
// root
// |-- column.a.b: integer (nullable = false)
df.select("`column.a.b`")
Also, you can rename them easily like this. Basically starting with your current DataFrame, keep updating it with a new column name for each field and return the final result.
val df2 = df.columns.foldLeft(df)(
(myDF, col) => myDF.withColumnRenamed(col, col.replace(".", "_"))
)
EDIT: Get the last component
To rename with just the last name component, this regex will work:
val df2 = df.columns.foldLeft(df)(
(myDF, col) => myDF.withColumnRenamed(col, col.replaceAll(".+\\.([^.]+)$", "$1"))
)
EDIT 2: Get the last two components
This is a little more complicated, and there might be a cleaner way to write this, but here is a way that works:
val pattern = (
".*?" + // Lazy match leading chars so we ignore that bits we don't want
"([^.]+\\.)?" + // Optional 2nd to last group
"([^.]+)$" // Last group
)
val df2 = df.columns.foldLeft(df)(
(myDF, col) => myDF.withColumnRenamed(col, col.replaceAll(pattern, "$1$2"))
)
df2.printSchema

Difference b/w val df = List(("amit","porwal")) and val df = List("amit","porwal")

what is the difference b/w
val df = List(("amit","porwal"))
and
val df = List("amit","porwal")
My question is how 2 parenthesis are making a difference.Because On doing
scala > val df = List(("amit","porwal")).toDF("fname","lname")
it works, but on doing
scala > val df = List("amit","porwal").toDF("fname","lname")
scala throws me an error
java.lang.IllegalArgumentException: requirement failed:
The number of columns doesn't match. Old column names (1): value New column names (2):
fname,lname –
at scala.Predef$.require(Predef.scala:224)
at org.apache.spark.sql.Dataset.toDF(Dataset.scala:393)
at org.apache.spark.sql.DatasetHolder.toDF(DatasetHolder.scala:44)
... 48 elided
Yes, they are different. The paranthesis inside is treated as tuples by scala compiler. Since there are two string values inside the nested brackets of your first example, it will be treated as Tuple2(String, String). While the second example the string values inside the List are treated as separate elements as String.
the first one val df = List(("amit","porwal")) is List[Tuple2(String, String)]. There is only one element in df and to get porwal you have to do df(0)._2
And,
the second one val df = List("amit","porwal") is List[String]. There are two elements in df and to get porwal you have to do df(1)
Even though the question is not related to spark
val df = List(("amit","porwal"))
Here df is list of Tuple2 as List[(String, String)], To get the value "amit" you should use df(0)._1 and for "porwal" df(0)._2
val df = List("amit","porwal")
Here is df is simply list of String as List[String]
In case of List[String] you can simply get as df(0) and df(1)
Hope this helps!

Makiing sql request on columns containing dot

i have a dataframe having column'name containing "."
I would like to filter columns to get column's name containing "." and then make a select on it.Any help will be appreciated.
here is the dataset
//dataset
time.1,col.1,col.2,col.3
2015-12-06 12:40:00,2,2,3
2015-12-07 12:41:35,3,3,4
val spark = SparkSession.builder.master("local").appName("my-spark-app").getOrCreate()
val df1 = spark.read.option("inferSchema", "true").option("header", "true").csv("C:/Users/mhattabi/Desktop/dataTestCsvFile/dataTest2.txt")
val columnContainingdots=df1.schema.fieldNames.filter(p=>p.contains('.'))
df1.select(columnContainingdots)
Having dot in column names will require you to enclose the names with "`" character. See the below code, this should serve your purpose.
val columnContainingDots = df1.schema.fieldNames.collect({
// since the column names has "." character, we must enclose the column names with "`", otherwise dataframe select will cause exception
case column if column.contains('.') => s"`${column}`"
})
df1.select(columnContainingDots.head, columnContainingDots.tail:_*)

Applying a function (mkString) to an entire column in Spark dataframe, error if column name has "."

I'm attempting to apply a function over a column of a Spark dataframe in Scala. The column is a String type, and I'd like to concatenate each token in the string with an "_" delimiter (e.g. "A B" --> "A_B"). I'm doing this with:
val converter: (String => String) = (arg: String) => {arg.split(" ").mkString("_")}
val myUDF = udf(converter)
val newDF = oldDF
.withColumn("TEST", myUDF(oldDF("colA.B")) )
display(newDF)
This works for columns in the dataframe with names without a dot ("."). However, the dot in the column name "colA.B" seems to be breaking the code and throws the error:
org.apache.spark.sql.AnalysisException: Cannot resolve column name "colA.B" among (colA.B, col1, col2);
I suppose a work around would be to rename the column (similar to this), but I'd prefer not to do this.
you can try with back quotes like below example (source)
val df = sqlContext.createDataFrame(Seq(
("user1", "task1"),
("user2", "task2")
)).toDF("user", "user.task")
df.select(df("user"), df("`user.task`")).show()
+-----+---------+
| user|user.task|
+-----+---------+
|user1| task1|
|user2| task2|
+-----+---------+
In your case before applying function you need to back quote such column...

Scala Spark DataFrame : dataFrame.select multiple columns given a Sequence of column names

val columnName=Seq("col1","col2",....."coln");
Is there a way to do dataframe.select operation to get dataframe containing only the column names specified .
I know I can do dataframe.select("col1","col2"...)
but the columnNameis generated at runtime.
I could do dataframe.select() repeatedly for each column name in a loop.Will it have any performance overheads?. Is there any other simpler way to accomplish this?
val columnNames = Seq("col1","col2",....."coln")
// using the string column names:
val result = dataframe.select(columnNames.head, columnNames.tail: _*)
// or, equivalently, using Column objects:
val result = dataframe.select(columnNames.map(c => col(c)): _*)
Since dataFrame.select() expects a sequence of columns and we have a sequence of strings, we need to convert our sequence to a List of cols and convert that list to the sequence. columnName.map(name => col(name)): _* gives a sequence of columns from a sequence of strings, and this can be passed as a parameter to select():
val columnName = Seq("col1", "col2")
val DFFiltered = DF.select(columnName.map(name => col(name)): _*)
Alternatively, you can also write like this
val columnName = Seq("col1", "col2")
val DFFiltered = DF.select(columnName.map(DF(_): _*)