Makiing sql request on columns containing dot

Makiing sql request on columns containing dot - scala

i have a dataframe having column'name containing "."
I would like to filter columns to get column's name containing "." and then make a select on it.Any help will be appreciated.
here is the dataset
//dataset
time.1,col.1,col.2,col.3
2015-12-06 12:40:00,2,2,3
2015-12-07 12:41:35,3,3,4
val spark = SparkSession.builder.master("local").appName("my-spark-app").getOrCreate()
val df1 = spark.read.option("inferSchema", "true").option("header", "true").csv("C:/Users/mhattabi/Desktop/dataTestCsvFile/dataTest2.txt")
val columnContainingdots=df1.schema.fieldNames.filter(p=>p.contains('.'))
df1.select(columnContainingdots)

Having dot in column names will require you to enclose the names with "`" character. See the below code, this should serve your purpose.
val columnContainingDots = df1.schema.fieldNames.collect({
// since the column names has "." character, we must enclose the column names with "`", otherwise dataframe select will cause exception
case column if column.contains('.') => s"`${column}`"
})
df1.select(columnContainingDots.head, columnContainingDots.tail:_*)

Related

List of columns for orderBy in spark dataframe

I have a list of variables that contains column names. I am trying to use that to call orderBy on a dataframe.
val l = List("COL1", "COL2")
df.orderBy(l.mkString(","))
But mkstring combines the column names to be one string, leading to this error -
org.apache.spark.sql.AnalysisException: cannot resolve '`COL1,COL2`' given input columns: [COL1, COL2, COL3, COL4];
How can I convert this list of strings into different strings so it looks for "COL1", "COL2" instead of "COL1,COL2"?
Thanks,

You can call orderBy for a specific column:
import org.apache.spark.sql.functions._
df.orderBy(asc("COL1")) // df.orderBy(asc(l.headOption.getOrElse("COL1")))
// OR
df.orderBy(desc("COL1"))
If you want sort by multiple columns you can write something like this:
val l = List($"COL1", $"COL2".desc)
df.sort(l: _*)

Passing single String argument is telling Spark to sort data frame using one column with given name. There is a method that accepts multiple column names and you can use it that way:
val l = List("COL1", "COL2")
df.orderBy(l.head, l.tail: _*)
If you care about the order use Column version of orderBy instead
val l = List($"COL1", $"COL2".desc)
df.orderBy(l: _*)

scala - how to substring column names after the last dot?

After exploding a nested structure I have a DataFrame with column names like this:
sales_data.metric1
sales_data.type.metric2
sales_data.type3.metric3
When performing a select I'm getting the error:
cannot resolve 'sales_data.metric1' given input columns: [sales_data.metric1, sales_data.type.metric2, sales_data.type3.metric3]
How should I select from the DataFrame so the column names are parsed correctly?
I've tried the following: the substrings after dots are extracted successfully. But since I also have columns without dots like date - their names are getting removed completely.
var salesDf_new = salesDf
for(col <- salesDf .columns){
salesDf_new = salesDf_new.withColumnRenamed(col, StringUtils.substringAfterLast(col, "."))
}
I want to leave just metric1, metric2, metric3

You can use backticks to select columns whose names include periods.
val df = (1 to 1000).toDF("column.a.b")
df.printSchema
// root
// |-- column.a.b: integer (nullable = false)
df.select("`column.a.b`")
Also, you can rename them easily like this. Basically starting with your current DataFrame, keep updating it with a new column name for each field and return the final result.
val df2 = df.columns.foldLeft(df)(
(myDF, col) => myDF.withColumnRenamed(col, col.replace(".", "_"))
)
EDIT: Get the last component
To rename with just the last name component, this regex will work:
val df2 = df.columns.foldLeft(df)(
(myDF, col) => myDF.withColumnRenamed(col, col.replaceAll(".+\\.([^.]+)$", "$1"))
)
EDIT 2: Get the last two components
This is a little more complicated, and there might be a cleaner way to write this, but here is a way that works:
val pattern = (
".*?" + // Lazy match leading chars so we ignore that bits we don't want
"([^.]+\\.)?" + // Optional 2nd to last group
"([^.]+)$" // Last group
)
val df2 = df.columns.foldLeft(df)(
(myDF, col) => myDF.withColumnRenamed(col, col.replaceAll(pattern, "$1$2"))
)
df2.printSchema

Converting Column of Dataframe to Seq[Columns] Scala

I am trying to make the next operation:
var test = df.groupBy(keys.map(col(_)): _*).agg(sequence.head, sequence.tail: _*)
I know that the required parameter inside the agg should be a Seq[Columns].
I have then a dataframe "expr" containing the next:
sequences
count(col("colname1"),"*")
count(col("colname2"),"*")
count(col("colname3"),"*")
count(col("colname4"),"*")
The column sequence is of string type and I want to use the values of each row as input of the agg, but I am not capable to reach those.
Any idea of how to give it a try?

If you can change the strings in the sequences column to be SQL commands, then it would be possible to solve. Spark provides a function expr that takes a SQL string and converts it into a column. Example dataframe with working commands:
val df2 = Seq("sum(case when A like 2 then A end) as A", "count(B) as B").toDF("sequences")
To convert the dataframe to Seq[Column]s do:
val seqs = df2.as[String].collect().map(expr(_))
Then the groupBy and agg:
df.groupBy(...).agg(seqs.head, seqs.tail:_*)

Apply automatically backquotes to Array [column] spark

hello guys i have an Array[Column] the column included name with "." character.i know that using backquotes `` solves the issue of having".".How to add automatically backquotes on columnToKeep in the select command
val df = spark.read.option("header",true).option("inferSchema","false").csv("C:/data.csv")
val columToKeep = df.columns.map(c => stddev(c).as(c))
val new_Data= df.select(columToKeep:_*)//issue here because name contains "."
Row.Number,Poids,Age,Taille,0M.I,Hmean,Cooc.Param,Ldp.Param,Test.2,Classe.2
0,87,72,160,5,0.6993,2.9421,2.3745,3,4
1,54,70,163,5,0.6301,2.7273,2.2205,3,4
2,72,51,164,5,0.6551,2.9834,2.3993,3,4
3,75,74,170,5,0.6966,2.9654,2.3699,3,4
column with constant variable
expected output
OM.I,Test.2,Classe.2
5,3,4
5,3,4
5,3,4
5,3,4
Thanks

This will do the trick
val columToKeep = df.columns.map(c => stddev(c).as(c)).map(x => s"`${x}`")
val new_Data= df.select(columToKeep.head, columToKeep.tail:_*)
Though, I did not get the purpose of
stddev

Applying a function (mkString) to an entire column in Spark dataframe, error if column name has "."

I'm attempting to apply a function over a column of a Spark dataframe in Scala. The column is a String type, and I'd like to concatenate each token in the string with an "_" delimiter (e.g. "A B" --> "A_B"). I'm doing this with:
val converter: (String => String) = (arg: String) => {arg.split(" ").mkString("_")}
val myUDF = udf(converter)
val newDF = oldDF
.withColumn("TEST", myUDF(oldDF("colA.B")) )
display(newDF)
This works for columns in the dataframe with names without a dot ("."). However, the dot in the column name "colA.B" seems to be breaking the code and throws the error:
org.apache.spark.sql.AnalysisException: Cannot resolve column name "colA.B" among (colA.B, col1, col2);
I suppose a work around would be to rename the column (similar to this), but I'd prefer not to do this.

you can try with back quotes like below example (source)
val df = sqlContext.createDataFrame(Seq(
("user1", "task1"),
("user2", "task2")
)).toDF("user", "user.task")
df.select(df("user"), df("`user.task`")).show()
+-----+---------+
| user|user.task|
+-----+---------+
|user1| task1|
|user2| task2|
+-----+---------+
In your case before applying function you need to back quote such column...

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Makiing sql request on columns containing dot - scala

Related

List of columns for orderBy in spark dataframe

scala - how to substring column names after the last dot?

Converting Column of Dataframe to Seq[Columns] Scala

Apply automatically backquotes to Array [column] spark

Applying a function (mkString) to an entire column in Spark dataframe, error if column name has "."

Categories

Resources