display column name into list[column]scala - scala

I want to insert list of column from datframe into a list [column] so I can perform a select request. it means want to get list of column and insert it automatically into a list [column] Any help Thanks
object PCA extends App{
val spark = SparkSession.builder.master("local").appName("my-spark-app").getOrCreate()
val strPath="C:/Users/mhattabi/Desktop/testBis2.txt"
val intial_Data=spark.read.option("header",true).csv(strPath)
//array string contains names of column
val arrayList=intial_Data.columns
var colsList = List[Column]()
//wanna insert name of column into the listColum
arrayList.foreach(p=>colsList.)
//i want to have something like
//val colsList = List(col("col1"),col("col2"))
//intial_Data.select(colsList:_*).show
}

You could use col function as follow:
var colsList = List[Column]()
arrayList.columns.foreach { c => colsList:+=col(c)}
Remember to import sql functions to use col:
import org.apache.spark.sql.functions._

I would rather use immutable list than the variable list by transformation like below.
val arrayList = initial_Data.columns
val colsList = arrayList.map(col)

Related

how to use a custom function in query on Spark Dataframe using Scala

I load data from database to Spark Dataframe,named DF,then I must to extract some records from the Dataframe which their ID has special condition. So, I define this function:
def hash_id(id:String): Int = {
val two_char = id.takeRight(2).toInt
val hash_result = two_char % 4
return hash_result
}
Then, I use the function in this query:
DF.filter(hash_id("ID")===3)
But I receive this error:
value === is not a member of Int
DF has ID column.
Would you please guide me how to use a custom function in where/filter clause?
Any help would be really appreciated.
=== can only be used between Column objects. That's why you have an error value === is not a member of Int, as return type of your function hash_id is an Int, not a Column
To be able to use your function, you should convert it to an user-defined function and apply this function to a column object as follow:
import org.apache.spark.sql.functions.{col, udf}
def hash_id(id:String): Int = {
val two_char = id.takeRight(2).toInt
val hash_result = two_char % 4
return hash_result
}
val hash_id_udf = udf((id: String) => hasd_id(id))
DF.filter(hash_id_udf(col("ID")) === 3)

Dynamic conversion of Array of double columns into multiple columns in nested spark dataframe

My current DataFrame looks like as below:
{"id":"1","inputs":{"values":{"0.2":[1,1],"0.4":[1,1],"0.6":[1,1]}},"id1":[1,2]}
I want to transform this dataframe into the below dataFrame:
{"id":"1", "v20":[1,1],"v40":[1,1],"v60":[1,1],"id1":[1,2]}
This means that, each 'values' array's items (0.2, 0.4 and 0.6) will be multiplied by 100, prepended with the letter 'v', and extracted into separate columns.
How does the code would look like in order to achieve this. I have tried withColumn but couldn't achieve this.
Try the below code and please find the inline comments for the code explanation
import org.apache.spark.sql.SaveMode
import org.apache.spark.sql.functions._
import org.apache.spark.sql.types.StructType
object DynamicCol {
def main(args: Array[String]): Unit = {
val spark = SparkSession.builder().master("local[*]").getOrCreate()
val df = spark.read.json("src/main/resources/dyamicCol.json") /// Load the JSON file
val dfTemp = df.select(col("inputs.values").as("values")) // Temp Dataframe for fetching the nest values
val index = dfTemp
.schema.fieldIndex("values")
val propSchema = dfTemp.schema(index).dataType.asInstanceOf[StructType]
val dfFinal = propSchema.fields.foldLeft(df)( (df,field) => { // Join Dataframe with the list of nested columns
val colNameInt = (field.name.toDouble * 100).toInt
val colName = s"v$colNameInt"
df.withColumn(colName,col("inputs.values.`" + field.name + "`")) // Add the nested column mappings
} ).drop("inputs") // Drop the extra column
dfFinal.write.mode(SaveMode.Overwrite).json("src/main/resources/dyamicColOut.json") // Output the JSON file
}
}
I would make the logic for the change of column name splitter into 2 parts, the one that is a numeric value, and the one that doesn't change.
def stringDecimalToVNumber(colName:String): String =
"v" + (colName.toFloat * 100).toInt.toString
and form a single function that transforms according to the case
val floatRegex = """(\d+\.?\d*)""".r
def transformColumnName(colName:String): String = colName match {
case floatRegex(v) => stringDecimalToVNumber(v) //it's a float, transform it
case x => x // keep it
now we have the function to transform the end of the columns, let's pick the schema dynamicly.
val flattenDF = df.select("id","inputs.values.*")
val finalDF = flattenDF
.schema.names
.foldLeft(flattenDF)((dfacum,x) => {
val newName = transformColumnName(x)
if (newName == x)
dfacum // the name didn't need to be changed
else
dfacum.withColumnRenamed(x, transformColumnName(x))
})
This will dynamically transform all the columns inside inputs.values to the new name, and put them in next to id.

How to pass column names in selectExpr through one or more string parameters in spark using scala?

I am using script for CDC Merge in spark streaming. I wish to pass column values in selectExpr through a parameter as column names for each table would change. When I pass the columns and struct field through a string variable, I am getting error as ==> mismatched input ',' expecting
Below is the piece of code I am trying to parameterize.
var filteredMicroBatchDF=microBatchOutputDF
.selectExpr("col1","col2","struct(offset,KAFKA_TS) as otherCols" )
.groupBy("col1","col2").agg(max("otherCols").as("latest"))
.selectExpr("col1","col2","latest.*")
Reference to the script I am trying to emulate: -
https://docs.databricks.com/_static/notebooks/merge-in-cdc.html
I have tried like below by passing column names in a variable and then reading in the selectExpr from these variables: -
val keyCols = "col1","col2"
val structCols = "struct(offset,KAFKA_TS) as otherCols"
var filteredMicroBatchDF=microBatchOutputDF
.selectExpr(keyCols,structCols )
.groupBy(keyCols).agg(max("otherCols").as("latest"))
.selectExpr(keyCols,"latest.*")
When I run the script it gives me error as
org.apache.spark.sql.streaming.StreamingQueryException:
mismatched input ',' expecting <<EOF>>
EDIT
Here is what I have tried after comments by Luis Miguel which works fine: -
import org.apache.spark.sql.{DataFrame, functions => sqlfun}
def foo(microBatchOutputDF: DataFrame)
(keyCols: Seq[String], structCols: Seq[String]): DataFrame =
microBatchOutputDF
.selectExpr((keyCols ++ structCols) : _*)
.groupBy(keyCols.head, keyCols.tail : _*).agg(sqlfun.max("otherCols").as("latest"))
.selectExpr((keyCols :+ "latest.*") : _*)
var keyColumns = Seq("COL1","COL2")
var structColumns = "offset,Kafka_TS"
foo(microBatchOutputDF)(keyCols = Seq(keyColumns:_*), structColumns = Seq("struct("+structColumns+") as otherCols"))
Note: Below results in an error
foo(microBatchOutputDF)(keyCols = Seq(keyColumns), structColumns = Seq("struct("+structColumns+") as otherCols"))
The thing about above working code is that, here keyColumns were hardcoded. So, I tried reading (firstly) from parameter file and (Secondly) from widget which resulted in error and it is here I am looking for advice and suggestions: -
First Method
def loadProperties(url: String):Properties = {
val properties: Properties = new Properties()
if (url != null) {
val source = Source.fromURL(url)
properties.load(source.bufferedReader())
}
return properties
}
var tableProp: Properties = new Properties()
tableProp = loadProperties("dbfs:/Configs/Databricks/Properties/table/Table.properties")
var keyColumns = Seq(tableProp.getProperty("keyCols"))
var structColumns = tableProp.getProperty("structCols")
keyCols and StructCols are defined in parameter file as: -
keyCols = Col1, Col2 (I also tried assigning these as "Col1","Col2")
StructCols = offset,Kafka_TS
Then finally,
foo(microBatchOutputDF)(keyCols = Seq(keyColumns:_*), structColumns = Seq("struct("+structColumns+") as otherCols"))
The code is throwing the error pointing at first comma (as if its taking the columns field as single argument):
mismatched input ',' expecting <EOF>
== SQL ==
"COL1","COL2""
-----^^^
If I pass just one column in the keyCols property, code is working fine.
E.g. keyCols = Col1
Second Method
Here I tried reading key columns from the widget and its the same error again.
dbutils.widgets.text("prmKeyCols", "","")
val prmKeyCols = dbutils.widgets.get("prmKeyCols")
var keyColumns = Seq(prmKeyCols)
The widget is passed in as below
"Col1","Col2"
Then finally,
foo(microBatchOutputDF)(keyCols = Seq(keyColumns:_*), structColumns = Seq("struct("+structColumns+") as otherCols"))
This is also giving same error.
Something like this should work:
import org.apache.spark.sql.{DataFrame, functions => sqlfun}
def foo(microBatchOutputDF: DataFrame)
(keyCols: Seq[String], structCols: Seq[String]): DataFrame =
microBatchOutputDF
.selectExpr((keyCols ++ structCols) : _*)
.groupBy(keyCols.head, keyCols.tail : _*).agg(sqlfun.max("otherCols").as("latest"))
.selectExpr((keyCols :+ "latest.*") : _*)
Which you can use like:
foo(microBatchOutputDF)(keyCols = Seq("col1", "col2"), structCols = Seq("struct(offset,KAFKA_TS) as otherCols"))

Apache Spark. UDF Column based on another column without passing it's name as argument.

There is DataSet with column firm, I'm adding another column to this DataSet - firm_id here's example:
private val firms: mutable.Map[String, Integer] = ...
private val firmIdFromCode: (String => Integer) = (code: String) => firms(code)
val firm_id_by_code: UserDefinedFunction = udf(firmIdFromCode)
...
val ds = dataset.withColumn("firm_id", firm_id_by_code($"firm"))
Is there a way to eliminate passing $"firm" as argument (this column is always present in DS).
I am searching for something for this:
val ds = dataset.withColumn("firm_id", firm_id_by_code)
You could supply the column it will be using when you define the udf.
val someUdf = udf{ /*udf code*/}.apply($"colName")
// Usage in dataset
val ds = dataset.withColumn("newColName",someUdf)

Scala to find common values between two lists

I have a text file in the following format
a,b,c,d,e
f,g,h,i,j
b,g,k,l,m
g,h,o,p,q
I want an output file that contains only those rows whose values in first column is available in any of the second column. For example in this case values in first column of last two rows are "b" and "g" which are also available somewhere in second column. So my required output has only two rows.
b,g,k,l,m
g,h,o,p,q
As per my solution, I got two lists of column 1 and column 2 with distinct values. Now, how can I see whether Values in Column 1 is available in Column2. Related Code :
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.commons.io.IOUtils;
import scala.io.StdIn.{readLine, readInt}
import scala.io.Source
object SimpleApp {
def main(args: Array[String]) {
val logFile = "src/data/s1.txt"
val sc = new SparkContext("spark://Hadoop1:7077", "Simple App", "/usr/local/spark",
List("/usr/local/spark/SimpleSparkProject/target/scala-2.11/simple-project_2.11-1.0.jar"))
val lD = sc.textFile(logFile).cache()
val d2Map = lD map(col2)
val Column2 = d2Map.distinct
val d1Map = lD map(col1)
val Column1 = d1Map.distinct
// Now, here I want only those values in Column1 Which are available in Column2
//Column2.saveAsTextFile("hdfs://Hadoop1:9000/user/output/distDestination")
}
def col2(s:String) : (String) = {
val kv = s.split(",")
val k = kv(1)
k
}
def col1(s:String) : (String) = {
val kv = s.split(",")
val k = kv(0)
k
}
}
This code written in pure scala, not using the spark, but I hope it will help you.
val str = "a,b,c,d,e\n" +
"f,g,h,i,j\n" +
"b,g,k,l,m\n" +
"g,h,o,p,q"
val rows = str.split("\n")
val splittedRows = rows.map(_.split(","))
val stringsInSecondColumn = splittedRows.map(_.apply(1)).toSet
val result = splittedRows.filter { row =>
stringsInSecondColumn.contains(row.apply(0))
}
result.foreach(x => println(x.mkString(",")))
Lines above result val stringsInSecondColumn is just string parsing.
Than we getting all string in second column and casting container with them to set to achieve linear search time.
And than we just need to filter all rows and check if first value can be found in stringsInSecondColumn set.
In your code you may do next things:
val stringsInSecondColumn = lD.map(_.split(",")(1)).toSet
val filteredRows = lD.filter(row => stringsInSecondColumn.contains(row.split(",")(0)))
Hope it will help you.