Matching Column name from Csv file in spark scala - scala

I want to take headers (column name) from my csv file and the want to match with it my existing header.
I am using below code:
val cc = sparksession.read.csv(filepath).take(1)
Its giving me value like:
Array([id,name,salary])
and I have created one more static schema, which is giving me value like this:
val ss=Array("id","name","salary")
and then I'm trying to compare column name using if condition:
if(cc==ss){
println("matched")
} else{
println("not matched")
}
I guess due to [] and () mismatch its always going to else part is there any other way to compare these value without considering [] and ()?

First, for convenience, set the header option to true when reading the file:
val df = sparksession.read.option("header", true).csv(filepath)
Get the column names and define the expected column names:
val cc = df.columns
val ss = Array("id", "name", "salary")
To check if the two match (not considering the ordering):
if (cc.toSet == ss.toSet) {
println("matched")
} else {
println("not matched")
}
If the order is relevant, then the condition can be done as follows (you can't use Array here but Seq works):
cc.toSeq == ss.toSeq
or you a deep array comparison:
cc.deep == d.deep

First of all, I think you are trying to compare a Array[org.apache.spark.sql.Row] with an Array[String]. I believe you should change how you load the headers to something like: val cc = spark.read.format("csv").option("header", "true").load(fileName).columns.toArray.
Then you could compare using cc.deep == ss.deep.

Below code worked for me.
val cc= spark.read.csv("filepath").take(1)(0).toString
The above code gave output as String:[id,name,salary].
created one one stating schema as
val ss="[id,name,salary]"
then wrote the if else Conditions.

Related

Scala - how to filter a StructType with a list of StructField names?

I'm writing a method to parse schema and want to filter the resulting StructType with a list of column names. Which is a subset of StructField names of the original schema.
As a result, if a flag isFilteringReq = true, I want to return a StructType containing only StructFields with the names from the specialColumnNames, in the same order. If the flag is false, then return an original StructType.
val specialColumnNames = Seq("metric_1", "metric_2", "metric_3")
First I'm getting an original schema with pattern-matching.
val customSchema: StructType = schemaType match {
case "type_1" => getType1chema()
case "type_2" => getType2chema()
}
There are two problems:
1 - I wasn't able to apply .filter() directly to the customSchema right after the curly brace. And geting a Cannot resolve symbol filter. So I wrote a separate method makeCustomSchema. But I don't need a separate object. Is there a more elegant way to apply filtering in this case?
2 - I could filter the originalStruct but only with a single hardcoded column name. How should I pass the specialColumnNames to contains()?
def makeCustomSchema(originalStruct: StructType, isFilteringReq: Boolean, updColumns: Seq[String]) = if (isFilteringReq) {
originalStruct.filter(s => s.name.contains("metric_1"))
} else {
originalStruct
}
val newSchema = makeCustomSchema(customSchema, isFilteringReq, specialColumnNames)
Instead of passing a Seq, pass a Set and you can filter if the field is in the set or not.
Also, I wouldn't use a flag, instead, you could pass an empty Set when there's no filtering, or use Option[Set[String]].
Anyway, you could also use the copy method that comes for free with case classes.
Something like this should work.
def makeCustomSchema(originalStruct: StructType, updColumns:Set[String]): StructType = {
updColumns match {
case s if s.isEmpty => originalStruct
case _ => originalStruct.copy(
fields = originalStruct.fields.filter(
f => updColumns.contains(f.name)))
}
}
Usually you don't need to build structs like this, have you tried using the drop() method in DataFrame/DataSet ?

Can I have a condition inside of a where or a filter?

I have a dataframe with many columns, and to explain the situation, let's say, there is a column with letters in it from a-z. I also have a list, which includes some specific letters.
val testerList = List("a","k")
The dataframe has to be filtered, to only include entries with the specified letters in the list. This is very straightforward:
val resultDF = df.where($"column".isin(testerList:_*)))
So the problem is, that the list is given to this function as a parameter, and it can be an empty list, which situation could be solved like this (resultDF is defined here as an empty dataframe):
if (!(testerList.isEmpty)) {
resultDF = df.where(some other stuff has to be filtered away)
.where($"column".isin(testerList:_*)))
} else {
resultDF = df.where(some other stuff has to be filtered away)
}
Is there a way to make this in a more simple way, something like this:
val resultDF = df.where(some other stuff has to be filtered away)
.where((!(testerList.isEmpty)) && $"column".isin(testerList:_*)))
This one throws an error though:
error: type mismatch;
found : org.apache.spark.sql.Column
required: Boolean
.where( (!(testerList.isEmpty)) && (($"agent_account_homepage").isin(testerList:_*)))
^
So, thanks a lot for any kind of ideas for a solution!! :)
What about this?
val filtered1 = df.where(some other stuff has to be filtered away)
val resultDF = if (testerList.isEmpty)
filtered1
else
filtered1.where($"column".isin(testerList:_*))
Or if you don't want filtered1 to be available below and perhaps unintentionally used, it can be declared inside a block initializing resultDF:
val resultDF = {
val filtered1 = df.where(some other stuff has to be filtered away)
if (testerList.isEmpty) filtered1 else filtered1.where($"column".isin(testerList:_*))
}
or if you change the order
val resultDF = (if (testerList.isEmpty)
df
else
df.where($"column".isin(testerList:_*))
).where(some other stuff has to be filtered away)
Essentially what Spark expects to receive in where is plain object Column. This means that you can extract all your complicated where logic to separate function:
def testerFilter(testerList: List[String]): Column = testerList match {
//of course, you have to replace ??? with real conditions
//just apend them by joining with "and"
case Nil => $"column".isNotNull and ???
case tl => $"column".isin(tl: _*) and ???
}
And then you just use it like:
df.where(testerFilter(testerList))
The solution I use now, use sql code inside the where clause:
var testerList = s""""""
var cond = testerList.isEmpty().toString
testerList = if (cond == "true") "''" else testerList
val resultDF= df.where(some other stuff has to be filtered away)
.where("('"+cond+"' = 'true') or (agent_account_homepage in ("+testerList+"))")
What do you think?

Null values in SCALA in an Array of String - ArrayIndexOutOfBoundsException

I have an Array of String as follows:
res17: Array[String] = Array(header - skip me, blk1|X|||||, a|b|c||||, d|e|f||||, x|y|z||||, blk2|X|||||, h|h|h|h|h|h|h, j|j|j|j|j|j|j, k|k|k|k|k|k|k, m|m|m|m|m|m|m, blk3|X|||||, 7|7|||||)
This is gotten by a SCALA program, not SPARK with SCALA:
for (line <- Source.fromFile(filename).getLines().drop(1).toVector) {
val values = line.split("\\|").map(_.trim)
...
When I perform:
...
println(values(0), values(1), values(2)) // giving an error on 2 or indeed 1, if a null is found.
}
I.e. it fails if there is nothing between the pipe |.
getOrElse does not help, how can I substitute the "nulls" when retrieving or saving? Cannot see from the documentation. It must be quite simple!
Note I am using SCALA only, not SPARK / SCALA.
Thanks in advance
Well, that's not the behaviour i am experiencing. Here a screenshot, i may be doing something different:
Anyway, if you want to get rid of your nulls, you can run a filter like the one below:
val values = s.split("\\|").map(_.trim).filterNot(_.isEmpty)
If you don't want to get rid but transform them in something else you can run:
val values = s.split("\\|").map{x => val trimmed = x.trim; if (trimmed.isEmpty) None else Some(trimmed)}
EDIT:
val values = s.split("\\|").map{x => if (x == null) "" else x.trim}
EDIT (Again):
I can finally reproduce it, sorry for the inconvenience, i missunderstood something. The problem is the split functions, that removes by default the empty values. You should pass the second parameter to the split function as explained in the API
val values = line.split("\\|", -1).map(_.trim)

Extracting data from RDD in Scala/Spark

So I have a large dataset that is a sample of a stackoverflow userbase. One line from this dataset is as follows:
<row Id="42" Reputation="11849" CreationDate="2008-08-01T13:00:11.640" DisplayName="Coincoin" LastAccessDate="2014-01-18T20:32:32.443" WebsiteUrl="" Location="Montreal, Canada" AboutMe="A guy with the attention span of a dead goldfish who has been having a blast in the industry for more than 10 years.
Mostly specialized in game and graphics programming, from custom software 3D renderers to accelerated hardware pipeline programming." Views="648" UpVotes="337" DownVotes="40" Age="35" AccountId="33" />
I would like to extract the number from reputation, in this case it is "11849" and the number from age, in this example it is "35" I would like to have them as floats.
The file is located in a HDFS so it comes in the format RDD
val linesWithAge = lines.filter(line => line.contains("Age=")) //This is filtering data which doesnt have age
val repSplit = linesWithAge.flatMap(line => line.split("\"")) //Here I am trying to split the data where there is a "
so when I split it with quotation marks the reputation is in index 3 and age in index 23 but how do I assign these to a map or a variable so I can use them as floats.
Also I need it to do this for every line on the RDD.
EDIT:
val linesWithAge = lines.filter(line => line.contains("Age=")) //transformations from the original input data
val repSplit = linesWithAge.flatMap(line => line.split("\""))
val withIndex = repSplit.zipWithIndex
val indexKey = withIndex.map{case (k,v) => (v,k)}
val b = indexKey.lookup(3)
println(b)
So if added an index to the array and now I've successfully managed to assign it to a variable but I can only do it to one item in the RDD, does anyone know how I could do it to all items?
What we want to do is to transform each element in the original dataset (represented as an RDD) into a tuple containing (Reputation, Age) as numeric values.
One possible approach is to transform each element of the RDD using String operations in order to extract the values of the elements "Age" and "Reputation", like this:
// define a function to extract the value of an element, given the name
def findElement(src: Array[String], name:String):Option[String] = {
for {
entry <- src.find(_.startsWith(name))
value <- entry.split("\"").lift(1)
} yield value
}
We then use that function to extract the interesting values from every record:
val reputationByAge = lines.flatMap{line =>
val elements = line.split(" ")
for {
age <- findElement(elements, "Age")
rep <- findElement(elements, "Reputation")
} yield (rep.toInt, age.toInt)
}
Note how we don't need to filter on "Age" before doing this. If we process a record that does not have "Age" or "Reputation", findElement will return None. Henceforth the result of the for-comprehension will be None and the record will be flattened by the flatMap operation.
A better way to approach this problem is by realizing that we are dealing with structured XML data. Scala provides built-in support for XML, so we can do this:
import scala.xml.XML
import scala.xml.XML._
// help function to map Strings to Option where empty strings become None
def emptyStrToNone(str:String):Option[String] = if (str.isEmpty) None else Some(str)
val xmlReputationByAge = lines.flatMap{line =>
val record = XML.loadString(line)
for {
rep <- emptyStrToNone((record \ "#Reputation").text)
age <- emptyStrToNone((record \ "#Age").text)
} yield (rep.toInt, age.toInt)
}
This method relies on the structure of the XML record to extract the right attributes. As before, we use the combination of Option values and flatMap to remove records that do not contain all the information we require.
First, you need a function which extracts the value for a given key of your line (getValueForKeyAs[T]), then do:
val rdd = linesWithAge.map(line => (getValueForKeyAs[Float](line,"Age"), getValueForKeyAs[Float](line,"Reputation")))
This should give you an rdd of type RDD[(Float,Float)]
getValueForKeyAs could be implemented like this:
def getValueForKeyAs[A](line:String, key:String) : A = {
val res = line.split(key+"=")
if(res.size==1) throw new RuntimeException(s"no value for key $key")
val value = res(1).split("\"")(1)
return value.asInstanceOf[A]
}

Sort files by name

How can I sort in ascending/descending order a group of files based on their name with the following naming convention: myPath\numberTheFileInt.ext?
I would like to obtain something like the following:
myPath\1.csv
myPath\02.csv
...
myPath\21.csv
...
myPath\101.csv
Here is what I have at the moment:
myFiles = getFiles(myFilesDirectory).sortWith(_.getName < _.getName)
Despite the files being sorted in the directory, they are unsorted in myFiles.
I have in output:
myPath\1.csv
myPath\101.csv
myPath\02.csv
...
myPath\21.csv
I tried multiple things but it always throws an NoSuchElementException.
Has anyone already done this?
Comparing strings would yield an order based on unicode values of the strings being compared. What you need is to extract the file number and order based on that as an Integer.
import java.io.File
val extractor = "([\\d]+).csv$".r
val files = List(
"myPath/1.csv",
"myPath/101.csv",
"myPath/02.csv",
"myPath/21.csv",
"myPath/33.csv"
).map(new File(_))
val sorted = files.sortWith {(l, r) =>
val extractor(lFileNumber) = l.getName
val extractor(rFileNumber) = r.getName
lFileNumber.toInt < rFileNumber.toInt
}
sorted.foreach(println)
Results:
myPath/1.csv
myPath/02.csv
myPath/21.csv
myPath/33.csv
myPath/101.csv
UPDATE
An alternative as proposed by #dhg
val sorted = files.sortBy { f => f.getName match {
case extractor(n) => n.toInt
}}
A cleaner version of J.Romero's answer, using sortBy:
val Extractor = "([\\d]+)\\.csv".r
val sorted = files.map(_.getName).sortBy{ case Extractor(n) => n.toInt }