I have a list that i want to parse its element and subelement and give them as variable for dataframe query
but i get an error can anybody help here is my code
val ListParser = ("age,15,20","revenue,1,2")
val vars = "category in (1,2,4)"
val resultQuery : Dataset[Row] =
if(ListParser.size == 0){
responses.filter(vars)
}else if(ListParser.size == 1){
responses.filter(vars +" AND " + responses(ListParser(0)).between(ListParser(1).toInt, ListParser(2).toInt))
}else if(ListParser.size >= 2){
responses.filter(vars + " AND " + {for(a <- ListParser){
val myInnerList : List[String] = a.split(",").map(_.trim).toList
responses(myInnerList(0)).between(myInnerList(1).toInt,myInnerList(2).toInt)
}})
}else{
responses.filter(vars)
}
and i have another question i want only the value of response.filter() to be in resultQuery val
It seems you are mixing SQL syntax with spark object syntax
try (listParser.map( s => {val l =s.split(","); s"""${l(0)} between (${l(1)},${l(2)})"""}) :+ vars).mkString(" AND ") (assuming listParser is indeed a list and not a tuple )
Related
I have a function which should take in a long string and separate it into a list of strings where each list element is a sentence of the article. I am going to achieve this by splitting on space and then grouping the elements from that split according to the tokens which end with a dot:
def getSentences(article: String): List[String] = {
val separatedBySpace = article
.map((c: Char) => if (c == '\n') ' ' else c)
.split(" ")
val splitAt: List[Int] = Range(0, separatedBySpace.size)
.filter(i => endsWithDot(separatedBySpace(0))).toList
// TODO
}
I have separated the string on space, and I've found each index that I want to group the list on. But how do I now turn separatedBySpace into a list of sentences based on splitAt?
Example of how it should work:
article = "I like donuts. I like cats."
result = List("I like donuts.", "I like cats.")
PS: Yes, I now that my algorithm for splitting the article into sentences has flaws, I just want to make a quick naive method to get the job done.
I ended up solving this by using recursion:
def getSentenceTokens(article: String): List[List[String]] = {
val separatedBySpace: List[String] = article
.replace('\n', ' ')
.replaceAll(" +", " ") // regex
.split(" ")
.toList
val splitAt: List[Int] = separatedBySpace.indices
.filter(i => ( i > 0 && endsWithDot(separatedBySpace(i - 1)) ) || i == 0)
.toList
groupBySentenceTokens(separatedBySpace, splitAt, List())
}
def groupBySentenceTokens(tokens: List[String], splitAt: List[Int], sentences: List[List[String]]): List[List[String]] = {
if (splitAt.size <= 1) {
if (splitAt.size == 1) {
sentences :+ tokens.slice(splitAt.head, tokens.size)
} else {
sentences
}
}
else groupBySentenceTokens(tokens, splitAt.tail, sentences :+ tokens.slice(splitAt.head, splitAt.tail.head))
}
val s: String = """I like donuts. I like cats
This is amazing"""
s.split("\\.|\n").map(_.trim).toList
//result: List[String] = List("I like donuts", "I like cats", "This is amazing")
To include the dots in the sentences:
val (a, b, _) = s.replace("\n", " ").split(" ")
.foldLeft((List.empty[String], List.empty[String], "")){
case ((temp, result, finalStr), word) =>
if (word.endsWith(".")) {
(List.empty[String], result ++ List(s"$finalStr${(temp ++ List(word)).mkString(" ")}"), "")
} else {
(temp ++ List(word), result, finalStr)
}
}
val result = b ++ List(a.mkString(" ").trim)
//result = List("I like donuts.", "I like cats.", "This is amazing")
I'm trying to typecast the columns in the data frame df_trial which has all the columns as string, based on an XML file I'm trying to type cast each column.
val columnList = sXml \\ "COLUMNS" \ "COLUMN"
val df_trial = sqlContext.createDataFrame(rowRDD, schema_allString)
columnList.foreach(i => {
var columnName = (i \\ "#ID").text.toLowerCase()
var dataType = (i \\ "#DATA_TYPE").text.toLowerCase()
if (dataType == "number") {
print("number")
var DATA_PRECISION: Int = (i \\ "#DATA_PRECISION").text.toLowerCase().toInt
var DATA_SCALE: Int = (i \\ "#DATA_SCALE").text.toLowerCase().toInt;
var decimalvalue = "decimal(" + DATA_PRECISION + "," + DATA_SCALE + ")"
val df_intermediate: DataFrame =
df_trial.withColumn(s"$columnName",
col(s"$columnName").cast(s"$decimalvalue"))
val df_trial: DataFrame = df_intermediate
} else if (dataType == "varchar2") {
print("varchar")
var DATA_LENGTH = (i \\ "#DATA_LENGTH").text.toLowerCase().toInt;
var varcharvalue = "varchar(" + DATA_LENGTH + ")"
val df_intermediate =
df_trial.withColumn(s"$columnName",
col(s"$columnName").cast(s"$varcharvalue"))
val df_trial: DataFrame = df_intermediate
} else if (dataType == "timestamp") {
print("time")
val df_intermediate =
df_trial.withColumn(s"$columnName", col(s"$columnName").cast("timestamp"))
val df_trial: DataFrame = df_intermediate
}
});
In each branch of the if-else you're using the values called df_trial before you've defined them. You'll need to rearrange the code to define them first.
Note: the way you have it, the df_trial at the very top is not being used. Depending on what you are trying to do, you may want to change the first df_trial to a var and remove the val from the other usages. (This is probably still wrong since you will be overwriting the same variable multiple times as you loop over columnList).
I have an ArrayBuffer with data in the following format: period_name:character varying(15) year:bigint. The data in it represents column name of a table and its datatype. My requirement is to extract the column name period and the datatype, just character varying excluding substring from "(" till ")" and then send all the elements to a ListBuffer. I came up with the following logic:
for(i <- receivedGpData) {
gpTypes = i.split("\\:")
if(gpTypes(1).contains("(")) {
gpColType = gpTypes(1).substring(0, gpTypes(1).indexOf("("))
prepList += gpTypes(0) + " " + gpColType
} else {
prepList += gpTypes(0) + " " + gpTypes(1)
}
}
The above code is working but I am trying to implement the same using Scala's Map and Filter functions. What I don't understand is how to use the if-else condition in the Scala Filter after the condition:
var reList = receivedGpData.map(element => element.split(":"))
.filter{ x => x(1).contains("(")
}
Could anyone let me know how can I implement the same code in for-loop using Scala's map & filter functions ?
val receivedGpData = Array("bla:bla(1)","bla2:cat")
val res = receivedGpData
.map(_.split(":"))
.map(s=>(s(0),s(1).takeWhile(_!='(')))
.map(s => s"${s._1} ${s._2}").toList
println(res)
Using regex:
val p = "(\\w+):([.[^(]]*)(\\(.*\\))?".r
val res = data.map{case p(x,y,_)=>x+" "+y}
In Scala REPL:
scala> val data = Array("period_name:character varying(15)","year:bigint")
data: Array[String] = Array(period_name:character varying(15), year:bigint)
scala> val p = "(\\w+):([.[^(]]*)(\\(.*\\))?".r
p: scala.util.matching.Regex = (\w+):([.[^(]]*)(\(.*\))?
scala> val res = data.map{case p(x,y,_)=>x+" "+y}
res: Array[String] = Array(period_name character varying, year bigint)
i want to write the functional version for finding the pair of elements with given sum.the below is the imperative code:
object ArrayUtil{
def findPairs(arr:Array[Int],sum:Int) ={
val MAX = 50
val binmap:Array[Boolean] = new Array[Boolean](MAX)
for(i <- 0 until arr.length){
val temp:Int = sum-arr(i);
if (temp>=0 && binmap(temp))
{
println("Pair with given sum " + sum + " is (" + arr(i) +", "+temp+")");
}
binmap(arr(i)) = true;
}
}
}
Study the Standard Library.
def findPairs(arr:Array[Int],sum:Int): List[Array[Int]] =
arr.combinations(2).filter(_.sum == sum).toList
I am trying to append to an array but for some reason it is just appending blanks into my Array.
def schemaClean(x: Array[String]): Array[String] =
{
val array = Array[String]()
for(i <- 0 until x.length){
val convert = x(i).toString
val split = convert.split('|')
if (split.length == 5) {
val drop = split.dropRight(3).mkString(" ")
array :+ drop
}
else if (split.length == 4) {
val drop = split.dropRight(2).mkString(" ")
println(drop)
array :+ drop.toString
println(array.mkString(" "))
}
}
array
}
val schema1 = schemaClean(schema)
prints this:
record_id string
assigned_offer_id string
accepted_offer_flag string
current_offer_flag string
If I try and print schema1 its just 1 blank line.
Scala's Array size is immutable. From Scala's reference:
def
:+(elem: A): Array[A]
[use case] A copy of this array with an element appended.
Thus :+ returns a new array whose reference you are not using.
val array = ...
Should be:
var array = ...
And you should update that reference with the new arrays obtained after each append operation.
Since there are not variable size arrays in Scala, the alternative to an Array var copied after insertion is BufferArray, use its method operator += to append new elements and obtain the resulting array from the buffer, e.g:
import scala.collection.mutable.ArrayBuffer
val ab = ArrayBuffer[String]()
ab += "hello"
ab += "world"
ab.toArray
res2: Array[String] = Array(hello, world)
Applied to your code:
def schemaClean(x: Array[String]): Array[String] =
{
val arrayBuf = ArrayBuffer[String]()
for(i <- 0 until x.length){
val convert = x(i).toString
val split = convert.split('|')
if (split.length == 5) {
val drop = split.dropRight(3).mkString(" ")
arrayBuf += drop
}
else if (split.length == 4) {
val drop = split.dropRight(2).mkString(" ")
println(drop)
arrayBuf += drop.toString
println(arrayBuf.toArray.mkString(" "))
}
}
arrayBuf.toArray
}