I want to remove substring after 'Y' character. Final output should be EBAY in all three cases below. But output coming as EBAYSK.
object AndreaTest extends SparkSessionWrapper {
def main(args: Array[String]): Unit = {
var string1= "EBAY-SK"
var string2= "EBAY SK"
var string3= "EBAY- SK"
val finalString1 = string1.replaceAll("[-]/' '", "")
val finalString2 = string2.replaceAll("[-]/' '", "")
val finalString3 = string3.replaceAll("[-]/' '", "")
println(finalString1)
println(finalString2)
println(finalString3)
}
try this regex
val strings = List("EBAY-SK", "EBAY SK", "EBAY- SK", "EBAY", "EBAYEBAY")
val pattern = """([^.]*?Y).*""".r
strings.foreach(a => pattern.findAllIn(a).matchData foreach {
m => println(a + " -> " + m.group(1))
})
output:
EBAY-SK -> EBAY
EBAY SK -> EBAY
EBAY- SK -> EBAY
EBAY -> EBAY
EBAYEBAY -> EBAY
You either use a working pattern, i.e. replace - or \s (space) plus everything following it:
val finalString1 = string1.replaceAll("[-\\s].*", "")
val finalString2 = string2.replaceAll("[-\\s].*", "")
val finalString3 = string3.replaceAll("[-\\s].*", "")
Or just use substring instead of replaceAll:
val finalString1 = string1.substring(0, 4)
val finalString2 = string2.substring(0, 4)
val finalString3 = string3.substring(0, 4)
Related
I want to test multiple methods, one that outputs a map and one that outputs a list. I have two separate test cases for each method, but I want a way to combine them and test both methods at the same time.
test("test 1 map") {
val testCases: Map[String, Map[String, Int]] = Map(
"Andorra" -> Map("la massana" -> 7211)
)
for ((input, expectedOutput) <- testCases) {
var computedOutput: mutable.Map[String, Int] = PaleBlueDot.cityPopulations(countriesFile, citiesFilename, input, "04")
assert(computedOutput == expectedOutput, input + " -> " + computedOutput)
}
}
test(testName="test 1 list") {
val testCases: Map[String, List[String]] = Map{
"Andorra" -> List("les escaldes")
}
for ((input, expectedOutput) <- testCases) {
var computedOutput: List[String] = PaleBlueDot.aboveAverageCities(countriesFile, citiesFilename, input)
assert(computedOutput.sorted == expectedOutput.sorted, input + " -> " + computedOutput)
}
Firstly, it is better to use a List rather than a Map for testCases as a Map can return values in any order. Using List ensures that tests are done in the order they are written in the list.
You can then make testCases into a List containing a tuple with test data for both tests, like this:
test("test map and list") {
val testCases = List {
"Andorra" -> (Map("la massana" -> 7211), List("les escaldes"))
}
for ((input, (mapOut, listOut)) <- testCases) {
val computedMap: mutable.Map[String, Int] =
PaleBlueDot.cityPopulations(countriesFile, citiesFilename, input, "04")
val computedList: List[String] =
PaleBlueDot.aboveAverageCities(countriesFile, citiesFilename, input)
assert(computedMap == mapOut, input + " -> " + computedMap)
assert(computedList.sorted == listOut.sorted, input + " -> " + computedList)
}
}
I have an ArrayBuffer with data in the following format: period_name:character varying(15) year:bigint. The data in it represents column name of a table and its datatype. My requirement is to extract the column name period and the datatype, just character varying excluding substring from "(" till ")" and then send all the elements to a ListBuffer. I came up with the following logic:
for(i <- receivedGpData) {
gpTypes = i.split("\\:")
if(gpTypes(1).contains("(")) {
gpColType = gpTypes(1).substring(0, gpTypes(1).indexOf("("))
prepList += gpTypes(0) + " " + gpColType
} else {
prepList += gpTypes(0) + " " + gpTypes(1)
}
}
The above code is working but I am trying to implement the same using Scala's Map and Filter functions. What I don't understand is how to use the if-else condition in the Scala Filter after the condition:
var reList = receivedGpData.map(element => element.split(":"))
.filter{ x => x(1).contains("(")
}
Could anyone let me know how can I implement the same code in for-loop using Scala's map & filter functions ?
val receivedGpData = Array("bla:bla(1)","bla2:cat")
val res = receivedGpData
.map(_.split(":"))
.map(s=>(s(0),s(1).takeWhile(_!='(')))
.map(s => s"${s._1} ${s._2}").toList
println(res)
Using regex:
val p = "(\\w+):([.[^(]]*)(\\(.*\\))?".r
val res = data.map{case p(x,y,_)=>x+" "+y}
In Scala REPL:
scala> val data = Array("period_name:character varying(15)","year:bigint")
data: Array[String] = Array(period_name:character varying(15), year:bigint)
scala> val p = "(\\w+):([.[^(]]*)(\\(.*\\))?".r
p: scala.util.matching.Regex = (\w+):([.[^(]]*)(\(.*\))?
scala> val res = data.map{case p(x,y,_)=>x+" "+y}
res: Array[String] = Array(period_name character varying, year bigint)
Within this code we have two files: athletes.csv that contains names, and twitter.test that contains the tweet message. We want to find name for every single line in the twitter.test that match the name in athletes.csv We applied map function to store the name from athletes.csv and want to iterate all of the name to all of the line in the test file.
object twitterAthlete {
def loadAthleteNames() : Map[String, String] = {
// Handle character encoding issues:
implicit val codec = Codec("UTF-8")
codec.onMalformedInput(CodingErrorAction.REPLACE)
codec.onUnmappableCharacter(CodingErrorAction.REPLACE)
// Create a Map of Ints to Strings, and populate it from u.item.
var athleteInfo:Map[String, String] = Map()
//var movieNames:Map[Int, String] = Map()
val lines = Source.fromFile("../athletes.csv").getLines()
for (line <- lines) {
var fields = line.split(',')
if (fields.length > 1) {
athleteInfo += (fields(1) -> fields(7))
}
}
return athleteInfo
}
def parseLine(line:String): (String)= {
var athleteInfo = loadAthleteNames()
var hello = new String
for((k,v) <- athleteInfo){
if(line.toString().contains(k)){
hello = k
}
}
return (hello)
}
def main(args: Array[String]){
Logger.getLogger("org").setLevel(Level.ERROR)
val sc = new SparkContext("local[*]", "twitterAthlete")
val lines = sc.textFile("../twitter.test")
var athleteInfo = loadAthleteNames()
val splitting = lines.map(x => x.split(";")).map(x => if(x.length == 4 && x(2).length <= 140)x(2))
var hello = new String()
val container = splitting.map(x => for((key,value) <- athleteInfo)if(x.toString().contains(key)){key}).cache
container.collect().foreach(println)
// val mapping = container.map(x => (x,1)).reduceByKey(_+_)
//mapping.collect().foreach(println)
}
}
the first file look like:
id,name,nationality,sex,height........
001,Michael,USA,male,1.96 ...
002,Json,GBR,male,1.76 ....
003,Martin,female,1.73 . ...
the second file look likes:
time, id , tweet .....
12:00, 03043, some message that contain some athletes names , .....
02:00, 03023, some message that contain some athletes names , .....
some thinks like this ...
but i got empty result after running this code, any suggestions is much appreciated
result i got is empty :
()....
()...
()...
but the result that i expected something like:
(name,1)
(other name,1)
You need to use yield to return value to your map
val container = splitting.map(x => for((key,value) <- athleteInfo ; if(x.toString().contains(key)) ) yield (key, 1)).cache
I think you should just start with the simplest option first...
I would use DataFrames so you can use the built-in CSV parsing and leverage Catalyst, Tungsten, etc.
Then you can use the built-in Tokenizer to split the tweets into words, explode, and do a simple join. Depending how big/small the data with athlete names is you'll end up with a more optimized broadcast join and avoid a shuffle.
import org.apache.spark.sql.functions._
import org.apache.spark.ml.feature.Tokenizer
val tweets = spark.read.format("csv").load(...)
val athletes = spark.read.format("csv").load(...)
val tokenizer = new Tokenizer()
tokenizer.setInputCol("tweet")
tokenizer.setOutputCol("words")
val tokenized = tokenizer.transform(tweets)
val exploded = tokenized.withColumn("word", explode('words))
val withAthlete = exploded.join(athletes, 'word === 'name)
withAthlete.select(exploded("id"), 'name).show()
I want to apply a list of regex to a string. My current approach is not very functional
My current code:
val stopWords = List[String](
"the",
"restaurant",
"bar",
"[^a-zA-Z -]"
)
def CanonicalName(name: String): String = {
var nameM = name
for (reg <- stopWords) {
nameM = nameM.replaceAll(reg, "")
}
nameM = nameM.replaceAll(" +", " ").trim
return nameM
}
I think this does what you're looking for.
def CanonicalName(name: String): String = {
val stopWords = List("the", "restaurant", "bar", "[^a-zA-Z -]")
stopWords.foldLeft(name)(_.replaceAll(_, "")).replaceAll(" +"," ").trim
}
'replaceAll' has the possiblity to replace part of a word, for example: "the thermal & barbecue restaurant" is replaced to "rmal becue". If what you want is "thermal barbecue", you may split the name first and then apply your stopwords rules word by word:
def isStopWord(word: String): Boolean = stopWords.exists(word.matches)
def CanonicalName(name: String): String =
name.replaceAll(" +", " ").trim.split(" ").flatMap(n => if (isStopWord(n)) List() else List(n)).mkString(" ")
Below Scala class parses a file using JDOM and populates the values from the file into a Scala immutable Map. Using the + operator on the Map does not seem to have any effect as the Map is always zero.
import java.io.File
import org.jsoup.nodes.Document
import org.jsoup.Jsoup
import org.jsoup.select.Elements
import org.jsoup.nodes.Element
import scala.collection.immutable.TreeMap
class JdkElementDetail() {
var fileLocation: String = _
def this(fileLocation: String) = {
this()
this.fileLocation = fileLocation;
}
def parseFile : Map[String , String] = {
val jdkElementsMap: Map[String, String] = new TreeMap[String , String];
val input: File = new File(fileLocation);
val doc: Document = Jsoup.parse(input, "UTF-8", "http://example.com/");
val e: Elements = doc.getElementsByAttribute("href");
val href: java.util.Iterator[Element] = e.iterator();
while (href.hasNext()) {
var objectName = href.next();
var hrefValue = objectName.attr("href");
var name = objectName.text();
jdkElementsMap + name -> hrefValue
println("size is "+jdkElementsMap.size)
}
jdkElementsMap
}
}
println("size is "+jdkElementsMap.size) always prints "size is 0"
Why is the size always zero, am I not adding to the Map correctly?
Is the only fix for this to convert jdkElementsMap to a var and then use the following?
jdkElementsMap += name -> hrefValue
Removing the while loop here is my updated object:
package com.parse
import java.io.File
import org.jsoup.nodes.Document
import org.jsoup.Jsoup
import org.jsoup.select.Elements
import org.jsoup.nodes.Element
import scala.collection.immutable.TreeMap
import scala.collection.JavaConverters._
class JdkElementDetail() {
var fileLocation: String = _
def this(fileLocation: String) = {
this()
this.fileLocation = fileLocation;
}
def parseFile : Map[String , String] = {
var jdkElementsMap: Map[String, String] = new TreeMap[String , String];
val input: File = new File(fileLocation);
val doc: Document = Jsoup.parse(input, "UTF-8", "http://example.com/");
val elements: Elements = doc.getElementsByAttribute("href");
val elementsScalaIterator = elements.iterator().asScala
elementsScalaIterator.foreach {
keyVal => {
var hrefValue = keyVal.attr("href");
var name = keyVal.text();
println("size is "+jdkElementsMap.size)
jdkElementsMap += name -> hrefValue
}
}
jdkElementsMap
}
}
Immutable data structures -- be they lists or maps -- are just that: immutable. You don't ever change them, you create new data structures based on changes to the old ones.
If you do val x = jdkElementsMap + (name -> hrefValue), then you'll get the new map on x, while jdkElementsMap continues to be the same.
If you change jdkElementsMap into a var, then you could do jdkEleemntsMap = jdkElementsMap + (name -> hrefValue), or just jdkElementsMap += (name -> hrefValue). The latter will also work for mutable maps.
Is that the only way? No, but you have to let go of while loops to achieve the same thing. You could replace these lines:
val href: java.util.Iterator[Element] = e.iterator();
while (href.hasNext()) {
var objectName = href.next();
var hrefValue = objectName.attr("href");
var name = objectName.text();
jdkElementsMap + name -> hrefValue
println("size is "+jdkElementsMap.size)
}
jdkElementsMap
With a fold, such as in:
import scala.collection.JavaConverters.asScalaIteratorConverter
e.iterator().asScala.foldLeft(jdkElementsMap) {
case (accumulator, href) => // href here is not an iterator
val objectName = href
val hrefValue = objectName.attr("href")
val name = objectName.text()
val newAccumulator = accumulator + (name -> hrefValue)
println("size is "+newAccumulator.size)
newAccumulator
}
Or with recursion:
def createMap(hrefIterator: java.util.Iterator[Element],
jdkElementsMap: Map[String, String]): Map[String, String] = {
if (hrefIterator.hasNext()) {
val objectName = hrefIterator.next()
val hrefValue = objectName.attr("href")
val name = objectName.text()
val newMap = jdkElementsMap + name -> hrefValue
println("size is "+newMap.size)
createMap(hrefIterator, newMap)
} else {
jdkElementsMap
}
}
createMap(e.iterator(), new TreeMap[String, String])
Performance-wise, the fold will be rather slower, and the recursion should be very slightly faster.
Mind you, Scala does provide mutable maps, and not just to be able to say it has them: if they fit better you problem, then go ahead and use them! If you want to learn how to use the immutable ones, then the two approaches above are the ones you should learn.
The map is immutable, so any modifications will return the modified map. jdkElementsMap + (name -> hrefValue) returns a new map containing the new pair, but you are discarding the modified map after it is created.
EDIT: It looks like you can convert Java iterables to Scala iterables, so you can then fold over the resulting sequence and accumulate a map:
import scala.collection.JavaConverters._
val e: Elements = doc.getElementsByAttribute("href");
val jdkElementsMap = e.asScala
.foldLeft(new TreeMap[String , String])((map, href) => map + (href.text() -> href.attr("href"))
if you don't care what kind of map you create you can use toMap:
val jdkElementsMap = e.asScala
.map(href => (href.text(), href.attr("href")))
.toMap