extracting sub string using pattern matching in scala - scala

I want to extract domain name from uri.
For example, input to the regular expression may be of one of the below types
test.net
https://www.test.net
https://test.net
http://www.test.net
http://test.net
in all the cases the input should return test.net
Below is the code in implemented for my purpose
val re = "([http[s]?://[w{3}\\.]?]+)(.*)".r
But I didn't get expected result
below is my output
val re(prefix, domain) = "https://www.test.net"
prefix: String = https://www.t
domain: String = est.net
what is problem with my regular expression and how can I fix it?

what is problem with my regular expression and how can I fix it?
You are using a character class
[http.?://(www.)?]
This means:
either an h
or a t
or a t
or a .
or a ?
or a :
or a /
or a /
or a (
or a w
or a w
or a w
or a .
or a )
or a ?
It does not include an s, so it will not match https://.
It is not clear to me why you are using a character class here, nor why you are using duplicate characters in the class.
Ideally, you shouldn't try to parse URIs yourself; someone else has already done the hard work. You could, for example, use the java.net.URI class:
import java.net.URI
val u1 = new URI("test.net")
u1.getHost
// res: String = null
val u2 = new URI("https://www.test.net")
u2.getHost
// res: String = www.test.net
val u3 = new URI("https://test.net")
u3.getHost
// res: String = test.net
val u4 = new URI("http://www.test.net")
u4.getHost
// res: String = www.test.net
val u5 = new URI("http://test.net")
u5.getHost
// res: String = test.net
Unfortunately, as you can see, what you want to achieve does not actually comply with the official URI syntax.
If you can fix that, then you can use java.net.URI. Otherwise, you will need to go back to your old solution and parse the URI yourself:
val re = "(?>https?://)?(?>www.)?([^/?#]*)".r
val re(domain1) = "test.net"
//=> domain1: String = test.net
val re(domain2) = "https://www.test.net"
//=> domain2: String = test.net
val re(domain3) = "https://test.net"
//=> domain3: String = test.net
val re(domain4) = "http://www.test.net"
//=> domain4: String = test.net
val re(domain5) = "http://test.net"
//=> domain5: String = test.net

Related

Function without var in scala

I have a function that takes a string and a case class as input and return string as output.
Different case class gets appended to the list and the final case class is returned which has the list.
I want to do it without using var. The val list would be immutable and no data would be added to it. Is there any other way of doing it in Scala way?
def getResult(eventName: Option[String], content: Content): String = {
var list = List.empty[Json]
val device = Device(
DEVICE_SCHEMA,
data = content.data.device
)
list = list :+ device.asJson
val parser = Parser(
PARSER_SCHEMA,
data = content.data.parser
)
list = list :+ parser.asJson
val res = Result(
RESULT_SCHMEA,
data = list
)
res.asJson.noSpaces
}
Try inlining list creation like so
def getResult(eventName: Option[String], content: Content): String = {
val device = Device(
DEVICE_SCHEMA,
data = content.data.device
)
val parser = Parser(
PARSER_SCHEMA,
data = content.data.parser
)
Result(
RESULT_SCHMEA,
data = List(device.asJson, parser.asJson) // <== inline list creation
).asJson.noSpaces
}
Just some little changes from the previous answer.
You don't need val res and it's preferred to create the list outside Result for easier reading and later debugging:
def getResult(eventName: Option[String], content: Content): String = {
val device = Device(
DEVICE_SCHEMA,
data = content.data.device
)
val parser = Parser(
PARSER_SCHEMA,
data = content.data.parser
)
val jsons = List(device.asJson, parser.asJson)
Result(
RESULT_SCHMEA,
data = jsons
).asJson.noSpaces
}

Error in code Regex

I am trying to find only the word contains 3 letters(e is below example) in the word
need to find using regex.
val inputString = """edepak,suman,employdee,eeeee,eme,ev"""
and i have written the below code.
val numberPatteren = "([a-z]*e){3,}".r
but i am getting the below output which is not as expected.
employdee,eeeee
but the output should be only -- employdee
can you please help me on this.
You can achieve that simply by doing the following
scala> inputString.split(",").filter(word => word.count(_ == 'e') == 3).mkString(",")
//res16: String = employdee
If you want to use regex, you can do as below
scala> val numberPatteren = "[a-df-zA-DF-Z0-9]".r
//numberPatteren: scala.util.matching.Regex = [a-df-zA-DF-Z0-9]
scala> inputString.split(",").filter(numberPatteren.replaceAllIn(_, "").length == 3).mkString(",")
//res0: String = employdee

Looping through Map Spark Scala

Within this code we have two files: athletes.csv that contains names, and twitter.test that contains the tweet message. We want to find name for every single line in the twitter.test that match the name in athletes.csv We applied map function to store the name from athletes.csv and want to iterate all of the name to all of the line in the test file.
object twitterAthlete {
def loadAthleteNames() : Map[String, String] = {
// Handle character encoding issues:
implicit val codec = Codec("UTF-8")
codec.onMalformedInput(CodingErrorAction.REPLACE)
codec.onUnmappableCharacter(CodingErrorAction.REPLACE)
// Create a Map of Ints to Strings, and populate it from u.item.
var athleteInfo:Map[String, String] = Map()
//var movieNames:Map[Int, String] = Map()
val lines = Source.fromFile("../athletes.csv").getLines()
for (line <- lines) {
var fields = line.split(',')
if (fields.length > 1) {
athleteInfo += (fields(1) -> fields(7))
}
}
return athleteInfo
}
def parseLine(line:String): (String)= {
var athleteInfo = loadAthleteNames()
var hello = new String
for((k,v) <- athleteInfo){
if(line.toString().contains(k)){
hello = k
}
}
return (hello)
}
def main(args: Array[String]){
Logger.getLogger("org").setLevel(Level.ERROR)
val sc = new SparkContext("local[*]", "twitterAthlete")
val lines = sc.textFile("../twitter.test")
var athleteInfo = loadAthleteNames()
val splitting = lines.map(x => x.split(";")).map(x => if(x.length == 4 && x(2).length <= 140)x(2))
var hello = new String()
val container = splitting.map(x => for((key,value) <- athleteInfo)if(x.toString().contains(key)){key}).cache
container.collect().foreach(println)
// val mapping = container.map(x => (x,1)).reduceByKey(_+_)
//mapping.collect().foreach(println)
}
}
the first file look like:
id,name,nationality,sex,height........
001,Michael,USA,male,1.96 ...
002,Json,GBR,male,1.76 ....
003,Martin,female,1.73 . ...
the second file look likes:
time, id , tweet .....
12:00, 03043, some message that contain some athletes names , .....
02:00, 03023, some message that contain some athletes names , .....
some thinks like this ...
but i got empty result after running this code, any suggestions is much appreciated
result i got is empty :
()....
()...
()...
but the result that i expected something like:
(name,1)
(other name,1)
You need to use yield to return value to your map
val container = splitting.map(x => for((key,value) <- athleteInfo ; if(x.toString().contains(key)) ) yield (key, 1)).cache
I think you should just start with the simplest option first...
I would use DataFrames so you can use the built-in CSV parsing and leverage Catalyst, Tungsten, etc.
Then you can use the built-in Tokenizer to split the tweets into words, explode, and do a simple join. Depending how big/small the data with athlete names is you'll end up with a more optimized broadcast join and avoid a shuffle.
import org.apache.spark.sql.functions._
import org.apache.spark.ml.feature.Tokenizer
val tweets = spark.read.format("csv").load(...)
val athletes = spark.read.format("csv").load(...)
val tokenizer = new Tokenizer()
tokenizer.setInputCol("tweet")
tokenizer.setOutputCol("words")
val tokenized = tokenizer.transform(tweets)
val exploded = tokenized.withColumn("word", explode('words))
val withAthlete = exploded.join(athletes, 'word === 'name)
withAthlete.select(exploded("id"), 'name).show()

Convert RDF4J stream filter (lambda?) from Java to Scala

A follow-up to Are typed literals "tricky" in RDF4J?
I have some triples abut the weight of dump trucks, using literal objects with different data types. I'm only interested in the integer values, so I want to filter based on the data type. Jeen Broekstra sent a Java solution about a week ago, and I'm having trouble converting it into Scala, my team's preferred language.
This is what I have so far. Eclipse is complaining
not found: value l
val rdf4jServer = "http://host.domain:7200"
val repositoryID = "trucks"
val MyRepo = new HTTPRepository(rdf4jServer, repositoryID)
MyRepo.initialize()
var con = MyRepo.getConnection()
val f = MyRepo.getValueFactory()
val DumpTruck = f.createIRI("http://example.com/dumpTruck")
val Weight = f.createIRI("http://example.com/weight")
val m = QueryResults.asModel(con.getStatements(DumpTruck, Weight, null))
val intValuesStream = Models.objectLiterals(m).stream()
# OK up to here
# errors start below
val intValuesFiltered =
intValuesStream.filter(l -> l.getDatatype().equals(XMLSchema.INTEGER))
val intValues = intValuesFiltered.collect(Collectors.toList())
Replace the -> with =>:
val intValuesFiltered = intValuesStream.filter(l => l.getDatatype().equals(XMLSchema.INTEGER))

Converting a String to a Map

Given a String : {'Name':'Bond','Job':'Agent','LastEntry':'15/10/2015 13:00'}
I want to parse it into a Map[String,String], I already tried this answer but it doesn't work when the character : is inside the parsed value. Same thing with the ' character, it seems to break every JSON Mappers...
Thanks for any help.
Let
val s0 = "{'Name':'Bond','Job':'Agent','LastEntry':'15/10/2015 13:00'}"
val s = s0.stripPrefix("{").stripSuffix("}")
Then
(for (e <- s.split(",") ; xs = e.split(":",2)) yield xs(0) -> xs(1)).toMap
Here we split each key-value by the first occurrence of ":". Further this is a strong assumption, in that the key does not contain any ":".
You can use the familiar jackson-module-scala that can do this in much better scale.
For example:
val src = "{'Name':'Bond','Job':'Agent','LastEntry':'15/10/2015 13:00'}"
val mapper = new ObjectMapper() with ScalaObjectMapper
mapper.registerModule(DefaultScalaModule)
val myMap = mapper.readValue[Map[String,String]](src)