Why is immutable map size always zero? - scala

Below Scala class parses a file using JDOM and populates the values from the file into a Scala immutable Map. Using the + operator on the Map does not seem to have any effect as the Map is always zero.
import java.io.File
import org.jsoup.nodes.Document
import org.jsoup.Jsoup
import org.jsoup.select.Elements
import org.jsoup.nodes.Element
import scala.collection.immutable.TreeMap
class JdkElementDetail() {
var fileLocation: String = _
def this(fileLocation: String) = {
this()
this.fileLocation = fileLocation;
}
def parseFile : Map[String , String] = {
val jdkElementsMap: Map[String, String] = new TreeMap[String , String];
val input: File = new File(fileLocation);
val doc: Document = Jsoup.parse(input, "UTF-8", "http://example.com/");
val e: Elements = doc.getElementsByAttribute("href");
val href: java.util.Iterator[Element] = e.iterator();
while (href.hasNext()) {
var objectName = href.next();
var hrefValue = objectName.attr("href");
var name = objectName.text();
jdkElementsMap + name -> hrefValue
println("size is "+jdkElementsMap.size)
}
jdkElementsMap
}
}
println("size is "+jdkElementsMap.size) always prints "size is 0"
Why is the size always zero, am I not adding to the Map correctly?
Is the only fix for this to convert jdkElementsMap to a var and then use the following?
jdkElementsMap += name -> hrefValue
Removing the while loop here is my updated object:
package com.parse
import java.io.File
import org.jsoup.nodes.Document
import org.jsoup.Jsoup
import org.jsoup.select.Elements
import org.jsoup.nodes.Element
import scala.collection.immutable.TreeMap
import scala.collection.JavaConverters._
class JdkElementDetail() {
var fileLocation: String = _
def this(fileLocation: String) = {
this()
this.fileLocation = fileLocation;
}
def parseFile : Map[String , String] = {
var jdkElementsMap: Map[String, String] = new TreeMap[String , String];
val input: File = new File(fileLocation);
val doc: Document = Jsoup.parse(input, "UTF-8", "http://example.com/");
val elements: Elements = doc.getElementsByAttribute("href");
val elementsScalaIterator = elements.iterator().asScala
elementsScalaIterator.foreach {
keyVal => {
var hrefValue = keyVal.attr("href");
var name = keyVal.text();
println("size is "+jdkElementsMap.size)
jdkElementsMap += name -> hrefValue
}
}
jdkElementsMap
}
}

Immutable data structures -- be they lists or maps -- are just that: immutable. You don't ever change them, you create new data structures based on changes to the old ones.
If you do val x = jdkElementsMap + (name -> hrefValue), then you'll get the new map on x, while jdkElementsMap continues to be the same.
If you change jdkElementsMap into a var, then you could do jdkEleemntsMap = jdkElementsMap + (name -> hrefValue), or just jdkElementsMap += (name -> hrefValue). The latter will also work for mutable maps.
Is that the only way? No, but you have to let go of while loops to achieve the same thing. You could replace these lines:
val href: java.util.Iterator[Element] = e.iterator();
while (href.hasNext()) {
var objectName = href.next();
var hrefValue = objectName.attr("href");
var name = objectName.text();
jdkElementsMap + name -> hrefValue
println("size is "+jdkElementsMap.size)
}
jdkElementsMap
With a fold, such as in:
import scala.collection.JavaConverters.asScalaIteratorConverter
e.iterator().asScala.foldLeft(jdkElementsMap) {
case (accumulator, href) => // href here is not an iterator
val objectName = href
val hrefValue = objectName.attr("href")
val name = objectName.text()
val newAccumulator = accumulator + (name -> hrefValue)
println("size is "+newAccumulator.size)
newAccumulator
}
Or with recursion:
def createMap(hrefIterator: java.util.Iterator[Element],
jdkElementsMap: Map[String, String]): Map[String, String] = {
if (hrefIterator.hasNext()) {
val objectName = hrefIterator.next()
val hrefValue = objectName.attr("href")
val name = objectName.text()
val newMap = jdkElementsMap + name -> hrefValue
println("size is "+newMap.size)
createMap(hrefIterator, newMap)
} else {
jdkElementsMap
}
}
createMap(e.iterator(), new TreeMap[String, String])
Performance-wise, the fold will be rather slower, and the recursion should be very slightly faster.
Mind you, Scala does provide mutable maps, and not just to be able to say it has them: if they fit better you problem, then go ahead and use them! If you want to learn how to use the immutable ones, then the two approaches above are the ones you should learn.

The map is immutable, so any modifications will return the modified map. jdkElementsMap + (name -> hrefValue) returns a new map containing the new pair, but you are discarding the modified map after it is created.
EDIT: It looks like you can convert Java iterables to Scala iterables, so you can then fold over the resulting sequence and accumulate a map:
import scala.collection.JavaConverters._
val e: Elements = doc.getElementsByAttribute("href");
val jdkElementsMap = e.asScala
.foldLeft(new TreeMap[String , String])((map, href) => map + (href.text() -> href.attr("href"))
if you don't care what kind of map you create you can use toMap:
val jdkElementsMap = e.asScala
.map(href => (href.text(), href.attr("href")))
.toMap

Related

List All objects in S3 with given Prefix in scala

I am trying list all objects in AWS S3 Buckets with input Bucket Name & Filter Prefix using following code.
import scala.collection.JavaConverters._
import com.amazonaws.services.s3.AmazonS3Client
import com.amazonaws.services.s3.model.ListObjectsV2Request
val bucket_name = "Mybucket"
val fiter_prefix = "Test/a/"
def list_objects(str: String): mutable.Buffer[String] = {
val request : ListObjectsV2Request = new ListObjectsV2Request().withBucketName(bucket_name).withPrefix(str)
var result: ListObjectsV2Result = new ListObjectsV2Result()
do {
result = s3_client.listObjectsV2(request)
val token = result.getNextContinuationToken
System.out.println("Next Continuation Token: " + token)
request.setContinuationToken(token)
}while(result.isTruncated)
result.getObjectSummaries.asScala.map(_.getKey).size
}
list_objects(fiter_prefix)
I have applied continuation method but i am just getting last object list. for example is prefix has 2210 objects i am getting back 210 objects only.
Regards
Mahi
listObjectsV2 returns some or all (up to 1,000) of the objects in a bucket as it is stated here. You need to use Continuation Token to iterate rest of the objects in the bucket.
There is an example code here for java.
This is the code which worked for me.
import scala.collection.JavaConverters._
import com.amazonaws.services.s3.AmazonS3Client
import com.amazonaws.services.s3.model.ListObjectsV2Request
val bucket_name = "Mybucket"
val fiter_prefix = "Test/a/"
def list_objects(str: String): List[String] = {
val s3_client = new AmazonS3Client
var final_list: List[String] = List()
var list: List[String] = List()
val request: ListObjectsV2Request = new ListObjectsV2Request().withBucketName(bucket_name).withPrefix(str)
var result: ListObjectsV2Result = new ListObjectsV2Result()
do {
result = s3_client.listObjectsV2(request)
val token = result.getNextContinuationToken
System.out.println("Next Continuation Token: " + token)
request.setContinuationToken(token)
list = (result.getObjectSummaries.asScala.map(_.getKey)).toList
println(list.size)
final_list = final_list ::: list
println(final_list)
} while (result.isTruncated)
println("size", final_list.size)
final_list
}
list_objects(fiter_prefix)
A solution using vanilla Scala avoiding vars and tail recursion:
import software.amazon.awssdk.regions.Region
import software.amazon.awssdk.services.s3.S3Client
import software.amazon.awssdk.services.s3.model.{ListObjectsV2Request,
ListObjectsV2Response}
import scala.annotation.tailrec
import scala.collection.JavaConverters.asScalaBufferConverter
import scala.collection.mutable
import scala.collection.mutable.ListBuffer
val sourceBucket = "yourbucket"
val sourceKey = "yourKey"
val subFolderPrefix = "yourprefix"
def getAllPaths(s3Client: S3Client, initReq: ListObjectsV2Request): List[String] = {
#tailrec
def listAllObjectsV2(
s3Client: S3Client,
req: ListObjectsV2Request,
tokenOpt: Option[String],
isFirstTime: Boolean,
initList: ListBuffer[String]
): ListBuffer[String] = {
println(s"IsFirstTime = ${isFirstTime}, continuationToken = ${tokenOpt}")
(isFirstTime, tokenOpt) match {
case (true, Some(x)) =>
// this combo is not possible..
initList
case (false, None) =>
// end
initList
case (_, _) =>
// possible scenarios are :
// true, None : First iteration
// false, Some(x): Second iteration onwards
val response =
s3Client.listObjectsV2(tokenOpt.fold(req)(token => req.toBuilder.continuationToken(token).build()))
val keys: Seq[String] = response.contents().asScala.toList.map(_.key())
val nextTokenOpt = Option(response.nextContinuationToken())
listAllObjectsV2(s3Client, req, nextTokenOpt, isFirstTime = false, keys ++: initList)
}
}
listAllObjectsV2(s3Client, initReq, None, true, mutable.ListBuffer.empty[String]).toList
}
val s3Client = S3Client.builder().region(Region.US_WEST_2).build()
val request: ListObjectsV2Request =
ListObjectsV2Request.builder
.bucket(sourceBucket)
.prefix(sourceKey + "/" + subFolderPrefix)
.build
val listofAllKeys: List[String] = getAllPaths(s3Client, request)

Looping through Map Spark Scala

Within this code we have two files: athletes.csv that contains names, and twitter.test that contains the tweet message. We want to find name for every single line in the twitter.test that match the name in athletes.csv We applied map function to store the name from athletes.csv and want to iterate all of the name to all of the line in the test file.
object twitterAthlete {
def loadAthleteNames() : Map[String, String] = {
// Handle character encoding issues:
implicit val codec = Codec("UTF-8")
codec.onMalformedInput(CodingErrorAction.REPLACE)
codec.onUnmappableCharacter(CodingErrorAction.REPLACE)
// Create a Map of Ints to Strings, and populate it from u.item.
var athleteInfo:Map[String, String] = Map()
//var movieNames:Map[Int, String] = Map()
val lines = Source.fromFile("../athletes.csv").getLines()
for (line <- lines) {
var fields = line.split(',')
if (fields.length > 1) {
athleteInfo += (fields(1) -> fields(7))
}
}
return athleteInfo
}
def parseLine(line:String): (String)= {
var athleteInfo = loadAthleteNames()
var hello = new String
for((k,v) <- athleteInfo){
if(line.toString().contains(k)){
hello = k
}
}
return (hello)
}
def main(args: Array[String]){
Logger.getLogger("org").setLevel(Level.ERROR)
val sc = new SparkContext("local[*]", "twitterAthlete")
val lines = sc.textFile("../twitter.test")
var athleteInfo = loadAthleteNames()
val splitting = lines.map(x => x.split(";")).map(x => if(x.length == 4 && x(2).length <= 140)x(2))
var hello = new String()
val container = splitting.map(x => for((key,value) <- athleteInfo)if(x.toString().contains(key)){key}).cache
container.collect().foreach(println)
// val mapping = container.map(x => (x,1)).reduceByKey(_+_)
//mapping.collect().foreach(println)
}
}
the first file look like:
id,name,nationality,sex,height........
001,Michael,USA,male,1.96 ...
002,Json,GBR,male,1.76 ....
003,Martin,female,1.73 . ...
the second file look likes:
time, id , tweet .....
12:00, 03043, some message that contain some athletes names , .....
02:00, 03023, some message that contain some athletes names , .....
some thinks like this ...
but i got empty result after running this code, any suggestions is much appreciated
result i got is empty :
()....
()...
()...
but the result that i expected something like:
(name,1)
(other name,1)
You need to use yield to return value to your map
val container = splitting.map(x => for((key,value) <- athleteInfo ; if(x.toString().contains(key)) ) yield (key, 1)).cache
I think you should just start with the simplest option first...
I would use DataFrames so you can use the built-in CSV parsing and leverage Catalyst, Tungsten, etc.
Then you can use the built-in Tokenizer to split the tweets into words, explode, and do a simple join. Depending how big/small the data with athlete names is you'll end up with a more optimized broadcast join and avoid a shuffle.
import org.apache.spark.sql.functions._
import org.apache.spark.ml.feature.Tokenizer
val tweets = spark.read.format("csv").load(...)
val athletes = spark.read.format("csv").load(...)
val tokenizer = new Tokenizer()
tokenizer.setInputCol("tweet")
tokenizer.setOutputCol("words")
val tokenized = tokenizer.transform(tweets)
val exploded = tokenized.withColumn("word", explode('words))
val withAthlete = exploded.join(athletes, 'word === 'name)
withAthlete.select(exploded("id"), 'name).show()

scala class member function as UDF

I am trying to define a member function in a class that would be used as UDF while parsing data from a json file. I am using trait to a define a set of methods and a class to override those methods.
trait geouastr {
def getGeoLocation(ipAddress: String): Map[String, String]
def uaParser(ua: String): Map[String, String]
}
class GeoUAData(appName: String, sc: SparkContext, conf: SparkConf, combinedCSV: String) extends geouastr with Serializable {
val spark = SparkSession.builder.config(conf).getOrCreate()
val GEOIP_FILE_COMBINED = combinedCSV;
val logger = LogFactory.getLog(this.getClass)
val allDF = spark.
read.
option("header","true").
option("inferSchema", "true").
csv(GEOIP_FILE_COMBINED).cache
val emptyMap = Map(
"country" -> "",
"state" -> "",
"city" -> "",
"zipCode" -> "",
"latitude" -> 0.0.toString(),
"longitude" -> 0.0.toString())
override def getGeoLocation(ipAddress: String): Map[String, String] = {
val ipLong = ipToLong(ipAddress)
try {
logger.error("Entering UDF " + ipAddress + " allDF " + allDF.count())
val resultDF = allDF.
filter(allDF("network").cast("long") <= ipLong.get).
filter(allDF("broadcast") >= ipLong.get).
select(allDF("country_name"), allDF("subdivision_1_name"),allDF("city_name"),
allDF("postal_code"),allDF("latitude"),allDF("longitude"))
val matchingDF = resultDF.take(1)
val matchRow = matchingDF(0)
logger.error("Lookup for " + ipAddress + " Map " + matchRow.toString())
val geoMap = Map(
"country" -> nullCheck(matchRow.getAs[String](0)),
"state" -> nullCheck(matchRow.getAs[String](1)),
"city" -> nullCheck(matchRow.getAs[String](2)),
"zipCode" -> nullCheck(matchRow.getAs[String](3)),
"latitude" -> matchRow.getAs[Double](4).toString(),
"longitude" -> matchRow.getAs[Double](5).toString())
} catch {
case (nse: NoSuchElementException) => {
logger.error("No such element", nse)
emptyMap
}
case (npe: NullPointerException) => {
logger.error("NPE for " + ipAddress + " allDF " + allDF.count(),npe)
emptyMap
}
case (ex: Exception) => {
logger.error("Generic exception " + ipAddress,ex)
emptyMap
}
}
}
def nullCheck(input: String): String = {
if(input != null) input
else ""
}
override def uaParser(ua: String): Map[String, String] = {
val client = Parser.get.parse(ua)
return Map(
"os"->client.os.family,
"device"->client.device.family,
"browser"->client.userAgent.family)
}
def ipToLong(ip: String): Option[Long] = {
Try(ip.split('.').ensuring(_.length == 4)
.map(_.toLong).ensuring(_.forall(x => x >= 0 && x < 256))
.zip(Array(256L * 256L * 256L, 256L * 256L, 256L, 1L))
.map { case (x, y) => x * y }
.sum).toOption
}
}
I notice uaParser to be working fine, while getGeoLocation is returning emptyMap(running into NPE). Adding snippet that shows how i am using this in main method.
val appName = "SampleApp"
val conf: SparkConf = new SparkConf().setAppName(appName)
val sc: SparkContext = new SparkContext(conf)
val spark = SparkSession.builder.config(conf).enableHiveSupport().getOrCreate()
val geouad = new GeoUAData(appName, sc, conf, args(1))
val uaParser = Sparkudf(geouad.uaParser(_: String))
val geolocation = Sparkudf(geouad.getGeoLocation(_: String))
val sampleRdd = sc.textFile(args(0))
val json = sampleRdd.filter(_.nonEmpty)
import spark.implicits._
val sampleDF = spark.read.json(json)
val columns = sampleDF.select($"user-agent", $"source_ip")
.withColumn("sourceIp", $"source_ip")
.withColumn("geolocation", geolocation($"source_ip"))
.withColumn("uaParsed", uaParser($"user-agent"))
.withColumn("device", ($"uaParsed") ("device"))
.withColumn("os", ($"uaParsed") ("os"))
.withColumn("browser", ($"uaParsed") ("browser"))
.withColumn("country" , ($"geolocation")("country"))
.withColumn("state" , ($"geolocation")("state"))
.withColumn("city" , ($"geolocation")("city"))
.withColumn("zipCode" , ($"geolocation")("zipCode"))
.withColumn("latitude" , ($"geolocation")("latitude"))
.withColumn("longitude" , ($"geolocation")("longitude"))
.drop("geolocation")
.drop("uaParsed")
Questions:
1. Should we switch from class to object for defining UDFs? (i can keep it as singleton)
2. Can class member function be used as UDF?
3. When such a UDF is invoked, will class member like allDF remain initialized?
4. Val declared as member variable - will it get initialized at the time of construction of geouad?
I am new to Scala, Thanks in advance for guidance/suggestions.
No, switching from class to object is not necessary for defining UDF, it is only different while calling the UDF.
Yes, you can use class member function as UDF, but first you need to register the function as an UDF.
spark.sqlContext.udf.register("registeredName", Class Method _)
No, other methods are initialized when calling one UDF
Yes, the class variable val will be initialized at the time of calling geouad and performing some actions.

Scala 2.11.8, OS:Windows 7, Java: JDK1.8

I am creating a Companion Objects, How do i traverse these objects?, i have written but not working, error thrown
Please help here
scala> :paste
object Network {
class Member(val name: String) {
var strName = name
val contacts = new collection.mutable.ArrayBuffer[Member]
println(" name -->" + strName)
}
}
class Network {
private val members = new collection.mutable.ArrayBuffer[Network.Member]
def join(name: String) = {
val m = new Network.Member(name)
members += m
m
}
}
val chatter = new Network
val myFace = new Network
val fred = chatter.join("Fred")
val wilma = chatter.join("Wilma")
fred.contacts += wilma // OK
val barney = myFace.join("Barney") // Has type myFace.Member
fred.contacts += barney // allowed
How do i traverse these objects?, i have written but not working, error thrown
for (a<- fred.contacts){
var Network.Member m = a
println("m -->" + m.strName)
//println("m -->" + a)
}
The declaration of m variable is not correct.
var m:Network.Member = a
That's the correct way to declare a variable in Scala. Or you can just ignore the type and let Scala interpret it.
var m = a

Chisel: Access to Module Parameters from Tester

How does one access the parameters used to construct a Module from inside the Tester that is testing it?
In the test below I am passing the parameters explicitly both to the Module and to the Tester. I would prefer not to have to pass them to the Tester but instead extract them from the module that was also passed in.
Also I am new to scala/chisel so any tips on bad techniques I'm using would be appreciated :).
import Chisel._
import math.pow
class TestA(dataWidth: Int, arrayLength: Int) extends Module {
val dataType = Bits(INPUT, width = dataWidth)
val arrayType = Vec(gen = dataType, n = arrayLength)
val io = new Bundle {
val i_valid = Bool(INPUT)
val i_data = dataType
val i_array = arrayType
val o_valid = Bool(OUTPUT)
val o_data = dataType.flip
val o_array = arrayType.flip
}
io.o_valid := io.i_valid
io.o_data := io.i_data
io.o_array := io.i_array
}
class TestATests(c: TestA, dataWidth: Int, arrayLength: Int) extends Tester(c) {
val maxData = pow(2, dataWidth).toInt
for (t <- 0 until 16) {
val i_valid = rnd.nextInt(2)
val i_data = rnd.nextInt(maxData)
val i_array = List.fill(arrayLength)(rnd.nextInt(maxData))
poke(c.io.i_valid, i_valid)
poke(c.io.i_data, i_data)
(c.io.i_array, i_array).zipped foreach {
(element,value) => poke(element, value)
}
expect(c.io.o_valid, i_valid)
expect(c.io.o_data, i_data)
(c.io.o_array, i_array).zipped foreach {
(element,value) => poke(element, value)
}
step(1)
}
}
object TestAObject {
def main(args: Array[String]): Unit = {
val tutArgs = args.slice(0, args.length)
val dataWidth = 5
val arrayLength = 6
chiselMainTest(tutArgs, () => Module(
new TestA(dataWidth=dataWidth, arrayLength=arrayLength))){
c => new TestATests(c, dataWidth=dataWidth, arrayLength=arrayLength)
}
}
}
If you make the arguments dataWidth and arrayLength members of TestA you can just reference them. In Scala this can be accomplished by inserting val into the argument list:
class TestA(val dataWidth: Int, val arrayLength: Int) extends Module ...
Then you can reference them from the test as members with c.dataWidth or c.arrayLength