Function without var in scala - scala

I have a function that takes a string and a case class as input and return string as output.
Different case class gets appended to the list and the final case class is returned which has the list.
I want to do it without using var. The val list would be immutable and no data would be added to it. Is there any other way of doing it in Scala way?
def getResult(eventName: Option[String], content: Content): String = {
var list = List.empty[Json]
val device = Device(
DEVICE_SCHEMA,
data = content.data.device
)
list = list :+ device.asJson
val parser = Parser(
PARSER_SCHEMA,
data = content.data.parser
)
list = list :+ parser.asJson
val res = Result(
RESULT_SCHMEA,
data = list
)
res.asJson.noSpaces
}

Try inlining list creation like so
def getResult(eventName: Option[String], content: Content): String = {
val device = Device(
DEVICE_SCHEMA,
data = content.data.device
)
val parser = Parser(
PARSER_SCHEMA,
data = content.data.parser
)
Result(
RESULT_SCHMEA,
data = List(device.asJson, parser.asJson) // <== inline list creation
).asJson.noSpaces
}

Just some little changes from the previous answer.
You don't need val res and it's preferred to create the list outside Result for easier reading and later debugging:
def getResult(eventName: Option[String], content: Content): String = {
val device = Device(
DEVICE_SCHEMA,
data = content.data.device
)
val parser = Parser(
PARSER_SCHEMA,
data = content.data.parser
)
val jsons = List(device.asJson, parser.asJson)
Result(
RESULT_SCHMEA,
data = jsons
).asJson.noSpaces
}

Related

How to create a DataFrame using a string consisting of key-value pairs?

I'm getting logs from a firewall in CEF Format as a string which looks as:
ABC|XYZ|F123|1.0|DSE|DSE|4|externalId=e705265d0d9e4d4fcb218b cn2=329160 cn1=3053998 dhost=SRV2019 duser=admin msg=Process accessed NTDS fname=ntdsutil.exe filePath=\\Device\\HarddiskVolume2\\Windows\\System32 cs5="C:\\Windows\\system32\\ntdsutil.exe" "ac i ntds" ifm "create full ntdstest3" q q fileHash=80c8b68240a95 dntdom=adminDomain cn3=13311 rt=1610948650000 tactic=Credential Access technique=Credential Dumping objective=Gain Access patternDisposition=Detection. outcome=0
How can I create a DataFrame from this kind of string where I'm getting key-value pairs separated by = ?
My objective is to infer schema from this string using the keys dynamically, i.e extract the keys from left side of the = operator and create a schema using them.
What I have been doing currently is pretty lame(IMHO) and not very dynamic in approach.(because the number of key-value pairs can change as per different type of logs)
val a: String = "ABC|XYZ|F123|1.0|DSE|DCE|4|externalId=e705265d0d9e4d4fcb218b cn2=329160 cn1=3053998 dhost=SRV2019 duser=admin msg=Process accessed NTDS fname=ntdsutil.exe filePath=\\Device\\HarddiskVolume2\\Windows\\System32 cs5="C:\\Windows\\system32\\ntdsutil.exe" "ac i ntds" ifm "create full ntdstest3" q q fileHash=80c8b68240a95 dntdom=adminDomain cn3=13311 rt=1610948650000 tactic=Credential Access technique=Credential Dumping objective=Gain Access patternDisposition=Detection. outcome=0"
val ttype: String = "DCE"
type parseReturn = (String,String,List[String],Int)
def cefParser(a: String, ttype: String): parseReturn = {
val firstPart = a.split("\\|")
var pD = new ListBuffer[String]()
var listSize: Int = 0
if (firstPart.size == 8 && firstPart(4) == ttype) {
pD += firstPart(0)
pD += firstPart(1)
pD += firstPart(2)
pD += firstPart(3)
pD += firstPart(4)
pD += firstPart(5)
pD += firstPart(6)
val secondPart = parseSecondPart(firstPart(7), ttype)
pD ++= secondPart
listSize = pD.toList.length
(firstPart(2), ttype, pD.toList, listSize)
} else {
val temp: List[String] = List(a)
(firstPart(2), "IRRELEVANT", temp, temp.length)
}
}
The method parseSecondPart is:
def parseSecondPart(m:String, ttype:String): ListBuffer[String] = ttype match {
case auditActivity.ttype=>parseAuditEvent(m)
Another function call to just replace some text in the logs
def parseAuditEvent(msg: String): ListBuffer[String] = {
val updated_msg = msg.replace("cat=", "metadata_event_type=")
.replace("destinationtranslatedaddress=", "event_user_ip=")
.replace("duser=", "event_user_id=")
.replace("deviceprocessname=", "event_service_name=")
.replace("cn3=", "metadata_offset=")
.replace("outcome=", "event_success=")
.replace("devicecustomdate1=", "event_utc_timestamp=")
.replace("rt=", "metadata_event_creation_time=")
parseEvent(updated_msg)
}
Final function to get only the values:
def parseEvent(msg: String): ListBuffer[String] = {
val newMsg = msg.replace("\\=", "$_equal_$")
val pD = new ListBuffer[String]()
val splitData = newMsg.split("=")
val mSize = splitData.size
for (i <- 1 until mSize) {
if(i < mSize-1) {
val a = splitData(i).split(" ")
val b = a.size-1
val c = a.slice(0,b).mkString(" ")
pD += c.replace("$_equal_$","=")
} else if(i == mSize-1) {
val a = splitData(i).replace("$_equal_$","=")
pD += a
} else {
logExceptions(newMsg)
}
}
pD
}
The returns contains a ListBuffer[String]at 3rd position, using which I create a DataFrame as follows:
val df = ss.sqlContext
.createDataFrame(tempRDD.filter(x => x._1 != "IRRELEVANT")
.map(x => Row.fromSeq(x._3)), schema)
People of stackoverflow, i really need your help in improving my code, both for performance and approach.
Any kind of help and/or suggestions will be highly appreciated.
Thanks In Advance.

Reading multiple files with akka streams in scala

I'am trying to read multiple files with akka streams and put result in a list.
I can read one file with no problem. the return type is Future[Seq[String]]. problem is processing the sequence inside the Future must go inside an onComplete{}.
i'am trying the following code but abviously it will not work. the list acc outside of the onComplete is empty. but holds values inside the inComplete. I understand the problem but i don't know how to approach this.
// works fine
def readStream(path: String, date: String): Future[Seq[String]] = {
implicit val system = ActorSystem("Sys")
val settings = ActorMaterializerSettings(system)
implicit val materializer = ActorMaterializer(settings)
val result: Future[Seq[String]] =
FileIO.fromPath(Paths.get(path + "transactions_" + date +
".data"))
.via(Framing.delimiter(ByteString("\n"), 256, true))
.map(_.utf8String)
.toMat(Sink.seq)(Keep.right)
.run()
var aa: List[scala.Array[String]] = Nil
result.onComplete(x => {
aa = x.get.map(line => line.split('|')).toList
})
result
}
//this won't work
def concatFiles(path : String, date : String, numberOfDays : Int) :
List[scala.Array[String]] = {
val formatter = DateTimeFormatter.ofPattern("yyyyMMdd");
val formattedDate = LocalDate.parse(date, formatter);
var acc = List[scala.Array[String]]()
for( a <- 0 to numberOfDays){
val date = formattedDate.minusDays(a).toString().replace("-", "")
val transactions = readStream(path , date)
var result: List[scala.Array[String]] = Nil
transactions.onComplete(x => {
result = x.get.map(line => line.split('|')).toList
acc= acc ++ result })
}
acc}
General Solution
Given an Iterator of Paths values a Source of the file lines can be created by combining FileIO & flatMapConcat:
val lineSourceFromPaths : (() => Iterator[Path]) => Source[String, _] = pathsIterator =>
Source
.fromIterator(pathsIterator)
.flatMapConcat { path =>
FileIO
.fromPath(path)
.via(Framing.delimiter(ByteString("\n"), 256, true))
.map(_.utf8String)
}
Application to Question
The reason your List is empty is because the Future values have not completed and therefore your mutable list is not be updated before the function returns the list.
Critique of Code in Question
The organization and style of the code within the question suggest several misunderstandings related to akka & Future. I think you are attempting a rather complex workflow without understanding the fundamentals of the tools you are trying to use.
1.You should not create an ActorSystem each time a function is being called. There is usually 1 ActorSystem per application and it's created only once.
implicit val system = ActorSystem("Sys")
val settings = ActorMaterializerSettings(system)
implicit val materializer = ActorMaterializer(settings)
def readStream(...
2.You should try to avoid mutable collections and instead use Iterator with corresponding functionality:
def concatFiles(path : String, date : String, numberOfDays : Int) : List[scala.Array[String]] = {
val formattedDate = LocalDate.parse(date, DateTimeFormatter.ofPattern("yyyyMMdd"))
val pathsIterator : () => Iterator[Path] = () =>
Iterator
.range(0, numberOfDays+1)
.map(formattedDate.minusDays)
.map(_.String().replace("-", "")
.map(path => Paths.get(path + "transactions_" + date + ".data")
lineSourceFromPaths(pathsIterator)
3.Since you are dealing with Futures you should not wait for Futures to complete and should instead change the return type of concateFiles to Future[List[Array[String]]].

Looping through Map Spark Scala

Within this code we have two files: athletes.csv that contains names, and twitter.test that contains the tweet message. We want to find name for every single line in the twitter.test that match the name in athletes.csv We applied map function to store the name from athletes.csv and want to iterate all of the name to all of the line in the test file.
object twitterAthlete {
def loadAthleteNames() : Map[String, String] = {
// Handle character encoding issues:
implicit val codec = Codec("UTF-8")
codec.onMalformedInput(CodingErrorAction.REPLACE)
codec.onUnmappableCharacter(CodingErrorAction.REPLACE)
// Create a Map of Ints to Strings, and populate it from u.item.
var athleteInfo:Map[String, String] = Map()
//var movieNames:Map[Int, String] = Map()
val lines = Source.fromFile("../athletes.csv").getLines()
for (line <- lines) {
var fields = line.split(',')
if (fields.length > 1) {
athleteInfo += (fields(1) -> fields(7))
}
}
return athleteInfo
}
def parseLine(line:String): (String)= {
var athleteInfo = loadAthleteNames()
var hello = new String
for((k,v) <- athleteInfo){
if(line.toString().contains(k)){
hello = k
}
}
return (hello)
}
def main(args: Array[String]){
Logger.getLogger("org").setLevel(Level.ERROR)
val sc = new SparkContext("local[*]", "twitterAthlete")
val lines = sc.textFile("../twitter.test")
var athleteInfo = loadAthleteNames()
val splitting = lines.map(x => x.split(";")).map(x => if(x.length == 4 && x(2).length <= 140)x(2))
var hello = new String()
val container = splitting.map(x => for((key,value) <- athleteInfo)if(x.toString().contains(key)){key}).cache
container.collect().foreach(println)
// val mapping = container.map(x => (x,1)).reduceByKey(_+_)
//mapping.collect().foreach(println)
}
}
the first file look like:
id,name,nationality,sex,height........
001,Michael,USA,male,1.96 ...
002,Json,GBR,male,1.76 ....
003,Martin,female,1.73 . ...
the second file look likes:
time, id , tweet .....
12:00, 03043, some message that contain some athletes names , .....
02:00, 03023, some message that contain some athletes names , .....
some thinks like this ...
but i got empty result after running this code, any suggestions is much appreciated
result i got is empty :
()....
()...
()...
but the result that i expected something like:
(name,1)
(other name,1)
You need to use yield to return value to your map
val container = splitting.map(x => for((key,value) <- athleteInfo ; if(x.toString().contains(key)) ) yield (key, 1)).cache
I think you should just start with the simplest option first...
I would use DataFrames so you can use the built-in CSV parsing and leverage Catalyst, Tungsten, etc.
Then you can use the built-in Tokenizer to split the tweets into words, explode, and do a simple join. Depending how big/small the data with athlete names is you'll end up with a more optimized broadcast join and avoid a shuffle.
import org.apache.spark.sql.functions._
import org.apache.spark.ml.feature.Tokenizer
val tweets = spark.read.format("csv").load(...)
val athletes = spark.read.format("csv").load(...)
val tokenizer = new Tokenizer()
tokenizer.setInputCol("tweet")
tokenizer.setOutputCol("words")
val tokenized = tokenizer.transform(tweets)
val exploded = tokenized.withColumn("word", explode('words))
val withAthlete = exploded.join(athletes, 'word === 'name)
withAthlete.select(exploded("id"), 'name).show()

Add scoped variable per row iteration in Apache Spark

I'm reading multiple html files into a dataframe in Spark.
I'm converting elements of the html to columns in the dataframe using a custom udf
val dataset = spark
.sparkContext
.wholeTextFiles(inputPath)
.toDF("filepath", "filecontent")
.withColumn("biz_name", parseDocValue(".biz-page-title")('filecontent))
.withColumn("biz_website", parseDocValue(".biz-website a")('filecontent))
...
def parseDocValue(cssSelectorQuery: String) =
udf((html: String) => Jsoup.parse(html).select(cssSelectorQuery).text())
Which works perfectly, however each withColumn call will result in the parsing of the html string, which is redundant.
Is there a way (without using lookup tables or such) that I can generate 1 parsed Document (Jsoup.parse(html)) based on the "filecontent" column per row and make that available for all withColumn calls in the dataframe?
Or shouldn't I even try using DataFrames and just use RDD's ?
So the final answer was in fact quite simple:
Just map over the rows and create the object ones there
def docValue(cssSelectorQuery: String, attr: Option[String] = None)(implicit document: Document): Option[String] = {
val domObject = document.select(cssSelectorQuery)
val domValue = attr match {
case Some(a) => domObject.attr(a)
case None => domObject.text()
}
domValue match {
case x if x == null || x.isEmpty => None
case y => Some(y)
}
}
val dataset = spark
.sparkContext
.wholeTextFiles(inputPath, minPartitions = 265)
.map {
case (filepath, filecontent) => {
implicit val document = Jsoup.parse(filecontent)
val customDataJson = docJson(filecontent, customJsonRegex)
DataEntry(
biz_name = docValue(".biz-page-title"),
biz_website = docValue(".biz-website a"),
url = docValue("meta[property=og:url]", attr = Some("content")),
...
filename = Some(fileName(filepath)),
fileTimestamp = Some(fileTimestamp(filepath))
)
}
}
.toDS()
I'd probably rewrite it as follows, to do the parsing and selecting in one go and put them in a temporary column:
val dataset = spark
.sparkContext
.wholeTextFiles(inputPath)
.withColumn("temp", parseDocValue(Array(".biz-page-title", ".biz-website a"))('filecontent))
.withColumn("biz_name", col("temp")(0))
.withColumn("biz_website", col("temp")(1))
.drop("temp")
def parseDocValue(cssSelectorQueries: Array[String]) =
udf((html: String) => {
val j = Jsoup.parse(html)
cssSelectorQueries.map(query => j.select(query).text())})

Parse a single RDF string

I have two strings of RDF Turtle data
val a: String = "<http://www.test.com/meta#0001> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://www.w3.org/2002/07/owl#Class>"
val b: String = "<http://www.test.com/meta#0002> <http://www.test.com/meta#CONCEPT_hasType> \"BEAR\"^^<http://www.w3.org/2001/XMLSchema#string>"
Each line has 3 items in it. I want to run one line through an RDF parse and get:
val items : Array[String] = magicallyParse(a)
items(0) == "http://www.test.com/meta#0001"
Bonus if I can also extract the Local items from each parsed items
0001, type, Class
0002, CONCEPT_hasType, (BEAR, string)
Is there a library out there (java or scala) that would do this split for me? I have looked at Jena and OpenRDF but could not find a way to do this single-line split up.
Thanks to #AndyS 's suggestion I came out with this for triples
val line1: String = "<http://www.test.com/meta#0001> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://www.w3.org/2002/07/owl#Class> ."
val reader: Reader = new StringReader(line1) ;
val tokenizer = TokenizerFactory.makeTokenizer(reader)
val graph: Graph = GraphFactory.createDefaultGraph()
val sink: StreamRDF = StreamRDFLib.graph(graph)
val turtle: LangTurtle = new LangTurtle(tokenizer, new ParserProfileBase(new Prologue(), null), sink)
turtle.parse()
println("item is this: " + graph)
println(graph.size())
println(graph.find(null, null, null).next())
val trip = graph.find(null, null, null).next()
val sub = trip.getSubject
val pred = trip.getPredicate
val obj = trip.getObject
println(s"subject[$sub] predicate[$pred] object[$obj]")
val subLoc = sub.getLocalName
val predLoc = pred.getLocalName
val objLoc = obj.getLocalName
println(s"subject[$subLoc] predicate[$predLoc] object[$objLoc]")
and then for quads I referenced this code to get this
def extractRdfLineAsQuad(line: String): Option[Quad] = {
val reader: Reader = new StringReader(line)
val tokenizer = TokenizerFactory.makeTokenizer(reader)
val parser: LangNQuads = new LangNQuads(tokenizer, RiotLib.profile(Lang.NQUADS, null), null)
if (parser.hasNext) Some(parser.next())
else None
}
Far from pretty, it does what I require.