for loop into map method with Spark using Scala - scala
Hi I want to use a "for" into a map method in scala.
How can I do it?
For example here for each line read I want to generate a random word :
val rdd = file.map(line => (line,{
val chars = "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ";
val word = new String;
val res = new String;
val rnd = new Random;
val len = 4 + rnd.nextInt((6-4)+1);
for(i <- 1 to len){
val char = chars(rnd.nextInt(51));
word.concat(char.toString);
}
word;
}))
My current output is :
Array[(String, String)] = Array((1,""), (2,""), (3,""), (4,""), (5,""), (6,""), (7,""), (8,""), (9,""), (10,""), (11,""), (12,""), (13,""), (14,""), (15,""), (16,""), (17,""), (18,""), (19,""), (20,""), (21,""), (22,""), (23,""), (24,""), (25,""), (26,""), (27,""), (28,""), (29,""), (30,""), (31,""), (32,""), (33,""), (34,""), (35,""), (36,""), (37,""), (38,""), (39,""), (40,""), (41,""), (42,""), (43,""), (44,""), (45,""), (46,""), (47,""), (48,""), (49,""), (50,""), (51,""), (52,""), (53,""), (54,""), (55,""), (56,""), (57,""), (58,""), (59,""), (60,""), (61,""), (62,""), (63,""), (64,""), (65,""), (66,""), (67,""), (68,""), (69,""), (70,""), (71,""), (72,""), (73,""), (74,""), (75,""), (76,""), (77,""), (78,""), (79,""), (80,""), (81,""), (82,""), (83,""), (84,""), (85,""), (86...
I don't know why the right side is empty.
There's no need for var here. It's a one liner
Seq.fill(len)(chars(rnd.nextInt(51))).mkString
This will create a sequence of Char of length len by repeatedly calling chars(rnd.nextInt(51)), then makes it into a String.
Thus you'll get something like this :
import org.apache.spark.rdd.RDD
import scala.util.Random
val chars = ('a' to 'z') ++ ('A' to 'Z')
val rdd = file.map(line => {
val randomWord = {
val rnd = new Random
val len = 4 + rnd.nextInt((6 - 4) + 1)
Seq.fill(len)(chars(rnd.nextInt(chars.length-1))).mkString
}
(line, randomWord)
})
word.concat doesn't modify word but return a new String, you can make word a variable and add new string to it:
var word = new String
....
for {
...
word += char
...
}
Related
How to create a DataFrame using a string consisting of key-value pairs?
I'm getting logs from a firewall in CEF Format as a string which looks as: ABC|XYZ|F123|1.0|DSE|DSE|4|externalId=e705265d0d9e4d4fcb218b cn2=329160 cn1=3053998 dhost=SRV2019 duser=admin msg=Process accessed NTDS fname=ntdsutil.exe filePath=\\Device\\HarddiskVolume2\\Windows\\System32 cs5="C:\\Windows\\system32\\ntdsutil.exe" "ac i ntds" ifm "create full ntdstest3" q q fileHash=80c8b68240a95 dntdom=adminDomain cn3=13311 rt=1610948650000 tactic=Credential Access technique=Credential Dumping objective=Gain Access patternDisposition=Detection. outcome=0 How can I create a DataFrame from this kind of string where I'm getting key-value pairs separated by = ? My objective is to infer schema from this string using the keys dynamically, i.e extract the keys from left side of the = operator and create a schema using them. What I have been doing currently is pretty lame(IMHO) and not very dynamic in approach.(because the number of key-value pairs can change as per different type of logs) val a: String = "ABC|XYZ|F123|1.0|DSE|DCE|4|externalId=e705265d0d9e4d4fcb218b cn2=329160 cn1=3053998 dhost=SRV2019 duser=admin msg=Process accessed NTDS fname=ntdsutil.exe filePath=\\Device\\HarddiskVolume2\\Windows\\System32 cs5="C:\\Windows\\system32\\ntdsutil.exe" "ac i ntds" ifm "create full ntdstest3" q q fileHash=80c8b68240a95 dntdom=adminDomain cn3=13311 rt=1610948650000 tactic=Credential Access technique=Credential Dumping objective=Gain Access patternDisposition=Detection. outcome=0" val ttype: String = "DCE" type parseReturn = (String,String,List[String],Int) def cefParser(a: String, ttype: String): parseReturn = { val firstPart = a.split("\\|") var pD = new ListBuffer[String]() var listSize: Int = 0 if (firstPart.size == 8 && firstPart(4) == ttype) { pD += firstPart(0) pD += firstPart(1) pD += firstPart(2) pD += firstPart(3) pD += firstPart(4) pD += firstPart(5) pD += firstPart(6) val secondPart = parseSecondPart(firstPart(7), ttype) pD ++= secondPart listSize = pD.toList.length (firstPart(2), ttype, pD.toList, listSize) } else { val temp: List[String] = List(a) (firstPart(2), "IRRELEVANT", temp, temp.length) } } The method parseSecondPart is: def parseSecondPart(m:String, ttype:String): ListBuffer[String] = ttype match { case auditActivity.ttype=>parseAuditEvent(m) Another function call to just replace some text in the logs def parseAuditEvent(msg: String): ListBuffer[String] = { val updated_msg = msg.replace("cat=", "metadata_event_type=") .replace("destinationtranslatedaddress=", "event_user_ip=") .replace("duser=", "event_user_id=") .replace("deviceprocessname=", "event_service_name=") .replace("cn3=", "metadata_offset=") .replace("outcome=", "event_success=") .replace("devicecustomdate1=", "event_utc_timestamp=") .replace("rt=", "metadata_event_creation_time=") parseEvent(updated_msg) } Final function to get only the values: def parseEvent(msg: String): ListBuffer[String] = { val newMsg = msg.replace("\\=", "$_equal_$") val pD = new ListBuffer[String]() val splitData = newMsg.split("=") val mSize = splitData.size for (i <- 1 until mSize) { if(i < mSize-1) { val a = splitData(i).split(" ") val b = a.size-1 val c = a.slice(0,b).mkString(" ") pD += c.replace("$_equal_$","=") } else if(i == mSize-1) { val a = splitData(i).replace("$_equal_$","=") pD += a } else { logExceptions(newMsg) } } pD } The returns contains a ListBuffer[String]at 3rd position, using which I create a DataFrame as follows: val df = ss.sqlContext .createDataFrame(tempRDD.filter(x => x._1 != "IRRELEVANT") .map(x => Row.fromSeq(x._3)), schema) People of stackoverflow, i really need your help in improving my code, both for performance and approach. Any kind of help and/or suggestions will be highly appreciated. Thanks In Advance.
Join two strings in Scala with one to one mapping
I have two strings in Scala Input 1 : "a,c,e,g,i,k" Input 2 : "b,d,f,h,j,l" How do I join the two Strings in Scala? Required output = "ab,cd,ef,gh,ij,kl" I tried something like: var columnNameSetOne:Array[String] = Array(); //v1 = "a,c,e,g,i,k" var columnNameSetTwo:Array[String] = Array(); //v2 = "b,d,f,h,j,l" After I get the input data as mentioned above columnNameSetOne = v1.split(",") columnNameSetTwo = v2.split(","); val newColumnSet = IntStream.range(0, Math.min(columnNameSetOne.length, columnNameSetTwo.length)).mapToObj(j => (columnNameSetOne(j) + columnNameSetTwo(j))).collect(Collectors.joining(",")); println(newColumnSet) But I am getting error on j Also, I am not sure if this would work!
object Solution1 extends App { val input1 = "a,c,e,g,i,k" val input2 = "b,d,f,h,j,l" val i1= input1.split(",") val i2 = input2.split(",") val x =i1.zipAll(i2, "", "").map{ case (a,b)=> a + b } println(x.mkString(",")) } //output : ab,cd,ef,gh,ij,kl
Easy to do using zip function on list. val v1 = "a,c,e,g,i,k" val v2 = "b,d,f,h,j,l" val list1 = v1.split(",").toList val list2 = v2.split(",").toList list1.zip(list2).mkString(",") // res0: String = (a,b),( c,d),( e,f),( g,h),( i,j),( k,l)
Looping through Map Spark Scala
Within this code we have two files: athletes.csv that contains names, and twitter.test that contains the tweet message. We want to find name for every single line in the twitter.test that match the name in athletes.csv We applied map function to store the name from athletes.csv and want to iterate all of the name to all of the line in the test file. object twitterAthlete { def loadAthleteNames() : Map[String, String] = { // Handle character encoding issues: implicit val codec = Codec("UTF-8") codec.onMalformedInput(CodingErrorAction.REPLACE) codec.onUnmappableCharacter(CodingErrorAction.REPLACE) // Create a Map of Ints to Strings, and populate it from u.item. var athleteInfo:Map[String, String] = Map() //var movieNames:Map[Int, String] = Map() val lines = Source.fromFile("../athletes.csv").getLines() for (line <- lines) { var fields = line.split(',') if (fields.length > 1) { athleteInfo += (fields(1) -> fields(7)) } } return athleteInfo } def parseLine(line:String): (String)= { var athleteInfo = loadAthleteNames() var hello = new String for((k,v) <- athleteInfo){ if(line.toString().contains(k)){ hello = k } } return (hello) } def main(args: Array[String]){ Logger.getLogger("org").setLevel(Level.ERROR) val sc = new SparkContext("local[*]", "twitterAthlete") val lines = sc.textFile("../twitter.test") var athleteInfo = loadAthleteNames() val splitting = lines.map(x => x.split(";")).map(x => if(x.length == 4 && x(2).length <= 140)x(2)) var hello = new String() val container = splitting.map(x => for((key,value) <- athleteInfo)if(x.toString().contains(key)){key}).cache container.collect().foreach(println) // val mapping = container.map(x => (x,1)).reduceByKey(_+_) //mapping.collect().foreach(println) } } the first file look like: id,name,nationality,sex,height........ 001,Michael,USA,male,1.96 ... 002,Json,GBR,male,1.76 .... 003,Martin,female,1.73 . ... the second file look likes: time, id , tweet ..... 12:00, 03043, some message that contain some athletes names , ..... 02:00, 03023, some message that contain some athletes names , ..... some thinks like this ... but i got empty result after running this code, any suggestions is much appreciated result i got is empty : ().... ()... ()... but the result that i expected something like: (name,1) (other name,1)
You need to use yield to return value to your map val container = splitting.map(x => for((key,value) <- athleteInfo ; if(x.toString().contains(key)) ) yield (key, 1)).cache
I think you should just start with the simplest option first... I would use DataFrames so you can use the built-in CSV parsing and leverage Catalyst, Tungsten, etc. Then you can use the built-in Tokenizer to split the tweets into words, explode, and do a simple join. Depending how big/small the data with athlete names is you'll end up with a more optimized broadcast join and avoid a shuffle. import org.apache.spark.sql.functions._ import org.apache.spark.ml.feature.Tokenizer val tweets = spark.read.format("csv").load(...) val athletes = spark.read.format("csv").load(...) val tokenizer = new Tokenizer() tokenizer.setInputCol("tweet") tokenizer.setOutputCol("words") val tokenized = tokenizer.transform(tweets) val exploded = tokenized.withColumn("word", explode('words)) val withAthlete = exploded.join(athletes, 'word === 'name) withAthlete.select(exploded("id"), 'name).show()
Task not serializable in scala
In my application, I'm using parallelize method to save an Array into file. code as follows: val sourceRDD = sc.textFile(inputPath + "/source") val destinationRDD = sc.textFile(inputPath + "/destination") val source_primary_key = sourceRDD.map(rec => (rec.split(",")(0).toInt, rec)) val destination_primary_key = destinationRDD.map(rec => (rec.split(",")(0).toInt, rec)) val extra_in_source = source_primary_key.subtractByKey(destination_primary_key) val extra_in_destination = destination_primary_key.subtractByKey(source_primary_key) val source_subtract = source_primary_key.subtract(destination_primary_key) val Destination_subtract = destination_primary_key.subtract(source_primary_key) val exact_bestmatch_src = source_subtract.subtractByKey(extra_in_source).sortByKey(true).map(rec => (rec._2)) val exact_bestmatch_Dest = Destination_subtract.subtractByKey(extra_in_destination).sortByKey(true).map(rec => (rec._2)) val exact_bestmatch_src_p = exact_bestmatch_src.map(rec => (rec.split(",")(0).toInt)) val primary_key_distinct = exact_bestmatch_src_p.distinct.toArray() for (i <- primary_key_distinct) { var dummyVar: String = "" val src = exact_bestmatch_src.filter(line => line.split(",")(0).toInt.equals(i)) var dest = exact_bestmatch_Dest.filter(line => line.split(",")(0).toInt.equals(i)).toArray for (print1 <- src) { var sourceArr: Array[String] = print1.split(",") var exactbestMatchCounter: Int = 0 var index: Array[Int] = new Array[Int](1) println(print1 + "source") for (print2 <- dest) { var bestMatchCounter = 0 var i: Int = 0 println(print1 + "source + destination" + print2) for (i <- 0 until sourceArr.length) { if (print1.split(",")(i).equals(print2.split(",")(i))) { bestMatchCounter += 1 } } if (exactbestMatchCounter < bestMatchCounter) { exactbestMatchCounter = bestMatchCounter dummyVar = print2 index +:= exactbestMatchCounter //9,8,9 } } var z = index.zipWithIndex.maxBy(_._1)._2 if (exactbestMatchCounter >= 0) { var samparr: Array[String] = new Array[String](4) samparr +:= print1 + " BEST_MATCH " + dummyVar var deletedest: Array[String] = new Array[String](1) deletedest = dest.take(z) ++ dest.drop(1) dest = deletedest val myFile = sc.parallelize((samparr)).saveAsTextFile(outputPath) I have used parallelize method and I even tried with below method to save it as a file val myFile = sc.textFile(samparr.toString()) val finalRdd = myFile finalRdd.coalesce(1).saveAsTextFile(outputPath) but its keep throwing the error : Exception in thread "main" org.apache.spark.SparkException: Task not serializable
You can't treat an RDD like a local collection. All operations against it happen over a distributed cluster. To work, all functions you run in that rdd must be serializable. The line for (print1 <- src) { Here you are iterating over the RDD src, everything inside the loop must be serialize, as it will be run on the executors. Inside however, you try to run sc.parallelize( while still inside that loop. SparkContext is not serializable. Working with rdds and sparkcontext are things you do on the driver, and cannot do within an RDD operation. I'm entirely sure what you are trying to accomplish, but it looks like some sort of hand-coded join operation with the source and destination. You can't work with loops in rdds like you can with local collections. Make use of the apis map, join, groupby, and the like to create your final rdd then save that. If you absolutely feel you must use a foreach loop over the rdd like this, then you can't use sc.parallelize().saveAsTextFile() Instead open an outputstream using the hadoop file api and write your array to the file manually.
Finally this piece of code helps me to save an array to file. new PrintWriter(outputPath) { write(array.mkString(" ")); close }
Scala Appending to an empty Array
I am trying to append to an array but for some reason it is just appending blanks into my Array. def schemaClean(x: Array[String]): Array[String] = { val array = Array[String]() for(i <- 0 until x.length){ val convert = x(i).toString val split = convert.split('|') if (split.length == 5) { val drop = split.dropRight(3).mkString(" ") array :+ drop } else if (split.length == 4) { val drop = split.dropRight(2).mkString(" ") println(drop) array :+ drop.toString println(array.mkString(" ")) } } array } val schema1 = schemaClean(schema) prints this: record_id string assigned_offer_id string accepted_offer_flag string current_offer_flag string If I try and print schema1 its just 1 blank line.
Scala's Array size is immutable. From Scala's reference: def :+(elem: A): Array[A] [use case] A copy of this array with an element appended. Thus :+ returns a new array whose reference you are not using. val array = ... Should be: var array = ... And you should update that reference with the new arrays obtained after each append operation. Since there are not variable size arrays in Scala, the alternative to an Array var copied after insertion is BufferArray, use its method operator += to append new elements and obtain the resulting array from the buffer, e.g: import scala.collection.mutable.ArrayBuffer val ab = ArrayBuffer[String]() ab += "hello" ab += "world" ab.toArray res2: Array[String] = Array(hello, world) Applied to your code: def schemaClean(x: Array[String]): Array[String] = { val arrayBuf = ArrayBuffer[String]() for(i <- 0 until x.length){ val convert = x(i).toString val split = convert.split('|') if (split.length == 5) { val drop = split.dropRight(3).mkString(" ") arrayBuf += drop } else if (split.length == 4) { val drop = split.dropRight(2).mkString(" ") println(drop) arrayBuf += drop.toString println(arrayBuf.toArray.mkString(" ")) } } arrayBuf.toArray }