Related
The below Input i have to replace last comma (,) with "," between two colons(:)
println(input)
//[level:1,File:one,three,Flag:NA][level:1,File:two,Flag:NA]
println(input.replace(",", "\",\""))
getting result as:
//[level:1","File:one","three","Flag:NA][level:1","File:two","Flag:NA]
expected result should be
[level:1","File:one,three","Flag:NA][level:1","File:two","Flag:NA]
Kindly help me.
val str1 = "[level:1,File:one,three,Flag:NA][level:1,File:two,Flag:NA]"
val regex1 = raw"(,)(\w+:)".r
val matches = regex1.findAllMatchIn(str1)
val str2 = matches.foldLeft(str1)({ case (str, m) =>
str.replaceFirst(m.group(0), "\",\"" + m.group(2))
})
// str2: String = [level:1","File:one,three","Flag:NA][level:1","File:two","Flag:NA]
Within this code we have two files: athletes.csv that contains names, and twitter.test that contains the tweet message. We want to find name for every single line in the twitter.test that match the name in athletes.csv We applied map function to store the name from athletes.csv and want to iterate all of the name to all of the line in the test file.
object twitterAthlete {
def loadAthleteNames() : Map[String, String] = {
// Handle character encoding issues:
implicit val codec = Codec("UTF-8")
codec.onMalformedInput(CodingErrorAction.REPLACE)
codec.onUnmappableCharacter(CodingErrorAction.REPLACE)
// Create a Map of Ints to Strings, and populate it from u.item.
var athleteInfo:Map[String, String] = Map()
//var movieNames:Map[Int, String] = Map()
val lines = Source.fromFile("../athletes.csv").getLines()
for (line <- lines) {
var fields = line.split(',')
if (fields.length > 1) {
athleteInfo += (fields(1) -> fields(7))
}
}
return athleteInfo
}
def parseLine(line:String): (String)= {
var athleteInfo = loadAthleteNames()
var hello = new String
for((k,v) <- athleteInfo){
if(line.toString().contains(k)){
hello = k
}
}
return (hello)
}
def main(args: Array[String]){
Logger.getLogger("org").setLevel(Level.ERROR)
val sc = new SparkContext("local[*]", "twitterAthlete")
val lines = sc.textFile("../twitter.test")
var athleteInfo = loadAthleteNames()
val splitting = lines.map(x => x.split(";")).map(x => if(x.length == 4 && x(2).length <= 140)x(2))
var hello = new String()
val container = splitting.map(x => for((key,value) <- athleteInfo)if(x.toString().contains(key)){key}).cache
container.collect().foreach(println)
// val mapping = container.map(x => (x,1)).reduceByKey(_+_)
//mapping.collect().foreach(println)
}
}
the first file look like:
id,name,nationality,sex,height........
001,Michael,USA,male,1.96 ...
002,Json,GBR,male,1.76 ....
003,Martin,female,1.73 . ...
the second file look likes:
time, id , tweet .....
12:00, 03043, some message that contain some athletes names , .....
02:00, 03023, some message that contain some athletes names , .....
some thinks like this ...
but i got empty result after running this code, any suggestions is much appreciated
result i got is empty :
()....
()...
()...
but the result that i expected something like:
(name,1)
(other name,1)
You need to use yield to return value to your map
val container = splitting.map(x => for((key,value) <- athleteInfo ; if(x.toString().contains(key)) ) yield (key, 1)).cache
I think you should just start with the simplest option first...
I would use DataFrames so you can use the built-in CSV parsing and leverage Catalyst, Tungsten, etc.
Then you can use the built-in Tokenizer to split the tweets into words, explode, and do a simple join. Depending how big/small the data with athlete names is you'll end up with a more optimized broadcast join and avoid a shuffle.
import org.apache.spark.sql.functions._
import org.apache.spark.ml.feature.Tokenizer
val tweets = spark.read.format("csv").load(...)
val athletes = spark.read.format("csv").load(...)
val tokenizer = new Tokenizer()
tokenizer.setInputCol("tweet")
tokenizer.setOutputCol("words")
val tokenized = tokenizer.transform(tweets)
val exploded = tokenized.withColumn("word", explode('words))
val withAthlete = exploded.join(athletes, 'word === 'name)
withAthlete.select(exploded("id"), 'name).show()
I came across the following example from the book "Fast Processing with Spark" by Holden Karau. I did not understand what the following line of code does in the program:
val splitLines = inFile.map(line => {
val reader = new CSVReader(new StringReader(line))
reader.readNext()
})
val numericData = splitLines.map(line => line.map(_.toDouble))
val summedData = numericData.map(row => row.sum)
The program is :
package pandaspark.examples
import spark.SparkContext
import spark.SparkContext._
import spark.SparkFiles;
import au.com.bytecode.opencsv.CSVReader
import java.io.StringReader
object LoadCsvExample {
def main(args: Array[String]) {
if (args.length != 2) {
System.err.println("Usage: LoadCsvExample <master>
<inputfile>")
System.exit(1)
}
val master = args(0)
val inputFile = args(1)
val sc = new SparkContext(master, "Load CSV Example",
System.getenv("SPARK_HOME"),
Seq(System.getenv("JARS")))
sc.addFile(inputFile)
val inFile = sc.textFile(inputFile)
val splitLines = inFile.map(line => {
val reader = new CSVReader(new StringReader(line))
reader.readNext()
})
val numericData = splitLines.map(line => line.map(_.toDouble))
val summedData = numericData.map(row => row.sum)
println(summedData.collect().mkString(","))
}
}
I briefly know the functionality of the above program. It parses the input
CSV and sums all the rows. But how exactly those 3 lines of code work to achieve is what I am unable to understand.
Also could anyone explain how the output would change if those lines are replaced with flatMap? Like:
val splitLines = inFile.flatMap(line => {
val reader = new CSVReader(new StringReader(line))
reader.readNext()
})
val numericData = splitLines.flatMap(line => line.map(_.toDouble))
val summedData = numericData.map(row => row.sum)
val splitLines = inFile.map(line => {
val reader = new CSVReader(new StringReader(line))
reader.readNext()
})
val numericData = splitLines.map(line => line.map(_.toDouble))
val summedData = numericData.map(row => row.sum)
so in this code is basically reading a CSV file data and adding it's value.
suppose your CSV file is something like -
10,12,13
1,2,3,4
1,2
so here inFile we are fetching a data from CSV file like -
val inFile = sc.textFile("your CSV file path")
so Here inFile is an RDD Which has text formatted data.
and when you apply collect on it then it will look like this -
Array[String] = Array(10,12,13 , 1,2,3,4 , 1,2)
and when you apply map over it then you will find -
line = 10,12,13
line = 1,2,3,4
line = 1,2
and for reading this data in CSV format it is using -
val reader = new CSVReader(new StringReader(line))
reader.readNext()
so after reading data in CSV format, splitLines look like -
Array(
Array(10,12,13),
Array(1,2,3,4),
Array(1,2)
)
on splitLines, it's applying
splitLines.map(line => line.map(_.toDouble))
here in line you will get Array(10,12,13) and after it, it's using
line.map(_.toDouble)
so it's changing all elements type from string to Double.
so in numericData you will get same
Array(Array(10.0, 12.0, 13.0), Array(1.0, 2.0, 3.0, 4.0), Array(1.0, 2.0))
but all elements now in form of Double
and it's applying the sum of the individual row or array so answer something like -
Array(35.0, 10.0, 3.0)
you will get it when you will apply susummedData.collect()
First of all there is no any flatMap operation in your code sample, so title is misleading. But in general map called on collection returns new collection with function applied to each element of collection.
Going line by line through your code snippet:
val splitLines = inFile.map(line => {
val reader = new CSVReader(new StringReader(line))
reader.readNext()
})
Type of inFile is RDD[String]. You take every such string, create csv reader out of it and call readNext (which returns array of strings). So at the end you will get RDD[String[]].
val numericData = splitLines.map(line => line.map(_.toDouble))
A bit more tricky line with 2 maps operations nested. Again, you take each element of RDD collection (which is now String[]) and apply _.toDouble function to every element of String[]. At the end you will get RDD[Double[]].
val summedData = numericData.map(row => row.sum)
You take elements of RDD and apply sum function to them. Since every element is Double[], sum will produce single Double value. At the end you will get RDD[Double].
Hi I want to use a "for" into a map method in scala.
How can I do it?
For example here for each line read I want to generate a random word :
val rdd = file.map(line => (line,{
val chars = "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ";
val word = new String;
val res = new String;
val rnd = new Random;
val len = 4 + rnd.nextInt((6-4)+1);
for(i <- 1 to len){
val char = chars(rnd.nextInt(51));
word.concat(char.toString);
}
word;
}))
My current output is :
Array[(String, String)] = Array((1,""), (2,""), (3,""), (4,""), (5,""), (6,""), (7,""), (8,""), (9,""), (10,""), (11,""), (12,""), (13,""), (14,""), (15,""), (16,""), (17,""), (18,""), (19,""), (20,""), (21,""), (22,""), (23,""), (24,""), (25,""), (26,""), (27,""), (28,""), (29,""), (30,""), (31,""), (32,""), (33,""), (34,""), (35,""), (36,""), (37,""), (38,""), (39,""), (40,""), (41,""), (42,""), (43,""), (44,""), (45,""), (46,""), (47,""), (48,""), (49,""), (50,""), (51,""), (52,""), (53,""), (54,""), (55,""), (56,""), (57,""), (58,""), (59,""), (60,""), (61,""), (62,""), (63,""), (64,""), (65,""), (66,""), (67,""), (68,""), (69,""), (70,""), (71,""), (72,""), (73,""), (74,""), (75,""), (76,""), (77,""), (78,""), (79,""), (80,""), (81,""), (82,""), (83,""), (84,""), (85,""), (86...
I don't know why the right side is empty.
There's no need for var here. It's a one liner
Seq.fill(len)(chars(rnd.nextInt(51))).mkString
This will create a sequence of Char of length len by repeatedly calling chars(rnd.nextInt(51)), then makes it into a String.
Thus you'll get something like this :
import org.apache.spark.rdd.RDD
import scala.util.Random
val chars = ('a' to 'z') ++ ('A' to 'Z')
val rdd = file.map(line => {
val randomWord = {
val rnd = new Random
val len = 4 + rnd.nextInt((6 - 4) + 1)
Seq.fill(len)(chars(rnd.nextInt(chars.length-1))).mkString
}
(line, randomWord)
})
word.concat doesn't modify word but return a new String, you can make word a variable and add new string to it:
var word = new String
....
for {
...
word += char
...
}
I was hoping somebody could help, I'm new to scala and I'm having some issues writing my output to a text file.
I have a data table and I've written some code to read it in one line at a time, do what I want it to do, and now I need it to write that line to a text file.
So for example, I have the following table of data type
Name, Date, goX, goY, stopX, stopY
1, 12/01/01, 1166, 2299, 3300, 4477
My code, takes the first characters of goX and goY and creates a new number, in this instance 1.2 and does the same for stopX and stopY so in this case you get 3.4
What I want to get in the text file is essentially the following:
go, stop
1.2, 3.4
and I want it to go through hundreds of lines doing this until I have a long list of on and off in the text file.
My current code is as follows, this is almost certainly not the most elegant solution but it is my first ever scala/java code:
import scala.io.Source
object FT2 extends App {
for(line<-Source.fromFile("C://Users//Data.csv").getLines){
var array = line.split(",")
val gox = (array(2));
val xStringGo = gox.toString
val goX =xStringGo.dropRight(1|2)
val goy = (array(3));
val yStringGo = goy.toString
val goY = yStringGo.dropRight(1|2)
val goXY = goX+"."+goY
val stopx = (array(4));
val xStringStop = stopx.toString
val stopX =xStringStop.dropRight(1|2)
val stopy = (array(3));
val yStringStop = stopy.toString
val stopY = yStringStop.dropRight(1|2)
val stopXY = stopX+"."+stopY
val GoStop = List(goXY,stopXY)
//This is where I want to print GoStop to a text file
}
Any help is much appreciated!
This should do it:
import java.io._
val data = List("everything", "you", "want", "to", "write", "to", "the", "file")
val file = "whatever.txt"
val writer = new BufferedWriter(new OutputStreamWriter(new FileOutputStream(file)))
for (x <- data) {
writer.write(x + "\n") // however you want to format it
}
writer.close()
But you can make it a little nicer by creating a method that will automatically close stuff for you:
def using[T <: Closeable, R](resource: T)(block: T => R): R = {
try { block(resource) }
finally { resource.close() }
}
using(new BufferedWriter(new OutputStreamWriter(new FileOutputStream(file)))) {
writer =>
for (x <- data) {
writer.write(x + "\n") // however you want to format it
}
}
So:
using(new BufferedWriter(new OutputStreamWriter(new FileOutputStream("output.txt")))) {
writer =>
for(line <- io.Source.fromFile("input.txt").getLines) {
writer.write(line + "\n") // however you want to format it
}
}