Scala : bug with getLines? - scala

I'm facing a problem on a very simple file usage in scala I don't understand if this is from a bug or a misunderstanding what I'm doing...
Even reproducible from a worksheet in scala/eclipse IDE. I'm using IDE4.6.1 and scala 2.12.2
Code is very simple :
//********************************
import scala.io.Source
import java.io.File
import java.io.PrintWriter
object Embed {
val filename = "proteins.csv"
val handler = Source.fromFile(filename)
val header:String = handler.getLines().next()
println (">"+header)
val header2:String = handler.getLines().next()
println (">"+header2)
val header3:String = handler.getLines().next()
println (">"+header3)
}
//**********************
first 3 lines of the file are a bit long and of non sense for non bio specialists :
Protein Group,Protein ID,Accession,Significance,Coverage (%),#Peptides,#Unique,PTM,Cond_A Intensity,Cond_B Intensity,Cond_C Intensity,Cond_D Intensity,Sample Profile (Ratio),Group 1 Intensity,Group 2 Intensity,Group 3 Intensity,Group 4 Intensity,Group Profile (Ratio),Avg. Mass,Description
261,247,P0AFG4|ODO1_ECOL6,200.00,39,30,30,Carbamidomethylation; Deamidation (NQ); Oxidation (M),1.7E5,9.87E4,5.51E4,3.09E4,3.09:1.79:1.00:0.56,1.7E5,9.87E4,5.51E4,3.09E4,3.09:1.79:1.00:0.56,105062,2-oxoglutarate dehydrogenase E1 component OS=Escherichia coli O6:H1 (strain CFT073 / ATCC 700928 / UPEC) GN=sucA PE=3 SV=1
287,657,B7NDL4|MDH_ECOLU,200.00,54,14,1,Carbamidomethylation; Deamidation (NQ); Oxidation (M),6.27E4,4.14E4,1.81E4,1.28E4,3.47:2.29:1.00:0.71,6.27E4,4.14E4,1.81E4,1.28E4,3.47:2.29:1.00:0.71,32336,Malate dehydrogenase OS=Escherichia coli O17:K52:H18 (strain UMN026 / ExPEC) GN=mdh PE=3 SV=1
I won't go into this file details but it is a 3600 lines file, each containing 20 fields separated by commas and a '' end of line. First line is teh header.
I tried also with only and only with same result :
First line is read correctly but second line read is only the final part of the 8th line in the file, and so on then I cannot read/parse my file :
Following is the result I get
val filename = "proteins.csv"
//> filename : String = proteins.csv
val handler = Source.fromFile(filename) //> handler : scala.io.BufferedSource = non-empty iterator
val header:String = handler.getLines().next() //> header : String = Protein Group,Protein ID,Accession,Significance,Coverage
//| (%),#Peptides,#Unique,PTM,Cond_A Intensity,Cond_B Intensity,Cond_C Intensity
//| ,Cond_D Intensity,Sample Profile (Ratio),Group 1 Intensity,Group 2 Intensity
//| ,Group 3 Intensity,Group 4 Intensity,Group Profile (Ratio),Avg. Mass,Descrip
//| tion
println (">"+header) //> >Protein Group,Protein ID,Accession,Significance,Coverage (%),#Peptides,#Uni
//| que,PTM,Cond_A Intensity,Cond_B Intensity,Cond_C Intensity,Cond_D Intensity,
//| Sample Profile (Ratio),Group 1 Intensity,Group 2 Intensity,Group 3 Intensity
//| ,Group 4 Intensity,Group Profile (Ratio),Avg. Mass,Description
val header2:String = handler.getLines().next() //> header2 : String = TCC 700928 / UPEC) GN=fumA PE=3 SV=2
println (">"+header2) //> >TCC 700928 / UPEC) GN=fumA PE=3 SV=2
val header3:String = handler.getLines().next() //> header3 : String = n SE11) GN=zapB PE=3 SV=1
println (">"+header3) //> >n SE11) GN=zapB PE=3 SV=1
An idea what I do wrong ?
Many thanks for helping
No hurry : this is part of an attempt to use scala and I'll now go back to Python for doing the job !

If I understand you correctly the problem is that every time you call handler.getLines() you receive a new Iterator[String] object that by default points to the first line of the CSV file. You should try something like this:
val lineIterator = Source.fromFile("proteins.csv").getLines() // Get the iterator object
val firstLine = lineIterator.next()
val secondLine = lineIterator.next()
val thirdLine = lineIterator.next()
Or this:
val lines = Source.fromFile("proteins.csv").getLines().toIndexedSeq // Convert iterator to the list of lines
val n = 2
val nLine = lines(n)
println(nLine)

Your mistake is that you have called three times handler.getLines() i.e. BufferedLineIterator is instantiated three times and each one is calling next meaning that each instances are trying to read from the same source. And thats the reason you are getting random outputs
The correct way is to create only one instance of handler.getLines() and call next on it
val linesIterator = handler.getLines()
val header:String = linesIterator.next()
println (">"+header)
val header2:String = linesIterator.next()
println (">"+header2)
val header3:String = linesIterator.next()
println (">"+header3)
More precisely, you don't even need to call next() by doing
for(lines <- handler.getLines()){
println(">"+lines)
}

Related

How to set type of dataset when applying transformations and how to implement transformations without using spark.sql.functions._?

I am a beginner for Scala and have been working on the following problem:
Example dataset named as given_dataset with player number and points scored
|player_no| |points|
1 25.0
1 20.0
1 21.0
2 15.0
2 18.0
3 24.0
3 25.0
3 29.0
Problem 1:
I have a dataset and need to calculate total points scored, average points per game, and number of games played. I am unable to explicitly set the data type to "double", "int", "float", when I apply the transformations. (Perhaps this is because they are untyped transformations?) Would anyone be able to help on this and correct my error?
No data type specified (but code is able to run)
val total_points_dataset = given_dataset.groupBy($"player_no").sum("points").orderBy("player_no")
val games_played_dataset = given_dataset.groupBy($"player_no").count().orderBy("player_no")
val avg_points_dataset = given_dataset.groupBy($"player_no").avg("points").orderBy("player_no")
Please note that I would like to retain the player number as I plan to merge total_points_dataset, games_played_dataset, and avg_points_dataset together.
Data type specified, but code crashes!
val total_points_dataset = given_dataset.groupBy($"player_no").sum("points").as[Double].orderBy("player_no")
val games_played_dataset = given_dataset.groupBy($"player_no").count().as[Int].orderBy("player_no")
val avg_points_dataset = given_dataset.groupBy($"player_no").avg("points").as[Double].orderBy("player_no")
Problem 2:
I would like to implement the above without using the library spark.sql.functions e.g. through functions such as map, groupByKey etc. If possible, could anyone provide an example for this and point me towards the right direction?
If you don't want to use import org.apache.spark.sql.types.{FloatType, IntegerType, StructType} then you have to cast it either at the time of reading or using as[(Int, Double)] in the dataset. Below is the example while reading from CSV file for your dataset:
/** A function that splits a line of input into (player_no, points) tuples. */
def parseLine(line: String): (Int, Float) = {
// Split by commas
val fields = line.split(",")
// Extract the player_no and points fields, and convert to integer & float
val player_no = fields(0).toInt
val points = fields(1).toFloat
// Create a tuple that is our result.
(player_no, points)
}
And then read as below:
val sc = new SparkContext("local[*]", "StackOverflow75354293")
val lines = sc.textFile("data/stackoverflowdata-noheader.csv")
val dataset = lines.map(parseLine)
val total_points_dataset2 = dataset.reduceByKey((x, y) => x + y)
val total_points_dataset2_sorted = total_points_dataset2.sortByKey(ascending = true)
total_points_dataset2_sorted.foreach(println)
val games_played_dataset2 = dataset.countByKey().toList.sorted
games_played_dataset2.foreach(println)
val avg_points_dataset2 =
dataset
.mapValues(x => (x, 1))
.reduceByKey((x, y) => (x._1 + y._1, x._2 + y._2))
.mapValues(x => x._1 / x._2)
.sortByKey(ascending = true)
avg_points_dataset2.collect().foreach(println)
I locally tried running both ways and both are working fine, we can check the below output also:
(3,78.0)
(1,66.0)
(2,33.0)
(1,3)
(2,2)
(3,3)
(1,22.0)
(2,16.5)
(3,26.0)
For details you can see it on mysql page
Regarding "Problem 1" try
val total_points_dataset = given_dataset.groupBy($"player_no").sum("points").as[(Int, Double)].orderBy("player_no")
val games_played_dataset = given_dataset.groupBy($"player_no").count().as[(Int, Long)].orderBy("player_no")
val avg_points_dataset = given_dataset.groupBy($"player_no").avg("points").as[(Int, Double)].orderBy("player_no")

How do I extract each words from a text file in scala

I'm pretty much new to Scala. I have a text file that has only one line with file words separated by a semi-colon(;).
I want to extract each word, remove the white spaces, convert all to lowercase and call them based on the index of each word. Below is how I approached it:
newListUpper2.txt contains (Bed; chairs;spoon; CARPET;curtains )
val file = sc.textFile("myfile.txt")
val lower = file.map(x=>x.toLowerCase)
val result = lower.flatMap(x=>x.trim.split(";"))
result.collect.foreach(println)
Below is the copy of the REPL when I executed the code
scala> val file = sc.textFile("newListUpper2.txt")
file: org.apache.spark.rdd.RDD[String] = newListUpper2.txt MapPartitionsRDD[5] at textFile at
<console>:24
scala> val lower = file.map(x=>x.toLowerCase)
lower: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[6] at map at <console>:26
scala> val result = lower.flatMap(x=>x.trim.split(";"))
result: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[7] at flatMap at <console>:28
scala> result.collect.foreach(println)
bed
chairs
spoon
carpet
curtains
scala> result(0)
<console>:31: error: org.apache.spark.rdd.RDD[String] does not take parameters
result(0)
The results are not trimmed and then passing the index as parameter to get the word at that index gives error. My expected outcome should be as stated below if I pass the index of each word as parameter
result(0)= bed
result(1) = chairs
result(2) = spoon
result(3) = carpet
result(4) = curtains
What am I doing wrong?.
newListUpper2.txt contains (Bed; chairs;spoon; CARPET;curtains )
val file = sc.textFile("myfile.txt")
val lower = file.map(x=>x.toLowerCase)
val result = lower.flatMap(x=>x.trim.split(";")) // x = `bed; chairs;spoon; carpet;curtains` , x.trim does not work. trim func effective for head and tail only
result.collect.foreach(println)
Try it:
val result = lower.flatMap(x=>x.split(";").map(x=>x.trim))
1) Issue 1
scala> result(0)
<console>:31: error: org.apache.spark.rdd.RDD[String] does not take parameters
result is a RDD and it cant take parameters in this format. Instead you can use result.show(10,false)
2) Issue 2 - To achieve like this - result(0)= bed ,result(1) = chairs.....
scala> var result = scala.io.Source.fromFile("/path/to/File").getLines().flatMap(x=>x.split(";").map(x=>x.trim)).toList
result: List[String] = List(Bed, chairs, spoon, CARPET, curtains)
scala> result(0)
res21: String = Bed
scala> result(1)
res22: String = chairs

counting lines in http logs, global line number not updated

I'm trying to use scala/spark to parse http log files (488 files in one directory)
scala> val logs2 = sc.textFile("D:/temp/tests/wwwlogs")
logs2: org.apache.spark.rdd.RDD[String] = D:/temp/tests/wwwlogs
MapPartitionsRDD[3] at textFile at <console>:24
scala> logs2.count
res1: Long = 230712
scala> logs2.filter(l => l.contains("92.50.64.234")).count()
res2: Long = 47
then I manually edit one file an add the following line:
2017-12-31 03:48:32 ... GET /status full=true 80 - 92.50.64.234 Python-urllib/2.7 - 404 0 2 416
scala> logs2.filter(l => l.contains("92.50.64.234")).count()
res3: Long = 48
great but then I execute again
scala> logs2.count
res4: Long = 230712
That is the same number of lines when I expect 230712 + 1 as I add one line to a file.
Why the filter result is updated but the global count is not ?
RDD enable cache already? as for filter, maybe not apply cache.

Filtering unique values from a text file

How to find and filter unique values from a text file.
I tried like below, its not working.
val spark = SparkSession.builder().master("local").appName("distinct").getOrCreate()
var data = spark.sparkContext.textFile("text/file/opath")
val uniqueval = data.map { rec => (rec.split(",")(3).distinct) }
var fils = data.filter(line => line.split(",")(3).equals(uniqueval)).map(x => (x)).foreach { println }
Sample Data:
ID | Name
1 john
1 john
2 david
3 peter
4 steve
Required Output:
1 john
2 david
3 peter
4 steve
You almost have it right. .distinct() must just be called on the RDD.
I'd replace statement 3 with:
val uniqueval = data.distinct().map...
This assumes that similar records will have identical lines in the text file.
Is core scala allowed?
scala> val text = List ("single" , "double", "mono", "double")
text: List[String] = List(single, double, mono, double)
scala> val u = text.distinct
u: List[String] = List(single, double, mono)
scala> val d = text.diff(u)
d: List[String] = List(double)
scala> val s = u.diff (d)
s: List[String] = List(single, mono)
your code can be something like:
sparkContext.textFile("sample-data.txt").distinct()
.saveAsTextFile("sample-data-dist.txt");
distinct method can do the action you want.

Want to parse a file and reformat it to create a pairRDD in Spark through Scala

I have dataset in a file in the form:
1: 1664968
2: 3 747213 1664968 1691047 4095634 5535664
3: 9 77935 79583 84707 564578 594898 681805 681886 835470 880698
4: 145
5: 8 57544 58089 60048 65880 284186 313376
6: 8
I need to transform this to something like below using Spark and Scala as a part of preprocessing of data:
1 1664968
2 3
2 747213
2 1664968
2 4095634
2 5535664
3 9
3 77935
3 79583
3 84707
And so on....
Can anyone provide input on how this can be done.
The length of the original rows in the file varies as shown in the dataset example above.
I am not sure, how to go about doing this transformation.
I tried soemthing like below which gives me a pair of the key and the first element after the semi-colon.
But I am not sure how to iterate over the entire data and generate the pairs as needed.
def main(args: Array[String]): Unit = {
val sc = new SparkContext(new SparkConf().setAppName("Graphx").setMaster("local"))
val rawLinks = sc.textFile("src/main/resources/links-simple-sorted-top100.txt")
rawLinks.take(5).foreach(println)
val formattedLinks = rawLinks.map{ rows =>
val fields = rows.split(":")
val fromVertex = fields(0)
val toVerticesArray = fields(1).split(" ")
(fromVertex, toVerticesArray(1))
}
val topFive = formattedLinks.take(5)
topFive.foreach(println)
}
val rdd = sc.parallelize(List("1: 1664968","2: 3 747213 1664968 1691047 4095634 5535664"))
val keyValues = rdd.flatMap(line => {
val Array(key, values) = line.split(":",2)
for(value <- values.trim.split("""\s+"""))
yield (key, value.trim)
})
keyValues.collect
split row in 2 parts and map on variable number of columns.
def transform(s: String): Array[String] = {
val Array(head, tail) = s.split(":", 2)
tail.trim.split("""\s+""").map(x => s"$head $x")
}
> transform("2: 3 747213 1664968 1691047 4095634 5535664")
// Array(2 3, 2 747213, 2 1664968, 2 1691047, 2 4095634, 2 5535664)