counting lines in http logs, global line number not updated - scala

I'm trying to use scala/spark to parse http log files (488 files in one directory)
scala> val logs2 = sc.textFile("D:/temp/tests/wwwlogs")
logs2: org.apache.spark.rdd.RDD[String] = D:/temp/tests/wwwlogs
MapPartitionsRDD[3] at textFile at <console>:24
scala> logs2.count
res1: Long = 230712
scala> logs2.filter(l => l.contains("92.50.64.234")).count()
res2: Long = 47
then I manually edit one file an add the following line:
2017-12-31 03:48:32 ... GET /status full=true 80 - 92.50.64.234 Python-urllib/2.7 - 404 0 2 416
scala> logs2.filter(l => l.contains("92.50.64.234")).count()
res3: Long = 48
great but then I execute again
scala> logs2.count
res4: Long = 230712
That is the same number of lines when I expect 230712 + 1 as I add one line to a file.
Why the filter result is updated but the global count is not ?

RDD enable cache already? as for filter, maybe not apply cache.

Related

How do I extract each words from a text file in scala

I'm pretty much new to Scala. I have a text file that has only one line with file words separated by a semi-colon(;).
I want to extract each word, remove the white spaces, convert all to lowercase and call them based on the index of each word. Below is how I approached it:
newListUpper2.txt contains (Bed; chairs;spoon; CARPET;curtains )
val file = sc.textFile("myfile.txt")
val lower = file.map(x=>x.toLowerCase)
val result = lower.flatMap(x=>x.trim.split(";"))
result.collect.foreach(println)
Below is the copy of the REPL when I executed the code
scala> val file = sc.textFile("newListUpper2.txt")
file: org.apache.spark.rdd.RDD[String] = newListUpper2.txt MapPartitionsRDD[5] at textFile at
<console>:24
scala> val lower = file.map(x=>x.toLowerCase)
lower: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[6] at map at <console>:26
scala> val result = lower.flatMap(x=>x.trim.split(";"))
result: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[7] at flatMap at <console>:28
scala> result.collect.foreach(println)
bed
chairs
spoon
carpet
curtains
scala> result(0)
<console>:31: error: org.apache.spark.rdd.RDD[String] does not take parameters
result(0)
The results are not trimmed and then passing the index as parameter to get the word at that index gives error. My expected outcome should be as stated below if I pass the index of each word as parameter
result(0)= bed
result(1) = chairs
result(2) = spoon
result(3) = carpet
result(4) = curtains
What am I doing wrong?.
newListUpper2.txt contains (Bed; chairs;spoon; CARPET;curtains )
val file = sc.textFile("myfile.txt")
val lower = file.map(x=>x.toLowerCase)
val result = lower.flatMap(x=>x.trim.split(";")) // x = `bed; chairs;spoon; carpet;curtains` , x.trim does not work. trim func effective for head and tail only
result.collect.foreach(println)
Try it:
val result = lower.flatMap(x=>x.split(";").map(x=>x.trim))
1) Issue 1
scala> result(0)
<console>:31: error: org.apache.spark.rdd.RDD[String] does not take parameters
result is a RDD and it cant take parameters in this format. Instead you can use result.show(10,false)
2) Issue 2 - To achieve like this - result(0)= bed ,result(1) = chairs.....
scala> var result = scala.io.Source.fromFile("/path/to/File").getLines().flatMap(x=>x.split(";").map(x=>x.trim)).toList
result: List[String] = List(Bed, chairs, spoon, CARPET, curtains)
scala> result(0)
res21: String = Bed
scala> result(1)
res22: String = chairs

Why does Spark increment the RDD ID by 2 instead of 1 when reading in text files?

I noticed something interesting when working with the spark-shell and I'm curious as to why this is happening. I load a text file into Spark using the basic syntax, and then I just simply repeat this command. The output of the REPL is below:
scala> val myreviews = sc.textFile("Reviews.csv")
myreviews: org.apache.spark.rdd.RDD[String] = Reviews.csv MapPartitionsRDD[1] at textFile at <console>:24
scala> val myreviews = sc.textFile("Reviews.csv")
myreviews: org.apache.spark.rdd.RDD[String] = Reviews.csv MapPartitionsRDD[3] at textFile at <console>:24
scala> val myreviews = sc.textFile("Reviews.csv")
myreviews: org.apache.spark.rdd.RDD[String] = Reviews.csv MapPartitionsRDD[5] at textFile at <console>:24
scala> val myreviews = sc.textFile("Reviews.csv")
myreviews: org.apache.spark.rdd.RDD[String] = Reviews.csv MapPartitionsRDD[7] at textFile at <console>:24
I know that the MapPartitionsRDD[X] portion features X as the RDD identifier. However, based upon this SO post on RDD identifiers, I'd expect that the identifier integer increments by one each time a new RDD is created. So why exactly is it incrementing by 2?
My guess is that loading a text file creates an intermediate RDD? Because clearly creating an RDD from parallelize() only increments the RDD counter by 1 (before it was 7):
scala> val arrayrdd = sc.parallelize(Array(3,4,5))
arrayrdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[8] at parallelize at <console>:24
Note: I don't believe the number has anything to do w/ partitions. If I call, I get that my RDD is partitioned into 9 partitions:
scala> myreviews.partitions.size
res2: Int = 9
Because a single method call can create more than one intermediate RDD. It will be obvious if you check the debug string
sc.textFile("README.md").toDebugString
String =
(2) README.md MapPartitionsRDD[1] at textFile at <console>:25 []
| README.md HadoopRDD[0] at textFile at <console>:25 []
As you see the lineage consist of two RDDs.
The first one is a HadoopRDD which corresponds to data import
hadoopFile(path, classOf[TextInputFormat], classOf[LongWritable], classOf[Text],
minPartitions)
The second one is MapPartitionsRDD and corresponds to the subsequent map which drops keys (offsets) and converts Text to String.
.map(pair => pair._2.toString).setName(path)

Scala : bug with getLines?

I'm facing a problem on a very simple file usage in scala I don't understand if this is from a bug or a misunderstanding what I'm doing...
Even reproducible from a worksheet in scala/eclipse IDE. I'm using IDE4.6.1 and scala 2.12.2
Code is very simple :
//********************************
import scala.io.Source
import java.io.File
import java.io.PrintWriter
object Embed {
val filename = "proteins.csv"
val handler = Source.fromFile(filename)
val header:String = handler.getLines().next()
println (">"+header)
val header2:String = handler.getLines().next()
println (">"+header2)
val header3:String = handler.getLines().next()
println (">"+header3)
}
//**********************
first 3 lines of the file are a bit long and of non sense for non bio specialists :
Protein Group,Protein ID,Accession,Significance,Coverage (%),#Peptides,#Unique,PTM,Cond_A Intensity,Cond_B Intensity,Cond_C Intensity,Cond_D Intensity,Sample Profile (Ratio),Group 1 Intensity,Group 2 Intensity,Group 3 Intensity,Group 4 Intensity,Group Profile (Ratio),Avg. Mass,Description
261,247,P0AFG4|ODO1_ECOL6,200.00,39,30,30,Carbamidomethylation; Deamidation (NQ); Oxidation (M),1.7E5,9.87E4,5.51E4,3.09E4,3.09:1.79:1.00:0.56,1.7E5,9.87E4,5.51E4,3.09E4,3.09:1.79:1.00:0.56,105062,2-oxoglutarate dehydrogenase E1 component OS=Escherichia coli O6:H1 (strain CFT073 / ATCC 700928 / UPEC) GN=sucA PE=3 SV=1
287,657,B7NDL4|MDH_ECOLU,200.00,54,14,1,Carbamidomethylation; Deamidation (NQ); Oxidation (M),6.27E4,4.14E4,1.81E4,1.28E4,3.47:2.29:1.00:0.71,6.27E4,4.14E4,1.81E4,1.28E4,3.47:2.29:1.00:0.71,32336,Malate dehydrogenase OS=Escherichia coli O17:K52:H18 (strain UMN026 / ExPEC) GN=mdh PE=3 SV=1
I won't go into this file details but it is a 3600 lines file, each containing 20 fields separated by commas and a '' end of line. First line is teh header.
I tried also with only and only with same result :
First line is read correctly but second line read is only the final part of the 8th line in the file, and so on then I cannot read/parse my file :
Following is the result I get
val filename = "proteins.csv"
//> filename : String = proteins.csv
val handler = Source.fromFile(filename) //> handler : scala.io.BufferedSource = non-empty iterator
val header:String = handler.getLines().next() //> header : String = Protein Group,Protein ID,Accession,Significance,Coverage
//| (%),#Peptides,#Unique,PTM,Cond_A Intensity,Cond_B Intensity,Cond_C Intensity
//| ,Cond_D Intensity,Sample Profile (Ratio),Group 1 Intensity,Group 2 Intensity
//| ,Group 3 Intensity,Group 4 Intensity,Group Profile (Ratio),Avg. Mass,Descrip
//| tion
println (">"+header) //> >Protein Group,Protein ID,Accession,Significance,Coverage (%),#Peptides,#Uni
//| que,PTM,Cond_A Intensity,Cond_B Intensity,Cond_C Intensity,Cond_D Intensity,
//| Sample Profile (Ratio),Group 1 Intensity,Group 2 Intensity,Group 3 Intensity
//| ,Group 4 Intensity,Group Profile (Ratio),Avg. Mass,Description
val header2:String = handler.getLines().next() //> header2 : String = TCC 700928 / UPEC) GN=fumA PE=3 SV=2
println (">"+header2) //> >TCC 700928 / UPEC) GN=fumA PE=3 SV=2
val header3:String = handler.getLines().next() //> header3 : String = n SE11) GN=zapB PE=3 SV=1
println (">"+header3) //> >n SE11) GN=zapB PE=3 SV=1
An idea what I do wrong ?
Many thanks for helping
No hurry : this is part of an attempt to use scala and I'll now go back to Python for doing the job !
If I understand you correctly the problem is that every time you call handler.getLines() you receive a new Iterator[String] object that by default points to the first line of the CSV file. You should try something like this:
val lineIterator = Source.fromFile("proteins.csv").getLines() // Get the iterator object
val firstLine = lineIterator.next()
val secondLine = lineIterator.next()
val thirdLine = lineIterator.next()
Or this:
val lines = Source.fromFile("proteins.csv").getLines().toIndexedSeq // Convert iterator to the list of lines
val n = 2
val nLine = lines(n)
println(nLine)
Your mistake is that you have called three times handler.getLines() i.e. BufferedLineIterator is instantiated three times and each one is calling next meaning that each instances are trying to read from the same source. And thats the reason you are getting random outputs
The correct way is to create only one instance of handler.getLines() and call next on it
val linesIterator = handler.getLines()
val header:String = linesIterator.next()
println (">"+header)
val header2:String = linesIterator.next()
println (">"+header2)
val header3:String = linesIterator.next()
println (">"+header3)
More precisely, you don't even need to call next() by doing
for(lines <- handler.getLines()){
println(">"+lines)
}

Spark RDD default number of partitions

Version: Spark 1.6.2, Scala 2.10
I am executing below commands In the spark-shell.
I am trying to see the number of partitions that Spark is creating by default.
val rdd1 = sc.parallelize(1 to 10)
println(rdd1.getNumPartitions) // ==> Result is 4
//Creating rdd for the local file test1.txt. It is not HDFS.
//File content is just one word "Hello"
val rdd2 = sc.textFile("C:/test1.txt")
println(rdd2.getNumPartitions) // ==> Result is 2
As per the Apache Spark documentation, the spark.default.parallelism is the number of cores in my laptop (which is 2 core processor).
My question is : rdd2 seem to be giving the correct result of 2 partitions as said in the documentation. But why rdd1 is giving the result as 4 partitions ?
The minimum number of partitions is actually a lower bound set by the SparkContext. Since spark uses hadoop under the hood, Hadoop InputFormat` will still be the behaviour by default.
The first case should reflect defaultParallelism as mentioned here which may differ, depending on settings and hardware. (Numbers of cores, etc.)
So unless you provide the number of slices, that first case would be defined by the number described by sc.defaultParallelism:
scala> sc.defaultParallelism
res0: Int = 6
scala> sc.parallelize(1 to 100).partitions.size
res1: Int = 6
As for the second case, with sc.textFile, the number of slices by default is the minimum number of partitions.
Which is equal to 2 as you can see in this section of code.
Thus, you should consider the following :
sc.parallelize will take numSlices or defaultParallelism.
sc.textFile will take the maximum between minPartitions and the number of splits computed based on hadoop input split size divided by the block size.
sc.textFile calls sc.hadoopFile, which creates a HadoopRDD that uses InputFormat.getSplits under the hood [Ref. InputFormat documentation].
InputSplit[] getSplits(JobConf job, int numSplits) throws IOException : Logically split the set of input files for the job.
Each InputSplit is then assigned to an individual Mapper for processing.
Note: The split is a logical split of the inputs and the input files are not physically split into chunks. For e.g. a split could be tuple. Parameters: job - job configuration.
numSplits - the desired number of splits, a hint. Returns: an array of InputSplits for the job. Throws: IOException.
Example:
Let's create some dummy text files:
fallocate -l 241m bigfile.txt
fallocate -l 4G hugefile.txt
This will create 2 files, respectively, of size 241MB and 4GB.
We can see what happens when we read each of the files:
scala> val rdd = sc.textFile("bigfile.txt")
// rdd: org.apache.spark.rdd.RDD[String] = bigfile.txt MapPartitionsRDD[1] at textFile at <console>:27
scala> rdd.getNumPartitions
// res0: Int = 8
scala> val rdd2 = sc.textFile("hugefile.txt")
// rdd2: org.apache.spark.rdd.RDD[String] = hugefile.txt MapPartitionsRDD[3] at textFile at <console>:27
scala> rdd2.getNumPartitions
// res1: Int = 128
Both of them are actually HadoopRDDs:
scala> rdd.toDebugString
// res2: String =
// (8) bigfile.txt MapPartitionsRDD[1] at textFile at <console>:27 []
// | bigfile.txt HadoopRDD[0] at textFile at <console>:27 []
scala> rdd2.toDebugString
// res3: String =
// (128) hugefile.txt MapPartitionsRDD[3] at textFile at <console>:27 []
// | hugefile.txt HadoopRDD[2] at textFile at <console>:27 []

Get objects out of scala interpreter

I spent hours investigating one topic. I am definitely out of my depth here. What I want is to run the scala interpreter programmatically and be able to extract object values from the interpreter. for example, if I send
val a = 1
val b = a + 1
I want to be able to read out b as an Int, not just a string printed out like
b = 2
The source code is dense. So far I don't see any part which would allow such an extraction. Any experts here who can give me a tip or tell me this is utter nonsense?
How do I get typed objects out of the scala interpreter between sessions?
Use JSR 223.
Welcome to Scala version 2.11.7 [...]
scala> import javax.script._
import javax.script._
scala> val engine = (new ScriptEngineManager).getEngineByName("scala")
engine: javax.script.ScriptEngine = scala.tools.nsc.interpreter.IMain#4233e892
scala> engine.eval("val a = 1")
res0: Object = 1
scala> engine.eval("val b = a + 1")
res1: Object = 2
scala> engine.eval("b").asInstanceOf[Int]
res2: Int = 2