Scala Spark loop goes through without any error, but does not produce an output - scala

I have a file in HDFS containing paths of various other files. Here is the file called file1:
path/of/HDFS/fileA
path/of/HDFS/fileB
path/of/HDFS/fileC
.
.
.
I am using a for loop in Scala Spark as follows to read each line of the above file and process it in another function:
val lines=Source.fromFile("path/to/file1.txt").getLines.toList
for(i<-lines){
i.toString()
val firstLines=sc.hadoopFile(i,classOf[TextInputFormat],classOf[LongWritable],classOf[Text]).flatMap {
case (k, v) => if (k.get == 0) Seq(v.toString) else Seq.empty[String]
}
}
when I run the above loop, it runs through without returning any errors and I get the Scala prompt in a new line: scala>
However, when I try to see a few lines of output which should be stored in firstLines, it does not work:
scala> firstLines
<console>:38: error: not found: value firstLines
firstLine
^
What is the problem in the above loop that is not producing the output, however running through without any errors?
Additional info
The function hadoopFile accepts a String path name as its first parameter. That is why I am trying to pass each line of file1 (each line is a path name) as a String in the first parameter i. The flatMap functionality is taking the first line of the file that has been passed to hadoopFile and stores that alone and dumps all the other lines. So the desired output (firstLines) should be the first line of all the files that are being passed to hadoopFile through their path names (i).
I tried running the function for just a single file, without a looop, and that produces the output:
val firstLines=sc.hadoopFile("path/of/HDFS/fileA",classOf[TextInputFormat],classOf[LongWritable],classOf[Text]).flatMap {
case (k, v) => if (k.get == 0) Seq(v.toString) else Seq.empty[String]
}
scala> firstLines.take(3)
res27: Array[String] = Array(<?xml version="1.0" encoding="utf-8"?>)
fileA is an XML file, so you can see the resulting first line of that file. So I know the function works fine, it is just a problem with the loop that I am not able to figure out. Please help.

The variable firstLines is defined in the body of the for loop and its scope is therefore limited to this loop. This means you cannot access the variable outside of the loop, and this is why the Scala compiler tells you error: not found: value firstLines.
From your description, I understand you want to collect the first line of every file which are listed in lines.
The every here can translate into different construct in Scala. We can use something like the for loop you wrote or even better adopt a functional approach and use a map function applied on the list of files. In the code below I put inside the map the code you used in your description, which creates an HadoopRDD and applies flatMap with your function to retrieve the first line of a file.
We then obtain a list of RDD[String] of lines. At this stage, note that we have not started to do any actual work. To trigger the evaluation of the RDDs and collect the result, we need an addition call to the collect method for each of the RDD we have in our list.
// Renamed "lines" to "files" as it is more explicit.
val fileNames = Source.fromFile("path/to/file1.txt").getLines.toList
val firstLinesRDDs = fileNames.map(sc.hadoopFile(_,classOf[TextInputFormat],classOf[LongWritable],classOf[Text]).flatMap {
case (k, v) => if (k.get == 0) Seq(v.toString) else Seq.empty[String]
})
// firstLinesRDDs is a list of RDD[String]. Based on this code, each RDD
// should consist in a single String value. We collect them using RDD#collect:
val firstLines = firstLinesRDDs.map(_.collect)
However, this approach suffers from a flaw which prevent us to benefit from any advantage Spark can provide.
When we apply the operation in map to filenames, we are not working with an RDD, hence the file names are processed sequentially on the driver (the process which hosts your Spark session) and not part of a parallelizable Spark job. This is equivalent to doing what you wrote in your second block of code, one file name at a time.
To address the problem, what can we do? A good thing to keep in mind when working with Spark is to try to push the declaration of the RDDs as early as possible in our code. Why? Because this allows Spark to parallelize and optimize the work we want to do. Your example could be a textbook illustration of this concept, though an additional complexity here is added by the requirement to manipulate files.
In our present case, we can benefit from the fact that hadoopFile accepts comma-separated files in input. Therefore, instead of sequentially creating RDDs for every file, we create one RDD for all of them:
val firstLinesRDD = sc.hadoopFile(fileNames.mkString(","), classOf[TextInputFormat],classOf[LongWritable],classOf[Text]).flatMap {
case (k, v) => if (k.get == 0) Seq(v.toString) else Seq.empty[String]
}
And we retrieve our first lines with a single collect:
val firstLines = firstLinesRDD.collect

Related

Transforming `PCollection` with many elements into a single element

I am trying to convert a PCollection, that has many elements, into a PCollection that has one element. Basically, I want to go from:
[1,2,3,4,5,6]
to:
[[1,2,3,4,5,6]]
so that I can work with the entire PCollection in a DoFn.
I've tried CombineGlobally(lamdba x: x), but only a portion of elements get combined into an array at a time, giving me the following result:
[1,2,3,4,5,6] -> [[1,2],[3,4],[5,6]]
Or something to that effect.
This is my relevant portion of my script that I'm trying to run:
import apache_beam as beam
raw_input = range(1024)
def run_test():
with TestPipeline() as test_pl:
input = test_pl | "Create" >> beam.Create(raw_input)
def combine(x):
print(x)
return x
(
input
| "Global aggregation" >> beam.CombineGlobally(combine)
)
pl.run()
run_test()
I figured out a pretty painless way to do this, which I missed in the docs:
The more general way to combine elements, and the most flexible, is
with a class that inherits from CombineFn.
CombineFn.create_accumulator(): This creates an empty accumulator. For
example, an empty accumulator for a sum would be 0, while an empty
accumulator for a product (multiplication) would be 1.
CombineFn.add_input(): Called once per element. Takes an accumulator
and an input element, combines them and returns the updated
accumulator.
CombineFn.merge_accumulators(): Multiple accumulators could be
processed in parallel, so this function helps merging them into a
single accumulator.
CombineFn.extract_output(): It allows to do additional calculations
before extracting a result.
I suppose supplying a lambda function that simply passes its argument to the "vanilla" CombineGlobally wouldn't do what I expected initially. That functionality has to be specified by me (although I still think it's weird this isn't built into the API).
You can find more about subclassing CombineFn here, which I found very helpful:
A CombineFn specifies how multiple values in all or part of a
PCollection can be merged into a single value—essentially providing
the same kind of information as the arguments to the Python “reduce”
builtin (except for the input argument, which is an instance of
CombineFnProcessContext). The combining process proceeds as follows:
Input values are partitioned into one or more batches.
For each batch, the create_accumulator method is invoked to create a fresh initial “accumulator” value representing the combination of
zero values.
For each input value in the batch, the add_input method is invoked to combine more values with the accumulator for that batch.
The merge_accumulators method is invoked to combine accumulators from separate batches into a single combined output accumulator value,
once all of the accumulators have had all the input value in their
batches added to them. This operation is invoked repeatedly, until
there is only one accumulator value left.
The extract_output operation is invoked on the final accumulator to get the output value. Note: If this CombineFn is used with a transform
that has defaults, apply will be called with an empty list at
expansion time to get the default value.
So, by subclassing CombineFn, I wrote this simple implementation, Aggregated, that does exactly what I want:
import apache_beam as beam
raw_input = range(1024)
class Aggregated(beam.CombineFn):
def create_accumulator(self):
return []
def add_input(self, accumulator, element):
accumulator.append(element)
return accumulator
def merge_accumulators(self, accumulators):
merged = []
for a in accumulators:
for item in a:
merged.append(item)
return merged
def extract_output(self, accumulator):
return accumulator
def run_test():
with TestPipeline() as test_pl:
input = test_pl | "Create" >> beam.Create(raw_input)
(
input
| "Global aggregation" >> beam.CombineGlobally(Aggregated())
| "print" >> beam.Map(print)
)
pl.run()
run_test()
You can also accomplish what you want with side inputs, e.g.
with beam.Pipeline() as p:
pcoll = ...
(p
# Create a PCollection with a single element.
| beam.Create([None])
# This will process the singleton exactly once,
# with the entirity of pcoll passed in as a second argument as a list.
| beam.Map(
lambda _, pcoll_as_side: ...consume pcoll_as_side here...,
pcoll_as_side=beam.pvalue.AsList(pcoll))

How to use filter in dynamically in scala?

I have the raw of line of logs file about 1TB. As below.
Test X1 SET WARN CATALOG MAP1,MAP2
INFO X2 SET WARN CATALOG MAPX,MAP2,MAP3
I read the logs file using spark scala scala and make the rdd of logs file.
I need to filter only those line which contains
1.SET
2.INFO
3. CATALOG
I write the filter like that
Val filterRdd = rdd.filter(f =>f.contains("SET")).filter(f => f.contains("INFO")).filter(f =>f.contains("CATALOG"))
can we do the same if these parameter are assign to list. and based on that we can filter dynamically not writing to much of line ; here in example i take only three restriction but in real it goes to upto 15 restriction keywords. can we do it dynamically.
Something like this could work when you require all words to appear in a line:
val words = Seq("SET", "INFO", "CATALOG")
val filterRdd = rdd.filter(f => words.forall(w => f.contains(w)))
and if you want any:
val filterRdd = rdd.filter(f => words.exists(w => f.contains(w)))

Spark - Create a DataFrame from a list of Rows generated in a loop

I have a loop which generates rows in each iteration. My goal is to create a dataframe, with a given schema, that contents just those rows. I have in mind a set of steps to follow, but I am not able to add a new Row to a List[Row] in each loop iteration
I am trying the following approach:
var listOfRows = List[Row]()
val dfToExtractValues: DataFrame = ???
dfToExtractValues.foreach { x =>
//Not really important how to generate here the variables
//So to simplify all the rows will have the same values
var col1 = "firstCol"
var col2 = "secondCol"
var col3 = "thirdCol"
val newRow = RowFactory.create(col1,col2,col3)
//This step I am not able to do
//listOfRows += newRow -> Just for strings
//listOfRows.add(newRow) -> This add doesnt exist, it is a addString
//listOfRows.aggregate(1)(newRow) -> This is not how aggreage works...
}
val rdd = sc.makeRDD[RDD](listOfRows)
val dfWithNewRows = sqlContext.createDataFrame(rdd, myOriginalDF.schema)
Can someone tell me what am I doing wrong, or what could I change in my approach to generate a dataframe from the rows I'm generating?
Maybe there is a better way to collect the Rows instead of List[Row]. But then I need to convert that other type of collection into a dataframe.
Can someone tell me what am I doing wrong
Closures:
First of all it looks like you skipped over Understanding Closures in the Programming Guide. Any attempt to modify variables passed with closure is futile. All you can do is modify a copy and changes won't be reflected globally.
Variable doesn't make object mutable:
Following
var listOfRows = List[Row]()
creates a variable. Assigned List is as immutable as it was. If it wasn't in the Spark context you could create a new List and reassign:
listOfRows = newRow :: listOfRows
Note that we perpend not append - you don't want to append to the list in a loop.
Variables with immutable objects are useful, when you want to share data (it is common pattern in Akka for example), but don't have many applications in Spark.
Keep things distributed:
Finally never fetch data to the driver just to distribute it again. You should also avoid unnecessary conversions between RDDs and DataFrames. It is best to use DataFrame operators all the way:
dfToExtractValues.select(...)
but if you need something more complex map:
import org.apache.spark.sql.catalyst.encoders.RowEncoder
dfToExtractValues.map(x => ...)(RowEncoder(schema))

How mimic the function map.getORelse to a CSV file

I have a CSV file that represent a map[String,Int], then I am reading the file as follows:
def convI2N (vkey:Int):String={
val in = new Scanner("dictionaryNV.csv")
loop.breakable{
while (in.hasNext) {
val nodekey = in.next(',')
val value = in.next('\n')
if (value == vkey.toString){
n=nodekey
loop.break()}
}}
in.close
n
}
the function give the String given the Int. The problem here is that I must browse the whole file, and the file is to big, then the procedure is too slow. Someone tell me that this is O(n) complexity time, and recomend me to pass to O(log n). I suppose that the function map.getOrElse is O(log n).
Someone can help me to find a way to get a best performance of this code?
As additional comment, the dictionaryNV file is sorted by the Int values
maybe I can divide the file by lines, or set of lines. The CSV has like 167000 Tuples [String,Int]
or in another way how you make some kind of binary search through the csv in scala?
If you are calling confI2N function many times then definitely the job will be slow because each time you have to scan the big file. So if the function is called many times then it is recommended to store them in temporary variable as properties or hashmap or collection of tuple2 and change the other code that is eating the memory.
You can try following way which should be faster than scanner way
Assuming that your csv file is comma separated as
key1,value1
key2,value2
Using Source.fromFile can be your solution as
def convI2N (vkey:Int):String={
var n = "not found"
val filtered = Source.fromFile("<your path to dictionaryNV.csv>")
.getLines()
.map(line => line.split(","))
.filter(sline => sline(0).equalsIgnoreCase(vkey.toString))
for(str <- filtered){
n = str(0)
}
n
}

Scala, user input till only newline is given

I have tried to get multiple user inputs to print them in Scala IDE.
I have tried the this piece of code
println(scala.io.StdIn.readLine())
which works, as the IDE takes my input and then print it in the line but this works only for a single input.
I want the code to take multiple inputs till only newline is entered. example,
1
2
3
so i decided we needed an iterator for the input, which led me to try the following 2 lines of code seperately
var in = Iterator.continually{ scala.io.StdIn.readLine() }.takeWhile { x => x != null}
and
var in = io.Source.stdin.getLines().takeWhile { x => x != null}
Unfortunately none of them worked as the IDE is not taking my input at all.
You're really close.
val in = Iterator.continually(io.StdIn.readLine).takeWhile(_.nonEmpty).toList
This will read input until an empty string is entered and saves the input in a List[String]. The reason for toList is because an Iterator element doesn't become real until next is called on it, so readLine won't be called until the next element is required. The transition to List creates all the elements of the Iterator.
update
As #vossad01 has pointed out, this can be made safer for unexpected input.
val in = Iterator.continually(io.StdIn.readLine)
.takeWhile(Option(_).fold(false)(_.nonEmpty))
.toList