pyspark: 'PipelinedRDD' object is not iterable - pyspark

I am getting this error but i do not know why.
Basically I am erroring from this code:
a = data.mapPartitions(helper(locations))
where data is an RDD and my helper is defined as:
def helper(iterator, locations):
for x in iterator:
c = locations[x]
yield c
(locations is just an array of data points)
I do not see what the problem is but I am also not the best at pyspark so can someone please tell me why I am getting 'PipelinedRDD' object is not iterable from this code?

RDD can iterated by using map and lambda functions. I have iterated through Pipelined RDD using the below method
lines1 = sc.textFile("\..\file1.csv")
lines2 = sc.textFile("\..\file2.csv")
pairs1 = lines1.map(lambda s: (int(s), 'file1'))
pairs2 = lines2.map(lambda s: (int(s), 'file2'))
pair_result = pairs1.union(pairs2)
pair_result.reduceByKey(lambda a, b: a + ','+ b)
result = pair.map(lambda l: tuple(l[:1]) + tuple(l[1].split(',')))
result_ll = [list(elem) for elem in result]
===> result_ll = [list(elem) for elem in result]
TypeError: 'PipelinedRDD' object is not iterable
Instead of this I replaced the iteration using map function
result_ll = result.map( lambda elem: list(elem))
Hope this helps to modify your code accordingly

I prefer the answer that said in another question with below link :
Can not access Pipelined Rdd in pyspark
You cannot iterate over an RDD, you need first to call an action to get your data back to the driver.
Quick sample:
`>>> test = sc.parallelize([1,2,3])
>>> for i in test:
... print i
...
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: 'RDD' object is not iterable`
but for example you can use '.collect()'
`>>> for i in test.collect():
... print i
1
2
3`

Related

Spark print result of Arrays or saveAsTextFile

I have been bogged down by this for some hours now... tried collect and mkString(") and still i am not able to print in console or save as text file.
scala> val au1 = sc.parallelize(List(("a",Array(1,2)),("b",Array(1,2))))
scala> val au2 = sc.parallelize(List(("a",Array(3)),("b",Array(2))))
scala> val au3 = au1.union(au2)
Result of the union is
Array[(String,Array[int])] = Array((a,Array(1,2)),(b,Array(1,2)),(a,Array(3)),(b,Array(2)))
All the print attempts are resulting in following when i do x(0) and x(1)
Array[Int]) does not take parameters
Last attempt, performed following and it is resulting in index error
scala> val au4 = au3.map(x => (x._1, x._2._1._1, x._2._1._2))
<console>:33: error: value _1 is not a member of Array[Int]
val au4 = au3.map(x => (x._1, x._2._1._1, x._2._1._2))
._1 or ._2 can be done in tuples and not in arrays
("a",Array(1,2)) is a tuple so ._1 is a and ._2 is Array(1,2)
so if you want to get elements of an array you need to use () as x._2(0)
but au2 arrays has only one element so x._2(1) will work for au1 and not for au2. You can use Option or Try as below
val au4 = au3.map(x => (x._1, x._2(0), Try(x._2(1)) getOrElse(0)))
The result of au3 is not Array[(String,Array[int])] , it is RDD[(String,Array[int])]
so this how you could do to write output in a file
au3.map( r => (r._1, r._2.map(_.toString).mkString(",")))
.saveAsTextFile("data/result")
You need to map through the array and create a string from it so that it could be written in file as
(a,1:2)
(b,1:2)
(a,3)
(b,2)
You could write to file without brackets as below
au3.map( r => Row(r._1, r._2.map(_.toString).mkString(":")).mkString(","))
.saveAsTextFile("data/result")
Output:
a,1:2
b,1:2
a,3
b,2
The value is comma "," separated and array value are separated as ":"
Hope this helps!

Spark scala filter multiple rdd based on string length

I am trying to solve one of the quiz, the question is as below,
Write the missing code in the given program to display the expected output to identify animals that have names with four
letters.
Output: Array((4,lion))
Program
val a = sc.parallelize(List("dog","tiger","lion","cat","spider","eagle"),2)
val b = a.keyBy(_.length)
val c = sc.parallelize(List("ant","falcon","squid"),2)
val d = c.keyBy(_.length)
I have tried to write code in spark shell but get stuck with syntax to add 4 RDD and applying filter.
How about using the PairRDD lookup method:
b.lookup(4).toArray
// res1: Array[String] = Array(lion)
d.lookup(4).toArray
// res2: Array[String] = Array()

How to concat multiple columns in a data frame using Scala

I am trying concat multiple columns in a data frame . My column list are present in a variable. I am trying to pass that variable into concat function but not able to do that.
Ex: base_tbl_columns contain list of columns and I am using below code to select all the columns mentioned in the varibale .
scala> val base_tbl_columns = scd_table_keys_df.first().getString(5).split(",")
base_tbl_columns: Array[String] = Array(acct_nbr, account_sk_id, zip_code, primary_state, eff_start_date, eff_end_date, load_tm, hash_key, eff_flag)
val hist_sk_df_ld = hist_sk_df.select(base_tbl_columns.head,base_tbl_columns.tail: _*)
Similarly, I have one more list whcih I want to use for concatenation. But there the concat function is not taking the .head and .tail argument.
scala> val hash_key_cols = scd_table_keys_df.first().getString(4)
hash_key_cols: String = primary_state,zip_code
Here I am hard coding the value primary_state and zip_code.
.withColumn("hash_key_col",concat($"primary_state",$"zip_code"))
Here I am passing the variable hash_key_cols .
.withColumn("hash_key_col",concat(hash_key_cols ))
I was able t do this in python by using the code below.
hist_sk_df = hist_tbl_df.join(broadcast(hist_tbl_lkp_df) ,primary_key_col,'inner' ).withColumn("eff_start_date",lit(load_dt))**.withColumn('hash_key_col',F.concat(*hash_key_cols))**.withColumn("hash_key",hash_udf('hash_key_col')).withColumn("eff_end_date",lit(eff_close_dt)).withColumn("load_tm",lit(load_tm)).withColumn("eff_flag",lit(eff_flag_curr))
Either:
val base_tbl_columns: Array[String] = ???
df.select(concat(base_tbl_columns.map(c => col(c)): _*))
or:
df.select(expr(s"""concat(${base_tbl_columns.mkstring(",")})"""))

Finding lines that start with a digit in Scala using filter() method

I am a python programmer and as the Python API is too slow for my Spark application and decided to port my code to Spark Scala API, to compare the computation time.
I am trying to filter out the lines that start with numeric characters from a huge file using Scala API in Spark. In my file, some lines have numbers and some have words and I want the lines that only have numbers.
So, in my Python application, I have these lines.
l = sc.textFile("my_file_path")
l_filtered = l.filter(lambda s: s[0].isdigit())
which works exactly as I want.
This is what I have tried so far.
val l = sc.textFile("my_file_path")
val l_filtered = l.filter(x => x.forall(_.isDigit))
This throws out an error saying that char does not have forall() function.
I also tried taking the first character of the lines using s.take(1) and apply isDigit() function on that in the following way.
val l = sc.textFile("my_file_path")
val l_filtered = l.filter(x => x.take(1).isDigit)
and this too...
val l = sc.textFile("my_file_path")
val l_filtered = l.filter(x => x.take(1).Character.isDigit)
This also throws an error.
This is basically a small error and as I am not accustomed to Scala syntax, I am having hard time figuring it out. Any help would be appreciated.
Edit: As answered for this question, I tried writing the function, but I am unable to use that in filter() function in my application. To apply the function for all the lines in the file.
In Scala indexing syntax uses parens () instead of brackets []. The exact translation of your Python code would be this:
val l = sc.textFile("my_file_path")
val l_filtered = l.filter(_(0).isDigit)
A more idiomatic extraction of the first symbol would be using head method:
val l = sc.textFile("my_file_path")
val l_filtered = l.filter(_.head.isDigit)
Both of these methods would fail if your file contains empty lines.
If that's the case, then you probably want this:
val l = sc.textFile("my_file_path")
val l_filtered = l.filter(_.headOption.map(_.isDigit).getOrElse(false))
UPD.
As curious noted map(predicate).getOrElse(false) on Option could be shortened to exists(predicate):
val l = sc.textFile("my_file_path")
val l_filtered = l.filter(_.headOption.exists(_.isDigit))
You can use regular expressions:
scala> List("1hello","2world","good").filter(_.matches("^[0-9].*$"))
res0: List[String] = List(1hello, 2world)
or you can do like this with lesser no. of operations as this file might contain a huge number of lines to filter.
scala> List("1hello","world").filter(_.headOption.exists(_.isDigit))
res1: List[String] = List(1hello)
replace List[String] with your lines l in your case to work.

Storing the contents of a file in an immutable Map in scala

I am trying to implement a simple wordcount in scala using an immutable map(this is intentional) and the way I am trying to accomplish it is as follows:
Create an empty immutable map
Create a scanner that reads through the file.
While the scanner.hasNext() is true:
Check if the Map contains the word, if it doesn't contain the word, initialize the count to zero
Create a new entry with the key=word and the value=count+1
Update the map
At the end of the iteration, the map is populated with all the values.
My code is as follows:
val wordMap = Map.empty[String,Int]
val input = new java.util.scanner(new java.io.File("textfile.txt"))
while(input.hasNext()){
val token = input.next()
val currentCount = wordMap.getOrElse(token,0) + 1
val wordMap = wordMap + (token,currentCount)
}
The ides is that wordMap will have all the wordCounts at the end of the iteration...
Whenever I try to run this snippet, I get the following exception
recursive value wordMap needs type.
Can somebody point out why I am getting this exception and what can I do to remedy it?
Thanks
val wordMap = wordMap + (token,currentCount)
This line is redefining an already-defined variable. If you want to do this, you need to define wordMap with var and then just use
wordMap = wordMap + (token,currentCount)
Though how about this instead?:
io.Source.fromFile("textfile.txt") // read from the file
.getLines.flatMap{ line => // for each line
line.split("\\s+") // split the line into tokens
.groupBy(identity).mapValues(_.size) // count each token in the line
} // this produces an iterator of token counts
.toStream // make a Stream so we can groupBy
.groupBy(_._1).mapValues(_.map(_._2).sum) // combine all the per-line counts
.toList
Note that the per-line pre-aggregation is used to try and reduce the memory required. Counting across the entire file at once might be too big.
If your file is really massive, I would suggest using doing this in parallel (since word counting is trivial to parallelize) using either Scala's parallel collections or Hadoop (using one of the cool Scala Hadoop wrappers like Scrunch or Scoobi).
EDIT: Detailed explanation:
Ok, first look at the inner part of the flatMap. We take a string, and split it apart on whitespace:
val line = "a b c b"
val tokens = line.split("\\s+") // Array(a, b, c, a, b)
Now identity is a function that just returns its argument, so if wegroupBy(identity)`, we map each distinct word type, to each word token:
val grouped = tokens.groupBy(identity) // Map(c -> Array(c), a -> Array(a), b -> Array(b, b))
And finally, we want to count up the number of tokens for each type:
val counts = grouped.mapValues(_.size) // Map(c -> 1, a -> 1, b -> 2)
Since we map this over all the lines in the file, we end up with token counts for each line.
So what does flatMap do? Well, it runs the token-counting function over each line, and then combines all the results into one big collection.
Assume the file is:
a b c b
b c d d d
e f c
Then we get:
val countsByLine =
io.Source.fromFile("textfile.txt") // read from the file
.getLines.flatMap{ line => // for each line
line.split("\\s+") // split the line into tokens
.groupBy(identity).mapValues(_.size) // count each token in the line
} // this produces an iterator of token counts
println(countsByLine.toList) // List((c,1), (a,1), (b,2), (c,1), (d,3), (b,1), (c,1), (e,1), (f,1))
So now we need to combine the counts of each line into one big set of counts. The countsByLine variable is an Iterator, so it doesn't have a groupBy method. Instead we can convert it to a Stream, which is basically a lazy list. We want laziness because we don't want to have to read the entire file into memory before we start. Then the groupBy groups all counts of the same word type together.
val groupedCounts = countsByLine.toStream.groupBy(_._1)
println(groupedCounts.mapValues(_.toList)) // Map(e -> List((e,1)), f -> List((f,1)), a -> List((a,1)), b -> List((b,2), (b,1)), c -> List((c,1), (c,1), (c,1)), d -> List((d,3)))
And finally we can sum up the counts from each line for each word type by grabbing the second item from each tuple (the count), and summing:
val totalCounts = groupedCounts.mapValues(_.map(_._2).sum)
println(totalCounts.toList)
List((e,1), (f,1), (a,1), (b,3), (c,3), (d,3))
And there you have it.
You have a few mistakes: you've defined wordMap twice (val is to declare a value). Also, Map is immutable, so you either have to declare it as a var or use a mutable map (I suggest the former).
Try this:
var wordMap = Map.empty[String,Int] withDefaultValue 0
val input = new java.util.Scanner(new java.io.File("textfile.txt"))
while(input.hasNext()){
val token = input.next()
wordMap += token -> (wordMap(token) + 1)
}