Spark Data Loadling Issue - scala

I am getting IndexOutOfBoundException while doing following operation in spark-shell
val input = sc.textFile("demo.txt")
b.collect
Both of above functions are working fine .
val out = input.map(_.split(",")).map(r => r(1))
Getting OutOfBoundException for above line
demo.txt is looks like this:(Header :- Name,Gender,age)
Danial,,14
,Male,18
Hema,,
With pig same file is working without any issue!!

You can try this out yourself, just start the Scala console and enter your sample lines.
scala> "Danial,,14".split(",")
res0: Array[String] = Array(Danial, "", 14)
scala> ",Male,18".split(",")
res1: Array[String] = Array("", Male, 18)
scala> "Hema,,".split(",")
res2: Array[String] = Array(Hema)
So ooops, the last line doesn't work. Add the number of expected columns to split:
scala> "Hema,,".split(",", 3)
res3: Array[String] = Array(Hema, "", "")
or even better, write a real parser. String.split isn't suitable for production code.

Related

FlatMap a mutable list in Spark appln using Scala

I am new to Spark-Scala Development and trying to get hands dirty so please bear with me if you find the question stupid.
Sample dataset
[29430500,1104296400000,1938,F,11,2131,
MutableList([123291654450,1440129600000,100121,0,1440734400000],[234564535,2345129600000,345121,1,14567734400000])
]
If you see the last field it's an Array[] and I want the output to look like this:-
Row 1:
[29430500,1104296400000,1938,F,11,2131,
123291654450,1440129600000,100121,0,1440734400000]
Row 2:
[29430500,1104296400000,1938,F,11,2131,
234564535,2345129600000,345121,1,14567734400000]
I think I have to do flatMap but for some reason, the following code gives this error:
def getMasterRdd(sc: SparkContext, hiveContext: HiveContext, outputDatabase:String, jobId:String,MasterTableName:String, dataSourceType: DataSourceType, startDate:Long, endDate:Long):RDD[Row]={}
val Rdd1= ClassName.getMasterRdd(sc, hiveContext, "xyz", "test123", "xyz.abc", DataSourceType.SS, 1435723200000L, 1451538000000L)
Rdd1: holds the sample dataset
val mapRdd1= Rdd1.map(Row => Row.get(6))
val flatmapRdd1 = mapPatientRdd.flatMap(_.split(","))
When I hover over (_.split(",")) I get a suggestion that says the following:
Type mismatch, expected:(Any) => TraversableOnce[NotInferedU], actual: (Any) =>Any
I think there is a better way to structure this (maybe using tuples instead of Lists) but anyway this works for me:
scala> val myRDD = sc.parallelize(Seq(Seq(29430500L,1104296400000L,1938L,"F",11L,2131L,Seq(Seq(123291654450L,1440129600000L,100121L,0L,1440734400000L),Seq(234564535L,2345129600000L,345121L,1L,14567734400000L)))))
myRDD: org.apache.spark.rdd.RDD[Seq[Any]] = ParallelCollectionRDD[11] at parallelize at <console>:27
scala> :pa
// Entering paste mode (ctrl-D to finish)
val myRDD2 = myRDD.flatMap(row => {
val (beginning, end) = (row.dropRight(1), row.last)
end.asInstanceOf[List[List[Any]]].map(beginning++_)
})
// Exiting paste mode, now interpreting.
myRDD2: org.apache.spark.rdd.RDD[Seq[Any]] = MapPartitionsRDD[10] at flatMap at <console>:29
scala> myRDD2.foreach{println}
List(29430500, 1104296400000, 1938, F, 11, 2131, 123291654450, 1440129600000, 100121, 0, 1440734400000)
List(29430500, 1104296400000, 1938, F, 11, 2131, 234564535, 2345129600000, 345121, 1, 14567734400000)
Use:
rdd.flatMap(row => row.getSeq[String](6).map(_.split(","))

How to iterate records spark scala?

I have a variable "myrdd" that is an avro file with 10 records loaded through hadoopfile.
When I do
myrdd.first_1.datum.getName()
I can get the name. Problem is, I have 10 records in "myrdd". When I do:
myrdd.map(x => {println(x._1.datum.getName())})
it does not work and prints out a weird object a single time. How can I iterate over all records?
Here is a log from a session using spark-shell with a similar scenario.
Given
scala> persons
res8: org.apache.spark.sql.DataFrame = [name: string, age: int]
scala> persons.first
res7: org.apache.spark.sql.Row = [Justin,19]
Your issue looks like
scala> persons.map(t => println(t))
res4: org.apache.spark.rdd.RDD[Unit] = MapPartitionsRDD[10]
so map just returns another RDD (the function is not applied immediately, the function is applied "lazily" when you really iterate over the result).
So when you materialize (using collect()) you get a "normal" collection:
scala> persons.collect()
res11: Array[org.apache.spark.sql.Row] = Array([Justin,19])
over which which you can map. Note that in this case you have a side-effect in the closure passed to map (the println), the result of println is Unit):
scala> persons.collect().map(t => println(t))
[Justin,19]
res5: Array[Unit] = Array(())
Same result if collect is applied at the end:
scala> persons.map(t => println(t)).collect()
[Justin,19]
res19: Array[Unit] = Array(())
But if you just want to print the rows, you can simplify it to using foreach:
scala> persons.foreach(t => println(t))
[Justin,19]
As #RohanAletty has pointed out in a comment, this works for a local Spark job. If the job runs in a cluster, collect is required as well:
persons.collect().foreach(t => println(t))
Notes
The same behaviour can be observed in the Iterator class.
The output of the session above has been reordered
Update
As for filtering: The location of collect is "bad", if you apply filters after collect which can be applied before.
For example these expressions give the same result:
scala> persons.filter("age > 20").collect().foreach(println)
[Michael,29]
[Andy,30]
scala> persons.collect().filter(r => r.getInt(1) >= 20).foreach(println)
[Michael,29]
[Andy,30]
but the 2nd case is worse, because that filter could have been applied before collect.
The same applies to any type of aggregation as well.

Spark fails while calling scala class method to comma split strings

I have the following clss in scala shell in spark.
class StringSplit(val query:String)
{
def getStrSplit(rdd:RDD[String]):RDD[String]={
rdd.map(x=>x.split(query))
}
}
I am trying to call the method in this class like
val inputRDD=sc.parallelize(List("one","two","three"))
val strSplit=new StringSplit(",")
strSplit.getStrSplit(inputRDD)
-> This steps fails with error:getStrSplit is not a member of StringSplit error.
Can you please let me know what is wrong with this?
It seems like a reasonable thing to do, but...
the result type for getStrSplit is wrong because .split returns Array[String] not String
parallelizing List("one","two","three") results in "one", "two" and "three" being stored, and there are no strings needing a comma split.
Another way:
val input = sc.parallelize(List("1,2,3,4","5,6,7,8"))
input: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[16] at parallelize at <console>
The test input here is a list of two strings that each require some comma splitting to get to the data.
To parse input by splitting can be as easy as:
val parsedInput = input.map(_.split(","))
parsedInput: org.apache.spark.rdd.RDD[Array[String]] = MapPartitionsRDD[19] at map at <console>:25
Here _.split(",") is an anonymous function with one parameter _, where Scala infers the types from the other calls rather than the types being explicitly defined.
Notice the type is RDD[Array[String]] not RDD[String]
We could extract the 3rd element of each line with
parsedInput.map(_(2)).collect()
res27: Array[String] = Array(3, 7)
So how about the original question, doing the same operation
in a class. I tried:
class StringSplit(query:String){
def get(rdd:RDD[String]) = rdd.map(_.split(query));
}
val ss = StringSplit(",");
ss.get(input);
---> org.apache.spark.SparkException: Task not serializable
I'm guessing that occurs because the class is not serialized to each worker, rather Spark tries to send split function but it has a parameter that is not also sent.
scala> class commaSplitter {
def get(rdd:RDD[String])=rdd.map(_.split(","));
}
defined class commaSplitter
scala> val cs = new commaSplitter;
cs: commaSplitter = $iwC$$iwC$commaSplitter#262f1580
scala> cs.get(input);
res29: org.apache.spark.rdd.RDD[Array[String]] = MapPartitionsRDD[23] at map at <console>:10
scala> cs.get(input).collect()
res30: Array[Array[String]] = Array(Array(1, 2, 3, 4), Array(5, 6, 7, 8))
This parameter-free class works.
EDIT
You can tell scala you want your class to be serializable by extends Serializable like so:
scala> class stringSplitter(s:String) extends Serializable {
def get(rdd:RDD[String]) = rdd.map(_.split(s));
}
defined class stringSplitter
scala> val ss = new stringSplitter(",");
ss: stringSplitter = $iwC$$iwC$stringSplitter#2a33abcd
scala> ss.get(input)
res33: org.apache.spark.rdd.RDD[Array[String]] = MapPartitionsRDD[25] at map at <console>:10
scala> ss.get(input).collect()
res34: Array[Array[String]] = Array(Array(1, 2, 3, 4), Array(5, 6, 7, 8))
and this works.

scala read file,and each line save to a variable?

scalaresult.txt
0~250::250~500::500~750::750~1000::1000~1250
481::827::750::256::1000
scala code
val filename = "/home/user/scalaresult.txt"
for ( (line,index) <- Source.fromFile(filename).getLines().zipWithIndex){
println(line)
println(index)
}
//val step_x = "0~250::250~500::500~750::750~1000::1000~1250"
//val step_y = "481::827::750::256::1000"
Seq("java", "-jar", "/home/user/birt2.jar" , step_x , step_y , "BarChart").lines
I have a file: scalaresult.txt
I need to save first line (index(0)) to step_x
and the second line (index(1)) to step_y
How to do this ? Please guide me Thank you.
This is not the optimal solution, but you can try the following: (I'm not a scala expert yet! :P)
scala> val it = Source.fromFile(filename).getLines().toList
it: List[String] = List(0~250::250~500::500~750::750~1000::1000~1250, "481::827::750::256::1000 ")
scala> it(1)
res7: String = "481::827::750::256::1000 "
scala> it(0)
res8: String = 0~250::250~500::500~750::750~1000::1000~1250
If all you are trying to do it take the two lines from the file and inserting them into the sequence, the indexer on the list will do the trick. Mind you, it's an O(n) operation on list, so if there were a lot of lines, it wouldn't be the best approach.
val filename = "/home/user/scalaresult.txt"
val lines = Source.fromFile(filename).getLines()
val seq = Seq("java", "-jar", "/home/user/birt2.jar" , lines(0) , lines(1), "BarChart")

Problem with outputting map values in scala

I have the following code snippet:
val map = new LinkedHashMap[String,String]
map.put("City","Dallas")
println(map.get("City"))
This outputs Some(Dallas) instead of just Dallas. Whats the problem with my code ?
Thank You
Use the apply method, it returns directly the String and throws a NoSuchElementException if the key is not found:
scala> import scala.collection.mutable.LinkedHashMap
import scala.collection.mutable.LinkedHashMap
scala> val map = new LinkedHashMap[String,String]
map: scala.collection.mutable.LinkedHashMap[String,String] = Map()
scala> map.put("City","Dallas")
res2: Option[String] = None
scala> map("City")
res3: String = Dallas
It's not really a problem.
While Java's Map version uses null to indicate that a key don't have an associated value, Scala's Map[A,B].get returns a Options[B], which can be Some[B] or None, and None plays a similar role to java's null.
REPL session showing why this is useful:
scala> map.get("State")
res6: Option[String] = None
scala> map.get("State").getOrElse("Texas")
res7: String = Texas
Or the not recommended but simple get:
scala> map.get("City").get
res8: String = Dallas
scala> map.get("State").get
java.util.NoSuchElementException: None.get
at scala.None$.get(Option.scala:262)
Check the Option documentation for more goodies.
There are two more ways you can handle Option results.
You can pattern match them:
scala> map.get("City") match {
| case Some(value) => println(value)
| case _ => println("found nothing")
| }
Dallas
Or there is another neat approach that appears somewhere in Programming in Scala. Use foreach to process the result. If a result is of type Some, then it will be used. Otherwise (if it's None), nothing happens:
scala> map.get("City").foreach(println)
Dallas
scala> map.get("Town").foreach(println)