difference between pipe and comma delimiter in spark-scala - scala

Can someone tell me why do we have two separate ways of representing pipe(|) and comma(,). Like
sc.textFile(file).map( x => x.split(","))
for comma, and
sc.textFile(file).map( x => x.split('|'))
for pipe.
Keeping both in double quotes, its failing with pipe and comma is giving me correct result.
Below is the full code which I am running
package com.rakesh.singh
import org.apache.spark._
import org.apache.spark.SparkContext._
import org.apache.log4j._
object MPMovie {
def namex ( x : String) = {
val fields = x.split('|')
val id = fields(0).toInt
val name = fields(1).toString
(id , name)
}
def main(rakesh : Array[String]) = {
Logger.getLogger("yoyo").setLevel(Level.ERROR)
val conf = new SparkConf().setAppName("Movies").setMaster("local[2]")
val sc = new SparkContext(conf)
val rdd = sc.textFile("F:/Raakesh/ml-100k/movies.data")
val names = sc.textFile("F:/Raakesh/ml-100k/names.data")
val mappednames = names.map(namex)
val splited = rdd.map(x => (x.split("\t")(1).toInt,1))
//.map(x => (x,1))
val counteachmovie = splited.reduceByKey( (a ,b )=> a + b).map( x => (x._2 , x._1))
val mpm = counteachmovie.max()
println(s"the final value of mpm is $mpm")
mappednames.foreach(println)
val finalname = mappednames.lookup(mpm._2)(0)
println(s"the final value of mpm is $finalname")
}
}
and data files are
movies.data
196 101 3 881250949
186 101 3 891717742
22 103 1 878887116
244 102 2 880606923
names:Data
101|Sajan
102|Mela
103|Hum

There are two different split methods:
The split(",") method comes originally from String.split(regex: String), it works with arbitrary regexes as separators, e.g.
scala> "helloABCworldCABfooBBACCAbar".split("[ABC]+")
res0: Array[String] = Array(hello, world, foo, bar)
The other split('|') comes from StringOps.split(separator: Char), and is rather like a generic Scala-collection operation. It doesn't work with regex, but it works on all StringLike collections, for example on StringBuilders:
scala> val b = new StringBuilder
b: StringBuilder =
scala> b ++= "hello|"
res2: b.type = hello|
scala> b ++= "world"
res3: b.type = hello|world
scala> b.split('|')
res4: Array[String] = Array(hello, world)
The "|" doesn't work with the first method, because it's a nonsensical "OR"-regex. In order to use the pipe | with the split(regex: String) version, you either have to escape it like this "\\|" or (often easier) to enclose it into "[|]"-character class.

Related

Transform a list of object to lists of its field

I have a List[MyObject], with MyObject containing the fields field1, field2 and field3.
I'm looking for an efficient way of doing :
Tuple3(_.map(_.field1), _.map(_.field2), _.map(_.field3))
In java I would do something like :
Field1Type f1 = new ArrayList<Field1Type>();
Field2Type f2 = new ArrayList<Field2Type>();
Field3Type f3 = new ArrayList<Field3Type>();
for(MyObject mo : myObjects) {
f1.add(mo.getField1());
f2.add(mo.getField2());
f3.add(mo.getField3());
}
I would like something more functional since I'm in scala but I can't put my finger on it.
Get 2\3 sub-groups with unzip\unzip3
Assuming the starting point:
val objects: Seq[MyObject] = ???
You can unzip to get all 3 sub-groups:
val (firsts, seconds, thirds) =
objects
.unzip3((o: MyObject) => (o.f1, o.f2, o.f3))
What if I have more than 3 relevant sub-groups ?
If you really need more sub-groups you need to implement your own unzipN however instead of working with Tuple22 I would personally use an adapter:
case class MyObjectsProjection(private val objs: Seq[MyObject]) {
lazy val f1s: Seq[String] =
objs.map(_.f1)
lazy val f2s: Seq[String] =
objs.map(_.f2)
...
lazy val f22s: Seq[String] =
objs.map(_.f3)
}
val objects: Seq[MyClass] = ???
val objsProjection = MyObjectsProjection(objects)
objs.f1s
objs.f2s
...
objs.f22s
Notes:
Change MyObjectsProjection according to your needs.
This is from a Scala 2.12\2.11 vanilla perspective.
The following will decompose your objects into three lists:
case class MyObject[T,S,R](f1: T, f2: S, f3: R)
val myObjects: Seq[MyObject[Int, Double, String]] = ???
val (l1, l2, l3) = myObjects.foldLeft((List.empty[Int], List.empty[Double], List.empty[String]))((acc, nxt) => {
(nxt.f1 :: acc._1, nxt.f2 :: acc._2, nxt.f3 :: acc._3)
})

How do I write a Dataset encoder to support mapping a function to a org.apache.spark.sql.Dataset[String] in Scala Spark

Moving from Spark 1.6 to Spark 2.2* has brought the error “error: Unable to find encoder for type stored in a 'Dataset'. Primitive types (Int, String, etc)” when trying to apply a method to a dataset returned from querying a parquet table.
I have oversimplified my code to demonstrate the same error. The code queries a parquet file to return the following datatype:
'org.apache.spark.sql.Dataset[org.apache.spark.sql.Row]'
I apply a function to extract a string and integer , returning a string. Returning the following
datatype: Array[String]
Next, I need to perform extensive manipulations requiring a separate function. In this test function, I try to append a string producing the same error as my detailed example.
I have tried some encoder examples and use of the ‘case’ but have not come up with a workable solution. Any suggestions/ examples would be appreciated
scala> var d1 = hive.executeQuery(st)
d1: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [cvdt35_message_id_d: string,
cvdt35_input_timestamp_s: decimal(16,5) ... 2 more fields]
val parseCVDP_parquet = (s:org.apache.spark.sql.Row) => s.getString(2).split("0x"
(1)+","+s.getDecimal(1);
scala> var d2 = d1.map(parseCVDP_parquet)
d2: org.apache.spark.sql.Dataset[String] = [value: string]
scala> d2.take(1)
20/03/25 19:01:08 WARN TaskSetManager: Stage 3 contains a task of very large size (131 KB). The
maximum recommended task size is 100 KB.
res10: Array[String] = Array(ab04006000504304,1522194407.95162)
scala> def dd(s:String){
| s + "some string"
| }
dd: (s: String)Unit
scala> var d3 = d2.map{s=> dd(s) }
<console>:47: error: Unable to find encoder for type stored in a Dataset. Primitive types (Int,
String, etc) and Product types (case classes) are supported by importing spark.implicits._ Support
for serializing other types will be added in future releases.
To distill the problem further, i believe this scenario (though I have not tried all possible solutions to) can be simplified further to the following code:
scala> var test = ( 1 to 3).map( _ => "just some words").toDS()
test: org.apache.spark.sql.Dataset[String] = [value: string]
scala> def f(s: String){
| s + "hi"
| }
f: (s: String)Unit
scala> var test2 = test.map{ s => f(s) }
<console>:42: error: Unable to find encoder for type stored in a Dataset.
Primitive types (Int, String, etc) and Product types (case classes) are
supported by importing spark.implicits._ Support for serializing other types
will be added in future releases.
var test2 = test.map{ s => f(s) }
I have a solution at least to my simplified Problem (below).
I will be testing more....
scala> var test = ( 1 to 3).map( _ => "just some words").toDS()
test: org.apache.spark.sql.Dataset[String] = [value: string]
scala> def f(s: String): String = {
| val r = s + "hi"
| return r
| }
f: (s: String)String
scala> var test2 = test.rdd.map{ s => f(s) }
test2: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[17] at map at <console>:43
scala> test2.take(1)
res9: Array[String] = Array(just some wordshi)
The first solution does not work on my initial (production) data set, rather producing the error "org.apache.spark.SparkException: Task not serializable" (interestingly though both stored as the same data type (org.apache.spark.sql.Dataset[String] = [value: string]) which I believe to be related. I included yet another solution to my test data set that eliminates the initial Encoder error and as shown actually works on my toy problem, does not ramp to a production data set. A bit confused as to exactly why my application is sidelined in the movement from 1.6 to 2.3 version spark as I didn't have to make any special accommodations to my application for years and have run it successfully for calculations that most likely count in the trillions. Other explorations have included wrapping my method as Serializable, explorations of the #transient keyword, leveraging the "org.apache.spark.serializer.KryoSerializer", writing my methods as functions and changing all vars to 'vals' (following related posts on 'stack').
scala> import spark.implicits._
import spark.implicits._
scala> var test = ( 1 to 3).map( _ => "just some words").toDS()
test: org.apache.spark.sql.Dataset[String] = [value: string]
scala> def f(s: String): String = {
| val r = s + "hi"
| return r
| }
f: (s: String)String
scala> var d2 = test.map{s => f(s)}(Encoders.STRING)
d2: org.apache.spark.sql.Dataset[String] = [value: string]
scala> d2.take(1)
res0: Array[String] = Array(just some wordshi)
scala>

Scala sort output on Key and then alphabetically

I'm trying out my first Scala program to sort the following output such that when the value is identical, words are sorted alphabetically.
cookie 8
document 6
function 5
name 5
start 5
My current code is as follows:
object Problem1{
def main(args: Array[String]){
val inputFile = args(0)
val outputFolder = args(1)
val kValue = args(2)
val conf = new SparkConf().setAppName("Problem1").setMaster("local")
val sc = new SparkContext(conf)
val input = sc.textFile(inputFile)
val words = input.flatMap(line => line.toLowerCase().split( [\\s*&#^'''\\,..:;?!\\[\\](){}<>~\\-_]+"))
.filter(x => x.matches("[A-Za-z]+")&& x.length >2)
.map(word => (word,1)).reduceByKey(_+_).map(_.swap)
val freq = words.sortByKey(false,1).map(_.swap).take(kValue.toInt)
val topKrdd = sc.parallelize(freq)
val tabSeperated = topKrdd.map(f => f._1 +"\t" + f._2)
tabSeperated.saveAsTextFile(outputFolder)
}
}
Can someone help me with the alphabetical sort for the lines where the numerical value is identical?
Usually Scala provides and uses an implicit Ordering for methods like sortByKey, but you can also construct a custom one and pass it in explicitly. The Ordering trait and companion object have a fair few helpful methods for this. You could do this:
val ord = Ordering.Tuple2(Ordering[Int].reverse, Ordering[String])
val freq = words.takeOrdered(kValue.toInt)(ord).map(_.swap)

SCALA : Read the text file and create tuple of it

How to create a tuple from the below-existing RDD?
// reading a text file "b.txt" and creating RDD
val rdd = sc.textFile("/home/training/desktop/b.txt")
b.txt dataset -->
Ankita,26,BigData,newbie
Shikha,30,Management,Expert
If you are intending to have Array[Tuples4] then you can do the following
scala> val rdd = sc.textFile("file:/home/training/desktop/b.txt")
rdd: org.apache.spark.rdd.RDD[String] = file:/home/training/desktop/b.txt MapPartitionsRDD[5] at textFile at <console>:24
scala> val arrayTuples = rdd.map(line => line.split(",")).map(array => (array(0), array(1), array(2), array(3))).collect
arrayTuples: Array[(String, String, String, String)] = Array((" Ankita",26,BigData,newbie), (" Shikha",30,Management,Expert))
Then you can access each fields as tuples
scala> arrayTuples.map(x => println(x._3))
BigData
Management
res4: Array[Unit] = Array((), ())
Updated
If you have variable sized input file as
Ankita,26,BigData,newbie
Shikha,30,Management,Expert
Anita,26,big
you can write match case pattern matching as
scala> val arrayTuples = rdd.map(line => line.split(",") match {
| case Array(a, b, c, d) => (a,b,c,d)
| case Array(a,b,c) => (a,b,c)
| }).collect
arrayTuples: Array[Product with Serializable] = Array((Ankita,26,BigData,newbie), (Shikha,30,Management,Expert), (Anita,26,big))
Updated again
As #eliasah pointed that above procedure is a bad practice which is using product iterator. As his suggestion we should know the maximum elements of the input data and use following logic where we assign default values for no elements
val arrayTuples = rdd.map(line => line.split(",")).map(array => (Try(array(0)) getOrElse("Empty"), Try(array(1)) getOrElse(0), Try(array(2)) getOrElse("Empty"), Try(array(3)) getOrElse("Empty"))).collect
And as #philantrovert pointed out, we can verify the output in the following way, if we are not using REPL
arrayTuples.foreach(println)
which results to
(Ankita,26,BigData,newbie)
(Shikha,30,Management,Expert)
(Anita,26,big,Empty)

how to join two datasets by key in scala spark

I have two datasets and each dataset have two elements.
Below are examples.
Data1: (name, animal)
('abc,def', 'monkey(1)')
('df,gh', 'zebra')
...
Data2: (name, fruit)
('a,efg', 'apple')
('abc,def', 'banana(1)')
...
Results expected: (name, animal, fruit)
('abc,def', 'monkey(1)', 'banana(1)')
...
I want to join these two datasets by using first column 'name.' I have tried to do this for a couple of hours, but I couldn't figure out. Can anyone help me?
val sparkConf = new SparkConf().setAppName("abc").setMaster("local[2]")
val sc = new SparkContext(sparkConf)
val text1 = sc.textFile(args(0))
val text2 = sc.textFile(args(1))
val joined = text1.join(text2)
Above code is not working!
join is defined on RDDs of pairs, that is, RDDs of type RDD[(K,V)].
The first step needed is to transform the input data into the right type.
We first need to transform the original data of type String into pairs of (Key, Value):
val parse:String => (String, String) = s => {
val regex = "^\\('([^']+)',[\\W]*'([^']+)'\\)$".r
s match {
case regex(k,v) => (k,v)
case _ => ("","")
}
}
(Note that we can't use a simple split(",") expression because the key contains commas)
Then we use that function to parse the text input data:
val s1 = Seq("('abc,def', 'monkey(1)')","('df,gh', 'zebra')")
val s2 = Seq("('a,efg', 'apple')","('abc,def', 'banana(1)')")
val rdd1 = sparkContext.parallelize(s1)
val rdd2 = sparkContext.parallelize(s2)
val kvRdd1 = rdd1.map(parse)
val kvRdd2 = rdd2.map(parse)
Finally, we use the join method to join the two RDDs
val joined = kvRdd1.join(kvRdd2)
// Let's check out results
joined.collect
// res31: Array[(String, (String, String))] = Array((abc,def,(monkey(1),banana(1))))
You have to create pairRDDs first for your data sets then you have to apply join transformation. Your data sets are not looking accurate.
Please consider the below example.
**Dataset1**
a 1
b 2
c 3
**Dataset2**
a 8
b 4
Your code should be like below in Scala
val pairRDD1 = sc.textFile("/path_to_yourfile/first.txt").map(line => (line.split(" ")(0),line.split(" ")(1)))
val pairRDD2 = sc.textFile("/path_to_yourfile/second.txt").map(line => (line.split(" ")(0),line.split(" ")(1)))
val joinRDD = pairRDD1.join(pairRDD2)
joinRDD.collect
Here is the result from scala shell
res10: Array[(String, (String, String))] = Array((a,(1,8)), (b,(2,4)))