How to use split function in spark scala - scala

I am supplying line by line to the program and each line consists of date in the format MM/DD/YYYY, how I can use split function here.
val data = line.split("/")
val year = data[2]
println(year)
I am not getting any output can anyone explain me where I am wrong.

You are not working on Java. Please look at the code snippet and make required changes in your code.
scala> val str = "12/05/2018"
str: String = 12/05/2018
scala> str.split("/")
res0: Array[String] = Array(12, 05, 2018)
scala> res0(2)
res1: String = 2018
So make below changes in your code:
val data = line.split("/")
val year = data(2)
println(year)

Related

How do I extract each words from a text file in scala

I'm pretty much new to Scala. I have a text file that has only one line with file words separated by a semi-colon(;).
I want to extract each word, remove the white spaces, convert all to lowercase and call them based on the index of each word. Below is how I approached it:
newListUpper2.txt contains (Bed; chairs;spoon; CARPET;curtains )
val file = sc.textFile("myfile.txt")
val lower = file.map(x=>x.toLowerCase)
val result = lower.flatMap(x=>x.trim.split(";"))
result.collect.foreach(println)
Below is the copy of the REPL when I executed the code
scala> val file = sc.textFile("newListUpper2.txt")
file: org.apache.spark.rdd.RDD[String] = newListUpper2.txt MapPartitionsRDD[5] at textFile at
<console>:24
scala> val lower = file.map(x=>x.toLowerCase)
lower: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[6] at map at <console>:26
scala> val result = lower.flatMap(x=>x.trim.split(";"))
result: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[7] at flatMap at <console>:28
scala> result.collect.foreach(println)
bed
chairs
spoon
carpet
curtains
scala> result(0)
<console>:31: error: org.apache.spark.rdd.RDD[String] does not take parameters
result(0)
The results are not trimmed and then passing the index as parameter to get the word at that index gives error. My expected outcome should be as stated below if I pass the index of each word as parameter
result(0)= bed
result(1) = chairs
result(2) = spoon
result(3) = carpet
result(4) = curtains
What am I doing wrong?.
newListUpper2.txt contains (Bed; chairs;spoon; CARPET;curtains )
val file = sc.textFile("myfile.txt")
val lower = file.map(x=>x.toLowerCase)
val result = lower.flatMap(x=>x.trim.split(";")) // x = `bed; chairs;spoon; carpet;curtains` , x.trim does not work. trim func effective for head and tail only
result.collect.foreach(println)
Try it:
val result = lower.flatMap(x=>x.split(";").map(x=>x.trim))
1) Issue 1
scala> result(0)
<console>:31: error: org.apache.spark.rdd.RDD[String] does not take parameters
result is a RDD and it cant take parameters in this format. Instead you can use result.show(10,false)
2) Issue 2 - To achieve like this - result(0)= bed ,result(1) = chairs.....
scala> var result = scala.io.Source.fromFile("/path/to/File").getLines().flatMap(x=>x.split(";").map(x=>x.trim)).toList
result: List[String] = List(Bed, chairs, spoon, CARPET, curtains)
scala> result(0)
res21: String = Bed
scala> result(1)
res22: String = chairs

replace multiple occurrence of duplicate string in Scala with empty

I have a string as
something,'' something,nothing_something,op nothing_something,'' cat,cat
I want to achieve my output as
'' something,op nothing_something,cat
Is there any way to achieve it?
If I understand your requirement correctly, here's one approach with the following steps:
Split the input string by "," and create a list of indexed-CSVs and convert it to a Map
Generate 2-combinations of the indexed-CSVs
Check each of the indexed-CSV pairs and capture the index of any CSV which is contained within the other CSV
Since the CSVs corresponding to the captured indexes are contained within some other CSV, removing these indexes will result in remaining indexes we would like to keep
Use the remaining indexes to look up CSVs from the CSV Map and concatenate them back to a string
Here is sample code applying to a string with slightly more general comma-separated values:
val str = "cats,a cat,cat,there is a cat,my cat,cats,cat"
val csvIdxList = (Stream from 1).zip(str.split(",")).toList
val csvMap = csvIdxList.toMap
val csvPairs = csvIdxList.combinations(2).toList
val csvContainedIdx = csvPairs.collect{
case List(x, y) if x._2.contains(y._2) => y._1
case List(x, y) if y._2.contains(x._2) => x._1
}.
distinct
// csvContainedIdx: List[Int] = List(3, 6, 7, 2)
val csvToKeepIdx = (1 to csvIdxList.size) diff csvContainedIdx
// csvToKeepIdx: scala.collection.immutable.IndexedSeq[Int] = Vector(1, 4, 5)
val strDeduped = csvToKeepIdx.map( csvMap.getOrElse(_, "") ).mkString(",")
// strDeduped: String = cats,there is a cat,my cat
Applying the above to your sample string something,'' something,nothing_something,op nothing_something would yield the expected result:
strDeduped: String = '' something,op nothing_something
First create an Array of words separated by commas using split command on the given String, and do other operations using filter and mkString as below:
s.split(",").filter(_.contains(' ')).mkString(",")
In Scala REPL:
scala> val s = "something,'' something,nothing_something,op nothing_something"
s: String = something,'' something,nothing_something,op nothing_something
scala> s.split(",").filter(_.contains(' ')).mkString(",")
res27: String = '' something,op nothing_something
As per Leo C comment, I tested it as below with some other String:
scala> val s = "something,'' something anything anything anything anything,nothing_something,op op op nothing_something"
s: String = something,'' something anything anything anything anything,nothing_something,op op op nothing_something
scala> s.split(",").filter(_.contains(' ')).mkString(",")
res43: String = '' something anything anything anything anything,op op op nothing_something

Spark scala filter multiple rdd based on string length

I am trying to solve one of the quiz, the question is as below,
Write the missing code in the given program to display the expected output to identify animals that have names with four
letters.
Output: Array((4,lion))
Program
val a = sc.parallelize(List("dog","tiger","lion","cat","spider","eagle"),2)
val b = a.keyBy(_.length)
val c = sc.parallelize(List("ant","falcon","squid"),2)
val d = c.keyBy(_.length)
I have tried to write code in spark shell but get stuck with syntax to add 4 RDD and applying filter.
How about using the PairRDD lookup method:
b.lookup(4).toArray
// res1: Array[String] = Array(lion)
d.lookup(4).toArray
// res2: Array[String] = Array()

Get objects out of scala interpreter

I spent hours investigating one topic. I am definitely out of my depth here. What I want is to run the scala interpreter programmatically and be able to extract object values from the interpreter. for example, if I send
val a = 1
val b = a + 1
I want to be able to read out b as an Int, not just a string printed out like
b = 2
The source code is dense. So far I don't see any part which would allow such an extraction. Any experts here who can give me a tip or tell me this is utter nonsense?
How do I get typed objects out of the scala interpreter between sessions?
Use JSR 223.
Welcome to Scala version 2.11.7 [...]
scala> import javax.script._
import javax.script._
scala> val engine = (new ScriptEngineManager).getEngineByName("scala")
engine: javax.script.ScriptEngine = scala.tools.nsc.interpreter.IMain#4233e892
scala> engine.eval("val a = 1")
res0: Object = 1
scala> engine.eval("val b = a + 1")
res1: Object = 2
scala> engine.eval("b").asInstanceOf[Int]
res2: Int = 2

Splitting a number into parts in Scala

I am trying to split a number, like
20130405
into three parts: year, month and date.One way is to convert it to string and use regex. Something like:
(\d{4})(\d{2})(\d{2}).r
A better way is to divide it by 100. Something like:
var date = dateNumber
val day = date % 100
date /= 100
val month = date % 100
date /= 100
val year = date
I get itchy while using 'var' in Scala. Is there a better way to do it?
I would go with the former:
scala> val regex = """(\d{4})(\d{2})(\d{2})""".r
regex: scala.util.matching.Regex = (\d{4})(\d{2})(\d{2})
scala> val regex(year, month, day) = "20130405"
year: String = 2013
month: String = 04
day: String = 05
This is probably not much better than your own solution, but it doesn't use var and doesn't require transforming the number to a string. On the other hand, it's not very safe - if you're not 100% sure that your number is going to be well formatted, better use a SimpleDateFormat - granted, it's more expensive, but at least you're safe from illegal input.
val num = 20130405
val (date, month, year) = (num % 100, num / 100 % 100, num / 10000)
println(date) // Prints 5
println(month) // Prints 4
println(year) // Prints 2013
I'd personally use a SimpleDateFormat even if I were sure the input will always be legal. The only certainty there is is that I'm wrong and the input will someday be illegal.
Better than substring would be to use the java Date and SimpleDateFormat classes, see:
https://stackoverflow.com/a/4216767/88588
Not very scala-ish but...
scala> import java.util.Calendar
import java.util.Calendar
scala> val format = new java.text.SimpleDateFormat("yyyyMMdd")
format: java.text.SimpleDateFormat = java.text.SimpleDateFormat#ef87e460
scala> format.format(new java.util.Date())
res0: String = 20131025
scala> val d=format.parse("20130405")
d: java.util.Date = Fri Apr 05 00:00:00 CEST 2013
scala> val calendar = Calendar.getInstance
calendar: java.util.Calendar = [cut...]
scala> calendar.setTime(d)
scala> calendar.get(Calendar.YEAR)
res1: Int = 2013