split command in scala not working fine with special character like ~ - scala

hi I have string like this :
var ma_test="~0.000000~~~"
I am using the split function with ~ as delimiter but it did not split correctly
what I try :
scala> var ma_test="~0.000000~~~"
scala> val split_val = ma_test.split("~")
split_val: Array[String] = Array("", 0.000000)
scala> val split_dis = split_val(2)
java.lang.ArrayIndexOutOfBoundsException: 2
... 32 elided
I try also to use val split_val = ma_test.split("\~") and ma_test.split('~') still not able to split correctly

Using split will remove all the trailing empty strings, so there are 2 elements after split (as the leading ~ will also split), and starting from index 0, 1 etc..
Note that you get the first empty entry in the Array, as there is a ~ at the start that will also split, so you should use index 1.
var ma_test="~0.000000~~~"
val split_val = ma_test.split("~")
val split_dis = split_val(1)
Output
var ma_test: String = ~0.000000~~~
val split_val: Array[String] = Array("", 0.000000)
val split_dis: String = 0.000000
You can pass -1 as the second argument to see all parts, and using index 2 will then give you an empty string.
var ma_test="~0.000000~~~"
val split_val = ma_test.split("~", -1)
val split_dis = split_val(2)
Output
var ma_test: String = ~0.000000~~~
val split_val: Array[String] = Array("", 0.000000, "", "", "")
val split_dis: String = ""

The output is the same with a "special character" ~ or the character x which suggests the the split function is not the issue. In either case if you try and access split_val(2) you will get an ArrayOutOfBoundsException since that index does not exist in the array. 0 or 1 will work ok.
scala> var ma_test="x0.000000xxx"
var ma_test: String = x0.000000xxx
scala> ma_test.split("x")
val res1: Array[String] = Array("", 0.000000)
scala> var ma_test="~0.000000~~~"
var ma_test: String = ~0.000000~~~
scala> ma_test.split("~")
val res0: Array[String] = Array("", 0.000000)

Related

How do I extract each words from a text file in scala

I'm pretty much new to Scala. I have a text file that has only one line with file words separated by a semi-colon(;).
I want to extract each word, remove the white spaces, convert all to lowercase and call them based on the index of each word. Below is how I approached it:
newListUpper2.txt contains (Bed; chairs;spoon; CARPET;curtains )
val file = sc.textFile("myfile.txt")
val lower = file.map(x=>x.toLowerCase)
val result = lower.flatMap(x=>x.trim.split(";"))
result.collect.foreach(println)
Below is the copy of the REPL when I executed the code
scala> val file = sc.textFile("newListUpper2.txt")
file: org.apache.spark.rdd.RDD[String] = newListUpper2.txt MapPartitionsRDD[5] at textFile at
<console>:24
scala> val lower = file.map(x=>x.toLowerCase)
lower: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[6] at map at <console>:26
scala> val result = lower.flatMap(x=>x.trim.split(";"))
result: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[7] at flatMap at <console>:28
scala> result.collect.foreach(println)
bed
chairs
spoon
carpet
curtains
scala> result(0)
<console>:31: error: org.apache.spark.rdd.RDD[String] does not take parameters
result(0)
The results are not trimmed and then passing the index as parameter to get the word at that index gives error. My expected outcome should be as stated below if I pass the index of each word as parameter
result(0)= bed
result(1) = chairs
result(2) = spoon
result(3) = carpet
result(4) = curtains
What am I doing wrong?.
newListUpper2.txt contains (Bed; chairs;spoon; CARPET;curtains )
val file = sc.textFile("myfile.txt")
val lower = file.map(x=>x.toLowerCase)
val result = lower.flatMap(x=>x.trim.split(";")) // x = `bed; chairs;spoon; carpet;curtains` , x.trim does not work. trim func effective for head and tail only
result.collect.foreach(println)
Try it:
val result = lower.flatMap(x=>x.split(";").map(x=>x.trim))
1) Issue 1
scala> result(0)
<console>:31: error: org.apache.spark.rdd.RDD[String] does not take parameters
result is a RDD and it cant take parameters in this format. Instead you can use result.show(10,false)
2) Issue 2 - To achieve like this - result(0)= bed ,result(1) = chairs.....
scala> var result = scala.io.Source.fromFile("/path/to/File").getLines().flatMap(x=>x.split(";").map(x=>x.trim)).toList
result: List[String] = List(Bed, chairs, spoon, CARPET, curtains)
scala> result(0)
res21: String = Bed
scala> result(1)
res22: String = chairs

replace multiple occurrence of duplicate string in Scala with empty

I have a string as
something,'' something,nothing_something,op nothing_something,'' cat,cat
I want to achieve my output as
'' something,op nothing_something,cat
Is there any way to achieve it?
If I understand your requirement correctly, here's one approach with the following steps:
Split the input string by "," and create a list of indexed-CSVs and convert it to a Map
Generate 2-combinations of the indexed-CSVs
Check each of the indexed-CSV pairs and capture the index of any CSV which is contained within the other CSV
Since the CSVs corresponding to the captured indexes are contained within some other CSV, removing these indexes will result in remaining indexes we would like to keep
Use the remaining indexes to look up CSVs from the CSV Map and concatenate them back to a string
Here is sample code applying to a string with slightly more general comma-separated values:
val str = "cats,a cat,cat,there is a cat,my cat,cats,cat"
val csvIdxList = (Stream from 1).zip(str.split(",")).toList
val csvMap = csvIdxList.toMap
val csvPairs = csvIdxList.combinations(2).toList
val csvContainedIdx = csvPairs.collect{
case List(x, y) if x._2.contains(y._2) => y._1
case List(x, y) if y._2.contains(x._2) => x._1
}.
distinct
// csvContainedIdx: List[Int] = List(3, 6, 7, 2)
val csvToKeepIdx = (1 to csvIdxList.size) diff csvContainedIdx
// csvToKeepIdx: scala.collection.immutable.IndexedSeq[Int] = Vector(1, 4, 5)
val strDeduped = csvToKeepIdx.map( csvMap.getOrElse(_, "") ).mkString(",")
// strDeduped: String = cats,there is a cat,my cat
Applying the above to your sample string something,'' something,nothing_something,op nothing_something would yield the expected result:
strDeduped: String = '' something,op nothing_something
First create an Array of words separated by commas using split command on the given String, and do other operations using filter and mkString as below:
s.split(",").filter(_.contains(' ')).mkString(",")
In Scala REPL:
scala> val s = "something,'' something,nothing_something,op nothing_something"
s: String = something,'' something,nothing_something,op nothing_something
scala> s.split(",").filter(_.contains(' ')).mkString(",")
res27: String = '' something,op nothing_something
As per Leo C comment, I tested it as below with some other String:
scala> val s = "something,'' something anything anything anything anything,nothing_something,op op op nothing_something"
s: String = something,'' something anything anything anything anything,nothing_something,op op op nothing_something
scala> s.split(",").filter(_.contains(' ')).mkString(",")
res43: String = '' something anything anything anything anything,op op op nothing_something

How to change a column position in a spark dataframe with more than 22 columns?

I'm changing the column position of my DF, because I will put it into Cassandra.
The problems is that I have more that 22 columns and I get this error:
<console>:1: error: too many elements for tuple: 38, allowed: 22
I am using this procedure:
scala> val columns: Array[String] = firstDF.columns
columns: Array[String] = Array(HOCPNY, HOCOL, HONUMR, HOLINH, HODTTO, HOTOUR, HOCLIC, HOOE, HOTPAC, HODTAC, HOHRAC, HODESF, HOCDAN, HOCDRS, HOCDSL, HOOBS, HOTDSC, HONRAC, HOLINR, HOUSCA, HODTEA, HOHREA, HOUSEA, HODTCL, HOHRCL, HOUSCL, HODTRC, HOHRRC, HOUSRC, HODTRA, HOHRRA, HOUSRA, HODTCM, HOHRCM, HOUSCM, HODTUA, HOHRUA, HOUSER)
scala> val reorderedColumnNames: Array[String] = (hoclic,hotpac, hocdan, hocdrs,hocdsl,hocol,hocpny,hodesf,hodtac,hodtcl,hodtcm,hodtea,hodtra,hodtrc,hodtto,hodtua,hohrac,hohrcl,hohrcm,hohrea,hohrra,hohrrc,hohrua,holinh,holinr,honrac,honumr,hoobs,hooe,hotdsc,hotour,housca,houscl,houscm,housea,houser,housra,housrc)
<console>:1: error: too many elements for tuple: 38, allowed: 22
val reorderedColumnNames: Array[String] = (hoclic,hotpac,hocdan,hocdrs,hocdsl,hocol,hocpny,hodesf,hodtac,hodtcl,hodtcm,hodtea,hodtra,hodtrc,hodtto,hodtua,hohrac,hohrcl,hohrcm,hohrea,hohrra,hohrrc,hohrua,holinh,holinr,honrac,honumr,hoobs,hooe,hotdsc,hotour,housca,houscl,houscm,housea,houser,housra,housrc)
How can I solve?.
P.S. The table in Cassandra has this structure:
CREATE TABLE tfm.foehis(hocpny text, hocol text,honumr int,holinh text,hodtto date,hotour text,hoclic int,hooe text,hotpac text,hodtac int,hohrac int,hodesf text,hocdan text,hocdrs text,hocdsl text, hoobs text,hotdsc int,honrac int,holinr int,housca text,hodtea int,hohrea int,housea text,hodtcl int,hohrcl int,houscl text,hodtrc int,hohrrc int,housrc text,hodtra int,hohrra int,housra text,hodtcm int,hohrcm int,houscm text,hodtua int,hohrua int,houser text, PRIMARY KEY((hoclic),hotpac,hocdan));
val reorderedColumnNames: Array[String] = (hoclic,hotpac,hocdan,hocdrs,hocdsl,hocol,hocpny,hodesf,hodtac,hodtcl,hodtcm,hodtea,hodtra,hodtrc,hodtto,hodtua,hohrac,hohrcl,hohrcm,hohrea,hohrra,hohrrc,hohrua,holinh,holinr,honrac,honumr,hoobs,hooe,hotdsc,hotour,housca,houscl,houscm,housea,houser,housra,housrc)
The issue is in the definition of the right hand side of this assignment. Let's take a quick look and what happens with a smaller example
scala> val x = ("hello", "world")
x: (String, String) = (hello,world)
x became a two element tuple! That's because in scala (...) is syntax for making a tuple not a sequence. Instead you should use something like
scala> val x = Seq("hello", "world")
x: Seq[String] = List(hello, world)
To make a sequence or
scala> val x = Array("hello", "world")
x: Array[String] = Array(hello, world)
to make an array. Depending on what you need.

Filtering unique values from a text file

How to find and filter unique values from a text file.
I tried like below, its not working.
val spark = SparkSession.builder().master("local").appName("distinct").getOrCreate()
var data = spark.sparkContext.textFile("text/file/opath")
val uniqueval = data.map { rec => (rec.split(",")(3).distinct) }
var fils = data.filter(line => line.split(",")(3).equals(uniqueval)).map(x => (x)).foreach { println }
Sample Data:
ID | Name
1 john
1 john
2 david
3 peter
4 steve
Required Output:
1 john
2 david
3 peter
4 steve
You almost have it right. .distinct() must just be called on the RDD.
I'd replace statement 3 with:
val uniqueval = data.distinct().map...
This assumes that similar records will have identical lines in the text file.
Is core scala allowed?
scala> val text = List ("single" , "double", "mono", "double")
text: List[String] = List(single, double, mono, double)
scala> val u = text.distinct
u: List[String] = List(single, double, mono)
scala> val d = text.diff(u)
d: List[String] = List(double)
scala> val s = u.diff (d)
s: List[String] = List(single, mono)
your code can be something like:
sparkContext.textFile("sample-data.txt").distinct()
.saveAsTextFile("sample-data-dist.txt");
distinct method can do the action you want.

Can Scala Array add new element

When I created a Scala Array and added one element, but the array length is still 0, and I can not got the added element although I can see it in the construction function.
scala> val arr = Array[String]()
arr: Array[String] = Array()
scala> arr:+"adf"
res9: Array[String] = Array(adf)
scala> println(arr(0))
java.lang.ArrayIndexOutOfBoundsException: 0
... 33 elided
You declared Array of 0 size. It cannot have any elements. Your array is of size 0. Array[String]() is a array contructor syntax.
:+ creates a new Array with with the given element so the old array is empty even after doing the :+ operation.
You have to use ofDim function to declare the array first of certain size and then you can put elements inside using arr(index) = value syntax.
Once declared array size do not increase dynamically like list. Trying to append values would create new array instance.
or you can initialize the array during creation itself using the Array("apple", "ball") syntax.
val size = 1
val arr = Array.ofDim[String](size)
arr(0) = "apple"
Scala REPL
scala> val size = 1
size: Int = 1
scala> val arr = Array.ofDim[String](size)
arr: Array[String] = Array(null)
scala> arr(0) = "apple"
scala> arr(0)
res12: String = apple