I'm pretty much new to Scala. I have a text file that has only one line with file words separated by a semi-colon(;).
I want to extract each word, remove the white spaces, convert all to lowercase and call them based on the index of each word. Below is how I approached it:
newListUpper2.txt contains (Bed; chairs;spoon; CARPET;curtains )
val file = sc.textFile("myfile.txt")
val lower = file.map(x=>x.toLowerCase)
val result = lower.flatMap(x=>x.trim.split(";"))
result.collect.foreach(println)
Below is the copy of the REPL when I executed the code
scala> val file = sc.textFile("newListUpper2.txt")
file: org.apache.spark.rdd.RDD[String] = newListUpper2.txt MapPartitionsRDD[5] at textFile at
<console>:24
scala> val lower = file.map(x=>x.toLowerCase)
lower: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[6] at map at <console>:26
scala> val result = lower.flatMap(x=>x.trim.split(";"))
result: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[7] at flatMap at <console>:28
scala> result.collect.foreach(println)
bed
chairs
spoon
carpet
curtains
scala> result(0)
<console>:31: error: org.apache.spark.rdd.RDD[String] does not take parameters
result(0)
The results are not trimmed and then passing the index as parameter to get the word at that index gives error. My expected outcome should be as stated below if I pass the index of each word as parameter
result(0)= bed
result(1) = chairs
result(2) = spoon
result(3) = carpet
result(4) = curtains
What am I doing wrong?.
newListUpper2.txt contains (Bed; chairs;spoon; CARPET;curtains )
val file = sc.textFile("myfile.txt")
val lower = file.map(x=>x.toLowerCase)
val result = lower.flatMap(x=>x.trim.split(";")) // x = `bed; chairs;spoon; carpet;curtains` , x.trim does not work. trim func effective for head and tail only
result.collect.foreach(println)
Try it:
val result = lower.flatMap(x=>x.split(";").map(x=>x.trim))
1) Issue 1
scala> result(0)
<console>:31: error: org.apache.spark.rdd.RDD[String] does not take parameters
result is a RDD and it cant take parameters in this format. Instead you can use result.show(10,false)
2) Issue 2 - To achieve like this - result(0)= bed ,result(1) = chairs.....
scala> var result = scala.io.Source.fromFile("/path/to/File").getLines().flatMap(x=>x.split(";").map(x=>x.trim)).toList
result: List[String] = List(Bed, chairs, spoon, CARPET, curtains)
scala> result(0)
res21: String = Bed
scala> result(1)
res22: String = chairs
I have a string as
something,'' something,nothing_something,op nothing_something,'' cat,cat
I want to achieve my output as
'' something,op nothing_something,cat
Is there any way to achieve it?
If I understand your requirement correctly, here's one approach with the following steps:
Split the input string by "," and create a list of indexed-CSVs and convert it to a Map
Generate 2-combinations of the indexed-CSVs
Check each of the indexed-CSV pairs and capture the index of any CSV which is contained within the other CSV
Since the CSVs corresponding to the captured indexes are contained within some other CSV, removing these indexes will result in remaining indexes we would like to keep
Use the remaining indexes to look up CSVs from the CSV Map and concatenate them back to a string
Here is sample code applying to a string with slightly more general comma-separated values:
val str = "cats,a cat,cat,there is a cat,my cat,cats,cat"
val csvIdxList = (Stream from 1).zip(str.split(",")).toList
val csvMap = csvIdxList.toMap
val csvPairs = csvIdxList.combinations(2).toList
val csvContainedIdx = csvPairs.collect{
case List(x, y) if x._2.contains(y._2) => y._1
case List(x, y) if y._2.contains(x._2) => x._1
}.
distinct
// csvContainedIdx: List[Int] = List(3, 6, 7, 2)
val csvToKeepIdx = (1 to csvIdxList.size) diff csvContainedIdx
// csvToKeepIdx: scala.collection.immutable.IndexedSeq[Int] = Vector(1, 4, 5)
val strDeduped = csvToKeepIdx.map( csvMap.getOrElse(_, "") ).mkString(",")
// strDeduped: String = cats,there is a cat,my cat
Applying the above to your sample string something,'' something,nothing_something,op nothing_something would yield the expected result:
strDeduped: String = '' something,op nothing_something
First create an Array of words separated by commas using split command on the given String, and do other operations using filter and mkString as below:
s.split(",").filter(_.contains(' ')).mkString(",")
In Scala REPL:
scala> val s = "something,'' something,nothing_something,op nothing_something"
s: String = something,'' something,nothing_something,op nothing_something
scala> s.split(",").filter(_.contains(' ')).mkString(",")
res27: String = '' something,op nothing_something
As per Leo C comment, I tested it as below with some other String:
scala> val s = "something,'' something anything anything anything anything,nothing_something,op op op nothing_something"
s: String = something,'' something anything anything anything anything,nothing_something,op op op nothing_something
scala> s.split(",").filter(_.contains(' ')).mkString(",")
res43: String = '' something anything anything anything anything,op op op nothing_something
I'm changing the column position of my DF, because I will put it into Cassandra.
The problems is that I have more that 22 columns and I get this error:
<console>:1: error: too many elements for tuple: 38, allowed: 22
I am using this procedure:
scala> val columns: Array[String] = firstDF.columns
columns: Array[String] = Array(HOCPNY, HOCOL, HONUMR, HOLINH, HODTTO, HOTOUR, HOCLIC, HOOE, HOTPAC, HODTAC, HOHRAC, HODESF, HOCDAN, HOCDRS, HOCDSL, HOOBS, HOTDSC, HONRAC, HOLINR, HOUSCA, HODTEA, HOHREA, HOUSEA, HODTCL, HOHRCL, HOUSCL, HODTRC, HOHRRC, HOUSRC, HODTRA, HOHRRA, HOUSRA, HODTCM, HOHRCM, HOUSCM, HODTUA, HOHRUA, HOUSER)
scala> val reorderedColumnNames: Array[String] = (hoclic,hotpac, hocdan, hocdrs,hocdsl,hocol,hocpny,hodesf,hodtac,hodtcl,hodtcm,hodtea,hodtra,hodtrc,hodtto,hodtua,hohrac,hohrcl,hohrcm,hohrea,hohrra,hohrrc,hohrua,holinh,holinr,honrac,honumr,hoobs,hooe,hotdsc,hotour,housca,houscl,houscm,housea,houser,housra,housrc)
<console>:1: error: too many elements for tuple: 38, allowed: 22
val reorderedColumnNames: Array[String] = (hoclic,hotpac,hocdan,hocdrs,hocdsl,hocol,hocpny,hodesf,hodtac,hodtcl,hodtcm,hodtea,hodtra,hodtrc,hodtto,hodtua,hohrac,hohrcl,hohrcm,hohrea,hohrra,hohrrc,hohrua,holinh,holinr,honrac,honumr,hoobs,hooe,hotdsc,hotour,housca,houscl,houscm,housea,houser,housra,housrc)
How can I solve?.
P.S. The table in Cassandra has this structure:
CREATE TABLE tfm.foehis(hocpny text, hocol text,honumr int,holinh text,hodtto date,hotour text,hoclic int,hooe text,hotpac text,hodtac int,hohrac int,hodesf text,hocdan text,hocdrs text,hocdsl text, hoobs text,hotdsc int,honrac int,holinr int,housca text,hodtea int,hohrea int,housea text,hodtcl int,hohrcl int,houscl text,hodtrc int,hohrrc int,housrc text,hodtra int,hohrra int,housra text,hodtcm int,hohrcm int,houscm text,hodtua int,hohrua int,houser text, PRIMARY KEY((hoclic),hotpac,hocdan));
val reorderedColumnNames: Array[String] = (hoclic,hotpac,hocdan,hocdrs,hocdsl,hocol,hocpny,hodesf,hodtac,hodtcl,hodtcm,hodtea,hodtra,hodtrc,hodtto,hodtua,hohrac,hohrcl,hohrcm,hohrea,hohrra,hohrrc,hohrua,holinh,holinr,honrac,honumr,hoobs,hooe,hotdsc,hotour,housca,houscl,houscm,housea,houser,housra,housrc)
The issue is in the definition of the right hand side of this assignment. Let's take a quick look and what happens with a smaller example
scala> val x = ("hello", "world")
x: (String, String) = (hello,world)
x became a two element tuple! That's because in scala (...) is syntax for making a tuple not a sequence. Instead you should use something like
scala> val x = Seq("hello", "world")
x: Seq[String] = List(hello, world)
To make a sequence or
scala> val x = Array("hello", "world")
x: Array[String] = Array(hello, world)
to make an array. Depending on what you need.
How to find and filter unique values from a text file.
I tried like below, its not working.
val spark = SparkSession.builder().master("local").appName("distinct").getOrCreate()
var data = spark.sparkContext.textFile("text/file/opath")
val uniqueval = data.map { rec => (rec.split(",")(3).distinct) }
var fils = data.filter(line => line.split(",")(3).equals(uniqueval)).map(x => (x)).foreach { println }
Sample Data:
ID | Name
1 john
1 john
2 david
3 peter
4 steve
Required Output:
1 john
2 david
3 peter
4 steve
You almost have it right. .distinct() must just be called on the RDD.
I'd replace statement 3 with:
val uniqueval = data.distinct().map...
This assumes that similar records will have identical lines in the text file.
Is core scala allowed?
scala> val text = List ("single" , "double", "mono", "double")
text: List[String] = List(single, double, mono, double)
scala> val u = text.distinct
u: List[String] = List(single, double, mono)
scala> val d = text.diff(u)
d: List[String] = List(double)
scala> val s = u.diff (d)
s: List[String] = List(single, mono)
your code can be something like:
sparkContext.textFile("sample-data.txt").distinct()
.saveAsTextFile("sample-data-dist.txt");
distinct method can do the action you want.