Removing a certain row in csv file that contains a comma in scala? - scala

I have this csv file called data.csv:
Name, Animal and Total are the header of the file
Name Animal Total
Ann Fish 6
Bob Cat 4
Jim Dog, Cat 5
I want to drop the row if any cells contain a comma, so that the result is this:
Name Animal Total
Ann Fish 6
Bob Cat 4
Here's what I tried to do in scala:
val data = sc.textFile("file:/home/user/data.csv")
val new_data = data.filter(x => x.contains(","))
Unfortunately, this code did not produce the results I wanted. What can I do? Any help is greatly appreciated.

Initially, if your input data is csv, you are going to have always true for this condition: x.contains(",").
So, using .textFile, every item on data will be a line from your file. Assuming that your source file will split every item in your list with a comma (,), you can do something like:
val new_data = data.filter(x => x.split(",").count() > 3) // where: 3 is the ideal scenario.

this is a good way to do this:
def readCsv(filePath: String): List[List[String]] = {
val bufferedReader = Source.fromFile(filePath)
bufferedReader.getLines
.drop(1)
.map { line =>
val row = line.split(",(?=(?:[^\"]*\"[^\"]*\")*[^\"]*$)", -1).toList
row
}
.toList
}

Related

Reading multiple integers from line in text file

I am using Scala and reading input from the console. I am able to regurgitate the strings that make up each line, but if my input has the following format, how can I access each integer within each line?
2 2
1 2 2
2 1 1
Currently I just regurgitate the input back to the console using
object Main {
def main(args: Array[String]): Unit = {
for (ln <- io.Source.stdin.getLines) println(ln)
//how can I access each individual number within each line?
}
}
And I need to compile this project like so:
$ scalac main.scala
$ scala Main <input01.txt
2 2
1 2 2
2 1 1
A reasonable algorithm would be:
for each line, split it into words
parse each word into an Int
An implementation of that algorithm:
io.Source.stdin.getLines // for each line...
.flatMap(
_.split("""\s+""") // split it into words
.map(_.toInt) // parse each word into an Int
)
The result of this expression will be an Iterator[Int]; if you want a Seq, you can call toSeq on that Iterator (if there's a reasonable chance there will be more than 7 or so integers, it's probably worth calling toVector instead). It will blow up with a NumberFormatException if there's a word which isn't an integer. You can handle this a few different ways... if you want to ignore words that aren't integers, you can:
import scala.util.Try
io.Source.stdin.getLines
.flatMap(
_.split("""\s+""")
.flatMap(Try(_.toInt).toOption)
)
The following will give you a flat list of numbers.
val integers = (
for {
line <- io.Source.stdin.getLines
number <- line.split("""\s+""").map(_.toInt)
} yield number
)
As you can read here, some care must be taken when parsing the numbers.

Read and processing data in spark output is not deliminated correctly

So my stored output looks like this, it is one column with
\N|\N|\N|8931|\N|1
Where | is suppose to be the deliminated column. So it should have 6 columns, but it only has one.
My code to generate this is
val distData = sc.textFile(inputFileAdl).repartition(partitions.toInt)
val x = new UdfWrapper(inputTempProp, "local")
val wrapper = sc.broadcast(x)
distData.map({s =>
wrapper.value.exec(s.toString)
}).toDF().write.parquet(outFolder)
Nothing inside of the map can be changed. wrapper.value.exec(s.toString) returns a deliminated string(This cannot be changed). I want to write this deliminated string to a parquet file, but have it be correctly deliminated by a given deliminator. How can I accomplish this?
So current output - One column which is a deliminated string
Exepcted out - Six columns from the single deliminated string

split the file into multiple files based on a string in spark scala

I have a text file with the below data having no particular format
abc*123 *180109*1005*^*001*0000001*0*T*:~
efg*05*1*X*005010X2A1~
k7*IT 1234*P*234df~
hig*0109*10052200*Rq~
abc*234*9698*709870*99999*N:~
tng****MI*917937861~
k7*IT 8876*e*278df~
dtp*D8*20171015~
I want the output as two files as below :
Based on string abc, I want to split the file.
file 1:
abc*123 *180109*1005*^*001*0000001*0*T*:~
efg*05*1*X*005010X2A1~
k7*IT 1234*P*234df~
hig*0109*10052200*Rq~
file 2:
abc*234*9698*709870*99999*N:~
tng****MI*917937861~
k7*IT 8876*e*278df~
dtp*D8*20171015~
And the file names should be IT name(the line starts with k7) so file1 name should be IT_1234 second file name should be IT_8876.
There is this little dirty trick that I used for a project :
sc.hadoopConfiguration.set("textinputformat.record.delimiter", "abc")
You can set the delimiter of your spark context for reading files. So you could do something like this :
val delimit = "abc"
sc.hadoopConfiguration.set("textinputformat.record.delimiter", delimit)
val df = sc.textFile("your_original_file.txt")
.map(x => (delimit ++ x))
.toDF("delimit_column")
.filter(col("delimit_column") !== delimit)
Then you can map each element of your DataFrame (or RDD) to be written to a file.
It's a dirty method but it might help you !
Have a good day
PS : The filter at the end is to drop the first line which is empty with the concatenated delimiter
You can benefit from sparkContext's wholeTextFiles function to read the file. Then parse it to separate the strings ( here I have used #### as distinct combination of characters that won't repeat in the text)
val rdd = sc.wholeTextFiles("path to the file")
.flatMap(tuple => tuple._2.replace("\r\nabc", "####abc").split("####")).collect()
And then loop the array to save the texts to output
for(str <- rdd){
//saving codes here
}

How to Remove first few lines/header from multiple files using scala in spark

I was able to remove the first few lines of a single file using the code below:
scala> val file = sc.textFile("file:///root/path/file.csv")
Removing first 5 lines:
scala> val Data = file.mapPartitionsWithIndex{ (idx, iter) => if (idx == 0) iter.drop(5) else iter }
The problem is: Suppose that I have multiple files with the same columns, and I want to load all of them into rdd, removing the first few lines of each file.
Is this actually possible?
I'd appreciate any help. Thanks in advance!
Lets assume there are 2 files.
ravis-MacBook-Pro:files raviramadoss$ cat file.csv
first_file_first_record
first_file_second_record
first_file_third_record
first_file_fourth_record
first_file_fifth_record
first_file_sixth_record
ravis-MacBook-Pro:files raviramadoss$ cat file_2.csv
second_file_first_record
second_file_second_record
second_file_third_record
second_file_fourth_record
second_file_fifth_record
second_file_sixth_record
second_file_seventh_record
second_file_eight_record
Scala Code
sc.wholeTextFiles("/Users/raviramadoss/files").flatMap( _._2.lines.drop(5) ).collect()
Output:
res41: Array[String] = Array(first_file_sixth_record, second_file_sixth_record, second_file_seventh_record, second_file_eight_record)
In Spark/Hadoop if you give the input path as the directory containing all the files then the code which you have written will work on all the individual files separately.
So to achieve your objective, just give the input path as the directory containing all the files. So the first few lines will be removed from all the files.

Reading a csv file and selecting three columns in Scala

I need to read a csv file and then to make a new file having the specified 3 columns ..
I am aware of reading a text file but not csv file .
import scala.io.Source._
val lines = fromFile("file.txt").getLines
Or if you just want the first three columns, try this
val lines = fromFile("file.txt").
getLines.
map(_.split(",",4).take(3)).
toList
Assuming a collection of indices idx that refer to columns in the csv file, consider first
val idx = Array(1,3,4)
val xs = (1 to 10).toArray
and so we can fetch the 2nd, 4th and 5th columns (index 0 refers to the first column),
idx.map(xs)
Array(2, 4, 5)
We can apply this idea onn each array from splitting each line as follows,
Source.fromFile("file.csv").getLines.map(_.split(",").map(idx))
This approach allows for defining the indices of interest at runtime (non hard-coding).