remove pipe delimiter from data using spark - scala

i am new to spark, i am using scala to separate pipe delimited file and save in hdfs without pipe delimited, for that i have written this code.
object WordCount {
def main(args: Array[String])
{
val textfile = sc.textFile("/user/cloudera/xxxx/xxxx")
val word = textfile.map( l => l.split("|"))
word.saveAsTextFile("/user/cloudera/xxxxx/Sparktest")
}
}
but when i am executing it i am not getting any error's but in my hdfs i am getting below data.
[Ljava.lang.String;#10ed847f
[Ljava.lang.String;#4316ebe
[Ljava.lang.String;#495d7e18
[Ljava.lang.String;#19017f49
[Ljava.lang.String;#314b9e72
[Ljava.lang.String;#5b8f67a6
[Ljava.lang.String;#23ddf240
[Ljava.lang.String;#404b5a25
[Ljava.lang.String;#130b541d
[Ljava.lang.String;#4cbf45af
[Ljava.lang.String;#21780b86
[Ljava.lang.String;#503c9b94
[Ljava.lang.String;#3b0a3ab3
i don't know what i am doing wrong.
Please help

That's because you are splitting each string into a Array of Strings. To save as text file, you'll need to use mkString(",") if you wish to concatenate with a comma. But I don't see any purpose in that.
If you want to replace pipe separator by a comma, you can use _.replaceAll("|",",") instead and save it :
val word = textfile.map(_.replaceAll("\\|",",").replaceFirst(",","").trim)
word.saveAsTextFile("/user/cloudera/xxxxx/Sparktest")
PS : You can replace the comma with anything you want e.g a whitespace, a word, etc.
So Why does the pipe need to be escaped ?
A string split expects a regular expression argument. An unescaped | is parsed as a regex meaning "empty string or empty string," which isn't what you mean.

Related

How do i replace whitespace with underscore and encode values in scala array / list

I have a spark scala dataframe which has column "Name"
I have extracted the values of that column in to scala array[string]
org_name: Array[String] = Array(SARATOGA SENIOR HIGH SCHOOL)
I want to replace whitespaces with _ and encode that value in to utf-8 (any encoding is fine as long as it replaces special chars with something else)
so if there are any special chars those will be removed. later i want to use those in file path .
var org_name = orgsFlatDF.rdd.collect
.map( _.getString(2))
This is how i am extracting those vals ^^. I haven't found any method which I can use to do that. Replace or replaceall doesn't work on array
I tried this :
org_name.replace("\\s", "")
That didn't work .
Expected output : SARATOGA_SENIOR_HIGH_SCHOOL
if name is : new $ high school it should gets converted to new_$_high_school then encoded to new_%24_high_school
There are a couple of issues with what you are asking.
Java/Scala Arrays don't have a replace method. Even if they did have a replace method, would they replace the values they hold or the characters in a String they hold?
Let's assume this line org_name.replace("\\s", "") didn't compiled and org_name is indeed a an Array[String] holding one element.
scala> val org_name=Array("SARATOGA SENIOR HIGH SCHOOL")
val org_name: Array[String] = Array(SARATOGA SENIOR HIGH SCHOOL)
scala> org_name(0).replace(" ","_")
val res15: String = SARATOGA_SENIOR_HIGH_SCHOOL
replace("\\s","_") wouldn't work because it represents a \s string. "\" represents \. That's only way you'd be able to define strings containing other escape codes like \n or \t.
PS: to transform all the string in the array use org_name.map(_.replace(" ","_")), this gives you back another another array.

How do you expand one literal of regex into multiple lines?

For example, I have a regex string:
val myRegex:Regex = "blahblah".r
but if the 'blahblah' is like more than thousand characters long, I want to split them into multiple lines so I can read easier. like so:
val myRegex:Regex = "blah".r
+ "blah".r
this does not work because value unary_+ is not a member of scala.util.matching.Regex.
is there a proper way?
One possible solution:
val myRegex:Regex =
"""a
|very
|long
|pattern
|"""
.stripMargin
.replaceAll("\n", "")
.r

Apply a text-preprocessing function to a dataframe column in scala spark

I want to create a function to handle the text-prepocessing in a problem I am facing with text data. I am familiar with Python and pandas dataframe and my usual thought process of solving the problem is to use a function and then using pandas apply method to apply the function to all the elements in a column. However I don't know where to begin to accomplish this.
So, I created two functions to handle the replacements. The problem is that I don't know how to put more than one replace inside this method. I need to make about 20 replacements for three separate dataframes so to solve it with this method it would take me 60 lines of code. Is there a way to do all the replacements inside a single function and then apply it to all the elements in a dataframe column in scala?
def removeSpecials: String => String = _.replaceAll("$", " ")
def removeSpecials2: String => String = _.replaceAll("?", " ")
val udf_removeSpecials = udf(removeSpecials)
val udf_removeSpecials2 = udf(removeSpecials2)
val consolidated2 = consolidated.withColumn("product_description", udf_removeSpecials($"product_description"))
val consolidated3 = consolidated2.withColumn("product_description", udf_removeSpecials2($"product_description"))
consolidated3.show()
Well you can simply add every replacement next to the previous one like this :
def removeSpecials: String => String = _.replaceAll("$", " ").replaceAll("?", " ")
But in this case where the replacement character is the same, it would be better to use regular expressions to avoid multiple replaceAll.
def removeSpecials: String => String = _.replaceAll("\\$|\\?", " ")
Note that \\ is used as escape character.

split the file into multiple files based on a string in spark scala

I have a text file with the below data having no particular format
abc*123 *180109*1005*^*001*0000001*0*T*:~
efg*05*1*X*005010X2A1~
k7*IT 1234*P*234df~
hig*0109*10052200*Rq~
abc*234*9698*709870*99999*N:~
tng****MI*917937861~
k7*IT 8876*e*278df~
dtp*D8*20171015~
I want the output as two files as below :
Based on string abc, I want to split the file.
file 1:
abc*123 *180109*1005*^*001*0000001*0*T*:~
efg*05*1*X*005010X2A1~
k7*IT 1234*P*234df~
hig*0109*10052200*Rq~
file 2:
abc*234*9698*709870*99999*N:~
tng****MI*917937861~
k7*IT 8876*e*278df~
dtp*D8*20171015~
And the file names should be IT name(the line starts with k7) so file1 name should be IT_1234 second file name should be IT_8876.
There is this little dirty trick that I used for a project :
sc.hadoopConfiguration.set("textinputformat.record.delimiter", "abc")
You can set the delimiter of your spark context for reading files. So you could do something like this :
val delimit = "abc"
sc.hadoopConfiguration.set("textinputformat.record.delimiter", delimit)
val df = sc.textFile("your_original_file.txt")
.map(x => (delimit ++ x))
.toDF("delimit_column")
.filter(col("delimit_column") !== delimit)
Then you can map each element of your DataFrame (or RDD) to be written to a file.
It's a dirty method but it might help you !
Have a good day
PS : The filter at the end is to drop the first line which is empty with the concatenated delimiter
You can benefit from sparkContext's wholeTextFiles function to read the file. Then parse it to separate the strings ( here I have used #### as distinct combination of characters that won't repeat in the text)
val rdd = sc.wholeTextFiles("path to the file")
.flatMap(tuple => tuple._2.replace("\r\nabc", "####abc").split("####")).collect()
And then loop the array to save the texts to output
for(str <- rdd){
//saving codes here
}

Scala - write Windows file paths that contain spaces as string literals

I need to make some Windows file paths that contain spaces into string literals in Scala. I have tried wrapping the entire path in double quotes AND wrapping the entire path in double quotes with each directory name that has a space with single quotes. Now it is wanting an escape character for "\Jun" in both places and I don't know why.
Here are the strings:
val input = "R:\'Unclaimed Property'\'CT State'\2015\Jun\ct_finderlist_2015.pdf"
val output = "R:\'Unclaimed Property'\'CT State'\2015\Jun"
Here is the latest error:
The problem is with the \ character, that has to be escaped.
This should work:
val input = "R:\\Unclaimed Property\\CT State\\2015\\Jun.ct_finderlist_2015.pdf"
val output = "R:\\Unclaimed Property\\CT State\\2015\\Jun"
A cleaner way to create string literals is to use triple quotes.
You can wrap your string directly in triple quotes without escaping special characters. And you can put multiple lines string in it.
It's much easier to code and read.
For example
val input =
"""
|R:\Unclaimed Property\CT State\2015\Jun.ct_finderlist_2015.pdf
"""
To add a variable to the string, do it like the following by adding "$variableName".
val input =
s"""
|R:\Unclaimed Property\$variablePath\CT State\2015\Jun.ct_finderlist_2015.pdf
"""