Spark (Scala) modify the contents of a Dataset Column - scala

I would like to have a Dataset, where the first column contains single words and the second column contains the filenames of the files where these words appear.
My current code looks something like this:
val path = "path/to/folder/with/files"
val tokens = spark.read.textFile(path).
.flatMap(line => line.split(" "))
.withColumn("filename", input_file_name)
tokens.show()
However this returns something like
|word1 |whole/path/to/some/file |
|word2 |whole/path/to/some/file |
|word1 |whole/path/to/some/otherfile|
(I don't need the whole path, just the last bit). My idea to fix this, was to use the map function
val tokensNoPath = tokens.
map(r => (r(0), r(1).asInstanceOf[String].split("/").lastOption))
So basically, just going to every tow, grabbing the second entry and deleting everything before the last slash.
However since I'm very new to Spark and Scala I can't figure out how to get the syntax for this right

Docs:
substring_index "substring_index(str, delim, count) Returns the substring from str before count occurrences of the delimiter delim... If count is negative, everything to the right of the final delimiter (counting from the right) is returned."
.withColumn("filename", substring_index(input_file_name, "/", -1))

You can split by slash and get the last element:
val tokens2 = tokens.withColumn("filename", element_at(split(col("filename"), "/"), -1))

Related

How do i replace whitespace with underscore and encode values in scala array / list

I have a spark scala dataframe which has column "Name"
I have extracted the values of that column in to scala array[string]
org_name: Array[String] = Array(SARATOGA SENIOR HIGH SCHOOL)
I want to replace whitespaces with _ and encode that value in to utf-8 (any encoding is fine as long as it replaces special chars with something else)
so if there are any special chars those will be removed. later i want to use those in file path .
var org_name = orgsFlatDF.rdd.collect
.map( _.getString(2))
This is how i am extracting those vals ^^. I haven't found any method which I can use to do that. Replace or replaceall doesn't work on array
I tried this :
org_name.replace("\\s", "")
That didn't work .
Expected output : SARATOGA_SENIOR_HIGH_SCHOOL
if name is : new $ high school it should gets converted to new_$_high_school then encoded to new_%24_high_school
There are a couple of issues with what you are asking.
Java/Scala Arrays don't have a replace method. Even if they did have a replace method, would they replace the values they hold or the characters in a String they hold?
Let's assume this line org_name.replace("\\s", "") didn't compiled and org_name is indeed a an Array[String] holding one element.
scala> val org_name=Array("SARATOGA SENIOR HIGH SCHOOL")
val org_name: Array[String] = Array(SARATOGA SENIOR HIGH SCHOOL)
scala> org_name(0).replace(" ","_")
val res15: String = SARATOGA_SENIOR_HIGH_SCHOOL
replace("\\s","_") wouldn't work because it represents a \s string. "\" represents \. That's only way you'd be able to define strings containing other escape codes like \n or \t.
PS: to transform all the string in the array use org_name.map(_.replace(" ","_")), this gives you back another another array.

Spark - read text file, string off first X and last Y rows using monotonically_increasing_id

I have to read in files from vendors, that can get potentially pretty big (multiple GB). These files may have multiple header and footer rows I want to strip off.
Reading the file in is easy:
val rawData = spark.read
.format("csv")
.option("delimiter","|")
.option("mode","PERMISSIVE")
.schema(schema)
.load("/path/to/file.csv")
I can add a simple row number using monotonically_increasing_id:
val withRN = rawData.withColumn("aIndex",monotonically_increasing_id())
That seems to work fine.
I can easily use that to strip off header rows:
val noHeader = withRN.filter($"aIndex".geq(2))
but how can I strip off footer rows?
I was thinking about getting the max of the index column, and using that as a filter, but I can't make that work.
val MaxRN = withRN.agg(max($"aIndex")).first.toString
val noFooter = noHeader.filter($"aIndex".leq(MaxRN))
That returns no rows, because MaxRN is a string.
If I try to convert it to a long, that fails:
noHeader.filter($"aIndex".leq(MaxRN.toLong))
java.lang.NumberFormatException: For input string: "[100000]"
How can I use that max value in a filter?
Is trying to use monotonically_increasing_id like this even a viable approach? Is it really deterministic?
This happens because first will return a Row. To access the first element of the row you must do:
val MaxRN = withRN.agg(max($"aIndex")).first.getLong(0)
By converting the row to string you will get [100000] and of course this is not a valid Long that's why the casting is failing.

split the file into multiple files based on a string in spark scala

I have a text file with the below data having no particular format
abc*123 *180109*1005*^*001*0000001*0*T*:~
efg*05*1*X*005010X2A1~
k7*IT 1234*P*234df~
hig*0109*10052200*Rq~
abc*234*9698*709870*99999*N:~
tng****MI*917937861~
k7*IT 8876*e*278df~
dtp*D8*20171015~
I want the output as two files as below :
Based on string abc, I want to split the file.
file 1:
abc*123 *180109*1005*^*001*0000001*0*T*:~
efg*05*1*X*005010X2A1~
k7*IT 1234*P*234df~
hig*0109*10052200*Rq~
file 2:
abc*234*9698*709870*99999*N:~
tng****MI*917937861~
k7*IT 8876*e*278df~
dtp*D8*20171015~
And the file names should be IT name(the line starts with k7) so file1 name should be IT_1234 second file name should be IT_8876.
There is this little dirty trick that I used for a project :
sc.hadoopConfiguration.set("textinputformat.record.delimiter", "abc")
You can set the delimiter of your spark context for reading files. So you could do something like this :
val delimit = "abc"
sc.hadoopConfiguration.set("textinputformat.record.delimiter", delimit)
val df = sc.textFile("your_original_file.txt")
.map(x => (delimit ++ x))
.toDF("delimit_column")
.filter(col("delimit_column") !== delimit)
Then you can map each element of your DataFrame (or RDD) to be written to a file.
It's a dirty method but it might help you !
Have a good day
PS : The filter at the end is to drop the first line which is empty with the concatenated delimiter
You can benefit from sparkContext's wholeTextFiles function to read the file. Then parse it to separate the strings ( here I have used #### as distinct combination of characters that won't repeat in the text)
val rdd = sc.wholeTextFiles("path to the file")
.flatMap(tuple => tuple._2.replace("\r\nabc", "####abc").split("####")).collect()
And then loop the array to save the texts to output
for(str <- rdd){
//saving codes here
}

find line number in an unstructured file in scala

Hi guys I am parsing an unstructured file for some key words but i can't seem to easily find the line number of what the results I am getiing
val filePath:String = "myfile"
val myfile = sc.textFile(filePath);
var ora_temp = myfile.filter(line => line.contains("MyPattern")).collect
ora_temp.length
However, I not only want to find the lines that contains MyPatterns but I want more like a tupple (Mypattern line, line number)
Thanks in advance,
You can use ZipWithIndex as eliasah pointed out in a comment (with probably the most succinct way to do this using the direct tuple accessor syntax), or like so using pattern matching in the filter:
val matchingLineAndLineNumberTuples = sc.textFile("myfile").zipWithIndex().filter({
case (line, lineNumber) => line.contains("MyPattern")
}).collect

_.split(" ") more fields in scala RDD

I'm trying to extract data from an RDD[string] into another RDD[string]
the RDD contains data similar to this :
17.808 15.749 6.649 -0.548 15.9994
I need to multiply 4th and 5th fields of each row and store them into a different RDD[string].
I can use the following code to pull out one field
ansRDD = rawRDD(._split(" ")(4)).(_.toFloat)
rawRDD contains the string.
But I need to pull out both the fields into a single RDD as
-0.548 15.9994
so that I can simply do
answer = ansRDD.foreach(case(a,b) => a*b)
You could use:
rawRDD.map(_.split(' ').view(4, 6).map(_.toFloat).reduce(_*_).toString)
You could define ansRDD as:
ansRDD = rawRD.map(item => {val comps=item.split(" "); (comps(3),comps(4)})