Fuzzy Compare between two hive columns using apache spark with scala

Fuzzy Compare between two hive columns using apache spark with scala - scala

I am reading the data from 2 hive tables. Token table has the tokens that needs to be matched with the input data. Input data will have description column along with other columns. I need to split input data and need to compare each splitted element with all the elements from the token table.
currently I am using me.xdrop.fuzzywuzzy.FuzzySearch library for fuzzy match.
below is my code snippet-
val tokens = sqlContext.sql("select token from tokens")
val desc = sqlContext.sql("select description from desceriptiontable")
val desc_tokens = desc.flatMap(_.toString().split(" "))
Now i need to iterate desc_tokens and each element of desc_tokens should be fuzzy matched with each element of tokens and it it exceeds 85% match i need to replace element from desc_tokens by element from the tokens.
Example --
My token list is
hello
this
is
token
file
sample
and my input description is
helo this is input desc sampl
code should return
hello this is input desc sample
as hello and helo are fuzzy matched > 85% so helo will be replaced by hello. Similarly for sampl.

I make a test with this library : https://github.com/rockymadden/stringmetric
Other idea (Not optimized) :
//I change order tokens
val tokens = Array("this","is","sample","token","file","hello");
val desc_tokens = Array("helo","this","is","token","file","sampl");
val res = desc_tokens.map(str => {
//Compute score beetween tokens and desc_tokens
val elem = tokens.zipWithIndex.map{ case(tok,index) => (tok,index,JaroMetric.compare(str, tok).get)}
//Get token has max score
val emax = elem.maxBy{case(_,_,score) => score}
//if emax have a score > 0.85 get It. Else keep input
if(emax._3 > 0.85) tokens(emax._2) else str
})
res.foreach { println }
My Output :
hello
this
is
token
file
sample

Related

Spark (Scala) modify the contents of a Dataset Column

I would like to have a Dataset, where the first column contains single words and the second column contains the filenames of the files where these words appear.
My current code looks something like this:
val path = "path/to/folder/with/files"
val tokens = spark.read.textFile(path).
.flatMap(line => line.split(" "))
.withColumn("filename", input_file_name)
tokens.show()
However this returns something like
|word1 |whole/path/to/some/file |
|word2 |whole/path/to/some/file |
|word1 |whole/path/to/some/otherfile|
(I don't need the whole path, just the last bit). My idea to fix this, was to use the map function
val tokensNoPath = tokens.
map(r => (r(0), r(1).asInstanceOf[String].split("/").lastOption))
So basically, just going to every tow, grabbing the second entry and deleting everything before the last slash.
However since I'm very new to Spark and Scala I can't figure out how to get the syntax for this right

Docs:
substring_index "substring_index(str, delim, count) Returns the substring from str before count occurrences of the delimiter delim... If count is negative, everything to the right of the final delimiter (counting from the right) is returned."
.withColumn("filename", substring_index(input_file_name, "/", -1))

You can split by slash and get the last element:
val tokens2 = tokens.withColumn("filename", element_at(split(col("filename"), "/"), -1))

How to output field padding in file Scala spark?

I have a text file. Now, I want output field padding in file as Exp1 & Exp2.
What should I do?
This is my input:
a
a a
a a a
a a a a
a a a a a
Exp1. Fill the remaining fields with the - character when each record in the file does not fit into the n=4 field.
a _ _ _
a a _ _
a a a _
a a a a
a a a a a
Exp2. Same as above. Delete the fields after the n=4 field when the number of fields in the record exceeds n.
a _ _ _
a a _ _
a a a _
a a a a
a a a a
My code:
val df = spark.read.text("data.txt")
val result = df.columns.foldLeft(df){(newdf, colname) =>
newdf.withColumnRenamed(colname, colname.replace("a", "_"))
}
result .show

This resembles a homework-style problem, so I will help guide you based on your provided code and try to lead you on the right path here.
Your current code is only changing the name of the columns. In this case, the column name "value" is being changed to "v_lue".
You want to change the actual records themselves.
First, you want to read this data into an RDD. It can be done with a dataframe, but being able to map on the row strings
instead of Row objects might make this easier to understand conceptually. I'll get you started.
val data = sc.textFile("data.txt")
Data will be an RDD of strings, where each element is a line in the data file.
We're going to want to map this data to some new data, and transform each row.
data.map(row => {
// transform each row here
})
Inside this map we make some change to row, which is a string. The code inside applies to every string in the RDD.
You will probably want to split the row to get an array of strings, so that you can count how many occurrences
of 'a' there are. Depending on the size of the array, you will want to create a new string and output that from this map.
If there are fewer 'a's than n, you will probably want to create a string with enough '_'s. If there are too many,
you will probably want to return a string with the correct number.
Hope this helps.

To split data into good and bad rows and write to output file using Spark program

I am trying to filter the good and bad rows by counting the number of delimiters in a TSV.gz file and write to separate files in HDFS
I ran the below commands in spark-shell
Spark Version: 1.6.3
val file = sc.textFile("/abc/abc.tsv.gz")
val data = file.map(line => line.split("\t"))
var good = data.filter(a => a.size == 995)
val bad = data.filter(a => a.size < 995)
When I checked the first record the value could be seen in the spark shell
good.first()
But when I try to write to an output file I am seeing the below records,
good.saveAsTextFile(good.tsv)
Output in HDFS (top 2 rows):
[Ljava.lang.String;#1287b635
[Ljava.lang.String;#2ef89922
Could ypu please let me know on how to get the required output file in HDFS
Thanks.!

Your final RDD is type of org.apache.spark.rdd.RDD[Array[String]]. Which leads to writing objects instead of string values in the write operation.
You should convert the array of strings to tab separated string values again before saving. Just try;
good.map(item => item.mkString("\t")).saveAsTextFile("goodFile.tsv")

How to extract number from string column?

My requirement is to retrieve the order number from the comment column which is in a column comment and always starts with R. The order number should be added as a new column to the table.
Input data:
code,id,mode,location,status,comment
AS-SD,101,Airways,hyderabad,D,order got delayed R1657
FY-YT,102,Airways,Delhi,ND,R7856 package damaged
TY-OP,103,Airways,Pune,D,Order number R5463 not received
Expected output:
AS-SD,101,Airways,hyderabad,D,order got delayed R1657,R1657
FY-YT,102,Airways,Delhi,ND,R7856 package damaged,R7856
TY-OP,103,Airways,Pune,D,Order number R5463 not received,R5463
I have tried it in spark-sql, the query I am using is given below:
val r = sqlContext.sql("select substring(comment, PatIndex('%[0-9]%',comment, length(comment))) as number from A")
However, I'm getting the following error:
org.apache.spark.sql.AnalysisException: undefined function PatIndex; line 0 pos 0

You can use regexp_extract which has the definition :
def regexp_extract(e: Column, exp: String, groupIdx: Int): Column
(R\\d{4}) means R followed by 4 digits. You can easily accommodate any other case by using a valid regex
df.withColumn("orderId", regexp_extract($"comment", "(R\\d{4})" , 1 )).show
+-----+---+-------+---------+------+--------------------+-------+
| code| id| mode| location|status| comment|orderId|
+-----+---+-------+---------+------+--------------------+-------+
|AS-SD|101|Airways|hyderabad| D|order got delayed...| R1657|
|FY-YT|102|Airways| Delhi| ND|R7856 package dam...| R7856|
|TY-OP|103|Airways| Pune| D|Order number R546...| R5463|
+-----+---+-------+---------+------+--------------------+-------+

You can use a udf function as following
import org.apache.spark.sql.functions._
def extractString = udf((comment: String) => comment.split(" ").filter(_.startsWith("R")).head)
df.withColumn("newColumn", extractString($"comment")).show(false)
where the comment column is splitted with space and filtering the words that starts with R. head will take the first word that was filtered starting with R.
Updated
To ensure that the returned string is order number starting with R and rest of the strings are digits, you can add additional filter
import scala.util.Try
def extractString = udf((comment: String) => comment.split(" ").filter(x => x.startsWith("R") && Try(x.substring(1).toDouble).isSuccess).head)
You can edit the filter according to your need.

_.split(" ") more fields in scala RDD

I'm trying to extract data from an RDD[string] into another RDD[string]
the RDD contains data similar to this :
17.808 15.749 6.649 -0.548 15.9994
I need to multiply 4th and 5th fields of each row and store them into a different RDD[string].
I can use the following code to pull out one field
ansRDD = rawRDD(._split(" ")(4)).(_.toFloat)
rawRDD contains the string.
But I need to pull out both the fields into a single RDD as
-0.548 15.9994
so that I can simply do
answer = ansRDD.foreach(case(a,b) => a*b)

You could use:
rawRDD.map(_.split(' ').view(4, 6).map(_.toFloat).reduce(_*_).toString)

You could define ansRDD as:
ansRDD = rawRD.map(item => {val comps=item.split(" "); (comps(3),comps(4)})

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Fuzzy Compare between two hive columns using apache spark with scala - scala

Related

Spark (Scala) modify the contents of a Dataset Column

How to output field padding in file Scala spark?

To split data into good and bad rows and write to output file using Spark program

How to extract number from string column?

_.split(" ") more fields in scala RDD

Categories

Resources