Scala:Handling special encoding characters in String - scala

I have string containing many umlauts(ä,ö,ü) and euro(€) symbol. Is there any library or existing methods that transform them to (a,o,u) and Euro(or its equivalent) respectively in Scala.
I am aware of the similar libraries in python that do the job but can't seem to find it in scala.
Consider this example : val String="Köln and München are great cities. The average bus ticket costs €4.5"
I want to be converted something like this or equivalent: "Koln and Munchen are great cities. The average bus ticket costs Euros 4.5"

You can build your own translator with whatever rules you need to apply.
val str="Köln and München are great cities. The average bus ticket costs €4.5"
val deUm :Map[Char,String] =
Map('ö'->"o", 'ü'->"u", '€'->"Euros ").withDefault(_.toString)
str.flatMap(deUm(_))
//res0: String = Koln and Munchen are great cities. The average bus ticket costs Euros 4.5

Do you really need a library for this? you can just use the string replace function like shown here?
http://gordon.koefner.at/blog/coding/replacing-german-umlauts/

Related

NLP ELMo model pruning input

I am trying to retrieve embeddings for words based on the pretrained ELMo model available on tensorflow hub. The code I am using is modified from here: https://www.geeksforgeeks.org/overview-of-word-embedding-using-embeddings-from-language-models-elmo/
The sentence that I am inputting is
bod =" is coming up in and every project is expected to do a video due on we look forward to discussing this with you at our meeting this this time they have laid out the selection criteria for the video award s go for the top spot this time "
and these are the keywords I want embeddings for:
words=["do", "a", "video"]
embeddings = elmo([bod],
signature="default",
as_dict=True)["elmo"]
init = tf.initialize_all_variables()
sess = tf.Session()
sess.run(init)
this sentence is 236 characters in length.
this is the picture showing that
but when I put this sentence into the ELMo model, the tensor that is returned is only contains a string of length 48
and this becomes a problem when i try to extract embeddings for keywords that are outside the 48 length limit because the indices of the keywords are shown to be outside this length:
this is the code I used to get the indices for the words in 'bod'(as shown above)
num_list=[]
for item in words:
print(item)
index = bod.index(item)
num_list.append(index)
num_list
But i keep running into this error:
I tried looking for ELMo documentation to explain why this is happening but I have not found anything related to this problem of pruned input.
Any advice is much appreciated!
Thank You
This is not really an AllenNLP issue since you are using a tensorflow-based implementation of ELMo.
That said, I think the problem is that ELMo embeds tokens, not characters. You are getting 48 embeddings because the string has 48 tokens.

Conditional text replacement

I'm pre-processing some ecommerce product titles, such as:
Input:
1. Jersey Shore: Family Vac Season 2
2. Robotic Vac Cleaner with Max Power Suction
Notice that booth titles have a Vac word. I would like to correct the 2nd one, replacing it to Vaccum.
Desired output:
1. Jersey Shore: Family Vac Season 2
2. Robotic Vaccum Cleaner with Max Power Suction
I could write a algorithm (for instance checking if the string contains "clean" or "suction"), but first I would like to know if there are any frameworks, libraries, etc that already does this kind of task. Seems to be a commom problem... It could be any language (java, python, c, etc).
I could think that you are getting those tittles from an API or are them hardcoded in the site?
If you have it in a JSON format or even in something more simple as an array of strings:
var products = [{
'title': "Jersey Shore: Family Vac Season 2",
}, {
'title': 'Robotic Vac Cleaner with Max Power Suction',
}]
There is a very useful Javascript tool ->
https://fusejs.io/
With it you can search and even specify nice parameters like:
threshold -> if you want a perfect match or similar words, etc.
Visit the site the documentation is great.
After that you can use Javascript (String.prototype.replace) replace with the word that you want, in this case: Vaccum
https://developer.mozilla.org/es/docs/Web/JavaScript/Referencia/Objetos_globales/String/replace
And get your final object or array to be placed on your ecommerce site

Uima Ruta StringList

Is there a way to iterate StringList in Ruta, provided the strings in the StringList are not present in the input document?
Sample StringList
Television
Computer
Tablet
Sound
Sample Input Document
Flat-screen televisions for sale at a consumer electronics store in 2008.
Television (TV), sometimes shortened to tele or telly is a telecommunication medium used for transmitting moving images in monochrome (black and white), or in colour, and in two or three dimensions and sound. The term can refer to a television set, a television program ("TV show"), or the medium of television transmission. Television is a mass medium for advertising, entertainment and news.
Problem
I want to get the values, Computer and Tablet as a result from the output CAS (say as an annotation or a feature). Is there a way to do so?
There is currently (2.6.1) no way to iterate over a StringList in Ruta as far as I know.
You want to return all entries that are not present in the text?
I have not tried it, but you could maybe use multiple lists, and add entries to one list if they occur in the text and to the other if they do not. Then, you story the second StringList in a feature.
(I would probably use a simple Java analysis engine instead of Ruta)
DISCLAIMER: I am a developer of UIMA Ruta

java.lang.NumberFormatException: For input string: "Some(12)"

CAn anyone tell me please what is wrong with my code:
Below is my spark code in scala:
import java.text.SimpleDateFormat
import org.apache.spark.sql.SparkSession
import scala.xml.XML
object TopTenTags09 {
def main(args:Array[String]){
val format = new SimpleDateFormat("yyyy-MM-dd'T'HH:mm:ss.SSS")
val format2 = new SimpleDateFormat("yyyy-MM")
val spark = SparkSession.builder().appName("Number of posts which are questions and contains specified words").master("local").getOrCreate()
val data = spark.read.textFile("/home/harsh/Hunny/HadoopPractice/Spark/DF/StackOverFlow/Posts.xml").rdd
val result = data.filter{line=>{line.trim().startsWith("<row")}}
.filter{line=>{line.contains("PostTypeId=\"1\"")}}
.map { line=>{
val xml = XML.loadString(line)
if(xml.attribute("Tags").mkString.toLowerCase().contains("hadoop") ||
xml.attribute("Tags").mkString.toLowerCase().contains("spark")){
(Integer.parseInt(xml.attribute("Score").toString()),Integer.parseInt(xml.attribute("Score").toString()))
}
}}/*.filter(line=>line._1>2)
.sortByKey(false)*/
result.foreach(println) //throwing error while printing
spark.stop
}
}
And below is the error I am getting while running it:
java.lang.NumberFormatException: For input string: "Some(12)"
at java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
at java.lang.Integer.parseInt(Integer.java:580)
at java.lang.Integer.parseInt(Integer.java:615)
I am new to spark and the the error is making me crazy because as mentioned in error ther is no "Some" in code or in data.Can anyone help me please.
Sample data
<row Id="5" PostTypeId="1" CreationDate="2014-05-13T23:58:30.457" Score="7" ViewCount="286" Body="<p>I've always been interested in machine learning, but I can't figure out one thing about starting out with a simple "Hello World" example - how can I avoid hard-coding behavior?</p>
<p>For example, if I wanted to "teach" a bot how to avoid randomly placed obstacles, I couldn't just use relative motion, because the obstacles move around, but I don't want to hard code, say, distance, because that ruins the whole point of machine learning.</p>
<p>Obviously, randomly generating code would be impractical, so how could I do this?</p>
" OwnerUserId="5" LastActivityDate="2014-05-14T00:36:31.077" Title="How can I do simple machine learning without hard-coding behavior?" Tags="<machine-learning>" AnswerCount="1" CommentCount="1" FavoriteCount="1" ClosedDate="2014-05-14T14:40:25.950" />
<row Id="7" PostTypeId="1" AcceptedAnswerId="10" CreationDate="2014-05-14T00:11:06.457" Score="2" ViewCount="266" Body="<p>As a researcher and instructor, I'm looking for open-source books (or similar materials) that provide a relatively thorough overview of data science from an applied perspective. To be clear, I'm especially interested in a thorough overview that provides material suitable for a college-level course, not particular pieces or papers.</p>
" OwnerUserId="36" LastEditorUserId="97" LastEditDate="2014-05-16T13:45:00.237" LastActivityDate="2014-05-16T13:45:00.237" Title="What open-source books (or other materials) provide a relatively thorough overview of data science?" Tags="<education><open-source>" AnswerCount="3" CommentCount="4" FavoriteCount="1" ClosedDate="2014-05-14T08:40:54.950" />
<row Id="9" PostTypeId="2" ParentId="5" CreationDate="2014-05-14T00:36:31.077" Score="4" Body="<p>Not sure if this fits the scope of this SE, but here's a stab at an answer anyway.</p>
<p>With all AI approaches you have to decide what it is you're modelling and what kind of uncertainty there is. Once you pick a framework that allows modelling of your situation, you then see which elements are "fixed" and which are flexible. For example, the model may allow you to define your own network structure (or even learn it) with certain constraints. You have to decide whether this flexibility is sufficient for your purposes. Then within a particular network structure, you can learn parameters given a specific training dataset.</p>
<p>You rarely hard-code behavior in AI/ML solutions. It's all about modelling the underlying situation and accommodating different situations by tweaking elements of the model.</p>
<p>In your example, perhaps you might have the robot learn how to detect obstacles (by analyzing elements in the environment), or you might have it keep track of where the obstacles were and which way they were moving.</p>
" OwnerUserId="51" LastActivityDate="2014-05-14T00:36:31.077" CommentCount="0" />
<row Id="10" PostTypeId="2" ParentId="7" CreationDate="2014-05-14T00:53:43.273" Score="9" Body="<p>One book that's freely available is "The Elements of Statistical Learning" by Hastie, Tibshirani, and Friedman (published by Springer): <a href="http://statweb.stanford.edu/~tibs/ElemStatLearn/">see Tibshirani's website</a>.</p>
<p>Another fantastic source, although it isn't a book, is Andrew Ng's Machine Learning course on Coursera. This has a much more applied-focus than the above book, and Prof. Ng does a great job of explaining the thinking behind several different machine learning algorithms/situations.</p>
" OwnerUserId="22" LastActivityDate="2014-05-14T00:53:43.273" CommentCount="1" />
<row Id="14" PostTypeId="1" CreationDate="2014-05-14T01:25:59.677" Score="14" ViewCount="686" Body="<p>I am sure data science as will be discussed in this forum has several synonyms or at least related fields where large data is analyzed.</p>
<p>My particular question is in regards to Data Mining. I took a graduate class in Data Mining a few years back. What are the differences between Data Science and Data Mining and in particular what more would I need to look at to become proficient in Data Mining?</p>
" OwnerUserId="66" LastEditorUserId="322" LastEditDate="2014-06-17T16:17:20.473" LastActivityDate="2014-06-20T17:36:05.023" Title="Is Data Science the Same as Data Mining?" Tags="<data-mining><definitions>" AnswerCount="4" CommentCount="1" FavoriteCount="2" />
I assume that
(Integer.parseInt(xml.attribute("Score").toString())
throws the above mentioned exception, because xml is of type Elem, and if you call the method attribute on it, it returns an Option[Seq[Node]], not just a single string with the number.
You probably want to replace both pieces of the above type by
(Integer.parseInt(xml.attribute("Score").get.toString())
Moreover, you could also replace the cumbersome Integer.parseInt by
xml.attribute("Score").get.toString.toInt
Isolated demo:
scala> val e = XML.loadString("""<foo Score="42" Bar="58"/>""")
e: scala.xml.Elem = <foo Bar="58" Score="42"/>
scala> e.attribute("Score").get.toString.toInt
res4: Int = 42

Searching Array for similar spelled words/objects

I am looking for a solution similar to Google's "Did You Mean: Word"
I have an array of vehicles entered as
2011ChevroletMalibu
2011FordF150
2009FordProbe
etc...
In my app I have three Textfields.
Year Make Model .
When the user types in 2011 Chevrolet Malabu (notice that Malibu is misspelled) and hits search...
Id like to reply "Did you mean: 2011 Chevrolet Malibu".
Anyone have any suggestions on how to "search for similar"? Thanks!
AFAIK that is not any function the iOS SDK would solve for you - however, there are many algorithms available doing just that.
Look out for:
Phonetic Soundex String Comparison
Levenshtein Distance
Oliver 1993
Louie, I think it is not very simple. You need use some phonetic search. If your app use data provided by a webservice and you have a mssqlserver>=2000 behind this webservice, you can use SOUNDEX function in your searchs. But if you want, implement your own phonetic search, it is a big challenge.
Look at soundx "hashes" or if you have the CPU for it the levenshtein distance. Soundx will be cheaper to compute, the levenshtein distance will give better results.