Partial Text match - Uima Ruta - uima

Whether a partial text match is possible in Ruta (WORDTABLE)?
Sample Input:
yearbook
book
databook
worship
friendship
yearbook
Sample CSV:
book;b.
ship;sh.
I have a sample CSV file and a sample input where I need to match a word which ends with "book" and "ship".Needs to assign feature value from column 2.

Yes, you could make use of regular expressions:
DECLARE MyWord(String ending);
W{REGEXP(".*book\b") -> CREATE(MyWord, "ending"="b");
Or you could use string functions:
w:W{endsWith(w.ct, "book") -> CREATE(MyWord, "ending"="b")};

Related

In Spark Scala, how to create a column with substring() using locate() as a parameter?

I have a dataset that is like the following:
val df = Seq("samb id 12", "car id 13", "lxu id 88").toDF("list")
I want to create a column that will be a string containing only the values after Id. The result would be something like:
val df_result = Seq(("samb id 12",12), ("car id 13",13), ("lxu id 88",88)).toDF("list", "id_value")
For that, I am trying to use substring. For the the parameter of the starting position to extract the substring, I am trying to use locate. But it gives me an error saying that it should be an Int and not a column type.
What I am trying is like:
df
.withColumn("id_value", substring($"list", locate("id", $"list") + 2, 2))
The error I get is:
error: type mismatch;
found : org.apache.spark.sql.Column
required: Int
.withColumn("id_value", substring($"list", locate("id", $"list") + 2, 2))
^
How can I fix this and continue using locate() as a parameter?
UPDATE
Updating to give an example in which #wBob answer doesn't work for my real world data: my data is indeed a bit more complicated than the examples above.
It is something like this:
val df = Seq(":option car, lorem :ipsum: :ison, ID R21234, llor ip", "lst ID X49329xas ipsum :ion: ip_s-")
The values are very long strings that don't have a specific pattern.
Somewhere in the string that is always a part written ID XXXXX. The XXXXX varies, but it is always the same size (5 characters) and always after a ID .
I am not being able to use neither split nor regexp_extract to get something in this pattern.
It is not clear if you want the third item or the first number from the list, but here are a couple of examples which should help:
// Assign sample data to dataframe
val df = Seq("samb id 12", "car id 13", "lxu id 88").toDF("list")
df
.withColumn("t1", split($"list", "\\ ")(2))
.withColumn("t2", regexp_extract($"list", "\\d+", 0))
.withColumn("t3", regexp_extract($"list", "(id )(\\d+)", 2))
.withColumn("t4", regexp_extract($"list", "ID [A-Z](\\d{5})", 1))
.show()
You can use functions like split and regexp_extract with withColumn to create new columns based on existing values. split splits out the list into an array based on the delimiter you pass in. I have used space here, escaped with two slashes to split the array. The array is zero-based hence specifying 2 gets the third item in the array. regexp_extract uses regular expressions to extract from strings. here I've used \\d which represents digits and + which matches the digit 1 or many times. The third column, t3, again uses regexp_extract with a similar RegEx expression, but using brackets to group up sections and 2 to get the second group from the regex, ie the (\\d+). NB I'm using additional slashes in the regex to escape the slashes used in the \d.
My results:
If your real data is more complicated please post a few simple examples where this code does not work and explain why.

Filter pyspark DataFrame by partial string match

would like check substring match between comments and keyword column and find if anyone of the keywords present in that particular row.
name : ['paul','john','max','peter']
comments :['account is active','account is activated','account is activateds','account is activ']
keywords :[ 'active,activated,activ' , 'active,activated,activ' ,'active,activated,activ','null']
expected output
match = ['True','True','True','True']

pyspark: how to get_json_object for names with spaces (or other weird characters in the name)?

Normally,
F.get_json_object(name, "$.element_name")
works fine to extract the element_name from a JSON object like this
{"element_name" : 1}
But what if the name has a space in this? How do I quote the name?
{"element name" : 1}
this doesn't work obviously
F.get_json_object(name, "$.elementname")
Normally, this is not a pyspark specific problem but it seems like pyspark (and maybe java) can have slightly different specs for the jsonpath.
For Spark, one of the following two should be working: (1) dot-notation .name with name excluding any dot . or opening bracket [; or (2) bracket-notation ['name'] with name excluding any single quote ' or question-mark ?, for example:
F.get_json_object('name', "$['element name']")
F.get_json_object('name', "$.element name")
see below from the source code with Scala JsonPathParser:
// parse `.name` or `['name']` child expressions
def named: Parser[List[PathInstruction]] =
for {
name <- '.' ~> "[^\\.\\[]+".r | "['" ~> "[^\\'\\?]+".r <~ "']"
} yield {
Key :: Named(name) :: Nil
}
Thus, if the name contains dot or opening bracket, use ['name'], if the name contains single quote or question mark, use .name. otherwise you can select either one. more examples of working expressions:
F.get_json_object('name', "$.Trader Joe's")
F.get_json_object('name', "$['amazon.com']")
For JSON keys that have names that are unfriendly to properties, you'll need to use the indexer syntax.
$["element name"]
(Single quotes should also work.)

kdb+: Save table into a csv file

I have the below table "dates" , it has a sym column with symbols and a d column with list of strings and would like to save it into a regular CSV file. Couldn't find a good way to do it. Any suggestions?
q)dates
sym d
----------------------------------------------------------------------------
6AH0 "1970.03.16" "1980.03.17" "1990.03.19" "2010.03.15"
6AH6 "1976.03.15" "1986.03.17" "1996.03.18" "2016.03.14"
6AH7 "1977.03.14" "1987.03.16" "1997.03.17" "2017.03.13"
6AH8 "1978.03.13" "1988.03.14" "1998.03.16" "2018.03.19"
6AH9 "1979.03.19" "1989.03.13" "1999.03.15" "2019.03.18"
When I try to do the regular save the below error happens:
q)save `:dates.csv
k){$[t&77>t:#y;$y;x;-14!'y;y]}
'type
q))
The internal table->csv conversion function within Kdb+ is not able to handle nested lists in columns. The d column in your table is a list of list of chars. However, the conversion function is able to handle a simply nested column (depth of 1).
Therefore, you can convert the d column to a list of chars and then save to CSV using the internal function:
/ generate a table of dummy data
q)show dates:flip `sym`d!(`6AH0`6AH6`6AH7;string (3;0N)#12?.z.d)
sym d
--------------------------------------------------------
6AH0 "2008.02.04" "2015.01.02" "2003.07.05" "2005.02.25"
6AH6 "2012.10.25" "2008.08.28" "2017.01.25" "2007.12.27"
6AH7 "2004.02.01" "2005.06.06" "2013.02.11" "2010.12.20"
/ convert 'd' column to simple list - the (" " sv') is the conversion func here
q)#[`dates;`d;" " sv']
`dates
/ review what was done
q)show dates
sym d
--------------------------------------------------
6AH0 "2008.02.04 2015.01.02 2003.07.05 2005.02.25"
6AH6 "2012.10.25 2008.08.28 2017.01.25 2007.12.27"
6AH7 "2004.02.01 2005.06.06 2013.02.11 2010.12.20"
/ save to csv
q)save `:dates.csv
`:dates.csv
/ review saved csv
q)\cat dates.csv
"sym,d"
"6AH0,2008.02.04 2015.01.02 2003.07.05 2005.02.25"
"6AH6,2012.10.25 2008.08.28 2017.01.25 2007.12.27"
"6AH7,2004.02.01 2005.06.06 2013.02.11 2010.12.20"
As per the csv specification, you'll want to flatten the list out and separate each with a comma and double quote the list.
'save' is limited in that the file must be named the same as the global variable you are saving.
If I was tasked with your question I'd do it like so;
`:myFileNamedWhatever.csv 0: csv 0: select sym,csv sv'd from dates
Explanation;
csv 0: table /csv is a variable, literally defined as "," - its good for readability. csv 0: table converts the table to a comma separated list of strings
`:file 0: listOfStrings /this takes a LIST of strings and pushes them to the file handle. Each element of the list is a new line in the file
I'd prefer this approach as it is general and allows the saving of various types. You can use it within a function etc..
At a later date I decided that I wanted it saved as a pipe (or anything) separated file;
`:myNewFile.psv 0: "|" 0: select sym,"|"sv'd from table

How to use orderby() with descending order in Spark window functions?

I need a window function that partitions by some keys (=column names), orders by another column name and returns the rows with top x ranks.
This works fine for ascending order:
def getTopX(df: DataFrame, top_x: String, top_key: String, top_value:String): DataFrame ={
val top_keys: List[String] = top_key.split(", ").map(_.trim).toList
val w = Window.partitionBy(top_keys(1),top_keys.drop(1):_*)
.orderBy(top_value)
val rankCondition = "rn < "+top_x.toString
val dfTop = df.withColumn("rn",row_number().over(w))
.where(rankCondition).drop("rn")
return dfTop
}
But when I try to change it to orderBy(desc(top_value)) or orderBy(top_value.desc) in line 4, I get a syntax error. What's the correct syntax here?
There are two versions of orderBy, one that works with strings and one that works with Column objects (API). Your code is using the first version, which does not allow for changing the sort order. You need to switch to the column version and then call the desc method, e.g., myCol.desc.
Now, we get into API design territory. The advantage of passing Column parameters is that you have a lot more flexibility, e.g., you can use expressions, etc. If you want to maintain an API that takes in a string as opposed to a Column, you need to convert the string to a column. There are a number of ways to do this and the easiest is to use org.apache.spark.sql.functions.col(myColName).
Putting it all together, we get
.orderBy(org.apache.spark.sql.functions.col(top_value).desc)
Say for example, if we need to order by a column called Date in descending order in the Window function, use the $ symbol before the column name which will enable us to use the asc or desc syntax.
Window.orderBy($"Date".desc)
After specifying the column name in double quotes, give .desc which will sort in descending order.
Column
col = new Column("ts")
col = col.desc()
WindowSpec w = Window.partitionBy("col1", "col2").orderBy(col)