Remove Backslash in a word in R - numbers

I have been trying to do topic modeling for articles. I cleaned the raw data which contains a lot of backslash and numbers. Even after removing the punctuations, backslash, and numbers, but I got the backslash along with numbers in top terms in topic 1.
The code snippet which I used for the preprocessing is
articles <- tm::tm_map(articles, content_transformer(tolower))
# Remove numbers
articles<- tm_map(articles, removeNumbers)
# Remove english common stopwords
articles<- tm_map(articles, removeWords, stopwords("english"))
# Remove punctuations
articles<- tm_map(articles, removePunctuation)
# Eliminate extra white spaces
articles <- tm_map(articles, stripWhitespace)
toSpace <- content_transformer(function(x, pattern) gsub(pattern, " ", x))
articles <- tm_map(articles,toSpace, "\\\\" )
Even after trying to clean the data, I got the backslash and numbers in top terms in topics,
design
robot
class
medical
device
wkh\003
students
dcbl
ri\003
course
The backslash and the numbers in the topics are totally inappropriate. Kindly help me with a solution

You can use the stringr package. For example:
library(tidyverse)
df <- tibble(text = c("robot", "class", "medical", "device wkh\\003", "students", "dcbl", "ri\\003", "course", NA))
df %>%
mutate(text = str_remove_all(text, "\\\\"))
# A tibble: 9 × 1
text
<chr>
1 robot
2 class
3 medical
4 device wkh003
5 students
6 dcbl
7 ri003
8 course
9 NA

Related

str_detect for multiple patterns

I am using str_detect within the stringr package and I am having trouble searching a string with more than one pattern.
Here is the code I am using, however it is not returning anything even though my vector ("Notes-Title") contains these patterns.
filter(str_detect(`Notes-Title`, c("quantity","single")))
The logic I want to code is:
Search each row and filter it if it contains the string "quantity" or "single".
You need to use the | separator in your search, all within one set of "".
> words <- c("quantity", "single", "double", "triple", "awful")
> set.seed(1234)
> df = tibble(col = sample(words,10, replace = TRUE))
> df
# A tibble: 10 x 1
col
<chr>
1 triple
2 single
3 awful
4 triple
5 quantity
6 awful
7 triple
8 single
9 single
10 triple
> df %>% filter(str_detect(col, "quantity|single"))
# A tibble: 4 x 1
col
<chr>
1 single
2 quantity
3 single
4 single

How to get number of lines from RDD which contain any digits

Lines of the document as follows:
I am 12 year old.
I go to school.
I am playing.
Its 4 pm.
There are two lines of the document that contain numbers in them. I want to count how many lines are there in the document with number?
This is to be implemented in scala spark.
val lineswithnum=linesRdd.filter(line => (line.contains([^0-9]))).count()
I expect output to be 2 . But I am getting 0
You can use exists method:
val lineswithnum=linesRdd.filter(line => line.exists(_.isDigit)).count()
In line with your original approach and not discounting the other answer(s):
val textFileLines = sc.textFile("/FileStore/tables/so99.txt")
val linesWithNumCollect = textFileLines.filter(_.matches(".*[0-9].*")).count
The .* added so as to capture within a line string.

Finding the three longest substrings in a string using SPARQL on the Wikidata Query Service, and ranking them across strings

I'm trying to identify the longest three substrings from a string using SPARQL and the Wikidata Query Service and then rank
the substrings within a string by length
the strings by the lengths of any of those longest substrings .
I managed to identify the first and second substring from a string and could of course just create similar additional lines to tackle the problem, but this seems ugly and inefficient, so I am wondering if anyone here knows of a better way to get there.
This is a simplified version of the code, though I have left some auxiliary variables in that I am using for tracking progress on the way. You can try it here.
Clarification in response to this comment: if it is necessary to treat this query as a subquery and to feed it with results from another subquery, that's fine with me. To get an idea of the kinds of use I have in mind, see this demo.
SELECT * WHERE {
{
VALUES (?title) {
("What are the longest three words in this string?")
("A really complicated title")
("OneWordTitleInCamelCase")
("Thanks for your help!")
}
}
BIND(STRLEN(REPLACE(?title, " ", "")) AS ?titlelength)
BIND(STRBEFORE(?title, " ") AS ?substring1)
BIND(STRLEN(REPLACE(?substring1, " ", "")) AS ?substring1length)
BIND(STRAFTER(?title, " ") AS ?postfix)
BIND(STRLEN(REPLACE(?postfix, " ", "")) AS ?postfixlength)
BIND(STRBEFORE(?postfix, " ") AS ?substring2)
BIND(STRLEN(REPLACE(?substring2, " ", "")) AS ?substring2length)
}
ORDER BY DESC(?substring1length)
Expected results:
longsubstring substringlength
OneWordTitleInCamelCase 23
complicated 11
longest 7
really 6
string 6
Thanks 6
title 5
three 5
your 4
help 4
Actual results:
title titlelength substring1 substring1length postfix postfixlength substring2 substring2length
Thanks for your help! 18 Thanks 6 for your help! 12 for 3
What are the longest three words in this string? 40 What 4 are the longest three words in this string? 36 are 3
A really complicated title 23 A 1 really complicated title 22 really 6
OneWordTitleInCamelCase 23 0 0 0

Splitting a list of strings using cut - KDB

For the following list:
q)a:("ua#1100#1";"sba#2220#2";"r#4444#a")
I want following output :
("1100#1";"2220#2";"4444#a")
? gives first index of #
q)(a?\:"#")
2 3 1`
but using cut does not give the desired result :
q)(a?\:"#")cut'a
(("ua";"#1";"10";"0#";"1");("sba";"#22";"20#";"2");("r";"#";"4";"4";"4";"4";"#";"a"))`
You can also parse the data rather than drop chars from each string.
It'll be somewhat more efficient if your dataset is large.
q)("J#*"0:/:a)[;1]
"1100#1"
"2220#2"
"4444#a"
Notice I've set the 'key' to 'J' which will result in nulls in your example case, but you only care about the values anyway.
If you can join (sv) the strings together, it'll be even better too
q)last "J#;"0:";" sv a
"1100#1"
"2220#2"
"4444#a"
HTH,
Sean
When the left argument of cut is atom , cut behaves differently than _.
q)2 cut 2 3 4 5 6
(2 3;4 5;,6)
q)2 _ 2 3 4 5 6
4 5 6
Use _ to cut the string
q)(1+a?\:"#")_'a
("1100#1";"2220#2";"4444#a")
or
q)"#"sv/:1_/:"#" vs/:a
("1100#1";"2220#2";"4444#a")

Understanding how to read each-right and each-left combined in kdb

From q for mortals, i'm struggling to understand how to read this, and understand it logically.
1 2 3,/:\:10 20
I understand the result is a cross product when in full form: raze 1 2 3,/:\:10 20.
But reading from left to right, I'm currently lost at understanding what this yields (in my head)
\:10 20
combined with 1 2 3,/: ??
Help in understanding how to read this clearly (in words or clear logic) would be appreciated.
I found myself saying the following in my head whilst I program the syntax in q. q works from right to left.
Internal Monologue -> Join the string on the right onto each of the strings on the left
code -> "ABC",\:"-D"
result -> "A-D"
"B-D"
"C-D"
I think that's an easy way to understand it. 'join' can be replaced with whatever...
Internal Monologue -> Does the string on the right match any of the strings on the left
code -> ("Cat";"Dog";"CAT";"dog")~\:"CAT"
result -> 0010b
Each-right is the same concept and combining them is straightforward also;
Internal Monologue -> Does each of the strings on the right match each of the strings on the left
code -> ("Cat";"Dog";"CAT";"dog")~\:/:("CAT";"Dog")
result -> 0010b
0100b
So in your example 1 2 3,/:\:10 20 - you're saying 'Join each of the elements on the right to each of the elements on the left'
Hope this helps!!
EDIT To add a real world example.... - consider the following table
q)show tab:([] upper syms:10?`2; names:10?("Robert";"John";"Peter";"Jenny"); amount:10?til 10)
syms names amount
--------------------
CF "Peter" 8
BP "Robert" 1
IC "John" 9
IN "John" 5
NM "Peter" 4
OJ "Jenny" 6
BJ "Robert" 6
KH "John" 1
HJ "Peter" 8
LH "John" 5
q)
I you want to get all records where the name is Robert, you can do; select from tab where names like "Robert"
But if you want to get the results where the name is either Robert or John, then it is a perfect scenario to use our each-left and each-right.
Consider the names column - it's a list of strings (a list where each element is a list of chars). What we want to ask is 'does any of the strings in the names column match any of the strings we want to find'... that translates to (namesList)~\:/:(list;of;names;to;find). Here's the steps;
q)(tab`names)~\:/:("Robert";"John")
0100001000b
0011000101b
From that result we want a compiled list of booleans where each element is true of it is true for Robert OR John - for example, if you look at index 1 of both lists, it's 1b for Robert and 0b for John - in our result, the value at index 1 should be 1b. Index 2 should be 1b, index3 should be 1b, index4 should be 0b etc... To do this, we can apply the any function (or max or sum!). The result is then;
q)any(tab`names)~\:/:("Robert";"John")
0111001101b
Putting it all together, we get;
q)select from tab where any names~\:/:("Robert";"John")
syms names amount
--------------------
BP "Robert" 1
IC "John" 9
IN "John" 5
BJ "Robert" 6
KH "John" 1
LH "John" 5
q)
Firstly, q is executed (and hence generally read) right to left. This means that it's interpreting the \: as a modifier to be applied to the previous function, which itself is a simple join modified by the /: adverb. So the way to read this is "Apply join each-right to each of the left-hand arguments."
In this case, you're applying the two adverbs to the join - \:10 20 on its own has no real meaning here.
I find it helpful to also look at the converse case 1 2 3,\:/:10 20, running that code produces a 2x6 matrix, which I'd describe more like "apply join each-left to each of the right hand arguments" ... I hope that makes sense.
An alternative syntax which also might help is ,/:\:[1 2 3;10 20] - this might be useful as it makes it very clear what the function you're applying is, and is equivalent to your in-place notation.