R stringr using subset - stringr

In R using the stringr package, how would you get only three occurrences of a letter within a word using str_subset?
Example- the letter "a" three times within a word
Results- banana and Canada

library(stringr)
text <- c("Canada", "and", "banana", "baobab")
# Any character repeated three times:
#
# maybe something followed by a marked character, maybe followed by
# something different, followed by that character, maybe followed by
# something different, followed by that character, maybe followed by
# something different
pattern <- "^.*(.)+.*\\1.*\\1.*$"
are_matching <- str_detect(text, pattern)
are_matching
#> [1] TRUE FALSE TRUE TRUE
words_extracted <- str_subset(text, pattern)
words_extracted
#> [1] "Canada" "banana" "baobab"
letter_repeated <- str_replace(words_extracted, pattern, "\\1")
letter_repeated
#> [1] "a" "a" "b"
# That give you the "last" repeated character
str_replace("baobaba", pattern, "\\1")
#> [1] "a"
# Note: If you want the first repeated character (if multiple), you
# should be lazy both at the initial optional set of character and at
# the first marked matching. (Not relevant for "detect" and "subset")
lazy_text <- c("bananan", "baobaba")
lazy_pattern <- "^.*?(.)+?.*\\1.*\\1.*$"
str_replace(lazy_text, pattern, "\\1")
#> [1] "n" "a"
str_replace(lazy_text, lazy_pattern, "\\1")
#> [1] "a" "b"
Created on 2020-09-02 by the reprex package (v0.3.0)

This will give you all words in which at least one letter appears exactly 3 times:
library(tidyverse)
vec <- "banana and Canada"
words <- vec %>% str_split(" ") %>% .[[1]]
lgl_vec <- words %>% map_lgl(
~str_split(.x, "") %>%
.[[1]] %>%
factor() %>%
summary() %>%
"=="(3) %>%
any()
)
words[lgl_vec]
[1] "banana" "Canada"

Use str_extract_all:
input <- c("apple", "banana", "Canada")
regex <- "\\b[^\\WAa]*[Aa][^\\WAa]*[Aa][^\\WAa]*[Aa][^\\WAa]*\\b"
matches <- str_extract_all(input, regex)
Demo

Related

How to strip extra spaces when writing from dataframe to csv

Read in multiple sheets (6) from an xlsx file and created individual dataframes. Want to write each one out to a pipe delimited csv.
ind_dim.to_csv (r'/mypath/ind_dim_out.csv', index = None, header=True, sep='|')
Currently outputs like this:
1|value1 |value2 |word1 word2 word3 etc.
Want to strip trailing blanks
Suggestion
Include the method .apply(lambda x: x.str.rstrip()) to your output string (prior to the .to_csv() call) to strip the right trailing blank from each field across the DataFrame. It would look like:
Change:
ind_dim.to_csv(r'/mypath/ind_dim_out.csv', index = None, header=True, sep='|')
To:
ind_dim.apply(lambda x: x.str.rstrip()).to_csv(r'/mypath/ind_dim_out.csv', index = None, header=True, sep='|')
It can be easily inserted to the output code string using '.' referencing. To handle multiple data types, we can enforce the 'object' dtype on import by including the argument dtype='str':
ind_dim = pd.read_excel('testing_xlsx_nums.xlsx', header=0, index_col=0, sheet_name=None, dtype='str')
Or on the DataFrame itself by:
df = pd.DataFrame(df, dtype='str')
Proof
I did a mock-up where the .xlsx document has 5 sheets, with each sheet having three columns: The first column with all numbers except an empty cell in row 2; the second column with both a leading blank and a trailing blank on strings, an empty cell in row 3, and a number in row 4; and the third column * with all strings having a leading blank, and an empty value in row 4*. Integer indexes and integer columns have been included. The text in each sheet is:
0 1 2
0 11111 valueB1 valueC1
1 valueB2 valueC2
2 33333 valueC3
3 44444 44444
4 55555 valueB5 valueC5
This code reads in our .xlsx testing_xlsx_dtype.xlsx to the DataFrame dictionary ind_dim.
Next, it loops through each sheet using a for loop to place the sheet name variable as a key to reference the individual sheet DataFrame. It applies the .str.rstrip() method to the entire sheet/DataFrame by passing the lambda x: x.str.rstrip() lambda function to the .apply() method called on the sheet/DataFrame.
Finally, it outputs the sheet/DataFrame as a .csv with the pipe delimiter using .to_csv() as seen in the OP post.
# reads xlsx in
ind_dim = pd.read_excel('testing_xlsx_nums.xlsx', header=0, index_col=0, sheet_name=None, dtype='str')
# loops through sheets, applies rstrip(), output as csv '|' delimit
for sheet in ind_dim:
ind_dim[sheet].apply(lambda x: x.str.rstrip()).to_csv(sheet + '_ind_dim_out.csv', sep='|')
Returns:
|0|1|2
0|11111| valueB1| valueC1
1|| valueB2| valueC2
2|33333|| valueC3
3|44444|44444|
4|55555| valueB5| valueC5
(Note our column 2 strings no longer have the trailing space).
We can also reference each sheet using a loop that cycles through the dictionary items; the syntax would look like for k, v in dict.items() where k and v are the key and value:
# reads xlsx in
ind_dim = pd.read_excel('testing_xlsx_nums.xlsx', header=0, index_col=0, sheet_name=None, dtype='str')
# loops through sheets, applies rstrip(), output as csv '|' delimit
for k, v in ind_dim.items():
v.apply(lambda x: x.str.rstrip()).to_csv(k + '_ind_dim_out.csv', sep='|')
Notes:
We'll still need to apply the correct arguments for selecting/ignoring indexes and columns with the header= and names= parameters as needed. For these examples I just passed =None for simplicity.
The other methods that strip leading and leading & trailing spaces are: .str.lstrip() and .str.strip() respectively. They can also be applied to an entire DataFrame using the .apply(lambda x: x.str.strip()) lambda function passed to the .apply() method called on the DataFrame.
Only 1 Column: If we only wanted to strip from one column, we can call the .str methods directly on the column itself. For example, to strip leading & trailing spaces from a column named column2 in DataFrame df we would write: df.column2.str.strip().
Data types not string: When importing our data, pandas will assume data types for columns with a similar data type. We can override this by passing dtype='str' to the pd.read_excel() call when importing.
pandas 1.0.1 documentation (04/30/2020) on pandas.read_excel:
"dtypeType name or dict of column -> type, default None
Data type for data or columns. E.g. {‘a’: np.float64, ‘b’: np.int32} Use object to preserve data as stored in Excel and not interpret dtype. If converters are specified, they will be applied INSTEAD of dtype conversion."
We can pass the argument dtype='str' when importing with pd.read_excel.() (as seen above). If we want to enforce a single data type on a DataFrame we are working with, we can set it equal to itself and pass it to pd.DataFrame() with the argument dtype='str like: df = pd.DataFrame(df, dtype='str')
Hope it helps!
The following trims left and right spaces fairly easily:
if (!require(dplyr)) {
install.packages("dplyr")
}
library(dplyr)
if (!require(stringr)) {
install.packages("stringr")
}
library(stringr)
setwd("~/wherever/you/need/to/get/data")
outputWithSpaces <- read.csv("CSVSpace.csv", header = FALSE)
print(head(outputWithSpaces), quote=TRUE)
#str_trim(string, side = c("both", "left", "right"))
outputWithoutSpaces <- outputWithSpaces %>% mutate_all(str_trim)
print(head(outputWithoutSpaces), quote=TRUE)
Starting Data:
V1 V2 V3 V4
1 "Something is interesting. " "This is also Interesting. " "Not " "Intereting "
2 " Something with leading space" " Leading" " Spaces with many words." " More."
3 " Leading and training Space. " " More " " Leading and trailing. " " Spaces. "
Resulting:
V1 V2 V3 V4
1 "Something is interesting." "This is also Interesting." "Not" "Intereting"
2 "Something with leading space" "Leading" "Spaces with many words." "More."
3 "Leading and training Space." "More" "Leading and trailing." "Spaces."

How do I find letters in words that are part of a string and remove them? (List comprehensions with if statements)

I'm trying to remove vowels from a string. Specifically, remove vowels from words that have more than 4 letters.
Here's my thought process:
(1) First, split the string into an array.
(2) Then, loop through the array and identify words that are more than 4 letters.
(3) Third, replace vowels with "".
(4) Lastly, join the array back into a string.
Problem: I don't think the code is looping through the array.
Can anyone find a solution?
def abbreviate_sentence(sent):
split_string = sent.split()
for word in split_string:
if len(word) > 4:
abbrev = word.replace("a", "").replace("e", "").replace("i", "").replace("o", "").replace("u", "")
sentence = " ".join(abbrev)
return sentence
print(abbreviate_sentence("follow the yellow brick road")) # => "fllw the yllw brck road"
I just figured out that the "abbrev = words.replace..." line was incomplete.
I changed it to:
abbrev = [words.replace("a", "").replace("e", "").replace("i", "").replace("o", "").replace("u", "") if len(words) > 4 else words for words in split_string]
I found the part of the solution here: Find and replace string values in list.
It is called a List Comprehension.
I also found List Comprehension with If Statement
The new lines of code look like:
def abbreviate_sentence(sent):
split_string = sent.split()
for words in split_string:
abbrev = [words.replace("a", "").replace("e", "").replace("i", "").replace("o", "").replace("u", "")
if len(words) > 4 else words for words in split_string]
sentence = " ".join(abbrev)
return sentence
print(abbreviate_sentence("follow the yellow brick road")) # => "fllw the yllw brck road"

How to remove double quotes and extra delimiter(s) with in double quotes of TextQualifier file in Scala

I have a lot of delimited files with Text Qualifier (every column start and end has double quote). Delimited is not consistent i.e. there can be any delimited like comma(,), Pipe (|), ~, tab (\t).
I need to read this file with spark.read.textFile (single column) and then remove Text Qualifier along with delimiter (need to replace delimiter with space) with in double quotes. Here I want do with out considering columns i.e. I should not split into columns
Below is test data with 3 columns ID, Name and DESC. DESC column has extra delimiter.
val y = """4 , "XAA" , "sf,sd\nsdfsf""""
val pattern = """"[^"]*(?:""[^"]*)*"""".r
val output = pattern replaceAllIn (y, m => m.group(0).replaceAll("[,\n]", " "))
I got above code which works fine for static value. But I am not able to apply to DF.
"ID","Name","DESC"
"1" , "ABC", "A,B C"
"2" , "XYZ" , "ABC is bother"
"3" , "YYZ" , "FER" sfsf,sfd f"
4 , "XAA" , "sf,sd sdfsf"
I need output as
ID,Name,DESC
1 , ABC , A B C
2 , XYZ , ABC is bother
3 , YYZ , FER" sfsf sfd f
4 , XAA , sf sd sdfsf
Thanks in Advance.
Resolved
var SourceFile = spark.read.textFile("/data/test.csv")
val SourceFileDF= SourceFile.withColumn("value", RemoveQualifier(col("value")))
def RemoveQualifier = udf((RawData:String)=>
{
var Data = RawData
val pattern = """"[^"]*(?:""[^"]*)*"""".r
Data = pattern replaceAllIn (Data , m => m.group(0).replaceAll("[,]", " "))
Data
})
Thanks.
you can two replaceAll() like this use like this:
val output = pattern replaceAllIn (y, m => m.group(0).replaceAll("[,\\\\n]", " ").replaceAll("\"|\"", ""))
output: String = 4 , XAA , sf sd sdfsf

Scala : How to split words using multiple delimeters

Suppose I have the text file like this:
Apple#mango&banana#grapes
The data needs to be split on multiple delimiters before performing the word count.
How to do that?
Use split method:
scala> "Apple#mango&banana#grapes".split("[#&#]")
res0: Array[String] = Array(Apple, mango, banana, grapes)
If you just want to count words, you don't need to split. Something like this will do:
val numWords = """\b\w""".r.findAllIn(string).length
This is a regex that matches start of a word (\b is a (zero-length) word boundary, \w is any "word" character (letter, number or underscore), so you get all the matches in your string, and then just check how many there are.
If you are looking to count each word separately, and do it across multiple lines, then, split is, probably, a better option:
source
.getLines
.flatMap(_.split("\\W+"))
.filterNot(_.isEmpty)
.groupBy(identity)
.mapValues(_.size)

Combine leaflet and markdown in loop

This question shows how to loop over/apply leaflet objects within a markdown file. I'd like to do a similar thing, though I'd like to add additional markdown content.
---
title: "Test"
output: html_document
---
```{r setup, echo=T,results='asis'}
library(leaflet)
library(dplyr) ### !!! uses development version with tidyeval !!!
library(htmltools)
##Add A Random Year Column
data(quakes)
quakes <- tbl_df(quakes) %>%
mutate(year = sample(2008:2010, n(), replace=TRUE))
```
```{r maps, echo=T,results='asis'}
createMaps <- function(year){
cat(paste("###", year, "\n"))
leaflet(quakes %>% filter(year == !!year)) %>%
addTiles() %>%
addMarkers(
lng = ~long,
lat = ~lat,
popup = ~as.character(mag))
cat("\n\n")
}
htmltools::tagList(lapply(as.list(2008:2010), function(x) createMaps(x) ))
```
If I leave out the cat statements in the createMaps function, this code prints all three maps. If I put in the cat statements, I get the markdown, but no maps. Any way to combine both types of element?
The problem is, that your cat statements are being evaluated, before lapply returns its result list.
Delete the cat statements, change your createMaps function to
createMaps <- function(year){
mymap <- leaflet(quakes %>% filter(year == !!year)) %>%
addTiles() %>%
addMarkers(
lng = ~long,
lat = ~lat,
popup = ~as.character(mag))
return(list(tags$h1(year), mymap))
}
and change tags$h1() to whatever size of header you want (tags$h2(), ...)