How to determine the lines, and how many words in my song using split() in python - python-3.7

Music lyrics
Look what you made me do, I'm with somebody new
Ohh, baby, baby, I'm dancing with a stranger
I tried using re.split also tried loop.

Just a heads up, you should always post your code with the question. Anyway, .split() works fine, I would count lines by counting the number of newline characters '\n'
s = "Look what you made me do, I'm with somebody new Ohh, baby, baby, I'm dancing with a stranger"
lines = s.count('\n')
lst = s.replace('\n', '').split()
words = len(lst)

lyric = "Look what you made me do, I'm with somebody new Ohh, baby, baby, I'm dancing with a stranger"
print(len(lyric.split()))
This splits each word into an element of a list and prints the length of the list which is equal to the number of words in the song.

Related

How to find common patterns in thousands of strings?

I don't want to find "abc" in strings ["kkkabczzz", "shdirabckai"]
Not like that.
But bigger patterns like this:
If I have to __, then I will ___.
["If I have to do it, then I will do it right.", "Even if I have to make it, I will not make it without Jack.", "....If I have to do, I will not...."]
I want to discover patterns in a large array or database of strings. Say going over the contents of an entire book.
Is there a way to find patterns like this?
I can work with JavaScript, Python, PHP.
The following could be a starting point:
The RegExp rx=/(\b\w+(\s+\w+\b)+)(?=.+\1)+/g looks for small (multiple word) patterns that occur at least twice in the text.
By playing around with the repeat quantifier + after (\s+\w+\b) (i.e. changing it to something like {2}) you can restrict your word patterns to any number of words (in the above case to 3: original + 2 repetitions) and you will get different results.
(?=.+\1)+ is a look ahead pattern that will not consume any of the matched parts of the string, so there is "more string" left for the remaining match attempts in the while loop.
const str="If I have to do it, then I will do it right. Even if I have to make it, I will not make it without Jack. If I have to do, I will not."
const rx=/(\b\w+(\s+\w+\b)+)(?=.+\1)+/g, r={};
let t;
while (t=rx.exec(str)) r[t[1]]=(rx.lastIndex+=1-t[1].length);
const res=Object.keys(r).map(p=>
[p,[...str.matchAll(p)].length]).sort((a,b)=>b[1]-a[1]||b[0].localeCompare(a[0]));
// list all repeated patterns and their occurrence counts,
// ordered by occurrence count and alphabet:
console.log(res);
I extended my snippet a little bit by collecting all the matches as keys in an object (r). At the end I list all the keys of this object alphabetically with Object.keys(r).sort().
In the while loop I also reset the rx.lastIndex property to start the search for that next pattern immediately after the start of the last one found: rx.lastIndex+=1-t[1].length.

Multiple regex in one command

Disclaimer: I have no engineering background whatsoever - please don't hold it against me ;)
What I'm trying to do:
Scan a bunch of text strings and find the ones that
are more than one word
contain title case (at least one capitalized word after the first one)
but exclude specific proper nouns that don't get checked for title case
and disregard any parameters in curly brackets
Example: Today, a Man walked his dogs named {FIDO} and {Fifi} down the Street.
Expectation: Flag the string for title capitalization because of Man and Street, not because of Today, {FIDO} or {Fifi}
Example: Don't post that video on TikTok.
Expectation: No flag because TikTok is a proper noun
I have bits and pieces, none of them error-free from what https://www.regextester.com/ keeps telling me so I'm really hoping for help from this community.
What I've tried (in piece meal but not all together):
(?=([A-Z][a-z]+\s+[A-Z][a-z]+))
^(?!(WordA|WordB)$)
^((?!{*}))
I think your problem is not really solvable solely with regex...
My recommendation would be splitting the input via [\s\W]+ (e.g. with python's re.split, if you really need strings with more than one word, you can check the length of the result), filtering each resulting word if the first character is uppercase (e.g with python's string.isupper) and finally filtering against a dictionary.
[\s\W]+ matches all whitespace and non-word characters, yielding words...
The reasoning behind this different approach: compiling all "proper nouns" in a regex is kinda impossible, using "isupper" also works with non-latin letters (e.g. when your strings are unicode, [A-Z] won't be sufficient to detect uppercase). Filtering utilizing a dictionary is a way more forward approach and much easier to maintain (I would recommend using set or other data type suited for fast lookups.
Maybe if you can define your use case more clearer we can work out a pure regex solution...

How can I convert text to title case?

I have a text file containing a list of titles that I need to change to title case (words should begin with a capital letter except for most articles, conjunctions, and prepositions).
For example, this list of book titles:
barbarians at the gate
hot, flat, and crowded
A DAY LATE AND A DOLLAR SHORT
THE HITCHHIKER'S GUIDE TO THE GALAXY
should be changed to:
Barbarians at the Gate
Hot, Flat, and Crowded
A Day Late and a Dollar Short
The Hitchhiker's Guide to the Galaxy
I wrote the following code:
while(<DATA>)
{
$_=~s/(\s+)([a-z])/$1.uc($2)/eg;
print $_;
}
But it capitalizes the first letter of every word, even words like "at," "the," and "a" in the middle of a title:
Barbarians At The Gate
Hot, Flat, And Crowded
A Day Late And A Dollar Short
The Hitchhiker's Guide To The Galaxy
How can I do this?
Thanks to See also Lingua::EN::Titlecase – Håkon Hægland given the way to get the output.
use Lingua::EN::Titlecase;
my $tc = Lingua::EN::Titlecase->new();
while(<DATA>)
{
my $line = $_;
my $tc = Lingua::EN::Titlecase->new($line);
print $tc;
}
You can also try using this regex: ^(.)(.*?)\b|\b(at|to|that|and|this|the|a|is|was)\b|\b(\w)([\w']*?(?:[^\w'-]|$)) and replace with \U$1\L$2\U$3\L$4. It works my matching the first letter of words that are not articles, capitalizing it, then matching the rest of the word. This seems to work in PHP, I don't know about Perl but it will likely work.
^(.)(.*?)\b matches the first letter of the first word (group 1) and the rest of the word (group 2). This is done to prevent not capitalizing the first word because it's an article.
\b(word|multiple words|...)\b matches any connecting word to prevent capitalizing them.
(\w)([\w']*?(?:[^\w'-]|$)) matches the first letter of a word (group 3) and the rest of the word (group 4). Here I used [^\w'-] instead of \b so hyphens and apostrophes are counted as word characters too. This prevent 's from becoming 'S
The \U in replacement capitalizes the following characters and \L lowers them. If you want you can add more articles or words to the regex to prevent capitalizing them.
UPDATE: I changed the regex so you can include connecting phrases too (multiple words). But that will still make a very long regex...

Removing multiple character types from a string

Is this an acceptable approach for removing multiple character types from a string or is there a better (more efficient way)? The "ilr".contains(_) bit feels a little like cheating considering it will be done for each and every character, but then again, maybe this is the right way. Is there a faster or more efficient way to do this?
val sentence = "Twinkle twinkle little star, oh I wander what you are"
val words = sentence.filter(!"ilr".contains(_))
// Result: "Twnke twnke tte sta, oh I wande what you ae"
I'd just use Java's good old replaceAll (it takes a regexp):
"Twinkle twinkle little star, oh I wander what you are" replaceAll ("[ilr]", "")
// res0: String = Twnke twnke tte sta, oh I wande what you ae
In contrast to working with chars (as in filtering a Seq[Char]), using regular expressions should be Unicode-safe even if you're working with code points outside the basic multilingual plane. "There Ain't No Such Thing As Plain Text."
There would be no significant difference, since there is only 3 characters to remove and no so big string to filter, but you may consider to use Set for this purpose. E.g.
val toRemove = "ilr".toSet
val words = sentence.filterNot(toRemove)

regex to get string within 2 strings

"artistName":"Travie McCoy", "collectionName":"Billionaire (feat. Bruno Mars) - Single", "trackName":"Billionaire (feat. Bruno Mars)",
i wish to get the artist name so Travie McCoy from within that code using regex, please not i am using regexkitlite for the iphone sdk if this changes things.
Thanks
"?artistName"?\s*:\s*"([^"]*)("|$) should do the trick. It even handles some variations in the string:
White space before and after the :
artistName with and without the quotes
missing " at the end of the artist name if it is the last thing on the line
But there will be many more variations in the input you might encounter that this regex will not match.
Also you don’t want to use a regex for matching this for performance reasons. Right now you might only be interested in the artistName field. But some time later you will want information from the other fields. If you just change the field name in the regex you’ll have to match the whole string again. Much better to use a parser and transform the whole string into a dictionary where you can access the different fields easily. Parsing the whole string shouldn’t take much longer than matching the last key/value pair using a regex.
This looks like some kind of JSON, there are lots of good and complete parsers available. It isn’t hard to write one yourself though. You could write a simple recursive descent parser in a couple of hours. I think this is something every programmer should have done at least once.
\"?artistName\"?\s*:\s*\"([^\"]*)(\"|$)
Thats for objective c