Splitting unicode sentence into words array using XRegExp

Splitting unicode sentence into words array using XRegExp - unicode

I'm using following script for splitting Unicode sentence into words array.
XRegExp.matchChain("læseWEB læser teksten på dit website op.", [XRegExp("[\\p{Alphabetic}\\p{Nd}\\{Pc}\\p{M}]+", "g")])
[ "læseWEB", "læser", "teksten", "på", "dit", "website", "op" ]
Now I'm expecting
['læseWEB ', 'læser ', 'teksten ', 'på ', 'dit ', 'website ', 'op.'].
Someone said I need to use split function instead of matchChain.
Any suggestions?

Related

Arabic Dataset Cleaning: Removing everything but Arabic text

I have a huge dataset in the Arabic language, I cleaned the data from special characters, English characters. But, I discovered that the dataset contains many other languages like Chinese, Japanese, Russian, etc. The problem is that I can't tell exactly what other languages are there mixed with the Arabic language, so I need a solution to remove everything in the text rather than Arabic characters from a pandas data frame.
here is my code:
def clean_txt(input_str):
try:
if input_str: # if the input string is not empty do the following
input_str = re.sub('[?؟!##$%&*+~\/=><]+^' , '' , input_str) # Remove some of special chars
input_str=re.sub(r'[a-zA-Z?]', '', input_str).strip() # remove english chars
input_str = re.sub('[\\s]+'," ",input_str) # Remove all spaces
input_str = input_str.replace("_" , ' ') #Remove underscore
input_str = input_str.replace("ـ" , '') # Remove Arabic tatwelah
input_str =input_str.replace('"','')# Remove "
input_str =input_str.replace("''",'')# Remove ''
input_str =input_str.replace("'",'')# Remove '
input_str =input_str.replace(".",'')# Remove .
input_str =input_str.replace(",",'')# Remove ,
input_str =input_str.replace(":",' ')# Remove :
input_str=re.sub(r" ?\([^)]+\)", "", str(input_str)) #Remove text between ()
input_str = input_str.strip() # Trim input string
except:
return input_str
return input_str

Finally, I found the answer:
text ='大谷育江 صباح الخيرfff :"""%#$#&!~(2009 مرحباً Добро пожаловать fffff أحمــــد ݓ'
t = re.sub(r'[^0-9\u0600-\u06ff\u0750-\u077f\ufb50-\ufbc1\ufbd3-\ufd3f\ufd50-\ufd8f\ufd50-\ufd8f\ufe70-\ufefc\uFDF0-\uFDFD]+', ' ', text)
t
' صباح الخير 2009 مرحباً أحمــــد ݓ'

input_str = re.sub(r'[^ \\p{Arabic}]', '', input_str)
All those not-space and not-Arabic are removed. You might add interpunction, would need to take care of empties, like () but you could look into Unicode script/category names.
Corrected Instead of InArabic it should be Arabic, see Unicode scripts.

Language detection is a solved problem.
Simplest algorithmic approach is to scan a bunch of single-language texts for character bi-grams,
and compute distance between those and the bi-gram frequency of target text.
Simplest thing for you to implement is to call into this NLTK routine:
from nltk.classify.textcat import TextCat
nltk.download(['crubadan', 'punkt'])
tc = TextCat()
>>> tc.guess_language('Now is the time for all good men to come to the aid of their party.')
'eng'
>>> tc.guess_language('Il est maintenant temps pour tous les hommes de bien de venir en aide à leur parti.')
'fra'
>>> tc.guess_language('لقد حان الوقت الآن لجميع الرجال الطيبين لمساعدة حزبهم.')
'arb'

How to replace 2 spaces middle of string to a character and a space in Dart?

I have string as shown below. In dart replaceFirst() its removes all whitespace and it's not what i want. My question is: How to replace 2 spaces middle of string to a character and a space in Dart?
Example:
- Original: String _myText = "abc 23";
- Expected Text: "abcd 23"
- Result with replaceFirst() : "abcd23"

"abc 23".replaceFirst(' ', 'd') returns the expected output.

Use wrap widget may be your problem would be solve
Eg:
Wrap( child:Text(“hii Byy”))

Use below code
String _myText = "abc 23";
_myText.replaceAll(' ', ' '); // it returns the string 'abc 23'

"abc 23".replaceFirst(' ', 'd') returns the expected output. Are you sure, you pass a single white space character and not two as the first parameter of the replaceFirst method ?

How to auto-escape a special char in VS Code Snippets?

I want to write a snippet for Debugging in TYPO3.
This is my Snippet-Code in php.json file:
"TYPO3 Extbase DebuggerUtility": {
"prefix": "ee",
"body": [
"\\TYPO3\\CMS\\Extbase\\Utility\\DebuggerUtility::var_dump($1,'$1');",
"$0"
],
"description": "TYPO3 Extbase DebuggerUtility"
},
If I want to debug something liket this : $this->settings['key'] I get this code:
\TYPO3\CMS\Extbase\Utility\DebuggerUtility::var_dump($this->settings['key'],'$this->settings['key']');
But it should looks like this
\TYPO3\CMS\Extbase\Utility\DebuggerUtility::var_dump($this->settings['key'],'$this->settings[\'key\']');
With escaped ' in the second part of that snippet.
EDIT
Thank you, but I think you missunderstood the question.
I don't want to escape a static character. I want to use the snippet and when I type the first $1-content it should be $this->settings['someKey'] but the second $1 (which is near the same) should automatically escape the ' chars I write, that I don't do this manually by hand.
So if i type '
first $1: ' second $1: \' that my Debug looks like this
Debug:
$this->settings['someKey']
contentOfsomeKey
I I don't escape the ' signs inside the "title of the debug" it breaks the string because ' wraps the debug-title.
In other words: I want to escape the content of the second $1 variable not the variable or the '-wrap in the snippet.
I hope I could clarify my issue.

If you want escape characters \ in your output you need to insert escaped escape characters: \\ this should result in single escape characters.
You might need an additional escape character if the following character needs an additional escaping: one backslash before quote \' = \\+ \' = \\\'
`

Ionic Jade mask text input

I'm building an iOS app using Ionic and am using Jade for UI. There is an input box that I want the value typed to be masked. The input could be numeric or numeric+alphabetical. So far, I couldn't find how to do it.
Anyone can help what should I do to mask the input while user is typing? For example, if user type '123', what should be shown in UI is 'XXX'. And I would need the unmasked value as well when I submit to API.
Any example would be useful as well since I'm not too familiar with ionic and jade.
UPDATE:
it is like the password input text field. When user types in the password, in the UI, instead of the real character typed by user, you will see "...." or "XXXX"...this is the effect that I want.

For mask you can use
npm angular2-text-mask
In input set:
[textMask]="{mask: mask, guide: false, modelClean: true}"
In .ts file set rgex :
mask: any[] = ['+', /\d/, ' ', '(', /[1-9]/, /\d/, /\d/, ')', ' ', /\d/, /\d/, /\d/, '-', /\d/, /\d/, '-', /\d/, /\d/];

Matlab: How to print " ' " character

I am trying to create the following string:
javaaddpath ('C:\MatlabUserLib\ParforProgMonv2')
However, I could only do the following
command = sprintf('%s ', varargin{1}, '(', varargin{2}, ')');
and that gives me:
javaaddpath ( C:\MatlabUserLib\ParforProgMonv2 )
UPDATE:
Based on Dan's suggestion, I used the following:
command = sprintf('%s', varargin{1}, '(', '''', varargin{2}, '''', ')')

Use two single quotation marks. See the docs for formatting strings, btw this concept is known as an escape character (to help you google such things in the future).
command = sprintf('%s ', varargin{1}, '(''', varargin{2}, ''')')
Although I think you might prefer
command = sprintf('%s (''%s'')', varargin{1}, varargin{2})
or if you have no other varargins (which I guess is very unlikely but anyway)
command = sprintf('%s (''%s'')', varargin{:})

There are a couple of ways around this. First you could declare your path as a string variable then pass the string to your command, eg,
path = 'my/path'
javaaddpath (path)
Or you can use special characters to insert things like a single quote or a new line character, so for a single quote,
EDIT: wrong display command as pointed out by Dan below
myString = '" Hi there! "'
disp(myString)

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Splitting unicode sentence into words array using XRegExp - unicode

Related

Arabic Dataset Cleaning: Removing everything but Arabic text

How to replace 2 spaces middle of string to a character and a space in Dart?

How to auto-escape a special char in VS Code Snippets?

Ionic Jade mask text input

Matlab: How to print " ' " character

Categories

Resources