Arabic Dataset Cleaning: Removing everything but Arabic text - data-cleaning

I have a huge dataset in the Arabic language, I cleaned the data from special characters, English characters. But, I discovered that the dataset contains many other languages like Chinese, Japanese, Russian, etc. The problem is that I can't tell exactly what other languages are there mixed with the Arabic language, so I need a solution to remove everything in the text rather than Arabic characters from a pandas data frame.
here is my code:
def clean_txt(input_str):
try:
if input_str: # if the input string is not empty do the following
input_str = re.sub('[?؟!##$%&*+~\/=><]+^' , '' , input_str) # Remove some of special chars
input_str=re.sub(r'[a-zA-Z?]', '', input_str).strip() # remove english chars
input_str = re.sub('[\\s]+'," ",input_str) # Remove all spaces
input_str = input_str.replace("_" , ' ') #Remove underscore
input_str = input_str.replace("ـ" , '') # Remove Arabic tatwelah
input_str =input_str.replace('"','')# Remove "
input_str =input_str.replace("''",'')# Remove ''
input_str =input_str.replace("'",'')# Remove '
input_str =input_str.replace(".",'')# Remove .
input_str =input_str.replace(",",'')# Remove ,
input_str =input_str.replace(":",' ')# Remove :
input_str=re.sub(r" ?\([^)]+\)", "", str(input_str)) #Remove text between ()
input_str = input_str.strip() # Trim input string
except:
return input_str
return input_str

Finally, I found the answer:
text ='大谷育江 صباح الخيرfff :"""%#$#&!~(2009 مرحباً Добро пожаловать fffff أحمــــد ݓ'
t = re.sub(r'[^0-9\u0600-\u06ff\u0750-\u077f\ufb50-\ufbc1\ufbd3-\ufd3f\ufd50-\ufd8f\ufd50-\ufd8f\ufe70-\ufefc\uFDF0-\uFDFD]+', ' ', text)
t
' صباح الخير 2009 مرحباً أحمــــد ݓ'

input_str = re.sub(r'[^ \\p{Arabic}]', '', input_str)
All those not-space and not-Arabic are removed. You might add interpunction, would need to take care of empties, like () but you could look into Unicode script/category names.
Corrected Instead of InArabic it should be Arabic, see Unicode scripts.

Language detection is a solved problem.
Simplest algorithmic approach is to scan a bunch of single-language texts for character bi-grams,
and compute distance between those and the bi-gram frequency of target text.
Simplest thing for you to implement is to call into this NLTK routine:
from nltk.classify.textcat import TextCat
nltk.download(['crubadan', 'punkt'])
tc = TextCat()
>>> tc.guess_language('Now is the time for all good men to come to the aid of their party.')
'eng'
>>> tc.guess_language('Il est maintenant temps pour tous les hommes de bien de venir en aide à leur parti.')
'fra'
>>> tc.guess_language('لقد حان الوقت الآن لجميع الرجال الطيبين لمساعدة حزبهم.')
'arb'

Related

Determine if a string only contains invisible characters in Swift

I was parsing a messy XML. I found many of the nodes contain invisible characters only, for instance:
"\n "
" "
"\t "
"\n "
"\n\n"
I saw some posts and answers about alphabet and numbers, but the XML being parsed in my project includes UTF8 characters. I am not sure how I can list all visible UTF8 characters in the filter.
How can I determine if a string is made up of completely invisible characters like above, so I can filter them out? Thanks!
Use CharacterSet for that.
let nonWhitespace = CharacterSet.whitespacesAndNewlines.inverted
let containsNonWhitespace = (string.rangeOfCharacter(from: nonWhitespace) != nil)
Trim the string of whitespaces and newlines and see what's left.
if someString.trimmingCharacters(in: .whitespacesAndNewlines).isEmpty {
// someString only contains whitespaces and newlines
}

Replace emdash with double dash

I want to replace ― back into --
I tried with the utf8 encodings but that doesn't work
string = "blablabla -- blablabla ―"
I want to replace the long dash (if there is one) with double hyphens. I tried it the simple way but that didn't work:
string= string.replace ("―", "--")
I also tried to encode it with utf8 and use the codes of the special characters
stringutf8= string.encode("utf-8")
emdash= u"\u2014"
hyphen= u"\u002D"
if emdash in stringutf8:
stringutf8.replace(emdash, 2*hyphen)
Any suggestions?
I am working with text files in which sometimes apparently the two hyphens are replaced automatically with a long dash...
thanks a lot!
You are dealing with strings here. Strings are lists of characters. Replace the character, leave the encoding out of the equation.
string = 'blablabla -- blablabla \u2014'
emdash = '\u2014'
hyphen = '\u002D'
string2 = string.replace(emdash, 2*hyphen)

Filtering out all non-kanji characters in a text with Python 3

I have a text in which there are latin letters and japanese characters (hiragana, katakana & kanji).
I want to filter out all latin characters, hiragana and katakana but I am not sure how to do this in an elegant way.
My direct approach would be to just filter out every single letter of the latin alphabet in addition to every single hiragana/katakana but I am sure there is a better way.
I am guessing that I have to use regex but I am not quite sure how to go about it. Are letters somehow classified in roman letters, japanese, chinese etc.
If yes, could I somehow use this?
Here some sample text:
"Lesson 1:",, "私","わたし","I" "私たち","わたしたち","We" "あ なた","あなた","You" "あの人","あのひと","That person" "あの方","あのかた","That person (polite)" "皆さん","みなさん"
The program should only return the kanjis (chinese character) like this:
`私、人,方,皆`
I found the answer thanks to Olsgaarddk on reddit.
https://github.com/olsgaard/Japanese_nlp_scripts/blob/master/jp_regex.py
# -*- coding: utf-8 -*-
import re
''' This is a library of functions and variables that are helpful to have handy
when manipulating Japanese text in python.
This is optimized for Python 3.x, and takes advantage of the fact that all strings are unicode.
Copyright (c) 2014-2015, Mads Sørensen Ølsgaard
All rights reserved.
Released under BSD3 License, see http://opensource.org/licenses/BSD-3-Clause or license.txt '''
## UNICODE BLOCKS ##
# Regular expression unicode blocks collected from
# http://www.localizingjapan.com/blog/2012/01/20/regular-expressions-for-japanese-text/
hiragana_full = r'[ぁ-ゟ]'
katakana_full = r'[゠-ヿ]'
kanji = r'[㐀-䶵一-鿋豈-頻]'
radicals = r'[⺀-⿕]'
katakana_half_width = r'[⦅-゚]'
alphanum_full = r'[!-~]'
symbols_punct = r'[、-〿]'
misc_symbols = r'[ㇰ-ㇿ㈠-㉃㊀-㋾㌀-㍿]'
ascii_char = r'[ -~]'
## FUNCTIONS ##
def extract_unicode_block(unicode_block, string):
''' extracts and returns all texts from a unicode block from string argument.
Note that you must use the unicode blocks defined above, or patterns of similar form '''
return re.findall( unicode_block, string)
def remove_unicode_block(unicode_block, string):
''' removes all chaacters from a unicode block and returns all remaining texts from string argument.
Note that you must use the unicode blocks defined above, or patterns of similar form '''
return re.sub( unicode_block, '', string)
## EXAMPLES ##
text = '初めての駅 自由が丘の駅で、大井町線から降りると、ママは、トットちゃんの手を引っ張って、改札口を出ようとした。ぁゟ゠ヿ㐀䶵一鿋豈頻⺀⿕⦅゚abc!~、〿ㇰㇿ㈠㉃㊀㋾㌀㍿'
print('Original text string:', text, '\n')
print('All kanji removed:', remove_unicode_block(kanji, text))
print('All hiragana in text:', ''.join(extract_unicode_block(hiragana_full, text)))

Clean string from html tags and special characters

I want to clean my text from html tags, html spacial characters and characters like < > [ ] / \ * ,
I used $str = preg_replace("/&#?[a-zA-Z0-9]+;/i", "", $str);
it works well with html special characters but some characters doesn't remove like :
( /*/*]]>*/ )
how can I remove these characters?
If you are really using php as it looks like, you can just use:
$str = htmlspecialchars($str);
All HTML chars will be escaped (which could be better than just stripping them). If you really want just to filter these characters, what you need to do is escape those characters on the chars list:
$str = preg_replace("/[\&#\?\]\[\/\\\<\>\*\:\(\);]*/i","",$str);
Notice there's just one "/[]*/i", I removed the a-zA-Z0-9 as you should want these chars in. You can also classify only the desired chars to enter your string (will give you trouble with accentuations like á é ü if you use them, you have to specify every accepted char):
$str = preg_replace("/[^a-zA-Z0-9áÁéÉíÍãÃüÜõÕñÑ\.\+\-\_\%\$\#\!\=;]*/","",$str);
Notice also there's never too much to escape characters, unless for example for the intervals (\a-\z would do fine, \a-\z would match a, or -, or z).
I hope it helps. :)
Regular expression for html tags is:
/\<(.*)?\>/
so use something like this:
// The regular expression to remove HTML tags
$htmltagsregex = '/\<(.*)?\>/';
// what shit will substitute it
$nothing = '';
// the string I want to apply it to
$string = 'this is a string with <b>HTML tags</b> that I want to <strong>remove</strong>';
// DO IT
$result = preg_replace ($htmltagsregex,nothing,$string);
and it will return
this is a string with HTML tags that I want to remove
That's all

How do I print a tab character in Pascal?

I'm trying to figure out in all the Internets what's the special character for printing a simple tab in Pascal. I have to format a table in a CLI program and that would be handy.
Single non printable characters can be constructed using their ascii code prefixed with #
Since the ascii value for tab is 9, a tab is then #9. Characters such constructed must be outside literals, but don't need + to concatenate:
E.g.
const
sometext = 'firstfield'#9'secondfield'#13#10;
contains two fields separated by a tab, ended by a carriage return (#13) + a linefeed #10
The ' character can be made both via this route, or shorter by just ending the literal and reopening it:
const
some2 = '''bla'''; // will contain 'bla' with the ticks.
some3 = 'start''bla''end'; // will contain start'bla'end
write( ^i );
:-)