Pyspark how to remove punctuation marks and make lowercase letters in Rdd?

Pyspark how to remove punctuation marks and make lowercase letters in Rdd? - pyspark

I would like to remove punctuation mark and make the lowercase letters in RDD?
Below is my data set
l=sc.parallelize(["How are you","Hello\ then% you"\
,"I think he's fine+ COMING"])
I tried below function but I got an error message
punc='!"#$%&\'()*+,-./:;<=>?#[\\]^_`{|}~'
def lower_clean_str(x):
lowercased_str = x.lower()
clean_str = lowercased_str.translate(punc)
return clean_str
one_RDD = l.flatMap(lambda x: lower_clean_str(x).split())
one_RDD.collect()
But this gives me an error. What might be the problem? How can I fix this?
Thank you.

You are using the python translate function in a wrong way.
As I am not sure if you are using python 2.7 or python 3, I am suggesting an alternate approach.
The translate function changes a bit in python 3.
The following code will work irrespective of the python version.
def lower_clean_str(x):
punc='!"#$%&\'()*+,-./:;<=>?#[\\]^_`{|}~'
lowercased_str = x.lower()
for ch in punc:
lowercased_str = lowercased_str.replace(ch, '')
return lowercased_str
l=sc.parallelize(["How are you","Hello\ then% you","I think he's fine+ COMING"])
one_RDD = l.map(lower_clean_str)
one_RDD.collect()
Output :
['how are you', 'hello then you', 'i think hes fine coming']

Related

How can I manipulate a string in dart?

Currently I'm working in a project with flutter, but I realize there is a need in the management of the variables I'm using.
Basically I want to delete the last character of a string I'm concatenating, something like this:
string varString = 'My text'
And with the help of some method or function, the result I get:
'My tex'
Am I clear about it? I'm looking for some way which helps me to 'pop' the last character of a text (like pop function in javascript)
Is there something like that? I search in the Dart docs, but I didn't find anything about it.
Thank you in advance.

You can take a substring, like this:
string.substring(0, string.length - 1)
If you need the last character before popping, you can do this:
string[string.length - 1]
Strings in dart are immutable, so the only way to do the operation you are describing is by constructing a new instance of a string, as described above.

var str = 'My text';
var newStr = (str.split('')..removeLast()).join();
print(newStr);
Another way:
var newStr2 = str.replaceFirst(RegExp(r'.$') , '');
print(newStr2);

Using \n in a string

I'm doing a Nim project with GUI and I want to show some texts which i got from my local mongoDB.
Uplaoded some of these texts like:
"something \nsomething \nsomething"
as a string. Made also a query (sorry for the format)
proc getTexts*(section : Bson) : seq[string] =
for i in 0..<len(section["texts"]):
result.add(section["texts"][i])
Then when i want to set one of these seq items as a label, or simply just echo it, looks like this:
"something \nsomething \nsomething"
not this:
"something
something
something"
Thanks in advance.

We figured out that I stored the new line char as 2 separate character
So, finally I made this short process
proc myEscape*(str: string): string =
result = str.replace("\\n", $'\n').replace("\\t", $'\t')
to replace those with only one char.
Not the nicest solution, but works.

phpbb preg_replace deprecated error

Hi I have recently moved to php 5.6 and am now getting some deprecated errors from a phpBB3 installation. The offending line of code is:
$tpl = preg_replace('/{L_([A-Z_]+)}/e', "(!empty(\$user->lang['\$1'])) ? \$user->lang['\$1'] : ucwords(strtolower(str_replace('_', ' ', '\$1')))", $tpl);
Can anyone advise on how to convert this to preg_replace_callback?

I just managed to convert the expression to the new format, and I am not a php wizard, so I am a bit proud of it!
Here is what I have written to remove the error messages (bbcode.inc line 494):
$tpl = preg_replace_callback('/{L_([A-Z0-9_]+)}/', function ($m) { return (!empty($user->lang['\$m[1]'])) ? $user->lang['\$m[1]'] : ucwords(strtolower(str_replace('_', ' ', '\$m[1]'))); }, $tpl);
There is another similar line in bbcode.inc in line 370, that can be transformed in the exact same manner, but the one in line 113, I can't fix...
Obviously because the pattern comes from a variable, so it will take a little more to figure that one out.

Procedural macro parsing weirdness in Rust

I'm trying to parse a macro similar to this one:
annoying!({
hello({
// some stuff
});
})
Trying to do this with a procedural macro definition similar to the following, but I'm getting a behaviour I didn't expect and I'm not sure I'm doing something I'm not supposed to or I found a bug. In the following example, I'm trying to find the line where each block is,
for the first block (the one just inside annoying!) it reports the correct line, but for the inner block, when I try to print them it's always 1, no matter where the code is etc.
#![crate_type="dylib"]
#![feature(macro_rules, plugin_registrar)]
extern crate syntax;
extern crate rustc;
use macro_result::MacroResult;
use rustc::plugin::Registry;
use syntax::ext::base::{ExtCtxt, MacResult};
use syntax::ext::quote::rt::ToTokens;
use syntax::codemap::Span;
use syntax::ast;
use syntax::parse::tts_to_parser;
mod macro_result;
#[plugin_registrar]
pub fn plugin_registrar(registry: &mut Registry) {
registry.register_macro("annoying", macro_annoying);
}
pub fn macro_annoying(cx: &mut ExtCtxt, _: Span, tts: &[ast::TokenTree]) -> Box<MacResult> {
let mut parser = cx.new_parser_from_tts(tts);
let lo = cx.codemap().lookup_char_pos(parser.span.lo);
let hi = cx.codemap().lookup_char_pos(parser.span.hi);
println!("FIRST LO {}", lo.line); // real line for annoying! all cool
println!("FIRST HI {}", hi.line); // real line for annoying! all cool
let block_tokens = parser.parse_block().to_tokens(cx);
let mut block_parser = tts_to_parser(cx.parse_sess(), block_tokens, cx.cfg());
block_parser.bump(); // skip {
block_parser.parse_ident(); // hello
block_parser.bump(); // skip (
// block lines
let lo = cx.codemap().lookup_char_pos(block_parser.span.lo);
let hi = cx.codemap().lookup_char_pos(block_parser.span.hi);
println!("INNER LO {}", lo.line); // line 1? wtf?
println!("INNER HI {}", hi.line); // line 1? wtf?
MacroResult::new(vec![])
}
I think the problem might be the fact that I'm creating a second parser to parse the inner block, and that might be making the Span types inside it go crazy, but I'm not sure that's the problem or how to keep going from here. The reason I'm creating this second parser is so I can recursively parse what's inside each of the blocks, I might be doing something I'm not supposed to, in which case a better suggestion would be very welcome.

I believe this is #15962 (and #16472), to_tokens has a generally horrible implementation. Specifically, anything non-trivial uses ToSource, which just turns the code to a string, and then retokenises that (yes, it's not great at all!).
Until those issues are fixed, you should just handle the original tts directly as much as possible. You could approximate the right span using the .span of the parsed block (i.e. return value of parse_block), which will at least focus the user's attention on the right area.

preg_replace troubles

I am struggling with this regular expression.
$glossary_search[] = "/(^|>|\\s)".$glossary["glossary_name"]."($|<|\\s)/i";
$glossary_replace[] = "\$1<a href='/jargon-buster/".tapestry_hyphenate($glossary["glossary_name"]).".html' title='".$glossary["glossary_name"]."' target='_blank'>".$glossary["glossary_name"]."</a>\$2";
return preg_replace($glossary_search,$glossary_replace,$text);
I am trying to replace words in a product description with a hyperlink. The code above works if the word has a space either side but does not work if it has a full stop, comma or "<". Can anyone spot my mistake?
Thanks,
Simon

I think you might need to use preg_quote and htmlentities?
$glossary_search[] = "/(^|>|\\s)".preg_quote(htmlentities($glossary["glossary_name"],ENT_COMPAT,'UTF8'))."($|<|\\s)/i";
$glossary_replace[] = "\$1<a href='/jargon-buster/".tapestry_hyphenate($glossary["glossary_name"]).".html' title='".$glossary["glossary_name"]."' target='_blank'>".$glossary["glossary_name"]."</a>\$2";
return preg_replace($glossary_search,$glossary_replace,$text);

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Pyspark how to remove punctuation marks and make lowercase letters in Rdd? - pyspark

Related

How can I manipulate a string in dart?

Using \n in a string

phpbb preg_replace deprecated error

Procedural macro parsing weirdness in Rust

preg_replace troubles

Categories

Resources