How do I count unique grapheme clusters in a string in Rust? - unicode

For example, for
let n = count_unique_grapheme_clusters("πŸ‡§πŸ‡· πŸ‡·πŸ‡Ί πŸ‡§πŸ‡· πŸ‡ΊπŸ‡Έ πŸ‡§πŸ‡·");
println!("{}", n);
the expected output is (space and three flags: " ", "πŸ‡§πŸ‡·", "πŸ‡·πŸ‡Ί", "πŸ‡ΊπŸ‡Έ"):
4

We can use the graphemes method from unicode-segmentation crate to iterate over the grapheme clusters and save them in a HashSet<&str> to filter out the duplicates. Then we get the .len() of the container.
extern crate unicode_segmentation; // 1.2.1
use std::collections::HashSet;
use unicode_segmentation::UnicodeSegmentation;
fn count_unique_grapheme_clusters(s: &str) -> usize {
let is_extended = true;
s.graphemes(is_extended).collect::<HashSet<_>>().len()
}
fn main() {
assert_eq!(count_unique_grapheme_clusters(""), 0);
assert_eq!(count_unique_grapheme_clusters("a"), 1);
assert_eq!(count_unique_grapheme_clusters("πŸ‡ΊπŸ‡Έ"), 1);
assert_eq!(count_unique_grapheme_clusters("πŸ‡·πŸ‡Ίé"), 2);
assert_eq!(count_unique_grapheme_clusters("πŸ‡§πŸ‡·πŸ‡·πŸ‡ΊπŸ‡§πŸ‡·πŸ‡ΊπŸ‡ΈπŸ‡§πŸ‡·"), 3);
}
Playground

Related

How to write FCGI_PARAMS using unix sockets

I ask for your help to understand part of the specification of the FastCGI protocol.
Currently this is the code I have:
#![allow(non_snake_case)]
#![allow(unused_must_use)]
use std::os::unix::net::{UnixStream};
use std::io::{Read, Write};
fn main() {
pub const FCGI_VERSION_1: u8 = 1;
pub const FCGI_BEGIN_REQUEST:u8 = 1;
pub const FCGI_RESPONDER: u16 = 1;
pub const FCGI_PARAMS: &str = "FCGI_PARAMS";
let socket_path = "/run/php-fpm/php-fpm.sock";
let mut socket = match UnixStream::connect(socket_path) {
Ok(sock) => sock,
Err(e) => {
println!("Couldn't connect: {e:?}");
return
}
};
let requestId: u16 = 1;
let role: u16 = FCGI_RESPONDER;
let beginRequest = vec![
// FCGI_Header
FCGI_VERSION_1, FCGI_BEGIN_REQUEST,
(requestId >> 8) as u8, (requestId & 0xFF) as u8,
0x00, 0x08, // This is the size of `FCGI_BeginRequestBody`
0, 0,
// FCGI_BeginRequestBody
(role >> 8) as u8, (role & 0xFF) as u8,
0, // Flags
0, 0, 0, 0, 0, // Reserved
];
socket.write_all(&beginRequest).unwrap();
let data = vec![
(100) as u8, // this value is just an example
];
let contentLength = data.len();
assert!(contentLength <= usize::MAX);
let requestHeader = vec![
FCGI_VERSION_1, FCGI_BEGIN_REQUEST,
(requestId >> 8) as u8, (requestId & 0xFF) as u8,
(contentLength >> 8) as u8, (contentLength & 0xFF) as u8,
0, 0,
];
socket.write_all(&requestHeader).unwrap();
}
I have this code thanks to the answer of my last question related to this topic, so, with that example code (which works perfectly for me) I would like to ask you my question.
How can I write the FCGI_PARAMS?
I mean, if I understand correctly, the documentation says:
FCGI_PARAMS is a stream record type used in sending name-value pairs from the Web server to the application
This means that the FCGI_PARAMS are Name-Value Pairs. And the part of the documentation that describes the Name-Value Pairs says:
FastCGI transmits a name-value pair as the length of the name, followed by the length of the value, followed by the name, followed by the value
Then this way I think that it would be (represented in code):
let param = vec![
"SCRIPT_FILENAME".len(),
"index.php".len(),
"SCRIPT_FILENAME",
"index.php",
]; // it is just an example, but i think it represents what i am talking about
But if I add this code, and then I write it to the socket with the following line:
socket.write_all(&param);
And then when reading the socket, the socket does not return anything. What am I doing wrong? How should I send the data? I hope you can help me with this, I want to clarify that I am quite new to FastCGI and unix sockets so I am very sorry if any line of my displayed code is poorly exemplified.
Rust doesn't support heterogeneous vectors, so your let param =… shouldn't compile. The way to send the params is to use multiple writes:
let param_name = "SCRIPT_FILENAME".as_bytes(); // could also be written `b"SCRIPT_FILENAME"`
let param_value = "index.php".as_bytes();
let lengths = [ param_name.len() as u8, param_value.len() as u8 ];
socket.write_all (&lengths).unwrap();
socket.write_all (param_name).unwrap();
socket.write_all (param_value).unwrap();

Enumerate String from specific index in nested loop in Swift

To get characters from String using enumerated() method
let str = "Hello"
for (i,val) in str.enumerated() {
print("\(i) -> \(val)")
}
now trying to enumerate same string inside for loop but from i position like
for (i,val) in str.enumerated() {
print("\(i) -> \(val)")
for (j,val2) in str.enumerated() {
// val2 should be from i postion instead starting from zero
}
}
How to enumerate and set j position should start from i?
Thanks
You can use dropFirst() to create a view in the string starting with a specific position:
for (i,val) in str.enumerated() {
print("i: \(i) -> \(val)")
for (j, val2) in str.dropFirst(i).enumerated() {
print("j: \(i+j) -> \(val2)")
}
}
You need to add i in the second loop if you want to get the index corresponding to the original string, otherwise j will hold the index in the view created by dropFirst()
You can use for in the str.indices and start inside loop based on outside loop.
let str = "abc"
for i in str.indices {
for j in str[i...].indices {
print(str[j])
}
}
Output: a b c b c c
(Suggestion from #LeoDabus)
If I understood the problem statement correctly, the OP requires this -
Hello ello llo lo o
We can use the suffix method provided by Apple.
let str = "Hello"
var lengthOfString = str.count
for (i, val) in str.enumerated() {
print(String(str.suffix(lengthOfString - i)))
}
We don't need the val in the for loop above, so we can rewrite the above for loop as below.
var str = "Hello"
var lengthOfString = str.count
for i in 0..<lengthOfString {
print(String(str.suffix(lengthOfString - i)))
}
Both the above for loops will give the same desired output.

How to return a result from the Blake2 crate in Rust?

I am struggling getting the hash of the passed file name using the blake2 crate. From the documentation:
extern crate blake2;
use blake2::{Blake2b, Digest};
use std::env;
use std::fs;
use std::io::{self, Read};
const BUFFER_SIZE: usize = 1024;
fn print_result(sum: &[u8]) {
for byte in sum {
print!("{:02x}", byte);
}
}
fn process<D: Digest + Default, R: Read>(reader: &mut R) {
let mut sh = D::default();
let mut buffer = [0u8; BUFFER_SIZE];
loop {
let n = match reader.read(&mut buffer) {
Ok(n) => n,
Err(_) => return,
};
sh.input(&buffer[..n]);
if n == 0 || n < BUFFER_SIZE {
break;
}
}
print_result(&sh.result());
}
fn main() {
let args = env::args();
if args.len() > 1 {
for path in args.skip(1) {
if let Ok(mut file) = fs::File::open(&path) {
process::<Blake2b, _>(&mut file);
}
}
} else {
process::<Blake2b, _>(&mut io::stdin());
}
}
blake-test $ cargo run hoge.txt
Compiling blake-test v0.1.0 (/Users/hoge/blake-test)
Finished dev [unoptimized + debuginfo] target(s) in 0.61s
Running `target/debug/blake-test hoge.txt`
eefea9ae6b7fb678ed54e6d58d46aed9eae6d003f29419948cdb42a44a7016dee3eb566e7e95c68ac7587d5debd516a3b195eed0db84d72819e387d687fd06a6
It can successfully print the the &[u8] slice.
However, I want to receive/return the results instead of printing them.
When you're returning a newly-created object, you have to return it as an owned value.
Borrowed references, such as &[u8] are temporary and can't exist by themselves, they're merely a views of data that has storage in an owned form elsewhere.
You can for example, call .to_vec() on the slice and return Vec<u8>.

Is there a way to count with macros?

I want to create a macro that prints "Hello" a specified number of times. It's used like:
many_greetings!(3); // expands to three `println!("Hello");` statements
The naive way to create that macro is:
macro_rules! many_greetings {
($times:expr) => {{
println!("Hello");
many_greetings!($times - 1);
}};
(0) => ();
}
However, this doesn't work because the compiler does not evaluate expressions; $times - 1 isn't calculated, but fed as a new expression into the macro.
While the ordinary macro system does not enable you to repeat the macro expansion many times, there is no problem with using a for loop in the macro:
macro_rules! many_greetings {
($times:expr) => {{
for _ in 0..$times {
println!("Hello");
}
}};
}
If you really need to repeat the macro, you have to look into procedural macros/compiler plugins (which as of 1.4 are unstable, and a bit harder to write).
Edit: There are probably better ways of implementing this, but I've spent long enough on this for today, so here goes. repeat!, a macro that actually duplicates a block of code a number of times:
main.rs
#![feature(plugin)]
#![plugin(repeat)]
fn main() {
let mut n = 0;
repeat!{ 4 {
println!("hello {}", n);
n += 1;
}};
}
lib.rs
#![feature(plugin_registrar, rustc_private)]
extern crate syntax;
extern crate rustc;
use syntax::codemap::Span;
use syntax::ast::TokenTree;
use syntax::ext::base::{ExtCtxt, MacResult, MacEager, DummyResult};
use rustc::plugin::Registry;
use syntax::util::small_vector::SmallVector;
use syntax::ast::Lit_;
use std::error::Error;
fn expand_repeat(cx: &mut ExtCtxt, sp: Span, tts: &[TokenTree]) -> Box<MacResult + 'static> {
let mut parser = cx.new_parser_from_tts(tts);
let times = match parser.parse_lit() {
Ok(lit) => match lit.node {
Lit_::LitInt(n, _) => n,
_ => {
cx.span_err(lit.span, "Expected literal integer");
return DummyResult::any(sp);
}
},
Err(e) => {
cx.span_err(sp, e.description());
return DummyResult::any(sp);
}
};
let res = parser.parse_block();
match res {
Ok(block) => {
let mut stmts = SmallVector::many(block.stmts.clone());
for _ in 1..times {
let rep_stmts = SmallVector::many(block.stmts.clone());
stmts.push_all(rep_stmts);
}
MacEager::stmts(stmts)
}
Err(e) => {
cx.span_err(sp, e.description());
DummyResult::any(sp)
}
}
}
#[plugin_registrar]
pub fn plugin_registrar(reg: &mut Registry) {
reg.register_macro("repeat", expand_repeat);
}
added to Cargo.toml
[lib]
name = "repeat"
plugin = true
Note that if we really don't want to do looping, but expanding at compile-time, we have to do things like requiring literal numbers. After all, we are not able to evaluate variables and function calls that reference other parts of the program at compile time.
As the other answers already said: no, you can't count like this with declarative macros (macro_rules!).
But you can implement the many_greetings! example as a procedural macro. procedural macros were stabilized a while ago, so the definition works on stable. However, we can't yet expand macros into statements on stable -- that's what the #![feature(proc_macro_hygiene)] is for.
This looks like a lot of code, but most code is just error handling, so it's not that complicated!
examples/main.rs
#![feature(proc_macro_hygiene)]
use count_proc_macro::many_greetings;
fn main() {
many_greetings!(3);
}
Cargo.toml
[package]
name = "count-proc-macro"
version = "0.1.0"
authors = ["me"]
edition = "2018"
[lib]
proc-macro = true
[dependencies]
quote = "0.6"
src/lib.rs
extern crate proc_macro;
use std::iter;
use proc_macro::{Span, TokenStream, TokenTree};
use quote::{quote, quote_spanned};
/// Expands into multiple `println!("Hello");` statements. E.g.
/// `many_greetings!(3);` will expand into three `println`s.
#[proc_macro]
pub fn many_greetings(input: TokenStream) -> TokenStream {
let tokens = input.into_iter().collect::<Vec<_>>();
// Make sure at least one token is provided.
if tokens.is_empty() {
return err(Span::call_site(), "expected integer, found no input");
}
// Make sure we don't have too many tokens.
if tokens.len() > 1 {
return err(tokens[1].span(), "unexpected second token");
}
// Get the number from our token.
let count = match &tokens[0] {
TokenTree::Literal(lit) => {
// Unfortunately, `Literal` doesn't have nice methods right now, so
// the easiest way for us to get an integer out of it is to convert
// it into string and parse it again.
if let Ok(count) = lit.to_string().parse::<usize>() {
count
} else {
let msg = format!("expected unsigned integer, found `{}`", lit);
return err(lit.span(), msg);
}
}
other => {
let msg = format!("expected integer literal, found `{}`", other);
return err(other.span(), msg);
}
};
// Return multiple `println` statements.
iter::repeat(quote! { println!("Hello"); })
.map(TokenStream::from)
.take(count)
.collect()
}
/// Report an error with the given `span` and message.
fn err(span: Span, msg: impl Into<String>) -> TokenStream {
let msg = msg.into();
quote_spanned!(span.into()=> {
compile_error!(#msg);
}).into()
}
Running cargo run --example main prints three "Hello"s.
For those looking for a way to do this, there is also the seq_macro crate.
It is fairly easy to use and works out of the box with stable Rust.
use seq_macro::seq;
macro_rules! many_greetings {
($times:literal) => {
seq!{ N in 0..$times {
println!("Hello");
}}
};
}
fn main() {
many_greetings!(3);
many_greetings!(12);
}
As far as I know, no. The macro language is based on pattern matching and variable substitution, and only evaluates macros.
Now, you can implement counting with evaluation: it just is boring... see the playpen
macro_rules! many_greetings {
(3) => {{
println!("Hello");
many_greetings!(2);
}};
(2) => {{
println!("Hello");
many_greetings!(1);
}};
(1) => {{
println!("Hello");
many_greetings!(0);
}};
(0) => ();
}
Based on this, I am pretty sure one could invent a set of macro to "count" and invoke various operations at each step (with the count).

How do I eliminate the spurious warning "value assigned is never read" in a macro?

I am new to Rust and am learning to write my own macros. This macro should fill my struct MatrixXf like the macro vec! does for Vec<T>.
//fills matrix with matlab like syntax
macro_rules! mat {
[ $($( $x: expr ),*);* ] => {{
let mut tmp_vec = Vec::new();
let mut rows = 0;
let mut cols = 0;
let mut is_first_row_collected = false;
$(
let mut inner_cols = 0;
$(
tmp_vec.push($x);
inner_cols += 1;
)*
if is_first_row_collected {//if we read first row we can check that other rows have same length
assert!(inner_cols == cols);
} else {
is_first_row_collected = true;
cols = inner_cols;
}
rows += 1;
)*
MatrixXf::construct(tmp_vec, rows, cols)//fills MatrixXf fields
}}
}
And I use it this way:
let mat = mat![1.0, 2.0, 3.0; 4.0, 5.0, 6.0];
Everything is ok, but the compiler shows me the following warning:
7:23 warning: value assigned to is_first_row_collected is never read, #[warn(unused_assignments)] on by default
:7 is_first_row_collected = true ; cols = inner_cols ; } rows += 1 ; ) *
Maybe I misunderstood something, but I do use is_first_row_collected when checking that the first row was visited. Is it possible to rewrite my code to avoid this warning?
Instead of using a boolean variable, you could wrap cols in an Option to make it clear that cols has no valid value until you read the first row.
//fills matrix with matlab like syntax
macro_rules! mat {
[ $($( $x: expr ),*);* ] => {{
let mut tmp_vec = Vec::new();
let mut rows = 0;
let mut cols = None;
$(
let mut inner_cols = 0;
$(
tmp_vec.push($x);
inner_cols += 1;
)*
if let Some(cols) = cols {//if we read first row we can check that other rows have same length
assert!(inner_cols == cols);
} else {
cols = Some(inner_cols);
}
rows += 1;
)*
MatrixXf::construct(tmp_vec, rows, cols.unwrap_or(0))//fills MatrixXf fields
}}
}
Another option is to handle the first row and the following rows differently by separating them in the macro's pattern. This way, we can avoid the flag entirely because when we handle the following rows, we already know the number of columns.
//fills matrix with matlab like syntax
macro_rules! mat {
[] => { MatrixXf::construct(Vec::new(), 0, 0) };
[ $( $x: expr ),* $(; $( $y: expr ),*)* ] => {{
let mut tmp_vec = Vec::new();
let mut rows = 0;
let mut inner_cols = 0;
$(
tmp_vec.push($x);
inner_cols += 1;
)*
let cols = inner_cols; // remember how many columns the first row has
rows += 1;
$(
inner_cols = 0;
$(
tmp_vec.push($y);
inner_cols += 1;
)*
assert!(inner_cols == cols); // check that the following rows have as many columns as the first row
rows += 1;
)*
MatrixXf::construct(tmp_vec, rows, cols)//fills MatrixXf fields
}}
}
In this version of the macro, I added another rule to construct an empty matrix when there are no arguments and I moved the location of the semicolon so that you don't need a trailing semicolon when you have only one row.
The warning is real; let's use this modified example that doesn't rely on a structure that you didn't provide in your question:
macro_rules! mat {
[ $($( $x: expr ),*);* ] => {{
let mut tmp_vec = Vec::new();
let mut rows = 0;
let mut cols = 0;
let mut is_first_row_collected = false;
$(
let mut inner_cols = 0;
$(
tmp_vec.push($x);
inner_cols += 1;
)*
if is_first_row_collected {//if we read first row we can check that other rows have same length
assert!(inner_cols == cols);
} else {
is_first_row_collected = true;
cols = inner_cols;
}
rows += 1;
)*
(tmp_vec, rows, cols)
}}
}
fn main() {
let _mat = mat![1.0, 2.0, 3.0; 4.0, 5.0, 6.0];
}
We can then use the compiler to see what the expanded version is:
rustc -Z unstable-options --pretty expanded example.rs
This is a big, ugly blob of code, so I'll trim it down to the relevant parts:
fn main() {
let mut is_first_row_collected = false;
if is_first_row_collected {
// removed
} else {
is_first_row_collected = true;
}
if is_first_row_collected {
// removed
} else {
is_first_row_collected = true;
}
}
So, indeed, the value you assigned is never read. Of course, as a human you can see that that particular flow shouldn't happen, and perhaps you could request an enhancement to the compiler to track that.
Ideally, you'd rework your macro to not have the underlying problem. Francis GagnΓ© shows a great way of doing that. If you can't rework the macro, you can allow that warning. Unfortunately, I don't know of any way to add the #[allow(unused_assignments)] declaration on anything but a fn or a mod, so it seems like you'd have to do some changes to your macro anyway.
Yes, it would look like something in the lint is off here. If you manually expand the code yourself, do you still get the warning? Or is it just when it's in a macro?