OpenSUSE, in their infinite wisdom, has decided that ld -v will return
GNU ld (GNU Binutils; SUSE Linux Enterprise 15) 2.37.20211103-7.26
I need to extract the 2 and 37 values and throw out the rest, and this needs to work with ld that isn't so screwed up.
I have tried numerous examples found here and elsewhere for extracting the version, but they all get hung up on 15. Does anyone have any idea on how I can extract this using sed?
Currently in the Makefile I am using
LD_MAJOR_VER := $(shell $(LD) -v | perl -pe '($$_)=/([0-9]+([.][0-9]+)+)/' | cut -f1 -d. )
LD_MINOR_VER := $(shell $(LD) -v | perl -pe '($$_)=/([0-9]+([.][0-9]+)+)/' | cut -f2 -d. )
though I would much prefer to use sed like it did before SuSE screwed up our build process with their 15.3 update. Any help would be greatly appreciated.
You can use
LD_MAJOR_VER := $(shell $(LD) -v | sed -n 's/.* \([0-9]*\).*/\1/p')
LD_MINOR_VER := $(shell $(LD) -v | sed -n 's/.* [0-9]*\.\([0-9]*\).*/\1/p')
Details:
-n - an option that suppresses default line output with sed
.* \([0-9]*\).* - a regex that matches the whole string:
.* - any zero or more chars
- space
\([0-9]*\) - Group 1 (the parentheses are escaped to form a capturing group since this is a POSIX BRE pattern): any zero or more digits
.* - any zero or more chars
\1 - the replacement is the Group 1 value
p - only prints the result of the substitution.
In the second regex, [0-9]*\. also matches zero or more digits (the major version number) with a dot after it to skip that value.
I would do it in two steps, it can make it clear:
get the version information
get the major/minor or whatever from the version information
It would be easier to use awk to solve it, but since you said you prefer sed:
kent$ ver=$(sed 's/.*[[:space:]]//' <<< "GNU ld (GNU Binutils; SUSE Linux Enterprise 15) 2.37.20211103-7.26")
kent$ echo $ver
2.37.20211103-7.26
kent$ major=$(sed 's/[.].*//' <<< $ver)
kent$ echo $major
2
kent$ minor=$(sed 's/^[^.-]*[.]//;s/[.].*//' <<< $ver)
kent$ echo $minor
37
If you use GNU make then its Functions for Transforming Text solve all this:
LD_VERSION := $(subst ., ,$(lastword $(shell $(LD) -v)))
LD_MAJOR_VER := $(word 1,$(LD_VERSION))
LD_MINOR_VER := $(word 2,$(LD_VERSION))
Moreover it is probably very robust and should work with any version string where the version is the last word and its component are separated by dots. Demo (where the version string is passed as a make variable instead of being returned by $(LD) -v):
$ cat Makefile
LD_VERSION := $(subst ., ,$(lastword $(LD_VERSION_STRING)))
LD_MAJOR_VER := $(word 1,$(LD_VERSION))
LD_MINOR_VER := $(word 2,$(LD_VERSION))
.PHONY: all
all:
#echo $(LD_MAJOR_VER)
#echo $(LD_MINOR_VER)
$ make LD_VERSION_STRING='blah blah blah 1.2.3.4.5.6.7'
1
2
$ make LD_VERSION_STRING='GNU ld (GNU Binutils for Debian) 2.35.2'
2
35
$ make LD_VERSION_STRING='GNU ld (GNU Binutils; SUSE Linux Enterprise 15) 2.37.20211103-7.26'
2
37
I have function that prints a header that needs to be applied across several files, but if I utilize a sed process substitution the lines prior to the last have a backslash \ on them.
E.g.
function print_header() {
cat << EOF
-------------------------------------------------------------------
$(date '+%B %d, %Y # ~ %r') ID:$(echo $RANDOM)
EOF
}
If I then take a file such as test.txt:
line 1
line 2
line 3
line 4
line 5
sed "1 i $(print_header | sed 's/$/\\/g')" test.txt
I get:
-------------------------------------------------------------------\
November 24, 2015 # ~ 11:18:28 AM ID:13187
line 1
line 2
line 3
line 4
line 5
Notice the troublesome backslash at the end of the first line, I'd like to not have that backslash appear. Any ideas?
I would use cat for that:
cat <(print_header) file > file_with_header
This behavior depends on the sed dialect. Unfortunately, it's one of the things which depends on which version you have.
To simplify debugging, try specifying verbatim text. Here's one from a Debian system.
vnix$ sed '1i\
> foo\
> bar' <<':'
> hello
> goodbye
> :
foo
bar
hello
goodbye
Your diagnostics appear to indicate that your sed dialect does not in fact require the backslash after the first i.
Since you are generating the contents of the header programmatically anyway, my recommended solution would be to refactor the code so that you can avoid this conundrum. If you don't want cat <<EOF test.txt then maybe experiment with sed 1r/dev/stdin' <<EOF test.txt (I could not get 1r- to work, but /dev/stdin should be portable to any Linux.)
Here is my kludgy fix, if you can find something more elegant I'll gladly credit you:
sed "1 i $(print_header | sed 's/$/\\/g;$s/$/\x01/')" test.txt | tr -d '\001'
This puts an unprintable SOH (\x01) ascii Start Of Header character after the inserted text, that precludes the backslashes and then I run it over tr to delete the SOH chars.
I'm trying to use sed to edit a file within a makefile. When I edit a date, in the format xxxx-yy-zz it works fine. When I try to edit a version number with a format of x.y.z, it fails. I'm pretty certain this is because I need to escape the . in the version for the grep part of sed. I used this answer but it doesn't work, and I'm not good enough at this to figure it out (similar advice here). I can't give a working example due to use of external files, but here is the basic idea:
SHELL := /bin/bash # bash is needed for manipulation of version number
PKG_NAME=FuncMap
TODAY=$(shell date +%Y-%m-%d)
PKG_VERSION := $(shell grep -i '^version' $(PKG_NAME)/DESCRIPTION | cut -d ':' -f2 | cut -d ' ' -f2)
PKG_DATE := $(shell grep -i '^date' $(PKG_NAME)/DESCRIPTION | cut -d ':' -f2)
## Increment the z in of x.y.z
XYZ=$(subst ., , $(PKG_VERSION))
X=$(word 1, $(XYZ))
Y=$(word 2, $(XYZ))
Z=$(word 3, $(XYZ))
Z2=$$(($(Z)+1))
NEW_VERSION=$(addsuffix $(addprefix .,$(Z2)), $(addsuffix $(addprefix ., $(Y)), $(X)))
OLD_VERSION=$(echo "$(PKG_VERSION)" | sed -e 's/[]$.*[\^]/\\&/g' )
all: info update
info:
#echo "Package: " $(PKG_NAME)
#echo "Current/Pending version numbers: " $(PKG_VERSION) $(NEW_VERSION)
#echo "Old date: " $(PKG_DATE)
#echo "Today: " $(TODAY)
#echo "OLD_VERSION: " $(OLD_VERSION)
update: $(shell find $(PKG_NAME) -name "DESCRIPTION")
#echo "Editing DESCRIPTION to increment version"
$(shell sed 's/$(OLD_VERSION)/$(NEW_VERSION)/' $(PKG_NAME)/DESCRIPTION > $(PKG_NAME)/TEST)
#echo "Editing DESCRIPTION to update the date"
$(shell sed 's/$(PKG_DATE)/$(TODAY)/' $(PKG_NAME)/DESCRIPTION > $(PKG_NAME)/TEST)
And this gives as output:
Package: FuncMap
Current/Pending version numbers: 1.0.1000 1.0.1001
Current date: 2000-07-99
Today: 2015-07-11
OLD_VERSION:
sed: first RE may not be empty
Editing DESCRIPTION to increment version
Editing DESCRIPTION to update the date
Obviously the sed on the version number is not working (the date is handled fine, and current/pending versions are correct, and the date is properly changed in the external file). Besides this particular problem, I'm sure a lot of this code is suboptimal - don't laugh! I don't know make nor shell scripting very well...
OLD_VERSION is empty, because you omitted the makefile shell function:
OLD_VERSION=$(shell echo "$(PKG_VERSION)" | sed -e 's/[]$.*[\^]/\\&/g' )
I would like to replace the tabs in each file of a directory with the corresponding empty space. I found already a solution 11094383, where you can replace tabs with given number of empty spaces:
> find ./ -type f -exec sed -i 's/\t/ /g' {} \;
In the solution above tabs are replaced with four spaces. But in my case tabs can occupy more spaces - e.g. 8.
An example of file with tabs, which should be replaced with 8 spaces is:
NSMl1 100 PSHELL 0.00260 400000 400200 400300
400400 400500 400600 400700 400800 400900
401000 401100 400100 430000 430200 430300
430400 430500 430600 430700 430800 430900
431000 431100 430100 401200 431200
here the lines with tabs are the 3th to the 5th line.
An example of file with tabs, which should be replaced with 4 tabs is:
RBE2 1101001 5000511 123456 1100
Could anybody help?
The classic answer is to use the pr command with options to expand tabs into an appropriate number of spaces, turning of the pagination features:
pr -e8 -l1 -t …files…
The tricky part is getting the file over-written that seems to be part of the question. Of course, sed in the GNU and BSD (Mac OS X) incarnations supports overwriting with the -i option — with variant behaviours between the two as BSD sed requires a suffix for the backup files and GNU sed does not. However, sed does not (readily) support converting tabs to an appropriate number of blanks, so it isn't wholly appropriate.
There's a script overwrite (which I abbreviate to ow) in The UNIX Programming Environment that can do that. I've been using the script since 1987 (first checkin — last updated in 2005).
#!/bin/sh
# Overwrite file
# From: The UNIX Programming Environment by Kernighan and Pike
# Amended: remove PATH setting; handle file names with blanks.
case $# in
0|1) echo "Usage: $0 file command [arguments]" 1>&2
exit 1;;
esac
file="$1"
shift
new=${TMPDIR:-/tmp}/ovrwr.$$.1
old=${TMPDIR:-/tmp}/ovrwr.$$.2
trap "rm -f '$new' '$old' ; exit 1" 0 1 2 15
if "$#" >"$new"
then
cp "$file" "$old"
trap "" 1 2 15
cp "$new" "$file"
rm -f "$new" "$old"
trap 0
exit 0
else
echo "$0: $1 failed - $file unchanged" 1>&2
rm -f "$new" "$old"
trap 0
exit 1
fi
It would be possible and arguably better to use the mktemp command on most systems these days; it didn't exist way back then.
In the context of the question, you could then use:
find . -type f -exec ow {} pr -e8 -t -l1 \;
You do need to process each file separately.
If you are truly determined to use sed for the job, then you have your work cut out. There's a gruesome way to do it. There is a notational problem; how to represent a literal tab; I will use \t to denote it. The script would be stored in a file, which I'll assume is script.sed:
:again
/^\(\([^\t]\{8\}\)*\)\t/s//\1 /
/^\(\([^\t]\{8\}\)*\)\([^\t]\{1\}\)\t/s//\1\3 /
/^\(\([^\t]\{8\}\)*\)\([^\t]\{2\}\)\t/s//\1\3 /
/^\(\([^\t]\{8\}\)*\)\([^\t]\{3\}\)\t/s//\1\3 /
/^\(\([^\t]\{8\}\)*\)\([^\t]\{4\}\)\t/s//\1\3 /
/^\(\([^\t]\{8\}\)*\)\([^\t]\{5\}\)\t/s//\1\3 /
/^\(\([^\t]\{8\}\)*\)\([^\t]\{6\}\)\t/s//\1\3 /
/^\(\([^\t]\{8\}\)*\)\([^\t]\{7\}\)\t/s//\1\3 /
t again
That's using the classic sed notation.
You can then write:
sed -f script.sed …data-files…
If you have GNU sed or BSD (Mac OS X) sed, you can use the extended regular expressions instead:
:again
/^(([^\t]{8})*)\t/s//\1 /
/^(([^\t]{8})*)([^\t]{1})\t/s//\1\3 /
/^(([^\t]{8})*)([^\t]{2})\t/s//\1\3 /
/^(([^\t]{8})*)([^\t]{3})\t/s//\1\3 /
/^(([^\t]{8})*)([^\t]{4})\t/s//\1\3 /
/^(([^\t]{8})*)([^\t]{5})\t/s//\1\3 /
/^(([^\t]{8})*)([^\t]{6})\t/s//\1\3 /
/^(([^\t]{8})*)([^\t]{7})\t/s//\1\3 /
t again
and then run:
sed -r -f script.sed …data-files… # GNU sed
sed -E -f script.sed …data-files… # BSD sed
What do the scripts do?
The first line sets a label; the last line jumps to that label if any of the s/// operations in between made a substitution. So, for each line of the file, the script loops until there are no matches made, and hence no substitutions performed.
The 8 substitutions deal with:
A block of zero or more sequences of 8 non-tabs, which is captured, followed by
a sequence of 0-7 more non-tabs, which is also captured, followed by
a tab.
It replaces that match with the captured material, followed by an appropriate number of spaces.
One curiosity found during the testing is that if a line ends with white space, the pr command removes that trailing white space.
There's also the expand command on some systems (BSD or Mac OS X at least), which preserves the trailing white space. Using that is simpler than pr or sed.
With these sed scripts, and using the BSD or GNU sed with backup files, you can write:
find . -type f -exec sed -i.bak -r -f script.sed {} +
(GNU sed notation; substitute -E for -r for BSD sed.)
I have several very large XML files and I'm trying to find the lines that contain non-ASCII characters. I've tried the following:
grep -e "[\x{00FF}-\x{FFFF}]" file.xml
But this returns every line in the file, regardless of whether the line contains a character in the range specified.
Do I have the syntax wrong or am I doing something else wrong? I've also tried:
egrep "[\x{00FF}-\x{FFFF}]" file.xml
(with both single and double quotes surrounding the pattern).
You can use the command:
grep --color='auto' -P -n "[\x80-\xFF]" file.xml
This will give you the line number, and will highlight non-ascii chars in red.
In some systems, depending on your settings, the above will not work, so you can grep by the inverse
grep --color='auto' -P -n "[^\x00-\x7F]" file.xml
Note also, that the important bit is the -P flag which equates to --perl-regexp: so it will interpret your pattern as a Perl regular expression. It also says that
this is highly experimental and grep -P may warn of unimplemented
features.
Instead of making assumptions about the byte range of non-ASCII characters, as most of the above solutions do, it's slightly better IMO to be explicit about the actual byte range of ASCII characters instead.
So the first solution for instance would become:
grep --color='auto' -P -n '[^\x00-\x7F]' file.xml
(which basically greps for any character outside of the hexadecimal ASCII range: from \x00 up to \x7F)
On Mountain Lion that won't work (due to the lack of PCRE support in BSD grep), but with pcre installed via Homebrew, the following will work just as well:
pcregrep --color='auto' -n '[^\x00-\x7F]' file.xml
Any pros or cons that anyone can think off?
The following works for me:
grep -P "[\x80-\xFF]" file.xml
Non-ASCII characters start at 0x80 and go to 0xFF when looking at bytes. Grep (and family) don't do Unicode processing to merge multi-byte characters into a single entity for regex matching as you seem to want. The -P option in my grep allows the use of \xdd escapes in character classes to accomplish what you want.
The easy way is to define a non-ASCII character... as a character that is not an ASCII character.
LC_ALL=C grep '[^ -~]' file.xml
Add a tab after the ^ if necessary.
Setting LC_COLLATE=C avoids nasty surprises about the meaning of character ranges in many locales. Setting LC_CTYPE=C is necessary to match single-byte characters — otherwise the command would miss invalid byte sequences in the current encoding. Setting LC_ALL=C avoids locale-dependent effects altogether.
In perl
perl -ane '{ if(m/[[:^ascii:]]/) { print } }' fileName > newFile
Here is another variant I found that produced completely different results from the grep search for [\x80-\xFF] in the accepted answer. Perhaps it will be useful to someone to find additional non-ascii characters:
grep --color='auto' -P -n "[^[:ascii:]]" myfile.txt
Note: my computer's grep (a Mac) did not have -P option, so I did brew install grep and started the call above with ggrep instead of grep.
Searching for non-printable chars. TLDR; Executive Summary
search for control chars AND extended unicode
locale setting e.g. LC_ALL=C needed to make grep do what you might expect with extended unicode
SO the preferred non-ascii char finders:
$ perl -ne 'print "$. $_" if m/[\x00-\x08\x0E-\x1F\x80-\xFF]/' notes_unicode_emoji_test
as in top answer, the inverse grep:
$ grep --color='auto' -P -n "[^\x00-\x7F]" notes_unicode_emoji_test
as in top answer but WITH LC_ALL=C:
$ LC_ALL=C grep --color='auto' -P -n "[\x80-\xFF]" notes_unicode_emoji_test
. . more . . excruciating detail on this: . . .
I agree with Harvey above buried in the comments, it is often more useful to search for non-printable characters OR it is easy to think non-ASCII when you really should be thinking non-printable. Harvey suggests "use this: "[^\n -~]". Add \r for DOS text files. That translates to "[^\x0A\x020-\x07E]" and add \x0D for CR"
Also, adding -c (show count of patterns matched) to grep is useful when searching for non-printable chars as the strings matched can mess up terminal.
I found adding range 0-8 and 0x0e-0x1f (to the 0x80-0xff range) is a useful pattern. This excludes the TAB, CR and LF and one or two more uncommon printable chars. So IMHO a quite a useful (albeit crude) grep pattern is THIS one:
grep -c -P -n "[\x00-\x08\x0E-\x1F\x80-\xFF]" *
ACTUALLY, generally you will need to do this:
LC_ALL=C grep -c -P -n "[\x00-\x08\x0E-\x1F\x80-\xFF]" *
breakdown:
LC_ALL=C - set locale to C, otherwise many extended chars will not match (even though they look like they are encoded > 0x80)
\x00-\x08 - non-printable control chars 0 - 7 decimal
\x0E-\x1F - more non-printable control chars 14 - 31 decimal
\x80-1xFF - non-printable chars > 128 decimal
-c - print count of matching lines instead of lines
-P - perl style regexps
Instead of -c you may prefer to use -n (and optionally -b) or -l
-n, --line-number
-b, --byte-offset
-l, --files-with-matches
E.g. practical example of use find to grep all files under current directory:
LC_ALL=C find . -type f -exec grep -c -P -n "[\x00-\x08\x0E-\x1F\x80-\xFF]" {} +
You may wish to adjust the grep at times. e.g. BS(0x08 - backspace) char used in some printable files or to exclude VT(0x0B - vertical tab). The BEL(0x07) and ESC(0x1B) chars can also be deemed printable in some cases.
Non-Printable ASCII Chars
** marks PRINTABLE but CONTROL chars that is useful to exclude sometimes
Dec Hex Ctrl Char description Dec Hex Ctrl Char description
0 00 ^# NULL 16 10 ^P DATA LINK ESCAPE (DLE)
1 01 ^A START OF HEADING (SOH) 17 11 ^Q DEVICE CONTROL 1 (DC1)
2 02 ^B START OF TEXT (STX) 18 12 ^R DEVICE CONTROL 2 (DC2)
3 03 ^C END OF TEXT (ETX) 19 13 ^S DEVICE CONTROL 3 (DC3)
4 04 ^D END OF TRANSMISSION (EOT) 20 14 ^T DEVICE CONTROL 4 (DC4)
5 05 ^E END OF QUERY (ENQ) 21 15 ^U NEGATIVE ACKNOWLEDGEMENT (NAK)
6 06 ^F ACKNOWLEDGE (ACK) 22 16 ^V SYNCHRONIZE (SYN)
7 07 ^G BEEP (BEL) 23 17 ^W END OF TRANSMISSION BLOCK (ETB)
8 08 ^H BACKSPACE (BS)** 24 18 ^X CANCEL (CAN)
9 09 ^I HORIZONTAL TAB (HT)** 25 19 ^Y END OF MEDIUM (EM)
10 0A ^J LINE FEED (LF)** 26 1A ^Z SUBSTITUTE (SUB)
11 0B ^K VERTICAL TAB (VT)** 27 1B ^[ ESCAPE (ESC)
12 0C ^L FF (FORM FEED)** 28 1C ^\ FILE SEPARATOR (FS) RIGHT ARROW
13 0D ^M CR (CARRIAGE RETURN)** 29 1D ^] GROUP SEPARATOR (GS) LEFT ARROW
14 0E ^N SO (SHIFT OUT) 30 1E ^^ RECORD SEPARATOR (RS) UP ARROW
15 0F ^O SI (SHIFT IN) 31 1F ^_ UNIT SEPARATOR (US) DOWN ARROW
UPDATE: I had to revisit this recently. And, YYMV depending on terminal settings/solar weather forecast BUT . . I noticed that grep was not finding many unicode or extended characters. Even though intuitively they should match the range 0x80 to 0xff, 3 and 4 byte unicode characters were not matched. ??? Can anyone explain this? YES. #frabjous asked and #calandoa explained that LC_ALL=C should be used to set locale for the command to make grep match.
e.g. my locale LC_ALL= empty
$ locale
LANG=en_IE.UTF-8
LC_CTYPE="en_IE.UTF-8"
.
.
LC_ALL=
grep with LC_ALL= empty matches 2 byte encoded chars but not 3 and 4 byte encoded:
$ grep -P -n "[\x00-\x08\x0E-\x1F\x80-\xFF]" notes_unicode_emoji_test
5:© copyright c2a9
7:call underscore c2a0
9:CTRL
31:5 © copyright
32:7 call underscore
grep with LC_ALL=C does seem to match all extended characters that you would want:
$ LC_ALL=C grep --color='auto' -P -n "[\x80-\xFF]" notes_unicode_emoji_test
1:���� unicode dashes e28090
3:��� Heart With Arrow Emoji - Emojipedia == UTF8? f09f9298
5:� copyright c2a9
7:call� underscore c2a0
11:LIVE��E! ���������� ���� ���������� ���� �� �� ���� ���� YEOW, mix of japanese and chars from other e38182 e38184 . . e0a487
29:1 ���� unicode dashes
30:3 ��� Heart With Arrow Emoji - Emojipedia == UTF8 e28090
31:5 � copyright
32:7 call� underscore
33:11 LIVE��E! ���������� ���� ���������� ���� �� �� ���� ���� YEOW, mix of japanese and chars from other
34:52 LIVE��E! ���������� ���� ���������� ���� �� �� ���� ���� YEOW, mix of japanese and chars from other
81:LIVE��E! ���������� ���� ���������� ���� �� �� ���� ���� YEOW, mix of japanese and chars from other
THIS perl match (partially found elsewhere on stackoverflow) OR the inverse grep on the top answer DO seem to find ALL the ~weird~ and ~wonderful~ "non-ascii" characters without setting locale:
$ grep --color='auto' -P -n "[^\x00-\x7F]" notes_unicode_emoji_test
$ perl -ne 'print "$. $_" if m/[\x00-\x08\x0E-\x1F\x80-\xFF]/' notes_unicode_emoji_test
1 ‐‐ unicode dashes e28090
3 💘 Heart With Arrow Emoji - Emojipedia == UTF8? f09f9298
5 © copyright c2a9
7 call underscore c2a0
9 CTRL-H CHARS URK URK URK
11 LIVE‐E! あいうえお かが アイウエオ カガ ᚊ ᚋ ซฌ आइ YEOW, mix of japanese and chars from other e38182 e38184 . . e0a487
29 1 ‐‐ unicode dashes
30 3 💘 Heart With Arrow Emoji - Emojipedia == UTF8 e28090
31 5 © copyright
32 7 call underscore
33 11 LIVE‐E! あいうえお かが アイウエオ カガ ᚊ ᚋ ซฌ आइ YEOW, mix of japanese and chars from other
34 52 LIVE‐E! あいうえお かが アイウエオ カガ ᚊ ᚋ ซฌ आइ YEOW, mix of japanese and chars from other
73 LIVE‐E! あいうえお かが アイウエオ カガ ᚊ ᚋ ซฌ आइ YEOW, mix of japanese and chars from other
SO the preferred non-ascii char finders:
$ perl -ne 'print "$. $_" if m/[\x00-\x08\x0E-\x1F\x80-\xFF]/' notes_unicode_emoji_test
as in top answer, the inverse grep:
$ grep --color='auto' -P -n "[^\x00-\x7F]" notes_unicode_emoji_test
as in top answer but WITH LC_ALL=C:
$ LC_ALL=C grep --color='auto' -P -n "[\x80-\xFF]" notes_unicode_emoji_test
The following code works:
find /tmp | perl -ne 'print if /[^[:ascii:]]/'
Replace /tmp with the name of the directory you want to search through.
This method should work with any POSIX-compliant version of awk and iconv.
We can take advantage of file and tr as well.
curl is not POSIX, of course.
Solutions above may be better in some cases, but they seem to depend on GNU/Linux implementations or additional tools.
Just get a sample file somehow:
$ curl -LOs http://gutenberg.org/files/84/84-0.txt
$ file 84-0.txt
84-0.txt: UTF-8 Unicode (with BOM) text, with CRLF line terminators
Search for UTF-8 characters:
$ awk '/[\x80-\xFF]/ { print }' 84-0.txt
or non-ASCII
$ awk '/[^[:ascii:]]/ { print }' 84-0.txt
Convert UTF-8 to ASCII, removing problematic characters (including BOM which should not be in UTF-8 anyway):
$ iconv -c -t ASCII 84-0.txt > 84-ascii.txt
Check it:
$ file 84-ascii.txt
84-ascii.txt: ASCII text, with CRLF line terminators
Tweak it to remove DOS line endings / ^M ("CRLF line terminators"):
$ tr -d '\015' < 84-ascii.txt > 84-tweaked.txt && file 84-tweaked.txt
84-tweaked.txt: ASCII text
This method discards any "bad" characters it cannot deal with, so you may need to sanitize / validate the output. YMMV
Strangely, I had to do this today! I ended up using Perl because I couldn't get grep/egrep to work (even in -P mode). Something like:
cat blah | perl -en '/\xCA\xFE\xBA\xBE/ && print "found"'
For unicode characters (like \u2212 in example below) use this:
find . ... -exec perl -CA -e '$ARGV = #ARGV[0]; open IN, $ARGV; binmode(IN, ":utf8"); binmode(STDOUT, ":utf8"); while (<IN>) { next unless /\N{U+2212}/; print "$ARGV: $&: $_"; exit }' '{}' \;
It could be interesting to know how to search for one unicode character. This command can help. You only need to know the code in UTF8
grep -v $'\u200d'
Finding all non-ascii characters gives the impression that one is either looking for unicode strings or intends to strip said characters individually.
For the former, try one of these (variable file is used for automation):
file=file.txt ; LC_ALL=C grep -Piao '[\x80-\xFF\x20]{7,}' $file | iconv -f $(uchardet $file) -t utf-8
file=file.txt ; pcregrep -iao '[\x80-\xFF\x20]{7,}' $file | iconv -f $(uchardet $file) -t utf-8
file=file.txt ; pcregrep -iao '[^\x00-\x19\x21-\x7F]{7,}' $file | iconv -f $(uchardet $file) -t utf-8
Vanilla grep doesn't work correctly without LC_ALL=C as noted in the previous answers.
ASCII range is x00-x7F, space is x20, since strings have spaces the negative range omits it.
Non-ASCII range is x80-xFF, since strings have spaces the positive range adds it.
String is presumed to be at least 7 consecutive characters within the range. {7,}.
For shell readable output, uchardet $file returns a guess of the file encoding which is passed to iconv for automatic interpolation.
if you're trying to grab/grep UTF8-compliant multibyte-characters, use this :
( [\302-\337][\200-\277]|
[\340][\240-\277][\200-\277]|
[\355][\200-\237][\200-\277]|
[\341-\354\356-\357][\200-\277][\200-\277]|
[\360][\220-\277][\200-\277][\200-\277]|
[\361-\363][\200-\277][\200-\277][\200-\277]|
[\364][\200-\217][\200-\277][\200-\277] )
* please delete all newlines, spaces, or tabs in between (..)
* feel free to use bracket ranges {1,3} etc to optimize
the redundant listings of [\200-\277]. but don't change that
[\200-\277]+, as that might result in invalid encodings
due to either insufficient or too many continuation bytes
* although some historical UTF-8 references considers 5- and
6-byte encodings to be valid, as of Unicode 13 they only
consider up to 4-bytes
I've tested this string even against random binary files, and it would report the same multi-byte character count as gnu-wc.
Add in another [\000-\177]| at the front just after ( of that if you need full UTF8 matching string.
This regex is truly hideous yes, but it's also POSIX-compliant, cross-language and cross-platform compatible (doesn't depend on any special regex notation, (should be) fully UTF-8 compliant (Unicode 13), and completely independent of locale-setting.
if you're running grep with this, please use grep -P
If you just need the other bytes, then others have suggested already.
if you need the 11,172 characters of NFC-composed korean hangul it's
(([\352][\260-\277]|[\353\354][\200-\277]|
[\355][\200-\235])[\200-\277]|[\355][\236][\200-\243])
and if you need Japanese hiragana+katakana, it's
([\343]([\201-\203][\200-\277]|[\207][\260-\277]))