Cleaning up text files with sed?

Cleaning up text files with sed? - sed

I have a bunch of text files that need cleaning up. Example
`E..4B?#.#...
..9J5.....P0.z.n9.9.. ........
.k#a..5
E...y^#.r...J5..
E...y_#.r...J5..
..9.P..n9..0.z............
….2..3..9…n7…..#.yr`
Is there any way sed can do this? Like notice weird patterns?

For this answer, I will assume that you have access to standard unix/linux tools.
Your file might be in some word-processor format. If so, the best way to get rid of the junk is to open it with that program. You may be able to find out which with file:
$ file mysteryfile
mysteryfile: Composite Document File V2 Document, Little Endian, Os: Windows, Version 6.1 ....
If that doesn't work, there is a standard unix utility for extracting text from binary files. It is called strings:
$ strings mysteryfile
Some
Recovered Text
...
The behavior of strings can be fine tuned with several options. See man strings.

Related

Programmatically change text config files in Linux with minimal effort

I am looking for a tool that would ease the modification of text configuration files for tasks like:
Set ForwardAgent yes on /etc/ssh/ssh_config
Append HGUSER to AcceptEnv in /etc/ssh/sshd_config (that's more complex as it does accept several params, if yours is not alread there it should add it)
Most important:
running it several times should have no side effects.
if something looks weird, it should complain (for example if you find the same line several times in a file, or if the expected syntax does not match).
Is there any linux tool that can easily be used to automate things like this?
The whole point is to be able to write these config patches somewhere so you can deploy them on several machines or on a new machine when needed.

I would certainly do this with bash scripting. Here is a great tutorial.
http://linuxconfig.org/Bash_scripting_Tutorial
to change a line in a file you could do something like:
check the file exists
grep for the value you want to change - error if it appears multiple times or something
use sed to change that line
to append something to a file
check if file exists
grep to ensure it hasn't been appended to already
echo whatever >> file - the double greater than appends to a file
with each of these I would make a backup copy of the file first, just in case something goes wrong

You might want to have a look at the Unified Configuration Interface (UCI) used in Embedded Linux systems. If you have the flexibility to adapt the UCI format for your config files, this is pretty similar to what you are looking for.

How can I force emacs (or any editor) to read a file as if it is in ASCII format?

I could not find this answer in the man or info pages, nor with a search here or on Google. I have a file which is, in essence, a text file, but it somehow got screwed up upon saving. (I think there are a few strange bytes at the front of the file accidentally.)
I am able to open the file, and it makes sense, using head or cat, but not using any sort of editor.
In the end, all I wish to do is open the file in emacs, delete the "messy" characters, and save it once cleaned up. The file, however, is huge, so I need something powerful like emacs to be able to open it.
Otherwise, I suppose I can try to create a script to read this in line by line, forcing the script to read it in text format, then write it. But I wanted something quick, since I won't be doing this over & over.
Thanks!
Mike

perl -i.bk -pe 's/[^[:ascii:]]//g;' file
Found this perl one liner here: http://www.perlmonks.org/?node_id=619792

Try M-xfind-file-literally in Emacs.

You could edit the file using hexl-mode, which lets you edit the file in hexadecimal. That would let you see precisely what those offending characters are, and remove them.
It sounds like you either got a different line ending in the file (eg: carriage returns on a *nix system) or it got saved in an unexpected encoding.

You could use strings to grab "printable characters in file". You might have to play with the --encoding though I have only ever used it to grab ascii strings from executable files.

Rename file containing '©' character

We received as input in our application (running on Windows) a list of files. These files were automatically extracted from a database with a script.
Apparently some of the names are containing special characters (like accents) and these characters are rendered as '©' on our side.
How can rename programmatically these text files (around 900'000) to get rid of this character?
We cannot change the source neither re-extract the files.
The problem is that because of this character another program involved with our system does not accept the files.

Have a look at the unix command rename. It allows you to apply a perl regex to the names of a bunch of files. In this case you might want something like:
$ rename 's/[^a-zA-Z0-9]//' *
In debian the rename command is part of the perl package. It should also be available on CPAN.

I ended up creating a new script that reads the input files and search for special characters in their title.
It was quite easy indeed:
string filename = filename.Replace("©", "e");
Since the '©' is in the filename, the script (in C#) is able to recognize it and replace the match accordingly. In this way I can loop through all the folders and subfolders simply reading the filename and change specials characters.
Thank you all for the contributions!

How to see contents of exe in solaris

I have an exe file(eg naming.exe) on my Solaris server.
I want to see the contents of the exe file.
Is it possible ?

See the strings command which will extract readable text from a file. See the article on wikipedia for more about it.

Although Solaris and Unix in general doesn't care that much about suffixes, especially for executables.".exe" isn't a common file suffix there, it looks like a Windows thing to me.
Start by running file naming.exe to get an idea about what kind of file it is.

Often data beyond simply strings is packed in too. (For example software installers sometimes have useful cross-platform data files embedded within executable files). On linux you can extract this using cabextract. I can't see any reason why porting this to Solaris would be hard if it isn't already working on Solaris.

How do I find the data segment of Mac OSX executable with Perl?

I'm writing a tool in Perl that needs to scan for certain binary patterns inside an executable file on a Mac OSX. To avoid getting very many false positives, I want to restrict my search to the data/text segment of the executable, excluding the code segment and a few other things. How can I accomplish this?

How about using otool?
-t Display the contents of the (__TEXT,__text) section.
-d Display the contents of the (__DATA,__data) section.

You should look at the ELF file format specification. It contains headers and tables that tell you exactly which segments live where. Parsing it is tedious but straightforward.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Cleaning up text files with sed? - sed

I have a bunch of text files that need cleaning up. Example `E..4B?#.#... ..9J5.....P0.z.n9.9.. ........ .k#a..5 E...y^#.r...J5.. E...y_#.r...J5.. ..9.P..n9..0.z............ ….2..3..9…n7…..#.yr` Is there any way sed can do this? Like notice weird patterns?

Related

Programmatically change text config files in Linux with minimal effort

How can I force emacs (or any editor) to read a file as if it is in ASCII format?

Rename file containing '©' character

How to see contents of exe in solaris

How do I find the data segment of Mac OSX executable with Perl?

Categories

Resources