What is a good way to encode diffs in absence of the source text? - diff

The situation is this like: there's some text
hello world!
It is processed by my tool and is converted to some symbolic form, e.g.
[hello#0, world#6]
(notice how the ! is discarded).
Now my tool wants to recommend adding there to the original source text. My tool can send textual data back, so it makes sense to encode the delta in some format and send it back. Here's an example with diff:
1c1
< hello world!
---
> hello there world!
But the problem is that I cannot use the classical diff format because I don't have the original text any longer, and I can't produce that text from my model accurately (for example, because the ! is missing).
My question is, is there some standard textual format that can encode modifications in the middle of the line without knowing the entire line? Something like:
insert 'there ' at 1:6
I know diff itself has a few other possible output formats, but I could not spot anyone which can add things to the middle of a line without needing the entire new line content.

One of the output formats of diff is an ed script with diff -e. Now, diff produces ed scripts that make line-oriented edits, like delete lines or insert lines.
But since you're not necessarily using diff, you can make your tool output a finer-grained ed script which performs insertions and substitutions within a line.
Ed does not support numeric addressing of the characters within a line, but it can be done with regular expression match/replace.
To replace an n-character sequence starting at column m (counting from 1) with the text rep, you can use this command:
s/\(.\{m-1\}\).\{n\}/\1rep/
Here m-1 and n are replaced by decimal numbers. If m happens to be 1, then just
s/.\{n\}/&rep/
Your program has be careful about escaping the characters of rep, of course.
The edits are then applied to a file like this:
$ cp file file.tmp # operate in-place on file.tmp
$ (cat diffs ; echo wq) | ed -q file.tmp # edits are in file "diffs"

Related

Matlab dataimport

My matlab code for dataimport is giving me different results for what appear to be similar text files as input. Input1 gives me a normal cell with all lines from the text file as entries in the cell which i can reference using {i}.
Input2 gives me a scalar data structure where all numeric entries in my text file are converted to the input.data structure. I want all files to be converted to regular cell entries and I do not understand why for some files they are converted to scalar data structures.
Code: input = importdata(strcat(direct,'\',filename));
Input1 example: Correctly working dataimport, with text file on the right
File link: https://drive.google.com/open?id=1aHK4xivqEqJEmaA8Dv8Y0uW5giG-Bbip
Input2 example: Incorrectly working data import, with text file on the right FIle link: https://drive.google.com/open?id=1nzUj_wR1bNXFcGaSLGva6uVsxrk-R5vA
UTSL!
I'm guessing you are using GNU Octave although you are writing "Matlab" as topic of your question.
In importdata.m around line 178, the code tries to automatically detect the delimiter for your data:
delim = regexpi (row, '[-+\d.e*ij ]+([^-+\de.ij])[-+\de*.ij ]','tokens', 'once');
If you run this against W40A0060; you get A as delimiter because there is basically a number before and after it.
If you run this against W39E0016; you get {} as delimiter(empty) because the E could be part of a number in scientific notation and is therefore excluded.
Solution:
you really should add the correct delimiter to the importdata call and not trust that it's magically detected.
And if you just want the lines in a cell, use
strsplit (fileread ("W39E0016_Input2.txt"), "\n")
Analysis
This looks indeed strange!
EDIT: The cause for this strange looking behaviour has been deciphered by #Andy (See his solution).
When you use all outputs of importdata() function you can see what happens when reading the data:
[dat1,del1,headerrows1]=importdata('Input1.txt')
[dat2,del2,headerrows2]=importdata('Input2.txt')
For your first file it recognizes 69 header riws and no delimiter:
del1 = []
headerrows1 = 69
while in your second file only two header rows and a comma , delimiter is recognized
del2 = ','
headerrows2 = 2
I can not find an obvious reason in your files causing this different interpretation of data.
Suggestion
Your data format is rather complex. It is not a simple table like produced from excel. It has multiple lines with a different number of fields per line and varying data types. importdata() is not designed for this type of data. I suggest to write a specific import function for this kind of file. Have a look at textread() for a first guess. You can use it to read the lines of the files as text and later interpret it with sscanf() or use strsplit() to split the line contents into fields.

How to replace the same character in multiple text files?

So I have over 100 text files, all of which are over the size required to be opened in a normal text editor (eg; notepad, notepad++). Meaning I cannot use those mentioned.
All text files contain the same format, they contain:
abc0001:00000009a
abc0054:000000809a
abc00888:054450000009a
and so on..
I was wondering, how do I replace the ":" in each of those text files to then be "\n" (regex for new line)
So then it would be:
abc0001
00000009a
abc0054
000000809a
abc00888
054450000009a
How would I do this to all of the 100 text files, without doing this manually and individually. (if there's any way?)
Any help is appreciated.
You can use sed. The following does something similar to what you want. The question concerns Unix, but a lot of Unix utilities have been ported to MS Windows (even sed): http://gnuwin32.sourceforge.net/packages/sed.htm
UNIX: Replace Newline w/ Colon, Preserving Newline Before EOF
Something like (where you provide your text file as input, and the output becomes your new text file):
sed 's/:/\n/g'

Exiftool - modify metadata format

Suppose I have 5000 images with following metadata in the LABEL field.
0001 ELEPHANT
0002 ELEPHANT
0003 ELEPHANT
...
4999 ELEPHANT
5000 ELEPHANT
I wish to change the format to:
ELEPHANT-0001
ELEPHANT-0002
ELEPHANT-0003
…
ELEPHANT-4999
ELEPHANT-5000
In other words, I want to do the following for a metadata field of multiple images:
#### NAME —> NAME-####
From what I can gather there could be two ways of doing this
Ignore the current metadata in the images, and reference a (plain text? csv?) file that I prepare separately; or
Read the file's metadata as a string, identify the space and the number preceding the space, save that number, and finally make a new string by concatenating the number and space, and adding a hyphen in between!
Any suggestions?
Expanding upon the answer I gave in the exiftool forums.
The basic command would be
exiftool "-LABEL<${LABEL;s/(\d{4}) (.*)/$2-$1/}" <FileOrDir>
You basically want to copy a tag into the same tag, with some modifications. The option to copy a tag is the less than (or greater than) symbol < or >. A common mistake is to use the equal sign = which is used to assign a static value to a tag.
To do the modification to the tag, it takes the Advance Formatting option, which is actually some in-line perl code. In this example, the tag is treated as a perl string and a regex substitution is used. It matches and captures the first four digits (\d{4}), matches the space (but doesn't capture it), then matches and captures the rest of the tag (.*). The two captures are assigned to the variables $1 and $2, respectively. In the replace half of the substitution $2-$1, the two captures are reversed with the hyphen between them.
To take full advantage of the advance formatting, some basic perl and regex knowledge is helpful.
Once you are sure of the command, you can add -overwrite_original to suppress the generation of backup files and -r to recurse into subdirectories.

Linux Sort vs Perl String Comparison

Because I was dealing with very large files, I sorted my base and candidate files before comparing them to see what lines were missing from the other. I did this to avoid keeping the records in memory. The sorting was done by using the Linux command-line tool, sort.
In my Perl script, I would look at whether the string in the line was lt, gt, or eq to the line in the other file, advancing the pointers in the file where necessary. However, I hit a problem when I noticed that my string comparison thought the strings in the base file were lt a string in the candidate file which contained special characters.
Is there a surefire way of making sure my Linux sort and Perl string comparisons are using the same type of string comparator?
The sort command uses the current locale, as specified by the environment variable LC_ALL, to determine the sort order for characters. Usually the easiest way to fix sorting issues is to manually set this to the C locale, which treats each 8-bit byte as a single character and compares by simple numeric value. In most shells this can be done as a one-off just for a single command by prefixing it like so:
LC_ALL=C sort < infile > outfile
This will also solve similar problems for some other text-processing programs. (E.g. I recall problems working with CSV files on a German person's computer -- this was traced back to the fact that Germans use a comma instead of a decimal point. Putting LC_ALL=C in front of the relevant commands fixed that issue too.)
[EDIT] Although Perl can be directed to treat some strings as Unicode, by default it still treats input and output as streams of 8-bit bytes, so the above approach should produce an order that is the same as Perl's sort() function. (Thanks to Ven'Tatsu for this nugget.)

How to use '^#' in Vim scripts?

I'm trying to work around a problem with using ^# (i.e., <ctrl-#>) characters in Vim scripts. I can insert them into a script, but when the script runs it seems the line is truncated at the point where a ^# was located.
My kludgy solution so far is to have a ^# stored in a variable, then reference the variable in the script whenever I would have quoted a literal ^#. Can someone tell me what's going on here? Is there a better way around this problem?
That is one reason why I never use raw special character values in scripts. While ^# does not work, string <C-#> in mappings works as expected, so you may use one of
nnoremap <C-#> {rhs}
nnoremap <Nul> {rhs}
It is strange, but you cannot use <Char-0x0> here. Some notes about null byte in strings:
Inserting null byte into string truncates it: vim uses old C-style strigs that end with null byte, thus it cannot appear in strings. These strings are very inefficient, so if you want to generate a very large text, try accumulating it into a list of lines (using setline is very fast as buffer is represented as a list of lines).
Most functions that return list of strings (like readfile, getline(start, end)) or take list of strings (like writefile, setline, append) treat \n (NL) as Null. It is also the internal representation of buffer lines, see :h NL-used-for-Nul.
If you try to insert \n character into the command-line, you will get Null shown (but this is really a newline). If you want to edit a file that has \n in a filename (it is possible on *nix), you will need to prepend newline with backslash.
The byte ctrl-# is also known as '\0'. Many languages, programs, etc. use it as an "end of string" marker, so it's not surprising that vim gets confused there. If you must use this byte in the middle of a script string, it sounds like your workaround is a decent one.