How to extract certain columns from a big Notepad text file? - notepad

I have a big text file and the data in it are in 5 columns, but I need just the first and the last column of that.
It will take many days and probably with mistake if I want to enter the data of this two column one-by-one from here to another file.
Is there a fast way to do this?
For example:
1 1.0000000000000000 0.0000000000 S {0}
2 1.5000000000000000 0.3010299957 C {2}
3 1.7500000000000000 0.6020599913 S {0,2}
4 2.0000000000000000 0.7781512504 C {3}
5 2.3333333333333333 1.0791812460 C {3,2}
6 2.5000000000000000 1.3802112417 S {3,0,2}
7 2.5277777777777778 1.5563025008 S {0,3}
8 2.5833333333333333 1.6812412374 S {3,0,0,2}
9 2.8000000000000000 1.7781512504 C {5,2}
10 3.0000000000000000 2.0791812460 C {5,0,2}
I need the first column (numbering) and the last inside { }.

ALT + Left Mouse Click puts you in Column Mode Select. It's quite an useful shortcut that may help you.

in Notepad++, you can use regular expression to do replacement:
the regex for find and replace is:
^( +\d+).+\{([\d,]+)\}$
\1 \2
then can change the:
1 1.0000000000000000 0.0000000000 S {0}
2 1.5000000000000000 0.3010299957 C {2}
3 1.7500000000000000 0.6020599913 S {0,2}
4 2.0000000000000000 0.7781512504 C {3}
5 2.3333333333333333 1.0791812460 C {3,2}
6 2.5000000000000000 1.3802112417 S {3,0,2}
7 2.5277777777777778 1.5563025008 S {0,3}
8 2.5833333333333333 1.6812412374 S {3,0,0,2}
9 2.8000000000000000 1.7781512504 C {5,2}
10 3.0000000000000000 2.0791812460 C {5,0,2}
to:
1 0
2 2
3 0,2
4 3
5 3,2
6 3,0,2
7 0,3
8 3,0,0,2
9 5,2
10 5,0,2
if not want the leading space, then use:
^( +\d+).+\{([\d,]+)\}$
\1 \2
will change to:
1 0
2 2
3 0,2
4 3
5 3,2
6 3,0,2
7 0,3
8 3,0,0,2
9 5,2
10 5,0,2

You should use awk or gawk which is available on windows platform also. Use gawk "{print $1,$5}" inpfile > outfile. I copied your file named it 'one'. You can see the output which consists of 1st and 5th column of your file.
>gawk "{print $1, $5}" one
1 {0}
2 {2}
3 {0,2}
4 {3}
5 {3,2}
6 {3,0,2}
7 {0,3}
8 {3,0,0,2}
9 {5,2}
10 {5,0,2}

You can import it into Excel and manipulate it there.

If you are using .NET, FileHelpers may save you a lot of time. From your post we can't tell what technology you are hoping to use to accomplish this.

Ultraedit has a tool for selecting columns and opens large files (I tried a 900 Mb file on a 2008 desktop and it opened in 3 minutes). I think it has a demo version fully operational.
Excel could work if you do not have too many rows.
Cheers,

One more way is to copy the data to MS word file.
Then use
{Alt + left mouse click}
Then you can drag on the selected column and you can see only a single column is selected.
Copy and paste wherever you want.

There is only one way to convolve ungodly amounts of data. That is with the command prompt.
$cat text.txt | sed 's/{.*,//;s/ */ /g;s/[{}]//g' | awk '{print $1","$5}' > clean_text.csv
This 15 second fix is not available in Windows OS. It will take you less time to download and install Linux on that old dead computer in your closet than it will to get your data in and out of Excel.
Happy coding!

Related

Powershell Output Question when running Forms

When I run my Forms code I have different objects that are added to the Form (buttons, labels, etc) I attache the objects to the form by using the command $Form1.Controls.Add([ObjectType]).
My question is, when I run my code I get an instant sequence of numbers in my console and output dialogue box of:
0 1 2 3 4 5 6 7 8 9 0 1 2
When after I click Submit button the string "OK" is added to the numbers shown above
0 1 2 3 4 5 6 7 8 9 0 1 2 OK
Why is this happening and how can I remove these or atleast omit them from displaying.
The OK displays once the Submit button is pressed.
OK
Some actions like .Add() are producing output. To prevent this, pipe the output into the [void] by adding | Out-Null at the end of the line or [void] directly infront of the variable that is used, like:
$foo.SomethingThatGeneratesOutput() | Out-Null
or
[void]$foo = SomethingThatGeneratesOutput
As T-Me has stated, to prevent output being generated when executing methods as you are use [Void].
[Void]$Form1.Controls.Add([ObjectType])
If your code is still returning unwanted data, open the script in PowerShell ISE, and execute the script line by line (select the line and press F8). This will help you determine which line of code is generating output still.

Using sed to copy data between two numerical patterns to a new file

I'm running a bunch (~320) computational chemistry experiments and I need to pull a small amount of the data out of each of the files so that I can do some work on it in MatLab.
I'm pretty sure I can use sed to make this work, but try as I might I don't seem to be able to do so.
I need all of the data starting at the line beginning with "1 1" and ending with the line starting with "33 33".
I J FI(I,J) k(I,J) K(I,J)
1 1 -337.13279 -0.06697 -0.00430
2 2 3804.89120 8.52972 0.54787
3 3 3195.69653 6.01702 0.38648
4 4 3189.18684 5.99253 0.38490
5 5 3183.73262 5.97205 0.38359
6 6 3174.47525 5.93737 0.38136
7 7 3167.88746 5.91275 0.37978
8 8 1628.80868 1.56311 0.10040
9 9 1623.56055 1.55306 0.09975
10 10 1518.21620 1.35806 0.08723
11 11 1476.93012 1.28520 0.08255
12 12 1341.24087 1.05990 0.06808
13 13 1312.30373 1.01466 0.06517
14 14 1264.73004 0.94242 0.06053
15 15 1185.62592 0.82822 0.05320
16 16 1175.54013 0.81419 0.05230
17 17 1170.41211 0.80710 0.05184
18 18 1090.20196 0.70027 0.04498
19 19 1039.29190 0.63639 0.04088
20 20 1015.00116 0.60699 0.03899
21 21 1005.05773 0.59516 0.03823
22 22 986.55965 0.57345 0.03683
23 23 917.65537 0.49615 0.03187
24 24 842.93089 0.41863 0.02689
25 25 819.00146 0.39520 0.02538
26 26 758.39720 0.33888 0.02177
27 27 697.11173 0.28632 0.01839
28 28 628.75684 0.23292 0.01496
29 29 534.75856 0.16849 0.01082
30 30 499.35579 0.14692 0.00944
31 31 422.01320 0.10493 0.00674
32 32 409.30255 0.09870 0.00634
33 33 227.12411 0.03039 0.00195
33 2nd derivatives larger than 0.371D-04 over 561
MatLab is not a fan of text, so I'd like to not use text delimiters (though there are some in the header of this data section) and keep the data contained to only the numeric lines.
The data files contain a lot of other numbers as well, so I need to match the occurrence of "1 1" at the start of the line and "33 33" as the end of the copy. These 'indices' exist only in this block of info.
I attempted to use
% sed -n /"1 1"/,/"33 33"/p input.file > output.file
But I get a WHOLE BUNCH of data in the output file as it copies everything that shows up between any "1" and "33"
Is there any way to do what I'm looking for?
Also, I'm using the tcsh as that is what my servers run.
How about using awk
awk '$1=="1"&&$2=="1"{t=1};t;$1=="33"&&$2=="33"{t=0}' file
Recommand by #mklement0, if there is only one block, to avoid processing the remainder of the file you can update the command to:
awk '$1=="1"&&$2=="1"{t=1};t;$1=="33"&&$2=="33"{exit}' file
Your problem is twofold. First, there are two blanks between the ones, but your regex only allows for one (judging from the now indented code). Second, you are probably not precise enough; the /1 1/ pattern matches 11 11, for example, and 111 111 and so on.
So, you should consider:
sed -n -e '/^ *1 *1 /,/^33 *33 /p' -e '/^33 33 /q' input.file > output.file
The patterns are anchored to the start of line by the ^ (caret). The numbers are separated by one or more blanks (there are other, longer-winded ways of writing that in standard sed; the + option is not standard sed but is widely available). And the numbers are terminated by a blank. The chances are that the first expression alone will give you what you want. The second expression terminates the search early when it recognizes the 33 33 input line, which can save a significant amount of file I/O and hence processing time if the input file is big enough.
If the lines with ID numbers in the hundreds have some different format, then it should be fairly straight-forward to tweak the regexes to match what is used. If the data contains tabs instead of (or as well as) blanks, you can tweak the regexes to manage that, too.
If you data is all formatted exactly the same as this file, then you can use sed to just read the 3rd through the 35th line (rows 1 1 - 33 33). This is a lot easier than parsing the values, but does require that the files have a standard format:
sed -n 3,35p data.txt
Another cheap way would be to grep for only numeric lines, and take only the first 33:
grep "^[0-9 ][0-9 .-]*$" data.txt | head -n 33

copying every nth line and duplicating it on it's following line

I am trying to make test files for the project, and I figured in order to make a bradycardia test file from an example file of a normal ECG.
Therefore I would need to copy every third line and insert it into the next line.
for example:
a = [
1
2
3
4
5
6
7
8
9
10]
and I want:
b = [
1
2
3
3
4
5
6
6
7
8
9
9
10]
and so on... but since the file is 6000 characters long, obviously i cannot manually copy it. And I need it to be 9000 characters long I've tried looking online on how to do this, and am having no luck.
Any suggestions?
b=zeros(floor(4/3*length(a)),1);
b(1:4:end)=a(1:3:end);
b(2:4:end)=a(2:3:end);
b(3:4:end)=a(3:3:end);
b(4:4:end)=a(3:3:end);
Another way:
b = a(sort([1:numel(a) 3:3:numel(a)]))
And here is a third faster and simpler method
b = a(round(1:0.75:numel(a)))
This only works if length(a) is a multiple of 3, but seems to be faster than the other answers, at least for large vectors:
b = reshape([reshape(a,3,[]); a(3:3:end).'],[],1);

PowerShell's call operator (&) syntax and double-quotes

Can someone explain this result to me? I've wasted a lot of time over the years trying to master PowerShell's syntax for calling commands, but this...I can't even make a guess how to get this result from the input.
PS C:\Users\P> & echoargs " ""1"" 2 3 ""4 5 6"" 7 8 9"
Arg 0 is < 1 2 3 4>
Arg 1 is <5>
Arg 2 is <6 7 8 9>
Bueller?
The doubled-up doubles quotes inside a double-quoted string is a way to insert a double quote. The updated version of echoargs.exe shows this a bit more clearly as it shows you the command line used to invoke the exe:
PS> echoargs " ""1"" 2 3 ""4 5 6"" 7 8 9"
Arg 0 is < 1 2 3 4>
Arg 1 is <5>
Arg 2 is <6 7 8 9>
Command line:
"C:\...\Modules\Pscx\Apps\EchoArgs.exe" " "1" 2 3 "4 5 6" 7 8 9"
If you take that command line (after it has been parsed by PowerShell) you get the same result in CMD.exe:
CMD> EchoArgs.exe " "1" 2 3 "4 5 6" 7 8 9"
Arg 0 is < 1 2 3 4>
Arg 1 is <5>
Arg 2 is <6 7 8 9>
Command line:
C:\...\Modules\Pscx\Apps\EchoArgs.exe " "1" 2 3 "4 5 6" 7 8 9"
As to why .NET or the C++ startup code parses the command line that way, I'm not entirely sure. This MSDN topic covers it a bit and if you look at the examples at the bottom of the topic, you will see some equally weird parsing behavior e.g. a\\\b d"e f"g h gives a\\\b, de fg and h.
Note that Powershell is known for some heavy bugs when it comes to passing arguments to applications and quoting the said arguments - http://connect.microsoft.com/PowerShell/feedback/details/376207/executing-commands-which-require-quotes-and-variables-is-practically-impossible
This is how I understand how it is being (mis)parsed:
The string is " ""1"" 2 3 ""4 5 6"" 7 8 9"
Because of the bug the double double quotes, which become literal double quote, never make it.
The string would be like " "1" 2 3 "4 5 6" 7 8 9"
So <space>1 2 3 4 becomes an argument because it the first section with matching quotes and 4 occurs before the next space. Then space, and hence 5 becomes second argument. Then space, so the next part will be a separate argument. Here again, the same rule as the first argument, except that 6 occurs before quote and without a space and hence 6 7 8 9 becomes the next argument.
Bottomline - Powershell argument passing to external applications is pretty messed up.

What do the numbers on a Git Diff header mean? [duplicate]

This question already has answers here:
What does the “##…##” meta line with at signs in svn diff or git diff mean?
(3 answers)
Closed 7 years ago.
Every time I run git diff, for each single changes I made, I get some sort of header with numbers, for example:
## -169,14 +167,12 ## function Browser(window, document, body, XHR, $log) {.....
I wonder what does the four numbers mean? I guess -169 means that this particular line of code that follows was originally in line 169 but now is in 167? And what do 14 and 12 mean?
This header is called set of change, or hunk. Each hunk starts with a line that contains, enclosed in ##, the line or line range from,no-of-lines in the file before (with a -) and after (with a +) the changes. After that come the lines from the file. Lines starting with a - are deleted, lines starting with a + are added. Each line modified by the patch is surrounded with 3 lines of context before and after.
An addition looks like this:
## -75,6 +103,8 ##
foo
bar
baz
+line1
+line2
more context
and more
and still context
That means, in the original file before line 78 (= 75 + 3 lines of context) add two lines. These will be lines 106 (= 103 + 3 lines of context) through 107 after all changes.
Note the difference in from numbers (-75 vs +103), this means that there were other changes in this file before this particular hunk, that added 28 (103 - 75) lines of code.
A deletion looks like this:
## -75,7 +75,6 ##
foo
bar
baz
-line1
more context
and more
and still context
That means, delete line 78 (= 75 + 3 lines of context) in the original file. The unchanged context will be on lines 75 to 80 after all changes.
Note that from numbers in this hunk are equal (-75 and +75), this means that either there were no changes before this hunk, or amount of added and deleted lines in previous changes are the same.
Finally, a change looks like this:
## -70,7 +70,7 ##
foo
bar
baz
-red
+blue
more context
and more
still context
That means, change line 73 (= 70 + 3 lines of context) in the file before all changes, which contains red to blue. The changed line is also line 73 (= 70 + 3 lines of context) in the file after all changes.
I wonder what does the four numbers mean?
Let's analyze a simple example
The format is basically the same the diff -u unified diff.
We start with numbers from 1 to 16 and remove 2, 3, 14 and 15:
diff -u <(seq 16) <(seq 16 | grep -Ev '^(2|3|14|15)$')
Output:
## -1,6 +1,4 ##
1
-2
-3
4
5
6
## -11,6 +9,4 ##
11
12
13
-14
-15
16
## -1,6 +1,4 ## means:
-1,6 means that this piece of the first file starts at line 1 and shows a total of 6 lines. Therefore it shows lines 1 to 6.
1
2
3
4
5
6
- means "old", as we usually invoke it as diff -u old new.
+1,4 means that this piece of the second file starts at line 1 and shows a total of 4 lines. Therefore it shows lines 1 to 4.
+ means "new".
We only have 4 lines instead of 6 because 2 lines were removed! The new hunk is just:
1
4
5
6
## -11,6 +9,4 ## for the second hunk is analogous:
on the old file, we have 6 lines, starting at line 11 of the old file:
11
12
13
14
15
16
on the new file, we have 4 lines, starting at line 9 of the new file:
11
12
13
16
Note that line 11 is the 9th line of the new file because we have already removed 2 lines on the previous hunk: 2 and 3.
Summary:
Assume git diff will output [0-3] lines of context [before/after] [first/last] changes
## -[original file's number of first line displayed],[context lines + removed lines] +[changed file's number of first line displayed],[context lines + added lines] ##