Group the results of grep command - perl

I've been using grep -f to obtain patterns from one file and extract lines from the other.
The results are like below:
1 11294199 11294322 40 10 123 0.0813008
1 11294199 11294322 41 6 123 0.0487805
1 11294199 11294322 42 10 123 0.0813008
1 11294199 11294322 43 2 123 0.0162602
1 11293454 11293544 51 1 90 0.0111111
1 11293454 11293544 52 2 90 0.0222222
1 11291356 11291491 54 6 135 0.0444444
1 11291356 11291491 55 8 135 0.0592593
1 11291356 11291491 56 3 135 0.0222222
Now I need to group the results based on the first three columns,and calculate the sum of column 4 for each of the groups:
1 11294199 11294322 (40+41+42+43)
1 11293454 11293544 (51+52)
1 11291356 11291491 (54+55+56)
How can I get such results? Any options in grep to achieve this?
thx

You will need awk to do what you want. Try this:
awk '{ array[$1 "\t" $2 "\t" $3] += $4 } END { for (i in array) print i "\t" array[i] }' file.txt
Results:
1 11294199 11294322 166
1 11291356 11291491 165
1 11293454 11293544 103
HTH

Related

Make histogram of pixel intensities without imhist

I have used the unique command to get the unique pixel intensities from my image. Then I tried to make a histogram using them, but it doesn't use all of the intensity values
I = imread('pout.tif');
[rows, columns] = size(I);
UniquePixels=unique(I);
hist=histogram(UniquePixels)
An alternative approach would be to use accumarray combined with unique. I would specifically use the third output of unique to transform your data into a consecutive sequence of 1 up to N where N is the total number of unique intensities, then leverage the first output of unique that will give you the list of unique intensities. Therefore, if the first output of unique is A and the output of accumarray is B, the effect is that at location B(i), this gives the total number of intensities of A(i).
Therefore:
[UniquePixels, ~, id] = unique(I);
histo = accumarray(id, 1);
UniquePixels gives you all unique pixels while histo gives you the counts of each unique pixel corresponding to each element in UniquePixels.
Here's a quick example:
>> I = randi(255, 10, 10)
I =
42 115 28 111 218 107 199 60 140 237
203 22 246 233 159 13 100 91 76 198
80 59 2 47 90 231 62 210 190 125
135 233 198 68 131 241 103 4 49 112
43 39 209 38 103 126 25 11 176 114
154 211 222 35 20 125 34 44 47 79
68 138 22 222 62 87 241 166 94 130
167 255 102 148 32 230 244 187 160 131
176 20 67 141 47 95 147 166 199 209
191 113 205 37 62 29 16 115 21 203
>> [UniquePixels, ~, id] = unique(I);
>> histo = accumarray(id, 1);
>> [UniquePixels histo]
ans =
2 1
4 1
11 1
13 1
16 1
20 2
21 1
22 2
25 1
28 1
29 1
32 1
34 1
35 1
37 1
38 1
39 1
42 1
43 1
44 1
47 3
49 1
59 1
60 1
62 3
67 1
68 2
76 1
79 1
80 1
87 1
90 1
91 1
94 1
95 1
100 1
102 1
103 2
107 1
111 1
112 1
113 1
114 1
115 2
125 2
126 1
130 1
131 2
135 1
138 1
140 1
141 1
147 1
148 1
154 1
159 1
160 1
166 2
167 1
176 2
187 1
190 1
191 1
198 2
199 2
203 2
205 1
209 2
210 1
211 1
218 1
222 2
230 1
231 1
233 2
237 1
241 2
244 1
246 1
255 1
If you double check the input example and the final output, you will see that only the unique pixels are shown combined with their counts. Any bins that were zero in count are not shown.

Equivalent of *nix fold in PowerShell

Today I had a few hundred items (IDs from SQL query) and needed to paste them into another query to be readable by an analyst. I needed *nix fold command. I wanted to take the 300 lines and reformat them as multiple numbers per line seperated by a space. I would have used fold -w 100 -s.
Similar tools on *nix include fmt and par.
On Windows is there an easy way to do this in PowerShell? I expected one of the *-Format commandlets to do it, but I couldn't find it. I'm using PowerShell v4.
See https://unix.stackexchange.com/questions/25173/how-can-i-wrap-text-at-a-certain-column-size
# Input Data
# simulate a set of 300 numeric IDs from 100,000 to 150,000
100001..100330 |
Out-File _sql.txt -Encoding ascii
# I want output like:
# 100001, 100002, 100003, 100004, 100005, ... 100010, 100011
# 100012, 100013, 100014, 100015, 100016, ... 100021, 100021
# each line less than 100 characters.
Depending on how big the file is you could read it all into memory, join it with spaces and then split on 100* characters or the next space
(Get-Content C:\Temp\test.txt) -join " " -split '(.{100,}?[ |$])' | Where-Object{$_}
That regex looks for 100 characters then the first space after that. That match is then -split but since the pattern is wrapped in parenthesis the match is returned instead of discarded. The Where removes the empty entries that are created in between the matches.
Small sample to prove theory
#"
134
124
1
225
234
4
34
2
42
342
5
5
2
6
"#.split("`n") -join " " -split '(.{10,}?[ |$])' | Where-Object{$_}
The above splits on 10 characters where possible. If it cannot the numbers are still preserved. Sample is based on me banging on the keyboard with my head.
134 124 1
225 234 4
34 2 42
342 5 5
2 6
You could then make this into a function to get the simplicity back that you are most likely looking for. It can get better but this isn't really the focus of the answer.
Function Get-Folded{
Param(
[string[]]$Strings,
[int]$Wrap = 50
)
$strings -join " " -split "(.{$wrap,}?[ |$])" | Where-Object{$_}
}
Again with the samples
PS C:\Users\mcameron> Get-Folded -Strings (Get-Content C:\temp\test.txt) -wrap 40
"Lorem ipsum dolor sit amet, consectetur
adipiscing elit, sed do eiusmod tempor incididunt
ut labore et dolore magna aliqua. Ut enim
ad minim veniam, quis nostrud exercitation
... output truncated...
You can see that it was supposed to split on 40 characters but the second line is longer. It split on the next space after 40 to preserve the word.
If it's one item per line, and you want to join every 100 items onto a single line separated by a space you could put all the output into a text file then do this:
gc c:\count.txt -readcount 100 | % {$_ -join " "}
When I saw this, the first thing that came to my mind was abusing Format-Table to do this, mostly because it knows how to break the lines properly when you specify a width. After coming up with a function, it seems that the other solutions presented are shorter and probably easier to understand, but I figured I'd still go ahead and post this solution anyway:
function fold {
[CmdletBinding()]
param(
[Parameter(ValueFromPipeline)]
$InputObject,
[Alias('w')]
[int] $LineWidth = 100,
[int] $ElementWidth
)
begin {
$SB = New-Object System.Text.StringBuilder
if ($ElementWidth) {
$SBFormatter = "{0,$ElementWidth} "
}
else {
$SBFormatter = "{0} "
}
}
process {
foreach ($CurrentObject in $InputObject) {
[void] $SB.AppendFormat($SBFormatter, $CurrentObject)
}
}
end {
# Format-Table wanted some sort of an object assigned to it, so I
# picked the first static object that popped in my head:
([guid]::Empty | Format-Table -Property #{N="DoesntMatter"; E={$SB.ToString()}; Width = $LineWidth } -Wrap -HideTableHeaders |
Out-String).Trim("`r`n")
}
}
Using it gives output like this:
PS C:\> 0..99 | Get-Random -Count 100 | fold
1 73 81 47 54 41 17 87 2 55 30 91 19 50 64 70 51 29 49 46 39 20 85 69 74 43 68 82 76 22 12 35 59 92
13 3 88 6 72 67 96 31 11 26 80 58 16 60 89 62 27 36 37 18 97 90 40 65 42 15 33 24 23 99 0 32 83 14
21 8 94 48 10 4 84 78 52 28 63 7 34 86 75 71 53 5 45 66 44 57 77 56 38 79 25 93 9 61 98 95
PS C:\> 0..99 | Get-Random -Count 100 | fold -ElementWidth 2
74 89 10 42 46 99 21 80 81 82 4 60 33 45 25 57 49 9 86 84 83 44 3 77 34 40 75 50 2 18 6 66 13
64 78 51 27 71 97 48 58 0 65 36 47 19 31 79 55 56 59 15 53 69 85 26 20 73 52 68 35 93 17 5 54 95
23 92 90 96 24 22 37 91 87 7 38 39 11 41 14 62 12 32 94 29 67 98 76 70 28 30 16 1 61 88 43 8 63
72
PS C:\> 0..99 | Get-Random -Count 100 | fold -ElementWidth 2 -w 40
21 78 64 18 42 15 40 99 29 61 4 95 66
86 0 69 55 30 67 73 5 44 74 20 68 16
82 58 3 46 24 54 75 14 11 71 17 22 94
45 53 28 63 8 90 80 51 52 84 93 6 76
79 70 31 96 60 27 26 7 19 97 1 59 2
65 43 81 9 48 56 25 62 13 85 47 98 33
34 12 50 49 38 57 39 37 35 77 89 88 83
72 92 10 32 23 91 87 36 41
This is what I ended up using.
# simulate a set of 300 SQL IDs from 100,000 to 150,000
100001..100330 |
%{ "$_, " } | # I'll need this decoration in the SQL script
Out-File _sql.txt -Encoding ascii
gc .\_sql.txt -ReadCount 10 | %{ $_ -join ' ' }
Thanks everyone for the effort and the answers. I'm really surprised there wasn't a way to do this with Format-Table without the use of [guid]::Empty in Rohn Edward's answer.
My IDs are much more consistent than the example I gave, so Noah's use of gc -ReadCount is by far the simplest solution in this particular data set, but in the future I'd probably use Matt's answer or the answers linked to by Emperor in comments.
I came up with this:
$array =
(#'
1
2
3
10
11
100
101
'#).split("`n") |
foreach {$_.trim()}
$array = $array * 40
$SB = New-Object Text.StringBuilder(100,100)
foreach ($item in $array) {
Try { [void]$SB.Append("$item ") }
Catch {
$SB.ToString()
[void]$SB.Clear()
[Void]$SB.Append("$item ")
}
}
#don't forget the last line
$SB.ToString()
1 2 3 10 11 100 101 1 2 3 10 11 100 101 1 2 3 10 11 100 101 1 2 3 10 11 100 101 1 2 3 10 11 100 101
1 2 3 10 11 100 101 1 2 3 10 11 100 101 1 2 3 10 11 100 101 1 2 3 10 11 100 101 1 2 3 10 11 100 101
1 2 3 10 11 100 101 1 2 3 10 11 100 101 1 2 3 10 11 100 101 1 2 3 10 11 100 101 1 2 3 10 11 100 101
1 2 3 10 11 100 101 1 2 3 10 11 100 101 1 2 3 10 11 100 101 1 2 3 10 11 100 101 1 2 3 10 11 100 101
1 2 3 10 11 100 101 1 2 3 10 11 100 101 1 2 3 10 11 100 101 1 2 3 10 11 100 101 1 2 3 10 11 100 101
1 2 3 10 11 100 101 1 2 3 10 11 100 101 1 2 3 10 11 100 101 1 2 3 10 11 100 101 1 2 3 10 11 100 101
1 2 3 10 11 100 101 1 2 3 10 11 100 101 1 2 3 10 11 100 101 1 2 3 10 11 100 101 1 2 3 10 11 100 101
Maybe not as compact as you were hoping for, and there may be better ways to do it, but it seems to work.

Functional addition of Columns in kdb+q

I have a q table in which no. of non keyed columns is variable. Also, these column names contain an integer in their names. I want to perform some function on these columns without actually using their actual names
How can I achieve this ?
For Example:
table:
a | col10 col20 col30
1 | 2 3 4
2 | 5 7 8
// Assume that I have numbers 10, 20 ,30 obtained from column names
I want something like **update NewCol:10*col10+20*col20+30*col30 from table**
except that no.of columns is not fixed so are their inlcluded numbers
We want to use a functional update (simple example shown here: http://www.timestored.com/kdb-guides/functional-queries-dynamic-sql#functional-update)
For this particular query we want to generate the computation tree of the select clause, i.e. the last part of the functional update statement. The easiest way to do that is to parse a similar statement then recreate that format:
q)/ create our table
q)t:([] c10:1 2 3; c20:10 20 30; c30:7 8 9; c40:0.1*4 5 6)
q)t
c10 c20 c30 c40
---------------
1 10 7 0.4
2 20 8 0.5
3 30 9 0.6
q)parse "update r:(10*c10)+(20*col20)+(30*col30) from t"
!
`t
()
0b
(,`r)!,(+;(*;10;`c10);(+;(*;20;`col20);(*;30;`col30)))
q)/ notice the last value, the parse tree
q)/ we want to recreate that using code
q){(*;x;`$"c",string x)} 10
*
10
`c10
q){(+;x;y)} over {(*;x;`$"c",string x)} each 10 20
+
(*;10;`c10)
(*;20;`c20)
q)makeTree:{{(+;x;y)} over {(*;x;`$"c",string x)} each x}
/ now write as functional update
q)![t;();0b; enlist[`res]!enlist makeTree 10 20 30]
c10 c20 c30 c40 res
-------------------
1 10 7 0.4 420
2 20 8 0.5 660
3 30 9 0.6 900
q)update r:(10*c10)+(20*c20)+(30*c30) from t
c10 c20 c30 c40 r
-------------------
1 10 7 0.4 420
2 20 8 0.5 660
3 30 9 0.6 900
I think functional select (as suggested by #Ryan) is the way to go if the table is quite generic, i.e. column names might varies and number of columns is unknown.
Yet I prefer the way #JPC uses vector to solve the multiplication and summation problem, i.e. update res:sum 10 20 30*(col10;col20;col30) from table
Let combine both approach together with some extreme cases:
q)show t:1!flip(`a,`$((10?2 3 4)?\:.Q.a),'string 10?10)!enlist[til 100],0N 100#1000?10
a | vltg4 pnwz8 mifz5 pesq7 fkcx4 bnkh7 qvdl5 tl5 lr2 lrtd8
--| -------------------------------------------------------
0 | 3 3 0 7 9 5 4 0 0 0
1 | 8 4 0 4 1 6 0 6 1 7
2 | 4 7 3 0 1 0 3 3 6 4
3 | 2 4 2 3 8 2 7 3 1 7
4 | 3 9 1 8 2 1 0 2 0 2
5 | 6 1 4 5 3 0 2 6 4 2
..
q)show n:"I"$string[cols get t]inter\:.Q.n
4 8 5 7 4 7 5 5 2 8i
q)show c:cols get t
`vltg4`pnwz8`mifz5`pesq7`fkcx4`bnkh7`qvdl5`tl5`lr2`lrtd8
q)![t;();0b;enlist[`res]!enlist({sum x*y};n;enlist,c)]
a | vltg4 pnwz8 mifz5 pesq7 fkcx4 bnkh7 qvdl5 tl5 lr2 lrtd8 res
--| -----------------------------------------------------------
0 | 3 3 0 7 9 5 4 0 0 0 176
1 | 8 4 0 4 1 6 0 6 1 7 226
2 | 4 7 3 0 1 0 3 3 6 4 165
3 | 2 4 2 3 8 2 7 3 1 7 225
4 | 3 9 1 8 2 1 0 2 0 2 186
5 | 6 1 4 5 3 0 2 6 4 2 163
..
You can create a functional form query as #Ryan Hamilton indicated, and overall that will be the best approach since it is very flexible. But if you're just looking to add these up, multiplied by some weight, I'm a fan of going through other avenues.
EDIT: missed that you said the number in the columns name could vary, in which case you can easily adjust this. If the column names are all prefaced by the same number of letters, just drop those and then parse the remaining into int or what have you. Otherwise if the numbers are embedded within text, check out this other question
//Create our table with a random number of columns (up to 9 value columns) and 1 key column
q)show t:1!flip (`$"c",/:string til n)!flip -1_(n:2+first 1?10) cut neg[100]?100
c0| c1 c2 c3 c4 c5 c6 c7 c8 c9
--| --------------------------
28| 3 18 66 31 25 76 9 44 97
60| 35 63 17 15 26 22 73 7 50
74| 64 51 62 54 1 11 69 32 61
8 | 49 75 68 83 40 80 81 89 67
5 | 4 92 45 39 57 87 16 85 56
48| 88 34 55 21 12 37 53 2 41
86| 52 91 79 33 42 10 98 20 82
30| 71 59 43 58 84 14 27 90 19
72| 0 99 47 38 65 96 29 78 13
q)update res:sum (1+til -1+count cols t)*flip value t from t
c0| c1 c2 c3 c4 c5 c6 c7 c8 c9 res
--| -------------------------------
28| 3 18 66 31 25 76 9 44 97 2230
60| 35 63 17 15 26 22 73 7 50 1551
74| 64 51 62 54 1 11 69 32 61 1927
8 | 49 75 68 83 40 80 81 89 67 3297
5 | 4 92 45 39 57 87 16 85 56 2582
48| 88 34 55 21 12 37 53 2 41 1443
86| 52 91 79 33 42 10 98 20 82 2457
30| 71 59 43 58 84 14 27 90 19 2134
72| 0 99 47 38 65 96 29 78 13 2336
q)![t;();0b; enlist[`res]!enlist makeTree 1+til -1+count cols t] ~ update res:sum (1+til -1+count cols t)*flip value t from t
1b
q)\ts do[`int$1e4;![t;();0b; enlist[`res]!enlist makeTree 1+til 9]]
232 3216j
q)\ts do[`int$1e4;update nc:sum (1+til -1+count cols t)*flip value t from t]
69 2832j
I haven't tested this on a large table, so caveat emptor
Here is another solution which is also faster.
t,'([]res:(+/)("I"$(string tcols) inter\: .Q.n) *' (value t) tcols:(cols t) except keys t)
By spending some time, we can decrease the word count as well. Logic goes like this:
a:"I"$(string tcols) inter\: .Q.n
Here I am first extracting out the integers from column names and storing them in a vector. Variable 'tcols' is declared at the end of query which is nothing but columns of table except key columns.
b:(value t) tcols:(cols t) except keys t
Here I am extracting out each column vector.
c:(+/) a *' b
Multiplying each column vector(var b) by its integer(var a) and adding corresponding
values from each resulting list.
t,'([]res:c)
Finally storing result in a temp table and joining it to t.

Stacked-area with date format at x-axis on Gnuplot

I have created graphs using filledcurves. Now, the graphs looks bad because long range of data.
This is my data:
a b c d e f g h i
201312 49 26 34 30 14 25 9 4 1
201311 38 22 47 30 9 9 4 3 1
201310 44 24 43 38 9 14 5 7 0
201309 65 18 33 39 15 12 4 5 1
201308 42 31 44 30 5 11 0 2 2
201307 58 27 35 29 8 4 2 4 2
201306 30 22 15 17 2 6 3 4 0
201305 61 52 20 16 11 12 2 3 0
201304 62 60 33 18 13 9 5 6 0
201303 43 53 49 27 9 11 7 0 0
201302 31 30 42 27 10 8 4 2 0
201301 42 30 20 47 9 13 3 2 1
201212 26 19 39 24 9 11 0 0 0
201211 26 26 30 28 1 2 0 2 1
201210 55 46 34 30 13 5 0 2 1
201209 56 31 27 28 27 13 2 4 1
201208 48 75 38 46 22 10 0 1 0
201207 60 56 37 47 19 11 2 1 0
201206 60 41 37 28 17 12 5 1 0
201205 49 43 38 46 15 16 2 2 0
201204 43 50 36 33 4 7 3 0 2
201203 49 63 35 43 16 7 1 2 0
201202 43 59 59 52 16 13 3 4 1
201201 51 44 30 37 20 9 4 1 0
201112 50 38 36 36 8 2 3 1 1
201111 75 35 30 36 16 7 3 3 1
201110 68 53 41 27 11 15 1 2 1
201109 68 46 48 47 16 19 4 0 1
201108 45 41 20 36 17 10 1 0 0
201107 48 34 30 24 13 7 3 3 1
201106 49 29 24 25 5 6 0 3 0
201105 45 35 21 37 1 7 2 1 0
201104 53 35 23 18 4 6 1 5 1
201103 58 42 20 18 6 4 1 0 4
201102 54 32 19 20 4 10 0 2 0
201101 42 41 21 28 3 6 1 2 1
and this is my gnuplot file:
set terminal postscript eps color font 20
set xtics 1 out
set tics front
#set style fill transparent solid 0.5 noborder
set key below autotitle columnheader
set ylabel "Count"
set xlabel "across time"
set output 't1.eps'
set title "t1-Across time of Aspects"
set xtics 1
plot for [i=10:2:-1] \
"< awk 'NR==1 {print \"year\",$".(i-1)."} NR>=2 {for (i=2; i<=".i."; i++) \
{sum+= $i} {print $1, sum; sum=0} }' data.dat" \
using (column(2)):xtic(1) with filledcurves x1 t column(2)
When I add time in xdata:
set xdata time
set timefmt "%Y%m"
set xtics format "%b"
Erros message:
Need full using spec for x time data
Is the Errors because of my date format? I have googling this and do not have any answer about it. Please give me suggestion about this.
In the script you show, you specify only a single column in the using statement (besides the xtic). That means, that this value is taken as y-value and the row number is implicitely used as x-value.
When using time data, you must explicitely specify all columns which are needed for the plotting style, there is no assumption about what might be the first column. Use:
set key below autotitle columnheader
set ylabel "Count"
set xlabel "across time"
set tics front
set xdata time
set timefmt "%Y%m"
set xtics format "%b'%y"
set autoscale xfix
plot for [i=10:2:-1] \
"< awk 'NR==1 {print \"year\",$".(i-1)."} NR>=2 {for (i=2; i<=".i."; i++) \
{sum+= $i} {print $1, sum; sum=0} }' data.dat" \
using 1:2 with filledcurves x1 t column(2)
Result with 4.6.4:
I guess you don't want xtic(1) if you have time data and specify the x format.

Matlab: sum column elements with restrictions

We have a MxN matrix and a constrain cstrn = 100;.
The constrain is the summarize limit of column's elements (per column):
sum(matrix(:,:))<=cstrn.
For a given example as the following:
Columns 1 to 5:
15 18 -5 22 19
50 98 -15 39 -8
70 -15 80 45 38
31 52 9 80 72
-2 63 52 71 6
7 99 32 58 41
I want to find the max number of element per column who fulfill this constrain.
How can i summarize every column element with the others elements in same column and find which sum combinations uses the max number of elements per column?
In the given example solution is:
4 3 5 2 5
where
column 1: 15 + 50 + 31 +7 +(-2)
column 2: 18 +(-15) + 52 or 63 etc.
Thank you in advance.
Since it is always easier to fit small elements into a sum, you can do a sort, followed by the cumulative sum:
m= [
15 18 -5 22 19
50 98 -15 39 -8
70 -15 80 45 38
31 52 9 80 72
-2 63 52 71 6
7 99 32 58 41];
cs = cumsum(sort(m))
cs =
-2 -15 -15 22 -8
5 3 -20 61 -2
20 55 -11 106 17
51 118 21 164 55
101 216 73 235 96
171 315 153 315 168
Now you easily identify at which element you cross the threshold cnstrn (thanks, #sevenless)!
out = sum(cs <= cnstrn)
out =
4 3 5 2 5
I'd add to Jonas's answer, that you can impose your constraint in a way that outputs a logical matrix then sum over the 1's and 0's of that matrix like so:
cstrn = 100
m= [
15 18 -5 22 19
50 98 -15 39 -8
70 -15 80 45 38
31 52 9 80 72
-2 63 52 71 6
7 99 32 58 41];
val_matrix = cumsum(sort(m))
logical_matrix = val_matrix<=cstrn
output = sum(logical_matrix)
Giving output:
cstrn =
100
val_matrix =
-2 -15 -15 22 -8
5 3 -20 61 -2
20 55 -11 106 17
51 118 21 164 55
101 216 73 235 96
171 315 153 315 168
logical_matrix =
1 1 1 1 1
1 1 1 1 1
1 1 1 0 1
1 0 1 0 1
0 0 1 0 1
0 0 0 0 0
output =
4 3 5 2 5
Here is a logic, on mobile so can't give a code.
Check this out. Go to a column, sort it ascending order, loop to sum, break when hits <=100. Get counter. Refer back to original column, get the indices of elements matching the elements you just summed up :-)